subject:"\"\\\\\\\[ClusterLabs\\\\\\\] Pacemaker\""

[ClusterLabs] Pacemaker

2016-11-01 Thread Siwakoti, Ganesh

Hi,


i'm using CentOS release 6.8 (Final) as a KVM and i configured 3
nodes(PM1.local,PM2.local and PM3.local), and using
CMAN clustering. Resources running at two nodes as Active node then another
one node is for Fail-over resource as a Standby node. Configured resources
are groupA and groupB, 5 resources are configured in each resource group.
groupA is Basically run at PM1.local and groupB at PM2.local. If fail any
resource group it'll fail-over at PM3.local. On PM3.local can run only one
Resource group at a time. groupA should not run at PM2.local and groupB
should not run at PM1.local. groupA or groupB resources not move back on
own nodes from PM3.local automatically, if need to move back on own nodes,
can move by manually.someone pease help me.


version information:

CentOS release 6.8 (Final) (KVM)

Pacemaker 1.1.14-8.el6_8.1

Cman Version: 6.2.0


I want to be sure that my configuration will work properly or not.


my Cluster configuration and crm_configuration are on below:


cluster.conf:












































































crm_configuration:


node pm1.local \

attributes \

utilization capacity=1

node pm2.local \

attributes \

utilization capacity=1

node pm3.local \

attributes \

utilization capacity=1

primitive asteriskA asterisk \

params binary="/usr/sbin/asterisk" canary_binary=astcanary
additional_parameters="-vvvg -I" config="/etc/asteri

sk_pm1/asterisk.conf" user=root group=root additional_parameters="-vvvg -I"
realtime=true maxfiles=32768 \

meta migration-threshold=2 \

op monitor interval=20s start-delay=5s timeout=30s \

op stop interval=0s on-fail=ignore

primitive asteriskB asterisk \

params binary="/usr/sbin/asterisk" canary_binary=astcanary
additional_parameters="-vvvg -I" config="/etc/asteri

sk_pm2/asterisk.conf" user=root group=root additional_parameters="-vvvg -I"
realtime=true maxfiles=32768 \

meta migration-threshold=2 \

op monitor interval=20s start-delay=5s timeout=30s \

op stop interval=0s on-fail=ignore

primitive changeSrcIpA ocf:pacemaker:changeSrcIp \

params vip=192.168.12.215 mask=23 device=eth0 \

op start interval=0s timeout=0 \

op monitor interval=10s \

op stop interval=0s on-fail=ignore

primitive changeSrcIpB ocf:pacemaker:changeSrcIp \

params vip=192.168.12.216 mask=23 device=eth0 \

op start interval=0s timeout=0 \

op monitor interval=10s \

op stop interval=0s on-fail=ignore

primitive cronA lsb:crond \

meta migration-threshold=2 \

op monitor interval=20s start-delay=5s timeout=15s \

op stop interval=0s on-fail=ignore

primitive cronB lsb:crond \

meta migration-threshold=2 \

op monitor interval=20s start-delay=5s timeout=15s \

op stop interval=0s on-fail=ignore

primitive vip-local-checkA VIPcheck \

params target_ip=192.168.12.215 count=1 wait=5 \

op start interval=0s on-fail=restart timeout=60s \

op monitor interval=10s timeout=60s \

op stop interval=0s on-fail=ignore timeout=60s \

utilization capacity=1

primitive vip-local-checkB VIPcheck \

params target_ip=192.168.12.216 count=1 wait=5 \

op start interval=0s on-fail=restart timeout=60s \

op monitor interval=10s timeout=60s \

op stop interval=0s on-fail=ignore timeout=60s \

utilization capacity=1

primitive vip-localA IPaddr2 \

params ip=192.168.12.215 cidr_netmask=23 nic=eth0 iflabel=0
broadcast=192.168.13.255 \

op start interval=0s timeout=20s \

op monitor interval=5s timeout=20s \

op stop interval=0s on-fail=ignore

primitive vip-localB IPaddr2 \

params ip=192.168.12.216 cidr_netmask=23 nic=eth0 iflabel=0
broadcast=192.168.13.255 \

op start interval=0s timeout=20s \

op monitor interval=5s timeout=20s \

op stop interval=0s on-fail=ignore

group groupA vip-local-checkA vip-localA changeSrcIpA cronA asteriskA \

meta target-role=Started

group groupB vip-local-checkB vip-localB changeSrcIpB cronB asteriskB \

meta

location location-groupA-avoid-failed-node groupA \

rule -inf: defined fail-count-vip-local-checkA \

rule -inf: defined fail-count-vip-localA \

rule -inf: defined fail-count-changeSrcIpA \

rule -inf: defined fail-count-cronA \

rule -inf: defined fail-count-asteriskA \

rule -inf: defined fail-count-vip-local-checkB \

rule -inf: defined fail-count-vip-localB \

rule -inf: defined fail-count-change

[ClusterLabs] Pacemaker: pgsql

2019-09-24 Thread Shital A

Hello,

We have setup active-passive cluster using streaming replication on
Rhel7.5. We are testing pacemaker for automated failover.
We are seeing below issues with the setup :

1. When a failover is triggered when data is being added to the primary by
killing primary (killall -9 postgres), the standby doesnt come up in sync.
On pacemaker, the crm_mon -Afr shows standby in disconnected and HS:alone
state.

On postgres, we see below error:

< 2019-09-20 17:07:46.266 IST > LOG:  entering standby mode
< 2019-09-20 17:07:46.267 IST > LOG:  database system was not properly shut
down; automatic recovery in progress
< 2019-09-20 17:07:46.270 IST > LOG:  redo starts at 1/680A2188
< 2019-09-20 17:07:46.370 IST > LOG:  consistent recovery state reached at
1/6879D9F8
< 2019-09-20 17:07:46.370 IST > LOG:  database system is ready to accept
read only connections
cp: cannot stat '/var/lib/pgsql/9.6/data/archivedir/000100010068':
No such file or directory
< 2019-09-20 17:07:46.751 IST > LOG:  statement: select pg_is_in_recovery()
< 2019-09-20 17:07:46.782 IST > LOG:  statement: show
synchronous_standby_names
< 2019-09-20 17:07:50.993 IST > LOG:  statement: select pg_is_in_recovery()
< 2019-09-20 17:07:53.395 IST > LOG:  started streaming WAL from primary at
1/6800 on timeline 1
< 2019-09-20 17:07:53.436 IST > LOG:  invalid contrecord length 2662 at
1/6879D9F8
< 2019-09-20 17:07:53.438 IST > FATAL:  terminating walreceiver process due
to administrator command
cp: cannot stat '/var/lib/pgsql/9.6/data/archivedir/0002.history': No
such file or directory
cp: cannot stat '/var/lib/pgsql/9.6/data/archivedir/000100010068':
No such file or directory

When we try to restart postgres on the standby, using pg_ctl restart, the
standby start syncing.


2. After standby syncs using pg_ctl restart as mentioned above, we found
out that 1-2 records are missing on the standby.

Need help to check:
1. why the standby starts in disconnect, HS:alone state?

f you have faced this issue/have knowledge, please let us know.

Thanks.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker Shutdown

2020-07-21 Thread Harvey Shepherd

Hi All,

I'm running Pacemaker 2.0.3 on a two-node cluster, controlling 40+ resources 
which are a mixture of clones and other resources that are colocated with the 
master instance of certain clones. I've noticed that if I terminate pacemaker 
on the node that is hosting the master instances of the clones, Pacemaker 
focuses on stopping resources on that node BEFORE failing over to the other 
node, leading to a longer outage than necessary. Is there a way to change this 
behaviour?

These are the actions and logs I saw during the test:

# /etc/init.d/pacemaker stop
Signaling Pacemaker Cluster Manager to terminate

Waiting for cluster services to 
unload..sending 
signal 9 to procs


2020 Jul 22 06:16:50.581 Chassis2 daemon.notice CTR8740 pacemaker. Signaling 
Pacemaker Cluster Manager to terminate
2020 Jul 22 06:16:50.599 Chassis2 daemon.notice CTR8740 pacemaker. Waiting for 
cluster services to unload
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: new_event_notification (6140-6141-9): Broken pipe (32)
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: Notification of client stonithd/665bde82-cb28-40f7-9132-8321dc2f1992 
failed
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: new_event_notification (6140-6143-8): Broken pipe (32)
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: Notification of client attrd/a26ca273-3422-4ebe-8cb7-95849b8ff130 
failed
2020 Jul 22 06:18:03.320 Chassis1 daemon.warning CTR8740 
pacemaker-schedulerd.6240  warning: Blind faith: not fencing unseen nodes
2020 Jul 22 06:18:58.941 Chassis2 user.crit CTR8740 supervisor. pacemaker is 
inactive (3).

Regards,
Harvey
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker license

2015-10-06 Thread Santosh_Bidaralli

Dell - Internal Use - Confidential
Hello Pacemaker Admins,

We have a query regarding licensing for pacemaker header files

As per the given link http://clusterlabs.org/wiki/License, it is mentioned that 
"Pacemaker programs are licensed under the 
GPLv2+ 
(version 2 or later of the GPL) and its headers and libraries are under the 
less restrictive 
LGPLv2+ 
(version 2 or later of the LGPL) ."

However, website link 
http://clusterlabs.org/doxygen/pacemaker/2927a0f9f25610c331b6a137c846fec27032c9ea/cib_8h.html,
 states otherwise.
Cib.h header file needed to be included in order to configure pacemaker using C 
API. But the header file for cib.h states that the header file is under GPL 
license
This seems to be conflicting the statement regarding header file license.

In addition, which similar issue has been discussed in the past 
http://www.gossamer-threads.com/lists/linuxha/pacemaker/75967, no additional 
details on the resolution.

Need your inputs on licensing to proceed further.

Thanks & Regards
Santosh Bidaralli


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker

2016-11-02 Thread Ken Gaillot

On 11/01/2016 02:31 AM, Siwakoti, Ganesh wrote:
> Hi,
> 
> 
> i'm using CentOS release 6.8 (Final) as a KVM and i configured 3
> nodes(PM1.local,PM2.local and PM3.local), and using
> CMAN clustering. Resources running at two nodes as Active node then
> another one node is for Fail-over resource as a Standby node. Configured
> resources are groupA and groupB, 5 resources are configured in each
> resource group. groupA is Basically run at PM1.local and groupB at
> PM2.local. If fail any resource group it'll fail-over at PM3.local. On
> PM3.local can run only one Resource group at a time. groupA should not
> run at PM2.local and groupB should not run at PM1.local. groupA or
> groupB resources not move back on own nodes from PM3.local
> automatically, if need to move back on own nodes, can move by
> manually.someone pease help me.

Your configuration looks good to me. I don't think you need the -inf
rules for fail-counts; migration-threshold=1 handles that. Are you
seeing any problems?

> 
> 
> version information:
> 
> CentOS release 6.8 (Final) (KVM)
> 
> Pacemaker 1.1.14-8.el6_8.1
> 
> Cman Version: 6.2.0
> 
> 
> I want to be sure that my configuration will work properly or not.
> 
> 
> my Cluster configuration and crm_configuration are on below:
> 
> 
> cluster.conf:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  port="pm1.local"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  port="pm2.local"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  port="pm3.local"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> crm_configuration:
> 
> 
> node pm1.local \
> 
> attributes \
> 
> utilization capacity=1
> 
> node pm2.local \
> 
> attributes \
> 
> utilization capacity=1
> 
> node pm3.local \
> 
> attributes \
> 
> utilization capacity=1
> 
> primitive asteriskA asterisk \
> 
> params binary="/usr/sbin/asterisk" canary_binary=astcanary
> additional_parameters="-vvvg -I" config="/etc/asteri
> 
> sk_pm1/asterisk.conf" user=root group=root additional_parameters="-vvvg
> -I" realtime=true maxfiles=32768 \
> 
> meta migration-threshold=2 \
> 
> op monitor interval=20s start-delay=5s timeout=30s \
> 
> op stop interval=0s on-fail=ignore
> 
> primitive asteriskB asterisk \
> 
> params binary="/usr/sbin/asterisk" canary_binary=astcanary
> additional_parameters="-vvvg -I" config="/etc/asteri
> 
> sk_pm2/asterisk.conf" user=root group=root additional_parameters="-vvvg
> -I" realtime=true maxfiles=32768 \
> 
> meta migration-threshold=2 \
> 
> op monitor interval=20s start-delay=5s timeout=30s \
> 
> op stop interval=0s on-fail=ignore
> 
> primitive changeSrcIpA ocf:pacemaker:changeSrcIp \
> 
> params vip=192.168.12.215 mask=23 device=eth0 \
> 
> op start interval=0s timeout=0 \
> 
> op monitor interval=10s \
> 
> op stop interval=0s on-fail=ignore
> 
> primitive changeSrcIpB ocf:pacemaker:changeSrcIp \
> 
> params vip=192.168.12.216 mask=23 device=eth0 \
> 
> op start interval=0s timeout=0 \
> 
> op monitor interval=10s \
> 
> op stop interval=0s on-fail=ignore
> 
> primitive cronA lsb:crond \
> 
> meta migration-threshold=2 \
> 
> op monitor interval=20s start-delay=5s timeout=15s \
> 
> op stop interval=0s on-fail=ignore
> 
> primitive cronB lsb:crond \
> 
> meta migration-threshold=2 \
> 
> op monitor interval=20s start-delay=5s timeout=15s \
> 
> op stop interval=0s on-fail=ignore
> 
> primitive vip-local-checkA VIPcheck \
> 
> params target_ip=192.168.12.215 count=1 wait=5 \
> 
> op start interval=0s on-fail=restart timeout=60s \
> 
> op monitor interval=10s timeout=60s \
> 
> op stop interval=0s on-fail=ignore timeout=60s \
> 
> utilization capacity=1
> 
> primitive vip-local-checkB VIPcheck \
> 
> params target_ip=192.168.12.216 count=1 wait=5 \
> 
> op start interval=0s on-fail=restart timeout=60s \
> 
> op monitor interval=10s timeout=60s \
> 
> op stop interval=0s on-fail=ignore timeout=60s \
> 
> utilization capacity=1
> 
> primitive vip-localA IPaddr2 \
> 
> params ip=192.168.12.215 cidr_netmask=23 nic=eth0 iflabel=0
> broadcast=192.168.13.255 \
> 
> op start interval=0s timeout=20s \
> 
> op monitor interval=5s timeout=20s \
> 
> op stop interval=0s on-fail=ignore
> 
> primitive vip-localB IPaddr2 \
> 
> params ip=192.168.12.216 cidr_netmask=23 nic=eth0 ifl

Re: [ClusterLabs] Pacemaker

2016-11-11 Thread Siwakoti, Ganesh

Hi Ken Gaillot,

Sorry for late reply and thank you for response me.

i’ve some reason to using -inf rules.

Two resources group with three nodes (2 Active 1 Standby)

When resources group PM1 failed in pm1.local it’ll failover to pm3.local
which is common node can failover both resources group. If PM1 failed again
in pm3.local it’ll return back to own node pm1.local. problem is that in
this situation if PM2 failed in pm 2.local It’ll failover to pm3.local even
PM1 already failing in this node.that’s why I was using -inf rules because
if resource failed once on one node I want to prevent failover any
resources group on  same node.but I don’t know how to define it, so my
configuration of -inf was long, define every resources. Have you any ideas
about it. Thanks in advance.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] pacemaker fencing

2017-09-05 Thread Papastavros Vaggelis

Dear friends ,

I have two_nodes (sgw-01 and sgw-02) HA cluster integrated with two APC
PDUs as fence devices

1) pcs stonith create node2-psu fence_apc ipaddr="10.158.0.162" login="apc"
passwd="apc" port="1" pcmk_host_list="sgw-02" pcmk_host_check="static-list"
action="reboot" power_wait="5" op   monitor interval="60s"
pcs constraint location node2-psu prefers sgw-01=INFINITY


2)pcs stonith create node1-psu fence_apc ipaddr="10.158.0.161" login="apc"
passwd="apc" port="1" pcmk_host_list="sgw-01" pcmk_host_check="static-list"
action="reboot" power_wait="5" op monitor interval="60s"
pcs constraint location node1-psu prefers sgw-02=INFINITY


Fencing is working fine but i want to deploy the following scenario :

If one node fenced more than x times during a specified time period then
change the fence action from reboot to stop.

For example if the node rebooted more than 3 times during one hour then at
the next crash change the fence action from reboot to off.

Is there a proper way to implement the above scenario  ?

is it mandatory to use two level fencing  for the above case ?



Sincerely

Vaggelis Papastavros
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker question

2022-10-04 Thread Jelen, Piotr

Dear Clusterlabs team ,

I  would like to ask you if there is some possibility to use different user 
(eg.cephhauser) for authenticate/setup cluster or there is other method 
authenticate/setup cluster, not using password by dedicated  pacamker user such 
us hacluster  ?


Best Regards
Piotr Jelen
Senior Systems Platform Engineer

Mastercard
Mountain View, Central Park  | Leopard
[cid:image001.png@01D8D748.321C1CF0]


CONFIDENTIALITY NOTICE This e-mail message and any attachments are only for the 
use of the intended recipient and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient, any disclosure, distribution or other use of this e-mail 
message or attachments is prohibited. If you have received this e-mail message 
in error, please delete and notify the sender immediately. Thank you.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] pacemaker-remote

2023-09-14 Thread Mr.R via Users

Hi all??
   
In Pacemaker-Remote 2.1.6, the pacemaker package is required
for guest nodes and not for remote nodes. Why is that? What does 
pacemaker do?
After adding guest node, pacemaker package does not seem to be 
needed. Can I not install it here?

After testing, remote nodes can be offline, but guest nodes cannot
 be offline. Is there any way to get them offline? Are there relevant 
failure test cases?


thanks,___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] pacemaker alerts list

2019-07-17 Thread Gershman, Vladimir

Hi,

Is there a list of all possible alerts/events that Peacemaker can send out? 
Preferable with criticality levels for the alerts (minor, major, critical).



Thank you,

Vlad
Equipment Management (EM) System Engineer

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker: pgsql

2019-09-27 Thread Shital A

On Tue, 24 Sep 2019, 22:20 Shital A,  wrote:

> Hello,
>
> We have setup active-passive cluster using streaming replication on
> Rhel7.5. We are testing pacemaker for automated failover.
> We are seeing below issues with the setup :
>
> 1. When a failover is triggered when data is being added to the primary by
> killing primary (killall -9 postgres), the standby doesnt come up in sync.
> On pacemaker, the crm_mon -Afr shows standby in disconnected and HS:alone
> state.
>
> On postgres, we see below error:
>
> < 2019-09-20 17:07:46.266 IST > LOG:  entering standby mode
> < 2019-09-20 17:07:46.267 IST > LOG:  database system was not properly
> shut down; automatic recovery in progress
> < 2019-09-20 17:07:46.270 IST > LOG:  redo starts at 1/680A2188
> < 2019-09-20 17:07:46.370 IST > LOG:  consistent recovery state reached at
> 1/6879D9F8
> < 2019-09-20 17:07:46.370 IST > LOG:  database system is ready to accept
> read only connections
> cp: cannot stat '/var/lib/pgsql/9.6/data/archivedir/000100010068':
> No such file or directory
> < 2019-09-20 17:07:46.751 IST > LOG:  statement: select pg_is_in_recovery()
> < 2019-09-20 17:07:46.782 IST > LOG:  statement: show
> synchronous_standby_names
> < 2019-09-20 17:07:50.993 IST > LOG:  statement: select pg_is_in_recovery()
> < 2019-09-20 17:07:53.395 IST > LOG:  started streaming WAL from primary
> at 1/6800 on timeline 1
> < 2019-09-20 17:07:53.436 IST > LOG:  invalid contrecord length 2662 at
> 1/6879D9F8
> < 2019-09-20 17:07:53.438 IST > FATAL:  terminating walreceiver process
> due to administrator command
> cp: cannot stat '/var/lib/pgsql/9.6/data/archivedir/0002.history': No
> such file or directory
> cp: cannot stat '/var/lib/pgsql/9.6/data/archivedir/000100010068':
> No such file or directory
>
> When we try to restart postgres on the standby, using pg_ctl restart, the
> standby start syncing.
>
>
> 2. After standby syncs using pg_ctl restart as mentioned above, we found
> out that 1-2 records are missing on the standby.
>
> Need help to check:
> 1. why the standby starts in disconnect, HS:alone state?
>
> f you have faced this issue/have knowledge, please let us know.
>
> Thanks.
>


Hello,

I didn't  receive any reply on this issue.wondering whether there are no
opinions or whether pacemaker with pgsql is not recommended?.


Thanks!

>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker: pgsql

2019-09-27 Thread Ken Gaillot

On Fri, 2019-09-27 at 19:03 +0530, Shital A wrote:
> 
> 
> On Tue, 24 Sep 2019, 22:20 Shital A, 
> wrote:
> > Hello,
> > 
> > We have setup active-passive cluster using streaming replication on
> > Rhel7.5. We are testing pacemaker for automated failover.
> > We are seeing below issues with the setup :
> > 
> > 1. When a failover is triggered when data is being added to the
> > primary by killing primary (killall -9 postgres), the standby
> > doesnt come up in sync.
> > On pacemaker, the crm_mon -Afr shows standby in disconnected and
> > HS:alone state.
> > 
> > On postgres, we see below error:
> > 
> > < 2019-09-20 17:07:46.266 IST > LOG:  entering standby mode
> > < 2019-09-20 17:07:46.267 IST > LOG:  database system was not
> > properly shut down; automatic recovery in progress
> > < 2019-09-20 17:07:46.270 IST > LOG:  redo starts at 1/680A2188
> > < 2019-09-20 17:07:46.370 IST > LOG:  consistent recovery state
> > reached at 1/6879D9F8
> > < 2019-09-20 17:07:46.370 IST > LOG:  database system is ready to
> > accept read only connections
> > cp: cannot stat
> > '/var/lib/pgsql/9.6/data/archivedir/000100010068': No
> > such file or directory
> > < 2019-09-20 17:07:46.751 IST > LOG:  statement: select
> > pg_is_in_recovery()
> > < 2019-09-20 17:07:46.782 IST > LOG:  statement: show
> > synchronous_standby_names
> > < 2019-09-20 17:07:50.993 IST > LOG:  statement: select
> > pg_is_in_recovery()
> > < 2019-09-20 17:07:53.395 IST > LOG:  started streaming WAL from
> > primary at 1/6800 on timeline 1
> > < 2019-09-20 17:07:53.436 IST > LOG:  invalid contrecord length
> > 2662 at 1/6879D9F8
> > < 2019-09-20 17:07:53.438 IST > FATAL:  terminating walreceiver
> > process due to administrator command
> > cp: cannot stat
> > '/var/lib/pgsql/9.6/data/archivedir/0002.history': No such file
> > or directory
> > cp: cannot stat
> > '/var/lib/pgsql/9.6/data/archivedir/000100010068': No
> > such file or directory
> > 
> > When we try to restart postgres on the standby, using pg_ctl
> > restart, the standby start syncing.
> > 
> > 
> > 2. After standby syncs using pg_ctl restart as mentioned above, we
> > found out that 1-2 records are missing on the standby.
> > 
> > Need help to check:
> > 1. why the standby starts in disconnect, HS:alone state? 
> > 
> > f you have faced this issue/have knowledge, please let us know.
> > 
> > Thanks.
> 
> 
> Hello,
> 
> I didn't  receive any reply on this issue.wondering whether there are
> no opinions or whether pacemaker with pgsql is not recommended?.
> 
> 
> Thanks! 

Hi,

There are quite a few pacemaker+pgsql users active on this list, but
they may not have time to respond at the moment. Most are using the PAF
agent rather than the pgsql agent (see 
https://github.com/ClusterLabs/PAF ).
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker: pgsql

2019-09-27 Thread Jehan-Guillaume de Rorthais

On Fri, 27 Sep 2019 12:14:09 -0500
Ken Gaillot  wrote:

> On Fri, 2019-09-27 at 19:03 +0530, Shital A wrote:
> > 
> > 
> > On Tue, 24 Sep 2019, 22:20 Shital A, 
> > wrote:  
> > > Hello,
> > > 
> > > We have setup active-passive cluster using streaming replication on
> > > Rhel7.5. We are testing pacemaker for automated failover.
> > > We are seeing below issues with the setup :
> > > 
> > > 1. When a failover is triggered when data is being added to the
> > > primary by killing primary (killall -9 postgres), the standby
> > > doesnt come up in sync.
> > > On pacemaker, the crm_mon -Afr shows standby in disconnected and
> > > HS:alone state.
> > > 
> > > On postgres, we see below error:
> > > 
> > > < 2019-09-20 17:07:46.266 IST > LOG:  entering standby mode
> > > < 2019-09-20 17:07:46.267 IST > LOG:  database system was not
> > > properly shut down; automatic recovery in progress
> > > < 2019-09-20 17:07:46.270 IST > LOG:  redo starts at 1/680A2188
> > > < 2019-09-20 17:07:46.370 IST > LOG:  consistent recovery state
> > > reached at 1/6879D9F8
> > > < 2019-09-20 17:07:46.370 IST > LOG:  database system is ready to
> > > accept read only connections
> > > cp: cannot stat
> > > '/var/lib/pgsql/9.6/data/archivedir/000100010068': No
> > > such file or directory
> > > < 2019-09-20 17:07:46.751 IST > LOG:  statement: select
> > > pg_is_in_recovery()
> > > < 2019-09-20 17:07:46.782 IST > LOG:  statement: show
> > > synchronous_standby_names
> > > < 2019-09-20 17:07:50.993 IST > LOG:  statement: select
> > > pg_is_in_recovery()
> > > < 2019-09-20 17:07:53.395 IST > LOG:  started streaming WAL from
> > > primary at 1/6800 on timeline 1
> > > < 2019-09-20 17:07:53.436 IST > LOG:  invalid contrecord length
> > > 2662 at 1/6879D9F8
> > > < 2019-09-20 17:07:53.438 IST > FATAL:  terminating walreceiver
> > > process due to administrator command
> > > cp: cannot stat
> > > '/var/lib/pgsql/9.6/data/archivedir/0002.history': No such file
> > > or directory
> > > cp: cannot stat
> > > '/var/lib/pgsql/9.6/data/archivedir/000100010068': No
> > > such file or directory
> > > 
> > > When we try to restart postgres on the standby, using pg_ctl
> > > restart, the standby start syncing.
> > > 
> > > 
> > > 2. After standby syncs using pg_ctl restart as mentioned above, we
> > > found out that 1-2 records are missing on the standby.
> > > 
> > > Need help to check:
> > > 1. why the standby starts in disconnect, HS:alone state? 
> > > 
> > > f you have faced this issue/have knowledge, please let us know.
> > > 
> > > Thanks.  
> > 
> > 
> > Hello,
> > 
> > I didn't  receive any reply on this issue.wondering whether there are
> > no opinions or whether pacemaker with pgsql is not recommended?.

I did not read your mail because my experience with the pgsql resource agent
is quite old and I lost interest to it. Now that I focus on the details of your
original mail, something looks strange to me: how a standby could lost records?

In normal situation, a standby is more or less a clone from the primary, no
matter how you kill it. At worst, the clone is just lagging behind, but it can
not "lost records".

Are you able to reproduce this behavior outside Pacemaker? Just build your
primary and standby, wait for them to replicate, create some activity, then kill
the primary and restart it. If you lost records, then provide more infos about
your whole procedure to reproduce it and investigate what's going on.

> There are quite a few pacemaker+pgsql users active on this list, but
> they may not have time to respond at the moment. Most are using the PAF
> agent rather than the pgsql agent (see 
> https://github.com/ClusterLabs/PAF ).

This might not be a bug from the pgsql RA. I would bet on the procedure on a
first guess. But indeed, OP might want to have a look on the PAF resource
agent.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker Shutdown

2020-07-22 Thread Andrei Borzenkov

On Wed, Jul 22, 2020 at 9:42 AM Harvey Shepherd <
harvey.sheph...@aviatnet.com> wrote:

> Hi All,
>
> I'm running Pacemaker 2.0.3 on a two-node cluster, controlling 40+
> resources which are a mixture of clones and other resources that are
> colocated with the master instance of certain clones. I've noticed that if
> I terminate pacemaker on the node that is hosting the master instances of
> the clones, Pacemaker focuses on stopping resources on that node BEFORE
> failing over to the other node, leading to a longer outage than necessary.
> Is there a way to change this behaviour?
>
>
Educated guess - you want interleave=true on clone resources.



> These are the actions and logs I saw during the test:
>
> # /etc/init.d/pacemaker stop
> Signaling Pacemaker Cluster Manager to terminate
>
> Waiting for cluster services to
> unload..sending
> signal 9 to procs
>
>
> 2020 Jul 22 06:16:50.581 Chassis2 daemon.notice CTR8740 pacemaker.
> Signaling Pacemaker Cluster Manager to terminate
> 2020 Jul 22 06:16:50.599 Chassis2 daemon.notice CTR8740 pacemaker. Waiting
> for cluster services to unload
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: new_event_notification (6140-6141-9): Broken
> pipe (32)
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: Notification of client
> stonithd/665bde82-cb28-40f7-9132-8321dc2f1992 failed
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: new_event_notification (6140-6143-8): Broken
> pipe (32)
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: Notification of client
> attrd/a26ca273-3422-4ebe-8cb7-95849b8ff130 failed
> 2020 Jul 22 06:18:03.320 Chassis1 daemon.warning CTR8740
> pacemaker-schedulerd.6240  warning: Blind faith: not fencing unseen nodes
> 2020 Jul 22 06:18:58.941 Chassis2 user.crit CTR8740 supervisor. pacemaker
> is inactive (3).
>
> Regards,
> Harvey
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Хиль Эдуард


Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on ubuntu 20 + 1 
qdevice. I want to define new resource as systemd unit  dummy.service  :
 
[Unit]
Description=Dummy
[Service]
Restart=on-failure
StartLimitInterval=20
StartLimitBurst=5
TimeoutStartSec=0
RestartSec=5
Environment="HOME=/root"
SyslogIdentifier=dummy
ExecStart=/usr/local/sbin/dummy.sh
[Install]
WantedBy=multi-user.target
 
and /usr/local/sbin/dummy.sh :
 
#!/bin/bash
CNT=0
while true; do
  let CNT++
  echo "hello world $CNT"
  sleep 5
done
 
and then i try to define it with: pcs resource create dummy.service 
systemd:dummy op monitor interval="10s" timeout="15s"
after 2 seconds node2 reboot. In logs i see pacemaker in 2 seconds tried to 
start this unit, and it started, but pacemaker somehow think he is «Timed Out» 
. What i am doing wrong? Logs below.
 
 
Jul 21 15:53:41 node2.local pacemaker-controld[1813]:  notice: Result of probe 
operation for dummy.service on node2.local: 7 (not running) 
Jul 21 15:53:41 node2.local systemd[1]: Reloading.
Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5: 
ListenStream= references a path below legacy directory /var/run/, updating 
/var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update 
the unit file accordingly.
Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/docker.socket:6: 
ListenStream= references a path below legacy directory /var/run/, updating 
/var/run/docker.sock → /run/docker.sock; please update the unit file 
accordingly.
Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up on 
dummy.service start (rc=0): timeout (elapsed=259719ms, remaining=-159719ms)
Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result of start 
operation for dummy.service on node2.local: Timed Out 
Jul 21 15:53:42 node2.local systemd[1]: Started Cluster Controlled dummy.
Jul 21 15:53:42 node2.local dummy[9330]: hello world 1
Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface NamePolicy= 
disabled on kernel command line, ignoring.
Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting 
fail-count-dummy.service#start_0[node2.local]: (unset) -> INFINITY 
Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting 
last-failure-dummy.service#start_0[node2.local]: (unset) -> 1595336022 
Jul 21 15:53:42 node2.local systemd[1]: Reloading.
Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5: 
ListenStream= references a path below legacy directory /var/run/, updating 
/var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update 
the unit file accordingly.
Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/docker.socket:6: 
ListenStream= references a path below legacy directory /var/run/, updating 
/var/run/docker.sock → /run/docker.sock; please update the unit file 
accordingly.
Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up on 
dummy.service stop (rc=0): timeout (elapsed=317181ms, remaining=-217181ms)
Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result of stop 
operation for dummy.service on node2.local: Timed Out 
Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for dummy...
Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting 
fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY 
Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting 
last-failure-dummy.service#stop_0[node2.local]: (unset) -> 1595336022 
Jul 21 15:53:42 node2.local systemd[1]: dummy.service: Succeeded.
Jul 21 15:53:42 node2.local systemd[1]: Stopped Daemon for dummy.
... lost connection (node rebooting)
 
 ___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker Shutdown

2020-07-22 Thread Reid Wahl

On Tue, Jul 21, 2020 at 11:42 PM Harvey Shepherd <
harvey.sheph...@aviatnet.com> wrote:

> Hi All,
>
> I'm running Pacemaker 2.0.3 on a two-node cluster, controlling 40+
> resources which are a mixture of clones and other resources that are
> colocated with the master instance of certain clones. I've noticed that if
> I terminate pacemaker on the node that is hosting the master instances of
> the clones, Pacemaker focuses on stopping resources on that node BEFORE
> failing over to the other node, leading to a longer outage than necessary.
> Is there a way to change this behaviour?
>

Hi, Harvey.

As you likely know, a given resource active/passive resource will have to
stop on one node before it can start on another node, and the same goes for
a promoted clone instance having to demote on one node before it can
promote on another. There are exceptions for clone instances and for
promotable clones with promoted-max > 1 ("allow more than one master
instance"). A resource that's configured to run on one node at a time
should not try to run on two nodes during failover.

With that in mind, what exactly are you wanting to happen? Is the problem
that all resources are stopping on node 1 before *any* of them start on
node 2? Or that you want Pacemaker shutdown to kill the processes on node 1
instead of cleanly shutting them down? Or something different?

These are the actions and logs I saw during the test:
>

Ack. This seems like it's just telling us that Pacemaker is going through a
graceful shutdown. The info more relevant to the resource stop/start order
would be in /var/log/pacemaker/pacemaker.log (or less detailed in
/var/log/messages) on the DC.

# /etc/init.d/pacemaker stop
> Signaling Pacemaker Cluster Manager to terminate
>
> Waiting for cluster services to
> unload..sending
> signal 9 to procs
>
>
> 2020 Jul 22 06:16:50.581 Chassis2 daemon.notice CTR8740 pacemaker.
> Signaling Pacemaker Cluster Manager to terminate
> 2020 Jul 22 06:16:50.599 Chassis2 daemon.notice CTR8740 pacemaker. Waiting
> for cluster services to unload
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: new_event_notification (6140-6141-9): Broken
> pipe (32)
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: Notification of client
> stonithd/665bde82-cb28-40f7-9132-8321dc2f1992 failed
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: new_event_notification (6140-6143-8): Broken
> pipe (32)
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: Notification of client
> attrd/a26ca273-3422-4ebe-8cb7-95849b8ff130 failed
> 2020 Jul 22 06:18:03.320 Chassis1 daemon.warning CTR8740
> pacemaker-schedulerd.6240  warning: Blind faith: not fencing unseen nodes
> 2020 Jul 22 06:18:58.941 Chassis2 user.crit CTR8740 supervisor. pacemaker
> is inactive (3).
>
> Regards,
> Harvey
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker Shutdown

2020-07-22 Thread Harvey Shepherd

Thanks for your response Reid. What you say makes sense, and under normal 
circumstances if a resource failed, I'd want all of its dependents to be 
stopped cleanly before restarting the failed resource. However if pacemaker is 
shutting down on a node (e.g. due to a restart request), then I just want to 
failover as fast as possible, so an unclean kill is fine. At the moment the 
shutdown process is taking 2 mins. I was just wondering if there was a way to 
do this.

Regards,
Harvey


From: Users  on behalf of Reid Wahl 

Sent: 23 July 2020 08:05
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: EXTERNAL: Re: [ClusterLabs] Pacemaker Shutdown


On Tue, Jul 21, 2020 at 11:42 PM Harvey Shepherd 
mailto:harvey.sheph...@aviatnet.com>> wrote:
Hi All,

I'm running Pacemaker 2.0.3 on a two-node cluster, controlling 40+ resources 
which are a mixture of clones and other resources that are colocated with the 
master instance of certain clones. I've noticed that if I terminate pacemaker 
on the node that is hosting the master instances of the clones, Pacemaker 
focuses on stopping resources on that node BEFORE failing over to the other 
node, leading to a longer outage than necessary. Is there a way to change this 
behaviour?

Hi, Harvey.

As you likely know, a given resource active/passive resource will have to stop 
on one node before it can start on another node, and the same goes for a 
promoted clone instance having to demote on one node before it can promote on 
another. There are exceptions for clone instances and for promotable clones 
with promoted-max > 1 ("allow more than one master instance"). A resource 
that's configured to run on one node at a time should not try to run on two 
nodes during failover.

With that in mind, what exactly are you wanting to happen? Is the problem that 
all resources are stopping on node 1 before any of them start on node 2? Or 
that you want Pacemaker shutdown to kill the processes on node 1 instead of 
cleanly shutting them down? Or something different?

These are the actions and logs I saw during the test:

Ack. This seems like it's just telling us that Pacemaker is going through a 
graceful shutdown. The info more relevant to the resource stop/start order 
would be in /var/log/pacemaker/pacemaker.log (or less detailed in 
/var/log/messages) on the DC.

# /etc/init.d/pacemaker stop
Signaling Pacemaker Cluster Manager to terminate

Waiting for cluster services to 
unload..sending 
signal 9 to procs


2020 Jul 22 06:16:50.581 Chassis2 daemon.notice CTR8740 pacemaker. Signaling 
Pacemaker Cluster Manager to terminate
2020 Jul 22 06:16:50.599 Chassis2 daemon.notice CTR8740 pacemaker. Waiting for 
cluster services to unload
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: new_event_notification (6140-6141-9): Broken pipe (32)
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: Notification of client stonithd/665bde82-cb28-40f7-9132-8321dc2f1992 
failed
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: new_event_notification (6140-6143-8): Broken pipe (32)
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: Notification of client attrd/a26ca273-3422-4ebe-8cb7-95849b8ff130 
failed
2020 Jul 22 06:18:03.320 Chassis1 daemon.warning CTR8740 
pacemaker-schedulerd.6240  warning: Blind faith: not fencing unseen nodes
2020 Jul 22 06:18:58.941 Chassis2 user.crit CTR8740 supervisor. pacemaker is 
inactive (3).

Regards,
Harvey
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


--
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker Shutdown

2020-07-22 Thread Reid Wahl

Thanks for the clarification. As far as I'm aware, there's no way to do
this at the Pacemaker level during a Pacemaker shutdown. It would require
uncleanly killing all resources, which doesn't make sense at the Pacemaker
level.

Pacemaker only knows how to stop a resource by running the resource agent's
stop operation. Even if Pacemaker wanted to kill a resource uncleanly for
speed, the way to do so for each resource would depend on the type of
resource. For example, an IPaddr2 resource doesn't represent a running
process that can be killed; `ip addr del` would be necessary.

If we went the route of killing the Pacemaker daemon entirely, rather than
relying on it to stop resources, then that wouldn't guarantee the node has
stopped using the actual resources before the failover node tries to take
over. For example, for a Filesystem, the FS could still be mounted after
Pacemaker is killed.

The only ways to know with certainty that node 1 has stopped using cluster
resources so that node 2 can safely take them over are:

   1. gracefully stop them, or
   2. fence/reboot node 1

With that being said, if you don't mind node 1 being fenced to initiate a
faster failover, then you could fence it from node 2.

Others on the list may think of something I haven't considered here.

On Wed, Jul 22, 2020 at 2:43 PM Harvey Shepherd <
harvey.sheph...@aviatnet.com> wrote:

> Thanks for your response Reid. What you say makes sense, and under normal
> circumstances if a resource failed, I'd want all of its dependents to be
> stopped cleanly before restarting the failed resource. However if pacemaker
> is shutting down on a node (e.g. due to a restart request), then I just
> want to failover as fast as possible, so an unclean kill is fine. At the
> moment the shutdown process is taking 2 mins. I was just wondering if there
> was a way to do this.
>
> Regards,
> Harvey
>
> --
> *From:* Users  on behalf of Reid Wahl <
> nw...@redhat.com>
> *Sent:* 23 July 2020 08:05
> *To:* Cluster Labs - All topics related to open-source clustering
> welcomed 
> *Subject:* EXTERNAL: Re: [ClusterLabs] Pacemaker Shutdown
>
>
> On Tue, Jul 21, 2020 at 11:42 PM Harvey Shepherd <
> harvey.sheph...@aviatnet.com> wrote:
>
> Hi All,
>
> I'm running Pacemaker 2.0.3 on a two-node cluster, controlling 40+
> resources which are a mixture of clones and other resources that are
> colocated with the master instance of certain clones. I've noticed that if
> I terminate pacemaker on the node that is hosting the master instances of
> the clones, Pacemaker focuses on stopping resources on that node BEFORE
> failing over to the other node, leading to a longer outage than necessary.
> Is there a way to change this behaviour?
>
>
> Hi, Harvey.
>
> As you likely know, a given resource active/passive resource will have to
> stop on one node before it can start on another node, and the same goes for
> a promoted clone instance having to demote on one node before it can
> promote on another. There are exceptions for clone instances and for
> promotable clones with promoted-max > 1 ("allow more than one master
> instance"). A resource that's configured to run on one node at a time
> should not try to run on two nodes during failover.
>
> With that in mind, what exactly are you wanting to happen? Is the problem
> that all resources are stopping on node 1 before *any* of them start on
> node 2? Or that you want Pacemaker shutdown to kill the processes on node 1
> instead of cleanly shutting them down? Or something different?
>
> These are the actions and logs I saw during the test:
>
>
> Ack. This seems like it's just telling us that Pacemaker is going through
> a graceful shutdown. The info more relevant to the resource stop/start
> order would be in /var/log/pacemaker/pacemaker.log (or less detailed in
> /var/log/messages) on the DC.
>
> # /etc/init.d/pacemaker stop
> Signaling Pacemaker Cluster Manager to terminate
>
> Waiting for cluster services to
> unload..sending
> signal 9 to procs
>
>
> 2020 Jul 22 06:16:50.581 Chassis2 daemon.notice CTR8740 pacemaker.
> Signaling Pacemaker Cluster Manager to terminate
> 2020 Jul 22 06:16:50.599 Chassis2 daemon.notice CTR8740 pacemaker. Waiting
> for cluster services to unload
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: new_event_notification (6140-6141-9): Broken
> pipe (32)
> 2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740
> pacemaker-based.6140  warning: Notification of client
> stonithd/665bde82-cb28-40f7-9132-8321dc2f1992 failed
> 2020 Jul 22 06:18:01.794

Re: [ClusterLabs] Pacemaker Shutdown

2020-07-22 Thread Harvey Shepherd

Fencing could work. Thanks again Reid.

From: Users  on behalf of Reid Wahl 

Sent: 23 July 2020 10:10
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: EXTERNAL: Re: [ClusterLabs] Pacemaker Shutdown

Thanks for the clarification. As far as I'm aware, there's no way to do this at 
the Pacemaker level during a Pacemaker shutdown. It would require uncleanly 
killing all resources, which doesn't make sense at the Pacemaker level.

Pacemaker only knows how to stop a resource by running the resource agent's 
stop operation. Even if Pacemaker wanted to kill a resource uncleanly for 
speed, the way to do so for each resource would depend on the type of resource. 
For example, an IPaddr2 resource doesn't represent a running process that can 
be killed; `ip addr del` would be necessary.

If we went the route of killing the Pacemaker daemon entirely, rather than 
relying on it to stop resources, then that wouldn't guarantee the node has 
stopped using the actual resources before the failover node tries to take over. 
For example, for a Filesystem, the FS could still be mounted after Pacemaker is 
killed.

The only ways to know with certainty that node 1 has stopped using cluster 
resources so that node 2 can safely take them over are:

  1.  gracefully stop them, or
  2.  fence/reboot node 1

With that being said, if you don't mind node 1 being fenced to initiate a 
faster failover, then you could fence it from node 2.

Others on the list may think of something I haven't considered here.

On Wed, Jul 22, 2020 at 2:43 PM Harvey Shepherd 
mailto:harvey.sheph...@aviatnet.com>> wrote:
Thanks for your response Reid. What you say makes sense, and under normal 
circumstances if a resource failed, I'd want all of its dependents to be 
stopped cleanly before restarting the failed resource. However if pacemaker is 
shutting down on a node (e.g. due to a restart request), then I just want to 
failover as fast as possible, so an unclean kill is fine. At the moment the 
shutdown process is taking 2 mins. I was just wondering if there was a way to 
do this.

Regards,
Harvey

From: Users 
mailto:users-boun...@clusterlabs.org>> on behalf 
of Reid Wahl mailto:nw...@redhat.com>>
Sent: 23 July 2020 08:05
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>
Subject: EXTERNAL: Re: [ClusterLabs] Pacemaker Shutdown

On Tue, Jul 21, 2020 at 11:42 PM Harvey Shepherd 
mailto:harvey.sheph...@aviatnet.com>> wrote:
Hi All,

I'm running Pacemaker 2.0.3 on a two-node cluster, controlling 40+ resources 
which are a mixture of clones and other resources that are colocated with the 
master instance of certain clones. I've noticed that if I terminate pacemaker 
on the node that is hosting the master instances of the clones, Pacemaker 
focuses on stopping resources on that node BEFORE failing over to the other 
node, leading to a longer outage than necessary. Is there a way to change this 
behaviour?

Hi, Harvey.

As you likely know, a given resource active/passive resource will have to stop 
on one node before it can start on another node, and the same goes for a 
promoted clone instance having to demote on one node before it can promote on 
another. There are exceptions for clone instances and for promotable clones 
with promoted-max > 1 ("allow more than one master instance"). A resource 
that's configured to run on one node at a time should not try to run on two 
nodes during failover.

With that in mind, what exactly are you wanting to happen? Is the problem that 
all resources are stopping on node 1 before any of them start on node 2? Or 
that you want Pacemaker shutdown to kill the processes on node 1 instead of 
cleanly shutting them down? Or something different?

These are the actions and logs I saw during the test:

Ack. This seems like it's just telling us that Pacemaker is going through a 
graceful shutdown. The info more relevant to the resource stop/start order 
would be in /var/log/pacemaker/pacemaker.log (or less detailed in 
/var/log/messages) on the DC.

# /etc/init.d/pacemaker stop
Signaling Pacemaker Cluster Manager to terminate

Waiting for cluster services to 
unload..sending 
signal 9 to procs

2020 Jul 22 06:16:50.581 Chassis2 daemon.notice CTR8740 pacemaker. Signaling 
Pacemaker Cluster Manager to terminate
2020 Jul 22 06:16:50.599 Chassis2 daemon.notice CTR8740 pacemaker. Waiting for 
cluster services to unload
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: new_event_notification (6140-6141-9): Broken pipe (32)
2020 Jul 22 06:18:01.794 Chassis2 daemon.warning CTR8740 pacemaker-based.6140  
warning: Notification of client stonithd/665bde82-cb28-40f7-9132-

[ClusterLabs] pacemaker startup problem

2020-07-24 Thread Gabriele Bulfon

Hello,
 
after a long time I'm back to run heartbeat/pacemaker/corosync on our 
XStreamOS/illumos distro.
I rebuilt the original components I did in 2016 on our latest release (probably 
a bit outdated, but I want to start from where I left).
Looks like pacemaker is having trouble starting up showin this logs:
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Jul 24 18:21:32 [971] crmd: info: crm_log_init: Changed active directory to 
/sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [971] crmd: info: main: CRM Git Version: 1.1.15 (e174ec8)
Jul 24 18:21:32 [971] crmd: info: do_log: Input I_STARTUP received in state 
S_STARTING from crmd_init
Jul 24 18:21:32 [969] lrmd: info: crm_log_init: Changed active directory to 
/sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [968] stonith-ng: info: crm_log_init: Changed active directory 
to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Verifying cluster 
type: 'heartbeat'
Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Assuming an active 
'heartbeat' cluster
Jul 24 18:21:32 [968] stonith-ng: notice: crm_cluster_connect: Connecting to 
cluster infrastructure: heartbeat
Jul 24 18:21:32 [969] lrmd: error: mainloop_add_ipc_server: Could not start 
lrmd IPC server: Operation not supported (-48)
Jul 24 18:21:32 [969] lrmd: error: main: Failed to create IPC server: shutting 
down and inhibiting respawn
Jul 24 18:21:32 [969] lrmd: info: crm_xml_cleanup: Cleaning up memory from 
libxml2
Jul 24 18:21:32 [971] crmd: info: get_cluster_type: Verifying cluster type: 
'heartbeat'
Jul 24 18:21:32 [971] crmd: info: get_cluster_type: Assuming an active 
'heartbeat' cluster
Jul 24 18:21:32 [971] crmd: info: start_subsystem: Starting sub-system "pengine"
Jul 24 18:21:32 [968] stonith-ng: info: crm_get_peer: Created entry 
25bc5492-a49e-40d7-ae60-fd8f975a294a/80886f0 for node xstorage1/0 (1 total)
Jul 24 18:21:32 [968] stonith-ng: info: crm_get_peer: Node 0 has uuid 
d426a730-5229-6758-853a-99d4d491514a
Jul 24 18:21:32 [968] stonith-ng: info: register_heartbeat_conn: Hostname: 
xstorage1
Jul 24 18:21:32 [968] stonith-ng: info: register_heartbeat_conn: UUID: 
d426a730-5229-6758-853a-99d4d491514a
Jul 24 18:21:32 [970] attrd: notice: crm_cluster_connect: Connecting to cluster 
infrastructure: heartbeat
Jul 24 18:21:32 [970] attrd: error: mainloop_add_ipc_server: Could not start 
attrd IPC server: Operation not supported (-48)
Jul 24 18:21:32 [970] attrd: error: attrd_ipc_server_init: Failed to create 
attrd servers: exiting and inhibiting respawn.
Jul 24 18:21:32 [970] attrd: warning: attrd_ipc_server_init: Verify pacemaker 
and pacemaker_remote are not both enabled.
Jul 24 18:21:32 [972] pengine: info: crm_log_init: Changed active directory to 
/sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [972] pengine: error: mainloop_add_ipc_server: Could not start 
pengine IPC server: Operation not supported (-48)
Jul 24 18:21:32 [972] pengine: error: main: Failed to create IPC server: 
shutting down and inhibiting respawn
Jul 24 18:21:32 [972] pengine: info: crm_xml_cleanup: Cleaning up memory from 
libxml2
Jul 24 18:21:33 [971] crmd: info: do_cib_control: Could not connect to the CIB 
service: Transport endpoint is not connected
Jul 24 18:21:33 [971] crmd: warning: do_cib_control: Couldn't complete CIB 
registration 1 times... pause and retry
Jul 24 18:21:33 [971] crmd: error: crmd_child_exit: Child process pengine 
exited (pid=972, rc=100)
Jul 24 18:21:35 [971] crmd: info: crm_timer_popped: Wait Timer (I_NULL) just 
popped (2000ms)
Jul 24 18:21:36 [971] crmd: info: do_cib_control: Could not connect to the CIB 
service: Transport endpoint is not connected
Jul 24 18:21:36 [971] crmd: warning: do_cib_control: Couldn't complete CIB 
registration 2 times... pause and retry
Jul 24 18:21:38 [971] crmd: info: crm_timer_popped: Wait Timer (I_NULL) just 
popped (2000ms)
Jul 24 18:21:39 [971] crmd: info: do_cib_control: Could not connect to the CIB 
service: Transport endpoint is not connected
Jul 24 18:21:39 [971] crmd: warning: do_cib_control: Couldn't complete CIB 
registration 3 times... pause and retry
Jul 24 18:21:41 [971] crmd: info: crm_timer_popped: Wait Timer (I_NULL) just 
popped (2000ms)
Jul 24 18:21:42 [971] crmd: info: do_cib_control: Could not connect to the CIB 
service: Transport endpoint is not connected
Jul 24 18:21:42 [971] crmd: warning: do_cib_control: Couldn't complete CIB 
registration 4 times... pause and retry
Jul 24 18:21:42 [968] stonith-ng: error: setup_cib: Could not connect to the 
CIB service: Transport endpoint is not connected (-134)
Jul 24 18:21:42 [968] stonith-ng: error: mainloop_add_ipc_server: Could not 
start stonith-ng IPC server: Operation not supported (-48)
Jul 24 18:21:42 [968] stonith-ng: error: stonith_ipc_server_init: Failed to 
create stonith-ng servers: exiting and inhibiting respawn.
Jul 2

[ClusterLabs] pacemaker alerts node_selector

2020-11-25 Thread vockinger

Hi, I would like to trigger an external script, if something happens on a
specific node.

 

In the documentation of alerts, i can see  but whatever I
put into the XML, it's not working...

 

configuration>







  

  











Can anybody send me an example about the right syntax ?

 

Thank you very much..

 

Best regards, Alfred

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker Cluster help

2021-05-27 Thread Nathan Mazarelo

Is there a way to have pacemaker resource groups failover if all floating IP 
resources are unavailable?

I want to have multiple floating IPs in a resource group that will only 
failover if all IPs cannot work. Each floating IP is on a different subnet and 
can be used by the application I have. If a floating IP is unavailable it will 
use the next available floating IP.
Resource Group: floating_IP

floating-IP

floating-IP2

floating-IP3  
For example, right now if a floating-IP resource fails the whole resource group 
will failover to a different node. What I want is to have pacemaker failover 
the resource group only if all three resources are unavailable. Is this 
possible?

Thanks,Nathan
 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker Cluster help

2021-06-02 Thread Nathan Mazarelo

Is there a way to have pacemaker resource groups failover if all floating IP 
resources are unavailable?

I want to have multiple floating IPs in a resource group that will only 
failover if all IPs cannot work. Each floating IP is on a different subnet and 
can be used by the application I have. If a floating IP is unavailable it will 
use the next available floating IP.
Resource Group: floating_IP

floating-IP

floating-IP2

floating-IP3  
For example, right now if a floating-IP resource fails the whole resource group 
will failover to a different node. What I want is to have pacemaker failover 
the resource group only if all three resources are unavailable. Is this 
possible?

Thanks,Nathan
 
   
   -


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker and python3

2015-06-24 Thread Robert Kuska

Hello everyone,

I am Robert Kuska from fedoraproject, I am a python co-maintainer
and co-owner of change Python3 as default which aims to provide
python3 only packages by default across different fedora platform
releases[0].

The reason why I am contacting you is, that pacemaker is part of our default
installation on fedora server release and to keep it there we need pacemaker
to support python3. Do you plan to invest any time to python3 support in near 
future (meaning months, specifically, we need pacemaker before 1st of 
September)? 

Thank you for your time.


[0]https://fedoraproject.org/wiki/Changes/Python_3_as_Default

--
Robert Kuska
{rkuska}


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker failover failure

2015-07-01 Thread alex austin

Hi all,

I have configured a virtual ip and redis in master-slave with corosync
pacemaker. If redis fails, then the failover is successful, and redis gets
promoted on the other node. However if pacemaker itself fails on the active
node, the failover is not performed. Is there anything I missed in the
configuration?

Here's my configuration (i have hashed the ip address out):

node host1.com

node host2.com

primitive ClusterIP IPaddr2 \

params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \

op monitor interval=1s timeout=20s \

op start interval=0 timeout=20s \

op stop interval=0 timeout=20s \

meta is-managed=true target-role=Started resource-stickiness=500

primitive redis redis \

meta target-role=Master is-managed=true \

op monitor interval=1s role=Master timeout=5s on-fail=restart

ms redis_clone redis \

meta notify=true is-managed=true ordered=false interleave=false
globally-unique=false target-role=Master migration-threshold=1

colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master

colocation ip-on-redis inf: ClusterIP redis_clone:Master

property cib-bootstrap-options: \

dc-version=1.1.11-97629de \

cluster-infrastructure="classic openais (with plugin)" \

expected-quorum-votes=2 \

stonith-enabled=false

property redis_replication: \

redis_REPL_INFO=host.com


thank you in advance


Kind regards,


Alex
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker build error

2015-11-03 Thread Jim Van Oosten



I am getting a compile error when building Pacemaker on Linux version
2.6.32-431.el6.x86_64.

The build commands:

git clone git://github.com/ClusterLabs/pacemaker.git
cd pacemaker
./autogen.sh && ./configure --prefix=/usr --sysconfdir=/etc
make
make install

The compile error:

Making install in services
gmake[2]: Entering directory
`/tmp/software/HA_linux/pacemaker/lib/services'
  CC   libcrmservice_la-services.lo
services.c: In function 'resources_action_create':
services.c:153: error: 'svc_action_private_t' has no member named 'pending'
services.c: In function 'services_action_create_generic':
services.c:340: error: 'svc_action_private_t' has no member named 'pending'
gmake[2]: *** [libcrmservice_la-services.lo] Error 1
gmake[2]: Leaving directory `/tmp/software/HA_linux/pacemaker/lib/services'
gmake[1]: *** [install-recursive] Error 1
gmake[1]: Leaving directory `/tmp/software/HA_linux/pacemaker/lib'
make: *** [install-recursive] Error 1


The pending field that services.c is attenpting to set is conditioned on
the SUPPORT_DBUS flag in services_private.h.

pacemaker/lib/services/services_private.h

   
 #if SUPPORT_DBUS  
   
   
   





 DBusPendingCall*   
 pending;   





 unsigned timerid;  






 #endif 





Am I building Pacemaker incorrectly or should I open an defect for this
problem?

Jim VanOosten
jimvo at  us.ibm.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker license

2016-01-11 Thread Andrew Beekhof


> On 6 Oct 2015, at 9:39 AM, santosh_bidara...@dell.com wrote:
> 
> Dell - Internal Use - Confidential 
> 
> Hello Pacemaker Admins,
>  
> We have a query regarding licensing for pacemaker header files
>  
> As per the given link http://clusterlabs.org/wiki/License, it is mentioned 
> that “Pacemaker programs are licensed under the GPLv2+ (version 2 or later of 
> the GPL) and its headers and libraries are under the less restrictive LGPLv2+ 
> (version 2 or later of the LGPL) .”
>  
> However, website link 
> http://clusterlabs.org/doxygen/pacemaker/2927a0f9f25610c331b6a137c846fec27032c9ea/cib_8h.html,
>  states otherwise. 
> Cib.h header file needed to be included in order to configure pacemaker using 
> C API. But the header file for cib.h states that the header file is under GPL 
> license
> This seems to be conflicting the statement regarding header file license.
>  
> In addition, which similar issue has been discussed in the past 
> http://www.gossamer-threads.com/lists/linuxha/pacemaker/75967, no additional 
> details on the resolution.

I thought that was a pretty clear statement, but you’re correct that the 
licences were not changed.

Does this satisfy?

   https://github.com/beekhof/pacemaker/commit/6de9fde

>  
> Need your inputs on licensing to proceed further.
>  
> Thanks & Regards
> Santosh Bidaralli
>  
>  
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker 1.1.14 released

2016-01-14 Thread Ken Gaillot

ClusterLabs is proud to announce the latest release of the Pacemaker
cluster resource manager, version 1.1.14. The source code is available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.14

This version introduces some valuable new features:

* Resources will now start as soon as their state has been confirmed on
all nodes and all dependencies have been satisfied, rather than waiting
for the state of all resources to be confirmed. This allows for faster
startup of some services, and more even startup load.

* Fencing topology levels can now be applied to all nodes whose name
matches a configurable pattern, or that have a configurable node attribute.

* When a fencing topology level has multiple devices, reboots are now
automatically mapped to all-off-then-all-on, allowing much simplified
configuration of redundant power supplies.

* Guest nodes can now be included in groups, which simplifies the common
Pacemaker Remote use case of a grouping a storage device, filesystem and VM.

* Clone resources have a new clone-min metadata option, specifying that
a certain number of instances must be running before any dependent
resources can run. This is particularly useful for services behind a
virtual IP and haproxy, as is often done with OpenStack.

As usual, the release includes many bugfixes and minor enhancements. For
a more detailed list of changes, see the change log:

https://github.com/ClusterLabs/pacemaker/blob/1.1/ChangeLog

Feedback is invited and welcome.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker license

2016-01-19 Thread Jan Pokorný

On 12/01/16 11:27 +1100, Andrew Beekhof wrote:
>> On 6 Oct 2015, at 9:39 AM, santosh_bidara...@dell.com wrote:
>> As per the given link http://clusterlabs.org/wiki/License, it is
>> mentioned that “Pacemaker programs are licensed under the GPLv2+
>> (version 2 or later of the GPL) and its headers and libraries are
>> under the less restrictive LGPLv2+ (version 2 or later of the LGPL)
>> .”
>>  
>> However, website link
>> http://clusterlabs.org/doxygen/pacemaker/2927a0f9f25610c331b6a137c846fec27032c9ea/cib_8h.html,
>> states otherwise.  Cib.h header file needed to be included in order
>> to configure pacemaker using C API. But the header file for cib.h
>> states that the header file is under GPL license This seems to be
>> conflicting the statement regarding header file license.
>>  
>> In addition, which similar issue has been discussed in the past
>> http://www.gossamer-threads.com/lists/linuxha/pacemaker/75967, no
>> additional details on the resolution.
> 
> I thought that was a pretty clear statement, but you’re correct that the 
> licences were not changed.
> 
> Does this satisfy?
> 
>https://github.com/beekhof/pacemaker/commit/6de9fde

Just a reminder we should review the licenses as declared at
particular subpackages within the authoritative specfile.

For instance, pacemaker-cli should be GPLv2+ only, AFAICT.

And the associated license texts to be distributed along should
reflect the reality, too:
https://github.com/jnpkrn/pacemaker/commit/e5210be68779529407316c5c602e4edc2c2d75c0

-- 
Jan (Poki)


pgpgPFMkVvhLk.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker license

2016-01-24 Thread Andrew Beekhof


> On 20 Jan 2016, at 4:06 AM, Jan Pokorný  wrote:
> 
> On 12/01/16 11:27 +1100, Andrew Beekhof wrote:
>>> On 6 Oct 2015, at 9:39 AM, santosh_bidara...@dell.com wrote:
>>> As per the given link http://clusterlabs.org/wiki/License, it is
>>> mentioned that “Pacemaker programs are licensed under the GPLv2+
>>> (version 2 or later of the GPL) and its headers and libraries are
>>> under the less restrictive LGPLv2+ (version 2 or later of the LGPL)
>>> .”
>>> 
>>> However, website link
>>> http://clusterlabs.org/doxygen/pacemaker/2927a0f9f25610c331b6a137c846fec27032c9ea/cib_8h.html,
>>> states otherwise.  Cib.h header file needed to be included in order
>>> to configure pacemaker using C API. But the header file for cib.h
>>> states that the header file is under GPL license This seems to be
>>> conflicting the statement regarding header file license.
>>> 
>>> In addition, which similar issue has been discussed in the past
>>> http://www.gossamer-threads.com/lists/linuxha/pacemaker/75967, no
>>> additional details on the resolution.
>> 
>> I thought that was a pretty clear statement, but you’re correct that the 
>> licences were not changed.
>> 
>> Does this satisfy?
>> 
>>   https://github.com/beekhof/pacemaker/commit/6de9fde
> 
> Just a reminder we should review the licenses as declared at
> particular subpackages within the authoritative specfile.
> 
> For instance, pacemaker-cli should be GPLv2+ only, AFAICT.
> 
> And the associated license texts to be distributed along should
> reflect the reality, too:
> https://github.com/jnpkrn/pacemaker/commit/e5210be68779529407316c5c602e4edc2c2d75c0

agreed, can you merge that please?

> 
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker startup-fencing

2016-03-19 Thread Ferenc Wágner

Hi,

Pacemaker explained says about this cluster option:

Advanced Use Only: Should the cluster shoot unseen nodes? Not using
the default is very unsafe!

1. What are those "unseen" nodes?

And a possibly related question:

2. If I've got UNCLEAN (offline) nodes, is there a way to clean them up,
   so that they don't get fenced when I switch them on?  I mean without
   removing the node altogether, to keep its capacity settings for
   example.

And some more about fencing:

3. What's the difference in cluster behavior between
   - stonith-enabled=FALSE (9.3.2: how often will the stop operation be 
retried?)
   - having no configured STONITH devices (resources won't be started, right?)
   - failing to STONITH with some error (on every node)
   - timing out the STONITH operation
   - manual fencing

4. What's the modern way to do manual fencing?  (stonith_admin
   --confirm + what?  I ask because meatware.so comes from
   cluster-glue and uses the old API).
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] pacemaker and fence_sanlock

2016-05-11 Thread Da Shi Cao

Dear all,

I'm just beginning to use pacemaker+corosync as our HA solution on Linux, but I 
got stuck at the stage of configuring fencing.

Pacemaker 1.1.15,  Corosync Cluster Engine, version '2.3.5.46-d245', and 
sanlock 3.3.0 (built May 10 2016 05:13:12)

I have the following questions:

1. stonith_admin --list-installed will only list two agents: fence_pcmk, 
fence_legacy before sanlock is compiled and installed under /usr/local. But 
after "make install" of sanlock, stonith_admin --list-installed will list:

 fence_sanlockd
 fence_sanlock
 fence_pcmk
 fence_legacy
 It is weird and I wonder what makes stonith_admin know about fence_sanlock?

2. How to configure the fencing by fence_sanlock into pacemaker? I've tried to 
create a new resource to do the unfencing for each node, but the resource start 
will fail since there is no monitor operation of fence_sanlock agent, because 
resource manager will fire monitor once after the start to make sure it has 
been started OK.

3. How to create a fencing resource to do the fencing by sanlock. This I've not 
tried yet. But I wonder which node/nodes of the majority will initiate the 
fence operations to the nodes without quorum.

Thank you very much.
Dashi Cao
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker with Zookeeper??

2016-05-13 Thread Nguyen Xuan. Hai


Hi,

I have an idea: use Pacemaker with Zookeeper (instead of Corosync). Is 
it possible?

Is there any examination about that?

Thanks for your help!
Hai Nguyen


--
This mail was scanned by BitDefender
For more information please visit http://www.bitdefender.com


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker 1.1.15 released

2016-06-21 Thread Ken Gaillot

ClusterLabs is proud to announce the latest release of the Pacemaker
cluster resource manager, version 1.1.15. The source code is available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.15
The most significant enhancements since version 1.1.14 are:

* A new "alerts" section of the CIB allows you to configure scripts that
will be called after significant cluster events. Sample scripts are
installed in /usr/share/pacemaker/alerts.

* A new pcmk_action_limit option for fence devices allows multiple fence
actions to be executed concurrently. It defaults to 1 to preserve
existing behavior (i.e. serial execution of fence actions).

* Pacemaker Remote support has been improved. Most noticeably, if
pacemaker_remote is stopped without disabling the remote resource first,
any resources will be moved off the node (previously, the node would get
fenced). This allows easier software updates on remote nodes, since
updates often involve restarting the daemon.

* You may notice some files have moved from the pacemaker package to
pacemaker-cli, including most ocf:pacemaker resource agents, the
logrotate configuration, the XML schemas and the SNMP MIB. This allows
Pacemaker Remote nodes to work better when the full pacemaker package is
not installed.

* Have you ever wondered why a resource is not starting when you think
it should? crm_mon will now show why a resource is stopped, for example,
because it is unmanaged, or disabled in the configuration.

* In 1.1.14, the controld resource agent was modified to return a
monitor error when DLM is in the "wait fencing" state. This turned out
to be too aggressive, resulting in fencing the monitored node
unnecessarily if a slow fencing operation against another node was in
progress. The agent now does additional checking to determine whether to
return an error or not.

* Four significant regressions have been fixed. Compressed CIBs larger
than 1MB are again supported (a regression since 1.1.14), fenced unseen
nodes properly are not marked as unclean (also since 1.1.14),
have-watchdog is detected properly rather than always true (also since
1.1.14) and failures of multiple-level monitor checks should again cause
the resource to fail (since 1.1.10).

As usual, the release includes many bugfixes and minor enhancements. For
a more detailed list of changes, see the change log:

https://github.com/ClusterLabs/pacemaker/blob/1.1/ChangeLog

Everyone is encouraged to download, compile and test the new release. We
do many regression tests and simulations, but we can't cover all
possible use cases, so your feedback is important and appreciated.

Many thanks to all contributors of source code to this release,
including Andrew Beekhof, Bin Liu, Christian Schneider, Christoph Berg,
David Shane Holden, Ferenc Wágner, Gao Yan, Hideo Yamauchi, Jan Pokorný,
Ken Gaillot, Klaus Wenninger, Kostiantyn Ponomarenko, Kristoffer
Grönlund, Lars Ellenberg, Michal Koutný, Nakahira Kazutomo, Oyvind
Albrigtsen, Ruben Kerkhof, and Yusuke Iida. Apologies if I have
overlooked anyone.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] pacemaker validate-with

2016-08-24 Thread Gabriele Bulfon

Hi,
now I've got my pacemaker 1.1.14 and corosync 2.4.1 working together running an 
empty configuration of just 2 nodes.
So I run crm configure but I get this error:
ERROR: CIB not supported: validator 'pacemaker-2.4', release '3.0.10'
ERROR: You may try the upgrade command
ERROR: configure: Missing requirements
Looking around I found this :
http://unix.stackexchange.com/questions/269635/cib-not-supported-validator-pacemaker-2-0-release-3-0-9
but before going on this way, my questions are:
- why does cib try validator of pacemaker-2.4 when latest pacemaker available 
is 1.1.15??
- what is "release 3.0.10'? of what??
- may I daliver any preconfigure configuration file to have it work fine with 
pacemaker 1.1.14??
Thanks for any help.
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker quorum behavior

2016-09-08 Thread Scott Greenlese


Hi all...

I have a few very basic questions for the group.

I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100 VirtualDomain
pacemaker-remote nodes
plus 100 "opaque" VirtualDomain resources. The cluster is configured to be
'symmetric' and I have no
location constraints on the 200 VirtualDomain resources (other than to
prevent the opaque guests
from running on the pacemaker remote node resources).  My quorum is set as:

quorum {
provider: corosync_votequorum
}

As an experiment, I powered down one LPAR in the cluster, leaving 4 powered
up with the pcsd service up on the 4 survivors
but corosync/pacemaker down (pcs cluster stop --all) on the 4 survivors.
I then started pacemaker/corosync on a single cluster
node (pcs cluster start), and this resulted in the 200 VirtualDomain
resources activating on the single node.
This was not what I was expecting.  I assumed that no resources would
activate / start on any cluster nodes
until 3 out of the 5 total cluster nodes had pacemaker/corosync running.

After starting pacemaker/corosync on the single host (zs95kjpcs1), this is
what I see :

[root@zs95kj VD]# date;pcs status |less
Wed Sep  7 15:51:17 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep  7 15:51:18 2016  Last change: Wed Sep  7
15:30:12 2016 by hacluster via crmd on zs93kjpcs1
Stack: corosync
Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum
106 nodes and 304 resources configured

Node zs93KLpcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Online: [ zs95kjpcs1 ]
OFFLINE: [ zs90kppcs1 ]

.
.
.
PCSD Status:
  zs93kjpcs1: Online
  zs95kjpcs1: Online
  zs95KLpcs1: Online
  zs90kppcs1: Offline
  zs93KLpcs1: Online

So, what exactly constitutes an "Online" vs. "Offline" cluster node w.r.t.
quorum calculation?   Seems like in my case, it's "pending" on 3 nodes,
so where does that fall?   Any why "pending"?  What does that mean?

Also, what exactly is the cluster's expected reaction to quorum loss?
Cluster resources will be stopped or something else?

Where can I find this documentation?

Thanks!

Scott Greenlese -  IBM Solution Test Team.



Scott Greenlese ... IBM Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)M/S:  POK 42HA/P966
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] pacemaker compile error

2016-10-27 Thread ferdinando

  hello,

i'm trying to install pacemaker cluster in a testing
environment (2 nodes Fedora release 24, 4.7.9-200.fc24.x86_64)
i have
problems compiling last commit
(19c0d74717fb1e9701d51b206823a3386a114caa) of pacemaker.
chunk of
log:

===

pacemaker
configuration:
 Version = 1.1.15 (Build: 19c0d74)
 Features =
generated-manpages agent-manpages ascii-docs publican-docs ncurses
libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd snmp
libesmtp

 Prefix = /usr
 Executables = /usr/sbin
 Man pages =
/usr/share/man
 Libraries = /usr/lib64
 Header files = /usr/include

Arch-independent files = /usr/share
 State information = /var
 System
configuration = /etc

 Use system LTDL = yes

 HA group name = haclient

HA user name = hacluster

 CFLAGS = -g -O2 -I/usr/include/dbus-1.0
-I/usr/lib64/dbus-1.0/include -ggdb -fgnu89-inline -Wall
-Waggregate-return -Wbad-function-cast -Wcast-align
-Wdeation-after-statement -Wendif-labels -Wfloat-equal
-Wmissing-prototypes -Wmissing-declarations -Wnested-externs
-Wno-long-long -Wno-strict-aliasing -Wpointer-arith
-Wwrite-stringsnused-but-set-variable -Wformat=2
-fstack-protector-strong
 CFLAGS_HARDENED_EXE = -fPIE

CFLAGS_HARDENED_LIB =
 LDFLAGS_HARDENED_EXE = -Wl,-z,relro -pie
-Wl,-z,now -Wl,--as-needed
 LDFLAGS_HARDENED_LIB = -Wl,-z,relro
-Wl,-z,now -Wl,--as-needed
 Libraries = -lgnutls -lqb -lcorosync_common
-lqb -lqb -ldl -lrt -lpthread -lbz2 -lxslt -lxml2 -luuid -lpam -lrt -ldl
-lglib-2.0 -lltdl
 Stack Libraries = -lqb -ldl -lrt -lpthread -lcpg
-lcfg -lcmap
-lquorum
===
.

CC libcrmcommon_la-mainloop.lo
mainloop.c:406:8: error: unknown type
name 'qb_array_t'
 static qb_array_t *gio_map = NULL;

^~
mainloop.c: In function 'mainloop_cleanup':
mainloop.c:412:9:
error: implicit declaration of function 'qb_array_free'
[-Werror=implicit-function-declaration]
 qb_array_free(gio_map);

^
mainloop.c:412:9: error: nested extern declaration of
'qb_array_free' [-Werror=nested-externs]
mainloop.c: In function
'gio_poll_dispatch_update':
mainloop.c:465:11: error: implicit
declaration of function 'qb_array_index'
[-Werror=implicit-function-declaration]
 res = qb_array_index(gio_map,
fd, (void **)&adaptor);
 ^~
mainloop.c:465:5: error: nested
extern declaration of 'qb_array_index' [-Werror=nested-externs]
 res =
qb_array_index(gio_map, fd, (void **)&adaptor);
 ^~~
mainloop.c: In
function 'mainloop_add_ipc_server':
mainloop.c:594:19: error: implicit
declaration of function 'qb_array_create_2'
[-Werror=implicit-function-declaration]
 gio_map = qb_array_create_2(64,
sizeof(struct gio_to_qb_poll), 1);
 ^
mainloop.c:594:9:
error: nested extern declaration of 'qb_array_create_2'
[-Werror=nested-externs]
 gio_map = qb_array_create_2(64, sizeof(struct
gio_to_qb_poll), 1);
 ^~~
mainloop.c:594:17: error: assignment makes
pointer from integer without a cast [-Werror=int-conversion]
 gio_map =
qb_array_create_2(64, sizeof(struct gio_to_qb_poll), 1);
 ^
cc1: all
warnings being treated as errors
Makefile:754: recipe for target
'libcrmcommon_la-mainloop.lo' failed
gmake[2]: ***
[libcrmcommon_la-mainloop.lo] Error 1
gmake[2]: Leaving directory
'/usr/local/src/pacemaker/lib/common'
Makefile:565: recipe for target
'all-recursive' failed
gmake[1]: *** [all-recursive] Error 1
gmake[1]:
Leaving directory '/usr/local/src/pacemaker/lib'
Makefile:1242: recipe
for target 'core' failed
make: *** [core] Error 1

git log
commit
19c0d74717fb1e9701d51b206823a3386a114caa
Merge: 722276c a22b02e
Author:
Ken Gaillot 
Date: Tue Oct 25 11:59:10 2016 -0500

 Merge pull request
#1160 from wenningerk/fix_atomic_attrd

 fix usage of HAVE_ATOMIC_ATTRD
in
attrd_updater
===

libqb
compile is
fine:

libqb.pc
===
prefix=/usr
exec_prefix=${prefix}
libdir=/usr/lib64
includedir=${prefix}/include

Name:
libqb
Version: 1.0.0.53-026a
Description: libqb
Requires:
Libs:
-L${libdir} -lqb -ldl -lrt -lpthread
Cflags:
-I${includedir}
===

/usr/include/qb/
-rw-r--r--
1 root root 8.6K Oct 27 18:26 qblist.h
-rw-r--r-- 1 root root 4.6K Oct
27 18:26 qbhdb.h
-rw-r--r-- 1 root root 3.0K Oct 27 18:26
qbdefs.h
-rw-r--r-- 1 root root 7.2K Oct 27 18:26 qbutil.h
-rw-r--r-- 1
root root 8.5K Oct 27 18:26 qbrb.h
-rw-r--r-- 1 root root 7.3K Oct 27
18:26 qbmap.h
-rw-r--r-- 1 root root 7.6K Oct 27 18:26
qbloop.h
-rw-r--r-- 1 root root 22K Oct 27 18:26 qblog.h
-rw-r--r-- 1
root root 14K Oct 27 18:26 qbipcs.h
-rw-r--r-- 1 root root 1.8K Oct 27
18:26 qbipc_common.h
-rw-r--r-- 1 root root 8.0K Oct 27 18:26
qbipcc.h
-rw-r--r-- 1 root root 1.5K Oct 27 18:26 qbconfig.h
-rw-r--r--
1 root root 7.1K Oct 27 18:26 qbatomic.h
-rw-r--r-- 1 root root 3.2K Oct
27 18:26 qbarray.h

===

qb

[ClusterLabs] Pacemaker 1.1.16 released

2016-11-30 Thread Ken Gaillot

ClusterLabs is proud to announce the latest release of the Pacemaker
cluster resource manager, version 1.1.15. The source code is available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.16
The most significant enhancements in this release are:

* rsc-pattern may now be used instead of rsc in location constraints, to
allow a single location constraint to apply to all resources whose names
match a regular expression. Sed-like %0 - %9 backreferences let
submatches be used in node attribute names in rules.

* The new ocf:pacemaker:attribute resource agent sets a node attribute
according to whether the resource is running or stopped. This may be
useful in combination with attribute-based rules to model dependencies
that simple constraints can't handle.

* Pacemaker's existing "node health" feature allows resources to move
off nodes that become unhealthy. Now, when using
node-health-strategy=progressive, a new cluster property
node-health-base will be used as the initial health score of newly
joined nodes (defaulting to 0, which is the previous behavior). This
allows a node to be treated as "healthy" even if it has some "yellow"
health attributes, which can be useful to allow clones to run on such nodes.

* Previously, the OCF_RESKEY_CRM_meta_notify_active_* variables were not
properly passed to multistate resources with notification enabled. This
has been fixed. To help resource agents detect when the fix is
available, the CRM feature set has been incremented. (Whenever the
feature set changes, mixed-version clusters are supported only during
rolling upgrades -- nodes with an older version will not be allowed to
rejoin once they shut down.)

* Watchdog-based fencing using sbd now works better on remote nodes.
This capability still likely has some limitations, however.

* The build process now takes advantage of various compiler features
(RELRO, PIE, as-needed linking, etc.) that enhance security and start-up
performance. See the "Hardening flags" comments in the configure.ac file
for more details.

* Python 3 compatibility: The Pacemaker project now targets
compatibility with both python 2 (versions 2.6 and later) and python 3
(versions 3.2 and later). All of the project's python code now meets
this target, with the exception of CTS, which is still python 2 only.

* The Pacemaker coding guidelines have been replaced by a more
comprehensive addition to the documentation set, "Pacemaker
Development". It is intended for developers working on the Pacemaker
code base itself, rather than external code such as resource agents. A
copy is viewable at
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Development/

As usual, the release includes many bugfixes, including a fix for a
serious security vulnerability (CVE-2016-7035). For a more detailed list
of changes, see the change log:

https://github.com/ClusterLabs/pacemaker/blob/1.1/ChangeLog

Many thanks to all contributors of source code to this release,
including Andrew Beekhof, Bin Liu, Christian Schneider, Christoph Berg,
David Shane Holden, Ferenc Wágner, Yan Gao, Hideo Yamauchi, Jan Pokorný,
Ken Gaillot, Klaus Wenninger, Kostiantyn Ponomarenko, Kristoffer
Grönlund, Lars Ellenberg, Masatake Yamato, Michal Koutný, Nakahira
Kazutomo, Nate Clark, Nishanth Aravamudan, Oyvind Albrigtsen, Ruben
Kerkhof, Tim Bishop, Vladislav Bogdanov and Yusuke Iida. Apologies if I
have overlooked anyone.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker 1.1.17 released

2017-07-06 Thread Ken Gaillot

ClusterLabs is proud to announce the latest release of the Pacemaker
cluster resource manager, version 1.1.17. The source code is available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.17

The most significant enhancements in this release are:

* A new "bundle" resource type simplifies launching resources inside
Docker containers. This feature is considered experimental for this
release. It was discussed in detail previously:

  http://lists.clusterlabs.org/pipermail/users/2017-April/005380.html

A walk-through is available on the ClusterLabs wiki for anyone who wants
to experiment with the feature:

  http://wiki.clusterlabs.org/wiki/Bundle_Walk-Through

* A new environment variable PCMK_node_start_state can specify that a
node should start in standby mode. It was also discussed previously:

  http://lists.clusterlabs.org/pipermail/users/2017-April/005607.html

* The "crm_resource --cleanup" and "crm_failcount" commands can now
operate on a single operation type (previously, they could only operate
on all operations at once). This is part of an underlying switch to
tracking failure counts per operation, also discussed previously:

  http://lists.clusterlabs.org/pipermail/users/2017-April/005391.html

* Several command-line tools have new options, including "crm_resource
--validate" to run a resource agent's validate-all action,
"stonith_admin --list-targets" to list all potential targets of a fence
device, and "crm_attribute --pattern" to update or delete all node
attributes matching a regular expression

* The cluster's handling of fence failures has been improved. Among the
changes, a new "stonith-max-attempts" cluster option specifies how many
times fencing can fail for a target before the cluster will no longer
immediately re-attempt it (previously hard-coded at 10).

* The new release has scalability improvements for large clusters. Among
the changes, a new "cluster-ipc-limit" cluster option specifies how
large the IPC queue between pacemaker daemons can grow.

* Location constraints using rules may now compare a node attribute
against a resource parameter, using the new "value-source" field.
Previously, node attributes could only be compared against literal
values. This is most useful in combination with rsc-pattern to apply the
constraint to multiple resources.

As usual, to support the new features, the CRM feature set has been
incremented. This means that mixed-version clusters are supported only
during a rolling upgrade -- nodes with an older version will not be
allowed to rejoin once they shut down.

For a more detailed list of bug fixes and other changes, see the change log:

https://github.com/ClusterLabs/pacemaker/blob/1.1/ChangeLog

Many thanks to all contributors of source code to this release,
including Alexandra Zhuravleva, Andrew Beekhof, Aravind Kumar, Eric
Marques, Ferenc Wágner, Yan Gao, Hayley Swimelar, Hideo Yamauchi, Igor
Tsiglyar, Jan Pokorný, Jehan-Guillaume de Rorthais, Ken Gaillot, Klaus
Wenninger, Kristoffer Grönlund, Michal Koutný, Nate Clark, Patrick
Hemmer, Sergey Mishin, Vladislav Bogdanov, and Yusuke Iida. Apologies if
I have overlooked anyone.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker in Azure

2017-08-24 Thread Eric Robinson

I deployed a couple of cluster nodes in Azure and found out right away that 
floating a virtual IP address between nodes does not work because Azure does 
not honor IP changes made from within the VMs. IP changes must be made to 
virtual NICs in the Azure portal itself. Anybody know of an easy way around 
this limitation?

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] pacemaker fencing

2017-09-05 Thread Klaus Wenninger

On 09/05/2017 02:43 PM, Papastavros Vaggelis wrote:
> Dear friends ,
>
> I have two_nodes (sgw-01 and sgw-02) HA cluster integrated with two
> APC PDUs as fence devices
>
> 1) pcs stonith create node2-psu fence_apc ipaddr="10.158.0.162"
> login="apc" passwd="apc" port="1" pcmk_host_list="sgw-02"
> pcmk_host_check="static-list" action="reboot" power_wait="5" op  
> monitor interval="60s"
> pcs constraint location node2-psu prefers sgw-01=INFINITY
>
>
> 2)pcs stonith create node1-psu fence_apc ipaddr="10.158.0.161"
> login="apc" passwd="apc" port="1" pcmk_host_list="sgw-01"
> pcmk_host_check="static-list" action="reboot" power_wait="5" op
> monitor interval="60s"
> pcs constraint location node1-psu prefers sgw-02=INFINITY
>
>
> Fencing is working fine but i want to deploy the following scenario :
>
> If one node fenced more than x times during a specified time period
> then change the fence action from reboot to stop.
>
> For example if the node rebooted more than 3 times during one hour
> then at the next crash change the fence action from reboot to off.
>
> Is there a proper way to implement the above scenario  ?

I'm not aware of a direct support of this scenario.

But what I could imagine is:
Configure 2 fencing-resources for the same stonith-device on 2 different
levels.
Before these on the same levels you could put a fence-agent (would have
to do it yourself - maybe
derive from fence_dummy in pacemaker package). In this agent you can do
the counting and
you would succeed making pacemaker proceed on the same level or fail to
make pacemaker
try the next level.
Beware that with multiple agents on one level pacemaker always does
on/off and no reboot.
But for the higher level instance you can map the on-action to reboot
and the off-action to metadata.
While for the lower prio level you would just map the on-action to
metadata (to make it do nothing).
 
Regards,
Klaus
>
> is it mandatory to use two level fencing  for the above case ?
>
>
>
> Sincerely
>
> Vaggelis Papastavros
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] pacemaker fencing

2017-09-05 Thread Ken Gaillot

On Tue, 2017-09-05 at 14:59 +0200, Klaus Wenninger wrote:
> On 09/05/2017 02:43 PM, Papastavros Vaggelis wrote:
> 
> > Dear friends ,
> > 
> > 
> > I have two_nodes (sgw-01 and sgw-02) HA cluster integrated with two
> > APC PDUs as fence devices
> > 
> > 1) pcs stonith create node2-psu fence_apc ipaddr="10.158.0.162"
> > login="apc" passwd="apc" port="1" pcmk_host_list="sgw-02"
> > pcmk_host_check="static-list" action="reboot" power_wait="5" op
> > monitor interval="60s"
> > pcs constraint location node2-psu prefers sgw-01=INFINITY
> > 
> > 
> > 2)pcs stonith create node1-psu fence_apc ipaddr="10.158.0.161"
> > login="apc" passwd="apc" port="1" pcmk_host_list="sgw-01"
> > pcmk_host_check="static-list" action="reboot" power_wait="5" op
> > monitor interval="60s"

Be aware that "action" is not the correct parameter to use here -- it
should never be specified in the resource configuration (newer versions
of pcs and pacemaker will give an error if you do). It's an internal
parameter that pacemaker sends to the agent.

What you're looking for is either the stonith-action cluster property
(which will apply to all fence devices in the cluster) or the
pcmk_reboot_action resource parameter.

> > pcs constraint location node1-psu prefers sgw-02=INFINITY
> > 
> > 
> > 
> > 
> > Fencing is working fine but i want to deploy the following
> > scenario :
> > 
> > 
> > If one node fenced more than x times during a specified time period
> > then change the fence action from reboot to stop. 
> > 
> > 
> > 
> > For example if the node rebooted more than 3 times during one hour
> > then at the next crash change the fence action from reboot to off.
> > 
> > 
> > Is there a proper way to implement the above scenario  ? 
> > 
> 
> I'm not aware of a direct support of this scenario.
> 
> But what I could imagine is:
> Configure 2 fencing-resources for the same stonith-device on 2
> different levels.
> Before these on the same levels you could put a fence-agent (would
> have to do it yourself - maybe
> derive from fence_dummy in pacemaker package). In this agent you can
> do the counting and
> you would succeed making pacemaker proceed on the same level or fail
> to make pacemaker
> try the next level.
> Beware that with multiple agents on one level pacemaker always does
> on/off and no reboot.
> But for the higher level instance you can map the on-action to reboot
> and the off-action to metadata.
> While for the lower prio level you would just map the on-action to
> metadata (to make it do nothing).
>  
> Regards,
> Klaus
> > 
> > 
> > is it mandatory to use two level fencing  for the above case ? 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Sincerely
> > 
> > 
> > Vaggelis Papastavros 

-- 
Ken Gaillot 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker 1.1.18 released

2017-11-14 Thread Ken Gaillot

ClusterLabs announces the release of version 1.1.18 of the Pacemaker
cluster resource manager. The source code is available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.18

This is expected to be the last actively developed release in the 1.1
line. Development will now begin on Pacemaker 2.0.0.

The most significant new features in 1.1.18 are:

* Warnings will be logged when legacy configuration syntax planned to
be removed in 2.0.0 is used.

* Bundles are no longer considered experimental. They support all
constraint types, and they now support rkt as well as Docker
containers. Many bug fixes and enhancements have been made.
 
* Alerts may now be filtered so that alert agents are called only for
desired alert types, and (as an experimental feature) it is now
possible to receive alerts for transient node attribute changes.

* Status output (from crm_mon, pengine logs, and a new crm_resource --
why option) now has more details about why resources are in a certain
state.

As usual, to support the new features, the CRM feature set has been
incremented. This means that mixed-version clusters are supported only
during a rolling upgrade -- nodes with an older version will not be
allowed to rejoin once they shut down.

For a more detailed list of bug fixes and other changes, see the change
log:

https://github.com/ClusterLabs/pacemaker/blob/1.1/ChangeLog

Everyone is encouraged to download, compile and test the new release.
Your feedback is important and appreciated.

Many thanks to all contributors of source code to this release,
including Andrew Beekhof, Aravind Kumar, Artur Novik, Bin Liu, Ferenc
Wágner, Helmut Grohne, Hideo Yamauchi, Igor Tsiglyar, Jan
Pokorný, Kazunori INOUE, Keisuke MORI, Ken Gaillot, Klaus Wenninger,
Nye Liu, Tomer Azran, Valentin Vidic, and Yan Gao.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] pacemaker self stonith

2017-11-30 Thread Hauke Homburg

Hallo List,

I am searching für a possibility to stonith a pacemaker node himself.

The Reason is ich need to check of the pacemaker noch can reach the
network outside the local network. Because of network outage.

I can't connect to an ILO interface or so.

I consider a bash script:

if [ ! print -c 1 8.8.8.8] then $Hardware_reset ; fi

this in the crontab and run every minute.

Thanks für Help

Hauke



-- 
www.w3-creative.de

www.westchat.de

https://friendica.westchat.de/profile/hauke


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker PostgreSQL cluster

2018-05-29 Thread Salvatore D'angelo

Hi All,

I am new to this list. I am working on a project that uses a cluster composed
by 3 nodes (with Ubuntu 14.04 trusty) on which we run PostgreSQL managed as
Master/slaves.
We uses Pacemaker/Corosync to manage this cluster. In addition, we have a two
node GlusterFS where we store backups and Wal files.
Currently the versions of our components are quite old, we have:
Pacemaker 1.1.14
Corosync 2.3.5

and we want to move to a new version of Pacemaker but I have some doubts.

1. I noticed there is 2.0.0 candidate release so it could be convenient for us
move to this release. When will be published the final release? Is it
convenient move to 2.0.0 or 1.1.18?
2. I read some documentation about upgrade and since we want 0 ms downtime I
think the Rolling Upgrade (node by node) is the better approach. We migrate a
node and in the meantime the other two nodes are still active. The problem is
that I do not know if I can have a mix of 1.1.14 and 1.1.18 (or 2.0.0) nodes.
The documentation does not clarify it or at least it was not clear to me. Is
this possible?
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ap-upgrade.html

https://wiki.clusterlabs.org/wiki/Upgrade

3. I need to upgrade pacemaker/corosync on Ubuntu 14.04. I noticed for 1.1.18
there are Ubuntu packages available. What about 2.0.0? Is it possible create
Ubuntu packages in some way?
4. Where I can find the list of (ubuntu) dependencies required to
pacemaker/corosync for 1.1.18 and 2.0.0?

Thanks in advance for your help.___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker alert framework

2018-07-06 Thread Ian Underhill

requirement:
when a resource fails perform an actionm, run a script on all nodes within
the cluster, before the resource is relocated. i.e. information gathering
why the resource failed.

what I have looked into:
1) Use the monitor call within the resource to SSH to all nodes, again SSH
config needed.
2) Alert framework : this only seems to be triggered for nodes involved in
the relocation of the resource. i.e. if resource moves from node1 to node 2
node 3 doesnt know. so back to the SSH solution :(
3) sending a custom alert to all nodes in the cluster? is this possible?
not found a way?

only solution I have:
1) use SSH within an alert monitor (stop) to SSH onto all nodes to perform
the action, the nodes could be configured using the alert monitors
recipients, but I would still need to config SSH users and certs etc.
 1.a) this doesnt seem to be usable if the resource is relocated back
to the same node, as the alerts start\stop are run at the "same time". i.e
I need to delay the start till the SSH has completed.

what I would like:
1) delay the start\relocation of the resource until the information from
all nodes is complete, using only pacemaker behaviour\config

any ideas?

Thanks

/Ian.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker 1.1.19 released

2018-07-11 Thread Ken Gaillot

Source code for the final release of Pacemaker version 1.1.19 is
available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.19

This is a maintenance release that backports selected fixes and
features from the 2.0.0 version. The 1.1 series is no longer actively
maintained, but is still supported for users (and distributions) who
want to keep support for features dropped by the 2.0 series (such as
CMAN or heartbeat as the cluster layer).

The most significant changes in this release are:

* stonith_admin has a new --validate option to validate option
configurations

* .mount, .path, and .timer unit files are now supported as "systemd:"-
class resources

* 5 regressions in 1.1.17 and 1.1.18 have been fixed

For a more detailed list of bug fixes and other changes, see the change
log:

https://github.com/ClusterLabs/pacemaker/blob/1.1/ChangeLog

Many thanks to all contributors of source code to this release,
including Andrew Beekhof, Gao,Yan, Hideo Yamauchi, Jan Pokorný, Ken
Gaillot, and Klaus Wenninger.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker startup retries

2018-08-30 Thread Cesar Hernandez

Hi

I have a two-node corosync+pacemaker which, starting only one node, it fences 
the other node. It's ok as the default behaviour as the default 
"startup-fencing" is set to true.
But, the other node is rebooted 3 times, and then, the remaining node starts 
resources and doesn't fence the node anymore.

How can I change these 3 times, to, for example, 1 reboot , or more, 5? I use a 
custom fencing script so I'm sure these retries are not done by the script but 
pacemaker, and I also see the reboot operations on the logs:

Aug 30 17:22:08 [12978] 1   crmd:   notice: te_fence_node:  
Executing reboot fencing operation (81) on 2 (timeout=18)
Aug 30 17:22:31 [12978] 1   crmd:   notice: te_fence_node:  
Executing reboot fencing operation (87) on 2 (timeout=18)
Aug 30 17:22:48 [12978] 1   crmd:   notice: te_fence_node:  
Executing reboot fencing operation (89) on 2 (timeout=18)



Software versions:

corosync-1.4.8
crmsh-2.1.5
libqb-0.17.2
Pacemaker-1.1.14
resource-agents-3.9.6
Reusable-Cluster-Components-glue--glue-1.0.12

Some parameters:

property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0e5e \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes=2 \
stonith-enabled=true \
no-quorum-policy=ignore \
default-resource-stickiness=200 \
stonith-timeout=180s \
last-lrm-refresh=1534489943


Thanks

César Hernández Bañó


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker managing Keycloak

2022-01-28 Thread Philip Alesio

Hi Everyone,

I'm attempting to create a failover cluster that uses Postgresql and
Keycloak and am having difficulty getting Keycloak running.  Keycloak is
using a Postgresql database.  In one case I'm using DRBD to replicate the
data and in another case I'm using Postgresql.  The failure, in both cases,
is that Keycloak fails to connect to the database.  In both cases Pacemaker
is running with the Postgresql resource when I add the Keycloak resource.
If I "docker run" Keyclock, not adding it as a Pacemaker resource, Keycloak
starts and connects to the database.

Below adds Keycloak as a Pacemaker resource:

pcs cluster cib  cluster1.xml

pcs -f cluster1.xml resource create p_keycloak
ocf:heartbeat:docker image=jboss/keycloak name=keycloak run_opts="-d
-e KEYCLOAK_USER=admin -e KEYCLOAK_PASSWORD=admin -e DB_ADDR=postgres
-e DB_VENDOR=postgres -e DB_USER=postgres -e DB_PASSWORD=postgres -e
DB_DATABASE=keycloak_db -e JDBC_PARAMS=useSSL=false -p 8080:8080 -e
DB_ADDR=postgres -e DB_PORT='5432' –network=cluster1dkrnet" op monitor
interval=60s

pcs -f cluster1.xml resource group add g_receiver p_keycloak

pcs cluster cib-push  cluster1.xml --config



Below creates a Keycloak container that is not managed by Pacemaker:

docker run --name keycloak -e KEYCLOAK_USER=admin -e
KEYCLOAK_PASSWORD=admin -e DB_ADDR=postgres -e DB_VENDOR=postgres -e
DB_USER=postgres -e DB_PASSWORD=postgres -e DB_DATABASE=keycloak_db -e
JDBC_PARAMS=useSSL=false -p 8080:8080 -e DB_ADDR=postgres
-e DB_PORT='5432' --network=cluster1dkrnet jboss/keycloak

 Does anyone have experience with Pacemaker with Keyclock and/or if there
are any thoughts about why Keycloak is not connecting to the Postgresql
database?


Thanks in advance.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker question

2022-10-04 Thread Ken Gaillot

Yes, see ACLs:

https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/index.html#document-acls

On Mon, 2022-10-03 at 15:51 +, Jelen, Piotr wrote:
> Dear Clusterlabs team , 
> 
> I  would like to ask you if there is some possibility to use
> different user (eg.cephhauser) for authenticate/setup cluster or
> there is other method authenticate/setup cluster, not using password
> by dedicated  pacamker user such us hacluster  ?
> 
> 
> Best Regards
> Piotr Jelen
> Senior Systems Platform Engineer
>  
> Mastercard
> Mountain View, Central Park  | Leopard
>  
> 
>  
> CONFIDENTIALITY NOTICE This e-mail message and any attachments are
> only for the use of the intended recipient and may contain
> information that is privileged, confidential or exempt from
> disclosure under applicable law. If you are not the intended
> recipient, any disclosure, distribution or other use of this e-mail
> message or attachments is prohibited. If you have received this e-
> mail message in error, please delete and notify the sender
> immediately. Thank you.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker question

2022-10-05 Thread Tomas Jelinek

Hi,

If you are using pcs to setup your cluster, then the answer is no. I'm
not sure about crm shell / hawk. Once you have a cluster, you can use
users other than hacluster as Ken pointed out.

Regards,
Tomas

Dne 04. 10. 22 v 16:06 Ken Gaillot napsal(a):

Yes, see ACLs:

https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/index.html#document-acls

On Mon, 2022-10-03 at 15:51 +, Jelen, Piotr wrote:

Dear Clusterlabs team ,

I would like to ask you if there is some possibility to use
different user (eg.cephhauser) for authenticate/setup cluster or
there is other method authenticate/setup cluster, not using password
by dedicated pacamker user such us hacluster ?

Best Regards
Piotr Jelen
Senior Systems Platform Engineer

Mastercard

Mountain View, Central Park | Leopard

CONFIDENTIALITY NOTICE This e-mail message and any attachments are

only for the use of the intended recipient and may contain
information that is privileged, confidential or exempt from
disclosure under applicable law. If you are not the intended
recipient, any disclosure, distribution or other use of this e-mail
message or attachments is prohibited. If you have received this e-
mail message in error, please delete and notify the sender
immediately. Thank you.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker fatal shutdown

2023-07-19 Thread Priyanka Balotra

Hi All,
I am using SLES 15 SP4. One of the nodes of the cluster is brought down and
boot up after sometime. Pacemaker service came up first but later it faced
a fatal shutdown. Due to that crm service is down.

The logs from /var/log/pacemaker.pacemaker.log are as follows:

Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (pcmk_child_exit)
 warning: Shutting cluster down because pacemaker-controld[15962] had
fatal failure
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   notice: Shutting down Pacemaker
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   debug: pacemaker-controld confirmed stopped
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
notice: Stopping pacemaker-schedulerd | sent signal 15 to process 15961
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
(crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking
handler)
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
(qb_ipcs_us_withdraw)info: withdrawing server sockets
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (qb_ipcs_unref)
 debug: qb_ipcs_unref() - destroying
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_xml_cleanup)
 info: Cleaning up memory from libxml2
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_exit)
info: Exiting pacemaker-schedulerd | with status 0
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(qb_ipcs_event_sendv)debug: new_event_notification
(/dev/shm/qb-15957-15962-12-RDPw6O/qb): Broken pipe (32)
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_notify_send_one)warning: Could not notify client crmd: Broken pipe
| id=e29d175e-7e91-4b6a-bffb-fabfdd7a33bf
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_process_request)info: Completed cib_delete operation for section
//node_state[@uname='FILE-2']/*: OK (rc=0, origin=FILE-6/crmd/74,
version=0.24.75)
Jul 17 14:18:20.093 FILE-2 pacemaker-fenced[15958]
(xml_patch_version_check)debug: Can apply patch 0.24.75 to 0.24.74
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (pcmk_child_exit)
 info: pacemaker-schedulerd[15961] exited with status 0 (OK)
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_process_request)info: Completed cib_modify operation for section
status: OK (rc=0, origin=FILE-6/crmd/75, version=0.24.75)
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   debug: pacemaker-schedulerd confirmed stopped
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
notice: Stopping pacemaker-attrd | sent signal 15 to process 15960
Jul 17 14:18:20.093 FILE-2 pacemaker-attrd [15960]
(crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking
handler)

Could you please help me understand the issue here.

Regards
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker-remote

2023-09-18 Thread Ken Gaillot

On Thu, 2023-09-14 at 18:28 +0800, Mr.R via Users wrote:
> Hi all，
>
> In Pacemaker-Remote 2.1.6, the pacemaker package is required
> for guest nodes and not for remote nodes. Why is that? What does 
> pacemaker do?
> After adding guest node, pacemaker package does not seem to be 
> needed. Can I not install it here?

I'm not sure what's requiring it in your environment. There's no
dependency in the upstream RPM at least.

The pacemaker package does have the crm_master script needed by some
resource agents, so you will need it if you use any of those. (That
script should have been moved to the pacemaker-cli package in 2.1.3,
oops ...)

> After testing, remote nodes can be offline, but guest nodes cannot
>  be offline. Is there any way to get them offline? Are there
> relevant 
> failure test cases?
> 
> thanks,

To make a guest node offline, stop the resource that creates it.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker 2.1.8 released

2024-08-12 Thread Ken Gaillot

Hi all,

The final release of Pacemaker 2.1.8 is now available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.1.8

This release includes a significant number of bug fixes (including 10
regression fixes) and a few new features. It also deprecates some
obscure features and many C APIs in preparation for Pacemaker 3.0.0
dropping support for them later this year.

See the link above for more details.

Many thanks to all contributors to this release, including bixiaoyan1,
Chris Lumens, Ferenc Wágner, Gao,Yan, Grace Chin, Ken Gaillot, Klaus
Wenninger, liupei, Oyvind Albrigtsen, Reid Wahl, tomyouyou, wangluwei,
xin liang, and xuezhixin.

-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker alerts list

2019-07-17 Thread Ken Gaillot

On Tue, 2019-07-16 at 13:53 +, Gershman, Vladimir wrote:
> Hi,
>  
> Is there a list of all possible alerts/events that Peacemaker can
> send out? Preferable with criticality levels for the alerts (minor,
> major, critical).

I'm not sure whether you're using "alerts" in a general sense here, or
specifically about Pacemaker's alert configuration:

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#idm140043888101536

If the latter, the "Writing an Alert Agent" section of that link lists
all the possible alert types. The criticality would be derived from
CRM_alert_desc, CRM_alert_rc, and CRM_alert_status.

>  
>  
>  
> Thank you,
> 
> Vlad
> Equipment Management (EM) System Engineer
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker alerts list

2019-07-18 Thread Jan Pokorný

On 17/07/19 19:07 +, Gershman, Vladimir wrote:
> This would be for the Pacemaker.
> 
> Seems like the alerts in the link you sent, refer to numeric codes,
> so where would I see all the codes and their meanings ?  This would
> allow a way to select what I need to monitor. 

Unfortunately, we currently won't make do without direct code
references, but keep in mind we are in the expert area already
when the alerts usefulness is to be maxed out (and one is then
also responsible to update such a tight wrapping accordingly if
or when incompatible changes arrive in new versions -- presumably
any slightly more disruptive releases would designate that in the
versioning scheme [non-minor component being incremented] and
perhaps it'd be noted in the release notes as well).

That being said, more detailed documentation and perhaps accompanied
with firmer assurances as to the details of so far vaguely specified
informative data items attached to the "unit" of alert may arrive
in the future, and I bet contributions to make it happen faster
are warmly welcome, especially when driven by the real production
needs.

> For example:

[intentionally reordered]

> CRM_alert_status:
>   A numerical code used by Pacemaker to represent the operation
>   result (resource alerts only)  

See
https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.2/include/crm/services.h#L118-L129

> CRM_alert_desc:
>   Detail about event. For node alerts, this is the node's current
>   state (member or lost).

That's literal, see
https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.2/include/crm/cluster.h#L30-L31

>   For fencing alerts, this is a summary of the requested fencing
>   operation, including origin, target, and fencing operation error
>   code, if any.

This would indeed require extensive parsing of the generated string
for fields that are not present as standalone variables (here, node
to be fenced that is also available separately via CRM_alert_node):

https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.2/daemons/controld/controld_execd_state.c#L805-L809

>   For resource alerts, this is a readable string equivalent of
>   CRM_alert_status.  

See the first link above, translation from numeric codes is rather
symbolic, though:

https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.2/include/crm/services.h#L331-L340

(but may denote that some codes from the full enumeration are strictly
internal, based on a simple reasoning about the coverage, not sure)

Plus there's an exception for operations already known finished, for
which exit status from the actual agent's execution is reproduced here
in words, and luckily, that's actually documented:

https://github.com/ClusterLabs/resource-agents/blob/v4.3.0/doc/dev-guides/ra-dev-guide.asc#return-codes

> CRM_alert_target_rc:
>   The expected numerical return code of the operation (resource
>   alerts only)  

This appears to be primarily bound to OCF codes referred just above.

* * *

Hopefully that's enough to get you started with your own exploration.
Initially, I'd also suggest attaching your own dump-all alert handler
to get the real hands-on with the data at your disposal that can be
leveraged in your true handler.

-- 
Jan (Poki)

pgpJ_ABsu74l5.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] pacemaker resources under systemd

2019-08-27 Thread Ulrich Windl

Hi!

Systemd think he's the boss, doing what he wants: Today I noticed that all
resources are run inside control group "pacemaker.service" like this:
  ├─pacemaker.service
  │ ├─ 26582 isredir-ML1: listening on 172.20.17.238/12503 (2/1)
  │ ├─ 26601 /usr/bin/perl -w /usr/sbin/ldirectord /etc/ldirectord/mail.conf
start
  │ ├─ 26628 ldirectord tcp:172.20.17.238:25
  │ ├─ 28963 isredir-DS1: handling 172.20.16.33/10475 -- 172.20.17.200/389
  │ ├─ 40548 /usr/sbin/pacemakerd -f
  │ ├─ 40550 /usr/lib/pacemaker/cib
  │ ├─ 40551 /usr/lib/pacemaker/stonithd
  │ ├─ 40552 /usr/lib/pacemaker/lrmd
  │ ├─ 40553 /usr/lib/pacemaker/attrd
  │ ├─ 40554 /usr/lib/pacemaker/pengine
  │ ├─ 40555 /usr/lib/pacemaker/crmd
  │ ├─ 53948 isredir-DS2: handling 172.20.16.33/10570 -- 172.20.17.201/389
  │ ├─ 92472 isredir-DS1: listening on 172.20.17.204/12511 (13049/3)
...

(that "isredir" stuff is my own resource that forks processes and creates
threads on demand, thus modifying process (and thread) titles to help
understanding what's going on...)

My resources are started via OCF RA (shell script), not a systemd unit.

Wouldn't it make much more sense if each resource would run in its own control
group? I mean: If systemd thinks everything MUST run in some control group, why
not pick the "correct " one? Having the pacemaker infrastructure in the same
control group as all the resources seems to be a bad idea IMHO.

The other "discussable feature" are "high PIDs" like "92472". While port
numbers are still 16 bit (in IPv4 at least), I see little sense in having
millions of processes or threads.

Regards,
Ulrich

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] pacemaker-controld getting respawned

2020-01-04 Thread S Sathish S

Hi Team,

Pacemaker-controld process is getting restarted frequently reason for failure 
disconnect from CIB/Internal Error (or) high cpu on the system, same has been 
recorded in our system logs, Please find the pacemaker and corosync version 
installed on the system.

Kindly let us know why we are getting below error on the system.

corosync-2.4.4 -->  https://github.com/corosync/corosync/tree/v2.4.4
pacemaker-2.0.2 --> 
https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2

[root@vmc0621 ~]# ps -eo pid,lstart,cmd  | grep -iE 'corosync|pacemaker' | grep 
-v grep
2039 Wed Dec 25 15:56:15 2019 corosync
3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f
3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based
3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced
3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd
3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd
3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-schedulerd
25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker-controld


In system message logs :

Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4419 
failed: Timer expired (-62)
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4420 
failed: Timer expired (-62)
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input I_ERROR received 
in state S_IDLE from crmd_node_update_complete
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: State transition 
S_IDLE -> S_RECOVERY
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Fast-tracking 
shutdown in response to errors
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: warning: Not voting in 
election, we're in state S_RECOVERY
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input I_ERROR received 
in state S_RECOVERY from node_list_update_callback
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Input I_TERMINATE 
received in state S_RECOVERY from do_recover
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Stopped 0 recurring 
operations at shutdown (12 remaining)
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:241 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:261 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:249 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:258 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:253 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:250 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:244 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_OCC:237 (XXX_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:264 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:270 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:238 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Recurring action 
XXX_vmc0621:267 (XXX_vmc0621_monitor_1) incomplete at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: 12 resources were 
active at shutdown
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Disconnected from the 
executor
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Disconnected from 
Corosync
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: notice: Disconnected from the 
CIB manager
Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Could not recover from 
internal error
Dec 30 10:02:37 vmc0621 pacemakerd[3048]: error: pacemaker-controld[7517] 
exited with status 1 (Error occurred)
Dec 30 10:02:37 vmc0621 pacemakerd[3048]: notice: Respawning failed child 
process: pacemaker-controld

Please let us know if any further logs required from our end.

Thanks and Regards,
S Sathish S
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Klaus Wenninger

On 7/22/20 9:59 AM, Хиль Эдуард wrote:
> Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on
> ubuntu 20 + 1 qdevice. I want to define new resource as systemd
> unit *dummy.service *:
>  
> [Unit]
> Description=Dummy
> [Service]
Type=simple

That could do the trick. Actually I thought simple would be
the default but ...

Klaus
> Restart=on-failure
> StartLimitInterval=20
> StartLimitBurst=5
> TimeoutStartSec=0
> RestartSec=5
> Environment="HOME=/root"
> SyslogIdentifier=dummy
> ExecStart=/usr/local/sbin/dummy.sh
> [Install]
> WantedBy=multi-user.target
>  
> and /usr/local/sbin/dummy.sh :
>  
> #!/bin/bash
> CNT=0
> while true; do
>   let CNT++
>   echo "hello world $CNT"
>   sleep 5
> done
>  
> and then i try to define it with: pcs resource create dummy.service
> systemd:dummy op monitor interval="10s" timeout="15s"
> after 2 seconds node2 reboot. In logs i see pacemaker in 2 seconds
> tried to start this unit, and it started, but pacemaker somehow think
> he is «Timed Out» . What i am doing wrong? Logs below.
>  
>  
> Jul 21 15:53:41 node2.local pacemaker-controld[1813]:  notice: Result
> of probe operation for dummy.service on node2.local: 7 (not running) 
> Jul 21 15:53:41 node2.local systemd[1]: Reloading.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/dbus.socket:5: ListenStream= references a path
> below legacy directory /var/run/, updating
> /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please
> update the unit file accordingly.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/docker.socket:6: ListenStream= references a path
> below legacy directory /var/run/, updating /var/run/docker.sock →
> /run/docker.sock; please update the unit file accordingly.
> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up
> on dummy.service start (rc=0): timeout (elapsed=259719ms,
> remaining=-159719ms)
> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result
> of start operation for dummy.service on node2.local: Timed Out 
> Jul 21 15:53:42 node2.local systemd[1]: Started Cluster Controlled dummy.
> Jul 21 15:53:42 node2.local dummy[9330]: hello world 1
> Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface
> NamePolicy= disabled on kernel command line, ignoring.
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> fail-count-dummy.service#start_0[node2.local]: (unset) -> INFINITY 
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> last-failure-dummy.service#start_0[node2.local]: (unset) -> 1595336022 
> Jul 21 15:53:42 node2.local systemd[1]: Reloading.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/dbus.socket:5: ListenStream= references a path
> below legacy directory /var/run/, updating
> /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please
> update the unit file accordingly.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/docker.socket:6: ListenStream= references a path
> below legacy directory /var/run/, updating /var/run/docker.sock →
> /run/docker.sock; please update the unit file accordingly.
> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up
> on dummy.service stop (rc=0): timeout (elapsed=317181ms,
> remaining=-217181ms)
> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result
> of stop operation for dummy.service on node2.local: Timed Out 
> Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for dummy...
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY 
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> last-failure-dummy.service#stop_0[node2.local]: (unset) -> 1595336022 
> Jul 21 15:53:42 node2.local systemd[1]: dummy.service: Succeeded.
> Jul 21 15:53:42 node2.local systemd[1]: Stopped Daemon for dummy.
> ... lost connection (node rebooting)
>  
>  
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Andrei Borzenkov

On Wed, Jul 22, 2020 at 10:59 AM Хиль Эдуард  wrote:

> Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on ubuntu 20
> + 1 qdevice. I want to define new resource as systemd unit *dummy.service
> *:
>
> [Unit]
> Description=Dummy
> [Service]
> Restart=on-failure
> StartLimitInterval=20
> StartLimitBurst=5
> TimeoutStartSec=0
> RestartSec=5
> Environment="HOME=/root"
> SyslogIdentifier=dummy
> ExecStart=/usr/local/sbin/dummy.sh
> [Install]
> WantedBy=multi-user.target
>
> and /usr/local/sbin/dummy.sh :
>
> #!/bin/bash
> CNT=0
> while true; do
>   let CNT++
>   echo "hello world $CNT"
>   sleep 5
> done
>
> and then i try to define it with: pcs resource create dummy.service
> systemd:dummy op monitor interval="10s" timeout="15s"
> after 2 seconds node2 reboot.
>

Node reboots because stop operation failed, no start.



> In logs i see pacemaker in 2 seconds tried to start this unit, and it
> started, but pacemaker somehow think he is «Timed Out» . What i am doing
> wrong? Logs below.
>
>
> Jul 21 15:53:41 node2.local pacemaker-controld[1813]:  notice: Result of
> probe operation for dummy.service on node2.local: 7 (not running)
> Jul 21 15:53:41 node2.local systemd[1]: Reloading.
> Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5:
> ListenStream= references a path below legacy directory /var/run/, updating
> /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please
> update the unit file accordingly.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/docker.socket:6: ListenStream= references a path below
> legacy directory /var/run/, updating /var/run/docker.sock →
> /run/docker.sock; please update the unit file accordingly.
> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up on
> dummy.service start (rc=0): timeout (elapsed=259719ms, remaining=-159719ms)
> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result of
> start operation for dummy.service on node2.local: Timed Out
> Jul 21 15:53:42 node2.local systemd[1]: Started Cluster Controlled dummy.
> Jul 21 15:53:42 node2.local dummy[9330]: hello world 1
> Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface
> NamePolicy= disabled on kernel command line, ignoring.
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> fail-count-dummy.service#start_0[node2.local]: (unset) -> INFINITY
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> last-failure-dummy.service#start_0[node2.local]: (unset) -> 1595336022
> Jul 21 15:53:42 node2.local systemd[1]: Reloading.
> Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5:
> ListenStream= references a path below legacy directory /var/run/, updating
> /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please
> update the unit file accordingly.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/docker.socket:6: ListenStream= references a path below
> legacy directory /var/run/, updating /var/run/docker.sock →
> /run/docker.sock; please update the unit file accordingly.
> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up on
> dummy.service stop (rc=0): timeout (elapsed=317181ms, remaining=-217181ms)
>

317181ms == 5 minutes. Barring pacemaker bug, you need to show pacemaker
log since the very first start operation so we can see proper timing.
Seeing that systemd was reloaded in between, it is quite possible that
systemd lost track of pending job so any client waiting for confirmation
hangs forever. Such problems were known, not sure what current status is
(if it ever was fixed).



> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result of
> stop operation for dummy.service on node2.local: Timed Out
> Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for dummy...
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> last-failure-dummy.service#stop_0[node2.local]: (unset) -> 1595336022
> Jul 21 15:53:42 node2.local systemd[1]: dummy.service: Succeeded.
> Jul 21 15:53:42 node2.local systemd[1]: Stopped Daemon for dummy.
> ... lost connection (node rebooting)
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Хиль Эдуард


Hi Klaus! Thank you for your attention, but isn’t work. I have added 
Type=simple and there is no changes. I think problem not in service. As we can 
see from logs, the service is starting (Jul 21 15:53:42 node2.local 
dummy[9330]: hello world 1) but for the some reason pacemaker isn’t see it (Jul 
21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result of stop 
operation for dummy.service on node2.local: Timed Out ) and he draws his 
conclusions for 2 seconds (from 15:53:41 to 15:53:42) and i have no idea what 
to do :(

  
>Среда, 22 июля 2020, 13:15 +05:00 от Klaus Wenninger :
> 
>On 7/22/20 9:59 AM, Хиль Эдуард wrote:
>>Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on ubuntu 20 + 
>>1 qdevice. I want to define new resource as systemd unit  dummy.service  :
>> 
>>[Unit]
>>Description=Dummy
>>[Service] 
>Type=simple
>
>That could do the trick. Actually I thought simple would be
>the default but ...
>
>Klaus
>>Restart=on-failure
>>StartLimitInterval=20
>>StartLimitBurst=5
>>TimeoutStartSec=0
>>RestartSec=5
>>Environment="HOME=/root"
>>SyslogIdentifier=dummy
>>ExecStart=/usr/local/sbin/dummy.sh
>>[Install]
>>WantedBy=multi-user.target
>> 
>>and /usr/local/sbin/dummy.sh :
>> 
>>#!/bin/bash
>>CNT=0
>>while true; do
>>  let CNT++
>>  echo "hello world $CNT"
>>  sleep 5
>>done
>> 
>>and then i try to define it with: pcs resource create dummy.service 
>>systemd:dummy op monitor interval="10s" timeout="15s"
>>after 2 seconds node2 reboot. In logs i see pacemaker in 2 seconds tried to 
>>start this unit, and it started, but pacemaker somehow think he is «Timed 
>>Out» . What i am doing wrong? Logs below.
>> 
>> 
>>Jul 21 15:53:41 node2.local pacemaker-controld[1813]:  notice: Result of 
>>probe operation for dummy.service on node2.local: 7 (not running) 
>>Jul 21 15:53:41 node2.local systemd[1]: Reloading.
>>Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5: 
>>ListenStream= references a path below legacy directory /var/run/, updating 
>>/var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update 
>>the unit file accordingly.
>>Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/docker.socket:6: 
>>ListenStream= references a path below legacy directory /var/run/, updating 
>>/var/run/docker.sock → /run/docker.sock; please update the unit file 
>>accordingly.
>>Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up on 
>>dummy.service start (rc=0): timeout (elapsed=259719ms, remaining=-159719ms)
>>Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result of start 
>>operation for dummy.service on node2.local: Timed Out 
>>Jul 21 15:53:42 node2.local systemd[1]: Started Cluster Controlled dummy.
>>Jul 21 15:53:42 node2.local dummy[9330]: hello world 1
>>Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface NamePolicy= 
>>disabled on kernel command line, ignoring.
>>Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting 
>>fail-count-dummy.service#start_0[node2.local]: (unset) -> INFINITY 
>>Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting 
>>last-failure-dummy.service#start_0[node2.local]: (unset) -> 1595336022 
>>Jul 21 15:53:42 node2.local systemd[1]: Reloading.
>>Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5: 
>>ListenStream= references a path below legacy directory /var/run/, updating 
>>/var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update 
>>the unit file accordingly.
>>Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/docker.socket:6: 
>>ListenStream= references a path below legacy directory /var/run/, updating 
>>/var/run/docker.sock → /run/docker.sock; please update the unit file 
>>accordingly.
>>Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up on 
>>dummy.service stop (rc=0): timeout (elapsed=317181ms, remaining=-217181ms)
>>Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result of stop 
>>operation for dummy.service on node2.local: Timed Out 
>>Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for dummy...
>>Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting 
>>fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY 
>>Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting 
>>last-failure-dummy.service#stop_0[node2.local]: (unset) -> 1595336022 
>>Jul 21 15:53:42 node2.local systemd[1]: dummy.service: Succeeded.
>>Jul 21 15:53:42 node2.local systemd[1]: Stopped Daemon for dummy.
>>... lost connection (node rebooting)
>> 
>>   
>> 
>>___
>>Manage your subscription:
>>https://lists.clusterlabs.org/mailman/listinfo/users
>>
>>ClusterLabs home: https://www.clusterlabs.org/
>> 
 
   
 ___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Хиль Эдуард


Hey, Andrei! Thanx for ur time!
A-a-and there is no chance to do something? :( 
The pacemaker’s log below.
 
 
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_process_request)    
 info: Forwarding cib_apply_diff operation for section 'all' to all 
(origin=local/cibadmin/2)
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: Diff: --- 0.130.94 2
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: Diff: +++ 0.131.0 29b403fcf3c8d30705dceac1ba701963
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: +  /cib:  @epoch=131, @num_updates=0
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++ /cib/configuration/resources:  
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++                                  
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++                                    
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++                                    
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++                                    
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++                                  
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++                                
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_process_request)    
 info: Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=node2.local/cibadmin/2, version=0.131.0)
Jul 22 12:38:36 node2.local pacemaker-fenced    [1720] 
(update_cib_stonith_devices_v2)     info: Updating device list from the cib: 
create resources
Jul 22 12:38:36 node2.local pacemaker-fenced    [1720] (cib_devices_update)     
info: Updating devices to version 0.131.0
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_file_backup)     
info: Archived previous version as /var/lib/pacemaker/cib/cib-20.raw
Jul 22 12:38:36 node2.local pacemaker-fenced    [1720] (cib_device_update)     
info: Device mfs4.stonith has been disabled on node2.local: score=-INFINITY
Jul 22 12:38:36 node2.local pacemaker-based     [1719] 
(cib_file_write_with_digest)     info: Wrote version 0.131.0 of the CIB to disk 
(digest: 8a11f99f10fb5b69aee4da9460d9134b)
Jul 22 12:38:36 node2.local pacemaker-based     [1719] 
(cib_file_write_with_digest)     info: Reading cluster configuration file 
/var/lib/pacemaker/cib/cib.Ap1Vqv (digest: /var/lib/pacemaker/cib/cib.aiva1s)
Jul 22 12:38:36 node2.local pacemaker-execd     [1721] 
(process_lrmd_get_rsc_info)     info: Agent information for 'dummy.service' not 
in cache
Jul 22 12:38:36 node2.local pacemaker-execd     [1721] 
(process_lrmd_rsc_register)     info: Cached agent information for 
'dummy.service'
Jul 22 12:38:36 node2.local pacemaker-controld  [1724] (do_lrm_rsc_op)     
info: Performing key=13:23:7:76f4932e-716b-45b8-8fed-a20c3806df8a 
op=dummy.service_monitor_0
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_process_request)    
 info: Forwarding cib_modify operation for section status to all 
(origin=local/crmd/60)
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: Diff: --- 0.131.0 2
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: Diff: +++ 0.131.1 (null)
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: +  /cib:  @num_updates=1
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++ /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources:  

Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++                                                                

Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_process_request)    
 info: Completed cib_modify operation for section status: OK (rc=0, 
origin=node1.local/crmd/240, version=0.131.1)
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: Diff: --- 0.131.1 2
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: Diff: +++ 0.131.2 (null)
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: +  /cib:  @num_updates=2
Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources:  

Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
info: ++                                                                

Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_process_request)    
 info: Completed cib_modify operation for section status: OK (rc=0, 
origin=node2.local/crmd/60, version=0.131.2)
Jul 22 12:38:36 node2.local pacemaker-controld  [1724] (process_lrm_event)     
notice: R

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Ken Gaillot

On Wed, 2020-07-22 at 10:59 +0300, Хиль  Эдуард wrote:
> Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on
> ubuntu 20 + 1 qdevice. I want to define new resource as systemd
> unit dummy.service :
>  
> [Unit]
> Description=Dummy
> [Service]
> Restart=on-failure
> StartLimitInterval=20
> StartLimitBurst=5
> TimeoutStartSec=0
> RestartSec=5
> Environment="HOME=/root"
> SyslogIdentifier=dummy
> ExecStart=/usr/local/sbin/dummy.sh
> [Install]
> WantedBy=multi-user.target
>  
> and /usr/local/sbin/dummy.sh :
>  
> #!/bin/bash
> CNT=0
> while true; do
>   let CNT++
>   echo "hello world $CNT"
>   sleep 5
> done
>  
> and then i try to define it with: pcs resource create dummy.service
> systemd:dummy op monitor interval="10s" timeout="15s"
> after 2 seconds node2 reboot. In logs i see pacemaker in 2 seconds
> tried to start this unit, and it started, but pacemaker somehow think
> he is «Timed Out» . What i am doing wrong? Logs below.

The start is timing out because the ExecStart script never returns.

systemd starts processes but it doesn't daemonize them -- the script is
responsible for doing that itself. You can search online for more
details about daemonization, but most importantly you want to run your
daemon as a subprocess in the background and have your main process
return as soon as the daemon is ready for service.


> Jul 21 15:53:41 node2.local pacemaker-controld[1813]:  notice: Result
> of probe operation for dummy.service on node2.local: 7 (not running) 
> Jul 21 15:53:41 node2.local systemd[1]: Reloading.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/dbus.socket:5: ListenStream= references a path
> below legacy directory /var/run/, updating
> /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please
> update the unit file accordingly.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/docker.socket:6: ListenStream= references a path
> below legacy directory /var/run/, updating /var/run/docker.sock →
> /run/docker.sock; please update the unit file accordingly.
> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up
> on dummy.service start (rc=0): timeout (elapsed=259719ms, remaining=-
> 159719ms)
> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result
> of start operation for dummy.service on node2.local: Timed Out 
> Jul 21 15:53:42 node2.local systemd[1]: Started Cluster Controlled
> dummy.
> Jul 21 15:53:42 node2.local dummy[9330]: hello world 1
> Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface
> NamePolicy= disabled on kernel command line, ignoring.
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> fail-count-dummy.service#start_0[node2.local]: (unset) -> INFINITY 
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> last-failure-dummy.service#start_0[node2.local]: (unset) ->
> 1595336022 
> Jul 21 15:53:42 node2.local systemd[1]: Reloading.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/dbus.socket:5: ListenStream= references a path
> below legacy directory /var/run/, updating
> /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please
> update the unit file accordingly.
> Jul 21 15:53:42 node2.local systemd[1]:
> /lib/systemd/system/docker.socket:6: ListenStream= references a path
> below legacy directory /var/run/, updating /var/run/docker.sock →
> /run/docker.sock; please update the unit file accordingly.
> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up
> on dummy.service stop (rc=0): timeout (elapsed=317181ms, remaining=-
> 217181ms)
> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result
> of stop operation for dummy.service on node2.local: Timed Out 
> Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for dummy...
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY 
> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> last-failure-dummy.service#stop_0[node2.local]: (unset) ->
> 1595336022 
> Jul 21 15:53:42 node2.local systemd[1]: dummy.service: Succeeded.
> Jul 21 15:53:42 node2.local systemd[1]: Stopped Daemon for dummy.
> ... lost connection (node rebooting)
>  
>  
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Andrei Borzenkov

On Wed, Jul 22, 2020 at 4:58 PM Ken Gaillot  wrote:

> On Wed, 2020-07-22 at 10:59 +0300, Хиль  Эдуард wrote:
> > Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on
> > ubuntu 20 + 1 qdevice. I want to define new resource as systemd
> > unit dummy.service :
> >
> > [Unit]
> > Description=Dummy
> > [Service]
> > Restart=on-failure
> > StartLimitInterval=20
> > StartLimitBurst=5
> > TimeoutStartSec=0
> > RestartSec=5
> > Environment="HOME=/root"
> > SyslogIdentifier=dummy
> > ExecStart=/usr/local/sbin/dummy.sh
> > [Install]
> > WantedBy=multi-user.target
> >
> > and /usr/local/sbin/dummy.sh :
> >
> > #!/bin/bash
> > CNT=0
> > while true; do
> >   let CNT++
> >   echo "hello world $CNT"
> >   sleep 5
> > done
> >
> > and then i try to define it with: pcs resource create dummy.service
> > systemd:dummy op monitor interval="10s" timeout="15s"
> > after 2 seconds node2 reboot. In logs i see pacemaker in 2 seconds
> > tried to start this unit, and it started, but pacemaker somehow think
> > he is «Timed Out» . What i am doing wrong? Logs below.
>
> The start is timing out because the ExecStart script never returns.
>
>
Type=simple does not expect script to go into background. Quite the
contrary - systemd expects ExecStart command to remain, going into
background would be interpreted as "service terminated".

To quote systemd: "the service manager will consider the unit started
immediately after the main service process has been forked off. It is
expected that the process configured with ExecStart= is the main process of
the service".



> systemd starts processes but it doesn't daemonize them -- the script is
> responsible for doing that itself.


Only for Type=forking



> You can search online for more
> details about daemonization, but most importantly you want to run your
> daemon as a subprocess in the background and have your main process
> return as soon as the daemon is ready for service.
>
>
> > Jul 21 15:53:41 node2.local pacemaker-controld[1813]:  notice: Result
> > of probe operation for dummy.service on node2.local: 7 (not running)
> > Jul 21 15:53:41 node2.local systemd[1]: Reloading.
> > Jul 21 15:53:42 node2.local systemd[1]:
> > /lib/systemd/system/dbus.socket:5: ListenStream= references a path
> > below legacy directory /var/run/, updating
> > /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please
> > update the unit file accordingly.
> > Jul 21 15:53:42 node2.local systemd[1]:
> > /lib/systemd/system/docker.socket:6: ListenStream= references a path
> > below legacy directory /var/run/, updating /var/run/docker.sock →
> > /run/docker.sock; please update the unit file accordingly.
> > Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up
> > on dummy.service start (rc=0): timeout (elapsed=259719ms, remaining=-
> > 159719ms)
> > Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result
> > of start operation for dummy.service on node2.local: Timed Out
> > Jul 21 15:53:42 node2.local systemd[1]: Started Cluster Controlled
> > dummy.
> > Jul 21 15:53:42 node2.local dummy[9330]: hello world 1
> > Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface
> > NamePolicy= disabled on kernel command line, ignoring.
> > Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> > fail-count-dummy.service#start_0[node2.local]: (unset) -> INFINITY
> > Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> > last-failure-dummy.service#start_0[node2.local]: (unset) ->
> > 1595336022
> > Jul 21 15:53:42 node2.local systemd[1]: Reloading.
> > Jul 21 15:53:42 node2.local systemd[1]:
> > /lib/systemd/system/dbus.socket:5: ListenStream= references a path
> > below legacy directory /var/run/, updating
> > /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please
> > update the unit file accordingly.
> > Jul 21 15:53:42 node2.local systemd[1]:
> > /lib/systemd/system/docker.socket:6: ListenStream= references a path
> > below legacy directory /var/run/, updating /var/run/docker.sock →
> > /run/docker.sock; please update the unit file accordingly.
> > Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up
> > on dummy.service stop (rc=0): timeout (elapsed=317181ms, remaining=-
> > 217181ms)
> > Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result
> > of stop operation for dummy.service on node2.local: Timed Out
> > Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for dummy...
> > Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> > fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY
> > Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting
> > last-failure-dummy.service#stop_0[node2.local]: (unset) ->
> > 1595336022
> > Jul 21 15:53:42 node2.local systemd[1]: dummy.service: Succeeded.
> > Jul 21 15:53:42 node2.local systemd[1]: Stopped Daemon for dummy.
> > ... lost connection (node rebooting)
> >
> >
> > _

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Andrei Borzenkov

22.07.2020 12:46, Хиль Эдуард пишет:
> 
> Hey, Andrei! Thanx for ur time!
> A-a-and there is no chance to do something? :( 
> The pacemaker’s log below.
>  

Resource was started:

...
> Jul 22 12:38:36 node2.local pacemaker-execd     [1721] (log_execute)     
> info: executing - rsc:dummy.service action:start call_id:76
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: Diff: --- 0.131.4 2
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: Diff: +++ 0.131.5 (null)
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: +  /cib:  @num_updates=5
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: +  
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:
>   @operation_key=dummy.service_start_0, @operation=start, 
> @transition-key=164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, 
> @transition-magic=-1:193;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, 
> @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1595410716, 
> @last-run=1595410716, @e
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_process_request)  
>    info: Completed cib_modify operation for section status: OK (rc=0, 
> origin=node2.local/crmd/62, version=0.131.5)
> Jul 22 12:38:36 node2.local pacemaker-execd     [1721] (systemd_exec_result)  
>    info: Call to start passed: /org/freedesktop/systemd1/job/703
> Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (process_lrm_event)    
>  notice: Result of start operation for dummy.service on node2.local: 0 (ok) | 
> call=76 key=dummy.service_start_0 confirmed=true cib-update=63

So start operation at least was successfully completed.

> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)  
>    info: Forwarding cib_modify operation for section status to all 
> (origin=local/crmd/63)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: Diff: --- 0.131.5 2
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: Diff: +++ 0.131.6 (null)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: +  /cib:  @num_updates=6
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: +  
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:
>   @transition-magic=0:0;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, 
> @call-id=76, @rc-code=0, @op-status=0, @last-rc-change=1986, @last-run=1986, 
> @exec-time=-587720, @queue-time=59
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)  
>    info: Completed cib_modify operation for section status: OK (rc=0, 
> origin=node2.local/crmd/63, version=0.131.6)
> Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (do_lrm_rsc_op)     
> info: Performing key=165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a 
> op=dummy.service_monitor_6
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)  
>    info: Forwarding cib_modify operation for section status to all 
> (origin=local/crmd/64)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: Diff: --- 0.131.6 2
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: Diff: +++ 0.131.7 (null)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: +  /cib:  @num_updates=7
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: ++ 
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']:
>    operation_key="dummy.service_monitor_6" operation="monitor" 
> crm-debug-origin="do_update_resource" crm_feature_set="3.2.0" 
> transition-key="165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a" 
> transition-magic="-1:193;165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a" 
> exit-reason="" on_
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)  
>    info: Completed cib_modify operation for section status: OK (rc=0, 
> origin=node2.local/crmd/64, version=0.131.7)
> Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (process_lrm_event)    
>  notice: Result of monitor operation for dummy.service on node2.local: 0 (ok) 
> | call=77 key=dummy.service_monitor_6 confirmed=false cib-update=65

And monitor confirmed that it was started

> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)  
>    info: Forwarding cib_modify operation for section status to all 
> (origin=local/crmd/65)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: Diff: --- 0.131.7 2
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
> info: Diff: +++ 0.131.8 (null)
> Jul 22 1

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Reid Wahl

On Wed, Jul 22, 2020 at 10:57 AM Andrei Borzenkov 
wrote:

> 22.07.2020 12:46, Хиль Эдуард пишет:
> >
> > Hey, Andrei! Thanx for ur time!
> > A-a-and there is no chance to do something? :(
> > The pacemaker’s log below.
> >
>
> Resource was started:
>
> ...
> > Jul 22 12:38:36 node2.local pacemaker-execd [1721] (log_execute)
>  info: executing - rsc:dummy.service action:start call_id:76
> > Jul 22 12:38:36 node2.local pacemaker-based [1719] (cib_perform_op)
> info: Diff: --- 0.131.4 2
> > Jul 22 12:38:36 node2.local pacemaker-based [1719] (cib_perform_op)
> info: Diff: +++ 0.131.5 (null)
> > Jul 22 12:38:36 node2.local pacemaker-based [1719] (cib_perform_op)
> info: +  /cib:  @num_updates=5
> > Jul 22 12:38:36 node2.local pacemaker-based [1719] (cib_perform_op)
> info: +
>  
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:
>  @operation_key=dummy.service_start_0, @operation=start,
> @transition-key=164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a,
> @transition-magic=-1:193;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a,
> @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1595410716,
> @last-run=1595410716, @e
> > Jul 22 12:38:36 node2.local pacemaker-based [1719]
> (cib_process_request) info: Completed cib_modify operation for section
> status: OK (rc=0, origin=node2.local/crmd/62, version=0.131.5)
> > Jul 22 12:38:36 node2.local pacemaker-execd [1721]
> (systemd_exec_result) info: Call to start passed:
> /org/freedesktop/systemd1/job/703
> > Jul 22 12:38:38 node2.local pacemaker-controld  [1724]
> (process_lrm_event) notice: Result of start operation for dummy.service
> on node2.local: 0 (ok) | call=76 key=dummy.service_start_0 confirmed=true
> cib-update=63
>
> So start operation at least was successfully completed.
>
> > Jul 22 12:38:38 node2.local pacemaker-based [1719]
> (cib_process_request) info: Forwarding cib_modify operation for section
> status to all (origin=local/crmd/63)
> > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op)
> info: Diff: --- 0.131.5 2
> > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op)
> info: Diff: +++ 0.131.6 (null)
> > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op)
> info: +  /cib:  @num_updates=6
> > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op)
> info: +
>  
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:
>  @transition-magic=0:0;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a,
> @call-id=76, @rc-code=0, @op-status=0, @last-rc-change=1986,
> @last-run=1986, @exec-time=-587720, @queue-time=59
> > Jul 22 12:38:38 node2.local pacemaker-based [1719]
> (cib_process_request) info: Completed cib_modify operation for section
> status: OK (rc=0, origin=node2.local/crmd/63, version=0.131.6)
> > Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (do_lrm_rsc_op)
> info: Performing key=165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a
> op=dummy.service_monitor_6
> > Jul 22 12:38:38 node2.local pacemaker-based [1719]
> (cib_process_request) info: Forwarding cib_modify operation for section
> status to all (origin=local/crmd/64)
> > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op)
> info: Diff: --- 0.131.6 2
> > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op)
> info: Diff: +++ 0.131.7 (null)
> > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op)
> info: +  /cib:  @num_updates=7
> > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op)
> info: ++
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']:
>   operation_key="dummy.service_monitor_6" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.2.0"
> transition-key="165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a"
> transition-magic="-1:193;165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a"
> exit-reason="" on_
> > Jul 22 12:38:38 node2.local pacemaker-based [1719]
> (cib_process_request) info: Completed cib_modify operation for section
> status: OK (rc=0, origin=node2.local/crmd/64, version=0.131.7)
> > Jul 22 12:38:38 node2.local pacemaker-controld  [1724]
> (process_lrm_event) notice: Result of monitor operation for
> dummy.service on node2.local: 0 (ok) | call=77
> key=dummy.service_monitor_6 confirmed=false cib-update=65
>
> And monitor confirmed that it was started
>
> > Jul 22 12:38:38 node2.local pacemaker-based [1719]
> (cib_process_request) info: Forwarding cib_modify operation for section
> status to all (origin=local/crmd/65)
> > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op)
> info: Diff: --- 0.131.7 2
> > Jul 22 12:38:38 node2.local p

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Ken Gaillot

On Wed, 2020-07-22 at 17:04 +0300, Andrei Borzenkov wrote:
> 
> 
> On Wed, Jul 22, 2020 at 4:58 PM Ken Gaillot 
> wrote:
> > On Wed, 2020-07-22 at 10:59 +0300, Хиль  Эдуард wrote:
> > > Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on
> > > ubuntu 20 + 1 qdevice. I want to define new resource as systemd
> > > unit dummy.service :
> > >  
> > > [Unit]
> > > Description=Dummy
> > > [Service]
> > > Restart=on-failure
> > > StartLimitInterval=20
> > > StartLimitBurst=5
> > > TimeoutStartSec=0
> > > RestartSec=5
> > > Environment="HOME=/root"
> > > SyslogIdentifier=dummy
> > > ExecStart=/usr/local/sbin/dummy.sh
> > > [Install]
> > > WantedBy=multi-user.target
> > >  
> > > and /usr/local/sbin/dummy.sh :
> > >  
> > > #!/bin/bash
> > > CNT=0
> > > while true; do
> > >   let CNT++
> > >   echo "hello world $CNT"
> > >   sleep 5
> > > done
> > >  
> > > and then i try to define it with: pcs resource create
> > dummy.service
> > > systemd:dummy op monitor interval="10s" timeout="15s"
> > > after 2 seconds node2 reboot. In logs i see pacemaker in 2
> > seconds
> > > tried to start this unit, and it started, but pacemaker somehow
> > think
> > > he is «Timed Out» . What i am doing wrong? Logs below.
> > 
> > The start is timing out because the ExecStart script never returns.
> > 
> 
> Type=simple does not expect script to go into background. Quite the
> contrary - systemd expects ExecStart command to remain, going into
> background would be interpreted as "service terminated".
> 
> To quote systemd: "the service manager will consider the unit started
> immediately after the main service process has been forked off. It is
> expected that the process configured with ExecStart= is the main
> process of the service".
> 
>  
> > systemd starts processes but it doesn't daemonize them -- the
> > script is
> > responsible for doing that itself. 
> 
> Only for Type=forking

Ah, my bad, sorry for the noise :)
 
> > You can search online for more
> > details about daemonization, but most importantly you want to run
> > your
> > daemon as a subprocess in the background and have your main process
> > return as soon as the daemon is ready for service.
> > 
> > 
> > > Jul 21 15:53:41 node2.local pacemaker-controld[1813]:  notice:
> > Result
> > > of probe operation for dummy.service on node2.local: 7 (not
> > running) 
> > > Jul 21 15:53:41 node2.local systemd[1]: Reloading.
> > > Jul 21 15:53:42 node2.local systemd[1]:
> > > /lib/systemd/system/dbus.socket:5: ListenStream= references a
> > path
> > > below legacy directory /var/run/, updating
> > > /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket;
> > please
> > > update the unit file accordingly.
> > > Jul 21 15:53:42 node2.local systemd[1]:
> > > /lib/systemd/system/docker.socket:6: ListenStream= references a
> > path
> > > below legacy directory /var/run/, updating /var/run/docker.sock →
> > > /run/docker.sock; please update the unit file accordingly.
> > > Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice:
> > Giving up
> > > on dummy.service start (rc=0): timeout (elapsed=259719ms,
> > remaining=-
> > > 159719ms)
> > > Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error:
> > Result
> > > of start operation for dummy.service on node2.local: Timed Out 
> > > Jul 21 15:53:42 node2.local systemd[1]: Started Cluster
> > Controlled
> > > dummy.
> > > Jul 21 15:53:42 node2.local dummy[9330]: hello world 1
> > > Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface
> > > NamePolicy= disabled on kernel command line, ignoring.
> > > Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice:
> > Setting
> > > fail-count-dummy.service#start_0[node2.local]: (unset) ->
> > INFINITY 
> > > Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice:
> > Setting
> > > last-failure-dummy.service#start_0[node2.local]: (unset) ->
> > > 1595336022 
> > > Jul 21 15:53:42 node2.local systemd[1]: Reloading.
> > > Jul 21 15:53:42 node2.local systemd[1]:
> > > /lib/systemd/system/dbus.socket:5: ListenStream= references a
> > path
> > > below legacy directory /var/run/, updating
> > > /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket;
> > please
> > > update the unit file accordingly.
> > > Jul 21 15:53:42 node2.local systemd[1]:
> > > /lib/systemd/system/docker.socket:6: ListenStream= references a
> > path
> > > below legacy directory /var/run/, updating /var/run/docker.sock →
> > > /run/docker.sock; please update the unit file accordingly.
> > > Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice:
> > Giving up
> > > on dummy.service stop (rc=0): timeout (elapsed=317181ms,
> > remaining=-
> > > 217181ms)
> > > Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error:
> > Result
> > > of stop operation for dummy.service on node2.local: Timed Out 
> > > Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for
> > dummy...
> > > Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice:
> > Setting
> > > fail-count-du

Re: [ClusterLabs] pacemaker systemd resource

2020-07-23 Thread Хиль Эдуард



Thx Andrei, and to all of you guys for your time, i appreciate that!

Yeah, it’s very sad to see that. Looks like a bug described here:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1869751
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1881762
Well, for me no other way, but change OS from ubuntu to something else, cuz i 
am very disappointed there are so critical bugs :(
  
>Среда, 22 июля 2020, 22:57 +05:00 от Andrei Borzenkov :
> 
>22.07.2020 12:46, Хиль Эдуард пишет:
>>
>> Hey, Andrei! Thanx for ur time!
>> A-a-and there is no chance to do something? :( 
>> The pacemaker’s log below.
>>  
>
>Resource was started:
>
>...
>> Jul 22 12:38:36 node2.local pacemaker-execd     [1721] (log_execute)     
>> info: executing - rsc:dummy.service action:start call_id:76
>> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: Diff: --- 0.131.4 2
>> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: Diff: +++ 0.131.5 (null)
>> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: +  /cib:  @num_updates=5
>> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: +  
>> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:
>>  @operation_key=dummy.service_start_0, @operation=start, 
>> @transition-key=164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, 
>> @transition-magic=-1:193;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, 
>> @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1595410716, 
>> @last-run=1595410716, @e
>> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_process_request) 
>>     info: Completed cib_modify operation for section status: OK (rc=0, 
>> origin=node2.local/crmd/62, version=0.131.5)
>> Jul 22 12:38:36 node2.local pacemaker-execd     [1721] (systemd_exec_result) 
>>     info: Call to start passed: /org/freedesktop/systemd1/job/703
>> Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (process_lrm_event)   
>>   notice: Result of start operation for dummy.service on node2.local: 0 (ok) 
>> | call=76 key=dummy.service_start_0 confirmed=true cib-update=63
>
>So start operation at least was successfully completed.
>
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request) 
>>     info: Forwarding cib_modify operation for section status to all 
>> (origin=local/crmd/63)
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: Diff: --- 0.131.5 2
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: Diff: +++ 0.131.6 (null)
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: +  /cib:  @num_updates=6
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: +  
>> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:
>>   @transition-magic=0:0;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, 
>> @call-id=76, @rc-code=0, @op-status=0, @last-rc-change=1986, @last-run=1986, 
>> @exec-time=-587720, @queue-time=59
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request) 
>>     info: Completed cib_modify operation for section status: OK (rc=0, 
>> origin=node2.local/crmd/63, version=0.131.6)
>> Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (do_lrm_rsc_op)     
>> info: Performing key=165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a 
>> op=dummy.service_monitor_6
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request) 
>>     info: Forwarding cib_modify operation for section status to all 
>> (origin=local/crmd/64)
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: Diff: --- 0.131.6 2
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: Diff: +++ 0.131.7 (null)
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: +  /cib:  @num_updates=7
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     
>> info: ++ 
>> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']:
>>   > operation_key="dummy.service_monitor_6" operation="monitor" 
>> crm-debug-origin="do_update_resource" crm_feature_set="3.2.0" 
>> transition-key="165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a" 
>> transition-magic="-1:193;165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a" 
>> exit-reason="" on_
>> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request) 
>>     info: Completed cib_modify operation for section status: OK (rc=0, 
>> origin=node2.local/crmd/64, version=0.131.7)
>> Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (process_lrm_event)   
>>   notice: Result of monitor operation for dummy.service on node

Re: [ClusterLabs] pacemaker startup problem

2020-07-24 Thread Ken Gaillot

On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
> Hello,
>  
> after a long time I'm back to run heartbeat/pacemaker/corosync on our
> XStreamOS/illumos distro.
> I rebuilt the original components I did in 2016 on our latest release
> (probably a bit outdated, but I want to start from where I left).
> Looks like pacemaker is having trouble starting up showin this logs:
> 
> Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
> Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
> Jul 24 18:21:32 [971] crmd: info: crm_log_init: Changed active
> directory to /sonicle/var/cluster/lib/pacemaker/cores
> Jul 24 18:21:32 [971] crmd: info: main: CRM Git Version: 1.1.15
> (e174ec8)
> Jul 24 18:21:32 [971] crmd: info: do_log: Input I_STARTUP received in
> state S_STARTING from crmd_init
> Jul 24 18:21:32 [969] lrmd: info: crm_log_init: Changed active
> directory to /sonicle/var/cluster/lib/pacemaker/cores
> Jul 24 18:21:32 [968] stonith-ng: info: crm_log_init: Changed active
> directory to /sonicle/var/cluster/lib/pacemaker/cores
> Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Verifying
> cluster type: 'heartbeat'
> Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Assuming an
> active 'heartbeat' cluster
> Jul 24 18:21:32 [968] stonith-ng: notice: crm_cluster_connect:
> Connecting to cluster infrastructure: heartbeat


> Jul 24 18:21:32 [969] lrmd: error: mainloop_add_ipc_server: Could not
> start lrmd IPC server: Operation not supported (-48)

This is repeated for all the subdaemons ... the error is coming from
qb_ipcs_run(), which looks like the issue is an invalid PCMK_ipc_type
for illumos. If you set it to "socket" it should work.


> Jul 24 18:21:32 [969] lrmd: error: main: Failed to create IPC server:
> shutting down and inhibiting respawn
> Jul 24 18:21:32 [969] lrmd: info: crm_xml_cleanup: Cleaning up memory
> from libxml2
> Jul 24 18:21:32 [971] crmd: info: get_cluster_type: Verifying cluster
> type: 'heartbeat'
> Jul 24 18:21:32 [971] crmd: info: get_cluster_type: Assuming an
> active 'heartbeat' cluster
> Jul 24 18:21:32 [971] crmd: info: start_subsystem: Starting sub-
> system "pengine"
> Jul 24 18:21:32 [968] stonith-ng: info: crm_get_peer: Created entry
> 25bc5492-a49e-40d7-ae60-fd8f975a294a/80886f0 for node xstorage1/0 (1
> total)
> Jul 24 18:21:32 [968] stonith-ng: info: crm_get_peer: Node 0 has uuid
> d426a730-5229-6758-853a-99d4d491514a
> Jul 24 18:21:32 [968] stonith-ng: info: register_heartbeat_conn:
> Hostname: xstorage1
> Jul 24 18:21:32 [968] stonith-ng: info: register_heartbeat_conn:
> UUID: d426a730-5229-6758-853a-99d4d491514a
> Jul 24 18:21:32 [970] attrd: notice: crm_cluster_connect: Connecting
> to cluster infrastructure: heartbeat
> Jul 24 18:21:32 [970] attrd: error: mainloop_add_ipc_server: Could
> not start attrd IPC server: Operation not supported (-48)
> Jul 24 18:21:32 [970] attrd: error: attrd_ipc_server_init: Failed to
> create attrd servers: exiting and inhibiting respawn.
> Jul 24 18:21:32 [970] attrd: warning: attrd_ipc_server_init: Verify
> pacemaker and pacemaker_remote are not both enabled.
> Jul 24 18:21:32 [972] pengine: info: crm_log_init: Changed active
> directory to /sonicle/var/cluster/lib/pacemaker/cores
> Jul 24 18:21:32 [972] pengine: error: mainloop_add_ipc_server: Could
> not start pengine IPC server: Operation not supported (-48)
> Jul 24 18:21:32 [972] pengine: error: main: Failed to create IPC
> server: shutting down and inhibiting respawn
> Jul 24 18:21:32 [972] pengine: info: crm_xml_cleanup: Cleaning up
> memory from libxml2
> Jul 24 18:21:33 [971] crmd: info: do_cib_control: Could not connect
> to the CIB service: Transport endpoint is not connected
> Jul 24 18:21:33 [971] crmd: warning: do_cib_control: Couldn't
> complete CIB registration 1 times... pause and retry
> Jul 24 18:21:33 [971] crmd: error: crmd_child_exit: Child process
> pengine exited (pid=972, rc=100)
> Jul 24 18:21:35 [971] crmd: info: crm_timer_popped: Wait Timer
> (I_NULL) just popped (2000ms)
> Jul 24 18:21:36 [971] crmd: info: do_cib_control: Could not connect
> to the CIB service: Transport endpoint is not connected
> Jul 24 18:21:36 [971] crmd: warning: do_cib_control: Couldn't
> complete CIB registration 2 times... pause and retry
> Jul 24 18:21:38 [971] crmd: info: crm_timer_popped: Wait Timer
> (I_NULL) just popped (2000ms)
> Jul 24 18:21:39 [971] crmd: info: do_cib_control: Could not connect
> to the CIB service: Transport endpoint is not connected
> Jul 24 18:21:39 [971] crmd: warning: do_cib_control: Couldn't
> complete CIB registration 3 times... pause and retry
> Jul 24 18:21:41 [971] crmd: info: crm_timer_popped: Wait Timer
> (I_NULL) just popped (2000ms)
> Jul 24 18:21:42 [971] crmd: info: do_cib_control: Could not connect
> to the CIB service: Transport endpoint is not connected
> Jul 24 18:21:42 [971] crmd: warning: do_cib_control: Couldn't
> complete CIB registration 4 times... pause and retry
> Jul 24 18:21:4

Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Gabriele Bulfon

Thanks, I ran it manually so I got those errors, running from service script it 
correctly set PCMK_ipc_type to socket.
 
But now I see these now:
Jul 26 11:08:16 [4039] pacemakerd: info: crm_log_init: Changed active directory 
to /sonicle/var/cluster/lib/pacemaker/cores
Jul 26 11:08:16 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 1s
Jul 26 11:08:17 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 2s
Jul 26 11:08:19 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 3s
Jul 26 11:08:22 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 4s
Jul 26 11:08:26 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 5s
Jul 26 11:08:31 [4039] pacemakerd: warning: mcp_read_config: Could not connect 
to Cluster Configuration Database API, error 2
Jul 26 11:08:31 [4039] pacemakerd: notice: main: Could not obtain corosync 
config data, exiting
Jul 26 11:08:31 [4039] pacemakerd: info: crm_xml_cleanup: Cleaning up memory 
from libxml2
 
So I think I need to start corosync first (right?) but it dies with this:
 
Jul 26 11:07:06 [4027] xstorage1 corosync notice [MAIN ] Corosync Cluster 
Engine ('2.4.1'): started and ready to provide service.
Jul 26 11:07:06 [4027] xstorage1 corosync info [MAIN ] Corosync built-in 
features: bindnow
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: none hash: none
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] The network interface 
[10.100.100.1] is now up.
Jul 26 11:07:06 [4027] xstorage1 corosync notice [SERV ] Service engine loaded: 
corosync configuration map access [0]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync configuration service [1]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync cluster closed process group service v1.01 [2]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync profile loading service [4]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [QUORUM] Using quorum provider 
corosync_votequorum
Jul 26 11:07:06 [4027] xstorage1 corosync crit [QUORUM] Quorum provider: 
corosync_votequorum failed to initialize.
Jul 26 11:07:06 [4027] xstorage1 corosync error [SERV ] Service engine 
'corosync_quorum' failed to load for reason 'configuration error: nodelist or 
quorum.expected_votes must be configured!'
Jul 26 11:07:06 [4027] xstorage1 corosync error [MAIN ] Corosync Cluster Engine 
exiting with status 20 at 
/data/sources/sonicle/xstream-storage-gate/components/cluster/corosync/corosync-2.4.1/exec/service.c:356.
My corosync conf has nodelist configured! Here it is:
 
service {ver: 1name: pacemakeruse_mgmtd: nouse_logd: no}totem { 
   version: 2crypto_cipher: nonecrypto_hash: none
interface {ringnumber: 0bindnetaddr: 
10.100.100.0mcastaddr: 239.255.1.1mcastport: 
5405ttl: 1}}nodelist {   node { ring0_addr: 
xstorage1 nodeid: 1}   node { ring0_addr: xstorage2 
nodeid: 2}}quorum {provider: corosync_votequorum
two_node: 1}logging {fileline: offto_stderr: no
to_logfile: yeslogfile: /sonicle/var/log/cluster/corosync.log
to_syslog: nodebug: offtimestamp: onlogger_subsys { 
   subsys: QUORUMdebug: off}}
 
 
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: Cluster Labs - All topics related to open-source clustering welcomed
Data: 25 luglio 2020 0.46.52 CEST
Oggetto: Re: [ClusterLabs] pacemaker startup problem
On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
Hello,
after a long time I'm back to run heartbeat/pacemaker/corosync on our
XStreamOS/illumos distro.
I rebuilt the original components I did in 2016 on our latest release
(probably a bit outdated, but I want to start from where I left).
Looks like pacemaker is having trouble starting up showin this logs:
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Jul 24 18:21:32 [971] crmd: info: crm_log_init: Changed active
directory to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [971] crmd: info: main: CRM Git Version: 1.1.15
(e174ec8)
Jul 24 18:21:32 [971] crmd: info:

Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Gabriele Bulfon

Sorry, I was using wrong hostnames for that networks, using debug log I found 
it was not finding "this node" in conf file.
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
26 luglio 2020 11.23.53 CEST
Oggetto:
Re: [ClusterLabs] pacemaker startup problem
 
Thanks, I ran it manually so I got those errors, running from service script it 
correctly set PCMK_ipc_type to socket.
 
But now I see these now:
Jul 26 11:08:16 [4039] pacemakerd: info: crm_log_init: Changed active directory 
to /sonicle/var/cluster/lib/pacemaker/cores
Jul 26 11:08:16 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 1s
Jul 26 11:08:17 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 2s
Jul 26 11:08:19 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 3s
Jul 26 11:08:22 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 4s
Jul 26 11:08:26 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 5s
Jul 26 11:08:31 [4039] pacemakerd: warning: mcp_read_config: Could not connect 
to Cluster Configuration Database API, error 2
Jul 26 11:08:31 [4039] pacemakerd: notice: main: Could not obtain corosync 
config data, exiting
Jul 26 11:08:31 [4039] pacemakerd: info: crm_xml_cleanup: Cleaning up memory 
from libxml2
 
So I think I need to start corosync first (right?) but it dies with this:
 
Jul 26 11:07:06 [4027] xstorage1 corosync notice [MAIN ] Corosync Cluster 
Engine ('2.4.1'): started and ready to provide service.
Jul 26 11:07:06 [4027] xstorage1 corosync info [MAIN ] Corosync built-in 
features: bindnow
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: none hash: none
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] The network interface 
[10.100.100.1] is now up.
Jul 26 11:07:06 [4027] xstorage1 corosync notice [SERV ] Service engine loaded: 
corosync configuration map access [0]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync configuration service [1]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync cluster closed process group service v1.01 [2]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync profile loading service [4]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [QUORUM] Using quorum provider 
corosync_votequorum
Jul 26 11:07:06 [4027] xstorage1 corosync crit [QUORUM] Quorum provider: 
corosync_votequorum failed to initialize.
Jul 26 11:07:06 [4027] xstorage1 corosync error [SERV ] Service engine 
'corosync_quorum' failed to load for reason 'configuration error: nodelist or 
quorum.expected_votes must be configured!'
Jul 26 11:07:06 [4027] xstorage1 corosync error [MAIN ] Corosync Cluster Engine 
exiting with status 20 at 
/data/sources/sonicle/xstream-storage-gate/components/cluster/corosync/corosync-2.4.1/exec/service.c:356.
My corosync conf has nodelist configured! Here it is:
 
service {ver: 1name: pacemakeruse_mgmtd: nouse_logd: no}totem { 
   version: 2crypto_cipher: nonecrypto_hash: none
interface {ringnumber: 0bindnetaddr: 
10.100.100.0mcastaddr: 239.255.1.1mcastport: 
5405ttl: 1}}nodelist {   node { ring0_addr: 
xstorage1 nodeid: 1}   node { ring0_addr: xstorage2 
nodeid: 2}}quorum {provider: corosync_votequorum
two_node: 1}logging {fileline: offto_stderr: no
to_logfile: yeslogfile: /sonicle/var/log/cluster/corosync.log
to_syslog: nodebug: offtimestamp: onlogger_subsys { 
   subsys: QUORUMdebug: off}}
 
 
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: Cluster Labs - All topics related to open-source clustering welcomed
Data: 25 luglio 2020 0.46.52 CEST
Oggetto: Re: [ClusterLabs] pacemaker startup problem
On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
Hello,
after a long time I'm back to run heartbeat/pacemaker/corosync on our
XStreamOS/illumos distro.
I rebuilt the original components I did in 2016 on our latest release
(probably a bit ou

Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Gabriele Bulfon

Sorry, actually the problem is not gone yet.
Now corosync and pacemaker are running happily, but those IPC errors are coming 
out of heartbeat and crmd as soon as I start it.
The pacemakerd process has PCMK_ipc_type=socket, what's wrong with heartbeat or 
crmd?
 
Here's the env of the process:
 
sonicle@xstorage1:/sonicle/etc/cluster/ha.d# penv 4222
4222: /usr/sbin/pacemakerd
envp[0]: PCMK_respawned=true
envp[1]: PCMK_watchdog=false
envp[2]: HA_LOGFACILITY=none
envp[3]: HA_logfacility=none
envp[4]: PCMK_logfacility=none
envp[5]: HA_logfile=/sonicle/var/log/cluster/corosync.log
envp[6]: PCMK_logfile=/sonicle/var/log/cluster/corosync.log
envp[7]: HA_debug=0
envp[8]: PCMK_debug=0
envp[9]: HA_quorum_type=corosync
envp[10]: PCMK_quorum_type=corosync
envp[11]: HA_cluster_type=corosync
envp[12]: PCMK_cluster_type=corosync
envp[13]: HA_use_logd=off
envp[14]: PCMK_use_logd=off
envp[15]: HA_mcp=true
envp[16]: PCMK_mcp=true
envp[17]: HA_LOGD=no
envp[18]: LC_ALL=C
envp[19]: PCMK_service=pacemakerd
envp[20]: PCMK_ipc_type=socket
envp[21]: SMF_ZONENAME=global
envp[22]: PWD=/
envp[23]: SMF_FMRI=svc:/sonicle/xstream/cluster/pacemaker:default
envp[24]: _=/usr/sbin/pacemakerd
envp[25]: TZ=Europe/Rome
envp[26]: LANG=en_US.UTF-8
envp[27]: SMF_METHOD=start
envp[28]: SHLVL=2
envp[29]: PATH=/usr/sbin:/usr/bin
envp[30]: SMF_RESTARTER=svc:/system/svc/restarter:default
envp[31]: A__z="*SHLVL
 
 
Here are crmd complaints:
 
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: Node 
xstorage1 state is now member
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Could not 
start crmd IPC server: Operation not supported (-48)
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Failed to 
create IPC server: shutting down and inhibiting respawn
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: The 
local CRM is operational
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Input 
I_ERROR received in state S_STARTING from do_started
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: State 
transition S_STARTING -S_RECOVERY
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning: 
Fast-tracking shutdown in response to errors
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning: Input 
I_PENDING received in state S_RECOVERY from do_started
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Input 
I_TERMINATE received in state S_RECOVERY from do_recover
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: 
Disconnected from the LRM
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Child 
process pengine exited (pid=4316, rc=100)
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Could not 
recover from internal error
Jul 26 11:39:07 xstorage1 heartbeat: [ID 996084 daemon.warning] [4275]: WARN: 
Managed /usr/libexec/pacemaker/crmd process 4315 exited with return code 201.
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: Cluster Labs - All topics related to open-source clustering welcomed
Data: 25 luglio 2020 0.46.52 CEST
Oggetto: Re: [ClusterLabs] pacemaker startup problem
On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
Hello,
after a long time I'm back to run heartbeat/pacemaker/corosync on our
XStreamOS/illumos distro.
I rebuilt the original components I did in 2016 on our latest release
(probably a bit outdated, but I want to start from where I left).
Looks like pacemaker is having trouble starting up showin this logs:
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Jul 24 18:21:32 [971] crmd: info: crm_log_init: Changed active
directory to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [971] crmd: info: main: CRM Git Version: 1.1.15
(e174ec8)
Jul 24 18:21:32 [971] crmd: info: do_log: Input I_STARTUP received in
state S_STARTING from crmd_init
Jul 24 18:21:32 [969] lrmd: info: crm_log_init: Changed active
directory to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [968] stonith-ng: info: crm_log_init: Changed active
directory to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Verifying
cluster type: 'heartbeat'
Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Assuming an
active 'heartbeat' cluster
Jul 24 18:21:32 [968] stonith-ng: notice: crm_cluster_connect:
Connecting to cluster infrastructure: heartbeat
Jul 24 18:21:32 [969] lrmd: error: mainloop_add_ipc_server: Could not
start lrmd IPC server: Operation not supported (-48)
This is repeated for all the subdaemons ... the error is coming from
qb_ipcs

Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Reid Wahl

Hmm. If it's reading PCMK_ipc_type and matching the server type to
QB_IPC_SOCKET, then the only other place I see it could be coming from is
qb_ipc_auth_creds.

qb_ipcs_run -> qb_ipcs_us_publish -> qb_ipcs_us_connection_acceptor ->
qb_ipcs_uc_recv_and_auth -> process_auth -> qb_ipc_auth_creds ->

static int32_t
qb_ipc_auth_creds(struct ipc_auth_data *data)
{
...
#ifdef HAVE_GETPEERUCRED
/*
 * Solaris and some BSD systems
...
#elif defined(HAVE_GETPEEREID)
/*
* Usually MacOSX systems
...
#elif defined(SO_PASSCRED)
/*
* Usually Linux systems
...
#else /* no credentials */
data->ugp.pid = 0;
data->ugp.uid = 0;
data->ugp.gid = 0;
res = -ENOTSUP;
#endif /* no credentials */

return res;

I'll leave it to Ken to say whether that's likely and what it implies if so.

On Sun, Jul 26, 2020 at 2:53 AM Gabriele Bulfon  wrote:

> Sorry, actually the problem is not gone yet.
> Now corosync and pacemaker are running happily, but those IPC errors are
> coming out of heartbeat and crmd as soon as I start it.
> The pacemakerd process has PCMK_ipc_type=socket, what's wrong with
> heartbeat or crmd?
>
> Here's the env of the process:
>
> sonicle@xstorage1:/sonicle/etc/cluster/ha.d# penv 4222
> 4222: /usr/sbin/pacemakerd
> envp[0]: PCMK_respawned=true
> envp[1]: PCMK_watchdog=false
> envp[2]: HA_LOGFACILITY=none
> envp[3]: HA_logfacility=none
> envp[4]: PCMK_logfacility=none
> envp[5]: HA_logfile=/sonicle/var/log/cluster/corosync.log
> envp[6]: PCMK_logfile=/sonicle/var/log/cluster/corosync.log
> envp[7]: HA_debug=0
> envp[8]: PCMK_debug=0
> envp[9]: HA_quorum_type=corosync
> envp[10]: PCMK_quorum_type=corosync
> envp[11]: HA_cluster_type=corosync
> envp[12]: PCMK_cluster_type=corosync
> envp[13]: HA_use_logd=off
> envp[14]: PCMK_use_logd=off
> envp[15]: HA_mcp=true
> envp[16]: PCMK_mcp=true
> envp[17]: HA_LOGD=no
> envp[18]: LC_ALL=C
> envp[19]: PCMK_service=pacemakerd
> envp[20]: PCMK_ipc_type=socket
> envp[21]: SMF_ZONENAME=global
> envp[22]: PWD=/
> envp[23]: SMF_FMRI=svc:/sonicle/xstream/cluster/pacemaker:default
> envp[24]: _=/usr/sbin/pacemakerd
> envp[25]: TZ=Europe/Rome
> envp[26]: LANG=en_US.UTF-8
> envp[27]: SMF_METHOD=start
> envp[28]: SHLVL=2
> envp[29]: PATH=/usr/sbin:/usr/bin
> envp[30]: SMF_RESTARTER=svc:/system/svc/restarter:default
> envp[31]: A__z="*SHLVL
>
>
> Here are crmd complaints:
>
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
> Node xstorage1 state is now member
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
> Could not start crmd IPC server: Operation not supported (-48)
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
> Failed to create IPC server: shutting down and inhibiting respawn
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
> The local CRM is operational
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
> Input I_ERROR received in state S_STARTING from do_started
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
> State transition S_STARTING -> S_RECOVERY
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning:
> Fast-tracking shutdown in response to errors
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning:
> Input I_PENDING received in state S_RECOVERY from do_started
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
> Input I_TERMINATE received in state S_RECOVERY from do_recover
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
> Disconnected from the LRM
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
> Child process pengine exited (pid=4316, rc=100)
> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
> Could not recover from internal error
> Jul 26 11:39:07 xstorage1 heartbeat: [ID 996084 daemon.warning] [4275]:
> WARN: Managed /usr/libexec/pacemaker/crmd process 4315 exited with return
> code 201.
>
>
>
>
> *Sonicle S.r.l. *: http://www.sonicle.com
> *Music: *http://www.gabrielebulfon.com
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>
>
>
>
> --
>
> Da: Ken Gaillot 
> A: Cluster Labs - All topics related to open-source clustering welcomed <
> users@clusterlabs.org>
> Data: 25 luglio 2020 0.46.52 CEST
> Oggetto: Re: [ClusterLabs] pacemaker startup problem
>
> On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
> > Hello,
> >
> > after a long time I'm b

Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Reid Wahl

Illumos might have getpeerucred, which can also set errno to ENOTSUP.

On Sun, Jul 26, 2020 at 3:25 AM Reid Wahl  wrote:

> Hmm. If it's reading PCMK_ipc_type and matching the server type to
> QB_IPC_SOCKET, then the only other place I see it could be coming from is
> qb_ipc_auth_creds.
>
> qb_ipcs_run -> qb_ipcs_us_publish -> qb_ipcs_us_connection_acceptor ->
> qb_ipcs_uc_recv_and_auth -> process_auth -> qb_ipc_auth_creds ->
>
> static int32_t
> qb_ipc_auth_creds(struct ipc_auth_data *data)
> {
> ...
> #ifdef HAVE_GETPEERUCRED
> /*
>  * Solaris and some BSD systems
> ...
> #elif defined(HAVE_GETPEEREID)
> /*
> * Usually MacOSX systems
> ...
> #elif defined(SO_PASSCRED)
> /*
> * Usually Linux systems
> ...
> #else /* no credentials */
> data->ugp.pid = 0;
> data->ugp.uid = 0;
> data->ugp.gid = 0;
> res = -ENOTSUP;
> #endif /* no credentials */
>
> return res;
>
> I'll leave it to Ken to say whether that's likely and what it implies if
> so.
>
> On Sun, Jul 26, 2020 at 2:53 AM Gabriele Bulfon 
> wrote:
>
>> Sorry, actually the problem is not gone yet.
>> Now corosync and pacemaker are running happily, but those IPC errors are
>> coming out of heartbeat and crmd as soon as I start it.
>> The pacemakerd process has PCMK_ipc_type=socket, what's wrong with
>> heartbeat or crmd?
>>
>> Here's the env of the process:
>>
>> sonicle@xstorage1:/sonicle/etc/cluster/ha.d# penv 4222
>> 4222: /usr/sbin/pacemakerd
>> envp[0]: PCMK_respawned=true
>> envp[1]: PCMK_watchdog=false
>> envp[2]: HA_LOGFACILITY=none
>> envp[3]: HA_logfacility=none
>> envp[4]: PCMK_logfacility=none
>> envp[5]: HA_logfile=/sonicle/var/log/cluster/corosync.log
>> envp[6]: PCMK_logfile=/sonicle/var/log/cluster/corosync.log
>> envp[7]: HA_debug=0
>> envp[8]: PCMK_debug=0
>> envp[9]: HA_quorum_type=corosync
>> envp[10]: PCMK_quorum_type=corosync
>> envp[11]: HA_cluster_type=corosync
>> envp[12]: PCMK_cluster_type=corosync
>> envp[13]: HA_use_logd=off
>> envp[14]: PCMK_use_logd=off
>> envp[15]: HA_mcp=true
>> envp[16]: PCMK_mcp=true
>> envp[17]: HA_LOGD=no
>> envp[18]: LC_ALL=C
>> envp[19]: PCMK_service=pacemakerd
>> envp[20]: PCMK_ipc_type=socket
>> envp[21]: SMF_ZONENAME=global
>> envp[22]: PWD=/
>> envp[23]: SMF_FMRI=svc:/sonicle/xstream/cluster/pacemaker:default
>> envp[24]: _=/usr/sbin/pacemakerd
>> envp[25]: TZ=Europe/Rome
>> envp[26]: LANG=en_US.UTF-8
>> envp[27]: SMF_METHOD=start
>> envp[28]: SHLVL=2
>> envp[29]: PATH=/usr/sbin:/usr/bin
>> envp[30]: SMF_RESTARTER=svc:/system/svc/restarter:default
>> envp[31]: A__z="*SHLVL
>>
>>
>> Here are crmd complaints:
>>
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> Node xstorage1 state is now member
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Could not start crmd IPC server: Operation not supported (-48)
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Failed to create IPC server: shutting down and inhibiting respawn
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> The local CRM is operational
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Input I_ERROR received in state S_STARTING from do_started
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> State transition S_STARTING -> S_RECOVERY
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning:
>> Fast-tracking shutdown in response to errors
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning:
>> Input I_PENDING received in state S_RECOVERY from do_started
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Input I_TERMINATE received in state S_RECOVERY from do_recover
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> Disconnected from the LRM
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Child process pengine exited (pid=4316, rc=100)
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Could not recover from internal error
>> Jul 26 11:39:07 xstorage1 heartbeat: [ID 996084 daemon.warning] [4275]:
>> WARN: Managed /usr/libexec/pacemaker/crmd process 4315 exited with return
>> code 201.
>>
>>
>>
>>
>> *Sonicle S.r.l. *

Re: [ClusterLabs] pacemaker startup problem

2020-07-27 Thread Gabriele Bulfon

Solved this, actually I don't need heartbeat component and service running.
I just use corosync and pacemaker, and this seems to work.
Now going on with crm configuration.
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
26 luglio 2020 12.25.20 CEST
Oggetto:
Re: [ClusterLabs] pacemaker startup problem
Hmm. If it's reading PCMK_ipc_type and matching the server type to 
QB_IPC_SOCKET, then the only other place I see it could be coming from is 
qb_ipc_auth_creds.
 
qb_ipcs_run -qb_ipcs_us_publish -qb_ipcs_us_connection_acceptor 
-qb_ipcs_uc_recv_and_auth -process_auth -qb_ipc_auth_creds -
 
static int32_t
qb_ipc_auth_creds(struct ipc_auth_data *data)
{
...
#ifdef HAVE_GETPEERUCRED
        /*
         * Solaris and some BSD systems
...
#elif defined(HAVE_GETPEEREID)
        /*
        * Usually MacOSX systems
...
#elif defined(SO_PASSCRED)
        /*
        * Usually Linux systems
...
#else /* no credentials */
        data-ugp.pid = 0;
        data-ugp.uid = 0;
        data-ugp.gid = 0;
        res = -ENOTSUP;
#endif /* no credentials */
        return res;
 
I'll leave it to Ken to say whether that's likely and what it implies if so.
On Sun, Jul 26, 2020 at 2:53 AM Gabriele Bulfon
gbul...@sonicle.com
wrote:
Sorry, actually the problem is not gone yet.
Now corosync and pacemaker are running happily, but those IPC errors are coming 
out of heartbeat and crmd as soon as I start it.
The pacemakerd process has PCMK_ipc_type=socket, what's wrong with heartbeat or 
crmd?
 
Here's the env of the process:
 
sonicle@xstorage1:/sonicle/etc/cluster/ha.d# penv 4222
4222: /usr/sbin/pacemakerd
envp[0]: PCMK_respawned=true
envp[1]: PCMK_watchdog=false
envp[2]: HA_LOGFACILITY=none
envp[3]: HA_logfacility=none
envp[4]: PCMK_logfacility=none
envp[5]: HA_logfile=/sonicle/var/log/cluster/corosync.log
envp[6]: PCMK_logfile=/sonicle/var/log/cluster/corosync.log
envp[7]: HA_debug=0
envp[8]: PCMK_debug=0
envp[9]: HA_quorum_type=corosync
envp[10]: PCMK_quorum_type=corosync
envp[11]: HA_cluster_type=corosync
envp[12]: PCMK_cluster_type=corosync
envp[13]: HA_use_logd=off
envp[14]: PCMK_use_logd=off
envp[15]: HA_mcp=true
envp[16]: PCMK_mcp=true
envp[17]: HA_LOGD=no
envp[18]: LC_ALL=C
envp[19]: PCMK_service=pacemakerd
envp[20]: PCMK_ipc_type=socket
envp[21]: SMF_ZONENAME=global
envp[22]: PWD=/
envp[23]: SMF_FMRI=svc:/sonicle/xstream/cluster/pacemaker:default
envp[24]: _=/usr/sbin/pacemakerd
envp[25]: TZ=Europe/Rome
envp[26]: LANG=en_US.UTF-8
envp[27]: SMF_METHOD=start
envp[28]: SHLVL=2
envp[29]: PATH=/usr/sbin:/usr/bin
envp[30]: SMF_RESTARTER=svc:/system/svc/restarter:default
envp[31]: A__z="*SHLVL
 
 
Here are crmd complaints:
 
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: Node 
xstorage1 state is now member
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Could not 
start crmd IPC server: Operation not supported (-48)
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Failed to 
create IPC server: shutting down and inhibiting respawn
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: The 
local CRM is operational
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Input 
I_ERROR received in state S_STARTING from do_started
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: State 
transition S_STARTING -S_RECOVERY
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning: 
Fast-tracking shutdown in response to errors
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning: Input 
I_PENDING received in state S_RECOVERY from do_started
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Input 
I_TERMINATE received in state S_RECOVERY from do_recover
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: 
Disconnected from the LRM
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Child 
process pengine exited (pid=4316, rc=100)
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Could not 
recover from internal error
Jul 26 11:39:07 xstorage1 heartbeat: [ID 996084 daemon.warning] [4275]: WARN: 
Managed /usr/libexec/pacemaker/crmd process 4315 exited with return code 201.
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
kgail...@redhat.com
A: Cluster Labs - All topics related to open-source clustering welcomed
users@clusterlabs.org
Data: 25 luglio 2020 0.46.52 CEST
Oggetto: Re: [ClusterLabs] pacemaker startup problem
On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
Hello

Re: [ClusterLabs] Pacemaker not starting

2020-09-23 Thread Reid Wahl

Please also share /etc/cluster/cluster.conf. Do you have `two_node="1"
expected_votes="1"` in the  element of cluster.conf?

This is technically a cman startup issue. Pacemaker is waiting for
cman to start first and form quorum through corosync first.

On Wed, Sep 23, 2020 at 9:55 AM Strahil Nikolov  wrote:
>
> What is the output of 'corosync-quorumtool -s' on both nodes ?
> What is your cluster's configuration :
>
> 'crm configure show' or 'pcs config'
>
>
> Best Regards,
> Strahil Nikolov
>
>
>
>
>
>
> В сряда, 23 септември 2020 г., 16:07:16 Гринуич+3, Ambadas Kawle 
>  написа:
>
>
>
>
>
> Hello All
>
> We have 2 node with Mysql cluster and we are not able to start pacemaker on 
> one of the node (slave node)
> We are getting error "waiting for quorum... timed-out waiting for cluster"
>
> Following are package detail
> pacemaker pacemaker-1.1.15-5.el6.x86_64
> pacemaker-libs-1.1.15-5.el6.x86_64
> pacemaker-cluster-libs-1.1.15-5.el6.x86_64
> pacemaker-cli-1.1.15-5.el6.x86_64
>
> Corosync corosync-1.4.7-6.el6.x86_64
> corosynclib-1.4.7-6.el6.x86_64
>
> Mysql mysql-5.1.73-7.el6.x86_64
> "mysql-connector-python-2.0.4-1.el6.noarch
>
> Your help is appreciatedThanks Ambadas kawle
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker not starting

2020-09-23 Thread Strahil Nikolov

What is the output of 'corosync-quorumtool -s' on both nodes ?
What is your cluster's configuration :

'crm configure show' or 'pcs config'


Best Regards,
Strahil Nikolov






В сряда, 23 септември 2020 г., 16:07:16 Гринуич+3, Ambadas Kawle 
 написа: 





Hello All

We have 2 node with Mysql cluster and we are not able to start pacemaker on one 
of the node (slave node)
We are getting error "waiting for quorum... timed-out waiting for cluster"

Following are package detail
pacemaker pacemaker-1.1.15-5.el6.x86_64
pacemaker-libs-1.1.15-5.el6.x86_64
pacemaker-cluster-libs-1.1.15-5.el6.x86_64
pacemaker-cli-1.1.15-5.el6.x86_64

Corosync corosync-1.4.7-6.el6.x86_64
corosynclib-1.4.7-6.el6.x86_64

Mysql mysql-5.1.73-7.el6.x86_64
"mysql-connector-python-2.0.4-1.el6.noarch

Your help is appreciatedThanks Ambadas kawle
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker not starting

2020-09-25 Thread Ambadas Kawle

Hello Team


Please help me to solve this problem

Thanks

On Wed, 23 Sep 2020, 11:41 am Ambadas Kawle,  wrote:

> Hello All
>
> We have 2 node with Mysql cluster and we are not able to start pacemaker on 
> one of the node (slave node)
> We are getting error "waiting for quorum... timed-out waiting for cluster"
>
> Following are package detail
> pacemaker pacemaker-1.1.15-5.el6.x86_64
> pacemaker-libs-1.1.15-5.el6.x86_64
> pacemaker-cluster-libs-1.1.15-5.el6.x86_64
> pacemaker-cli-1.1.15-5.el6.x86_64
>
> Corosync corosync-1.4.7-6.el6.x86_64
> corosynclib-1.4.7-6.el6.x86_64
>
> Mysql mysql-5.1.73-7.el6.x86_64
> "mysql-connector-python-2.0.4-1.el6.noarch
>
> Your help is appreciated
>
>
> Thanks
>
> Ambadas kawle
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker not starting

2020-09-25 Thread Klaus Wenninger

On 9/24/20 2:53 PM, Ambadas Kawle wrote:
> Hello Team 
>
>
> Please help me to solve this problem
You have to provide us with some information about your
cluster setup so that we can help.
That is why we had asked you for the content
of /etc/cluster/cluster.conf and the output
of 'pcs config'.

Klaus
>
> Thanks 
>
> On Wed, 23 Sep 2020, 11:41 am Ambadas Kawle,  > wrote:
>
> Hello All
>
> We have 2 node with Mysql cluster and we are not able to start pacemaker 
> on one of the node (slave node)
> We are getting error "waiting for quorum... timed-out waiting for cluster"
>
> Following are package detail
> pacemaker pacemaker-1.1.15-5.el6.x86_64
> pacemaker-libs-1.1.15-5.el6.x86_64
> pacemaker-cluster-libs-1.1.15-5.el6.x86_64
> pacemaker-cli-1.1.15-5.el6.x86_64
> 
> Corosync corosync-1.4.7-6.el6.x86_64
> corosynclib-1.4.7-6.el6.x86_64
> 
> Mysql mysql-5.1.73-7.el6.x86_64
> "mysql-connector-python-2.0.4-1.el6.noarch
>
> Your help is appreciated
>
> Thanks 
>
> Ambadas kawle
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker not starting

2020-09-29 Thread Klaus Wenninger

On 9/29/20 10:04 AM, Ambadas Kawle wrote:
> Hello 
>
> Please find below attached ss for cluster configuration file.
As Reid already pointed out your issue is probably a
"pre pacemaker start thing".
Regarding pacemaker-clusters running with cman there
are probably otherswho can give you better advice out
of the back of their minds. Honestly I don't remember
having ever setup a 2-node-cluster on that config but
your cman section is empty instead of showing
'two_node="1" expected_votes="1"' Reid had mentioned.
Could you pls. in the future rather provide text-files
instead of graphical screenshots.
Your test of crm_mon shows - somehow expected - that
pacemaker isreally not running. That is why it is
not able to connect.
As with most of the tools you can make crm_mon read
the cib froma file instead of connecting with pacemaker
to get it via ipc.
Not that I would expect too much insight from it but ...:

  CIB_file=/var/lib/pacemaker/cib/cib.xml crm_mon

Klaus
>
> On Fri, 25 Sep 2020, 8:46 pm Klaus Wenninger,  > wrote:
>
> On 9/24/20 2:53 PM, Ambadas Kawle wrote:
>> Hello Team 
>>
>>
>> Please help me to solve this problem
> You have to provide us with some information about your
> cluster setup so that we can help.
> That is why we had asked you for the content
> of /etc/cluster/cluster.conf and the output
> of 'pcs config'.
>
> Klaus
>>
>> Thanks 
>>
>> On Wed, 23 Sep 2020, 11:41 am Ambadas Kawle, > > wrote:
>>
>> Hello All
>>
>> We have 2 node with Mysql cluster and we are not able to start 
>> pacemaker on one of the node (slave node)
>> We are getting error "waiting for quorum... timed-out waiting for 
>> cluster"
>>
>> Following are package detail
>> pacemaker pacemaker-1.1.15-5.el6.x86_64
>> pacemaker-libs-1.1.15-5.el6.x86_64
>> pacemaker-cluster-libs-1.1.15-5.el6.x86_64
>> pacemaker-cli-1.1.15-5.el6.x86_64
>> 
>> Corosync corosync-1.4.7-6.el6.x86_64
>> corosynclib-1.4.7-6.el6.x86_64
>> 
>> Mysql mysql-5.1.73-7.el6.x86_64
>> "mysql-connector-python-2.0.4-1.el6.noarch
>>
>> Your help is appreciated
>>
>> Thanks 
>>
>> Ambadas kawle
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker not starting

2020-09-29 Thread Ambadas Kawle

Hello

Please find below attached ss for cluster configuration file.

On Fri, 25 Sep 2020, 8:46 pm Klaus Wenninger,  wrote:

> On 9/24/20 2:53 PM, Ambadas Kawle wrote:
>
> Hello Team
>
>
> Please help me to solve this problem
>
> You have to provide us with some information about your
> cluster setup so that we can help.
> That is why we had asked you for the content
> of /etc/cluster/cluster.conf and the output
> of 'pcs config'.
>
> Klaus
>
>
> Thanks
>
> On Wed, 23 Sep 2020, 11:41 am Ambadas Kawle,  wrote:
>
>> Hello All
>>
>> We have 2 node with Mysql cluster and we are not able to start pacemaker on 
>> one of the node (slave node)
>> We are getting error "waiting for quorum... timed-out waiting for cluster"
>>
>> Following are package detail
>> pacemaker pacemaker-1.1.15-5.el6.x86_64
>> pacemaker-libs-1.1.15-5.el6.x86_64
>> pacemaker-cluster-libs-1.1.15-5.el6.x86_64
>> pacemaker-cli-1.1.15-5.el6.x86_64
>>
>> Corosync corosync-1.4.7-6.el6.x86_64
>> corosynclib-1.4.7-6.el6.x86_64
>>
>> Mysql mysql-5.1.73-7.el6.x86_64
>> "mysql-connector-python-2.0.4-1.el6.noarch
>>
>> Your help is appreciated
>>
>>  Thanks
>>
>> Ambadas kawle
>>
>>
> ___
> Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker alerts node_selector

2020-11-25 Thread Reid Wahl

What version of Pacemaker are you using, and how does it behave?

Depending on the error/misbehavior you're experiencing, this might
have been me. Looks like in commit bd451763[1], I copied the
alerts-2.9.rng[2] schema instead of the alerts-2.10.rng[3] schema.

[1] https://github.com/ClusterLabs/pacemaker/commit/bd451763
[2] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.9.rng
[3] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.10.rng

On Wed, Nov 25, 2020 at 9:31 AM  wrote:
>
> Hi, I would like to trigger an external script, if something happens on a 
> specific node.
>
>
>
> In the documentation of alerts, i can see  but whatever I put 
> into the XML, it’s not working…..
>
>
>
> configuration>
>
> 
>
> 
>
> 
>
>   
>
>   
>
> 
>
> 
>
> 
>
> 
>
> 
>
> Can anybody send me an example about the right syntax ?
>
>
>
> Thank you very much……
>
>
>
> Best regards, Alfred
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker alerts node_selector

2020-11-25 Thread vockinger

Hi, thank you for your reply.

I tried it this way:


  
  
hana_node_1  
  

  
  


  
  



During the save the select is been reset to 
  

 


Do I need to specify in addition to select_nodes the section 

Thank you, Alfred


-Ursprüngliche Nachricht-
Von: Users  Im Auftrag von Reid Wahl
Gesendet: Donnerstag, 26. November 2020 05:30
An: Cluster Labs - All topics related to open-source clustering welcomed 

Betreff: Re: [ClusterLabs] pacemaker alerts node_selector

What version of Pacemaker are you using, and how does it behave?

Depending on the error/misbehavior you're experiencing, this might have been 
me. Looks like in commit bd451763[1], I copied the alerts-2.9.rng[2] schema 
instead of the alerts-2.10.rng[3] schema.

[1] https://github.com/ClusterLabs/pacemaker/commit/bd451763
[2] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.9.rng
[3] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.10.rng

On Wed, Nov 25, 2020 at 9:31 AM  wrote:
>
> Hi, I would like to trigger an external script, if something happens on a 
> specific node.
>
>
>
> In the documentation of alerts, i can see  but whatever I put 
> into the XML, it’s not working…..
>
>
>
> configuration>
>
> 
>
> 
>
> 
>
>   
>
>   
>
> 
>
>  value="someu...@example.com"/>
>
> 
>
> 
>
> 
>
> Can anybody send me an example about the right syntax ?
>
>
>
> Thank you very much……
>
>
>
> Best regards, Alfred
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



--
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - 
ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker alerts node_selector

2020-11-25 Thread Reid Wahl

I created https://github.com/ClusterLabs/pacemaker/pull/2241 to
correct the schema mistake.

On Wed, Nov 25, 2020 at 10:51 PM  wrote:
>
> Hi, thank you for your reply.
>
> I tried it this way:
>
> 
>path="/usr/share/pacemaker/alerts/test_alert.sh">
>   
> hana_node_1
>   
> 
>id="test_alert_1-instance_attributes-HANASID"/>
>id="test_alert_1-instance_attributes-AVZone"/>
> 
>  value="/usr/share/pacemaker/alerts/test_alert.sh"/>
>   
>   
> 
>
>
> During the save the select is been reset to
>   
>   
>  

The schema shows that  has to be empty.

  

  

  

  
  

  

  
  

  

  
  

  

  
  

  

  

  


> Do I need to specify in addition to select_nodes the section  name="select_attributes">

The  element configures the agent to receive alerts
when a node attribute changes.

For a bit more detail on how these  values work, see:
  - 
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_alert_filters

So it doesn't seem like this would be the way to configure alerts for
a particular node, which is what you've said you want to do.

I'm not very familiar with alerts off the top of my head, so I would
have to research this further unless someone else jumps in to answer
first. However, based on a cursory reading of the doc, it looks like
the  attributes do not provide a way to filter by a
particular node. The  element does allow you to
filter by node **attribute**. But the  element simply
filters "node events" in general, rather than filtering by node.
(Anyone correct me if I'm wrong.)

>
> Thank you, Alfred
>
>
> -Ursprüngliche Nachricht-----
> Von: Users  Im Auftrag von Reid Wahl
> Gesendet: Donnerstag, 26. November 2020 05:30
> An: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Betreff: Re: [ClusterLabs] pacemaker alerts node_selector
>
> What version of Pacemaker are you using, and how does it behave?
>
> Depending on the error/misbehavior you're experiencing, this might have been 
> me. Looks like in commit bd451763[1], I copied the alerts-2.9.rng[2] schema 
> instead of the alerts-2.10.rng[3] schema.
>
> [1] https://github.com/ClusterLabs/pacemaker/commit/bd451763
> [2] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.9.rng
> [3] https://github.com/ClusterLabs/pacemaker/blob/master/xml/alerts-2.10.rng
>
> On Wed, Nov 25, 2020 at 9:31 AM  wrote:
> >
> > Hi, I would like to trigger an external script, if something happens on a 
> > specific node.
> >
> >
> >
> > In the documentation of alerts, i can see  but whatever I 
> > put into the XML, it’s not working…..
> >
> >
> >
> > configuration>
> >
> > 
> >
> > 
> >
> > 
> >
> >   
> >
> >   
> >
> > 
> >
> >  > value="someu...@example.com"/>
> >
> > 
> >
> > 
> >
> > 
> >
> > Can anybody send me an example about the right syntax ?
> >
> >
> >
> > Thank you very much……
> >
> >
> >
> > Best regards, Alfred
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery 
> - ClusterHA
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker 1.1 support period

2021-02-04 Thread Andrei Zheregelia

Hello,

I would like to clarify support period for Pacemaker 1.1.
Until which year it is planned to backport bugfixes into 1.1 branch and
create releases?

-- 

Best Regards,
Andrei

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker Cluster help

2021-05-27 Thread Andrei Borzenkov

On 27.05.2021 15:36, Nathan Mazarelo wrote:
> Is there a way to have pacemaker resource groups failover if all floating IP 
> resources are unavailable?
> 
> I want to have multiple floating IPs in a resource group that will only 
> failover if all IPs cannot work. Each floating IP is on a different subnet 
> and can be used by the application I have. If a floating IP is unavailable it 
> will use the next available floating IP.
> Resource Group: floating_IP
> 
> floating-IP
> 
> floating-IP2
> 
> floating-IP3  
> For example, right now if a floating-IP resource fails the whole resource 
> group will failover to a different node. What I want is to have pacemaker 
> failover the resource group only if all three resources are unavailable. Is 
> this possible?
> 

Yes. See "Moving Resources Due to Connectivity Changes" in pacemaker
explained.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker Cluster help

2021-06-01 Thread kgaillot

On Thu, 2021-05-27 at 20:46 +0300, Andrei Borzenkov wrote:
> On 27.05.2021 15:36, Nathan Mazarelo wrote:
> > Is there a way to have pacemaker resource groups failover if all
> > floating IP resources are unavailable?
> > 
> > I want to have multiple floating IPs in a resource group that will
> > only failover if all IPs cannot work. Each floating IP is on a
> > different subnet and can be used by the application I have. If a
> > floating IP is unavailable it will use the next available floating
> > IP.
> > Resource Group: floating_IP
> > 
> > floating-IP
> > 
> > floating-IP2
> > 
> > floating-IP3  
> > For example, right now if a floating-IP resource fails the whole
> > resource group will failover to a different node. What I want is to
> > have pacemaker failover the resource group only if all three
> > resources are unavailable. Is this possible?
> > 
> 
> Yes. See "Moving Resources Due to Connectivity Changes" in pacemaker
> explained.

I don't think that will work when the IP resources themselves are
what's desired to be affected.

My first thought is that a resource group is probably not the right
model, since there is not likely to be an ordering relationship among
the IPs, just colocation. I'd use separate colocations for IP2 and IP3
with IP1 instead. However, that is not completely symmetrical -- if IP1
*can't* be assigned to a node for any reason (e.g. meeting its failure
threshold on all nodes), then the other IPs can't either.

To keep the IPs failing over as soon as one of them fails, the closest
approach I can think of is the new critical resource feature, which is
just coming out in the 2.1.0 release and so probably not an option
here. Marking IP2 and IP3 as noncritical would allow those to stop on
failure, and only if IP1 also failed would they be started elsewhere.
However again it's not completely symmetric, all IPs would fail over if
IP1 fails.

Basically, there's no way to treat a set of resources exactly equally.
Pacemaker has to assign one of them to a node first, then assign the
others relative to it.

There are some feature requests that are related, but no one's
volunteered to do them yet:

 https://bugs.clusterlabs.org/show_bug.cgi?id=5052
 https://bugs.clusterlabs.org/show_bug.cgi?id=5320

-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker Cluster help

2021-06-01 Thread Andrei Borzenkov

On 01.06.2021 18:20, kgail...@redhat.com wrote:
> On Thu, 2021-05-27 at 20:46 +0300, Andrei Borzenkov wrote:
>> On 27.05.2021 15:36, Nathan Mazarelo wrote:
>>> Is there a way to have pacemaker resource groups failover if all
>>> floating IP resources are unavailable?
>>> 
>>> I want to have multiple floating IPs in a resource group that will
>>> only failover if all IPs cannot work. Each floating IP is on a
>>> different subnet and can be used by the application I have. If a
>>> floating IP is unavailable it will use the next available floating
>>> IP.
>>> Resource Group: floating_IP
>>>
>>> floating-IP
>>>
>>> floating-IP2
>>>
>>> floating-IP3  
>>> For example, right now if a floating-IP resource fails the whole
>>> resource group will failover to a different node. What I want is to
>>> have pacemaker failover the resource group only if all three
>>> resources are unavailable. Is this possible?
>>>
>>
>> Yes. See "Moving Resources Due to Connectivity Changes" in pacemaker
>> explained.
> 
> I don't think that will work when the IP resources themselves are
> what's desired to be affected.
> 

I guess this need more precise explanation from OP what "floating IP is
unavailable" means. Personally I do not see any point in having local IP
without connectivity. If the question is really just "fail only if all
resources failed" then the obvious answer is resource set with
require-all=false and it does not matter what type resources are.

> My first thought is that a resource group is probably not the right
> model, since there is not likely to be an ordering relationship among
> the IPs, just colocation. I'd use separate colocations for IP2 and IP3
> with IP1 instead. However, that is not completely symmetrical -- if IP1
> *can't* be assigned to a node for any reason (e.g. meeting its failure
> threshold on all nodes), then the other IPs can't either.
> 
> To keep the IPs failing over as soon as one of them fails, the closest
> approach I can think of is the new critical resource feature, which is
> just coming out in the 2.1.0 release and so probably not an option
> here. Marking IP2 and IP3 as noncritical would allow those to stop on
> failure, and only if IP1 also failed would they be started elsewhere.
> However again it's not completely symmetric, all IPs would fail over if
> IP1 fails.
> 
> Basically, there's no way to treat a set of resources exactly equally.
> Pacemaker has to assign one of them to a node first, then assign the
> others relative to it.
> 
> There are some feature requests that are related, but no one's
> volunteered to do them yet:
> 
>  https://bugs.clusterlabs.org/show_bug.cgi?id=5052
>  https://bugs.clusterlabs.org/show_bug.cgi?id=5320
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker alerts log duplication.

2021-07-08 Thread Amol Shinde

Hello everyone!!!
Hope you are doing well.
I need some help regarding pacemaker alerts. I have a 36-node cluster setup 
with some IP and dummy resources. I have also deployed an alert script for the 
cluster that monitors the node and resources and generates alerts on events 
occurrence. The alert script is present on all nodes and sends the captured 
alert to a Web-UI using a message bus. So, for example, when a node goes 
offline pacemaker triggers the alert agent script on other nodes in the cluster 
and logs the event as "Node is lost". This message is then sent to the message 
bus by the script.

The problem is that since the alert is triggered on every node the agent script 
sends multiple duplicate log messages to the message bus. Multiple duplicate 
log messages from all the live nodes are reported to the Web-UI thus clogging 
up the interface and making parsing through it difficult and ruining the user 
experience.

Is there any way in the pacemaker itself through which when an event occurs the 
pacemaker calls the agent on any one node and logs the message rather than 
calling the agent on all live nodes within the cluster? For example, when a 
node goes offline, the agent is triggered on any one of the live nodes on the 
cluster thus generating one log, rather than generating multiple duplicate logs 
for the same event.

Thank you.

Regards,
Amol Shinde


Seagate Internal
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Pacemaker problems with pingd

2021-08-04 Thread Janusz Jaskiewicz

Hello.

Please forgive the length of this email but I wanted to provide as much
details as possible.

I'm trying to set up a cluster of two nodes for my service.
I have a problem with a scenario where the network between two nodes gets
broken and they can no longer see each other.
This causes split-brain.
I know that proper way of implementing this would be to employ STONITH, but
it is not feasible for me now (I don't have necessary hardware support and
I don't want to introduce another point of failure by introducing shared
storage based STONITH).

In order to work-around the split-brain scenario I introduced pingd to my
cluster, which in theory should do what I expect.
pingd pings a network device, so when the NIC is broken on one of my nodes,
this node should not run the resources because pingd would fail for it.

pingd resource is configured to update the value of variable 'pingd'
(interval: 5s, dampen: 3s, multiplier:1000).
Based on the value of pingd I have a location constraint which sets score
to -INFINITY for resource DimProdClusterIP when 'pingd' is not 1000.
All other resources are colocated with DimProdClusterIP, and
DimProdClusterIP should start before all other resources.

Based on that setup I would expect that when the resources run on dimprod01
and I disconnect dimprod02 from the network, the resources will not start
on dimprod02.
Unfortunately I see that after a token interval + consensus interval my
resources are brought up for a moment and then go down again.
This is undesirable, as it causes DRBD split-brain inconsistency and
cluster IP may also be taken over by the node which is down.

I tried to debug it, but I can't figure out why it doesn't work.
I would appreciate any help/pointers.


Following are some details of my setup and snippet of pacemaker logs with
comments:

Setup details:

pcs status:
Cluster name: dimprodcluster
Cluster Summary:
  * Stack: corosync
  * Current DC: dimprod02 (version 2.0.5-9.el8_4.1-ba59be7122) - partition
with quorum
  * Last updated: Tue Aug  3 08:20:32 2021
  * Last change:  Mon Aug  2 18:24:39 2021 by root via cibadmin on dimprod01
  * 2 nodes configured
  * 8 resource instances configured

Node List:
  * Online: [ dimprod01 dimprod02 ]

Full List of Resources:
  * DimProdClusterIP (ocf::heartbeat:IPaddr2): Started dimprod01
  * WyrDimProdServer (systemd:wyr-dim): Started dimprod01
  * Clone Set: WyrDimProdServerData-clone [WyrDimProdServerData]
(promotable):
* Masters: [ dimprod01 ]
* Slaves: [ dimprod02 ]
  * WyrDimProdFS (ocf::heartbeat:Filesystem): Started dimprod01
  * DimTestClusterIP (ocf::heartbeat:IPaddr2): Started dimprod01
  * Clone Set: ping-clone [ping]:
* Started: [ dimprod01 dimprod02 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


pcs constraint
Location Constraints:
  Resource: DimProdClusterIP
Constraint: location-DimProdClusterIP
  Rule: score=-INFINITY
Expression: pingd ne 1000
Ordering Constraints:
  start DimProdClusterIP then promote WyrDimProdServerData-clone
(kind:Mandatory)
  promote WyrDimProdServerData-clone then start WyrDimProdFS
(kind:Mandatory)
  start WyrDimProdFS then start WyrDimProdServer (kind:Mandatory)
  start WyrDimProdServer then start DimTestClusterIP (kind:Mandatory)
Colocation Constraints:
  WyrDimProdServer with DimProdClusterIP (score:INFINITY)
  DimTestClusterIP with DimProdClusterIP (score:INFINITY)
  WyrDimProdServerData-clone with DimProdClusterIP (score:INFINITY)
(with-rsc-role:Master)
  WyrDimProdFS with DimProdClusterIP (score:INFINITY)
Ticket Constraints:


pcs resource config ping
 Resource: ping (class=ocf provider=pacemaker type=ping)
  Attributes: dampen=3s host_list=193.30.22.33 multiplier=1000
  Operations: monitor interval=5s timeout=4s (ping-monitor-interval-5s)
  start interval=0s timeout=60s (ping-start-interval-0s)
  stop interval=0s timeout=5s (ping-stop-interval-0s)



cat /etc/corosync/corosync.conf
totem {
version: 2
cluster_name: dimprodcluster
transport: knet
crypto_cipher: aes256
crypto_hash: sha256
token: 1
interface {
knet_ping_interval: 1000
knet_ping_timeout: 1000
}
}

nodelist {
node {
ring0_addr: dimprod01
name: dimprod01
nodeid: 1
}

node {
ring0_addr: dimprod02
name: dimprod02
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
debug:on
}



Logs:
When the network is connected 'pingd' takes value of 1000:

Aug 03 08:23:01 dimprod02.my.clustertest.com pacemaker-attrd [2827046]
(attrd_client_update) debug: Broadcasting pingd[dimprod02]=1000 (writer)
Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater   [3369856]
(pcmk__node_attr_request) debug: Asked pacemaker-attrd to update pingd=1000
for dimprod02: OK (0)

Re: [ClusterLabs] Pacemaker failover failure

2015-07-01 Thread alex austin

So I noticed that if I kill redis on one node, it starts on the other, no
problem, but if I actually kill pacemaker itself on one node, the other
doesn't "sense" it so it doesn't fail over.



On Wed, Jul 1, 2015 at 12:42 PM, alex austin  wrote:

> Hi all,
>
> I have configured a virtual ip and redis in master-slave with corosync
> pacemaker. If redis fails, then the failover is successful, and redis gets
> promoted on the other node. However if pacemaker itself fails on the active
> node, the failover is not performed. Is there anything I missed in the
> configuration?
>
> Here's my configuration (i have hashed the ip address out):
>
> node host1.com
>
> node host2.com
>
> primitive ClusterIP IPaddr2 \
>
> params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \
>
> op monitor interval=1s timeout=20s \
>
> op start interval=0 timeout=20s \
>
> op stop interval=0 timeout=20s \
>
> meta is-managed=true target-role=Started resource-stickiness=500
>
> primitive redis redis \
>
> meta target-role=Master is-managed=true \
>
> op monitor interval=1s role=Master timeout=5s on-fail=restart
>
> ms redis_clone redis \
>
> meta notify=true is-managed=true ordered=false interleave=false
> globally-unique=false target-role=Master migration-threshold=1
>
> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
>
> colocation ip-on-redis inf: ClusterIP redis_clone:Master
>
> property cib-bootstrap-options: \
>
> dc-version=1.1.11-97629de \
>
> cluster-infrastructure="classic openais (with plugin)" \
>
> expected-quorum-votes=2 \
>
> stonith-enabled=false
>
> property redis_replication: \
>
> redis_REPL_INFO=host.com
>
>
> thank you in advance
>
>
> Kind regards,
>
>
> Alex
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker failover failure

2015-07-01 Thread Nekrasov, Alexander

stonith-enabled=false
this might be the issue. The way peer node death is resolved, the surviving 
node must call STONITH on the peer. If it’s disabled it might not be able to 
resolve the event

Alex

From: alex austin [mailto:alexixa...@gmail.com]
Sent: Wednesday, July 01, 2015 9:51 AM
To: Users@clusterlabs.org
Subject: Re: [ClusterLabs] Pacemaker failover failure

So I noticed that if I kill redis on one node, it starts on the other, no 
problem, but if I actually kill pacemaker itself on one node, the other doesn't 
"sense" it so it doesn't fail over.

On Wed, Jul 1, 2015 at 12:42 PM, alex austin 
mailto:alexixa...@gmail.com>> wrote:
Hi all,

I have configured a virtual ip and redis in master-slave with corosync 
pacemaker. If redis fails, then the failover is successful, and redis gets 
promoted on the other node. However if pacemaker itself fails on the active 
node, the failover is not performed. Is there anything I missed in the 
configuration?

Here's my configuration (i have hashed the ip address out):

node host1.com<http://host1.com>

node host2.com<http://host2.com>

primitive ClusterIP IPaddr2 \

params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \

op monitor interval=1s timeout=20s \

op start interval=0 timeout=20s \

op stop interval=0 timeout=20s \

meta is-managed=true target-role=Started resource-stickiness=500

primitive redis redis \

meta target-role=Master is-managed=true \

op monitor interval=1s role=Master timeout=5s on-fail=restart

ms redis_clone redis \

meta notify=true is-managed=true ordered=false interleave=false 
globally-unique=false target-role=Master migration-threshold=1

colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master

colocation ip-on-redis inf: ClusterIP redis_clone:Master

property cib-bootstrap-options: \

dc-version=1.1.11-97629de \

cluster-infrastructure="classic openais (with plugin)" \

expected-quorum-votes=2 \

stonith-enabled=false

property redis_replication: \

redis_REPL_INFO=host.com<http://host.com>

thank you in advance

Kind regards,

Alex

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker failover failure

2015-07-01 Thread alex austin

so did another test:

two nodes: node1 and node2

Case: node1 is the active node
node2: is pasive

if I killall -9 pacemakerd corosync on node 1 the services do not fail over
to node2, but if I start corosync and pacemaker on node1 then it fails over
to node 2.

Where am I mistaking?

Alex

On Wed, Jul 1, 2015 at 12:42 PM, alex austin  wrote:

> Hi all,
>
> I have configured a virtual ip and redis in master-slave with corosync
> pacemaker. If redis fails, then the failover is successful, and redis gets
> promoted on the other node. However if pacemaker itself fails on the active
> node, the failover is not performed. Is there anything I missed in the
> configuration?
>
> Here's my configuration (i have hashed the ip address out):
>
> node host1.com
>
> node host2.com
>
> primitive ClusterIP IPaddr2 \
>
> params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \
>
> op monitor interval=1s timeout=20s \
>
> op start interval=0 timeout=20s \
>
> op stop interval=0 timeout=20s \
>
> meta is-managed=true target-role=Started resource-stickiness=500
>
> primitive redis redis \
>
> meta target-role=Master is-managed=true \
>
> op monitor interval=1s role=Master timeout=5s on-fail=restart
>
> ms redis_clone redis \
>
> meta notify=true is-managed=true ordered=false interleave=false
> globally-unique=false target-role=Master migration-threshold=1
>
> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
>
> colocation ip-on-redis inf: ClusterIP redis_clone:Master
>
> property cib-bootstrap-options: \
>
> dc-version=1.1.11-97629de \
>
> cluster-infrastructure="classic openais (with plugin)" \
>
> expected-quorum-votes=2 \
>
> stonith-enabled=false
>
> property redis_replication: \
>
> redis_REPL_INFO=host.com
>
>
> thank you in advance
>
>
> Kind regards,
>
>
> Alex
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker failover failure

2015-07-01 Thread alex austin

I have now configured stonith-enabled=true. What device should I use for
fencing given the fact that it's a virtual machine but I don't have access
to its configuration. would fence_pcmk do? if so, what parameters should I
configure for it to work properly?

This is my new config:


node dcwbpvmuas004.edc.nam.gm.com \

attributes standby=off

node dcwbpvmuas005.edc.nam.gm.com \

attributes standby=off

primitive ClusterIP IPaddr2 \

params ip=198.208.86.242 cidr_netmask=23 \

op monitor interval=1s timeout=20s \

op start interval=0 timeout=20s \

op stop interval=0 timeout=20s \

meta is-managed=true target-role=Started resource-stickiness=500

primitive pcmk-fencing stonith:fence_pcmk \

params pcmk_host_list="dcwbpvmuas004.edc.nam.gm.com
dcwbpvmuas005.edc.nam.gm.com" \

op monitor interval=10s \

meta target-role=Started

primitive redis redis \

meta target-role=Master is-managed=true \

op monitor interval=1s role=Master timeout=5s on-fail=restart

ms redis_clone redis \

meta notify=true is-managed=true ordered=false interleave=false
globally-unique=false target-role=Master migration-threshold=1

colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master

colocation ip-on-redis inf: ClusterIP redis_clone:Master

colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master

property cib-bootstrap-options: \

dc-version=1.1.11-97629de \

cluster-infrastructure="classic openais (with plugin)" \

expected-quorum-votes=2 \

stonith-enabled=true

property redis_replication: \

redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com

On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander <
alexander.nekra...@emc.com> wrote:

> stonith-enabled=false
>
> this might be the issue. The way peer node death is resolved, the
> surviving node must call STONITH on the peer. If it’s disabled it might not
> be able to resolve the event
>
>
>
> Alex
>
>
>
> *From:* alex austin [mailto:alexixa...@gmail.com]
> *Sent:* Wednesday, July 01, 2015 9:51 AM
> *To:* Users@clusterlabs.org
> *Subject:* Re: [ClusterLabs] Pacemaker failover failure
>
>
>
> So I noticed that if I kill redis on one node, it starts on the other, no
> problem, but if I actually kill pacemaker itself on one node, the other
> doesn't "sense" it so it doesn't fail over.
>
>
>
>
>
>
>
> On Wed, Jul 1, 2015 at 12:42 PM, alex austin  wrote:
>
> Hi all,
>
>
>
> I have configured a virtual ip and redis in master-slave with corosync
> pacemaker. If redis fails, then the failover is successful, and redis gets
> promoted on the other node. However if pacemaker itself fails on the active
> node, the failover is not performed. Is there anything I missed in the
> configuration?
>
>
>
> Here's my configuration (i have hashed the ip address out):
>
>
>
> node host1.com
>
> node host2.com
>
> primitive ClusterIP IPaddr2 \
>
> params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \
>
> op monitor interval=1s timeout=20s \
>
> op start interval=0 timeout=20s \
>
> op stop interval=0 timeout=20s \
>
> meta is-managed=true target-role=Started resource-stickiness=500
>
> primitive redis redis \
>
> meta target-role=Master is-managed=true \
>
> op monitor interval=1s role=Master timeout=5s on-fail=restart
>
> ms redis_clone redis \
>
> meta notify=true is-managed=true ordered=false interleave=false
> globally-unique=false target-role=Master migration-threshold=1
>
> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
>
> colocation ip-on-redis inf: ClusterIP redis_clone:Master
>
> property cib-bootstrap-options: \
>
> dc-version=1.1.11-97629de \
>
> cluster-infrastructure="classic openais (with plugin)" \
>
> expected-quorum-votes=2 \
>
> stonith-enabled=false
>
> property redis_replication: \
>
> redis_REPL_INFO=host.com
>
>
>
> thank you in advance
>
>
>
> Kind regards,
>
>
>
> Alex
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker failover failure

2015-07-01 Thread Ken Gaillot

On 07/01/2015 08:57 AM, alex austin wrote:
> I have now configured stonith-enabled=true. What device should I use for
> fencing given the fact that it's a virtual machine but I don't have access
> to its configuration. would fence_pcmk do? if so, what parameters should I
> configure for it to work properly?

No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's
CMAN to redirect its fencing requests to pacemaker.

For a virtual machine, ideally you'd use fence_virtd running on the
physical host, but I'm guessing from your comment that you can't do
that. Does whoever provides your VM also provide an API for controlling
it (starting/stopping/rebooting)?

Regarding your original problem, it sounds like the surviving node
doesn't have quorum. What version of corosync are you using? If you're
using corosync 2, you need "two_node: 1" in corosync.conf, in addition
to configuring fencing in pacemaker.

> This is my new config:
> 
> 
> node dcwbpvmuas004.edc.nam.gm.com \
> 
> attributes standby=off
> 
> node dcwbpvmuas005.edc.nam.gm.com \
> 
> attributes standby=off
> 
> primitive ClusterIP IPaddr2 \
> 
> params ip=198.208.86.242 cidr_netmask=23 \
> 
> op monitor interval=1s timeout=20s \
> 
> op start interval=0 timeout=20s \
> 
> op stop interval=0 timeout=20s \
> 
> meta is-managed=true target-role=Started resource-stickiness=500
> 
> primitive pcmk-fencing stonith:fence_pcmk \
> 
> params pcmk_host_list="dcwbpvmuas004.edc.nam.gm.com
> dcwbpvmuas005.edc.nam.gm.com" \
> 
> op monitor interval=10s \
> 
> meta target-role=Started
> 
> primitive redis redis \
> 
> meta target-role=Master is-managed=true \
> 
> op monitor interval=1s role=Master timeout=5s on-fail=restart
> 
> ms redis_clone redis \
> 
> meta notify=true is-managed=true ordered=false interleave=false
> globally-unique=false target-role=Master migration-threshold=1
> 
> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
> 
> colocation ip-on-redis inf: ClusterIP redis_clone:Master
> 
> colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master
> 
> property cib-bootstrap-options: \
> 
> dc-version=1.1.11-97629de \
> 
> cluster-infrastructure="classic openais (with plugin)" \
> 
> expected-quorum-votes=2 \
> 
> stonith-enabled=true
> 
> property redis_replication: \
> 
> redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com
> 
> On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander <
> alexander.nekra...@emc.com> wrote:
> 
>> stonith-enabled=false
>>
>> this might be the issue. The way peer node death is resolved, the
>> surviving node must call STONITH on the peer. If it’s disabled it might not
>> be able to resolve the event
>>
>>
>>
>> Alex
>>
>>
>>
>> *From:* alex austin [mailto:alexixa...@gmail.com]
>> *Sent:* Wednesday, July 01, 2015 9:51 AM
>> *To:* Users@clusterlabs.org
>> *Subject:* Re: [ClusterLabs] Pacemaker failover failure
>>
>>
>>
>> So I noticed that if I kill redis on one node, it starts on the other, no
>> problem, but if I actually kill pacemaker itself on one node, the other
>> doesn't "sense" it so it doesn't fail over.
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jul 1, 2015 at 12:42 PM, alex austin  wrote:
>>
>> Hi all,
>>
>>
>>
>> I have configured a virtual ip and redis in master-slave with corosync
>> pacemaker. If redis fails, then the failover is successful, and redis gets
>> promoted on the other node. However if pacemaker itself fails on the active
>> node, the failover is not performed. Is there anything I missed in the
>> configuration?
>>
>>
>>
>> Here's my configuration (i have hashed the ip address out):
>>
>>
>>
>> node host1.com
>>
>> node host2.com
>>
>> primitive ClusterIP IPaddr2 \
>>
>> params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \
>>
>> op monitor interval=1s timeout=20s \
>>
>> op start interval=0 timeout=20s \
>>
>> op stop interval=0 timeout=20s \
>>
>> meta is-managed=true target-role=Started resource-stickiness=500
>>
>> primitive redis redis \
>>
>> meta target-role=Master is-managed=true \
>>
>> op monitor interval=1s role=Master timeout=5s on-fail=restart
>>
>> ms redis_clone redis \
>>
>> meta notify=

Re: [ClusterLabs] Pacemaker failover failure

2015-07-01 Thread alex austin

This is what crm_mon shows


Last updated: Wed Jul  1 10:35:40 2015

Last change: Wed Jul  1 09:52:46 2015

Stack: classic openais (with plugin)

Current DC: host2 - partition with quorum

Version: 1.1.11-97629de

2 Nodes configured, 2 expected votes

4 Resources configured



Online: [ host1 host2 ]


ClusterIP (ocf::heartbeat:IPaddr2): Started host2

 Master/Slave Set: redis_clone [redis]

 Masters: [ host2 ]

 Slaves: [ host1 ]

pcmk-fencing(stonith:fence_pcmk):   Started host2

On Wed, Jul 1, 2015 at 3:37 PM, alex austin  wrote:

> I am running version 1.4.7 of corosync
>
>
>
> On Wed, Jul 1, 2015 at 3:25 PM, Ken Gaillot  wrote:
>
>> On 07/01/2015 08:57 AM, alex austin wrote:
>> > I have now configured stonith-enabled=true. What device should I use for
>> > fencing given the fact that it's a virtual machine but I don't have
>> access
>> > to its configuration. would fence_pcmk do? if so, what parameters
>> should I
>> > configure for it to work properly?
>>
>> No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's
>> CMAN to redirect its fencing requests to pacemaker.
>>
>> For a virtual machine, ideally you'd use fence_virtd running on the
>> physical host, but I'm guessing from your comment that you can't do
>> that. Does whoever provides your VM also provide an API for controlling
>> it (starting/stopping/rebooting)?
>>
>> Regarding your original problem, it sounds like the surviving node
>> doesn't have quorum. What version of corosync are you using? If you're
>> using corosync 2, you need "two_node: 1" in corosync.conf, in addition
>> to configuring fencing in pacemaker.
>>
>> > This is my new config:
>> >
>> >
>> > node dcwbpvmuas004.edc.nam.gm.com \
>> >
>> > attributes standby=off
>> >
>> > node dcwbpvmuas005.edc.nam.gm.com \
>> >
>> > attributes standby=off
>> >
>> > primitive ClusterIP IPaddr2 \
>> >
>> > params ip=198.208.86.242 cidr_netmask=23 \
>> >
>> > op monitor interval=1s timeout=20s \
>> >
>> > op start interval=0 timeout=20s \
>> >
>> > op stop interval=0 timeout=20s \
>> >
>> > meta is-managed=true target-role=Started resource-stickiness=500
>> >
>> > primitive pcmk-fencing stonith:fence_pcmk \
>> >
>> > params pcmk_host_list="dcwbpvmuas004.edc.nam.gm.com
>> > dcwbpvmuas005.edc.nam.gm.com" \
>> >
>> > op monitor interval=10s \
>> >
>> > meta target-role=Started
>> >
>> > primitive redis redis \
>> >
>> > meta target-role=Master is-managed=true \
>> >
>> > op monitor interval=1s role=Master timeout=5s on-fail=restart
>> >
>> > ms redis_clone redis \
>> >
>> > meta notify=true is-managed=true ordered=false interleave=false
>> > globally-unique=false target-role=Master migration-threshold=1
>> >
>> > colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
>> >
>> > colocation ip-on-redis inf: ClusterIP redis_clone:Master
>> >
>> > colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master
>> >
>> > property cib-bootstrap-options: \
>> >
>> > dc-version=1.1.11-97629de \
>> >
>> > cluster-infrastructure="classic openais (with plugin)" \
>> >
>> > expected-quorum-votes=2 \
>> >
>> > stonith-enabled=true
>> >
>> > property redis_replication: \
>> >
>> > redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com
>> >
>> > On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander <
>> > alexander.nekra...@emc.com> wrote:
>> >
>> >> stonith-enabled=false
>> >>
>> >> this might be the issue. The way peer node death is resolved, the
>> >> surviving node must call STONITH on the peer. If it’s disabled it
>> might not
>> >> be able to resolve the event
>> >>
>> >>
>> >>
>> >> Alex
>> >>
>> >>
>> >>
>> >> *From:* alex austin [mailto:alexixa...@gmail.com]
>> >> *Sent:* Wednesday, July 01, 2015 9:51 AM
>> >> *To:* Users@clusterlabs.org
>> >> *Subject:* Re: [ClusterLabs] Pacemaker failover failure
>> >>
>>

Re: [ClusterLabs] Pacemaker failover failure

2015-07-01 Thread alex austin

I am running version 1.4.7 of corosync



On Wed, Jul 1, 2015 at 3:25 PM, Ken Gaillot  wrote:

> On 07/01/2015 08:57 AM, alex austin wrote:
> > I have now configured stonith-enabled=true. What device should I use for
> > fencing given the fact that it's a virtual machine but I don't have
> access
> > to its configuration. would fence_pcmk do? if so, what parameters should
> I
> > configure for it to work properly?
>
> No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's
> CMAN to redirect its fencing requests to pacemaker.
>
> For a virtual machine, ideally you'd use fence_virtd running on the
> physical host, but I'm guessing from your comment that you can't do
> that. Does whoever provides your VM also provide an API for controlling
> it (starting/stopping/rebooting)?
>
> Regarding your original problem, it sounds like the surviving node
> doesn't have quorum. What version of corosync are you using? If you're
> using corosync 2, you need "two_node: 1" in corosync.conf, in addition
> to configuring fencing in pacemaker.
>
> > This is my new config:
> >
> >
> > node dcwbpvmuas004.edc.nam.gm.com \
> >
> > attributes standby=off
> >
> > node dcwbpvmuas005.edc.nam.gm.com \
> >
> > attributes standby=off
> >
> > primitive ClusterIP IPaddr2 \
> >
> > params ip=198.208.86.242 cidr_netmask=23 \
> >
> > op monitor interval=1s timeout=20s \
> >
> > op start interval=0 timeout=20s \
> >
> > op stop interval=0 timeout=20s \
> >
> > meta is-managed=true target-role=Started resource-stickiness=500
> >
> > primitive pcmk-fencing stonith:fence_pcmk \
> >
> > params pcmk_host_list="dcwbpvmuas004.edc.nam.gm.com
> > dcwbpvmuas005.edc.nam.gm.com" \
> >
> > op monitor interval=10s \
> >
> > meta target-role=Started
> >
> > primitive redis redis \
> >
> > meta target-role=Master is-managed=true \
> >
> > op monitor interval=1s role=Master timeout=5s on-fail=restart
> >
> > ms redis_clone redis \
> >
> > meta notify=true is-managed=true ordered=false interleave=false
> > globally-unique=false target-role=Master migration-threshold=1
> >
> > colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
> >
> > colocation ip-on-redis inf: ClusterIP redis_clone:Master
> >
> > colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master
> >
> > property cib-bootstrap-options: \
> >
> > dc-version=1.1.11-97629de \
> >
> > cluster-infrastructure="classic openais (with plugin)" \
> >
> > expected-quorum-votes=2 \
> >
> > stonith-enabled=true
> >
> > property redis_replication: \
> >
> > redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com
> >
> > On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander <
> > alexander.nekra...@emc.com> wrote:
> >
> >> stonith-enabled=false
> >>
> >> this might be the issue. The way peer node death is resolved, the
> >> surviving node must call STONITH on the peer. If it’s disabled it might
> not
> >> be able to resolve the event
> >>
> >>
> >>
> >> Alex
> >>
> >>
> >>
> >> *From:* alex austin [mailto:alexixa...@gmail.com]
> >> *Sent:* Wednesday, July 01, 2015 9:51 AM
> >> *To:* Users@clusterlabs.org
> >> *Subject:* Re: [ClusterLabs] Pacemaker failover failure
> >>
> >>
> >>
> >> So I noticed that if I kill redis on one node, it starts on the other,
> no
> >> problem, but if I actually kill pacemaker itself on one node, the other
> >> doesn't "sense" it so it doesn't fail over.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Jul 1, 2015 at 12:42 PM, alex austin 
> wrote:
> >>
> >> Hi all,
> >>
> >>
> >>
> >> I have configured a virtual ip and redis in master-slave with corosync
> >> pacemaker. If redis fails, then the failover is successful, and redis
> gets
> >> promoted on the other node. However if pacemaker itself fails on the
> active
> >> node, the failover is not performed. Is there anything I missed in the
> >> configuration?
> >>
> >>
> >>
> >> Here's my con

Re: [ClusterLabs] Pacemaker failover failure

2015-07-01 Thread Ken Gaillot

On 07/01/2015 09:39 AM, alex austin wrote:
> This is what crm_mon shows
> 
> 
> Last updated: Wed Jul  1 10:35:40 2015
> 
> Last change: Wed Jul  1 09:52:46 2015
> 
> Stack: classic openais (with plugin)
> 
> Current DC: host2 - partition with quorum
> 
> Version: 1.1.11-97629de
> 
> 2 Nodes configured, 2 expected votes
> 
> 4 Resources configured
> 
> 
> 
> Online: [ host1 host2 ]
> 
> 
> ClusterIP (ocf::heartbeat:IPaddr2): Started host2
> 
>  Master/Slave Set: redis_clone [redis]
> 
>  Masters: [ host2 ]
> 
>  Slaves: [ host1 ]
> 
> pcmk-fencing(stonith:fence_pcmk):   Started host2
> 
> On Wed, Jul 1, 2015 at 3:37 PM, alex austin  wrote:
> 
>> I am running version 1.4.7 of corosync

If you can't upgrade to corosync 2 (which has many improvements), you'll
need to set the no-quorum-policy=ignore cluster option.

Proper fencing is necessary to avoid a split-brain situation, which can
corrupt your data.

>> On Wed, Jul 1, 2015 at 3:25 PM, Ken Gaillot  wrote:
>>
>>> On 07/01/2015 08:57 AM, alex austin wrote:
>>>> I have now configured stonith-enabled=true. What device should I use for
>>>> fencing given the fact that it's a virtual machine but I don't have
>>> access
>>>> to its configuration. would fence_pcmk do? if so, what parameters
>>> should I
>>>> configure for it to work properly?
>>>
>>> No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's
>>> CMAN to redirect its fencing requests to pacemaker.
>>>
>>> For a virtual machine, ideally you'd use fence_virtd running on the
>>> physical host, but I'm guessing from your comment that you can't do
>>> that. Does whoever provides your VM also provide an API for controlling
>>> it (starting/stopping/rebooting)?
>>>
>>> Regarding your original problem, it sounds like the surviving node
>>> doesn't have quorum. What version of corosync are you using? If you're
>>> using corosync 2, you need "two_node: 1" in corosync.conf, in addition
>>> to configuring fencing in pacemaker.
>>>
>>>> This is my new config:
>>>>
>>>>
>>>> node dcwbpvmuas004.edc.nam.gm.com \
>>>>
>>>> attributes standby=off
>>>>
>>>> node dcwbpvmuas005.edc.nam.gm.com \
>>>>
>>>> attributes standby=off
>>>>
>>>> primitive ClusterIP IPaddr2 \
>>>>
>>>> params ip=198.208.86.242 cidr_netmask=23 \
>>>>
>>>> op monitor interval=1s timeout=20s \
>>>>
>>>> op start interval=0 timeout=20s \
>>>>
>>>> op stop interval=0 timeout=20s \
>>>>
>>>> meta is-managed=true target-role=Started resource-stickiness=500
>>>>
>>>> primitive pcmk-fencing stonith:fence_pcmk \
>>>>
>>>> params pcmk_host_list="dcwbpvmuas004.edc.nam.gm.com
>>>> dcwbpvmuas005.edc.nam.gm.com" \
>>>>
>>>> op monitor interval=10s \
>>>>
>>>> meta target-role=Started
>>>>
>>>> primitive redis redis \
>>>>
>>>> meta target-role=Master is-managed=true \
>>>>
>>>> op monitor interval=1s role=Master timeout=5s on-fail=restart
>>>>
>>>> ms redis_clone redis \
>>>>
>>>> meta notify=true is-managed=true ordered=false interleave=false
>>>> globally-unique=false target-role=Master migration-threshold=1
>>>>
>>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
>>>>
>>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master
>>>>
>>>> colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master
>>>>
>>>> property cib-bootstrap-options: \
>>>>
>>>> dc-version=1.1.11-97629de \
>>>>
>>>> cluster-infrastructure="classic openais (with plugin)" \
>>>>
>>>> expected-quorum-votes=2 \
>>>>
>>>> stonith-enabled=true
>>>>
>>>> property redis_replication: \
>>>>
>>>> redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com
>>>>
>>>> On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander <
>>>> alexander.nekra...@emc.com

Re: [ClusterLabs] Pacemaker failover failure

2015-07-02 Thread alex austin

Thank you!

However, what is proper fencing in this situation?

Kind Regards,

Alex

On Wed, Jul 1, 2015 at 11:30 PM, Ken Gaillot  wrote:

> On 07/01/2015 09:39 AM, alex austin wrote:
> > This is what crm_mon shows
> >
> >
> > Last updated: Wed Jul  1 10:35:40 2015
> >
> > Last change: Wed Jul  1 09:52:46 2015
> >
> > Stack: classic openais (with plugin)
> >
> > Current DC: host2 - partition with quorum
> >
> > Version: 1.1.11-97629de
> >
> > 2 Nodes configured, 2 expected votes
> >
> > 4 Resources configured
> >
> >
> >
> > Online: [ host1 host2 ]
> >
> >
> > ClusterIP (ocf::heartbeat:IPaddr2): Started host2
> >
> >  Master/Slave Set: redis_clone [redis]
> >
> >  Masters: [ host2 ]
> >
> >  Slaves: [ host1 ]
> >
> > pcmk-fencing(stonith:fence_pcmk):   Started host2
> >
> > On Wed, Jul 1, 2015 at 3:37 PM, alex austin 
> wrote:
> >
> >> I am running version 1.4.7 of corosync
>
> If you can't upgrade to corosync 2 (which has many improvements), you'll
> need to set the no-quorum-policy=ignore cluster option.
>
> Proper fencing is necessary to avoid a split-brain situation, which can
> corrupt your data.
>
> >> On Wed, Jul 1, 2015 at 3:25 PM, Ken Gaillot 
> wrote:
> >>
> >>> On 07/01/2015 08:57 AM, alex austin wrote:
> >>>> I have now configured stonith-enabled=true. What device should I use
> for
> >>>> fencing given the fact that it's a virtual machine but I don't have
> >>> access
> >>>> to its configuration. would fence_pcmk do? if so, what parameters
> >>> should I
> >>>> configure for it to work properly?
> >>>
> >>> No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's
> >>> CMAN to redirect its fencing requests to pacemaker.
> >>>
> >>> For a virtual machine, ideally you'd use fence_virtd running on the
> >>> physical host, but I'm guessing from your comment that you can't do
> >>> that. Does whoever provides your VM also provide an API for controlling
> >>> it (starting/stopping/rebooting)?
> >>>
> >>> Regarding your original problem, it sounds like the surviving node
> >>> doesn't have quorum. What version of corosync are you using? If you're
> >>> using corosync 2, you need "two_node: 1" in corosync.conf, in addition
> >>> to configuring fencing in pacemaker.
> >>>
> >>>> This is my new config:
> >>>>
> >>>>
> >>>> node dcwbpvmuas004.edc.nam.gm.com \
> >>>>
> >>>> attributes standby=off
> >>>>
> >>>> node dcwbpvmuas005.edc.nam.gm.com \
> >>>>
> >>>> attributes standby=off
> >>>>
> >>>> primitive ClusterIP IPaddr2 \
> >>>>
> >>>> params ip=198.208.86.242 cidr_netmask=23 \
> >>>>
> >>>> op monitor interval=1s timeout=20s \
> >>>>
> >>>> op start interval=0 timeout=20s \
> >>>>
> >>>> op stop interval=0 timeout=20s \
> >>>>
> >>>> meta is-managed=true target-role=Started
> resource-stickiness=500
> >>>>
> >>>> primitive pcmk-fencing stonith:fence_pcmk \
> >>>>
> >>>> params pcmk_host_list="dcwbpvmuas004.edc.nam.gm.com
> >>>> dcwbpvmuas005.edc.nam.gm.com" \
> >>>>
> >>>> op monitor interval=10s \
> >>>>
> >>>> meta target-role=Started
> >>>>
> >>>> primitive redis redis \
> >>>>
> >>>> meta target-role=Master is-managed=true \
> >>>>
> >>>> op monitor interval=1s role=Master timeout=5s on-fail=restart
> >>>>
> >>>> ms redis_clone redis \
> >>>>
> >>>> meta notify=true is-managed=true ordered=false
> interleave=false
> >>>> globally-unique=false target-role=Master migration-threshold=1
> >>>>
> >>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
> >>>>
> >>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master
> >>>>
> >>>

Re: [ClusterLabs] Pacemaker failover failure

2015-07-02 Thread Digimer

> <http://dcwbpvmuas005.edc.nam.gm.com>" \
> >>>>
> >>>> op monitor interval=10s \
> >>>>
> >>>> meta target-role=Started
> >>>>
> >>>> primitive redis redis \
> >>>>
> >>>> meta target-role=Master is-managed=true \
> >>>>
> >>>> op monitor interval=1s role=Master timeout=5s
> on-fail=restart
> >>>>
> >>>> ms redis_clone redis \
> >>>>
> >>>> meta notify=true is-managed=true ordered=false
> interleave=false
> >>>> globally-unique=false target-role=Master migration-threshold=1
> >>>>
> >>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
> >>>>
> >>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master
> >>>>
> >>>> colocation pcmk-fencing-on-redis inf: pcmk-fencing
> redis_clone:Master
> >>>>
> >>>> property cib-bootstrap-options: \
> >>>>
> >>>> dc-version=1.1.11-97629de \
> >>>>
> >>>> cluster-infrastructure="classic openais (with plugin)" \
> >>>>
> >>>> expected-quorum-votes=2 \
> >>>>
> >>>> stonith-enabled=true
> >>>>
> >>>> property redis_replication: \
> >>>>
> >>>> redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com
> <http://dcwbpvmuas005.edc.nam.gm.com>
> >>>>
> >>>> On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander <
> >>>> alexander.nekra...@emc.com <mailto:alexander.nekra...@emc.com>>
> wrote:
> >>>>
> >>>>> stonith-enabled=false
> >>>>>
> >>>>> this might be the issue. The way peer node death is resolved, the
> >>>>> surviving node must call STONITH on the peer. If it’s disabled it
> >>> might not
> >>>>> be able to resolve the event
> >>>>>
> >>>>>
> >>>>>
> >>>>> Alex
> >>>>>
> >>>>>
> >>>>>
> >>>>> *From:* alex austin [mailto:alexixa...@gmail.com
> <mailto:alexixa...@gmail.com>]
> >>>>> *Sent:* Wednesday, July 01, 2015 9:51 AM
> >>>>> *To:* Users@clusterlabs.org <mailto:Users@clusterlabs.org>
> >>>>> *Subject:* Re: [ClusterLabs] Pacemaker failover failure
> >>>>>
> >>>>>
> >>>>>
> >>>>> So I noticed that if I kill redis on one node, it starts on
> the other,
> >>> no
> >>>>> problem, but if I actually kill pacemaker itself on one node,
> the other
> >>>>> doesn't "sense" it so it doesn't fail over.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Jul 1, 2015 at 12:42 PM, alex austin
> mailto:alexixa...@gmail.com>>
> >>> wrote:
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>>
> >>>>>
> >>>>> I have configured a virtual ip and redis in master-slave with
> corosync
> >>>>> pacemaker. If redis fails, then the failover is successful,
> and redis
> >>> gets
> >>>>> promoted on the other node. However if pacemaker itself fails
> on the
> >>> active
> >>>>> node, the failover is not performed. Is there anything I
> missed in the
> >>>>> configuration?
> >>>>>
> >>>>>
> >>>>>
> >>>>> Here's my configuration (i have hashed the ip address out):
> >>>>>
> >>>>>
> >>>>>
> >>>>> node host1.com <http://host1.com>
> >>>>>
> >>>>> node host2.com <http://host2.com>
> >>>>>
> >>>>> primitive ClusterIP IPaddr2 \
> >>>>>
> >>>>> params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \
> >>>>>
> >>>>> op monitor interval=1s timeout=20s \
> >>>>>
> >>>>> op start interval=0 timeout=20s \
> >>>>>
> >>>>> op stop interval=0 timeout=20s \
> >>>>>
> >>>>> meta is-managed=true target-role=Started resource-stickiness=500
> >>>>>
> >>>>> primitive redis redis \
> >>>>>
> >>>>> meta target-role=Master is-managed=true \
> >>>>>
> >>>>> op monitor interval=1s role=Master timeout=5s on-fail=restart
> >>>>>
> >>>>> ms redis_clone redis \
> >>>>>
> >>>>> meta notify=true is-managed=true ordered=false interleave=false
> >>>>> globally-unique=false target-role=Master migration-threshold=1
> >>>>>
> >>>>> colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master
> >>>>>
> >>>>> colocation ip-on-redis inf: ClusterIP redis_clone:Master
> >>>>>
> >>>>> property cib-bootstrap-options: \
> >>>>>
> >>>>> dc-version=1.1.11-97629de \
> >>>>>
> >>>>> cluster-infrastructure="classic openais (with plugin)" \
> >>>>>
> >>>>> expected-quorum-votes=2 \
> >>>>>
> >>>>> stonith-enabled=false
> >>>>>
> >>>>> property redis_replication: \
> >>>>>
> >>>>> redis_REPL_INFO=host.com <http://host.com>
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

1 2 3 4 5 6 7 8 >

1 - 100 of 799 matches

Mail list logo