[ClusterLabs] Q: About a false negative of storage_mon

2022-08-02 Thread
Hi,

Since O_DIRECT is not specified in open() [1], it reads the buffer cache and
may result in a false negative. I fear that this possibility increases
in environments with large buffer cache and running disk-reading applications
such as database.

So, I think it's better to specify O_RDONLY|O_DIRECT, but what about it?
(in this case, lseek() processing is unnecessary.)

# I am ready to create a patch that works with O_DIRECT. Also, I wouldn't mind
# a "change to add a new mode of inspection with O_DIRECT
# (add a option to storage_mon) while keeping the current inspection process".

[1] 
https://github.com/ClusterLabs/resource-agents/blob/main/tools/storage_mon.c#L47-L90

Best Regards,
Kazunori INOUE



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: Is there any plan for pcs to support corosync-notifyd?

2021-03-24 Thread
On Thu, Mar 18, 2021 at 6:31 PM Jehan-Guillaume de Rorthais
 wrote:
>
> On Thu, 18 Mar 2021 17:29:59 +0900
> 井上和徳  wrote:
>
> > On Tue, Mar 16, 2021 at 10:23 PM Jehan-Guillaume de Rorthais
> >  wrote:
> > >
> > > > On Tue, 16 Mar 2021, 09:58 井上和徳,  wrote:
> > > >
> > > > > Hi!
> > > > >
> > > > > Cluster (corosync and pacemaker) can be started with pcs,
> > > > > but corosync-notifyd needs to be started separately with systemctl,
> > > > > which is not easy to use.
> > >
> > > Maybe you can add to the [Install] section of corosync-notifyd a 
> > > dependency
> > > with corosync? Eg.:
> > >
> > >   WantedBy=corosync.service
> > >
> > > (use systemctl edit corosync-notifyd)
> > >
> > > Then re-enable the service (without starting it by hands).
> >
> > I appreciate your proposal. How to use WantedBy was helpful!
> > However, since I want to start the cluster (corosync, pacemaker) only
> > manually, it is unacceptable to start corosync along with corosync-notifyd 
> > at
> > OS boot time.
>
> This is perfectly fine.
>
> I suppose corosync-notifyd is starting because the default service config has:
>
>   [Install]
>   WantedBy=multi-user.target
>
> If you want corosync-notifyd to be enabled ONLY on corosync startup, but noton
> system startup, you have to remove this startup dependency on "multi-user"
> target. So, your drop-in setup of corosync-notifyd shoudl be (remove leading
> spaces):
>
>   cat <   [Install]
>   WantedBy=
>   WantedBy=corosync.service
>   EOF
>

Oh, that makes sense!
With this setting, it seems that the purpose can be achieved.
Thank you.

> The first empty WantedBy= removes any pre-existing dependency.
>
> Then disable/enable corosync-notifyd again to install the new dependency and
> remove old ones. It should only creates ONE link in
> "/etc/systemd/system/corosync.service.wants/corosync-notifyd.service",
> but NOT in "/etc/systemd/system/multi-user.target.wants/".
>
> Regards,


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: Is there any plan for pcs to support corosync-notifyd?

2021-03-18 Thread
On Tue, Mar 16, 2021 at 11:24 PM Ken Gaillot  wrote:
>
> On Tue, 2021-03-16 at 13:24 +0100, damiano giuliani wrote:
> > Could be an idea to  let your cluster start it on all your nodes
> >
> > Br
>
> That does sound like a good idea. That way the cluster can monitor it
> (with the equivalent of systemctl status) and restart it if needed, and
> constraints can be used if some other resource depends on it. Something
> like:
>
> pcs resource create notifyd systemd:corosync-notifyd clone

I appreciate your proposal.

With this method, the trap at corosync startup is hardly sent. I will ask the
user for their opinion.
# pcs cluster start --all
rhel83-1: Starting Cluster...
rhel83-2: Starting Cluster...
rhel83-3: Starting Cluster...

Mar 18 12:48:47 cent83 snmptrapd[32754]: (snip) rhel83-2 (snip)
STRING: "quorate"
Mar 18 12:48:47 cent83 snmptrapd[32754]: (snip) rhel83-3 (snip)
STRING: "quorate"
Mar 18 12:48:47 cent83 snmptrapd[32754]: (snip) rhel83-1 (snip)
STRING: "quorate"

For your information, when corosync and notifyd are started at the same time,
the following trap is sent.
# for i in rhel83-{1..3}; do ssh -f $i 'systemctl start corosync-notifyd'; done

Mar 18 16:24:07 cent83 snmptrapd[33312]: (snip) rhel83-2 (snip)
STRING: "operational"
Mar 18 16:24:07 cent83 snmptrapd[33312]: (snip) rhel83-2 (snip)
STRING: "operational"
Mar 18 16:24:07 cent83 snmptrapd[33312]: (snip) rhel83-3 (snip)
STRING: "operational"
Mar 18 16:24:07 cent83 snmptrapd[33312]: (snip) rhel83-3 (snip)
STRING: "operational"
Mar 18 16:24:07 cent83 snmptrapd[33312]: (snip) rhel83-3 (snip)
STRING: "operational"
Mar 18 16:24:07 cent83 snmptrapd[33312]: (snip) rhel83-3 (snip)
STRING: "operational"
Mar 18 16:24:08 cent83 snmptrapd[33312]: (snip) rhel83-1 (snip)
STRING: "operational"
Mar 18 16:24:08 cent83 snmptrapd[33312]: (snip) rhel83-1 (snip)
STRING: "operational"
Mar 18 16:24:08 cent83 snmptrapd[33312]: (snip) rhel83-1 (snip)
STRING: "operational"
Mar 18 16:24:08 cent83 snmptrapd[33312]: (snip) rhel83-1 (snip)
STRING: "operational"
Mar 18 16:24:08 cent83 snmptrapd[33312]: (snip) rhel83-2 (snip)
STRING: "operational"
Mar 18 16:24:08 cent83 snmptrapd[33312]: (snip) rhel83-2 (snip)
STRING: "operational"
Mar 18 16:24:11 cent83 snmptrapd[33312]: (snip) rhel83-1 (snip) STRING: "joined"
Mar 18 16:24:11 cent83 snmptrapd[33312]: (snip) rhel83-1 (snip)
STRING: "quorate"
Mar 18 16:24:11 cent83 snmptrapd[33312]: (snip) rhel83-1 (snip) STRING: "joined"
Mar 18 16:24:11 cent83 snmptrapd[33312]: (snip) rhel83-3 (snip) STRING: "joined"
Mar 18 16:24:11 cent83 snmptrapd[33312]: (snip) rhel83-3 (snip)
STRING: "quorate"
Mar 18 16:24:11 cent83 snmptrapd[33312]: (snip) rhel83-3 (snip) STRING: "joined"
Mar 18 16:24:11 cent83 snmptrapd[33312]: (snip) rhel83-2 (snip) STRING: "joined"
Mar 18 16:24:11 cent83 snmptrapd[33312]: (snip) rhel83-2 (snip)
STRING: "quorate"
Mar 18 16:24:11 cent83 snmptrapd[33312]: (snip) rhel83-2 (snip) STRING: "joined"

> > On Tue, 16 Mar 2021, 09:58 井上和徳,  wrote:
> > > Hi!
> > >
> > > Cluster (corosync and pacemaker) can be started with pcs,
> > > but corosync-notifyd needs to be started separately with systemctl,
> > > which is not easy to use.
> > >
> > > # pcs cluster start --all
> > > rhel83-1: Starting Cluster...
> > > rhel83-2: Starting Cluster...
> > > rhel83-3: Starting Cluster...
> > > # ssh rhel83-1 systemctl start corosync-notifyd
> > > # ssh rhel83-2 systemctl start corosync-notifyd
> > > # ssh rhel83-3 systemctl start corosync-notifyd
> > >
> > > Is there any plan for pcs to support corosync-notifyd?
> > >
> > > Regards,
> > > Kazunori INOUE
> --
> Ken Gaillot 


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: Is there any plan for pcs to support corosync-notifyd?

2021-03-18 Thread
On Tue, Mar 16, 2021 at 10:23 PM Jehan-Guillaume de Rorthais
 wrote:
>
> > On Tue, 16 Mar 2021, 09:58 井上和徳,  wrote:
> >
> > > Hi!
> > >
> > > Cluster (corosync and pacemaker) can be started with pcs,
> > > but corosync-notifyd needs to be started separately with systemctl,
> > > which is not easy to use.
>
> Maybe you can add to the [Install] section of corosync-notifyd a dependency
> with corosync? Eg.:
>
>   WantedBy=corosync.service
>
> (use systemctl edit corosync-notifyd)
>
> Then re-enable the service (without starting it by hands).

I appreciate your proposal. How to use WantedBy was helpful!
However, since I want to start the cluster (corosync, pacemaker) only manually,
it is unacceptable to start corosync along with corosync-notifyd at OS
boot time.

Disabling the dependency (Requires=corosync.service) on corosync-notifyd.service
will prevent corosync from starting at OS boot, but will cause corosync-notifyd
to fail, which may confuse users.
# sytemctl reboot
 :
# grep "notifyd.*error" /var/log/messages
Mar 18 12:09:54 rhel83-1  notifyd[949]:[error] Failed to
initialize the cmap API. Error 2
Mar 18 12:09:54 rhel83-1  notifyd[978]:[error] Failed to
initialize the cmap API. Error 2
Mar 18 12:09:55 rhel83-1  notifyd[1014]:[error] Failed to
initialize the cmap API. Error 2
Mar 18 12:09:55 rhel83-1  notifyd[1021]:[error] Failed to
initialize the cmap API. Error 2
Mar 18 12:09:55 rhel83-1  notifyd[1025]:[error] Failed to
initialize the cmap API. Error 2

Regards,
Kazunori INOUE


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Q: Is there any plan for pcs to support corosync-notifyd?

2021-03-16 Thread
Hi!

Cluster (corosync and pacemaker) can be started with pcs,
but corosync-notifyd needs to be started separately with systemctl,
which is not easy to use.

# pcs cluster start --all
rhel83-1: Starting Cluster...
rhel83-2: Starting Cluster...
rhel83-3: Starting Cluster...
# ssh rhel83-1 systemctl start corosync-notifyd
# ssh rhel83-2 systemctl start corosync-notifyd
# ssh rhel83-3 systemctl start corosync-notifyd

Is there any plan for pcs to support corosync-notifyd?

Regards,
Kazunori INOUE


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: Starting from pcs-0.10.6-4.el8, logs of pcsd are now output to syslog

2021-01-07 Thread
Hi,

Thanks for your reply.
I've backported ddb0d3f to pcs-0.10.6-4.el8 and confirmed that this
bug is fixed.

Thanks,
Kazunori INOUE

On Fri, Jan 8, 2021 at 1:49 AM Tomas Jelinek  wrote:
>
> Hi,
>
> It took us some time to figure this out, sorry about that.
>
> The behavior you see is not intended, it is a bug. The bug originates in
> commit 966959ac54d80c4cdeeb0fac40dc7ea60c1a0a82, more specifically in
> this line in pcs/run.py:
> from pcs.app import main as cli
>
> The pcs/app.py file is responsible for starting pcs CLI. The pcs/run.py
> file is responsible for starting pcs daemon, pcs CLI and pcs SNMP agent
> in a unified way and it serves as an entry point. Importing pcs.app
> caused logging.basicConfig(), which is called in pcs/app.py, to be
> executed when starting pcs daemon and pcs SNMP agent. This
> unintentionally configured loggers in pcs daemon and pcs SNMP agent to
> log into stderr and those logs then got propagated to system log and
> /var/log/messages.
>
> To fix the bug, logging.basicConfig() should be removed from pcs/app.py,
> the line is actually not needed at all. The fix is available upstream in
> commit ddb0d3fed3273181356cd638d724b891ecd78263.
>
>
> Regards,
> Tomas
>
>
> Dne 23. 12. 20 v 5:25 井上和徳 napsal(a):
> > Hi!
> >
> > Is it a specification that pcsd (pcs-0.10.6-4.el8) outputs logs to syslog?
> > If it is a specification, which change/commit is it due to?
> >
> > # rpm -q pcs
> > pcs-0.10.6-4.el8.x86_64
> > #
> > # cat /var/log/pcsd/pcsd.log
> > I, [2020-12-23T13:17:36.748 #00010] INFO -- : Starting Daemons
> > I, [2020-12-23T13:17:36.749 #00010] INFO -- : Running:
> > /usr/sbin/pcs cluster start
> > I, [2020-12-23T13:17:36.749 #00010] INFO -- : CIB USER: hacluster, 
> > groups:
> > I, [2020-12-23T13:17:39.749 #00010] INFO -- : Return Value: 0
> > I, [2020-12-23T13:17:39.749 #0] INFO -- : 200 GET
> > /remote/cluster_start (192.168.122.247) 3243.41ms
> > I, [2020-12-23T13:18:47.049 #0] INFO -- : 200 GET
> > /remote/get_configs?cluster_name=my_cluster (192.168.122.140) 4.05ms
> > I, [2020-12-23T13:18:47.060 #00011] INFO -- : Config files sync started
> > I, [2020-12-23T13:18:47.061 #00011] INFO -- : SRWT Node: rhel83-2
> > Request: get_configs
> > I, [2020-12-23T13:18:47.061 #00011] INFO -- : Connecting to:
> > https://192.168.122.140:2224/remote/get_configs?cluster_name=my_cluster
> > I, [2020-12-23T13:18:47.061 #00011] INFO -- : SRWT Node: rhel83-1
> > Request: get_configs
> > I, [2020-12-23T13:18:47.061 #00011] INFO -- : Connecting to:
> > https://192.168.122.247:2224/remote/get_configs?cluster_name=my_cluster
> > I, [2020-12-23T13:18:47.061 #00011] INFO -- : Config files sync finished
> > #
> > # grep pcs /var/log/messages
> > Dec 23 13:17:39 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:Starting Daemons
> > Dec 23 13:17:39 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:Running: /usr/sbin/pcs cluster start
> > Dec 23 13:17:39 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:CIB USER: hacluster, groups:
> > Dec 23 13:17:39 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:Return Value: 0
> > Dec 23 13:17:39 rhel83-2 
> > pcsd[600350]:INFO:tornado.access:200 GET /remote/cluster_start
> > (192.168.122.247) 3243.41ms
> > Dec 23 13:18:47 rhel83-2 
> > pcsd[600350]:INFO:tornado.access:200 GET
> > /remote/get_configs?cluster_name=my_cluster (192.168.122.140) 4.05ms
> > Dec 23 13:18:47 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:Config files sync started
> > Dec 23 13:18:47 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:SRWT Node: rhel83-2 Request: get_configs
> > Dec 23 13:18:47 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:Connecting to:
> > https://192.168.122.140:2224/remote/get_configs?cluster_name=my_cluster
> > Dec 23 13:18:47 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:SRWT Node: rhel83-1 Request: get_configs
> > Dec 23 13:18:47 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:Connecting to:
> > https://192.168.122.247:2224/remote/get_configs?cluster_name=my_cluster
> > Dec 23 13:18:47 rhel83-2 
> > pcsd[600350]:INFO:pcs.daemon:Config files sync finished
> > #
> >
> > Up to pcs-0.10.4-6.el8, there was no output to syslog.
> >
> > # rpm -q pcs
> > pcs-0.10.4-6.el8.x86_64
> > #
> > # cat /var/log/pcsd/pcsd.log
> > I, [2020-12-23T13:17:36.059 #01200] INFO -- : Starting Daemons
> > I, [2020-12-23T13:17:36.060 #01200] INFO -- : Running:
> > /usr/sbin/pcs cluster start
> > I, [2020-12-23T13:17:36.060 #01200] INFO -- : CIB USER: hacluster,

[ClusterLabs] Q: Starting from pcs-0.10.6-4.el8, logs of pcsd are now output to syslog

2020-12-22 Thread
Hi!

Is it a specification that pcsd (pcs-0.10.6-4.el8) outputs logs to syslog?
If it is a specification, which change/commit is it due to?

# rpm -q pcs
pcs-0.10.6-4.el8.x86_64
#
# cat /var/log/pcsd/pcsd.log
I, [2020-12-23T13:17:36.748 #00010] INFO -- : Starting Daemons
I, [2020-12-23T13:17:36.749 #00010] INFO -- : Running:
/usr/sbin/pcs cluster start
I, [2020-12-23T13:17:36.749 #00010] INFO -- : CIB USER: hacluster, groups:
I, [2020-12-23T13:17:39.749 #00010] INFO -- : Return Value: 0
I, [2020-12-23T13:17:39.749 #0] INFO -- : 200 GET
/remote/cluster_start (192.168.122.247) 3243.41ms
I, [2020-12-23T13:18:47.049 #0] INFO -- : 200 GET
/remote/get_configs?cluster_name=my_cluster (192.168.122.140) 4.05ms
I, [2020-12-23T13:18:47.060 #00011] INFO -- : Config files sync started
I, [2020-12-23T13:18:47.061 #00011] INFO -- : SRWT Node: rhel83-2
Request: get_configs
I, [2020-12-23T13:18:47.061 #00011] INFO -- : Connecting to:
https://192.168.122.140:2224/remote/get_configs?cluster_name=my_cluster
I, [2020-12-23T13:18:47.061 #00011] INFO -- : SRWT Node: rhel83-1
Request: get_configs
I, [2020-12-23T13:18:47.061 #00011] INFO -- : Connecting to:
https://192.168.122.247:2224/remote/get_configs?cluster_name=my_cluster
I, [2020-12-23T13:18:47.061 #00011] INFO -- : Config files sync finished
#
# grep pcs /var/log/messages
Dec 23 13:17:39 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:Starting Daemons
Dec 23 13:17:39 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:Running: /usr/sbin/pcs cluster start
Dec 23 13:17:39 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:CIB USER: hacluster, groups:
Dec 23 13:17:39 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:Return Value: 0
Dec 23 13:17:39 rhel83-2 
pcsd[600350]:INFO:tornado.access:200 GET /remote/cluster_start
(192.168.122.247) 3243.41ms
Dec 23 13:18:47 rhel83-2 
pcsd[600350]:INFO:tornado.access:200 GET
/remote/get_configs?cluster_name=my_cluster (192.168.122.140) 4.05ms
Dec 23 13:18:47 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:Config files sync started
Dec 23 13:18:47 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:SRWT Node: rhel83-2 Request: get_configs
Dec 23 13:18:47 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:Connecting to:
https://192.168.122.140:2224/remote/get_configs?cluster_name=my_cluster
Dec 23 13:18:47 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:SRWT Node: rhel83-1 Request: get_configs
Dec 23 13:18:47 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:Connecting to:
https://192.168.122.247:2224/remote/get_configs?cluster_name=my_cluster
Dec 23 13:18:47 rhel83-2 
pcsd[600350]:INFO:pcs.daemon:Config files sync finished
#

Up to pcs-0.10.4-6.el8, there was no output to syslog.

# rpm -q pcs
pcs-0.10.4-6.el8.x86_64
#
# cat /var/log/pcsd/pcsd.log
I, [2020-12-23T13:17:36.059 #01200] INFO -- : Starting Daemons
I, [2020-12-23T13:17:36.060 #01200] INFO -- : Running:
/usr/sbin/pcs cluster start
I, [2020-12-23T13:17:36.060 #01200] INFO -- : CIB USER: hacluster, groups:
I, [2020-12-23T13:17:38.060 #01200] INFO -- : Return Value: 0
I, [2020-12-23T13:17:38.060 #0] INFO -- : 200 GET
/remote/cluster_start (192.168.122.247) 1650.38ms
I, [2020-12-23T13:18:46.991 #0] INFO -- : 200 GET
/remote/get_configs?cluster_name=my_cluster (192.168.122.140) 5.06ms
I, [2020-12-23T13:18:58.053 #0] INFO -- : 200 GET
/remote/get_configs?cluster_name=my_cluster (192.168.122.247) 5.74ms
I, [2020-12-23T13:18:58.064 #01201] INFO -- : Config files sync started
I, [2020-12-23T13:18:58.064 #01201] INFO -- : SRWT Node: rhel83-2
Request: get_configs
I, [2020-12-23T13:18:58.064 #01201] INFO -- : Connecting to:
https://192.168.122.140:2224/remote/get_configs?cluster_name=my_cluster
I, [2020-12-23T13:18:58.064 #01201] INFO -- : SRWT Node: rhel83-1
Request: get_configs
I, [2020-12-23T13:18:58.064 #01201] INFO -- : Connecting to:
https://192.168.122.247:2224/remote/get_configs?cluster_name=my_cluster
I, [2020-12-23T13:18:58.065 #01201] INFO -- : Config files sync finished
#
# grep pcs /var/log/messages
#

Regards,
Kazunori INOUE


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] About the log indicating RA execution

2020-07-02 Thread
Thanks for the comments, everyone.

I'll go over the details and create an improved PullRequest with
proposal (2).

Best Regards,
Kazunori INOUE

On Fri, Jul 3, 2020 at 12:09 AM Ken Gaillot  wrote:
>
> On Thu, 2020-07-02 at 18:12 +0900, 井上和徳 wrote:
> > Hi all,
> >
> > We think it is desirable to output the log indicating the start and
> > finish of RA execution to syslog on the same node. (End users
> > monitoring the syslog are requesting that the output be on the same
> > node.)
> >
> > Currently, the start[1] and finish[2] logs may be output by different
> > nodes.
> >
> > Cluster Summary:
> >   * Stack: corosync
> >   * Current DC: r81-2 (version 2.0.4-556cef416) - partition with
> > quorum
> >   * Last updated: Thu Jul  2 12:42:17 2020
> >   * Last change:  Thu Jul  2 12:42:13 2020 by hacluster via crmd on
> > r81-2
> >   * 2 nodes configured
> >   * 3 resource instances configured
> >
> > Node List:
> >   * Online: [ r81-1 r81-2 ]
> >
> > Full List of Resources:
> >   * dummy1  (ocf::pacemaker:Dummy):  Started r81-1
> >   * fence1-ipmilan  (stonith:fence_ipmilan): Started r81-
> > 2
> >   * fence2-ipmilan  (stonith:fence_ipmilan): Started r81-
> > 1
> >
> > *1
> > Jul  2 12:42:15 r81-2 pacemaker-controld[18009]:
> >  notice: Initiating start operation dummy1_start_0 on r81-1
> > *2
> > Jul  2 12:42:15 r81-1 pacemaker-controld[10109]:
> >  notice: Result of start operation for dummy1 on r81-1: ok
>
> Some background for readers who might not be familiar:
>
> The "Initiating" message comes from the Designated Controller (DC)
> node, which decides what actions need to be done, then asks the
> appropriate nodes to do them.
>
> The "Result of" message comes from the node executing the action.
>
> > As a suggestion,
> >
> > 1) change the following log levels to NOTICE and output the start and
> >finish logs to syslog on the node where RA was executed.
> >
> > Jul 02 12:42:15 r81-1 pacemaker-execd [10106] (log_execute)
> >   info: executing - rsc:dummy1 action:start call_id:10
> > Jul 02 12:42:15 r81-1 pacemaker-execd [10106] (log_finished)
> >   info: dummy1 start (call 10, PID 10164) exited with status 0
> > (execution time 91ms, queue time 0ms)
> >
> > 2) alternatively, change the following log levels to NOTICE and
> >output a log indicating the finish at the DC node.
> >
> > Jul 02 12:42:15 r81-2 pacemaker-controld  [18009]
> > (process_graph_event)
> >   info: Transition 2 action 7 (dummy1_start_0 on r81-1) confirmed: ok
> > > rc=0 call-id=10
>
> It's a tricky balance deciding which logs go to syslog (by default
> that's "notice" level and higher) and which go only to the pacemaker
> detail log. Pacemaker logs can already be overwhelming, but clustering
> is inherently complex and lots of information is needed to understand
> problems in depth.
>
> We want enough information in the syslog for the average user to
> understand what the cluster is doing and troubleshoot problems with the
> clustered services, while the detail log helps with more complex
> problems, especially with the cluster software itself.
>
> Of your two recommendations, I like (2) better, because it reports the
> complete round-trip of the whole execution process. If something goes
> wrong on the DC side, (1) might not give any information about it.
>
> However I'd want to reword the message in (2) to be more user-friendly,
> something like:
>
> (in syslog)
> Received result of start operation for dummy1 on r81-1: ok
>
> (in detail log)
> Received result of start operation for dummy1 on r81-1: ok | Transition
> 2 action 7 (dummy1_start_0) rc=0 call-id=10
>
>
> > What do you think about this? (Do you have a better idea?)
> >
> > Best Regards,
> > Kazunori INOUE
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] About the log indicating RA execution

2020-07-02 Thread
Hi all,

We think it is desirable to output the log indicating the start and
finish of RA execution to syslog on the same node. (End users
monitoring the syslog are requesting that the output be on the same
node.)

Currently, the start[1] and finish[2] logs may be output by different
nodes.

Cluster Summary:
  * Stack: corosync
  * Current DC: r81-2 (version 2.0.4-556cef416) - partition with quorum
  * Last updated: Thu Jul  2 12:42:17 2020
  * Last change:  Thu Jul  2 12:42:13 2020 by hacluster via crmd on r81-2
  * 2 nodes configured
  * 3 resource instances configured

Node List:
  * Online: [ r81-1 r81-2 ]

Full List of Resources:
  * dummy1  (ocf::pacemaker:Dummy):  Started r81-1
  * fence1-ipmilan  (stonith:fence_ipmilan): Started r81-2
  * fence2-ipmilan  (stonith:fence_ipmilan): Started r81-1

*1
Jul  2 12:42:15 r81-2 pacemaker-controld[18009]:
 notice: Initiating start operation dummy1_start_0 on r81-1
*2
Jul  2 12:42:15 r81-1 pacemaker-controld[10109]:
 notice: Result of start operation for dummy1 on r81-1: ok

As a suggestion,

1) change the following log levels to NOTICE and output the start and
   finish logs to syslog on the node where RA was executed.

Jul 02 12:42:15 r81-1 pacemaker-execd [10106] (log_execute)
  info: executing - rsc:dummy1 action:start call_id:10
Jul 02 12:42:15 r81-1 pacemaker-execd [10106] (log_finished)
  info: dummy1 start (call 10, PID 10164) exited with status 0
(execution time 91ms, queue time 0ms)

2) alternatively, change the following log levels to NOTICE and
   output a log indicating the finish at the DC node.

Jul 02 12:42:15 r81-2 pacemaker-controld  [18009] (process_graph_event)
  info: Transition 2 action 7 (dummy1_start_0 on r81-1) confirmed: ok
| rc=0 call-id=10

What do you think about this? (Do you have a better idea?)

Best Regards,
Kazunori INOUE


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Question] About clufter's Corosync 3 support

2019-07-18 Thread
2019年7月18日(木) 22:42 Jan Pokorný :

> On 18/07/19 18:08 +0900, 井上和徳 wrote:
> > 'pcs config export' fails in RHEL 8.0 environment because clufter
> > does not support Corosync 3 (Camelback).
> > How is the state of progress? https://pagure.io/clufter/issue/4
>
> As much as I'd want to, time budget hasn't been allocated for this,
> pacemaker and other things keep me busy.  I need to review all the
> configuration changes and work on direct and inter (one-way only,
> actually) conversion procedures.
>
> I hope to get to that later this year.
>
> Any consumable form of said analysis would ease that task for me.
>

Thank you for the information.


> > [root@r80-1 ~]# cat /etc/redhat-release
> > Red Hat Enterprise Linux release 8.0 (Ootpa)
> >
> > [root@r80-1 ~]# dnf list | grep -E "corosync|pcs|clufter" | grep @
> > clufter-bin.x86_640.77.1-5.el8@rhel-ha
> > clufter-cli.noarch0.77.1-5.el8@rhel-ha
> > clufter-common.noarch 0.77.1-5.el8@rhel-ha
> > corosync.x86_64   3.0.0-2.el8 @rhel-ha
> > corosynclib.x86_643.0.0-2.el8 @rhel-appstream
> > pcs.x86_640.10.1-4.el8@rhel-ha
> > python3-clufter.noarch0.77.1-5.el8@rhel-ha
> >
> > [root@r80-1 ~]# pcs config export pcs-commands
> > Error: unable to export cluster configuration: 'pcs2pcscmd-camelback'
> > Try using --interactive to solve the issues manually, --debug to get more
> > information, --force to override.
> >
> > [root@r80-1 ~]# clufter -l | grep pcs2pcscmd
> >   pcs2pcscmd-flatiron  (Corosync/CMAN,Pacemaker) cluster cfg. ->
> reinstating
> >   pcs2pcscmd-needle(Corosync v2,Pacemaker) cluster cfg. ->
> reinstating
> >   pcs2pcscmd   alias for pcs2pcscmd-needle
>
> --
> Jan (Poki)
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Question] About clufter's Corosync 3 support

2019-07-18 Thread
Hi,

'pcs config export' fails in RHEL 8.0 environment because clufter does not
support Corosync 3 (Camelback).
How is the state of progress? https://pagure.io/clufter/issue/4

[root@r80-1 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.0 (Ootpa)

[root@r80-1 ~]# dnf list | grep -E "corosync|pcs|clufter" | grep @
clufter-bin.x86_640.77.1-5.el8@rhel-ha
clufter-cli.noarch0.77.1-5.el8@rhel-ha
clufter-common.noarch 0.77.1-5.el8@rhel-ha
corosync.x86_64   3.0.0-2.el8 @rhel-ha
corosynclib.x86_643.0.0-2.el8 @rhel-appstream
pcs.x86_640.10.1-4.el8@rhel-ha
python3-clufter.noarch0.77.1-5.el8@rhel-ha

[root@r80-1 ~]# pcs config export pcs-commands
Error: unable to export cluster configuration: 'pcs2pcscmd-camelback'
Try using --interactive to solve the issues manually, --debug to get more
information, --force to override.

[root@r80-1 ~]# clufter -l | grep pcs2pcscmd
  pcs2pcscmd-flatiron  (Corosync/CMAN,Pacemaker) cluster cfg. -> reinstating
  pcs2pcscmd-needle(Corosync v2,Pacemaker) cluster cfg. -> reinstating
  pcs2pcscmd   alias for pcs2pcscmd-needle

Best regards,
Kazunori INOUE
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Questions about SBD behavior

2018-06-25 Thread
> -Original Message-
> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
> Sent: Wednesday, June 13, 2018 6:40 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed; 井上 和
> 徳
> Subject: Re: [ClusterLabs] Questions about SBD behavior
> 
> On 06/13/2018 10:58 AM, 井上 和徳 wrote:
> > Thanks for the response.
> >
> > As of v1.3.1 and later, I recognized that real quorum is necessary.
> > I also read this:
> >
> https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self
> -fencing_with_resource_recovery
> >
> > As related to this specification, in order to use pacemaker-2.0,
> > we are confirming the following known issue.
> >
> > * When SIGSTOP is sent to the pacemaker process, no failure of the
> >   resource will be detected.
> >   https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html
> >   https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html
> >
> >   I expected that it was being handled by SBD, but no one detected
> >   that the following process was frozen. Therefore, no failure of
> >   the resource was detected either.
> >   - pacemaker-based
> >   - pacemaker-execd
> >   - pacemaker-attrd
> >   - pacemaker-schedulerd
> >   - pacemaker-controld
> >
> >   I confirmed this, but I couldn't read about the correspondence
> >   situation.
> >
> https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SB
> D_1.1.pdf
> You are right. The issue was known as when I created these slides.
> So a plan for improving the observation of the pacemaker-daemons
> should have gone into that probably.
> 

It's good news that there is a plan to improve.
So I registered it as a memorandum in CLBZ:
https://bugs.clusterlabs.org/show_bug.cgi?id=5356

Best Regards

> Thanks for bringing this to the table.
> Guess the issue got a little bit neglected recently.
> 
> >
> > As a result of our discussion, we want SBD to detect it and reset the
> > machine.
> 
> Implementation wise I would go for some kind of a split
> solution between pacemaker & SBD. Thinking of Pacemaker
> observing the sub-daemons by itself while there would be
> some kind of a heartbeat (implicitly via corosync or explicitly)
> between pacemaker & SBD that assures this internal
> observation is doing it's job properly.
> 
> >
> > Also, for users who do not have shared disk or qdevice,
> > we need an option to work even without real quorum.
> > (fence races are going to avoid with delay attribute:
> >  https://access.redhat.com/solutions/91653
> >  https://access.redhat.com/solutions/1293523)
> I'm not sure if I get your point here.
> Watchdog-fencing on a 2-node-cluster without
> additional qdevice or shared disk is like denying
> the laws of physics in my mind.
> At the moment I don't see why auto_tie_breaker
> wouldn't work on a 4-node and up cluster here.
> 
> Regards,
> Klaus
> >
> > Best Regards,
> > Kazunori INOUE
> >
> >> -Original Message-
> >> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Klaus 
> >> Wenninger
> >> Sent: Friday, May 25, 2018 4:08 PM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] Questions about SBD behavior
> >>
> >> On 05/25/2018 07:31 AM, 井上 和徳 wrote:
> >>> Hi,
> >>>
> >>> I am checking the watchdog function of SBD (without shared block-device).
> >>> In a two-node cluster, if one cluster is stopped, watchdog is triggered 
> >>> on the
> >> remaining node.
> >>> Is this the designed behavior?
> >> SBD without a shared block-device doesn't really make sense on
> >> a two-node cluster.
> >> The basic idea is - e.g. in a case of a networking problem -
> >> that a cluster splits up in a quorate and a non-quorate partition.
> >> The quorate partition stays over while SBD guarantees a
> >> reliable watchdog-based self-fencing of the non-quorate partition
> >> within a defined timeout.
> >> This idea of course doesn't work with just 2 nodes.
> >> Taking quorum info from the 2-node feature of corosync (automatically
> >> switching on wait-for-all) doesn't help in this case but instead
> >> would lead to split-brain.
> >> What you can do - and what e.g. pcs does automatically - is enable
> >> the auto-tie-breaker instead of two-node in corosync. But that
> >> still doesn't give you a higher availability than the one of the
&

Re: [ClusterLabs] Questions about SBD behavior

2018-06-13 Thread
Thanks for the response.

As of v1.3.1 and later, I recognized that real quorum is necessary.
I also read this:
https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self-fencing_with_resource_recovery

As related to this specification, in order to use pacemaker-2.0,
we are confirming the following known issue.

* When SIGSTOP is sent to the pacemaker process, no failure of the
  resource will be detected.
  https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html
  https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html

  I expected that it was being handled by SBD, but no one detected
  that the following process was frozen. Therefore, no failure of
  the resource was detected either.
  - pacemaker-based
  - pacemaker-execd
  - pacemaker-attrd
  - pacemaker-schedulerd
  - pacemaker-controld

  I confirmed this, but I couldn't read about the correspondence
  situation.
  
https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SBD_1.1.pdf

As a result of our discussion, we want SBD to detect it and reset the
machine.

Also, for users who do not have shared disk or qdevice,
we need an option to work even without real quorum.
(fence races are going to avoid with delay attribute:
 https://access.redhat.com/solutions/91653
 https://access.redhat.com/solutions/1293523)

Best Regards,
Kazunori INOUE

> -Original Message-
> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Klaus 
> Wenninger
> Sent: Friday, May 25, 2018 4:08 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Questions about SBD behavior
> 
> On 05/25/2018 07:31 AM, 井上 和徳 wrote:
> > Hi,
> >
> > I am checking the watchdog function of SBD (without shared block-device).
> > In a two-node cluster, if one cluster is stopped, watchdog is triggered on 
> > the
> remaining node.
> > Is this the designed behavior?
> 
> SBD without a shared block-device doesn't really make sense on
> a two-node cluster.
> The basic idea is - e.g. in a case of a networking problem -
> that a cluster splits up in a quorate and a non-quorate partition.
> The quorate partition stays over while SBD guarantees a
> reliable watchdog-based self-fencing of the non-quorate partition
> within a defined timeout.
> This idea of course doesn't work with just 2 nodes.
> Taking quorum info from the 2-node feature of corosync (automatically
> switching on wait-for-all) doesn't help in this case but instead
> would lead to split-brain.
> What you can do - and what e.g. pcs does automatically - is enable
> the auto-tie-breaker instead of two-node in corosync. But that
> still doesn't give you a higher availability than the one of the
> winner of auto-tie-breaker. (Maybe interesting if you are going
> for a load-balancing-scenario that doesn't affect availability or
> for a transient state while setting up a cluste node-by-node ...)
> What you can do though is using qdevice to still have 'real-quorum'
> info with just 2 full cluster-nodes.
> 
> There was quite a lot of discussion round this topic on this
> thread previously if you search the history.
> 
> Regards,
> Klaus
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Questions about SBD behavior

2018-05-24 Thread
Hi,

I am checking the watchdog function of SBD (without shared block-device).
In a two-node cluster, if one cluster is stopped, watchdog is triggered on the 
remaining node.
Is this the designed behavior?


[vmrh75b]# cat /etc/corosync/corosync.conf
(snip)
quorum {
provider: corosync_votequorum
two_node: 1
}

[vmrh75b]# cat /etc/sysconfig/sbd
# This file has been generated by pcs.
SBD_DELAY_START=no
## SBD_DEVICE="/dev/vdb1"
SBD_OPTS="-vvv"
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

[vmrh75b]# crm_mon -r1
Stack: corosync
Current DC: vmrh75a (version 2.0.0-0.1.rc4.el7-2.0.0-rc4) - partition with 
quorum
Last updated: Fri May 25 13:36:07 2018
Last change: Fri May 25 13:35:22 2018 by root via cibadmin on vmrh75a

2 nodes configured
0 resources configured

Online: [ vmrh75a vmrh75b ]

No resources

[vmrh75b]# pcs property show
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: my_cluster
 dc-version: 2.0.0-0.1.rc4.el7-2.0.0-rc4
 have-watchdog: true
 stonith-enabled: false

[vmrh75b]# ps -ef | egrep "sbd|coro|pace"
root  2169 1  0 13:34 ?00:00:00 sbd: inquisitor
root  2170  2169  0 13:34 ?00:00:00 sbd: watcher: Pacemaker
root  2171  2169  0 13:34 ?00:00:00 sbd: watcher: Cluster
root  2172 1  0 13:34 ?00:00:00 corosync
root  2179 1  0 13:34 ?00:00:00 /usr/sbin/pacemakerd -f
haclust+  2180  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-based
root  2181  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-fenced
root  2182  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-execd
haclust+  2183  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-attrd
haclust+  2184  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-schedulerd
haclust+  2185  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-controld

[vmrh75b]# pcs cluster stop vmrh75a
vmrh75a: Stopping Cluster (pacemaker)...
vmrh75a: Stopping Cluster (corosync)...

[vmrh75b]# tail -F /var/log/messages
May 25 13:37:00 vmrh75b pacemaker-controld[2185]: notice: Our peer on the DC 
(vmrh75a) is dead
May 25 13:37:00 vmrh75b pacemaker-controld[2185]: notice: State transition 
S_NOT_DC -> S_ELECTION
May 25 13:37:00 vmrh75b pacemaker-controld[2185]: notice: State transition 
S_ELECTION -> S_INTEGRATION
May 25 13:37:00 vmrh75b pacemaker-attrd[2183]: notice: Node vmrh75a state is 
now lost
May 25 13:37:00 vmrh75b pacemaker-attrd[2183]: notice: Removing all vmrh75a 
attributes for peer loss
May 25 13:37:00 vmrh75b pacemaker-attrd[2183]: notice: Lost attribute writer 
vmrh75a
May 25 13:37:00 vmrh75b pacemaker-attrd[2183]: notice: Purged 1 peer with id=1 
and/or uname=vmrh75a from the membership cache
May 25 13:37:00 vmrh75b pacemaker-fenced[2181]: notice: Node vmrh75a state is 
now lost
May 25 13:37:00 vmrh75b pacemaker-fenced[2181]: notice: Purged 1 peer with id=1 
and/or uname=vmrh75a from the membership cache
May 25 13:37:00 vmrh75b pacemaker-based[2180]: notice: Node vmrh75a state is 
now lost
May 25 13:37:00 vmrh75b pacemaker-based[2180]: notice: Purged 1 peer with id=1 
and/or uname=vmrh75a from the membership cache
May 25 13:37:00 vmrh75b pacemaker-controld[2185]: warning: Input I_ELECTION_DC 
received in state S_INTEGRATION from do_election_check
May 25 13:37:01 vmrh75b sbd[2171]:   cluster:  warning: set_servant_health: 
Connected to corosync but requires both nodes present
May 25 13:37:01 vmrh75b sbd[2171]:   cluster:  warning: notify_parent: 
Notifying parent: UNHEALTHY (6)
May 25 13:37:01 vmrh75b sbd[2169]: warning: inquisitor_child: cluster health 
check: UNHEALTHY
May 25 13:37:01 vmrh75b sbd[2169]: warning: inquisitor_child: Servant cluster 
is outdated (age: 226)
May 25 13:37:01 vmrh75b sbd[2170]:  pcmk:   notice: unpack_config: Watchdog 
will be used via SBD if fencing is required
May 25 13:37:01 vmrh75b sbd[2170]:  pcmk: info: 
determine_online_status: Node vmrh75b is online
May 25 13:37:01 vmrh75b sbd[2170]:  pcmk: info: unpack_node_loop: Node 
2 is already processed
May 25 13:37:01 vmrh75b sbd[2170]:  pcmk: info: unpack_node_loop: Node 
2 is already processed
May 25 13:37:01 vmrh75b sbd[2171]:   cluster:  warning: notify_parent: 
Notifying parent: UNHEALTHY (6)
May 25 13:37:01 vmrh75b corosync[2172]: [TOTEM ] A new membership 
(192.168.28.132:5712) was formed. Members left: 1
May 25 13:37:01 vmrh75b corosync[2172]: [QUORUM] Members[1]: 2
May 25 13:37:01 vmrh75b corosync[2172]: [MAIN  ] Completed service 
synchronization, ready to provide service.
May 25 13:37:01 vmrh75b pacemakerd[2179]: notice: Node vmrh75a state is now lost
May 25 13:37:01 vmrh75b pacemaker-controld[2185]: notice: Node vmrh75a state is 
now lost
May 25 13:37:01 vmrh75b pacemaker-controld[2185]: warning: Stonith/shutdown of 
node vmrh75a was not expected
May 25 13:37:02 vmrh75b sbd[2171]:   cluster:  warning: n

Re: [ClusterLabs] PCMK_node_start_state=standby sometimes does not work

2017-12-05 Thread
Hi Ken,

Thank you for your comment. ("cibadmin --empty" is interesting.)

I registered in CLBZ :
https://bugs.clusterlabs.org/show_bug.cgi?id=5331

Best Regards

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Saturday, December 02, 2017 8:02 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] PCMK_node_start_state=standby sometimes does not 
> work
> 
> On Tue, 2017-11-28 at 09:36 +, 井上 和徳 wrote:
> > Hi,
> >
> > Sometimes a node with 'PCMK_node_start_state=standby' will start up
> > Online.
> >
> > [ reproduction scenario ]
> >  * Set 'PCMK_node_start_state=standby' to /etc/sysconfig/pacemaker.
> >  * Delete cib (/var/lib/pacemaker/cib/*).
> >  * Start pacemaker at the same time on 2 nodes.
> >   # for i in rhel74-1 rhel74-3 ; do ssh -f $i systemctl start
> > pacemaker ; done
> >
> > [ actual result ]
> >  * crm_mon
> >   Stack: corosync
> >   Current DC: rhel74-3 (version 1.1.18-2b07d5c) - partition with
> > quorum
> >   Last change: Wed Nov 22 06:22:50 2017 by hacluster via crmd on
> > rhel74-3
> >
> >   2 nodes configured
> >   0 resources configured
> >
> >   Node rhel74-3: standby
> >   Online: [ rhel74-1 ]
> >
> >  * cib.xml
> >   
> > 
> > 
> >   
> >  > value="on"/>
> >   
> > 
> >   
> >
> >  * pacemaker.log
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: (cib_native.c:462 )
> > warning: cib_native_perform_op_delegate:Call failed: No such
> > device or address
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update    > id="3232261507">
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update  > es id="nodes-3232261507">
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update    > id="nodes-3232261507-standby" name="standby" value="on"/>
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update  > tes>
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update   
> >
> >  * I attached crm_report to GitHub (too big to attach to this email),
> > so look at it.
> >    https://github.com/inouekazu/pcmk_report/blob/master/pcmk-Wed-22-N
> > ov-2017.tar.bz2
> >
> >
> > I think that the additional timing of *1 and
> > *2 is the cause.
> > *1 '
> > *2 
> >   > value="on"/>
> >
> > I expect to be fixed, but if it's difficult, I have two questions.
> > 1) Does this only occur if there is no cib.xml (in other words, there
> > is no  element)?
> 
> I believe so. I think this is the key message:
> 
> Nov 22 06:22:50 [20750] rhel74-1cib: ( callbacks.c:1101  )
> warning: cib_process_request:Completed cib_modify operation for
> section nodes: No such device or address (rc=-6, origin=rhel74-
> 1/crmd/12, version=0.3.0)
> 
> PCMK_node_start_state works by setting the "standby" node attribute in
> the CIB. However, it does this via a "modify" command that assumes the
>  tag already exists.
> 
> If there is no CIB, pacemaker will quickly create one -- but in this
> case, the node tries to set the attribute before that's happened.
> 
> Hopefully we can come up with a fix. If you want, you can file a bug
> report at bugs.clusterlabs.org, to track the progress.
> 
> > 2) Is there any workaround other than "Do not start at the same
> > time"?
> >
> > Best Regards
> 
> Before starting pacemaker, if /var/lib/pacemaker/cib is empty, you can
> create a skeleton CIB with:
> 
>  cibadmin --empty > /var/lib/pacemaker/cib/cib.xml
> 
> That will include an empty  tag, and the modify command should
> work when pacemaker starts.
> --
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] PCMK_node_start_state=standby sometimes does not work

2017-11-28 Thread
Hi,

Sometimes a node with 'PCMK_node_start_state=standby' will start up Online.

[ reproduction scenario ]
 * Set 'PCMK_node_start_state=standby' to /etc/sysconfig/pacemaker.
 * Delete cib (/var/lib/pacemaker/cib/*).
 * Start pacemaker at the same time on 2 nodes.
  # for i in rhel74-1 rhel74-3 ; do ssh -f $i systemctl start pacemaker ; done

[ actual result ]
 * crm_mon
  Stack: corosync
  Current DC: rhel74-3 (version 1.1.18-2b07d5c) - partition with quorum
  Last change: Wed Nov 22 06:22:50 2017 by hacluster via crmd on rhel74-3

  2 nodes configured
  0 resources configured

  Node rhel74-3: standby
  Online: [ rhel74-1 ]

 * cib.xml
  


  

  

  

 * pacemaker.log
  Nov 22 06:22:50 [20755] rhel74-1   crmd: (cib_native.c:462 ) warning: 
cib_native_perform_op_delegate: Call failed: No such device or address
  Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320 )info: 
update_attr_delegate:   Update   
  Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320 )info: 
update_attr_delegate:   Update 
  Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320 )info: 
update_attr_delegate:   Update   
  Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320 )info: 
update_attr_delegate:   Update 
  Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320 )info: 
update_attr_delegate:   Update   

 * I attached crm_report to GitHub (too big to attach to this email), so look 
at it.
   
https://github.com/inouekazu/pcmk_report/blob/master/pcmk-Wed-22-Nov-2017.tar.bz2


I think that the additional timing of *1 and 
*2 is the cause.
*1 '
*2 
 

I expect to be fixed, but if it's difficult, I have two questions.
1) Does this only occur if there is no cib.xml (in other words, there is no 
 element)?
2) Is there any workaround other than "Do not start at the same time"?

Best Regards

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Updated attribute is not displayed in crm_mon

2017-08-17 Thread
I confirmed that the problem was fixed.
Many thanks!

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Thursday, August 17, 2017 12:25 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Updated attribute is not displayed in crm_mon
> 
> I have a fix for this issue ready. I am running some tests on it, then
> will merge it in the upstream master branch, to become part of the next
> release.
> 
> The fix is to clear the transient attributes from the CIB when attrd
> starts, rather than when the crmd completes its first join. This
> eliminates the window where attributes can be set before the CIB is
> cleared.
> 
> On Tue, 2017-08-15 at 08:42 +, 井上 和徳 wrote:
> > Hi Ken,
> >
> > Thanks for the explanation.
> >
> > As an additional information, we are using Daemon(*1) that registers
> > Corosync's ring status as attributes, so I want to avoid events where
> > attributes are not displayed.
> >
> > *1 It's a ifcheckd that always running, not a resource. and registers
> >attributes when Pacemaker is running.
> >( https://github.com/linux-ha-japan/pm_extras/tree/master/tools )
> >Attribute example :
> >
> >Node Attributes:
> >* Node rhel73-1:
> >+ ringnumber_0  : 192.168.101.131 is UP
> >+ ringnumber_1  : 192.168.102.131 is UP
> >* Node rhel73-2:
> >+ ringnumber_0  : 192.168.101.132 is UP
> >+ ringnumber_1  : 192.168.102.132 is UP
> >
> > Regards,
> > Kazunori INOUE
> >
> > > -Original Message-
> > > From: Ken Gaillot [mailto:kgail...@redhat.com]
> > > Sent: Tuesday, August 15, 2017 2:42 AM
> > > To: Cluster Labs - All topics related to open-source clustering welcomed
> > > Subject: Re: [ClusterLabs] Updated attribute is not displayed in crm_mon
> > >
> > > On Mon, 2017-08-14 at 12:33 -0500, Ken Gaillot wrote:
> > > > On Wed, 2017-08-02 at 09:59 +, 井上 和徳 wrote:
> > > > > Hi,
> > > > >
> > > > > In Pacemaker-1.1.17, the attribute updated while starting pacemaker 
> > > > > is not displayed in crm_mon.
> > > > > In Pacemaker-1.1.16, it is displayed and results are different.
> > > > >
> > > > > https://github.com/ClusterLabs/pacemaker/commit/fe44f400a3116a158ab331a92a49a4ad8937170d
> > > > > This commit is the cause, but the following result (3.) is expected 
> > > > > behavior?
> > > >
> > > > This turned out to be an odd one. The sequence of events is:
> > > >
> > > > 1. When the node leaves the cluster, the DC (correctly) wipes all its
> > > > transient attributes from attrd and the CIB.
> > > >
> > > > 2. Pacemaker is newly started on the node, and a transient attribute is
> > > > set before the node joins the cluster.
> > > >
> > > > 3. The node joins the cluster, and its transient attributes (including
> > > > the new value) are sync'ed with the rest of the cluster, in both attrd
> > > > and the CIB. So far, so good.
> > > >
> > > > 4. Because this is the node's first join since its crmd started, its
> > > > crmd wipes all of its transient attributes again. The idea is that the
> > > > node may have restarted so quickly that the DC hasn't yet done it (step
> > > > 1 here), so clear them now to avoid any problems with old values.
> > > > However, the crmd wipes only the CIB -- not attrd (arguably a bug).
> > >
> > > Whoops, clarification: the node may have restarted so quickly that
> > > corosync didn't notice it left, so the DC would never have gotten the
> > > "peer lost" message that triggers wiping its transient attributes.
> > >
> > > I suspect the crmd wipes only the CIB in this case because we assumed
> > > attrd would be empty at this point -- missing exactly this case where a
> > > value was set between start-up and first join.
> > >
> > > > 5. With the older pacemaker version, both the joining node and the DC
> > > > would request a full write-out of all values from attrd. Because step 4
> > > > only wiped the CIB, this ends up restoring the new value. With the newer
> > > > pacemaker version, this step is no longer done, so the value winds up
> > > > staying in

Re: [ClusterLabs] Updated attribute is not displayed in crm_mon

2017-08-15 Thread
Hi Ken,

Thanks for the explanation.

As an additional information, we are using Daemon(*1) that registers
Corosync's ring status as attributes, so I want to avoid events where
attributes are not displayed.

*1 It's a ifcheckd that always running, not a resource. and registers
   attributes when Pacemaker is running.
   ( https://github.com/linux-ha-japan/pm_extras/tree/master/tools )
   Attribute example :

   Node Attributes:
   * Node rhel73-1:
   + ringnumber_0  : 192.168.101.131 is UP
   + ringnumber_1  : 192.168.102.131 is UP
   * Node rhel73-2:
   + ringnumber_0  : 192.168.101.132 is UP
   + ringnumber_1  : 192.168.102.132 is UP

Regards,
Kazunori INOUE

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Tuesday, August 15, 2017 2:42 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Updated attribute is not displayed in crm_mon
> 
> On Mon, 2017-08-14 at 12:33 -0500, Ken Gaillot wrote:
> > On Wed, 2017-08-02 at 09:59 +, 井上 和徳 wrote:
> > > Hi,
> > >
> > > In Pacemaker-1.1.17, the attribute updated while starting pacemaker is 
> > > not displayed in crm_mon.
> > > In Pacemaker-1.1.16, it is displayed and results are different.
> > >
> > > https://github.com/ClusterLabs/pacemaker/commit/fe44f400a3116a158ab331a92a49a4ad8937170d
> > > This commit is the cause, but the following result (3.) is expected 
> > > behavior?
> >
> > This turned out to be an odd one. The sequence of events is:
> >
> > 1. When the node leaves the cluster, the DC (correctly) wipes all its
> > transient attributes from attrd and the CIB.
> >
> > 2. Pacemaker is newly started on the node, and a transient attribute is
> > set before the node joins the cluster.
> >
> > 3. The node joins the cluster, and its transient attributes (including
> > the new value) are sync'ed with the rest of the cluster, in both attrd
> > and the CIB. So far, so good.
> >
> > 4. Because this is the node's first join since its crmd started, its
> > crmd wipes all of its transient attributes again. The idea is that the
> > node may have restarted so quickly that the DC hasn't yet done it (step
> > 1 here), so clear them now to avoid any problems with old values.
> > However, the crmd wipes only the CIB -- not attrd (arguably a bug).
> 
> Whoops, clarification: the node may have restarted so quickly that
> corosync didn't notice it left, so the DC would never have gotten the
> "peer lost" message that triggers wiping its transient attributes.
> 
> I suspect the crmd wipes only the CIB in this case because we assumed
> attrd would be empty at this point -- missing exactly this case where a
> value was set between start-up and first join.
> 
> > 5. With the older pacemaker version, both the joining node and the DC
> > would request a full write-out of all values from attrd. Because step 4
> > only wiped the CIB, this ends up restoring the new value. With the newer
> > pacemaker version, this step is no longer done, so the value winds up
> > staying in attrd but not in CIB (until the next write-out naturally
> > occurs).
> >
> > I don't have a solution yet, but step 4 is clearly the problem (rather
> > than the new code that skips step 5, which is still a good idea
> > performance-wise). I'll keep working on it.
> >
> > > [test case]
> > > 1. Start pacemaker on two nodes at the same time and update the attribute 
> > > during startup.
> > >In this case, the attribute is displayed in crm_mon.
> > >
> > >[root@node1 ~]# ssh -f node1 'systemctl start pacemaker ; 
> > > attrd_updater -n KEY -U V-1' ; \
> > >ssh -f node3 'systemctl start pacemaker ; 
> > > attrd_updater -n KEY -U V-3'
> > >[root@node1 ~]# crm_mon -QA1
> > >Stack: corosync
> > >Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with 
> > > quorum
> > >
> > >2 nodes configured
> > >0 resources configured
> > >
> > >Online: [ node1 node3 ]
> > >
> > >No active resources
> > >
> > >
> > >Node Attributes:
> > >* Node node1:
> > >+ KEY   : V-1
> > >* Node node3:
> > >+ KEY   : V-3
> > >
> > >
> > > 2. Restart pacemaker on node

[ClusterLabs] Updated attribute is not displayed in crm_mon

2017-08-02 Thread
Hi,

In Pacemaker-1.1.17, the attribute updated while starting pacemaker is not 
displayed in crm_mon.
In Pacemaker-1.1.16, it is displayed and results are different.

https://github.com/ClusterLabs/pacemaker/commit/fe44f400a3116a158ab331a92a49a4ad8937170d
This commit is the cause, but the following result (3.) is expected behavior?

[test case]
1. Start pacemaker on two nodes at the same time and update the attribute 
during startup.
   In this case, the attribute is displayed in crm_mon.

   [root@node1 ~]# ssh -f node1 'systemctl start pacemaker ; attrd_updater -n 
KEY -U V-1' ; \
   ssh -f node3 'systemctl start pacemaker ; attrd_updater -n 
KEY -U V-3'
   [root@node1 ~]# crm_mon -QA1
   Stack: corosync
   Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with quorum

   2 nodes configured
   0 resources configured

   Online: [ node1 node3 ]

   No active resources


   Node Attributes:
   * Node node1:
   + KEY   : V-1
   * Node node3:
   + KEY   : V-3


2. Restart pacemaker on node1, and update the attribute during startup.

   [root@node1 ~]# systemctl stop pacemaker
   [root@node1 ~]# systemctl start pacemaker ; attrd_updater -n KEY -U V-10


3. The attribute is registered in attrd but it is not registered in CIB,
   so the updated attribute is not displayed in crm_mon.

   [root@node1 ~]# attrd_updater -Q -n KEY -A
   name="KEY" host="node3" value="V-3"
   name="KEY" host="node1" value="V-10"

   [root@node1 ~]# crm_mon -QA1
   Stack: corosync
   Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with quorum

   2 nodes configured
   0 resources configured

   Online: [ node1 node3 ]

   No active resources


   Node Attributes:
   * Node node1:
   * Node node3:
   + KEY   : V-3


Best Regards

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Node attribute disappears when pacemaker is started

2017-06-13 Thread
Hi Ken,

Thank you for explaining the behavior.
I will consider the procedure to specify nodeid in corosync.conf.

Regards,
Kazunori INOUE

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Thursday, June 08, 2017 11:43 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Node attribute disappears when pacemaker is started
> 
> Hi,
> 
> Looking at the incident around May 26 16:40:00, here is what happens:
> 
> You are setting the attribute for rhel73-2 from rhel73-1, while rhel73-2
> is not part of cluster from rhel73-1's point of view.
> 
> The crm shell sets the node attribute for rhel73-2 with a CIB
> modification that starts like this:
> 
> ++ /cib/configuration/nodes:  
> 
> Note that the node ID is the same as its name. The CIB accepts the
> change (because you might be adding the proper node later). The crmd
> knows that this is not currently valid:
> 
> May 26 16:39:39 rhel73-1 crmd[2908]:   error: Invalid node id: rhel73-2
> 
> When rhel73-2 joins the cluster, rhel73-1 learns its node ID, and it
> removes the existing (invalid) rhel73-2 entry, including its attributes,
> because it assumes that the entry is for an older node that has been
> removed.
> 
> I believe attributes can be set for a node that's not in the cluster
> only if the node IDs are specified explicitly in corosync.conf.
> 
> You may want to mention the issue to the crm shell developers. It should
> probably at least warn if the node isn't known.
> 
> 
> On 05/31/2017 09:35 PM, 井上 和徳 wrote:
> > Hi Ken,
> >
> > I'm sorry. Attachment size was too large.
> > I attached it to GitHub, so look at it.
> > https://github.com/inouekazu/pcmk_report/blob/master/pcmk-Fri-26-May-2017.tar.bz2
> >
> >> -Original Message-
> >> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >> Sent: Thursday, June 01, 2017 8:43 AM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] Node attribute disappears when pacemaker is 
> >> started
> >>
> >> On 05/26/2017 03:21 AM, 井上 和徳 wrote:
> >>> Hi Ken,
> >>>
> >>> I got crm_report.
> >>>
> >>> Regards,
> >>> Kazunori INOUE
> >>
> >> I don't think it attached -- my mail client says it's 0 bytes.
> >>
> >>>> -Original Message-
> >>>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >>>> Sent: Friday, May 26, 2017 4:23 AM
> >>>> To: users@clusterlabs.org
> >>>> Subject: Re: [ClusterLabs] Node attribute disappears when pacemaker is 
> >>>> started
> >>>>
> >>>> On 05/24/2017 05:13 AM, 井上 和徳 wrote:
> >>>>> Hi,
> >>>>>
> >>>>> After loading the node attribute, when I start pacemaker of that node, 
> >>>>> the attribute disappears.
> >>>>>
> >>>>> 1. Start pacemaker on node1.
> >>>>> 2. Load configure containing node attribute of node2.
> >>>>>(I use multicast addresses in corosync, so did not set "nodelist 
> >>>>> {nodeid: }" in corosync.conf.)
> >>>>> 3. Start pacemaker on node2, the node attribute that should have been 
> >>>>> load disappears.
> >>>>>Is this specifications ?
> >>>>
> >>>> Hi,
> >>>>
> >>>> No, this should not happen for a permanent node attribute.
> >>>>
> >>>> Transient node attributes (status-attr in crm shell) are erased when the
> >>>> node starts, so it would be expected in that case.
> >>>>
> >>>> I haven't been able to reproduce this with a permanent node attribute.
> >>>> Can you attach logs from both nodes around the time node2 is started?
> >>>>
> >>>>>
> >>>>> 1.
> >>>>> [root@rhel73-1 ~]# systemctl start corosync;systemctl start pacemaker
> >>>>> [root@rhel73-1 ~]# crm configure show
> >>>>> node 3232261507: rhel73-1
> >>>>> property cib-bootstrap-options: \
> >>>>>   have-watchdog=false \
> >>>>>   dc-version=1.1.17-0.1.rc2.el7-524251c \
> >>>>>   cluster-infrastructure=corosync
> >>>>>
> >>>>> 2.
> >>>>> [root@rhel73-1 ~]# cat rhel73-2.crm
> >>>>> node rhel73-2 \
> >>>>>   utilization capacity="2"

Re: [ClusterLabs] Node attribute disappears when pacemaker is started

2017-05-31 Thread
Hi Ken,

I'm sorry. Attachment size was too large.
I attached it to GitHub, so look at it.
https://github.com/inouekazu/pcmk_report/blob/master/pcmk-Fri-26-May-2017.tar.bz2

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Thursday, June 01, 2017 8:43 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Node attribute disappears when pacemaker is started
> 
> On 05/26/2017 03:21 AM, 井上 和徳 wrote:
> > Hi Ken,
> >
> > I got crm_report.
> >
> > Regards,
> > Kazunori INOUE
> 
> I don't think it attached -- my mail client says it's 0 bytes.
> 
> >> -Original Message-
> >> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >> Sent: Friday, May 26, 2017 4:23 AM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] Node attribute disappears when pacemaker is 
> >> started
> >>
> >> On 05/24/2017 05:13 AM, 井上 和徳 wrote:
> >>> Hi,
> >>>
> >>> After loading the node attribute, when I start pacemaker of that node, 
> >>> the attribute disappears.
> >>>
> >>> 1. Start pacemaker on node1.
> >>> 2. Load configure containing node attribute of node2.
> >>>(I use multicast addresses in corosync, so did not set "nodelist 
> >>> {nodeid: }" in corosync.conf.)
> >>> 3. Start pacemaker on node2, the node attribute that should have been 
> >>> load disappears.
> >>>Is this specifications ?
> >>
> >> Hi,
> >>
> >> No, this should not happen for a permanent node attribute.
> >>
> >> Transient node attributes (status-attr in crm shell) are erased when the
> >> node starts, so it would be expected in that case.
> >>
> >> I haven't been able to reproduce this with a permanent node attribute.
> >> Can you attach logs from both nodes around the time node2 is started?
> >>
> >>>
> >>> 1.
> >>> [root@rhel73-1 ~]# systemctl start corosync;systemctl start pacemaker
> >>> [root@rhel73-1 ~]# crm configure show
> >>> node 3232261507: rhel73-1
> >>> property cib-bootstrap-options: \
> >>>   have-watchdog=false \
> >>>   dc-version=1.1.17-0.1.rc2.el7-524251c \
> >>>   cluster-infrastructure=corosync
> >>>
> >>> 2.
> >>> [root@rhel73-1 ~]# cat rhel73-2.crm
> >>> node rhel73-2 \
> >>>   utilization capacity="2" \
> >>>   attributes attrname="attr2"
> >>>
> >>> [root@rhel73-1 ~]# crm configure load update rhel73-2.crm
> >>> [root@rhel73-1 ~]# crm configure show
> >>> node 3232261507: rhel73-1
> >>> node rhel73-2 \
> >>>   utilization capacity=2 \
> >>>   attributes attrname=attr2
> >>> property cib-bootstrap-options: \
> >>>   have-watchdog=false \
> >>>   dc-version=1.1.17-0.1.rc2.el7-524251c \
> >>>   cluster-infrastructure=corosync
> >>>
> >>> 3.
> >>> [root@rhel73-1 ~]# ssh rhel73-2 'systemctl start corosync;systemctl start 
> >>> pacemaker'
> >>> [root@rhel73-1 ~]# crm configure show
> >>> node 3232261507: rhel73-1
> >>> node 3232261508: rhel73-2
> >>> property cib-bootstrap-options: \
> >>>   have-watchdog=false \
> >>>   dc-version=1.1.17-0.1.rc2.el7-524251c \
> >>>   cluster-infrastructure=corosync
> >>>
> >>> Regards,
> >>> Kazunori INOUE
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

2017-05-26 Thread
Hi Ken,

The cause turned out.

When stonith is executed, stonithd sends results and notifications to crmd.
https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/fencing/remote.c#L402-L406

- when "result" is sent (calling do_local_reply()), too many stonith failures 
is checked in too_many_st_failures().
  
https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L638-L669
- when "notification" is sent (calling do_stonith_notify()), the number of 
failures is incremented in st_fail_count_increment().
  
https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L704-L726
From this, since checking is done before incrementing, the number of failures 
in "Too many failures (10) to fence" log does not match the number of actual 
failures.

I confirmed that the expected result will be obtained from the following 
changes.

# git diff
diff --git a/fencing/remote.c b/fencing/remote.c
index 4a47d49..3ff324e 100644
--- a/fencing/remote.c
+++ b/fencing/remote.c
@@ -399,12 +399,12 @@ handle_local_reply_and_notify(remote_fencing_op_t * op, 
xmlNode * data, int rc)
 reply = stonith_construct_reply(op->request, NULL, data, rc);
 crm_xml_add(reply, F_STONITH_DELEGATE, op->delegate);

-/* Send fencing OP reply to local client that initiated fencing */
-do_local_reply(reply, op->client_id, op->call_options & st_opt_sync_call, 
FALSE);
-
 /* bcast to all local clients that the fencing operation happend */
 do_stonith_notify(0, T_STONITH_NOTIFY_FENCE, rc, notify_data);

+/* Send fencing OP reply to local client that initiated fencing */
+do_local_reply(reply, op->client_id, op->call_options & st_opt_sync_call, 
FALSE);
+
 /* mark this op as having notify's already sent */
 op->notify_sent = TRUE;
 free_xml(reply);

Regards,
Kazunori INOUE

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Wednesday, May 17, 2017 11:09 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not 
> accurate
> 
> On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
> > On 05/17/2017 11:28 AM, 井上 和徳 wrote:
> >> Hi,
> >> I'm testing Pacemaker-1.1.17-rc1.
> >> The number of failures in "Too many failures (10) to fence" log does not 
> >> match the number of actual failures.
> >
> > Well it kind of does as after 10 failures it doesn't try fencing again
> > so that is what
> > failures stay at ;-)
> > Of course it still sees the need to fence but doesn't actually try.
> >
> > Regards,
> > Klaus
> 
> This feature can be a little confusing: it doesn't prevent all further
> fence attempts of the target, just *immediate* fence attempts. Whenever
> the next transition is started for some other reason (a configuration or
> state change, cluster-recheck-interval, node failure, etc.), it will try
> to fence again.
> 
> Also, it only checks this threshold if it's aborting a transition
> *because* of this fence failure. If it's aborting the transition for
> some other reason, the number can go higher than the threshold. That's
> what I'm guessing happened here.
> 
> >> After the 11th time fence failure, "Too many failures (10) to fence" is 
> >> output.
> >> Incidentally, stonith-max-attempts has not been set, so it is 10 by 
> >> default..
> >>
> >> [root@x3650f log]# egrep "Requesting fencing|error: Operation 
> >> reboot|Stonith failed|Too many failures"
> >> ##Requesting fencing : 1st time
> >> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> >> rhel73-2 by rhel73-1 for
> crmd.5269@rhel73-1.8415167d: No data available
> >> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> >> failed
> >> ## 2nd time
> >> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> >> rhel73-2 by rhel73-1 for
> crmd.5269@rhel73-1.53d3592a: No data available
> >> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> >> failed
> >> ## 3rd time
> >> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operati

Re: [ClusterLabs] Node attribute disappears when pacemaker is started

2017-05-26 Thread
Hi Ken,

I got crm_report.

Regards,
Kazunori INOUE

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Friday, May 26, 2017 4:23 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Node attribute disappears when pacemaker is started
> 
> On 05/24/2017 05:13 AM, 井上 和徳 wrote:
> > Hi,
> >
> > After loading the node attribute, when I start pacemaker of that node, the 
> > attribute disappears.
> >
> > 1. Start pacemaker on node1.
> > 2. Load configure containing node attribute of node2.
> >(I use multicast addresses in corosync, so did not set "nodelist 
> > {nodeid: }" in corosync.conf.)
> > 3. Start pacemaker on node2, the node attribute that should have been load 
> > disappears.
> >Is this specifications ?
> 
> Hi,
> 
> No, this should not happen for a permanent node attribute.
> 
> Transient node attributes (status-attr in crm shell) are erased when the
> node starts, so it would be expected in that case.
> 
> I haven't been able to reproduce this with a permanent node attribute.
> Can you attach logs from both nodes around the time node2 is started?
> 
> >
> > 1.
> > [root@rhel73-1 ~]# systemctl start corosync;systemctl start pacemaker
> > [root@rhel73-1 ~]# crm configure show
> > node 3232261507: rhel73-1
> > property cib-bootstrap-options: \
> >   have-watchdog=false \
> >   dc-version=1.1.17-0.1.rc2.el7-524251c \
> >   cluster-infrastructure=corosync
> >
> > 2.
> > [root@rhel73-1 ~]# cat rhel73-2.crm
> > node rhel73-2 \
> >   utilization capacity="2" \
> >   attributes attrname="attr2"
> >
> > [root@rhel73-1 ~]# crm configure load update rhel73-2.crm
> > [root@rhel73-1 ~]# crm configure show
> > node 3232261507: rhel73-1
> > node rhel73-2 \
> >   utilization capacity=2 \
> >   attributes attrname=attr2
> > property cib-bootstrap-options: \
> >   have-watchdog=false \
> >   dc-version=1.1.17-0.1.rc2.el7-524251c \
> >   cluster-infrastructure=corosync
> >
> > 3.
> > [root@rhel73-1 ~]# ssh rhel73-2 'systemctl start corosync;systemctl start 
> > pacemaker'
> > [root@rhel73-1 ~]# crm configure show
> > node 3232261507: rhel73-1
> > node 3232261508: rhel73-2
> > property cib-bootstrap-options: \
> >   have-watchdog=false \
> >   dc-version=1.1.17-0.1.rc2.el7-524251c \
> >   cluster-infrastructure=corosync
> >
> > Regards,
> > Kazunori INOUE
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


pcmk-Fri-26-May-2017.tar.bz2
Description: pcmk-Fri-26-May-2017.tar.bz2
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Node attribute disappears when pacemaker is started

2017-05-24 Thread
Hi,

After loading the node attribute, when I start pacemaker of that node, the 
attribute disappears.

1. Start pacemaker on node1.
2. Load configure containing node attribute of node2.
   (I use multicast addresses in corosync, so did not set "nodelist {nodeid: }" 
in corosync.conf.)
3. Start pacemaker on node2, the node attribute that should have been load 
disappears.
   Is this specifications ?

1.
[root@rhel73-1 ~]# systemctl start corosync;systemctl start pacemaker
[root@rhel73-1 ~]# crm configure show
node 3232261507: rhel73-1
property cib-bootstrap-options: \
  have-watchdog=false \
  dc-version=1.1.17-0.1.rc2.el7-524251c \
  cluster-infrastructure=corosync

2.
[root@rhel73-1 ~]# cat rhel73-2.crm
node rhel73-2 \
  utilization capacity="2" \
  attributes attrname="attr2"

[root@rhel73-1 ~]# crm configure load update rhel73-2.crm
[root@rhel73-1 ~]# crm configure show
node 3232261507: rhel73-1
node rhel73-2 \
  utilization capacity=2 \
  attributes attrname=attr2
property cib-bootstrap-options: \
  have-watchdog=false \
  dc-version=1.1.17-0.1.rc2.el7-524251c \
  cluster-infrastructure=corosync

3.
[root@rhel73-1 ~]# ssh rhel73-2 'systemctl start corosync;systemctl start 
pacemaker'
[root@rhel73-1 ~]# crm configure show
node 3232261507: rhel73-1
node 3232261508: rhel73-2
property cib-bootstrap-options: \
  have-watchdog=false \
  dc-version=1.1.17-0.1.rc2.el7-524251c \
  cluster-infrastructure=corosync

Regards,
Kazunori INOUE

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

2017-05-18 Thread
Hi Ken,
thank you for your comment.
I'll try to check behavior.

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Wednesday, May 17, 2017 11:09 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not 
> accurate
> 
> On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
> > On 05/17/2017 11:28 AM, 井上 和徳 wrote:
> >> Hi,
> >> I'm testing Pacemaker-1.1.17-rc1.
> >> The number of failures in "Too many failures (10) to fence" log does not 
> >> match the number of actual failures.
> >
> > Well it kind of does as after 10 failures it doesn't try fencing again
> > so that is what
> > failures stay at ;-)
> > Of course it still sees the need to fence but doesn't actually try.
> >
> > Regards,
> > Klaus
> 
> This feature can be a little confusing: it doesn't prevent all further
> fence attempts of the target, just *immediate* fence attempts. Whenever
> the next transition is started for some other reason (a configuration or
> state change, cluster-recheck-interval, node failure, etc.), it will try
> to fence again.
> 
> Also, it only checks this threshold if it's aborting a transition
> *because* of this fence failure. If it's aborting the transition for
> some other reason, the number can go higher than the threshold. That's
> what I'm guessing happened here.
> 
> >> After the 11th time fence failure, "Too many failures (10) to fence" is 
> >> output.
> >> Incidentally, stonith-max-attempts has not been set, so it is 10 by 
> >> default..
> >>
> >> [root@x3650f log]# egrep "Requesting fencing|error: Operation 
> >> reboot|Stonith failed|Too many failures"
> >> ##Requesting fencing : 1st time
> >> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> >> rhel73-2 by rhel73-1 for
> crmd.5269@rhel73-1.8415167d: No data available
> >> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> >> failed
> >> ## 2nd time
> >> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> >> rhel73-2 by rhel73-1 for
> crmd.5269@rhel73-1.53d3592a: No data available
> >> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> >> failed
> >> ## 3rd time
> >> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> >> rhel73-2 by rhel73-1 for
> crmd.5269@rhel73-1.9177cb76: No data available
> >> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> >> failed
> >> ## 4th time
> >> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> >> rhel73-2 by rhel73-1 for
> crmd.5269@rhel73-1.946531cb: No data available
> >> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> >> failed
> >> ## 5th time
> >> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> >> rhel73-2 by rhel73-1 for
> crmd.5269@rhel73-1.278b3c4b: No data available
> >> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> >> failed
> >> ## 6th time
> >> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb:
> No data available
> >> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> >> failed
> >> ## 7th time
> >> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
> >> of node rhel73-2
> >> May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> >> rhel73-2 by rhel73-1 for
> crmd.5269@rhel73-1.83421862: No data available
> >> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: 

[ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

2017-05-17 Thread
Hi,
I'm testing Pacemaker-1.1.17-rc1.
The number of failures in "Too many failures (10) to fence" log does not match 
the number of actual failures.

After the 11th time fence failure, "Too many failures (10) to fence" is output.
Incidentally, stonith-max-attempts has not been set, so it is 10 by default..

[root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith 
failed|Too many failures"
##Requesting fencing : 1st time
May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available
May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 2nd time
May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available
May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 3rd time
May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available
May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 4th time
May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available
May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 5th time
May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available
May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 6th time
May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available
May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 7th time
May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available
May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 8th time
May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available
May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 9th time
May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available
May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 10th time
May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available
May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 11th time
May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.db50c21a: No data available
May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence 
rhel73-2, giving up
May 12 06:03:37 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed

Regards,
Kazunori INOUE

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org