Re: [ClusterLabs] crm_resource --wait

2017-10-20 Thread Ken Gaillot
I've narrowed down the cause.

When the "standby" transition completes, vm2 has more remaining
utilization capacity than vm1, so the cluster wants to run sv-fencer
there. That should be taken into account in the same transition, but it
isn't, so a second transition is needed to make it happen.

Still investigating a fix. A workaround is to assign some stickiness or
utilization to sv-fencer.

On Wed, 2017-10-11 at 14:01 +1000, Leon Steffens wrote:
> I've attached two files:
> 314 = after standby step
> 315 = after resource update
> 
> On Wed, Oct 11, 2017 at 12:22 AM, Ken Gaillot 
> wrote:
> > On Tue, 2017-10-10 at 15:19 +1000, Leon Steffens wrote:
> > > Hi Ken,
> > >
> > > I managed to reproduce this on a simplified version of the
> > cluster,
> > > and on Pacemaker 1.1.15, 1.1.16, as well as 1.1.18-rc1
> > 
> > > The steps to create the cluster are:
> > >
> > > pcs property set stonith-enabled=false
> > > pcs property set placement-strategy=balanced
> > >
> > > pcs node utilization vm1 cpu=100
> > > pcs node utilization vm2 cpu=100
> > > pcs node utilization vm3 cpu=100
> > >
> > > pcs property set maintenance-mode=true
> > >
> > > pcs resource create sv-fencer ocf:pacemaker:Dummy
> > >
> > > pcs resource create sv ocf:pacemaker:Dummy clone notify=false
> > > pcs resource create std ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > >
> > > pcs resource create partition1 ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > > pcs resource create partition2 ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > > pcs resource create partition3 ocf:pacemaker:Dummy meta resource-
> > > stickiness=100
> > >
> > > pcs resource utilization partition1 cpu=5
> > > pcs resource utilization partition2 cpu=5
> > > pcs resource utilization partition3 cpu=5
> > >
> > > pcs constraint colocation add std with sv-clone INFINITY
> > > pcs constraint colocation add partition1 with sv-clone INFINITY
> > > pcs constraint colocation add partition2 with sv-clone INFINITY
> > > pcs constraint colocation add partition3 with sv-clone INFINITY
> > >
> > > pcs property set maintenance-mode=false
> > >  
> > >
> > > I can then reproduce the issues in the following way:
> > >
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm3
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > >
> > > $ pcs cluster standby vm3
> > >
> > > # Check that all resources have moved off vm3
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 ]
> > >      Stopped: [ vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm1
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > 
> > Thanks for the detailed information, this should help me get to the
> > bottom of it. From this description, it sounds like a new
> > transition
> > isn't being triggered when it should.
> > 
> > Could you please attach the DC's pe-input file that is listed in
> > the
> > logs after the standby step above? That would simplify analysis.
> > 
> > > # Wait for any outstanding actions to complete.
> > > $ crm_resource --wait --timeout 300
> > > Pending actions:
> > >         Action 22: sv-fencer_monitor_1      on vm2
> > >         Action 21: sv-fencer_start_0    on vm2
> > >         Action 20: sv-fencer_stop_0     on vm1
> > > Error performing operation: Timer expired
> > >
> > > # Check the resources again - sv-fencer is still on vm1
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm1
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 ]
> > >      Stopped: [ vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm1
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > >
> > > # Perform a random update to the CIB.
> > > $ pcs resource update std op monitor interval=20 timeout=20
> > >
> > > # Check resource status again - sv_fencer has now moved to vm2
> > (the
> > > action crm_resource was waiting for)
> > > $ pcs resource
> > >  sv-fencer      (ocf::pacemaker:Dummy): Started vm2 
> > <<<
> > >  Clone Set: sv-clone [sv]
> > >      Started: [ vm1 vm2 ]
> > >      Stopped: [ vm3 ]
> > >  std    (ocf::pacemaker:Dummy): Started vm2
> > >  partition1     (ocf::pacemaker:Dummy): Started vm1
> > >  partition2     (ocf::pacemaker:Dummy): Started vm1
> > >  partition3     (ocf::pacemaker:Dummy): Started vm2
> > >
> > > I do not get the problem if I:
> > > 1) remove the "std" resource; or
> > > 2) remove 

Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-10-20 Thread Ken Gaillot
On Fri, 2017-10-20 at 15:52 +0200, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
> > On Fri, 2017-09-22 at 18:30 +0200, Ferenc Wágner wrote:
> > > Ken Gaillot  writes:
> > > 
> > > > Hmm, stop+reload is definitely a bug. Can you attach (or email
> > > > it to
> > > > me privately, or file a bz with it attached) the above pe-input 
> > > > file
> > > > with any sensitive info removed?
> > > 
> > > I sent you the pe-input file privately.  It indeed shows the
> > > issue:
> > > 
> > > $ /usr/sbin/crm_simulate -x pe-input-1033.bz2 -RS
> > > [...]
> > > Executing cluster transition:
> > >  * Resource action: vm-alderstop on vhbl05
> > >  * Resource action: vm-alderreload on vhbl05
> > > [...]
> > > 
> > > Hope you can easily get to the bottom of this.
> > 
> > This turned out to have the same underlying cause as CLBZ#5309. I
> > have
> > a fix pending review, which I expect to make it into the soon-to-
> > be-
> > released 1.1.18.
> 
> Great!
> 
> > It is a regression introduced in 1.1.15 by commit 2558d76f. The
> > logic
> > for reloads was consolidated in one place, but that happened to be
> > before restarts were scheduled, so it no longer had the right
> > information about whether a restart was needed. Now, it sets an
> > ordering flag that is used later to cancel the reload if the
> > restart
> > becomes required. I've also added a regression test for it.
> 
> Restarts shouldn't even enter the picture here, so I don't get your
> explanation.  But I also don't know the code, so that doesn't mean a
> thing.  I'll test the next RC to be sure.

:-)

Reloads are done in place of restarts, when circumstances allow. So
reloads are always related to (potential) restarts.

The problem arose because not all of the relevant circumstances are
known at the time the reload action is created. We may figure out later
that a resource the reloading resource depends on must be restarted,
therefore the reloading resource must be fully restarted instead of
reloaded. E.g. a database resource might otherwise be able to reload,
but not if the filesystem it's using is going away.

Previously in those cases, we would end up scheduling both the reload
and the restart. Now, we schedule only the restart.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 2.4.3 is available at corosync.org!

2017-10-20 Thread Digimer
On 2017-10-20 10:26 AM, Jan Friesse wrote:
> I am pleased to announce the latest maintenance release of Corosync
> 2.4.3 available immediately from our website at
> http://build.clusterlabs.org/corosync/releases/.
> 
> This release contains a lot of fixes. New feature is support for
> heuristics in qdevice.
> 
> Complete changelog for 2.4.3:
> Adrian Vondendriesch (1):
>   doc: document watchdog_device parameter
> 
> Andrew Price (1):
>   Main: Call mlockall after fork
> 
> Bin Liu (7):
>   Totempg: remove duplicate memcpy in mcast_msg func
>   Qdevice: fix spell errors in qdevice
>   logconfig: Do not overwrite logger_subsys priority
>   totemconfig: Prefer nodelist over bindnetaddr
>   cpghum: Fix printf of size_t variable
>   Qnetd lms: Use UTILS_PRI_RING_ID printf format str
>   wd: Report error when close of wd fails
> 
> Christine Caulfield (6):
>   votequorum: Don't update expected_votes display if value is too high
>   votequorum: simplify reconfigure message handling
>   quorumtool: Add option to show all node addresses
>   main: Don't ask libqb to handle segv, it doesn't work
>   man: Document -a option to corosync-quorumtool
>   main: use syslog & printf directly for early log messages
> 
> Edwin Torok (1):
>   votequorum: make atb consistent on nodelist reload
> 
> Ferenc Wágner (7):
>   Fix typo: Destorying -> Destroying
>   init: Add doc URIs to the systemd service files
>   wd: fix typo
>   corosync.conf.5: Fix watchdog documentation
>   corosync.conf.5: add warning about slow watchdogs
>   wd: remove extra capitalization typo
>   corosync.conf.5: watchdog support is conditional
> 
> Hideo Yamauchi (1):
>   notifyd: Add the community name to an SNMP trap
> 
> Jan Friesse (11):
>   Logsys: Change logsys syslog_priority priority
>   totemrrp: Fix situation when all rings are faulty
>   main: Display reason why cluster cannot be formed
>   totem: Propagate totem initialization failure
>   totemcrypto: Refactor symmetric key importing
>   totemcrypto: Use different method to import key
>   main: Add option to set priority
>   main: Add support for libcgroup
>   totemcrypto: Fix compiler warning
>   cmap: Remove noop highest config version check
>   qdevice: Add support for heuristics
> 
> Jan Pokorný (2):
>   Spec: drop unneeded dependency
>   Spec: make internal dependencies arch-qualified
> 
> Jonathan Davies (1):
>   cmap: don't shutdown highest config_version node
> 
> Kazunori INOUE (1):
>   totemudp: Remove memb_join discarding
> 
> Keisuke MORI (1):
>   Spec: fix arch-qualified dependencies
> 
> Khem Raj (1):
>   Include fcntl.h for F_* and O_* defines
> 
> Masse Nicolas (1):
>   totemudp: Retry if bind fails
> 
> Richard B Winters (1):
>   Remove deprecated doxygen flags
> 
> Takeshi MIZUTA (3):
>   man: Fix typos in man page
>   man: Modify man-page according to command usage
>   Remove redundant header file inclusion
> 
> yuusuke (1):
>   upstart: Add softdog module loading example
> 
> 
> Upgrade is (as usually) highly recommended.
> 
> Thanks/congratulations to all people that contributed to achieve this
> great milestone.

Thanks to all who helped make this release happen!

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Corosync 2.4.3 is available at corosync.org!

2017-10-20 Thread Jan Friesse

I am pleased to announce the latest maintenance release of Corosync
2.4.3 available immediately from our website at
http://build.clusterlabs.org/corosync/releases/.

This release contains a lot of fixes. New feature is support for 
heuristics in qdevice.


Complete changelog for 2.4.3:
Adrian Vondendriesch (1):
  doc: document watchdog_device parameter

Andrew Price (1):
  Main: Call mlockall after fork

Bin Liu (7):
  Totempg: remove duplicate memcpy in mcast_msg func
  Qdevice: fix spell errors in qdevice
  logconfig: Do not overwrite logger_subsys priority
  totemconfig: Prefer nodelist over bindnetaddr
  cpghum: Fix printf of size_t variable
  Qnetd lms: Use UTILS_PRI_RING_ID printf format str
  wd: Report error when close of wd fails

Christine Caulfield (6):
  votequorum: Don't update expected_votes display if value is too high
  votequorum: simplify reconfigure message handling
  quorumtool: Add option to show all node addresses
  main: Don't ask libqb to handle segv, it doesn't work
  man: Document -a option to corosync-quorumtool
  main: use syslog & printf directly for early log messages

Edwin Torok (1):
  votequorum: make atb consistent on nodelist reload

Ferenc Wágner (7):
  Fix typo: Destorying -> Destroying
  init: Add doc URIs to the systemd service files
  wd: fix typo
  corosync.conf.5: Fix watchdog documentation
  corosync.conf.5: add warning about slow watchdogs
  wd: remove extra capitalization typo
  corosync.conf.5: watchdog support is conditional

Hideo Yamauchi (1):
  notifyd: Add the community name to an SNMP trap

Jan Friesse (11):
  Logsys: Change logsys syslog_priority priority
  totemrrp: Fix situation when all rings are faulty
  main: Display reason why cluster cannot be formed
  totem: Propagate totem initialization failure
  totemcrypto: Refactor symmetric key importing
  totemcrypto: Use different method to import key
  main: Add option to set priority
  main: Add support for libcgroup
  totemcrypto: Fix compiler warning
  cmap: Remove noop highest config version check
  qdevice: Add support for heuristics

Jan Pokorný (2):
  Spec: drop unneeded dependency
  Spec: make internal dependencies arch-qualified

Jonathan Davies (1):
  cmap: don't shutdown highest config_version node

Kazunori INOUE (1):
  totemudp: Remove memb_join discarding

Keisuke MORI (1):
  Spec: fix arch-qualified dependencies

Khem Raj (1):
  Include fcntl.h for F_* and O_* defines

Masse Nicolas (1):
  totemudp: Retry if bind fails

Richard B Winters (1):
  Remove deprecated doxygen flags

Takeshi MIZUTA (3):
  man: Fix typos in man page
  man: Modify man-page according to command usage
  Remove redundant header file inclusion

yuusuke (1):
  upstart: Add softdog module loading example


Upgrade is (as usually) highly recommended.

Thanks/congratulations to all people that contributed to achieve this
great milestone.


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-10-20 Thread Ferenc Wágner
Ken Gaillot  writes:

> On Fri, 2017-09-22 at 18:30 +0200, Ferenc Wágner wrote:
>> Ken Gaillot  writes:
>> 
>>> Hmm, stop+reload is definitely a bug. Can you attach (or email it to
>>> me privately, or file a bz with it attached) the above pe-input file
>>> with any sensitive info removed?
>> 
>> I sent you the pe-input file privately.  It indeed shows the issue:
>> 
>> $ /usr/sbin/crm_simulate -x pe-input-1033.bz2 -RS
>> [...]
>> Executing cluster transition:
>>  * Resource action: vm-alderstop on vhbl05
>>  * Resource action: vm-alderreload on vhbl05
>> [...]
>> 
>> Hope you can easily get to the bottom of this.
>
> This turned out to have the same underlying cause as CLBZ#5309. I have
> a fix pending review, which I expect to make it into the soon-to-be-
> released 1.1.18.

Great!

> It is a regression introduced in 1.1.15 by commit 2558d76f. The logic
> for reloads was consolidated in one place, but that happened to be
> before restarts were scheduled, so it no longer had the right
> information about whether a restart was needed. Now, it sets an
> ordering flag that is used later to cancel the reload if the restart
> becomes required. I've also added a regression test for it.

Restarts shouldn't even enter the picture here, so I don't get your
explanation.  But I also don't know the code, so that doesn't mean a
thing.  I'll test the next RC to be sure.
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] corosync continues even if node is removed from the cluster

2017-10-20 Thread Jonathan Davies

Hi ClusterLabs,

I have a query about safely removing a node from a corosync cluster.

When "corosync-cfgtool -R" is issued, it causes all nodes to reload 
their config from corosync.conf. If I have removed a node from the 
nodelist but corosync is still running on that node, it will receive the 
reload signal but will try to continue as if nothing had happened. This 
then causes various problems on all nodes.


A specific example:

I have a running cluster containing two nodes: 10.71.217.70 (nodeid=1) 
and 10.71.217.71 (nodeid=2). When I remove node 1 from the nodelist in 
corosync.conf on both nodes then issue "corosync-cfgtool -R" on 
10.71.217.71, I see this on 10.71.217.70:


  Quorum information
  --
  Date: Fri Oct 20 13:23:02 2017
  Quorum provider:  corosync_votequorum
  Nodes:2
  Node ID:  1
  Ring ID:  124
  Quorate:  Yes

  Votequorum information
  --
  Expected votes:   2
  Highest expected: 2
  Total votes:  2
  Quorum:   2
  Flags:Quorate AutoTieBreaker

  Membership information
  --
  Nodeid  Votes Name
   1  1 cluster1 (local)
   2  1 10.71.217.71

and this on 10.71.217.71:

  Quorum information
  --
  Date: Fri Oct 20 13:22:46 2017
  Quorum provider:  corosync_votequorum
  Nodes:1
  Node ID:  2
  Ring ID:  132
  Quorate:  No

  Votequorum information
  --
  Expected votes:   2
  Highest expected: 2
  Total votes:  1
  Quorum:   2 Activity blocked
  Flags:

  Membership information
  --
  Nodeid  Votes Name
   2  1 10.71.217.71 (local)

Instead, I would expect corosync on node 1 to exit and node 2 to have 
"expected votes: 1, total votes: 1, quorate: yes".


I notice that there is already some logic in votequorum.c that detects 
this condition, and it produces the following log messages on node 1:


  debug   [VOTEQ ] No nodelist defined or our node is not in the nodelist
  crit[VOTEQ ] configuration error: nodelist or 
quorum.expected_votes must be configured!

  crit[VOTEQ ] will continue with current runtime data

What is the rationale for continuing despite the obvious inconsistency? 
Surely this is destined to cause problems...?


I find that I get my expected behaviour with the following patch:

diff --git a/exec/votequorum.c b/exec/votequorum.c
index 1a97c6d..4ff7ff2 100644
--- a/exec/votequorum.c
+++ b/exec/votequorum.c
@@ -1286,7 +1287,8 @@ static char *votequorum_readconfig(int runtime)
error = (char *)"configuration error: nodelist 
or quorum.expected_votes must be configured!";

} else {
log_printf(LOGSYS_LEVEL_CRIT, "configuration 
error: nodelist or quorum.expected_votes must be configured!");
-   log_printf(LOGSYS_LEVEL_CRIT, "will continue 
with current runtime data");

+   log_printf(LOGSYS_LEVEL_CRIT, "exiting...");
+   exit(1);
}
goto out;
}

Is there any reason why that would not be a good idea?

Thanks,
Jonathan

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org