Re: [ClusterLabs] (Live) Migration failure results in a stop operation

2018-02-19 Thread Digimer
On 2018-02-20 12:07 AM, Digimer wrote:
> Hi all,
> 
>   Is there a way to tell pacemaker that, if a migration operation fails,
> to just leave the service on the host node? The service being hosted is
> a VM and a migration failure that triggers a shut down and reboot is
> very disruptive. I'd rather just leave it alone (and let a human fix the
> underlying problem).
> 
> Thanks!
> 

I should mention; I tried setting the 'on-fail' for the 'migate_to' and
'migrate_from' operations;

pcs resource create srv01-c7 ocf:alteeve:server name="srv01-c7" \
meta allow-migrate="true" op monitor interval="60" \
op stop on-fail="block" op migrate_to on-fail="ignore" \
op migrate_from on-fail="ignore" \
meta allow-migrate="true" failure-timeout="75"

 [root@m3-a02n01 ~]# pcs config
Cluster Name: m3-anvil-02
Corosync Nodes:
 m3-a02n01.alteeve.com m3-a02n02.alteeve.com
Pacemaker Nodes:
 m3-a02n01.alteeve.com m3-a02n02.alteeve.com

Resources:
 Clone: hypervisor-clone
  Meta Attrs: clone-max=2 notify=false
  Resource: hypervisor (class=systemd type=libvirtd)
   Operations: monitor interval=60 (hypervisor-monitor-interval-60)
   start interval=0s timeout=100 (hypervisor-start-interval-0s)
   stop interval=0s timeout=100 (hypervisor-stop-interval-0s)
 Resource: srv01-c7 (class=ocf provider=alteeve type=server)
  Attributes: name=srv01-c7
  Meta Attrs: allow-migrate=true failure-timeout=75
  Operations: migrate_from interval=0s on-fail=ignore
(srv01-c7-migrate_from-interval-0s)
  migrate_to interval=0s on-fail=ignore
(srv01-c7-migrate_to-interval-0s)
  monitor interval=60 (srv01-c7-monitor-interval-60)
  start interval=0s timeout=30 (srv01-c7-start-interval-0s)
  stop interval=0s on-fail=block (srv01-c7-stop-interval-0s)

Stonith Devices:
 Resource: virsh_node1 (class=stonith type=fence_virsh)
  Attributes: delay=15 ipaddr=10.255.255.250 login=root passwd="secret"
pcmk_host_list=m3-a02n01.alteeve.com port=m3-a02n01
  Operations: monitor interval=60 (virsh_node1-monitor-interval-60)
 Resource: virsh_node2 (class=stonith type=fence_virsh)
  Attributes: ipaddr=10.255.255.250 login=root passwd="secret"
pcmk_host_list=m3-a02n02.alteeve.com port=m3-a02n02
  Operations: monitor interval=60 (virsh_node2-monitor-interval-60)
Fencing Levels:

Location Constraints:
  Resource: srv01-c7
Enabled on: m3-a02n02.alteeve.com (score:50)
(id:location-srv01-c7-m3-a02n02.alteeve.com-50)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: m3-anvil-02
 dc-version: 1.1.16-12.el7_4.7-94ff4df
 have-watchdog: false
 last-lrm-refresh: 1518584295

Quorum:
  Options:


When I tried to migrate (with the RA set to fail on purpose), I got:

 Node 1
Feb 20 07:06:40 m3-a02n01.alteeve.com crmd[1865]:   notice: Result of
migrate_to operation for srv01-c7 on m3-a02n01.alteeve.com: 1 (unknown
error)
Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3440]: 167;
ocf:alteeve:server invoked.
Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3442]: 1360;
Command line switch: [stop] -> [#!SET!#]


 Node 2
Feb 20 07:05:37 m3-a02n02.alteeve.com crmd[2394]:   notice: State
transition S_TRANSITION_ENGINE -> S_IDLE
Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice: State
transition S_IDLE -> S_POLICY_ENGINE
Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:  *
Migratesrv01-c7( m3-a02n01.alteeve.com ->
m3-a02n02.alteeve.com )
Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:
Calculated transition 756, saving inputs in
/var/lib/pacemaker/pengine/pe-input-172.bz2
Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice: Initiating
migrate_to operation srv01-c7_migrate_to_0 on m3-a02n01.alteeve.com
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
(srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
(srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice: Initiating
migrate_from operation srv01-c7_migrate_from_0 locally on
m3-a02n02.alteeve.com
Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3396]: 167;
ocf:alteeve:server invoked.
Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3398]: 1360;
Command line switch: [migrate_from] -> [#!SET!#]
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice: Result of
migrate_from operation for srv01-c7 on m3-a02n02.alteeve.com: 1 (unknown
error)
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 23
(srv01-c7_migrate_from_0) on m3-a02n02.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: 

[ClusterLabs] (Live) Migration failure results in a stop operation

2018-02-19 Thread Digimer
Hi all,

  Is there a way to tell pacemaker that, if a migration operation fails,
to just leave the service on the host node? The service being hosted is
a VM and a migration failure that triggers a shut down and reboot is
very disruptive. I'd rather just leave it alone (and let a human fix the
underlying problem).

Thanks!

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Monitor being called repeatedly for Master/Slave resource despite monitor failure

2018-02-19 Thread Samarth Jain
Hi,


I have configured wildfly resource in master slave mode on a 6 VM cluster
with stonith disabled and and no quorum policy set to ignore.

We are observing that on either of master or slave resource failure,
pacemaker keeps on calling stateful_monitor for wildfly repeatedly, despite
us returning appropriate failure return codes on monitor failure for both
master (rc=OCF_MASTER_FAILED) and slave (rc=OCF_NOT_RUNNING).

This continues till failure-timeout is reached after which the resource
gets demoted and stopped in case of master monitor failure and stopped in
case of slave monitor failure.

# pacemakerd --version
Pacemaker 1.1.16
Written by Andrew Beekhof

# corosync -v
Corosync Cluster Engine, version '2.4.2'
Copyright (c) 2006-2009 Red Hat, Inc.

Below is my configuration:

node 1: VM-0
node 2: VM-1
node 3: VM-2
node 4: VM-3
node 5: VM-4
node 6: VM-5
primitive stateful_wildfly ocf:pacemaker:wildfly \
op start timeout=200s interval=0 \
op promote timeout=300s interval=0 \
op monitor interval=90s role=Master timeout=90s \
op monitor interval=80s role=Slave timeout=100s \
meta resource-stickiness=100 migration-threshold=3
failure-timeout=240s
ms wildfly_MS stateful_wildfly \
location stateful_wildfly_rule_2 wildfly_MS \
rule -inf: #uname eq VM-2
location stateful_wildfly_rule_3 wildfly_MS \
rule -inf: #uname eq VM-3
location stateful_wildfly_rule_4 wildfly_MS \
rule -inf: #uname eq VM-4
location stateful_wildfly_rule_5 wildfly_MS \
rule -inf: #uname eq VM-5
property cib-bootstrap-options: \
stonith-enabled=false \
no-quorum-policy=ignore \
cluster-recheck-interval=30s \
start-failure-is-fatal=false \
stop-all-resources=false \
have-watchdog=false \
dc-version=1.1.16-94ff4df51a \
cluster-infrastructure=corosync \
cluster-name=hacluster-0

Could you please help us in understanding this behavior and how to fix this?


Thanks!
Samarth J
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitor being called repeatedly for Master/Slave resource despite monitor returning failure

2018-02-19 Thread Ken Gaillot
On Mon, 2018-02-19 at 16:48 +0530, Pankaj wrote:
> Hi,
> 
> 
> I have configured wildfly resource in master slave mode on a 6 VM
> cluster with stonith disabled and and no quorum policy set to ignore.

To some of us that sounds like "I'm driving a car with no brakes ..."
:-)

Without stonith or quorum, there's a high risk of split-brain. Any node
that gets cut off from the others will start all the resources.

> We are observing that on either of master or slave resource failure,
> pacemaker keeps on calling stateful_monitor for wildfly repeatedly,
> despite us returning appropriate failure return codes on monitor
> failure for both master (failure rc=OCF_MASTER_FAILED) and slave
> (failure rc=OCF_NOT_RUNNING).

With your configuration, after the first monitor failure, it should try
to stop the resource, start it again, then monitor it.

One of the nodes at any time is elected the DC. This node will run the
policy engine to make decisions about what needs to be done. The logs
from that node will be most helpful.

Look for the time the failure occurred; once the cluster detects the
failure, there should be a bunch of lines from "pengine" ending in
"Calculated transition" -- these will show what actions were decided.

After that, there will be lines from "crmd" showing "Initiating" and
"Result of" those actions.

> This continues till failure-timeout is reached after which the
> resource gets demoted and stopped in case of master monitor failure,
> and stopped in case of slave monitor failure.
> 
> Could you please help me understand:
> Why don't pacemaker demotes or stops resource immediately after first
> failure, and keeps calling monitor ?
> 
> # pacemakerd --version
> Pacemaker 1.1.16
> Written by Andrew Beekhof
> 
> # corosync -v
> Corosync Cluster Engine, version '2.4.2'
> Copyright (c) 2006-2009 Red Hat, Inc.
> 
> Below is my configuration:
> 
> node 1: VM-0
> node 2: VM-1
> node 3: VM-2
> node 4: VM-3
> node 5: VM-4
> node 6: VM-5
> primitive stateful_wildfly ocf:pacemaker:wildfly \
>         op start timeout=200s interval=0 \
>         op promote timeout=300s interval=0 \
>         op monitor interval=90s role=Master timeout=90s \
>         op monitor interval=80s role=Slave timeout=100s \
>         meta resource-stickiness=100 migration-threshold=3 failure-
> timeout=240s
> ms wildfly_MS stateful_wildfly \
> location stateful_wildfly_rule_2 wildfly_MS \
>         rule -inf: #uname eq VM-2
> location stateful_wildfly_rule_3 wildfly_MS \
>         rule -inf: #uname eq VM-3
> location stateful_wildfly_rule_4 wildfly_MS \
>         rule -inf: #uname eq VM-4
> location stateful_wildfly_rule_5 wildfly_MS \
>         rule -inf: #uname eq VM-5
> property cib-bootstrap-options: \
>         stonith-enabled=false \
>         no-quorum-policy=ignore \
>         cluster-recheck-interval=30s \
>         start-failure-is-fatal=false \
>         stop-all-resources=false \
>         have-watchdog=false \
>         dc-version=1.1.16-94ff4df51a \
>         cluster-infrastructure=corosync \
>         cluster-name=hacluster-0
> 
> Could you please help us in understanding this behavior and how to
> fix this?
> 
> Regards,
> Pankaj
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Pacemaker 2.0.0-rc1 now available

2018-02-19 Thread Ken Gaillot
On Mon, 2018-02-19 at 10:23 +0100, Ulrich Windl wrote:
> > > > Ken Gaillot  schrieb am 16.02.2018 um
> > > > 22:06 in Nachricht
> 
> <1518815166.31176.22.ca...@redhat.com>:
> [...]
> > It is recommended to run "cibadmin --upgrade" (or the equivalent in
> > your higher-level tool of choice) both before and after the
> > upgrade.
> 
> [...]
> Playing with it (older version), I found two possible improvements.
> Consider:
> 
> h01:~ # cibadmin --upgrade
> The supplied command is considered dangerous.  To prevent accidental
> destruction of the cluster, the --force flag is required in order to
> proceed.
> h01:~ # cibadmin --upgrade --force
> Call cib_upgrade failed (-211): Schema is already the latest
> available
> 
> First, cibadmin should check whether the CIB version is up-to-date
> already. If so, there is no need to insist on using --force, and
> secondly if the CIB is already up-to-date, there should not be a
> failure, but a success.

Good point, I'll change it so it prints the following message and exits
0 in such a case:

 Upgrade unnecessary: Schema is already the latest available

Avoiding the need for --force in such a case is a bigger project and
will have to go on the to-do list. (With the current design, we can't
know it's not needed until after we try to upgrade it.)

> If a status is needed to detect and out-of-date CIB, a different
> option would be the better solution IMHO.
> 
> Regards,
> Ulrich
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Pacemaker 2.0.0-rc1 now available

2018-02-19 Thread Jan Pokorný
On 19/02/18 10:39 +0100, Ulrich Windl wrote:
 Ken Gaillot  schrieb am 16.02.2018 um 22:06 in 
 Nachricht
> <1518815166.31176.22.ca...@redhat.com>:
> [...]
>> * The master XML tag is deprecated (though still supported) in favor of
> 
> XML guys!
> 
> Everybody is using (and liking?) XML, but please also learn the
> correct names: There are no "tags" in XML (unless we talk about the
> syntax of XML, which is not the case here), only "elements" and
> "attributes".
> 
> "master" element

True, but perhaps no need to be so harsh about the terms widely
understood as synonyms (from pre-XML era, remember HTML tutorials?),
especially when the target audience wrt. XML level of configuration
keeps shifting towards real power (as in maximum-level-of-control
pursuing) users vs. wider audience served with more abstract means
thanks to crm/pcs.

In anything, would expect rather pointing out the "CIB version 1.3+
does support 'tag' elements" paradox ;-)

>> using the standard clone tag with a new "promotable" meta-attribute set
> 
> "clone" element
> 
> ``"promotable" meta-attribute'' --> ``"nvpair" element with name attribute 
> "promotable"''
> 
>> to true. The "master-max" and "master-node-max" master meta-attributes
>> are deprecated in favor of new "promoted-max" and "promoted-node-max"
>> clone meta-attributes. Documentation now refers to these as promotable
>> clones rather than master/slave, stateful or multistate clones.
>> 
>> * The record-pending option now defaults to true, which means pending
>> actions will be shown in status displays.
>> 
>> * Three minor regressions introduced in 1.1.18, and one introduced in
>> 1.1.17, have been fixed.
> 
> I know what you are talking about, but you need quite some
> background to find out that in "master XML tag" the emphasis is not
> on XML (even if in capitals), but on master ;-)

I see, this shakes the message delivery, something I am running into
all the time when proof-reading my own sentences (especially since I am
not a native speaker), and employing various tricks such as the basic
quoted "meta escape" or join-by-dashes one.

Happy Monday

-- 
Poki


pgpnAhmI8JRgx.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-19 Thread Dileep V Nair

Hello Ondrej,

I am still having issues with my DB2 HADR on Pacemaker. When I do a
db2_kill on Primary for testing, initially it does a restart of DB2 on the
same node. But if I let it run for some days and then try the same test, it
goes into fencing and then reboots the Primary Node.

I am not sure how exactly it should behave in case my DB2 crashes on
Primary.

Also if I crash the Node 1 (the node itself, not only DB2), it
promotes Node 2  to Primary, but once the Pacemaker is started again on
Node 1, the DB on Node 1 is also promoted to Primary. Is that expected
behaviour ?

   
 Regards,   
   

   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 









From:   Ondrej Famera 
To: Dileep V Nair 
Cc: Cluster Labs - All topics related to open-source clustering
welcomed 
Date:   02/12/2018 11:46 AM
Subject:Re: [ClusterLabs] Issues with DB2 HADR Resource Agent



On 02/01/2018 07:24 PM, Dileep V Nair wrote:
> Thanks Ondrej for the response. I have set the PEER_WINDOWto 1000 which
> I guess is a reasonable value. What I am noticing is it does not wait
> for the PEER_WINDOW. Before that itself the DB goes into a
> REMOTE_CATCHUP_PENDING state and Pacemaker give an Error saying a DB in
> STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.
>
>
> Regards,
>
> *Dileep V Nair*

Hi Dileep,

sorry for later response. The DB2 should not get into the
'REMOTE_CATCHUP' phase or the DB2 resource agent will indeed not
promote. From my experience it usually gets into that state when the DB2
on standby was restarted during or after PEER_WINDOW timeout.

When the primary DB2 fails then standby should end up in some state that
would match the one on line 770 of DB2 resource agent and the promote
operation is attempted.

  770  STANDBY/*PEER/DISCONNECTED|Standby/DisconnectedPeer)

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_resource-2Dagents_blob_master_heartbeat_db2-23L770=DwIDBA=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=dhvUwjWghTBfDEHmzU3P5eaU9Ce3DkCRdRPNd71L1bU=3vPiNA4KGdZzc0xJOYv5hMCObjWdlxZDO_bLb86YaGM=


The DB2 on standby can get restarted when the 'promote' operation times
out, so you can try increasing the 'promote' timeout to something higher
if this was the case.

So if you see that DB2 was restarted after Primary failed, increase the
promote timeout. If DB2 was not restarted then question is why DB2 has
decided to change the status in this way.

Let me know if above helped.

--
Ondrej Faměra
@Red Hat



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Monitor being called repeatedly for Master/Slave resource despite monitor returning failure

2018-02-19 Thread Ulrich Windl
 

>>> Pankaj  schrieb am 19.02.2018 um 12:18 in Nachricht
:
> Hi,
> 
> 
> I have configured wildfly resource in master slave mode on a 6 VM cluster
> with stonith disabled and and no quorum policy set to ignore.
> 
> We are observing that on either of master or slave resource failure,
> pacemaker keeps on calling stateful_monitor for wildfly repeatedly, despite
> us returning appropriate failure return codes on monitor failure for both
> master (failure rc=OCF_MASTER_FAILED) and slave (failure
> rc=OCF_NOT_RUNNING).
> 
> This continues till failure-timeout is reached after which the resource
> gets demoted and stopped in case of master monitor failure, and stopped in
> case of slave monitor failure.
> 
> Could you please help me understand:
> Why don't pacemaker demotes or stops resource immediately after first
> failure, and keeps calling monitor ?
> 
> # pacemakerd --version
> Pacemaker 1.1.16
> Written by Andrew Beekhof
> 
> # corosync -v
> Corosync Cluster Engine, version '2.4.2'
> Copyright (c) 2006-2009 Red Hat, Inc.
> 
> Below is my configuration:
> 
> node 1: VM-0
> node 2: VM-1
> node 3: VM-2
> node 4: VM-3
> node 5: VM-4
> node 6: VM-5
> primitive stateful_wildfly ocf:pacemaker:wildfly \
> op start timeout=200s interval=0 \
> op promote timeout=300s interval=0 \
> op monitor interval=90s role=Master timeout=90s \
> op monitor interval=80s role=Slave timeout=100s \
> meta resource-stickiness=100 migration-threshold=3
> failure-timeout=240s
> ms wildfly_MS stateful_wildfly \
> location stateful_wildfly_rule_2 wildfly_MS \
> rule -inf: #uname eq VM-2
> location stateful_wildfly_rule_3 wildfly_MS \
> rule -inf: #uname eq VM-3
> location stateful_wildfly_rule_4 wildfly_MS \
> rule -inf: #uname eq VM-4
> location stateful_wildfly_rule_5 wildfly_MS \
> rule -inf: #uname eq VM-5
> property cib-bootstrap-options: \
> stonith-enabled=false \
> no-quorum-policy=ignore \
> cluster-recheck-interval=30s \
> start-failure-is-fatal=false \
> stop-all-resources=false \
> have-watchdog=false \
> dc-version=1.1.16-94ff4df51a \
> cluster-infrastructure=corosync \
> cluster-name=hacluster-0
> 
> Could you please help us in understanding this behavior and how to fix this?

What does the cluster log?

> 
> Regards,
> Pankaj

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Monitor being called repeatedly for Master/Slave resource despite monitor returning failure

2018-02-19 Thread Pankaj
 Hi,


I have configured wildfly resource in master slave mode on a 6 VM cluster
with stonith disabled and and no quorum policy set to ignore.

We are observing that on either of master or slave resource failure,
pacemaker keeps on calling stateful_monitor for wildfly repeatedly, despite
us returning appropriate failure return codes on monitor failure for both
master (failure rc=OCF_MASTER_FAILED) and slave (failure
rc=OCF_NOT_RUNNING).

This continues till failure-timeout is reached after which the resource
gets demoted and stopped in case of master monitor failure, and stopped in
case of slave monitor failure.

Could you please help me understand:
Why don't pacemaker demotes or stops resource immediately after first
failure, and keeps calling monitor ?

# pacemakerd --version
Pacemaker 1.1.16
Written by Andrew Beekhof

# corosync -v
Corosync Cluster Engine, version '2.4.2'
Copyright (c) 2006-2009 Red Hat, Inc.

Below is my configuration:

node 1: VM-0
node 2: VM-1
node 3: VM-2
node 4: VM-3
node 5: VM-4
node 6: VM-5
primitive stateful_wildfly ocf:pacemaker:wildfly \
op start timeout=200s interval=0 \
op promote timeout=300s interval=0 \
op monitor interval=90s role=Master timeout=90s \
op monitor interval=80s role=Slave timeout=100s \
meta resource-stickiness=100 migration-threshold=3
failure-timeout=240s
ms wildfly_MS stateful_wildfly \
location stateful_wildfly_rule_2 wildfly_MS \
rule -inf: #uname eq VM-2
location stateful_wildfly_rule_3 wildfly_MS \
rule -inf: #uname eq VM-3
location stateful_wildfly_rule_4 wildfly_MS \
rule -inf: #uname eq VM-4
location stateful_wildfly_rule_5 wildfly_MS \
rule -inf: #uname eq VM-5
property cib-bootstrap-options: \
stonith-enabled=false \
no-quorum-policy=ignore \
cluster-recheck-interval=30s \
start-failure-is-fatal=false \
stop-all-resources=false \
have-watchdog=false \
dc-version=1.1.16-94ff4df51a \
cluster-infrastructure=corosync \
cluster-name=hacluster-0

Could you please help us in understanding this behavior and how to fix this?

Regards,
Pankaj
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Pacemaker 2.0.0-rc1 now available

2018-02-19 Thread Ulrich Windl


>>> Ken Gaillot  schrieb am 16.02.2018 um 22:06 in 
>>> Nachricht
<1518815166.31176.22.ca...@redhat.com>:
[...]
> * The master XML tag is deprecated (though still supported) in favor of

XML guys!

Everybody is using (and liking?) XML, but please also learn the correct names: 
There are no "tags" in XML (unless we talk about the syntax of XML, which is 
not the case here), only "elements" and "attributes".

"master" element

> using the standard clone tag with a new "promotable" meta-attribute set

"clone" element

``"promotable" meta-attribute'' --> ``"nvpair" element with name attribute 
"promotable"''

> to true. The "master-max" and "master-node-max" master meta-attributes
> are deprecated in favor of new "promoted-max" and "promoted-node-max"
> clone meta-attributes. Documentation now refers to these as promotable
> clones rather than master/slave, stateful or multistate clones.
> 
> * The record-pending option now defaults to true, which means pending
> actions will be shown in status displays.
> 
> * Three minor regressions introduced in 1.1.18, and one introduced in
> 1.1.17, have been fixed.

I know what you are talking about, but you need quite some background to find 
out that in "master XML tag" the emphasis is not on XML (even if in capitals), 
but on master ;-)

Regards,
Ulrich


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Pacemaker 2.0.0-rc1 now available

2018-02-19 Thread Ulrich Windl


>>> Ken Gaillot  schrieb am 16.02.2018 um 22:06 in 
>>> Nachricht
<1518815166.31176.22.ca...@redhat.com>:
[...]
> It is recommended to run "cibadmin --upgrade" (or the equivalent in
> your higher-level tool of choice) both before and after the upgrade.

[...]
Playing with it (older version), I found two possible improvements. Consider:

h01:~ # cibadmin --upgrade
The supplied command is considered dangerous.  To prevent accidental 
destruction of the cluster, the --force flag is required in order to proceed.
h01:~ # cibadmin --upgrade --force
Call cib_upgrade failed (-211): Schema is already the latest available

First, cibadmin should check whether the CIB version is up-to-date already. If 
so, there is no need to insist on using --force, and secondly if the CIB is 
already up-to-date, there should not be a failure, but a success.

If a status is needed to detect and out-of-date CIB, a different option would 
be the better solution IMHO.

Regards,
Ulrich

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker 2.0.0-rc1 now available

2018-02-19 Thread Jan Friesse

Ken Gaillot napsal(a):

On Fri, 2018-02-16 at 15:06 -0600, Ken Gaillot wrote:

Source code for the first release candidate for Pacemaker version
2.0.0
is now available at:

   https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.0
.0
-rc1

The main goal of the change from Pacemaker 1 to 2 is to drop support
for deprecated legacy usage, in order to make the code base more
maintainable going into the future. As such, this release involves a
net drop of more than 20,000 lines of code!

Rolling (live) upgrades are possible only from Pacemaker 1.1.11 or
later, on top of corosync 2. Other setups can be upgraded with the
cluster stopped.

It is recommended to run "cibadmin --upgrade" (or the equivalent in
your higher-level tool of choice) both before and after the upgrade.

The final 2.0.0 release will automatically transform most of the
dropped older syntax to the newer form. However, this functionality
is
not yet complete in rc1.

The most significant changes in this release include:

* Support has been dropped for heartbeat and corosync 1 (whether
using
CMAN or plugin), and many legacy aliases for cluster options
(including
default-resource-stickiness, which should be set as resource-
stickiness
in rsc_defaults instead).

* The default location of the Pacemaker detail log is now
/var/log/pacemaker/pacemaker.log, and Pacemaker will no longer use
Corosync's logging preferences. Options are available in the
configure
script to change the default log locations.


Thank you a lot!



* The master XML tag is deprecated (though still supported) in favor
of
using the standard clone tag with a new "promotable" meta-attribute
set
to true. The "master-max" and "master-node-max" master meta-
attributes
are deprecated in favor of new "promoted-max" and "promoted-node-max"
clone meta-attributes. Documentation now refers to these as
promotable
clones rather than master/slave, stateful or multistate clones.

* The record-pending option now defaults to true, which means pending
actions will be shown in status displays.

* Three minor regressions introduced in 1.1.18, and one introduced in
1.1.17, have been fixed.

More details are available in the change log:

   https://github.com/ClusterLabs/pacemaker/blob/2.0/ChangeLog

and in a special wiki page for the 2.0 release:

   https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes

Everyone is encouraged to download, compile and test the new release.
We do many regression tests and simulations, but we can't cover all
possible use cases, so your feedback is important and appreciated.

Many thanks to all contributors of source code to this release,
including


Whoops, hit send too soon :)

Andrew Beekhof, Bin Liu, Gao,Yan, Jan Pokorný, and Ken Gaillot



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org