Re: [ClusterLabs] Pacemaker cluster not working after switching from 1.0 to 1.1

2017-02-09 Thread Ken Gaillot
On 01/16/2017 01:16 PM, Rick Kint wrote:
> 
>> Date: Mon, 16 Jan 2017 09:15:44 -0600
>> From: Ken Gaillot 
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Pacemaker cluster not working after
>> switching from 1.0 to 1.1 (resend as plain text)
>> Message-ID: 
>> Content-Type: text/plain; charset=utf-8
>>
>> A preliminary question -- what cluster layer are you running?
>>
>> Pacemaker 1.0 worked with heartbeat or corosync 1, while Ubuntu 14.04
>> ships with corosync 2 by default, IIRC. There were major incompatible
>> changes between corosync 1 and 2, so it's important to get that right
>> before looking at pacemaker.
>>
>> A general note, when making such a big jump in the pacemaker version,
>> I'd recommend running "cibadmin --upgrade" both before exporting 
>> the
>> configuration from 1.0, and again after deploying it on 1.1. This will
>> apply any transformations needed in the CIB syntax. Pacemaker will do
>> this on the fly, but doing it manually lets you see any issues early, as
>> well as being more efficient.
> 
> TL;DR
> - Thanks.
> - Cluster mostly works so I don't think it's a corosync issue.
> - Configuration XML is actually created with crm shell.
> - Is there a summary of changes from 1.0 to 1.1?
> 
> 
> Thanks for the quick reply.
> 
> 
> Corosync is v2.3.3. We've already been through the issues getting corosync 
> working. 
> 
> The cluster works in many ways:
> - Pacemaker sees both nodes.
> - Pacemaker starts all the resources.- Pacemaker promotes an instance of the 
> stateful Encryptor resource to Master/active.
> - If the node running the active Encryptor goes down, the standby Encryptor 
> is promoted and the DC changes.
> - Manual failover works (fiddling with the master-score attribute).
> 
> The problem is that a failure in one of the dependencies doesn't cause 
> promotion anymore.
> 
> 
> 
> 
> Thanks for the cibadmin command, I missed that when reading the docs.
> 
> I omitted some detail. I didn't export the XML from the old cluster to the 
> new cluster. We create the configuration with the crm shell, not with XML. 
> The sequence of events is
> 
> 
> - install corosync, pacemaker, etc.- apply local config file changes.
> - start corosync and pacemaker on both nodes in cluster.
> - verify that cluster is formed (crm_mon shows both nodes online, but no 
> resources).
> - create cluster by running script which passes a here document to the crm 
> shell.
> - verify that cluster is formed
> 
> 
> The crm shell version is "1.2.5+hg1034-1ubuntu4". I've checked the XML 
> against the "Pacemaker Configuration Explained" doc and it looks OK to my 
> admittedly non-knowledgeable eye.
> 
> I tried the cibadmin command in hopes that this might tell me something, but 
> it made no changes. "cib_verify --live-check" doesn't complain either.
> I copied the XML from a Pacemaker 1.0.X system to a Pacemaker 1.1.X system 
> and ran "cibadmin --upgrade" on it. Nothing changed there either. 
> 
> 
> 
> Is there a quick summary of changes from 1.0 to 1.1 somewhere? The "Pacemaker 
> 1.1 Configuration Explained" doc has a section entitled "What is new in 1.0" 
> but nothing for 1.1. I wouldn't be surprised if there is something obvious 
> that I'm missing and it would help if I could limit my search space.

No, there's just the change log, which is quite detailed.

There was no defining change from 1.0 to 1.1. Originally, it was planned
that 1.1 would be a "development" branch with new features, and 1.0
would be a "production" branch with bugfixes only. It proved too much
work to maintain two separate branches, so the 1.0 line was ended, and
1.1 became the sole production branch.

> I've done quite a bit of experimentation: changed the syntax of the 
> colocation constraints, added ordering constraints, and fiddled with 
> timeouts. When I was doing the port to Ubuntu, I tested resource agent exit 
> status but I'll go back and check that again. Any other suggestions?
> 
> 
> BTW, I've fixed some issues with the Pacemaker init script running on Ubuntu. 
> Should these go to Clusterlabs or the Debian/Ubuntu maintainer?

It depends on whether they're using the init script provided upstream,
or their own (which I suspect is more likely).

> CONFIGURATION
> 
> 
> Here's the configuration again, hopefully with indentation preserved this 
> time:
> 
> 
> 
>   
> id="cib-bootstrap-options-stonith-enabled"/>
> id="cib-bootstrap-o

[ClusterLabs] Pacemaker cluster not working after switching from 1.0 to 1.1 (resend as plain text)

2017-01-15 Thread Rick Kint
Sorry about the garbled email. Trying again with plain text.



A working cluster running on Pacemaker 1.0.12 on RHEL5 has been copied with 
minimal modifications to Pacemaker 1.1.10 on Ubuntu 14.04. The version string 
is "1.1.10+git20130802-1ubuntu2.3".

We run simple active/standby two-node clusters. 

There are four resources on each node:
- a stateful resource (Encryptor) representing a process in either active or 
standby mode.
-- this process does not maintain persistent data.
- a clone resource (CredProxy) representing a helper process.
- two clone resources (Ingress, Egress) representing network interfaces.

Colocation constraints require that all three clone resources must be in 
Started role in order for the stateful Encryptor resource to be in Master role.

The full configuration is at the end of this message.

The Encryptor resource should fail over on these events:
- active node (i.e. node containing active Encryptor process) goes down
- active Encryptor process goes down and cannot be restarted
- auxiliary CredProxy process on active node goes down and cannot be restarted
- either interface on active node goes down

All of these events trigger failover on the old platform (Pacemaker 1.0 on 
RHEL5).

However, on the new platform (Pacemaker 1.1 on Ubuntu) neither interface 
failure nor auxiliary process failure trigger failover. Pacemaker goes into a 
loop where it starts and stops the active Encryptor resource and never promotes 
the standby Encryptor resource. Cleaning up the failed resource manually and 
issuing "crm_resource --cleanup" clears the jam and the standby Encryptor 
resource is promoted. So does taking the former active node offline completely.

The pe-input-X.bz2 files show this sequence:

(EncryptBase:1 is active, EncryptBase:0 is standby)

T: pacemaker recognizes that Ingress has failed
transition: recover Ingress on active node

T+1: transition: recover Ingress on active node

T+2: transition: recover Ingress on active node

T+3: transitions: promote EncryptBase:0, demote EncryptBase:1, stop Ingress on 
active node (no-op)

T+4: EncryptBase:1 demoted (both clones are now in slave mode), Ingress stopped
transitions: promote  EncryptBase:0, stop EncryptBase:1

T+5: EncryptBase:1 stopped, EncryptBase:0 still in slave role
transitions: promote EncryptBase:0, start EncryptBase:1

T+6: EncryptBase:1 started (slave role)
transitions: promote EncryptBase:0, stop EncryptBase:1

The last two steps repeat. Although pengine has decided that EncryptBase:0 
should be promoted, Pacemaker keeps stopping and starting EncryptBase:1 (the 
one on the node with the failed interface) without ever promoting EncryptBase:0.

More precisely, crmd never issues the command that would cause promotion. For a 
normal promotion, I see a sequence like this:

2017-01-12T20:04:39.887154+00:00 encryptor4 pengine[2201]:   notice: 
LogActions: Promote EncryptBase:0  (Slave -> Master encryptor4)
2017-01-12T20:04:39.888018+00:00 encryptor4 pengine[2201]:   notice: 
process_pe_message: Calculated Transition 3: 
/var/lib/pacemaker/pengine/pe-input-3.bz2
2017-01-12T20:04:39.888428+00:00 encryptor4 crmd[2202]:   notice: 
te_rsc_command: Initiating action 9: promote EncryptBase_promote_0 on 
encryptor4 (local)
2017-01-12T20:04:39.903827+00:00 encryptor4 Encryptor_ResourceAgent: INFO: 
Promoting Encryptor.
2017-01-12T20:04:44.959804+00:00 encryptor4 crmd[2202]:   notice: 
process_lrm_event: LRM operation EncryptBase_promote_0 (call=42, rc=0, 
cib-update=43, confirmed=true) ok

in which crmd initiates an action for promotion and the RA logs a message 
indicating that it was called with the arg "promote".

In contrast, the looping sections look like this:

(EncryptBase:1 on encryptor5 is the active/Master instance, EncryptBase:0 on 
encryptor4 is the standby/Slave instance)

2017-01-12T20:12:36.548980+00:00 encryptor4 pengine[2201]:   notice: 
LogActions: Promote EncryptBase:0(Slave -> Master encryptor4)
2017-01-12T20:12:36.549005+00:00 encryptor4 pengine[2201]:   notice: 
LogActions: StopEncryptBase:1(encryptor5)
2017-01-12T20:12:36.550306+00:00 encryptor4 pengine[2201]:   notice: 
process_pe_message: Calculated Transition 15: 
/var/lib/pacemaker/pengine/pe-input-15.bz2
2017-01-12T20:12:36.550958+00:00 encryptor4 crmd[2202]:   notice: 
te_rsc_command: Initiating action 14: stop EncryptBase_stop_0 on encryptor5
2017-01-12T20:12:38.649416+00:00 encryptor4 crmd[2202]:   notice: run_graph: 
Transition 15 (Complete=3, Pending=0, Fired=0, Skipped=4, Incomplete=1, 
Source=/var/lib/pacemaker/pengine/pe-input-15.bz2): Stopped


2017-01-12T20:12:38.655686+00:00 encryptor4 pengine[2201]:   notice: 
LogActions: Promote EncryptBase:0(Slave -> Master encryptor4)
2017-01-12T20:12:38.655706+00:00 encryptor4 pengine[2201]:   notice: 
LogActions: Start   EncryptBase:1(encryptor5)
2017-01-12T20:12:38.656696+00:00 encryptor4 pengine[2201]:   notice: 
process_pe_message: Calculated Transition 16: 
/var/lib

Re: [ClusterLabs] Pacemaker cluster not working after switching from 1.0 to 1.1 (resend as plain text)

2017-01-16 Thread Ken Gaillot
A preliminary question -- what cluster layer are you running?

Pacemaker 1.0 worked with heartbeat or corosync 1, while Ubuntu 14.04
ships with corosync 2 by default, IIRC. There were major incompatible
changes between corosync 1 and 2, so it's important to get that right
before looking at pacemaker.

A general note, when making such a big jump in the pacemaker version,
I'd recommend running "cibadmin --upgrade" both before exporting the
configuration from 1.0, and again after deploying it on 1.1. This will
apply any transformations needed in the CIB syntax. Pacemaker will do
this on the fly, but doing it manually lets you see any issues early, as
well as being more efficient.

On 01/16/2017 12:24 AM, Rick Kint wrote:
> Sorry about the garbled email. Trying again with plain text.
> 
> 
> 
> A working cluster running on Pacemaker 1.0.12 on RHEL5 has been copied with 
> minimal modifications to Pacemaker 1.1.10 on Ubuntu 14.04. The version string 
> is "1.1.10+git20130802-1ubuntu2.3".
> 
> We run simple active/standby two-node clusters. 
> 
> There are four resources on each node:
> - a stateful resource (Encryptor) representing a process in either active or 
> standby mode.
> -- this process does not maintain persistent data.
> - a clone resource (CredProxy) representing a helper process.
> - two clone resources (Ingress, Egress) representing network interfaces.
> 
> Colocation constraints require that all three clone resources must be in 
> Started role in order for the stateful Encryptor resource to be in Master 
> role.
> 
> The full configuration is at the end of this message.
> 
> The Encryptor resource should fail over on these events:
> - active node (i.e. node containing active Encryptor process) goes down
> - active Encryptor process goes down and cannot be restarted
> - auxiliary CredProxy process on active node goes down and cannot be restarted
> - either interface on active node goes down
> 
> All of these events trigger failover on the old platform (Pacemaker 1.0 on 
> RHEL5).
> 
> However, on the new platform (Pacemaker 1.1 on Ubuntu) neither interface 
> failure nor auxiliary process failure trigger failover. Pacemaker goes into a 
> loop where it starts and stops the active Encryptor resource and never 
> promotes the standby Encryptor resource. Cleaning up the failed resource 
> manually and issuing "crm_resource --cleanup" clears the jam and the standby 
> Encryptor resource is promoted. So does taking the former active node offline 
> completely.
> 
> The pe-input-X.bz2 files show this sequence:
> 
> (EncryptBase:1 is active, EncryptBase:0 is standby)
> 
> T: pacemaker recognizes that Ingress has failed
> transition: recover Ingress on active node
> 
> T+1: transition: recover Ingress on active node
> 
> T+2: transition: recover Ingress on active node
> 
> T+3: transitions: promote EncryptBase:0, demote EncryptBase:1, stop Ingress 
> on active node (no-op)
> 
> T+4: EncryptBase:1 demoted (both clones are now in slave mode), Ingress 
> stopped
> transitions: promote  EncryptBase:0, stop EncryptBase:1
> 
> T+5: EncryptBase:1 stopped, EncryptBase:0 still in slave role
> transitions: promote EncryptBase:0, start EncryptBase:1
> 
> T+6: EncryptBase:1 started (slave role)
> transitions: promote EncryptBase:0, stop EncryptBase:1
> 
> The last two steps repeat. Although pengine has decided that EncryptBase:0 
> should be promoted, Pacemaker keeps stopping and starting EncryptBase:1 (the 
> one on the node with the failed interface) without ever promoting 
> EncryptBase:0.
> 
> More precisely, crmd never issues the command that would cause promotion. For 
> a normal promotion, I see a sequence like this:
> 
> 2017-01-12T20:04:39.887154+00:00 encryptor4 pengine[2201]:   notice: 
> LogActions: Promote EncryptBase:0  (Slave -> Master encryptor4)
> 2017-01-12T20:04:39.888018+00:00 encryptor4 pengine[2201]:   notice: 
> process_pe_message: Calculated Transition 3: 
> /var/lib/pacemaker/pengine/pe-input-3.bz2
> 2017-01-12T20:04:39.888428+00:00 encryptor4 crmd[2202]:   notice: 
> te_rsc_command: Initiating action 9: promote EncryptBase_promote_0 on 
> encryptor4 (local)
> 2017-01-12T20:04:39.903827+00:00 encryptor4 Encryptor_ResourceAgent: INFO: 
> Promoting Encryptor.
> 2017-01-12T20:04:44.959804+00:00 encryptor4 crmd[2202]:   notice: 
> process_lrm_event: LRM operation EncryptBase_promote_0 (call=42, rc=0, 
> cib-update=43, confirmed=true) ok
> 
> in which crmd initiates an action for promotion and the RA logs a message 
> indicating that it was called with the arg "promote".
> 
> In contrast, the looping sections look like this:
> 
> (EncryptBase:1 on encryptor5 is the active/Master instance, EncryptBase:0 on 
> encryptor4 is the standby/Slave instance)
> 
> 2017-01-12T20:12:36.548980+00:00 encryptor4 pengine[2201]:   notice: 
> LogActions: Promote EncryptBase:0(Slave -> Master encryptor4)
> 2017-01-12T20:12:36.549005+00:00 encryptor4 pengine[2201]:   notice: 
> LogActions: StopEncryptBase