[ClusterLabs] Troubleshooting Faulty Networks / Heartbeat Rings

2016-10-26 Thread Martin Schlegel
Hello all

One one of our test clusters the network seems to be dropping messages at
different times of the day - we know it was not a network latency issue. We
could prove it via iperf - a local network test utility.

However, I wish there was some more detailed logs than the retransmit log
messages we are seeing. Even with debug enabled in Corosync it was next to
impossible for me to get confirmation from the logs about what is causing it and
how it affects the heartbeat ring.

How can I can track the heartbeat ring in action using time stamps to first
understand how it operates in detail and finally to tune it's configuration
parameters and trouble shoot it adequately ?

It seems there is little documentation on this topic (besides the source code).
Could somebody please point me to some useful sources of information ?


Regards,
Martin Schlegel

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-07 Thread Martin Schlegel
Thanks for all responses from Jan, Ulrich and Digimer !

We are already using bond'ed network interfaces, but we are also forced to go
across IP-subnets. Certain routes between routers can go and have gone missing.

This has happened for one of our node's public network, where it was
inaccessible to other local, public IP-subnets. If this were to happen in
parallel on another node of our private network the entire cluster would be
down, just because - as Ulrich said "It's a ring !" - both heartbeat rings are
marked faulty. It's not an optimal result, because cluster communication is in
fact 100% possible between all nodes.

With an increasing number of nodes this risk is fairly big. Just think about
providers of bigger cloud infrastructures.

With the above scenario in mind - is there a better (tested and recommended) way
to configure this ?
... or is knet the way to go in the future then ?


Regards,
Martin Schlegel


> Jan Friesse <jfrie...@redhat.com> hat am 7. Oktober 2016 um 11:28 geschrieben:
> 
> Martin Schlegel napsal(a):
> 
> > Thanks for the confirmation Jan, but this sounds a bit scary to me !
> > 
> > Spinning this experiment a bit further ...
> > 
> > Would this not also mean that with a passive rrp with 2 rings it only takes
> > 2
> > different nodes that are not able to communicate on different networks at
> > the
> > same time to have all rings marked faulty on _every_node ... therefore all
> > cluster members loosing quorum immediately even though n-2 cluster members
> > are
> > technically able to send and receive heartbeat messages through all 2 rings
> > ?
> 
> Not exactly, but this situation causes corosync to start behaving really 
> badly spending most of the time in "creating new membership" loop.
> 
> Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by 
> replace it by knet is biggest TODO for 3.x.
> 
> Regards,
>  Honza
> 
> > I really hope the answer is no and the cluster still somehow has a quorum in
> > this case.
> > 
> > Regards,
> > Martin Schlegel
> 
> >> Jan Friesse <jfrie...@redhat.com> hat am 5. Oktober 2016 um 09:01
> >> geschrieben:>>
> >> Martin,
> >>
> >>> Hello all,
> >>>
> >>> I am trying to understand why the following 2 Corosync heartbeat ring
> >>> failure
> >>> scenarios
> >>> I have been testing and hope somebody can explain why this makes any
> >>> sense.
> >>>
> >>> Consider the following cluster:
> >>>
> >>> * 3x Nodes: A, B and C
> >>> * 2x NICs for each Node
> >>> * Corosync 2.3.5 configured with "rrp_mode: passive" and
> >>> udpu transport with ring id 0 and 1 on each node.
> >>> * On each node "corosync-cfgtool -s" shows:
> >>> [...] ring 0 active with no faults
> >>> [...] ring 1 active with no faults
> >>>
> >>> Consider the following scenarios:
> >>>
> >>> 1. On node A only block all communication on the first NIC configured with
> >>> ring id 0
> >>> 2. On node A only block all communication on all NICs configured with
> >>> ring id 0 and 1
> >>>
> >>> The result of the above scenarios is as follows:
> >>>
> >>> 1. Nodes A, B and C (!) display the following ring status:
> >>> [...] Marking ringid 0 interface  FAULTY
> >>> [...] ring 1 active with no faults
> >>> 2. Node A is shown as OFFLINE - B and C display the following ring status:
> >>> [...] ring 0 active with no faults
> >>> [...] ring 1 active with no faults
> >>>
> >>> Questions:
> >>> 1. Is this the expected outcome ?
> >>
> >> Yes
> >>
> >>> 2. In experiment 1. B and C can still communicate with each other over
> >>> both
> >>> NICs, so why are
> >>> B and C not displaying a "no faults" status for ring id 0 and 1 just like
> >>> in experiment 2.
> >>
> >> Because this is how RRP works. RRP marks whole ring as failed so every
> >> node sees that ring as failed.
> >>
> >>> when node A is completely unreachable ?
> >>
> >> Because it's different scenario. In scenario 1 there are 3 nodes
> >> membership where one of them has failed one ring -> whole ring is
> >> failed. In scenario 2 there are 2 nodes membership where both rings
> >> works as expected. Node A is completely unreachable and it's not in the
> >>

Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-06 Thread Martin Schlegel
Thanks for the confirmation Jan, but this sounds a bit scary to me !

Spinning this experiment a bit further ...

Would this not also mean that with a passive rrp with 2 rings it only takes 2
different nodes that are not able to communicate on different networks at the
same time to have all rings marked faulty on _every_node ... therefore all
cluster members loosing quorum immediately even though n-2 cluster members are
technically able to send and receive heartbeat messages through all 2 rings ?

I really hope the answer is no and the cluster still somehow has a quorum in
this case.


Regards,
Martin Schlegel

> Jan Friesse <jfrie...@redhat.com> hat am 5. Oktober 2016 um 09:01 geschrieben:
> 
> Martin,
> 
> > Hello all,
> > 
> > I am trying to understand why the following 2 Corosync heartbeat ring
> > failure
> > scenarios
> > I have been testing and hope somebody can explain why this makes any sense.
> > 
> > Consider the following cluster:
> > 
> >  * 3x Nodes: A, B and C
> >  * 2x NICs for each Node
> >  * Corosync 2.3.5 configured with "rrp_mode: passive" and
> >  udpu transport with ring id 0 and 1 on each node.
> >  * On each node "corosync-cfgtool -s" shows:
> >  [...] ring 0 active with no faults
> >  [...] ring 1 active with no faults
> > 
> > Consider the following scenarios:
> > 
> >  1. On node A only block all communication on the first NIC configured with
> > ring id 0
> >  2. On node A only block all communication on all NICs configured with
> > ring id 0 and 1
> > 
> > The result of the above scenarios is as follows:
> > 
> >  1. Nodes A, B and C (!) display the following ring status:
> >  [...] Marking ringid 0 interface  FAULTY
> >  [...] ring 1 active with no faults
> >  2. Node A is shown as OFFLINE - B and C display the following ring status:
> >  [...] ring 0 active with no faults
> >  [...] ring 1 active with no faults
> > 
> > Questions:
> >  1. Is this the expected outcome ?
> 
> Yes
> 
> > 2. In experiment 1. B and C can still communicate with each other over both
> > NICs, so why are
> >  B and C not displaying a "no faults" status for ring id 0 and 1 just like
> > in experiment 2.
> 
> Because this is how RRP works. RRP marks whole ring as failed so every 
> node sees that ring as failed.
> 
> > when node A is completely unreachable ?
> 
> Because it's different scenario. In scenario 1 there are 3 nodes 
> membership where one of them has failed one ring -> whole ring is 
> failed. In scenario 2 there are 2 nodes membership where both rings 
> works as expected. Node A is completely unreachable and it's not in the 
> membership.
> 
> Regards,
>  Honza
> 
> > Regards,
> > Martin Schlegel
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-04 Thread Martin Schlegel
Hello all,

I am trying to understand why the following 2 Corosync heartbeat ring failure
scenarios 
I have been testing and hope somebody can explain why this makes any sense.


Consider the following cluster:

* 3x Nodes: A, B and C
* 2x NICs for each Node
* Corosync 2.3.5 configured with "rrp_mode: passive" and 
  udpu transport with ring id 0 and 1 on each node.
* On each node "corosync-cfgtool -s" shows:
[...] ring 0 active with no faults
[...] ring 1 active with no faults


Consider the following scenarios:

1. On node A only block all communication on the first NIC  configured with
ring id 0
2. On node A only block all communication on all   NICs configured with
ring id 0 and 1


The result of the above scenarios is as follows:

1. Nodes A, B and C (!) display the following ring status:
[...] Marking ringid 0 interface  FAULTY
[...] ring 1 active with no faults
2. Node A is shown as OFFLINE - B and C display the following ring status:
[...] ring 0 active with no faults
[...] ring 1 active with no faults


Questions:
1. Is this the expected outcome ?
2. In experiment 1. B and C can still communicate with each other over both
NICs, so why are 
   B and C not displaying a "no faults" status for ring id 0 and 1 just like
in experiment 2. 
   when node A is completely unreachable ?


Regards,
Martin Schlegel

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Users Digest, Vol 18, Issue 42

2016-07-21 Thread Martin Schlegel
On 07/21/2016 08:49 AM, Ulrich Windl wrote:
 Ken Gaillot  schrieb am 19.07.2016 um 16:17 in
 Nachricht

> > :
> > 
> > [...]
> 
> >> You're right -- if not told otherwise, Pacemaker will query the device>>
> >> for the target list. In this case, the output of "stonith_admin -l"
> 
> > In sles11 SP4 I see the following (surprising) output:
> > "stonith_admin -l" shows the usage message
> > "stonith_admin -l any" shows the configured devices, independently whether
> > the given name is part of the cluster or no. Even if that host does not
> > exist at all the same list is displayed:
> >  prm_stonith_sbd:0
> >  prm_stonith_sbd
> > 
> > Is that the way it's meant to be?
> 
> This seems to be the behavior you get when you didn't define a
> 'pcmk-host-list' and
> 'dynamic-list' isn't supported either.
> So the device will probably be used for fencing anything and it will be
> left to the device
> to fail then.
> So the answer is not that wrong - might work - we just can't tell unless
> you try...
> 
> > 
> 
> >> suggests it's not returning the desired information. I'm not familiar>>
> >> with the external agents, so I don't know why that would be. I
> >> mistakenly assumed it worked similarly to fence_ipmilan ...

Thanks everybody !

It seems to work now. For anybody interested please see the syslog messages
produced further below.

Yesterday we have added "pcmk_host_list=" and
pcmk_host_check=static-list to the primitive resource definitions as shown
below.

__


New STONITH resource definitions:

primitive p_ston_pg1 stonith:external/ipmi \
params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list
ipaddr=10.148.128.35 userid=root
passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG1-ipmipass"
passwd_method=file interface=lan priv=OPERATOR

primitive p_ston_pg2 [...]

primitive p_ston_pg3 [...]

___


Syslog messages produced leading up to the fencing operation:

[...]

Jul 20 18:57:06 localhost pengine[5476]:  warning: Node pg2 will be fenced
because the node is no longer part of the cluster
Jul 20 18:57:06 localhost pengine[5476]:  warning: Node pg2 is unclean
Jul 20 18:57:06 localhost pengine[5476]:  warning: Action p_ston_pg1_stop_0 on
pg2 is unrunnable (offline)
Jul 20 18:57:06 localhost pengine[5476]:  warning: Scheduling Node pg2 for
STONITH
Jul 20 18:57:06 localhost pengine[5476]:   notice: Move
   p_ston_pg1#011(Started pg2 -> pg3)
Jul 20 18:57:06 localhost pengine[5476]:  warning: Calculated Transition 0:
/var/lib/pacemaker/pengine/pe-warn-69.bz2
Jul 20 18:57:06 localhost crmd[5477]:   notice: Executing poweroff fencing
operation (49) on pg2 (timeout=6)
Jul 20 18:57:06 localhost crmd[5477]:   notice: Initiating action 3: start
p_ston_pg1_start_0 on pg3
Jul 20 18:57:06 localhost stonith-ng[5473]:   notice: Client crmd.5477.1ea9e005
wants to fence (poweroff) 'pg2' with device '(any)'
Jul 20 18:57:06 localhost stonith-ng[5473]:   notice: Initiating remote
operation poweroff for pg2: e669ca92-8255-4036-a57b-447de0453162 (0)
Jul 20 18:57:06 localhost stonith-ng[5473]:   notice: p_ston_pg3 can not fence
(poweroff) pg2: static-list
Jul 20 18:57:06 localhost stonith-ng[5473]:   notice: p_ston_pg2 can fence
(poweroff) pg2: static-list
Jul 20 18:57:06 localhost stonith-ng[5473]:   notice: p_ston_pg3 can not fence
(poweroff) pg2: static-list
Jul 20 18:57:06 localhost stonith-ng[5473]:   notice: p_ston_pg2 can fence
(poweroff) pg2: static-list
Jul 20 18:57:07 localhost stonith-ng[5473]:   notice: Operation 'poweroff'
[29810] (call 2 from crmd.5477) for host 'pg2' with device 'p_ston_pg2'
returned: 0 (OK)
Jul 20 18:57:07 localhost stonith-ng[5473]:   notice: Operation poweroff of pg2
by pg1 for crmd.5477@pg1.e669ca92: OK
Jul 20 18:57:07 localhost crmd[5477]:   notice: Stonith operation
2/49:0:0:8b1582af-b779-4975-8f98-40c1ba4fa75e: OK (0)
Jul 20 18:57:07 localhost crmd[5477]:   notice: Peer pg2 was terminated
(poweroff) by pg1 for pg1: OK (ref=e669ca92-8255-4036-a57b-447de0453162) by
client crmd.5477

[...]




> 
> > Regards,
> > Ulrich
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> --
> 
> ___
> Users mailing list
> Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> End of Users Digest, Vol 18, Issue 42
> *

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 

Re: [ClusterLabs] Antw: Re: Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

2016-07-19 Thread Martin Schlegel

> Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> hat am 19. Juli 2016 um 08:41
> geschrieben:
> 
> >>> Martin Schlegel <mar...@nuboreto.org> schrieb am 19.07.2016 um 00:51 in
> Nachricht
> <301244266.332724.5ea3ddc5-55ea-43b0-9a1b-22ebb1dcafd2.open-xchange@email.1und1.
> e>:
> 
> > Thanks Jan !
> > 
> > If anybody else is hitting the error of a ring being bound to 127.0.0.1 
> > instead
> > of the configured IP and corosync-cfgtool -s showing "[...] interface 
> > 127.0.0.1
> > FAULTY [...]" 
> > 
> > We noticed an issue occasionally occurring at boot time, that we believe to 
> > be a
> > bug in Ubuntu 14.04. It causes Corosync to start before all bindnetaddr IPs 
> > are
> > up and running.
> 
> Would it happen also if someone does a "rcnetwork restart" while the cluster
> is up? I think we had it once in SLES11 also, but I wa snever sure how it was
> triggered.

Yes, it likely would as this would do a ifdown on the network interfaces.
However, as we could confirm in Syslog our issues always originated at boot time
and only affected us later when the other remaining ring failed also. We are
adding monitoring to detect any of the rings being marked faulty. In this case
we are warned and we could try to recover the ring manually in case the
automatic recovery had given up.

I wish the health state of each ring per node would be available cluster wide,
but I have not found out how. Instead I need to gather the output of
"corosync-cfgtool -s" on each node.


> 
> > What happens is that despite the $network dependency and correct order for 
> > the
> > corosync runlevel script the corosync service might be started after only 
> > the
> > bond0 interface was fully started, but before our bond1 interface was 
> > assigned
> > the IP-address.
> > 
> > For now we have added some code to the Corosync runlevel scripts that waits 
> > for
> > a certain amount for whatever bindnetaddr-IPs had been configured in
> > /etc/corosync/corosync.conf .
> > 
> > Cheers,
> > Martin Schlegel
> 
> >> Jan Friesse <jfrie...@redhat.com> hat am 16. Juni 2016 um 17:55
> >> geschrieben:>> 
> >> Martin Schlegel napsal(a):
> >> 
> >> > Hello everyone,
> >> > 
> >> > we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a
> >> > couple
> >> > of
> >> > months successfully and we have started seeing a faulty ring with
> 
> > unexpected
> 
> >> > 127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".>> 
> >> This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with 
> >> RRP it means BIG problem.
> >> 
> >> > We have had this once before and only restarting Corosync (and everything
> >> > else)
> >> > on the node showing the unexpected 127.0.0.1 binding made the problem go
> >> > away.
> >> > However, in production we obviously would like to avoid this if possible.
> >> 
> >> Just don't do ifdown. Never. If you are using NetworkManager (which does 
> >> ifdown by default if cable is disconnected), use something like 
> >> NetworkManager-config-server package (it's just change of configuration 
> >> so you can adopt it to whatever distribution you are using).
> >> 
> >> Regards,
> >> Honza
> >> 
> >> > So from the following description - how can I troubleshoot this issue
> 
> > and/or
> 
> >> > does anybody have a good idea what might be happening here ?>> > 
> >> > We run 2x passive rrp rings across different IP-subnets via udpu and we
> >> > get
> >> > the
> >> > following output (all IPs obfuscated) - please notice the unexpected
> >> > interface
> >> > binding 127.0.0.1 for host pg2.
> >> > 
> >> > If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
> >> > briefly
> >> > shows "no faults" but goes back to "FAULTY" seconds later.
> >> > 
> >> > Regards,
> >> > Martin Schlegel
> >> > _
> >> > 
> >> > root@pg1:~# corosync-cfgtool -s
> >> > Printing ring status.
> >> > Local node ID 1
> >> > RING ID 0
> >> > id = A.B.C1.5
> >> > status = ring 0 active with no faults
> >> > RING ID 1
> >> > id = D.E.F1.170
> &g

Re: [ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

2016-06-16 Thread Martin Schlegel
Hi Jan

Thanks for your super quick response !

We do not use a Network Manager - it's all static on these Ubuntu 14.04 nodes
(/etc/network/interfaces). 

I do not think we did an ifdown on the network interface manually. However, the
IP-Addresses are assigned to bond0 and bond1 - we use 4x physical network
interfaces with 2x bond'ed into a public (bond1) and 2x bond'ed into a private
network (bond0).

Could this have anything to do with it ?

Regards,
Martin Schlegel

___

>From /etc/network/interfaces, i.e. 

auto bond0
iface bond0 inet static
#pre-up /sbin/ethtool -s bond0 speed 1000 duplex full autoneg on
post-up ifenslave bond0 eth0 eth2
pre-down ifenslave -d bond0 eth0 eth2
bond-slaves none
bond-mode 4
bond-lacp-rate fast
bond-miimon 100
bond-downdelay 0
bond-updelay 0
bond-xmit_hash_policy 1
address  [...]

> Jan Friesse <jfrie...@redhat.com> hat am 16. Juni 2016 um 17:55 geschrieben:
> 
> Martin Schlegel napsal(a):
> 
> > Hello everyone,
> > 
> > we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple
> > of
> > months successfully and we have started seeing a faulty ring with unexpected
> >  127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".
> 
> This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with 
> RRP it means BIG problem.
> 
> > We have had this once before and only restarting Corosync (and everything
> > else)
> > on the node showing the unexpected 127.0.0.1 binding made the problem go
> > away.
> > However, in production we obviously would like to avoid this if possible.
> 
> Just don't do ifdown. Never. If you are using NetworkManager (which does 
> ifdown by default if cable is disconnected), use something like 
> NetworkManager-config-server package (it's just change of configuration 
> so you can adopt it to whatever distribution you are using).
> 
> Regards,
>  Honza
> 
> > So from the following description - how can I troubleshoot this issue and/or
> > does anybody have a good idea what might be happening here ?
> > 
> > We run 2x passive rrp rings across different IP-subnets via udpu and we get
> > the
> > following output (all IPs obfuscated) - please notice the unexpected
> > interface
> > binding 127.0.0.1 for host pg2.
> > 
> > If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
> > briefly
> > shows "no faults" but goes back to "FAULTY" seconds later.
> > 
> > Regards,
> > Martin Schlegel
> > _
> > 
> > root@pg1:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 1
> > RING ID 0
> >  id = A.B.C1.5
> >  status = ring 0 active with no faults
> > RING ID 1
> >  id = D.E.F1.170
> >  status = Marking ringid 1 interface D.E.F1.170 FAULTY
> > 
> > root@pg2:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 2
> > RING ID 0
> >  id = A.B.C2.88
> >  status = ring 0 active with no faults
> > RING ID 1
> >  id = 127.0.0.1
> >  status = Marking ringid 1 interface 127.0.0.1 FAULTY
> > 
> > root@pg3:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 3
> > RING ID 0
> >  id = A.B.C3.236
> >  status = ring 0 active with no faults
> > RING ID 1
> >  id = D.E.F3.112
> >  status = Marking ringid 1 interface D.E.F3.112 FAULTY
> > 
> > _
> > 
> > /etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
> > IPs, but are otherwise identical:
> > ===
> > quorum {
> >  provider: corosync_votequorum
> >  expected_votes: 3
> > }
> > 
> > totem {
> >  version: 2
> > 
> >  crypto_cipher: none
> >  crypto_hash: none
> > 
> >  rrp_mode: passive
> >  interface {
> >  ringnumber: 0
> >  bindnetaddr: A.B.C1.0
> >  mcastport: 5405
> >  ttl: 1
> >  }
> >  interface {
> >  ringnumber: 1
> >  bindnetaddr: D.E.F1.64
> >  mcastport: 5405
> >  ttl: 1
> >  }
> >  transport: udpu
> > }
> > 
> > nodelist {
> >  node {
> >  ring0_addr: pg1
> >  ring1_addr: pg1p
> >  nodeid: 1
> >  }
> >  node {
> >  ring0_addr: pg2
> >  ring1_addr: pg2p
> >  nodeid: 2
> >  }
> >  node {
> >  ring0_addr: pg3
> >  ring1_addr: pg3p
> >  nodeid: 3
> >  }
> > }
> > 
> > logging {
> >  to_syslog: yes
> > }
> > 
> > ===
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

2016-06-16 Thread Martin Schlegel
Hello everyone,

we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple of
months successfully and we have started seeing a faulty ring with unexpected
 127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".

We have had this once before and only restarting Corosync (and everything else)
on the node showing the unexpected 127.0.0.1 binding made the problem go away.
However, in production we obviously would like to avoid this if possible.

So from the following description - how can I troubleshoot this issue and/or
does anybody have a good idea what might be happening here ?

We run 2x passive rrp rings across different IP-subnets via udpu and we get the
following output (all IPs obfuscated) - please notice the unexpected interface
binding 127.0.0.1 for host pg2.

If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1 briefly
shows "no faults" but goes back to "FAULTY" seconds later.

Regards,
Martin Schlegel
_

root@pg1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id  = A.B.C1.5
status  = ring 0 active with no faults
RING ID 1
id  = D.E.F1.170
status  = Marking ringid 1 interface D.E.F1.170 FAULTY

root@pg2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id  = A.B.C2.88
status  = ring 0 active with no faults
RING ID 1
id  = 127.0.0.1
status  = Marking ringid 1 interface 127.0.0.1 FAULTY

root@pg3:~# corosync-cfgtool -s
Printing ring status.
Local node ID 3
RING ID 0
id  = A.B.C3.236
status  = ring 0 active with no faults
RING ID 1
id  = D.E.F3.112
status  = Marking ringid 1 interface D.E.F3.112 FAULTY


_


/etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
IPs, but are otherwise identical:
===
quorum {
provider: corosync_votequorum
expected_votes: 3
}

totem {
version: 2

crypto_cipher: none
crypto_hash: none

rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: A.B.C1.0
mcastport: 5405
ttl: 1
}
interface {
ringnumber: 1
bindnetaddr: D.E.F1.64
mcastport: 5405
ttl: 1
}
transport: udpu
}

nodelist {
node {
ring0_addr: pg1
ring1_addr: pg1p
nodeid: 1
}
node {
ring0_addr: pg2
ring1_addr: pg2p
nodeid: 2
}
node {
ring0_addr: pg3
ring1_addr: pg3p
nodeid: 3
}
}

logging {
to_syslog: yes
}

===

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Opt-in cluster shows resources stopped where no nodes should be considered

2016-03-04 Thread Martin Schlegel
Hello all

While our cluster seems to be working just fine I have noticed something in the
crm_mon output that I don't quite understand and that is throwing off my
monitoring a bit as stopped resources could mean something is wrong. I was
hoping somebody could help me to understand what it means. It seems this might
have something to do with the fact I am using remote nodes, but I cannot wrap my
head around it.

What I am seeing are 3 additional, unexpected lines in the crm_mon -1rR output
listing my "p_pgcPgbouncer_test" resources as stopped even though there should
not be any more nodes to be considered in my mind (opt-in cluster, see location
rules). At the same time this is not happening to my p_pgsqln resources as shown
at the top of the crm_mon output.

The important crm_mon -1rR output lines further below are marked with arrows ->
  <---.  


Some background on the policy:
We are running an asymmetric / opt-in cluster (property symmetric-cluster=false.


The cluster's main purpose is to take care of a 3+-nodes replicating master /
slave database running strictly on nodes pg1, pg2 and pg3 per location rule
l_pgs_resources.

We also have 2 remote nodes pagalog1 & pgalog2 defined to control database
connection pooler resources (p_pgcPgbouncer_test) to facilitate client
connection reroute as per location rule l_pgc_resources.


crm_mon -1rR output:

Last updated: Fri Mar  4 09:56:02 2016  Last change: Fri Mar  4 09:55:47
2016 by root via cibadmin on pg1
Stack: corosync
Current DC: pg1 (1) (version 1.1.14-70404b0) - partition with quorum
5 nodes and 29 resources configured

Online: [ pg1 (1) pg2 (2) pg3 (3) ]
RemoteOnline: [ pgalog1 pgalog2 ]

Full list of resources:

 Master/Slave Set: ms_pgsqln [p_pgsqln]

   
 p_pgsqln   (ocf::heartbeat:pgsqln):Master pg3

  
 p_pgsqln   (ocf::heartbeat:pgsqln):Started pg1

 
 p_pgsqln   (ocf::heartbeat:pgsqln):Started pg2
-> NO additional lines here <---
 Masters: [ pg3 ]
 Stopped: [ pg1 pg2 ]
[...]
 pgalog1(ocf::pacemaker:remote):Started pg1
 pgalog2(ocf::pacemaker:remote):Started pg3
 Clone Set: cl_pgcPgbouncer [p_pgcPgbouncer_test]
 p_pgcPgbouncer_test(ocf::heartbeat:pgbouncer): Started pgalog1
 p_pgcPgbouncer_test(ocf::heartbeat:pgbouncer): Started pgalog2
->   p_pgcPgbouncer_test(ocf::heartbeat:pgbouncer): Stopped
<
->   p_pgcPgbouncer_test(ocf::heartbeat:pgbouncer): Stopped
<
->   p_pgcPgbouncer_test(ocf::heartbeat:pgbouncer): Stopped
<
 Started: [ pgalog1 pgalog2 ]



Here are the most important parts of the configuration as shown in "crm
configure show":

[...]
primitive pgalog1 ocf:pacemaker:remote \
params server=pgalog1 port=3121 \
meta target-role=Started
primitive pgalog2 ocf:pacemaker:remote \
params server=pgalog2 port=3121 \
meta target-role=Started
[...]
location l_pgc_resources { cl_pgcPgbouncer } resource-discovery=exclusive \
rule #uname eq pgalog1 \
rule #uname eq pgalog2

location l_pgs_resources { cl_pgsServices1 ms_pgsqln p_pgsBackupjob pgalog1
pgalog2 } resource-discovery=exclusive \
rule #uname eq pg1 \
rule #uname eq pg2 \
rule #uname eq pg3

[...]
property cib-bootstrap-options: \
    symmetric-cluster=false \
[...]


Regards,
Martin Schlegel

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] crm_mon change in behaviour PM 1.1.12 -> 1.1.14: crm_mon -XA filters #health.* node attributes

2016-03-03 Thread Martin Schlegel

Hello everybodyThis is my first post on this mailing list and I am only using Pacemaker since fall 2015 ... please be gentle :-) and I will do the same.Our cluster is using multiple resource agents that update various node attributes (via RAs like sysinfo, healthcpu, etc.) in the form of #health.* and we rely on the mechanism enabled via the property node-health-strategy=migrate-on-red to trigger a resource migrations.In Pacemaker version 1.1.12 crm_mon -A or -XA would still display these #health.* attributes, but not since we have moved up to 1.1.14 and I am not sure why this needed to be changed :root@ys0-resiliency-test-1:~# crm node status-attr pg1 show \#health-cpu scope=status name=#health-cpu value=green root@ys0-resiliency-test-1:~# crm_mon -XrRAf1 | grep -i '#health' ; echo $? 1This seems to be due to this part of the crm_mon.c code:/* Never display node attributes whose name starts with one of these prefixes */#define FILTER_STR { "shutdown", "terminate", "standby", "fail-count", \                                    "last-failure", "probe_complete", "#", NULL }I would like to know if anybody is sharing my opinions on that:   1. From an operations point of view it would be easier to get crm_mon to include #health.* in the general output or at least in the XML output via crm_mon -XA, so that I can get a comprehensive status view in one shot.   2. Because the node attributes list can be extensive and clutters up the output it would make sense to allow a user-defined filter for node attributes in generalRegards,Martin Schlegel

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] crm_mon change in behaviour PM 1.1.12 -> 1.1.14: crm_mon -XA filters #health.* node attributes

2016-03-03 Thread Martin Schlegel

Hello everybodyThis is my first post on this mailing list and I am only using Pacemaker since fall 2015 ... please be gentle :-) and I will do the same.Our cluster is using multiple resource agents that update various node attributes (via RAs like sysinfo, healthcpu, etc.) in the form of #health.* and we rely on the mechanism enabled via the property node-health-strategy=migrate-on-red to trigger resource migrations.In Pacemaker version 1.1.12 crm_mon -A or -XA would still display these #health.* attributes, but not since we have moved up to 1.1.14 and I am not sure why this needed to be changed :root@ys0-resiliency-test-1:~# crm node status-attr pg1 show \#health-cpu scope=status name=#health-cpu value=greenroot@ys0-resiliency-test-1:~# crm_mon -XrRAf1 | grep -i '#health' ; echo $? 1This seems to be due to this part of the crm_mon.c code:/* Never display node attributes whose name starts with one of these prefixes */#define FILTER_STR { "shutdown", "terminate", "standby", "fail-count", \                                    "last-failure", "probe_complete", “#”, NULL }I would like to know if anybody is sharing my opinions on that:   1. From an operations point of view it would be easier to get crm_mon to include #health.* in the general output or at least in the XML output via crm_mon -XA, so that I can get a comprehensive status view in one shot.   2. Because the node attributes list can be extensive and clutters up the output it would make sense to allow a user-defined filter for node attributes in generalRegards,Martin Schlegel

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org