Re: [ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

Jan Friesse Fri, 17 Jun 2016 02:01:09 -0700

Martin,

Hi Jan

Thanks for your super quick response !

We do not use a Network Manager - it's all static on these Ubuntu 14.04 nodes
(/etc/network/interfaces).


Good


I do not think we did an ifdown on the network interface manually. However, the
IP-Addresses are assigned to bond0 and bond1 - we use 4x physical network
interfaces with 2x bond'ed into a public (bond1) and 2x bond'ed into a private
network (bond0).

Could this have anything to do with it ?

I don't think so. Problem really happens only when corosync isconfigured to ip address which disappears so it has to rebind to127.0.0.1. You would then see "The network interface is down" in thelogs. Try to find that msg, if it's really problem I was referring about.


Regards,
  Honza


Regards,
Martin Schlegel

___________________

 From /etc/network/interfaces, i.e.

auto bond0
iface bond0 inet static
#pre-up /sbin/ethtool -s bond0 speed 1000 duplex full autoneg on
post-up ifenslave bond0 eth0 eth2
pre-down ifenslave -d bond0 eth0 eth2
bond-slaves none
bond-mode 4
bond-lacp-rate fast
bond-miimon 100
bond-downdelay 0
bond-updelay 0
bond-xmit_hash_policy 1
address  [...]

Jan Friesse <jfrie...@redhat.com> hat am 16. Juni 2016 um 17:55 geschrieben:

Martin Schlegel napsal(a):

Hello everyone,

we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple
of
months successfully and we have started seeing a faulty ring with unexpected
  127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".


This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with
RRP it means BIG problem.

We have had this once before and only restarting Corosync (and everything
else)
on the node showing the unexpected 127.0.0.1 binding made the problem go
away.
However, in production we obviously would like to avoid this if possible.


Just don't do ifdown. Never. If you are using NetworkManager (which does
ifdown by default if cable is disconnected), use something like
NetworkManager-config-server package (it's just change of configuration
so you can adopt it to whatever distribution you are using).

Regards,
  Honza

So from the following description - how can I troubleshoot this issue and/or
does anybody have a good idea what might be happening here ?

We run 2x passive rrp rings across different IP-subnets via udpu and we get
the
following output (all IPs obfuscated) - please notice the unexpected
interface
binding 127.0.0.1 for host pg2.

If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
briefly
shows "no faults" but goes back to "FAULTY" seconds later.

Regards,
Martin Schlegel
_____________________________________

root@pg1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
  id = A.B.C1.5
  status = ring 0 active with no faults
RING ID 1
  id = D.E.F1.170
  status = Marking ringid 1 interface D.E.F1.170 FAULTY

root@pg2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
  id = A.B.C2.88
  status = ring 0 active with no faults
RING ID 1
  id = 127.0.0.1
  status = Marking ringid 1 interface 127.0.0.1 FAULTY

root@pg3:~# corosync-cfgtool -s
Printing ring status.
Local node ID 3
RING ID 0
  id = A.B.C3.236
  status = ring 0 active with no faults
RING ID 1
  id = D.E.F3.112
  status = Marking ringid 1 interface D.E.F3.112 FAULTY

_____________________________________

/etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
IPs, but are otherwise identical:
===========================================
quorum {
  provider: corosync_votequorum
  expected_votes: 3
}

totem {
  version: 2

  crypto_cipher: none
  crypto_hash: none

  rrp_mode: passive
  interface {
  ringnumber: 0
  bindnetaddr: A.B.C1.0
  mcastport: 5405
  ttl: 1
  }
  interface {
  ringnumber: 1
  bindnetaddr: D.E.F1.64
  mcastport: 5405
  ttl: 1
  }
  transport: udpu
}

nodelist {
  node {
  ring0_addr: pg1
  ring1_addr: pg1p
  nodeid: 1
  }
  node {
  ring0_addr: pg2
  ring1_addr: pg2p
  nodeid: 2
  }
  node {
  ring0_addr: pg3
  ring1_addr: pg3p
  nodeid: 3
  }
}

logging {
  to_syslog: yes
}

===========================================

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

Reply via email to