Re: [ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

Martin Schlegel Thu, 16 Jun 2016 09:15:58 -0700

Hi Jan

Thanks for your super quick response !


We do not use a Network Manager - it's all static on these Ubuntu 14.04 nodes
(/etc/network/interfaces). 

I do not think we did an ifdown on the network interface manually. However, the
IP-Addresses are assigned to bond0 and bond1 - we use 4x physical network
interfaces with 2x bond'ed into a public (bond1) and 2x bond'ed into a private
network (bond0).

Could this have anything to do with it ?

Regards,
Martin Schlegel

___________________

>From /etc/network/interfaces, i.e. 

auto bond0
iface bond0 inet static
#pre-up /sbin/ethtool -s bond0 speed 1000 duplex full autoneg on
post-up ifenslave bond0 eth0 eth2
pre-down ifenslave -d bond0 eth0 eth2
bond-slaves none
bond-mode 4
bond-lacp-rate fast
bond-miimon 100
bond-downdelay 0
bond-updelay 0
bond-xmit_hash_policy 1
address  [...]

> Jan Friesse <jfrie...@redhat.com> hat am 16. Juni 2016 um 17:55 geschrieben:
> 
> Martin Schlegel napsal(a):
> 
> > Hello everyone,
> > 
> > we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple
> > of
> > months successfully and we have started seeing a faulty ring with unexpected
> >  127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".
> 
> This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with 
> RRP it means BIG problem.
> 
> > We have had this once before and only restarting Corosync (and everything
> > else)
> > on the node showing the unexpected 127.0.0.1 binding made the problem go
> > away.
> > However, in production we obviously would like to avoid this if possible.
> 
> Just don't do ifdown. Never. If you are using NetworkManager (which does 
> ifdown by default if cable is disconnected), use something like 
> NetworkManager-config-server package (it's just change of configuration 
> so you can adopt it to whatever distribution you are using).
> 
> Regards,
>  Honza
> 
> > So from the following description - how can I troubleshoot this issue and/or
> > does anybody have a good idea what might be happening here ?
> > 
> > We run 2x passive rrp rings across different IP-subnets via udpu and we get
> > the
> > following output (all IPs obfuscated) - please notice the unexpected
> > interface
> > binding 127.0.0.1 for host pg2.
> > 
> > If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
> > briefly
> > shows "no faults" but goes back to "FAULTY" seconds later.
> > 
> > Regards,
> > Martin Schlegel
> > _____________________________________
> > 
> > root@pg1:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 1
> > RING ID 0
> >  id = A.B.C1.5
> >  status = ring 0 active with no faults
> > RING ID 1
> >  id = D.E.F1.170
> >  status = Marking ringid 1 interface D.E.F1.170 FAULTY
> > 
> > root@pg2:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 2
> > RING ID 0
> >  id = A.B.C2.88
> >  status = ring 0 active with no faults
> > RING ID 1
> >  id = 127.0.0.1
> >  status = Marking ringid 1 interface 127.0.0.1 FAULTY
> > 
> > root@pg3:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 3
> > RING ID 0
> >  id = A.B.C3.236
> >  status = ring 0 active with no faults
> > RING ID 1
> >  id = D.E.F3.112
> >  status = Marking ringid 1 interface D.E.F3.112 FAULTY
> > 
> > _____________________________________
> > 
> > /etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
> > IPs, but are otherwise identical:
> > ===========================================
> > quorum {
> >  provider: corosync_votequorum
> >  expected_votes: 3
> > }
> > 
> > totem {
> >  version: 2
> > 
> >  crypto_cipher: none
> >  crypto_hash: none
> > 
> >  rrp_mode: passive
> >  interface {
> >  ringnumber: 0
> >  bindnetaddr: A.B.C1.0
> >  mcastport: 5405
> >  ttl: 1
> >  }
> >  interface {
> >  ringnumber: 1
> >  bindnetaddr: D.E.F1.64
> >  mcastport: 5405
> >  ttl: 1
> >  }
> >  transport: udpu
> > }
> > 
> > nodelist {
> >  node {
> >  ring0_addr: pg1
> >  ring1_addr: pg1p
> >  nodeid: 1
> >  }
> >  node {
> >  ring0_addr: pg2
> >  ring1_addr: pg2p
> >  nodeid: 2
> >  }
> >  node {
> >  ring0_addr: pg3
> >  ring1_addr: pg3p
> >  nodeid: 3
> >  }
> > }
> > 
> > logging {
> >  to_syslog: yes
> > }
> > 
> > ===========================================
> > 
> > _______________________________________________
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

Reply via email to