Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-09-02 Thread David Tolosa
Hello guys,
Finally we decided to not use the alpha version because this will be
production servers.

But it finally works. In Ubuntu18 I just rollback from netplan (networkd)
to NetworkManager ifupdown and now corosync reports correctly the redundant
ring after reboots.
I didn't need to install NetworkManager-config-server package.

Thank you so much!

2018-08-25 22:13 GMT+02:00 Ferenc Wágner :

> wf...@niif.hu (Ferenc Wágner) writes:
>
> > David Tolosa  writes:
> >
> >> I tried to install corosync 3.x and it works pretty well.
> >> But when I install pacemaker, it installs previous version of corosync
> as
> >> dependency and breaks all the setup.
> >> Any suggestions?
> >
> > Install the equivs package to create a dummy corosync package
> > representing your local corosync build.
> > https://manpages.debian.org/stretch/equivs/equivs-build.1.en.html
>
> Forget it, libcfg changed ABI, so you'll have to recompile Pacemaker
> after all.
> --
> Regards,
> Feri
>



-- 
*David Tolosa Martínez*
Customer Support & Infrastructure
UPCnet - Edifici Vèrtex
Plaça d'Eusebi Güell, 6, 08034 Barcelona
Tel: 934054555



-- 




INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES:

 

Responsable: UPCNET, 
Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU  
 |   Finalitat: gestionar els contactes i les relacions professionals i 
comercials amb els nostres clients i proveïdors   |   Base legal: 
consentiment, interès legítim i/o relació contractual   |   Destinataris: 
no seran comunicades a tercers excepte per obligació legal   |   Drets: 
pots exercir els teus drets d’accés, rectificació i supressió, així com els 
altres drets reconeguts a la normativa vigent, enviant-nos un missatge a 
priv...@upcnet.es    |   Més informació: consulta 
la nostra política completa de protecció de dades 
.

 

AVÍS DE 
CONFIDENCIALITAT 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-25 Thread Ferenc Wágner
wf...@niif.hu (Ferenc Wágner) writes:

> David Tolosa  writes:
>
>> I tried to install corosync 3.x and it works pretty well.
>> But when I install pacemaker, it installs previous version of corosync as
>> dependency and breaks all the setup.
>> Any suggestions?
>
> Install the equivs package to create a dummy corosync package
> representing your local corosync build.
> https://manpages.debian.org/stretch/equivs/equivs-build.1.en.html

Forget it, libcfg changed ABI, so you'll have to recompile Pacemaker
after all.
-- 
Regards,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-25 Thread Ferenc Wágner
David Tolosa  writes:

> I tried to install corosync 3.x and it works pretty well.
> But when I install pacemaker, it installs previous version of corosync as
> dependency and breaks all the setup.
> Any suggestions?

Install the equivs package to create a dummy corosync package
representing your local corosync build.
https://manpages.debian.org/stretch/equivs/equivs-build.1.en.html
-- 
Regards,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-24 Thread David Tolosa
How can I follow the first two solutions?
Regards,

2018-08-24 8:21 GMT+02:00 Jan Friesse :

> I tried to install corosync 3.x and it works pretty well.
>>
>
> Cool
>
> But when I install pacemaker, it installs previous version of corosync as
>> dependency and breaks all the setup.
>> Any suggestions?
>>
>
> I can see at least following "solutions":
> - make proper Debian package
> - install corosync 3 to /usr/local
> - (ugly) install packaged corosync and reinstall by corosync 3 from source
>
> Regards,
>   Honza
>
>
>
>> 2018-08-23 9:32 GMT+02:00 Jan Friesse :
>>
>> David,
>>>
>>> BTW, where I can download Corosync 3.x?
>>>
 I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro
 sync/


>>> Yes, that's Alpha 4 of Corosync 3.
>>>
>>>
>>>
>>>
>>> 2018-08-23 9:11 GMT+02:00 David Tolosa :

 I'm currently using an Ubuntu 18.04 server configuration with netplan.

>
> Here you have my current YAML configuration:
>
> # This file describes the network interfaces available on your system
> # For more information, see netplan(5).
> network:
> version: 2
> renderer: networkd
> ethernets:
>   eno1:
> addresses: [192.168.0.1/24]
>   enp4s0f0:
> addresses: [192.168.1.1/24]
>   enp5s0f0:
> {}
> vlans:
>   vlan.XXX:
> id: XXX
> link: enp5s0f0
> addresses: [ 10.1.128.5/29 ]
> gateway4: 10.1.128.1
> nameservers:
>   addresses: [ 8.8.8.8, 8.8.4.4 ]
>   search: [ foo.com, bar.com ]
>   vlan.YYY:
> id: YYY
> link: enp5s0f0
> addresses: [ 10.1.128.5/29 ]
>
>
> So, eno1 and enp4s0f0 are the two ethernet ports connected each other
> with crossover cables to node2.
> enp5s0f0 port is used to connect outside/services using vlans defined
> in
> the same file.
>
> In short, I'm using systemd-networkd default Ubuntu 18 server service
> for
>
>
 Ok, so systemd-networkd is really doing ifdown and somebody actually
>>> tries
>>> fix it and merge into upstream (sadly with not too much luck :( )
>>>
>>> https://github.com/systemd/systemd/pull/7403
>>>
>>>
>>> manage networks. Im not detecting any NetworkManager-config-server
>>>
 package in my repository neither.
>
>
 I'm not sure how it's called in Debian based distributions, but it's
>>> just
>>> one small file in /etc, so you can extract it from RPM.
>>>
>>> So the only solution that I have left, I suppose, is to test corosync 3.x
>>>
 and see if it works better handling RRP.
>
>
 You may also reconsider to try ether completely static network
>>> configuration or NetworkManager + NetworkManager-config-server.
>>>
>>>
>>> Corosync 3.x with knet will work for sure, but be prepared for quite a
>>> long compile path, because you first have to compile knet and then
>>> corosync. What may help you a bit is that we have a ubuntu 18.04 in our
>>> jenkins, so it should be possible corosync build log
>>> https://ci.kronosnet.org/view/corosync/job/corosync-build-al
>>> l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt
>>> s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/
>>> knet/job/knet-build-all-voting/lastBuild/knet-build-all-
>>> voting=ubuntu-18-04-lts-x86-64/consoleText).
>>>
>>> Also please consult http://people.redhat.com/ccaul
>>> fie/docs/KnetCorosync.pdf about changes in corosync configuration.
>>>
>>> Regards,
>>>Honza
>>>
>>>
>>> Thank you for your quick response!
>
> 2018-08-23 8:40 GMT+02:00 Jan Friesse :
>
> David,
>
>>
>> Hello,
>>
>> Im getting crazy about this problem, that I expect to resolve here,
>>> with
>>> your help guys:
>>>
>>> I have 2 nodes with Corosync redundant ring feature.
>>>
>>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>>> connected each other by two crossover cables.
>>>
>>>
>>> I believe this is root of the problem. Are you using NetworkManager?
>> If
>> so, have you installed NetworkManager-config-server? If not, please
>> install
>> it and test again.
>>
>>
>> I configured both nodes with rrp mode passive. Everything is working
>>
>>> well
>>> at this point, but when I shutdown 1 node to test failover, and this
>>> node > returns to be online, corosync is marking the interface as
>>> FAULTY
>>> and rrp
>>>
>>>
>>> I believe it's because with crossover cables configuration when other
>> side is shutdown, NetworkManager detects it and does ifdown of the
>> interface. And corosync is unable to handle ifdown properly. Ifdown is
>> bad
>> with single ring, but it's just killer with RRP (127.0.0.1 poisons
>> every
>> node in the cluster).
>>
>> fails to 

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-24 Thread David Tolosa
I tried to install corosync 3.x and it works pretty well.
But when I install pacemaker, it installs previous version of corosync as
dependency and breaks all the setup.
Any suggestions?

2018-08-23 9:32 GMT+02:00 Jan Friesse :

> David,
>
> BTW, where I can download Corosync 3.x?
>> I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro
>> sync/
>>
>
> Yes, that's Alpha 4 of Corosync 3.
>
>
>
>
>> 2018-08-23 9:11 GMT+02:00 David Tolosa :
>>
>> I'm currently using an Ubuntu 18.04 server configuration with netplan.
>>>
>>> Here you have my current YAML configuration:
>>>
>>> # This file describes the network interfaces available on your system
>>> # For more information, see netplan(5).
>>> network:
>>>version: 2
>>>renderer: networkd
>>>ethernets:
>>>  eno1:
>>>addresses: [192.168.0.1/24]
>>>  enp4s0f0:
>>>addresses: [192.168.1.1/24]
>>>  enp5s0f0:
>>>{}
>>>vlans:
>>>  vlan.XXX:
>>>id: XXX
>>>link: enp5s0f0
>>>addresses: [ 10.1.128.5/29 ]
>>>gateway4: 10.1.128.1
>>>nameservers:
>>>  addresses: [ 8.8.8.8, 8.8.4.4 ]
>>>  search: [ foo.com, bar.com ]
>>>  vlan.YYY:
>>>id: YYY
>>>link: enp5s0f0
>>>addresses: [ 10.1.128.5/29 ]
>>>
>>>
>>> So, eno1 and enp4s0f0 are the two ethernet ports connected each other
>>> with crossover cables to node2.
>>> enp5s0f0 port is used to connect outside/services using vlans defined in
>>> the same file.
>>>
>>> In short, I'm using systemd-networkd default Ubuntu 18 server service for
>>>
>>
> Ok, so systemd-networkd is really doing ifdown and somebody actually tries
> fix it and merge into upstream (sadly with not too much luck :( )
>
> https://github.com/systemd/systemd/pull/7403
>
>
> manage networks. Im not detecting any NetworkManager-config-server
>>> package in my repository neither.
>>>
>>
> I'm not sure how it's called in Debian based distributions, but it's just
> one small file in /etc, so you can extract it from RPM.
>
> So the only solution that I have left, I suppose, is to test corosync 3.x
>>> and see if it works better handling RRP.
>>>
>>
> You may also reconsider to try ether completely static network
> configuration or NetworkManager + NetworkManager-config-server.
>
>
> Corosync 3.x with knet will work for sure, but be prepared for quite a
> long compile path, because you first have to compile knet and then
> corosync. What may help you a bit is that we have a ubuntu 18.04 in our
> jenkins, so it should be possible corosync build log
> https://ci.kronosnet.org/view/corosync/job/corosync-build-al
> l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt
> s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/
> knet/job/knet-build-all-voting/lastBuild/knet-build-all-
> voting=ubuntu-18-04-lts-x86-64/consoleText).
>
> Also please consult http://people.redhat.com/ccaul
> fie/docs/KnetCorosync.pdf about changes in corosync configuration.
>
> Regards,
>   Honza
>
>
>>> Thank you for your quick response!
>>>
>>> 2018-08-23 8:40 GMT+02:00 Jan Friesse :
>>>
>>> David,

 Hello,

> Im getting crazy about this problem, that I expect to resolve here,
> with
> your help guys:
>
> I have 2 nodes with Corosync redundant ring feature.
>
> Each node has 2 similarly connected/configured NIC's. Both nodes are
> connected each other by two crossover cables.
>
>
 I believe this is root of the problem. Are you using NetworkManager? If
 so, have you installed NetworkManager-config-server? If not, please
 install
 it and test again.


 I configured both nodes with rrp mode passive. Everything is working
> well
> at this point, but when I shutdown 1 node to test failover, and this
> node > returns to be online, corosync is marking the interface as
> FAULTY
> and rrp
>
>
 I believe it's because with crossover cables configuration when other
 side is shutdown, NetworkManager detects it and does ifdown of the
 interface. And corosync is unable to handle ifdown properly. Ifdown is
 bad
 with single ring, but it's just killer with RRP (127.0.0.1 poisons every
 node in the cluster).

 fails to recover the initial state:

>
> 1. Initial scenario:
>
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
>   id  = 192.168.0.1
>   status  = ring 0 active with no faults
> RING ID 1
>   id  = 192.168.1.1
>   status  = ring 1 active with no faults
>
>
> 2. When I shutdown the node 2, all continues with no faults. Sometimes
> the
> ring ID's are bonding with 127.0.0.1 and then bond back to their
> respective
> heartbeat IP.
>
>
 Again, result of ifdown.


 3. When node 2 is back online:
>
> # corosync-cfgtool -s

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-24 Thread Ken Gaillot
On Fri, 2018-08-24 at 08:21 +0200, Jan Friesse wrote:
> > I tried to install corosync 3.x and it works pretty well.
> 
> Cool
> 
> > But when I install pacemaker, it installs previous version of
> > corosync as
> > dependency and breaks all the setup.
> > Any suggestions?
> 
> I can see at least following "solutions":
> - make proper Debian package
> - install corosync 3 to /usr/local
> - (ugly) install packaged corosync and reinstall by corosync 3 from
> source
> 
> Regards,
>    Honza

If you're compiling corosync 3, you may want to consider compiling
pacemaker 2.0.0 as well (or even pacemaker master branch, which has
extra bug fixes and should be stable at the moment).

If you're not familiar with checkinstall, it's a simple way to build
.deb packages from any "make install", so you only have to compile on
one host.

You could also get in touch with the Debian HA team ( https://wiki.debi
an.org/Debian-HA ) to see what their plans are for the new versions
and/or get tips on building.

> > 
> > 2018-08-23 9:32 GMT+02:00 Jan Friesse :
> > 
> > > David,
> > > 
> > > BTW, where I can download Corosync 3.x?
> > > > I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github
> > > > .io/coro
> > > > sync/
> > > > 
> > > 
> > > Yes, that's Alpha 4 of Corosync 3.
> > > 
> > > 
> > > 
> > > 
> > > > 2018-08-23 9:11 GMT+02:00 David Tolosa 
> > > > :
> > > > 
> > > > I'm currently using an Ubuntu 18.04 server configuration with
> > > > netplan.
> > > > > 
> > > > > Here you have my current YAML configuration:
> > > > > 
> > > > > # This file describes the network interfaces available on
> > > > > your system
> > > > > # For more information, see netplan(5).
> > > > > network:
> > > > > version: 2
> > > > > renderer: networkd
> > > > > ethernets:
> > > > >   eno1:
> > > > > addresses: [192.168.0.1/24]
> > > > >   enp4s0f0:
> > > > > addresses: [192.168.1.1/24]
> > > > >   enp5s0f0:
> > > > > {}
> > > > > vlans:
> > > > >   vlan.XXX:
> > > > > id: XXX
> > > > > link: enp5s0f0
> > > > > addresses: [ 10.1.128.5/29 ]
> > > > > gateway4: 10.1.128.1
> > > > > nameservers:
> > > > >   addresses: [ 8.8.8.8, 8.8.4.4 ]
> > > > >   search: [ foo.com, bar.com ]
> > > > >   vlan.YYY:
> > > > > id: YYY
> > > > > link: enp5s0f0
> > > > > addresses: [ 10.1.128.5/29 ]
> > > > > 
> > > > > 
> > > > > So, eno1 and enp4s0f0 are the two ethernet ports connected
> > > > > each other
> > > > > with crossover cables to node2.
> > > > > enp5s0f0 port is used to connect outside/services using vlans
> > > > > defined in
> > > > > the same file.
> > > > > 
> > > > > In short, I'm using systemd-networkd default Ubuntu 18 server
> > > > > service for
> > > > > 
> > > 
> > > Ok, so systemd-networkd is really doing ifdown and somebody
> > > actually tries
> > > fix it and merge into upstream (sadly with not too much luck :( )
> > > 
> > > https://github.com/systemd/systemd/pull/7403
> > > 
> > > 
> > > manage networks. Im not detecting any NetworkManager-config-
> > > server
> > > > > package in my repository neither.
> > > > > 
> > > 
> > > I'm not sure how it's called in Debian based distributions, but
> > > it's just
> > > one small file in /etc, so you can extract it from RPM.
> > > 
> > > So the only solution that I have left, I suppose, is to test
> > > corosync 3.x
> > > > > and see if it works better handling RRP.
> > > > > 
> > > 
> > > You may also reconsider to try ether completely static network
> > > configuration or NetworkManager + NetworkManager-config-server.
> > > 
> > > 
> > > Corosync 3.x with knet will work for sure, but be prepared for
> > > quite a
> > > long compile path, because you first have to compile knet and
> > > then
> > > corosync. What may help you a bit is that we have a ubuntu 18.04
> > > in our
> > > jenkins, so it should be possible corosync build log
> > > https://ci.kronosnet.org/view/corosync/job/corosync-build-al
> > > l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt
> > > s-x86-64/consoleText, knet build log https://ci.kronosnet.org/vie
> > > w/
> > > knet/job/knet-build-all-voting/lastBuild/knet-build-all-
> > > voting=ubuntu-18-04-lts-x86-64/consoleText).
> > > 
> > > Also please consult http://people.redhat.com/ccaul
> > > fie/docs/KnetCorosync.pdf about changes in corosync
> > > configuration.
> > > 
> > > Regards,
> > >    Honza
> > > 
> > > 
> > > > > Thank you for your quick response!
> > > > > 
> > > > > 2018-08-23 8:40 GMT+02:00 Jan Friesse :
> > > > > 
> > > > > David,
> > > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > > Im getting crazy about this problem, that I expect to
> > > > > > > resolve here,
> > > > > > > with
> > > > > > > your help guys:
> > > > > > > 
> > > > > > > I have 2 nodes with Corosync redundant ring feature.
> > > > > > > 
> > > > > > > Each node has 2 similarly connected/configured NIC's.
> > > > > > > Both nodes 

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread David Tolosa
I'm currently using an Ubuntu 18.04 server configuration with netplan.

Here you have my current YAML configuration:

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
eno1:
  addresses: [192.168.0.1/24]
enp4s0f0:
  addresses: [192.168.1.1/24]
enp5s0f0:
  {}
  vlans:
vlan.XXX:
  id: XXX
  link: enp5s0f0
  addresses: [ 10.1.128.5/29 ]
  gateway4: 10.1.128.1
  nameservers:
addresses: [ 8.8.8.8, 8.8.4.4 ]
search: [ foo.com, bar.com ]
vlan.YYY:
  id: YYY
  link: enp5s0f0
  addresses: [ 10.1.128.5/29 ]


So, eno1 and enp4s0f0 are the two ethernet ports connected each other with
crossover cables to node2.
enp5s0f0 port is used to connect outside/services using vlans defined in
the same file.

In short, I'm using systemd-networkd default Ubuntu 18 server service for
manage networks. Im not detecting any NetworkManager-config-server package
in my repository neither.
So the only solution that I have left, I suppose, is to test corosync 3.x
and see if it works better handling RRP.

Thank you for your quick response!

2018-08-23 8:40 GMT+02:00 Jan Friesse :

> David,
>
> Hello,
>> Im getting crazy about this problem, that I expect to resolve here, with
>> your help guys:
>>
>> I have 2 nodes with Corosync redundant ring feature.
>>
>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>> connected each other by two crossover cables.
>>
>
> I believe this is root of the problem. Are you using NetworkManager? If
> so, have you installed NetworkManager-config-server? If not, please install
> it and test again.
>
>
>> I configured both nodes with rrp mode passive. Everything is working well
>> at this point, but when I shutdown 1 node to test failover, and this node
>> > returns to be online, corosync is marking the interface as FAULTY and rrp
>>
>
> I believe it's because with crossover cables configuration when other side
> is shutdown, NetworkManager detects it and does ifdown of the interface.
> And corosync is unable to handle ifdown properly. Ifdown is bad with single
> ring, but it's just killer with RRP (127.0.0.1 poisons every node in the
> cluster).
>
> fails to recover the initial state:
>>
>> 1. Initial scenario:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>>  id  = 192.168.0.1
>>  status  = ring 0 active with no faults
>> RING ID 1
>>  id  = 192.168.1.1
>>  status  = ring 1 active with no faults
>>
>>
>> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>> respective
>> heartbeat IP.
>>
>
> Again, result of ifdown.
>
>
>> 3. When node 2 is back online:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>>  id  = 192.168.0.1
>>  status  = ring 0 active with no faults
>> RING ID 1
>>  id  = 192.168.1.1
>>  status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>
>>
>> # service corosync status
>> ● corosync.service - Corosync Cluster Engine
>> Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
>> preset: enabled)
>> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s
>> ago
>>   Docs: man:corosync
>> man:corosync.conf
>> man:corosync_overview
>>   Main PID: 1439 (corosync)
>>  Tasks: 2 (limit: 4915)
>> CGroup: /system.slice/corosync.service
>> └─1439 /usr/sbin/corosync -f
>>
>>
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
>> network interface [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
>> network interface [192.168.1.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.1.1] is now up.
>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
>> new membership (192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
>> Marking ringid 1 interface 192.168.1.1 FAULTY
>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>> interface
>> 192.168.1.1 FAULTY
>>
>>
>> If I execute corosync-cfgtool, clears the faulty error but after some
>> seconds return to be FAULTY.
>> 

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread David Tolosa
BTW, where I can download Corosync 3.x?
I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/corosync/

2018-08-23 9:11 GMT+02:00 David Tolosa :

> I'm currently using an Ubuntu 18.04 server configuration with netplan.
>
> Here you have my current YAML configuration:
>
> # This file describes the network interfaces available on your system
> # For more information, see netplan(5).
> network:
>   version: 2
>   renderer: networkd
>   ethernets:
> eno1:
>   addresses: [192.168.0.1/24]
> enp4s0f0:
>   addresses: [192.168.1.1/24]
> enp5s0f0:
>   {}
>   vlans:
> vlan.XXX:
>   id: XXX
>   link: enp5s0f0
>   addresses: [ 10.1.128.5/29 ]
>   gateway4: 10.1.128.1
>   nameservers:
> addresses: [ 8.8.8.8, 8.8.4.4 ]
> search: [ foo.com, bar.com ]
> vlan.YYY:
>   id: YYY
>   link: enp5s0f0
>   addresses: [ 10.1.128.5/29 ]
>
>
> So, eno1 and enp4s0f0 are the two ethernet ports connected each other
> with crossover cables to node2.
> enp5s0f0 port is used to connect outside/services using vlans defined in
> the same file.
>
> In short, I'm using systemd-networkd default Ubuntu 18 server service for
> manage networks. Im not detecting any NetworkManager-config-server
> package in my repository neither.
> So the only solution that I have left, I suppose, is to test corosync 3.x
> and see if it works better handling RRP.
>
> Thank you for your quick response!
>
> 2018-08-23 8:40 GMT+02:00 Jan Friesse :
>
>> David,
>>
>> Hello,
>>> Im getting crazy about this problem, that I expect to resolve here, with
>>> your help guys:
>>>
>>> I have 2 nodes with Corosync redundant ring feature.
>>>
>>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>>> connected each other by two crossover cables.
>>>
>>
>> I believe this is root of the problem. Are you using NetworkManager? If
>> so, have you installed NetworkManager-config-server? If not, please install
>> it and test again.
>>
>>
>>> I configured both nodes with rrp mode passive. Everything is working well
>>> at this point, but when I shutdown 1 node to test failover, and this
>>> node > returns to be online, corosync is marking the interface as FAULTY
>>> and rrp
>>>
>>
>> I believe it's because with crossover cables configuration when other
>> side is shutdown, NetworkManager detects it and does ifdown of the
>> interface. And corosync is unable to handle ifdown properly. Ifdown is bad
>> with single ring, but it's just killer with RRP (127.0.0.1 poisons every
>> node in the cluster).
>>
>> fails to recover the initial state:
>>>
>>> 1. Initial scenario:
>>>
>>> # corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>>  id  = 192.168.0.1
>>>  status  = ring 0 active with no faults
>>> RING ID 1
>>>  id  = 192.168.1.1
>>>  status  = ring 1 active with no faults
>>>
>>>
>>> 2. When I shutdown the node 2, all continues with no faults. Sometimes
>>> the
>>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>>> respective
>>> heartbeat IP.
>>>
>>
>> Again, result of ifdown.
>>
>>
>>> 3. When node 2 is back online:
>>>
>>> # corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>>  id  = 192.168.0.1
>>>  status  = ring 0 active with no faults
>>> RING ID 1
>>>  id  = 192.168.1.1
>>>  status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>>
>>>
>>> # service corosync status
>>> ● corosync.service - Corosync Cluster Engine
>>> Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
>>> vendor
>>> preset: enabled)
>>> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min
>>> 38s ago
>>>   Docs: man:corosync
>>> man:corosync.conf
>>> man:corosync_overview
>>>   Main PID: 1439 (corosync)
>>>  Tasks: 2 (limit: 4915)
>>> CGroup: /system.slice/corosync.service
>>> └─1439 /usr/sbin/corosync -f
>>>
>>>
>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>> The
>>> network interface [192.168.0.1] is now up.
>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>> [192.168.0.1] is now up.
>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>> The
>>> network interface [192.168.1.1] is now up.
>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>> [192.168.1.1] is now up.
>>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
>>> new membership (192.168.0.1:601760) was formed. Members
>>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>>> 192.168.0.1:601760) was formed. Members
>>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
>>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>>> 192.168.0.1:601764) was 

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread Jan Friesse

David,


BTW, where I can download Corosync 3.x?
I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/corosync/


Yes, that's Alpha 4 of Corosync 3.




2018-08-23 9:11 GMT+02:00 David Tolosa :


I'm currently using an Ubuntu 18.04 server configuration with netplan.

Here you have my current YAML configuration:

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
   version: 2
   renderer: networkd
   ethernets:
 eno1:
   addresses: [192.168.0.1/24]
 enp4s0f0:
   addresses: [192.168.1.1/24]
 enp5s0f0:
   {}
   vlans:
 vlan.XXX:
   id: XXX
   link: enp5s0f0
   addresses: [ 10.1.128.5/29 ]
   gateway4: 10.1.128.1
   nameservers:
 addresses: [ 8.8.8.8, 8.8.4.4 ]
 search: [ foo.com, bar.com ]
 vlan.YYY:
   id: YYY
   link: enp5s0f0
   addresses: [ 10.1.128.5/29 ]


So, eno1 and enp4s0f0 are the two ethernet ports connected each other
with crossover cables to node2.
enp5s0f0 port is used to connect outside/services using vlans defined in
the same file.

In short, I'm using systemd-networkd default Ubuntu 18 server service for


Ok, so systemd-networkd is really doing ifdown and somebody actually 
tries fix it and merge into upstream (sadly with not too much luck :( )


https://github.com/systemd/systemd/pull/7403



manage networks. Im not detecting any NetworkManager-config-server
package in my repository neither.


I'm not sure how it's called in Debian based distributions, but it's 
just one small file in /etc, so you can extract it from RPM.



So the only solution that I have left, I suppose, is to test corosync 3.x
and see if it works better handling RRP.


You may also reconsider to try ether completely static network 
configuration or NetworkManager + NetworkManager-config-server.



Corosync 3.x with knet will work for sure, but be prepared for quite a 
long compile path, because you first have to compile knet and then 
corosync. What may help you a bit is that we have a ubuntu 18.04 in our 
jenkins, so it should be possible corosync build log 
https://ci.kronosnet.org/view/corosync/job/corosync-build-all-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lts-x86-64/consoleText, 
knet build log 
https://ci.kronosnet.org/view/knet/job/knet-build-all-voting/lastBuild/knet-build-all-voting=ubuntu-18-04-lts-x86-64/consoleText).


Also please consult 
http://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf about changes in 
corosync configuration.


Regards,
  Honza



Thank you for your quick response!

2018-08-23 8:40 GMT+02:00 Jan Friesse :


David,

Hello,

Im getting crazy about this problem, that I expect to resolve here, with
your help guys:

I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.



I believe this is root of the problem. Are you using NetworkManager? If
so, have you installed NetworkManager-config-server? If not, please install
it and test again.



I configured both nodes with rrp mode passive. Everything is working well
at this point, but when I shutdown 1 node to test failover, and this
node > returns to be online, corosync is marking the interface as FAULTY
and rrp



I believe it's because with crossover cables configuration when other
side is shutdown, NetworkManager detects it and does ifdown of the
interface. And corosync is unable to handle ifdown properly. Ifdown is bad
with single ring, but it's just killer with RRP (127.0.0.1 poisons every
node in the cluster).

fails to recover the initial state:


1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
  id  = 192.168.0.1
  status  = ring 0 active with no faults
RING ID 1
  id  = 192.168.1.1
  status  = ring 1 active with no faults


2. When I shutdown the node 2, all continues with no faults. Sometimes
the
ring ID's are bonding with 127.0.0.1 and then bond back to their
respective
heartbeat IP.



Again, result of ifdown.



3. When node 2 is back online:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
  id  = 192.168.0.1
  status  = ring 0 active with no faults
RING ID 1
  id  = 192.168.1.1
  status  = Marking ringid 1 interface 192.168.1.1 FAULTY


# service corosync status
● corosync.service - Corosync Cluster Engine
 Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
vendor
preset: enabled)
 Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min
38s ago
   Docs: man:corosync
 man:corosync.conf
 man:corosync_overview
   Main PID: 1439 (corosync)
  Tasks: 2 (limit: 4915)
 CGroup: /system.slice/corosync.service
 └─1439 /usr/sbin/corosync -f


Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
The
network 

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread Jan Friesse

David,


Hello,
Im getting crazy about this problem, that I expect to resolve here, with
your help guys:

I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.


I believe this is root of the problem. Are you using NetworkManager? If 
so, have you installed NetworkManager-config-server? If not, please 
install it and test again.




I configured both nodes with rrp mode passive. Everything is working well
at this point, but when I shutdown 1 node to test failover, and this node > 
returns to be online, corosync is marking the interface as FAULTY and rrp


I believe it's because with crossover cables configuration when other 
side is shutdown, NetworkManager detects it and does ifdown of the 
interface. And corosync is unable to handle ifdown properly. Ifdown is 
bad with single ring, but it's just killer with RRP (127.0.0.1 poisons 
every node in the cluster).



fails to recover the initial state:

1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
 id  = 192.168.0.1
 status  = ring 0 active with no faults
RING ID 1
 id  = 192.168.1.1
 status  = ring 1 active with no faults


2. When I shutdown the node 2, all continues with no faults. Sometimes the
ring ID's are bonding with 127.0.0.1 and then bond back to their respective
heartbeat IP.


Again, result of ifdown.



3. When node 2 is back online:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
 id  = 192.168.0.1
 status  = ring 0 active with no faults
RING ID 1
 id  = 192.168.1.1
 status  = Marking ringid 1 interface 192.168.1.1 FAULTY


# service corosync status
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
preset: enabled)
Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
  Docs: man:corosync
man:corosync.conf
man:corosync_overview
  Main PID: 1439 (corosync)
 Tasks: 2 (limit: 4915)
CGroup: /system.slice/corosync.service
└─1439 /usr/sbin/corosync -f


Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.1.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.1.1] is now up.
Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
new membership (192.168.0.1:601760) was formed. Members
Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601760) was formed. Members
Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
new membership (192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
Marking ringid 1 interface 192.168.1.1 FAULTY
Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
192.168.1.1 FAULTY


If I execute corosync-cfgtool, clears the faulty error but after some
seconds return to be FAULTY.
The only thing that it resolves the problem is to restart de service with
service corosync restart.

Here you have some of my configuration settings on node 1 (I probed already
to change rrp_mode):

*- corosync.conf*

totem {
 version: 2
 cluster_name: node
 token: 5000
 token_retransmits_before_loss_const: 10
 secauth: off
 threads: 0
 rrp_mode: passive
 nodeid: 1
 interface {
 ringnumber: 0
 bindnetaddr: 192.168.0.0
 #mcastaddr: 226.94.1.1
 mcastport: 5405
 broadcast: yes
 }
 interface {
 ringnumber: 1
 bindnetaddr: 192.168.1.0
 #mcastaddr: 226.94.1.2
 mcastport: 5407
 broadcast: yes
 }
}

logging {
 fileline: off
 to_stderr: yes
 to_syslog: yes
 to_logfile: yes
 logfile: /var/log/corosync/corosync.log
 debug: off
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: off
 }
}

amf {
 mode: disabled
}

quorum {
 provider: corosync_votequorum
 expected_votes: 2
}

nodelist {
 node {
 nodeid: 1
 ring0_addr: 192.168.0.1
 ring1_addr: 192.168.1.1
 }

 node {
 nodeid: 2
 ring0_addr: 192.168.0.2
 

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-22 Thread Andrei Borzenkov
22.08.2018 15:53, David Tolosa пишет:
> Hello,
> Im getting crazy about this problem, that I expect to resolve here, with
> your help guys:
> 
> I have 2 nodes with Corosync redundant ring feature.
> 
> Each node has 2 similarly connected/configured NIC's. Both nodes are
> connected each other by two crossover cables.
> 
> I configured both nodes with rrp mode passive. Everything is working well
> at this point, but when I shutdown 1 node to test failover, and this node
> returns to be online, corosync is marking the interface as FAULTY and rrp
> fails to recover the initial state:
> 
> 1. Initial scenario:
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id  = 192.168.0.1
> status  = ring 0 active with no faults
> RING ID 1
> id  = 192.168.1.1
> status  = ring 1 active with no faults
> 
> 
> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
> ring ID's are bonding with 127.0.0.1 and then bond back to their respective
> heartbeat IP.
> 
> 3. When node 2 is back online:
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id  = 192.168.0.1
> status  = ring 0 active with no faults
> RING ID 1
> id  = 192.168.1.1
> status  = Marking ringid 1 interface 192.168.1.1 FAULTY
> 
> 
> # service corosync status
> ● corosync.service - Corosync Cluster Engine
>Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
> preset: enabled)
>Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
>  Docs: man:corosync
>man:corosync.conf
>man:corosync_overview
>  Main PID: 1439 (corosync)
> Tasks: 2 (limit: 4915)
>CGroup: /system.slice/corosync.service
>└─1439 /usr/sbin/corosync -f
> 
> 
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.1.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.1.1] is now up.
> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
> new membership (192.168.0.1:601760) was formed. Members
> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601760) was formed. Members
> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
> new membership (192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
> Marking ringid 1 interface 192.168.1.1 FAULTY
> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
> 192.168.1.1 FAULTY
> 
> 
> If I execute corosync-cfgtool, clears the faulty error but after some
> seconds return to be FAULTY.
> The only thing that it resolves the problem is to restart de service with
> service corosync restart.
> 
> Here you have some of my configuration settings on node 1 (I probed already
> to change rrp_mode):
> 
> *- corosync.conf*
> 
> totem {
> version: 2
> cluster_name: node
> token: 5000
> token_retransmits_before_loss_const: 10
> secauth: off
> threads: 0
> rrp_mode: passive
> nodeid: 1
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.0.0
> #mcastaddr: 226.94.1.1
> mcastport: 5405
> broadcast: yes
> }
> interface {
> ringnumber: 1
> bindnetaddr: 192.168.1.0
> #mcastaddr: 226.94.1.2
> mcastport: 5407
> broadcast: yes
> }
> }
> 
> logging {
> fileline: off
> to_stderr: yes
> to_syslog: yes
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
> 
> amf {
> mode: disabled
> }
> 
> quorum {
> provider: corosync_votequorum
> expected_votes: 2
> }
> 
> nodelist {
> node {
> nodeid: 1
> ring0_addr: 192.168.0.1
> ring1_addr: 192.168.1.1
> }
> 
> node {
> nodeid: 2
> ring0_addr: 192.168.0.2
> ring1_addr: 192.168.1.2
> }
> }
> 

My understanding so far was that nodelist is used with udpu transport
only. You may try without nodelist or with transport: udpu to see if it
makes a difference.

> aisexec {
> user: root
> group: root
> }

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-22 Thread Emmanuel Gelati
Sorry a Typo

I think you are mixing interface with nodelist http://clusterlabs.org/
pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_
Scratch/_sample_corosync_configuration.html

2018-08-22 22:20 GMT+02:00 Emmanuel Gelati :

> I think you are missing interface with nodelist http://clusterlabs.org/
> pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_
> Scratch/_sample_corosync_configuration.html
>
> 2018-08-22 14:53 GMT+02:00 David Tolosa :
>
>> Hello,
>> Im getting crazy about this problem, that I expect to resolve here, with
>> your help guys:
>>
>> I have 2 nodes with Corosync redundant ring feature.
>>
>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>> connected each other by two crossover cables.
>>
>> I configured both nodes with rrp mode passive. Everything is working well
>> at this point, but when I shutdown 1 node to test failover, and this node
>> returns to be online, corosync is marking the interface as FAULTY and rrp
>> fails to recover the initial state:
>>
>> 1. Initial scenario:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>> id  = 192.168.0.1
>> status  = ring 0 active with no faults
>> RING ID 1
>> id  = 192.168.1.1
>> status  = ring 1 active with no faults
>>
>>
>> 2. When I shutdown the node 2, all continues with no faults. Sometimes
>> the ring ID's are bonding with 127.0.0.1 and then bond back to their
>> respective heartbeat IP.
>>
>> 3. When node 2 is back online:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>> id  = 192.168.0.1
>> status  = ring 0 active with no faults
>> RING ID 1
>> id  = 192.168.1.1
>> status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>
>>
>> # service corosync status
>> ● corosync.service - Corosync Cluster Engine
>>Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
>> preset: enabled)
>>Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s
>> ago
>>  Docs: man:corosync
>>man:corosync.conf
>>man:corosync_overview
>>  Main PID: 1439 (corosync)
>> Tasks: 2 (limit: 4915)
>>CGroup: /system.slice/corosync.service
>>└─1439 /usr/sbin/corosync -f
>>
>>
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>> The network interface [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>> The network interface [192.168.1.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.1.1] is now up.
>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
>> new membership (192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
>> Marking ringid 1 interface 192.168.1.1 FAULTY
>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>> interface 192.168.1.1 FAULTY
>>
>>
>> If I execute corosync-cfgtool, clears the faulty error but after some
>> seconds return to be FAULTY.
>> The only thing that it resolves the problem is to restart de service with
>> service corosync restart.
>>
>> Here you have some of my configuration settings on node 1 (I probed
>> already to change rrp_mode):
>>
>> *- corosync.conf*
>>
>> totem {
>> version: 2
>> cluster_name: node
>> token: 5000
>> token_retransmits_before_loss_const: 10
>> secauth: off
>> threads: 0
>> rrp_mode: passive
>> nodeid: 1
>> interface {
>> ringnumber: 0
>> bindnetaddr: 192.168.0.0
>> #mcastaddr: 226.94.1.1
>> mcastport: 5405
>> broadcast: yes
>> }
>> interface {
>> ringnumber: 1
>> bindnetaddr: 192.168.1.0
>> #mcastaddr: 226.94.1.2
>> mcastport: 5407
>> broadcast: yes
>> }
>> }
>>
>> logging {
>> fileline: off
>> to_stderr: yes
>> to_syslog: yes
>> to_logfile: yes
>> logfile: /var/log/corosync/corosync.log
>> debug: off
>> timestamp: on
>> logger_subsys {
>> subsys: AMF
>> debug: off
>> }
>> }
>>
>> amf {
>> mode: disabled
>> }
>>
>> quorum {
>> provider: corosync_votequorum
>> expected_votes: 2

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-22 Thread Emmanuel Gelati
I think you are missing interface with nodelist
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_sample_corosync_configuration.html

2018-08-22 14:53 GMT+02:00 David Tolosa :

> Hello,
> Im getting crazy about this problem, that I expect to resolve here, with
> your help guys:
>
> I have 2 nodes with Corosync redundant ring feature.
>
> Each node has 2 similarly connected/configured NIC's. Both nodes are
> connected each other by two crossover cables.
>
> I configured both nodes with rrp mode passive. Everything is working well
> at this point, but when I shutdown 1 node to test failover, and this node
> returns to be online, corosync is marking the interface as FAULTY and rrp
> fails to recover the initial state:
>
> 1. Initial scenario:
>
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id  = 192.168.0.1
> status  = ring 0 active with no faults
> RING ID 1
> id  = 192.168.1.1
> status  = ring 1 active with no faults
>
>
> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
> ring ID's are bonding with 127.0.0.1 and then bond back to their respective
> heartbeat IP.
>
> 3. When node 2 is back online:
>
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id  = 192.168.0.1
> status  = ring 0 active with no faults
> RING ID 1
> id  = 192.168.1.1
> status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>
>
> # service corosync status
> ● corosync.service - Corosync Cluster Engine
>Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
> preset: enabled)
>Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s
> ago
>  Docs: man:corosync
>man:corosync.conf
>man:corosync_overview
>  Main PID: 1439 (corosync)
> Tasks: 2 (limit: 4915)
>CGroup: /system.slice/corosync.service
>└─1439 /usr/sbin/corosync -f
>
>
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.1.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.1.1] is now up.
> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
> new membership (192.168.0.1:601760) was formed. Members
> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601760) was formed. Members
> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
> new membership (192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
> Marking ringid 1 interface 192.168.1.1 FAULTY
> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
> interface 192.168.1.1 FAULTY
>
>
> If I execute corosync-cfgtool, clears the faulty error but after some
> seconds return to be FAULTY.
> The only thing that it resolves the problem is to restart de service with
> service corosync restart.
>
> Here you have some of my configuration settings on node 1 (I probed
> already to change rrp_mode):
>
> *- corosync.conf*
>
> totem {
> version: 2
> cluster_name: node
> token: 5000
> token_retransmits_before_loss_const: 10
> secauth: off
> threads: 0
> rrp_mode: passive
> nodeid: 1
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.0.0
> #mcastaddr: 226.94.1.1
> mcastport: 5405
> broadcast: yes
> }
> interface {
> ringnumber: 1
> bindnetaddr: 192.168.1.0
> #mcastaddr: 226.94.1.2
> mcastport: 5407
> broadcast: yes
> }
> }
>
> logging {
> fileline: off
> to_stderr: yes
> to_syslog: yes
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> quorum {
> provider: corosync_votequorum
> expected_votes: 2
> }
>
> nodelist {
> node {
> nodeid: 1
> ring0_addr: 192.168.0.1
> ring1_addr: 192.168.1.1
> }
>
> node {
> nodeid: 2
> ring0_addr: 192.168.0.2
> ring1_addr: 192.168.1.2
> }
> }
>
> aisexec {
> user: root
> group: root
> }
>
> service