Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread Jan Friesse

David,


BTW, where I can download Corosync 3.x?
I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/corosync/


Yes, that's Alpha 4 of Corosync 3.




2018-08-23 9:11 GMT+02:00 David Tolosa :


I'm currently using an Ubuntu 18.04 server configuration with netplan.

Here you have my current YAML configuration:

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
   version: 2
   renderer: networkd
   ethernets:
 eno1:
   addresses: [192.168.0.1/24]
 enp4s0f0:
   addresses: [192.168.1.1/24]
 enp5s0f0:
   {}
   vlans:
 vlan.XXX:
   id: XXX
   link: enp5s0f0
   addresses: [ 10.1.128.5/29 ]
   gateway4: 10.1.128.1
   nameservers:
 addresses: [ 8.8.8.8, 8.8.4.4 ]
 search: [ foo.com, bar.com ]
 vlan.YYY:
   id: YYY
   link: enp5s0f0
   addresses: [ 10.1.128.5/29 ]


So, eno1 and enp4s0f0 are the two ethernet ports connected each other
with crossover cables to node2.
enp5s0f0 port is used to connect outside/services using vlans defined in
the same file.

In short, I'm using systemd-networkd default Ubuntu 18 server service for


Ok, so systemd-networkd is really doing ifdown and somebody actually 
tries fix it and merge into upstream (sadly with not too much luck :( )


https://github.com/systemd/systemd/pull/7403



manage networks. Im not detecting any NetworkManager-config-server
package in my repository neither.


I'm not sure how it's called in Debian based distributions, but it's 
just one small file in /etc, so you can extract it from RPM.



So the only solution that I have left, I suppose, is to test corosync 3.x
and see if it works better handling RRP.


You may also reconsider to try ether completely static network 
configuration or NetworkManager + NetworkManager-config-server.



Corosync 3.x with knet will work for sure, but be prepared for quite a 
long compile path, because you first have to compile knet and then 
corosync. What may help you a bit is that we have a ubuntu 18.04 in our 
jenkins, so it should be possible corosync build log 
https://ci.kronosnet.org/view/corosync/job/corosync-build-all-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lts-x86-64/consoleText, 
knet build log 
https://ci.kronosnet.org/view/knet/job/knet-build-all-voting/lastBuild/knet-build-all-voting=ubuntu-18-04-lts-x86-64/consoleText).


Also please consult 
http://people.redhat.com/ccaulfie/docs/KnetCorosync.pdf about changes in 
corosync configuration.


Regards,
  Honza



Thank you for your quick response!

2018-08-23 8:40 GMT+02:00 Jan Friesse :


David,

Hello,

Im getting crazy about this problem, that I expect to resolve here, with
your help guys:

I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.



I believe this is root of the problem. Are you using NetworkManager? If
so, have you installed NetworkManager-config-server? If not, please install
it and test again.



I configured both nodes with rrp mode passive. Everything is working well
at this point, but when I shutdown 1 node to test failover, and this
node > returns to be online, corosync is marking the interface as FAULTY
and rrp



I believe it's because with crossover cables configuration when other
side is shutdown, NetworkManager detects it and does ifdown of the
interface. And corosync is unable to handle ifdown properly. Ifdown is bad
with single ring, but it's just killer with RRP (127.0.0.1 poisons every
node in the cluster).

fails to recover the initial state:


1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
  id  = 192.168.0.1
  status  = ring 0 active with no faults
RING ID 1
  id  = 192.168.1.1
  status  = ring 1 active with no faults


2. When I shutdown the node 2, all continues with no faults. Sometimes
the
ring ID's are bonding with 127.0.0.1 and then bond back to their
respective
heartbeat IP.



Again, result of ifdown.



3. When node 2 is back online:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
  id  = 192.168.0.1
  status  = ring 0 active with no faults
RING ID 1
  id  = 192.168.1.1
  status  = Marking ringid 1 interface 192.168.1.1 FAULTY


# service corosync status
● corosync.service - Corosync Cluster Engine
 Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
vendor
preset: enabled)
 Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min
38s ago
   Docs: man:corosync
 man:corosync.conf
 man:corosync_overview
   Main PID: 1439 (corosync)
  Tasks: 2 (limit: 4915)
 CGroup: /system.slice/corosync.service
 └─1439 /usr/sbin/corosync -f


Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
The
network inte

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread David Tolosa
I'm currently using an Ubuntu 18.04 server configuration with netplan.

Here you have my current YAML configuration:

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
eno1:
  addresses: [192.168.0.1/24]
enp4s0f0:
  addresses: [192.168.1.1/24]
enp5s0f0:
  {}
  vlans:
vlan.XXX:
  id: XXX
  link: enp5s0f0
  addresses: [ 10.1.128.5/29 ]
  gateway4: 10.1.128.1
  nameservers:
addresses: [ 8.8.8.8, 8.8.4.4 ]
search: [ foo.com, bar.com ]
vlan.YYY:
  id: YYY
  link: enp5s0f0
  addresses: [ 10.1.128.5/29 ]


So, eno1 and enp4s0f0 are the two ethernet ports connected each other with
crossover cables to node2.
enp5s0f0 port is used to connect outside/services using vlans defined in
the same file.

In short, I'm using systemd-networkd default Ubuntu 18 server service for
manage networks. Im not detecting any NetworkManager-config-server package
in my repository neither.
So the only solution that I have left, I suppose, is to test corosync 3.x
and see if it works better handling RRP.

Thank you for your quick response!

2018-08-23 8:40 GMT+02:00 Jan Friesse :

> David,
>
> Hello,
>> Im getting crazy about this problem, that I expect to resolve here, with
>> your help guys:
>>
>> I have 2 nodes with Corosync redundant ring feature.
>>
>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>> connected each other by two crossover cables.
>>
>
> I believe this is root of the problem. Are you using NetworkManager? If
> so, have you installed NetworkManager-config-server? If not, please install
> it and test again.
>
>
>> I configured both nodes with rrp mode passive. Everything is working well
>> at this point, but when I shutdown 1 node to test failover, and this node
>> > returns to be online, corosync is marking the interface as FAULTY and rrp
>>
>
> I believe it's because with crossover cables configuration when other side
> is shutdown, NetworkManager detects it and does ifdown of the interface.
> And corosync is unable to handle ifdown properly. Ifdown is bad with single
> ring, but it's just killer with RRP (127.0.0.1 poisons every node in the
> cluster).
>
> fails to recover the initial state:
>>
>> 1. Initial scenario:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>>  id  = 192.168.0.1
>>  status  = ring 0 active with no faults
>> RING ID 1
>>  id  = 192.168.1.1
>>  status  = ring 1 active with no faults
>>
>>
>> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>> respective
>> heartbeat IP.
>>
>
> Again, result of ifdown.
>
>
>> 3. When node 2 is back online:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>>  id  = 192.168.0.1
>>  status  = ring 0 active with no faults
>> RING ID 1
>>  id  = 192.168.1.1
>>  status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>
>>
>> # service corosync status
>> ● corosync.service - Corosync Cluster Engine
>> Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
>> preset: enabled)
>> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s
>> ago
>>   Docs: man:corosync
>> man:corosync.conf
>> man:corosync_overview
>>   Main PID: 1439 (corosync)
>>  Tasks: 2 (limit: 4915)
>> CGroup: /system.slice/corosync.service
>> └─1439 /usr/sbin/corosync -f
>>
>>
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
>> network interface [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
>> network interface [192.168.1.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.1.1] is now up.
>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
>> new membership (192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
>> Marking ringid 1 interface 192.168.1.1 FAULTY
>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>> interface
>> 192.168.1.1 FAULTY
>>
>>
>> If I execute corosync-cfgtool, clears the faulty error but after some
>> seconds return to be FAULTY.
>> 

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread David Tolosa
BTW, where I can download Corosync 3.x?
I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/corosync/

2018-08-23 9:11 GMT+02:00 David Tolosa :

> I'm currently using an Ubuntu 18.04 server configuration with netplan.
>
> Here you have my current YAML configuration:
>
> # This file describes the network interfaces available on your system
> # For more information, see netplan(5).
> network:
>   version: 2
>   renderer: networkd
>   ethernets:
> eno1:
>   addresses: [192.168.0.1/24]
> enp4s0f0:
>   addresses: [192.168.1.1/24]
> enp5s0f0:
>   {}
>   vlans:
> vlan.XXX:
>   id: XXX
>   link: enp5s0f0
>   addresses: [ 10.1.128.5/29 ]
>   gateway4: 10.1.128.1
>   nameservers:
> addresses: [ 8.8.8.8, 8.8.4.4 ]
> search: [ foo.com, bar.com ]
> vlan.YYY:
>   id: YYY
>   link: enp5s0f0
>   addresses: [ 10.1.128.5/29 ]
>
>
> So, eno1 and enp4s0f0 are the two ethernet ports connected each other
> with crossover cables to node2.
> enp5s0f0 port is used to connect outside/services using vlans defined in
> the same file.
>
> In short, I'm using systemd-networkd default Ubuntu 18 server service for
> manage networks. Im not detecting any NetworkManager-config-server
> package in my repository neither.
> So the only solution that I have left, I suppose, is to test corosync 3.x
> and see if it works better handling RRP.
>
> Thank you for your quick response!
>
> 2018-08-23 8:40 GMT+02:00 Jan Friesse :
>
>> David,
>>
>> Hello,
>>> Im getting crazy about this problem, that I expect to resolve here, with
>>> your help guys:
>>>
>>> I have 2 nodes with Corosync redundant ring feature.
>>>
>>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>>> connected each other by two crossover cables.
>>>
>>
>> I believe this is root of the problem. Are you using NetworkManager? If
>> so, have you installed NetworkManager-config-server? If not, please install
>> it and test again.
>>
>>
>>> I configured both nodes with rrp mode passive. Everything is working well
>>> at this point, but when I shutdown 1 node to test failover, and this
>>> node > returns to be online, corosync is marking the interface as FAULTY
>>> and rrp
>>>
>>
>> I believe it's because with crossover cables configuration when other
>> side is shutdown, NetworkManager detects it and does ifdown of the
>> interface. And corosync is unable to handle ifdown properly. Ifdown is bad
>> with single ring, but it's just killer with RRP (127.0.0.1 poisons every
>> node in the cluster).
>>
>> fails to recover the initial state:
>>>
>>> 1. Initial scenario:
>>>
>>> # corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>>  id  = 192.168.0.1
>>>  status  = ring 0 active with no faults
>>> RING ID 1
>>>  id  = 192.168.1.1
>>>  status  = ring 1 active with no faults
>>>
>>>
>>> 2. When I shutdown the node 2, all continues with no faults. Sometimes
>>> the
>>> ring ID's are bonding with 127.0.0.1 and then bond back to their
>>> respective
>>> heartbeat IP.
>>>
>>
>> Again, result of ifdown.
>>
>>
>>> 3. When node 2 is back online:
>>>
>>> # corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>>  id  = 192.168.0.1
>>>  status  = ring 0 active with no faults
>>> RING ID 1
>>>  id  = 192.168.1.1
>>>  status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>>
>>>
>>> # service corosync status
>>> ● corosync.service - Corosync Cluster Engine
>>> Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
>>> vendor
>>> preset: enabled)
>>> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min
>>> 38s ago
>>>   Docs: man:corosync
>>> man:corosync.conf
>>> man:corosync_overview
>>>   Main PID: 1439 (corosync)
>>>  Tasks: 2 (limit: 4915)
>>> CGroup: /system.slice/corosync.service
>>> └─1439 /usr/sbin/corosync -f
>>>
>>>
>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>> The
>>> network interface [192.168.0.1] is now up.
>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>> [192.168.0.1] is now up.
>>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>>> The
>>> network interface [192.168.1.1] is now up.
>>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>>> [192.168.1.1] is now up.
>>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
>>> new membership (192.168.0.1:601760) was formed. Members
>>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>>> 192.168.0.1:601760) was formed. Members
>>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
>>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>>> 192.168.0.1:601764) was formed.

[ClusterLabs] Q: automaticlly remove expired location constraints

2018-08-23 Thread Ulrich Windl
Hi!

I have a non-trivial question: How can I remove expired manual migration 
requests, like the following?:
location cli-standby-rsc rsc rule -inf: #uname eq host and date lt "2013-06-12 
13:47:26Z"

One problem is that the date value is not a constant, and it had to be compared 
against the current date&time.

Regards,
Ulrich


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Q: ordering for a monitoring op only?

2018-08-23 Thread Ulrich Windl
>>> Ryan Thomas  schrieb am 21.08.2018 um 17:38 in
Nachricht
:
> You could accomplish this be creating a custom RA which normally acts as a
> pass-through and calls the "real" RA.  However, it intercepts "monitor"
> actions, checks nfs, and if nfs is down it returns success, otherwise it
> passes though the monitor action to the real RA.  If nfs fails the monitor
> action is in-flight, the customer RA can intercept the failure, check if
> nfs is down, and if so change the failure to a success.

Hi!

This sounds like an interesting approach, but I wonder how to avoid a 
monitoring timeout: I.e. what value to return when NFS is down? I'm missing a 
return value like 
CANNOT_CHECK_AT_THE_MOMENT_SO_PLEASE_ASSUME_RESOURCE_STILL_HAS_ITS_LAST_STATE 
;-)

Unless I can return such a value, the wrapper RA will have to wait (possibly 
causing a timeout). OK, the wrapper RA could cache its last return value and 
reuse that when NFS is down.

Regards,
Ulrich

> 
> On Mon, Aug 20, 2018 at 3:51 AM Ulrich Windl <
> ulrich.wi...@rz.uni-regensburg.de> wrote:
> 
>> Hi!
>>
>> I wonder whether it's possible to run a monitoring op only if some
>> specific resource is up.
>> Background: We have some resource that runs fine without NFS, but the
>> start, stop and monitor operations will just hang if NFS is down. In effect
>> the monitor operation will time out, the cluster will try to recover,
>> calling the stop operation, which in turn will time out, making things
>> worse (i.e.: causing a node fence).
>>
>> So my idea was to pause the monitoing operation while NFS is down (NFS
>> itself is controlled by the cluster and should recover "rather soon" TM).
>>
>> Is that possible?
>> And before you ask: No, I have not written that RA that has the problem; a
>> multi-million-dollar company wrote it (Years before I had written a monitor
>> for HP-UX' cluster that did not have this problem, even though the
>> configuration files were read from NFS (It's not magic: Just periodically
>> copy them to shared memory, and read the config from shared memory).
>>
>> Regards,
>> Ulrich
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>>




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: Spurious node loss in corosync cluster

2018-08-23 Thread Ulrich Windl
>>> Prasad Nagaraj  schrieb am 22.08.2018 um 02:59 
>>> in
Nachricht
:
> Thanks Ken and Ulrich. There is definitely high IO on the system with
> sometimes IOWAIT s of upto 90%
> I have come across some previous posts that IOWAIT is also considered as
> CPU load by Corosync. Is this true ? Does having high IO may lead corosync

It's not Corosync, ist Linux: A process busy with I/O also adds to the (CPU) 
load. One typ8ical effect tha twe see if some stale NFS or CIFS share is still 
being used, the "CPU load" goes up...

> complain as in " Corosync main process was not scheduled for..." or "High
> CPU load detected.." ?
> 
> I will surely monitor the system more.

As recommanded try sar's disk activity (the numbers below are from a fast disk 
sytem, BTW):
00:00:01  DEV   tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz 
await svctm %util
08:40:01dev253-13   1225.65  18393.28   1469.77 16.21  0.53  
0.44  0.33 40.76
08:50:01dev253-51   4972.41  38796.90977.88  8.00  2.26  
0.46  0.11 55.19
09:10:01dev253-51   4709.03  36692.07975.01  8.00  2.73  
0.58  0.14 64.57
09:20:01dev253-51   4445.17  34708.88847.96  8.00  1.70  
0.38  0.12 55.03
10:10:01dev253-51   4246.66  32944.55   1023.61  8.00  3.12  
0.73  0.18 77.83
11:00:01dev253-51   5500.39  42984.68   1012.82  8.00  4.55  
0.83  0.14 76.91
19:50:01dev253-51  49618.88 396396.53547.83  8.00139.60  
2.81  0.01 60.98

The %util is the column too look at; you could also look at await, but note 
that network-related I/O is not included there.

Regards,
Ulrich

> 
> Thanks for your help.
> Prasad
> 
> 
> 
> On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot  wrote:
> 
>> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
>> > > > > Prasad Nagaraj  schrieb am
>> > > > > 21.08.2018 um 11:42 in
>> >
>> > Nachricht
>> > :
>> > > Hi Ken - Thanks for you response.
>> > >
>> > > We do have seen messages in other cases like
>> > > corosync [MAIN  ] Corosync main process was not scheduled for
>> > > 17314.4746 ms
>> > > (threshold is 8000. ms). Consider token timeout increase.
>> > > corosync [TOTEM ] A processor failed, forming new configuration.
>> > >
>> > > Is this the indication of a failure due to CPU load issues and will
>> > > this
>> > > get resolved if I upgrade to Corosync 2.x series ?
>>
>> Yes, most definitely this is a CPU issue. It means corosync isn't
>> getting enough CPU cycles to handle the cluster token before the
>> timeout is reached.
>>
>> Upgrading may indeed help, as recent versions ensure that corosync runs
>> with real-time priority in the kernel, and thus are more likely to get
>> CPU time when something of lower priority is consuming all the CPU.
>>
>> But of course, there is some underlying problem that should be
>> identified and addressed. Figure out what's maxing out the CPU or I/O.
>> Ulrich's monitoring suggestion is a good start.
>>
>> > Hi!
>> >
>> > I'd strongly recommend starting monitoring on your nodes, at least
>> > until you know what's going on. The good old UNIX sa (sysstat
>> > package) could be a starting point. I'd monitor CPU idle
>> > specifically. Then go for 100% device utilization, then look for
>> > network bottlenecks...
>> >
>> > A new corosync release cannot fix those, most likely.
>> >
>> > Regards,
>> > Ulrich
>> >
>> > >
>> > > In any case, for the current scenario, we did not see any
>> > > scheduling
>> > > related messages.
>> > >
>> > > Thanks for your help.
>> > > Prasad
>> > >
>> > > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
>> > > wrote:
>> > >
>> > > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
>> > > > > Hi:
>> > > > >
>> > > > > One of these days, I saw a spurious node loss on my 3-node
>> > > > > corosync
>> > > > > cluster with following logged in the corosync.log of one of the
>> > > > > nodes.
>> > > > >
>> > > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
>> > > > > Transitional membership event on ring 32: memb=2, new=0, lost=1
>> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>> > > > > vm02d780875f 67114156
>> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>> > > > > vmfa2757171f 151000236
>> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
>> > > > > vm728316982d 201331884
>> > > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
>> > > > > Stable
>> > > > > membership event on ring 32: memb=2, new=0, lost=0
>> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>> > > > > vm02d780875f 67114156
>> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>> > > > > vmfa2757171f 151000236
>> > > > > Aug 18 12:40:25 corosync [pcmk  ] info:
>> > > > > ais_mark_unseen_peer_dead:
>> > > > > Node vm728316982d was not seen in the previous transiti

[ClusterLabs] Antw: Re: Antw: Re: Spurious node loss in corosync cluster

2018-08-23 Thread Ulrich Windl
>>> Prasad Nagaraj  schrieb am 22.08.2018 um 19:00 
>>> in
Nachricht
:
> Hi - My systems are single core cpu VMs running on azure platform. I am

OK, so you don't have any control over overprovisioning CPU power and the VM 
being migrated between nodes, I guess. Be aware that the CPU time you are 
seeing is purely virtual, and there may be times when a "100% busy CPU" gets no 
CPU cycles at all. An interesting experiment would be to compare the 
CLOCK_MONOTONIC values against CLOCK_REALTIME (ona real hiost, so that 
CLOCK_REALTIME actually is no virtual time) over some time. I wouldn't be 
suprised if you see jumps.

I think clouds are no god for realtime demands.

Regards,
Ulrich

> running MySQL on the nodes that do generate high io load. And my bad , I
> meant to say 'High CPU load detected' logged by crmd and not corosync.
> Corosync logs messages like 'Corosync main process was not scheduled
> for.' kind of messages which inturn makes pacemaker monitor action to
> fail sometimes. Is increasing token timeout a solution for this or any
> other ways ?
> 
> Thanks for the help
> Prasaf
> 
> On Wed, 22 Aug 2018, 11:55 am Jan Friesse,  wrote:
> 
>> Prasad,
>>
>> > Thanks Ken and Ulrich. There is definitely high IO on the system with
>> > sometimes IOWAIT s of upto 90%
>> > I have come across some previous posts that IOWAIT is also considered as
>> > CPU load by Corosync. Is this true ? Does having high IO may lead
>> corosync
>> > complain as in " Corosync main process was not scheduled for..." or "Hi
>> > CPU load detected.." ?
>>
>> Yes it can.
>>
>> Corosync never logs "Hi CPU load detected...".
>>
>> >
>> > I will surely monitor the system more.
>>
>> Is that system VM or physical machine? Because " Corosync main process
>> was not scheduled for..." is usually happening on VMs where hosts are
>> highly overloaded.
>>
>> Honza
>>
>> >
>> > Thanks for your help.
>> > Prasad
>> >
>> >
>> >
>> > On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot 
>> wrote:
>> >
>> >> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
>> >> Prasad Nagaraj  schrieb am
>> >> 21.08.2018 um 11:42 in
>> >>>
>> >>> Nachricht
>> >>> :
>>  Hi Ken - Thanks for you response.
>> 
>>  We do have seen messages in other cases like
>>  corosync [MAIN  ] Corosync main process was not scheduled for
>>  17314.4746 ms
>>  (threshold is 8000. ms). Consider token timeout increase.
>>  corosync [TOTEM ] A processor failed, forming new configuration.
>> 
>>  Is this the indication of a failure due to CPU load issues and will
>>  this
>>  get resolved if I upgrade to Corosync 2.x series ?
>> >>
>> >> Yes, most definitely this is a CPU issue. It means corosync isn't
>> >> getting enough CPU cycles to handle the cluster token before the
>> >> timeout is reached.
>> >>
>> >> Upgrading may indeed help, as recent versions ensure that corosync runs
>> >> with real-time priority in the kernel, and thus are more likely to get
>> >> CPU time when something of lower priority is consuming all the CPU.
>> >>
>> >> But of course, there is some underlying problem that should be
>> >> identified and addressed. Figure out what's maxing out the CPU or I/O.
>> >> Ulrich's monitoring suggestion is a good start.
>> >>
>> >>> Hi!
>> >>>
>> >>> I'd strongly recommend starting monitoring on your nodes, at least
>> >>> until you know what's going on. The good old UNIX sa (sysstat
>> >>> package) could be a starting point. I'd monitor CPU idle
>> >>> specifically. Then go for 100% device utilization, then look for
>> >>> network bottlenecks...
>> >>>
>> >>> A new corosync release cannot fix those, most likely.
>> >>>
>> >>> Regards,
>> >>> Ulrich
>> >>>
>> 
>>  In any case, for the current scenario, we did not see any
>>  scheduling
>>  related messages.
>> 
>>  Thanks for your help.
>>  Prasad
>> 
>>  On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
>>  wrote:
>> 
>> > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
>> >> Hi:
>> >>
>> >> One of these days, I saw a spurious node loss on my 3-node
>> >> corosync
>> >> cluster with following logged in the corosync.log of one of the
>> >> nodes.
>> >>
>> >> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
>> >> Transitional membership event on ring 32: memb=2, new=0, lost=1
>> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>> >> vm02d780875f 67114156
>> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>> >> vmfa2757171f 151000236
>> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
>> >> vm728316982d 201331884
>> >> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
>> >> Stable
>> >> membership event on ring 32: memb=2, new=0, lost=0
>> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>> >> vm02d780875f 67114156
>> >> Aug 18 12:40:25 corosyn

Re: [ClusterLabs] Q: automaticlly remove expired location constraints

2018-08-23 Thread Ken Gaillot
On Thu, 2018-08-23 at 12:27 +0200, Ulrich Windl wrote:
> Hi!
> 
> I have a non-trivial question: How can I remove expired manual
> migration requests, like the following?:
> location cli-standby-rsc rsc rule -inf: #uname eq host and date lt
> "2013-06-12 13:47:26Z"
> 
> One problem is that the date value is not a constant, and it had to
> be compared against the current date&time.
> 
> Regards,
> Ulrich

crm_resource --clear -r RSC will clear all cli-* constraints
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Q: (SLES11 SP4) lrm_rsc_op without last-run?

2018-08-23 Thread Ken Gaillot
On Thu, 2018-08-23 at 08:08 +0200, Ulrich Windl wrote:
> Hi!
> 
> Many years ago I wrote a parser that could format the CIB XML in a
> flexible way. Today I used it again to print some statistics for
> "exec-time". Thereby I discovered one operation that has a valid
> "exec-time", a valid "last-rc-change", but no "last-run".
> All other operations had "last-run". Can someone explain how this can
> happen? The operation in question is "monitor", so it should be run
> frequently (specified as "op monitor interval=600 timeout=30").

Recurring actions never get last-run, because the crmd doesn't initiate
each run of a recurring action.

> I see no failed actions regarding the resource.
> 
> The original XML part for the operation looks like this:
>   operation_key="prm_cron-cleanup_monitor_60" operation="monitor"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
> transition-key="92:3:0:6c6eff09-0d57-4844-9c3c-bc300c095bb6"
> transition-magic="0:0;92:3:0:6c6eff09-0d57-4844-9c3c-bc300c095bb6"
> on_node="h06" call-id="151" rc-code="0" op-status="0"
> interval="60" last-rc-change="1513367408" exec-time="13" queue-
> time="0" op-digest="2351b51e5316689a0eb89e8061445728"/>
> 
> The node is not completely up-to-date, and it's using pacemaker-
> 1.1.12-18.1...
> 
> Regards,
> Ulrich
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-23 Thread Jan Friesse

I tried to install corosync 3.x and it works pretty well.


Cool


But when I install pacemaker, it installs previous version of corosync as
dependency and breaks all the setup.
Any suggestions?


I can see at least following "solutions":
- make proper Debian package
- install corosync 3 to /usr/local
- (ugly) install packaged corosync and reinstall by corosync 3 from source

Regards,
  Honza



2018-08-23 9:32 GMT+02:00 Jan Friesse :


David,

BTW, where I can download Corosync 3.x?

I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro
sync/



Yes, that's Alpha 4 of Corosync 3.





2018-08-23 9:11 GMT+02:00 David Tolosa :

I'm currently using an Ubuntu 18.04 server configuration with netplan.


Here you have my current YAML configuration:

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
version: 2
renderer: networkd
ethernets:
  eno1:
addresses: [192.168.0.1/24]
  enp4s0f0:
addresses: [192.168.1.1/24]
  enp5s0f0:
{}
vlans:
  vlan.XXX:
id: XXX
link: enp5s0f0
addresses: [ 10.1.128.5/29 ]
gateway4: 10.1.128.1
nameservers:
  addresses: [ 8.8.8.8, 8.8.4.4 ]
  search: [ foo.com, bar.com ]
  vlan.YYY:
id: YYY
link: enp5s0f0
addresses: [ 10.1.128.5/29 ]


So, eno1 and enp4s0f0 are the two ethernet ports connected each other
with crossover cables to node2.
enp5s0f0 port is used to connect outside/services using vlans defined in
the same file.

In short, I'm using systemd-networkd default Ubuntu 18 server service for




Ok, so systemd-networkd is really doing ifdown and somebody actually tries
fix it and merge into upstream (sadly with not too much luck :( )

https://github.com/systemd/systemd/pull/7403


manage networks. Im not detecting any NetworkManager-config-server

package in my repository neither.




I'm not sure how it's called in Debian based distributions, but it's just
one small file in /etc, so you can extract it from RPM.

So the only solution that I have left, I suppose, is to test corosync 3.x

and see if it works better handling RRP.




You may also reconsider to try ether completely static network
configuration or NetworkManager + NetworkManager-config-server.


Corosync 3.x with knet will work for sure, but be prepared for quite a
long compile path, because you first have to compile knet and then
corosync. What may help you a bit is that we have a ubuntu 18.04 in our
jenkins, so it should be possible corosync build log
https://ci.kronosnet.org/view/corosync/job/corosync-build-al
l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt
s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/
knet/job/knet-build-all-voting/lastBuild/knet-build-all-
voting=ubuntu-18-04-lts-x86-64/consoleText).

Also please consult http://people.redhat.com/ccaul
fie/docs/KnetCorosync.pdf about changes in corosync configuration.

Regards,
   Honza



Thank you for your quick response!

2018-08-23 8:40 GMT+02:00 Jan Friesse :

David,


Hello,


Im getting crazy about this problem, that I expect to resolve here,
with
your help guys:

I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.



I believe this is root of the problem. Are you using NetworkManager? If
so, have you installed NetworkManager-config-server? If not, please
install
it and test again.


I configured both nodes with rrp mode passive. Everything is working

well
at this point, but when I shutdown 1 node to test failover, and this
node > returns to be online, corosync is marking the interface as
FAULTY
and rrp



I believe it's because with crossover cables configuration when other
side is shutdown, NetworkManager detects it and does ifdown of the
interface. And corosync is unable to handle ifdown properly. Ifdown is
bad
with single ring, but it's just killer with RRP (127.0.0.1 poisons every
node in the cluster).

fails to recover the initial state:



1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
   id  = 192.168.0.1
   status  = ring 0 active with no faults
RING ID 1
   id  = 192.168.1.1
   status  = ring 1 active with no faults


2. When I shutdown the node 2, all continues with no faults. Sometimes
the
ring ID's are bonding with 127.0.0.1 and then bond back to their
respective
heartbeat IP.



Again, result of ifdown.


3. When node 2 is back online:


# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
   id  = 192.168.0.1
   status  = ring 0 active with no faults
RING ID 1
   id  = 192.168.1.1
   status  = Marking ringid 1 interface 192.168.1.1 FAULTY


# service corosync status
● corosync.service - Corosync Cluster Engine
  Loaded: loaded (/l