Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-22 Thread Ferenc Wágner
Jan Friesse  writes:

> Is that system VM or physical machine? Because " Corosync main process
> was not scheduled for..." is usually happening on VMs where hosts are
> highly overloaded.

Or when physical hosts use BMC watchdogs.  But Prasad didn't encounter
such logs in the setup at hand, as far as I understand.
-- 
Regards,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-22 Thread Prasad Nagaraj
Hi - My systems are single core cpu VMs running on azure platform. I am
running MySQL on the nodes that do generate high io load. And my bad , I
meant to say 'High CPU load detected' logged by crmd and not corosync.
Corosync logs messages like 'Corosync main process was not scheduled
for.' kind of messages which inturn makes pacemaker monitor action to
fail sometimes. Is increasing token timeout a solution for this or any
other ways ?

Thanks for the help
Prasaf

On Wed, 22 Aug 2018, 11:55 am Jan Friesse,  wrote:

> Prasad,
>
> > Thanks Ken and Ulrich. There is definitely high IO on the system with
> > sometimes IOWAIT s of upto 90%
> > I have come across some previous posts that IOWAIT is also considered as
> > CPU load by Corosync. Is this true ? Does having high IO may lead
> corosync
> > complain as in " Corosync main process was not scheduled for..." or "Hi
> > CPU load detected.." ?
>
> Yes it can.
>
> Corosync never logs "Hi CPU load detected...".
>
> >
> > I will surely monitor the system more.
>
> Is that system VM or physical machine? Because " Corosync main process
> was not scheduled for..." is usually happening on VMs where hosts are
> highly overloaded.
>
> Honza
>
> >
> > Thanks for your help.
> > Prasad
> >
> >
> >
> > On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot 
> wrote:
> >
> >> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
> >> Prasad Nagaraj  schrieb am
> >> 21.08.2018 um 11:42 in
> >>>
> >>> Nachricht
> >>> :
>  Hi Ken - Thanks for you response.
> 
>  We do have seen messages in other cases like
>  corosync [MAIN  ] Corosync main process was not scheduled for
>  17314.4746 ms
>  (threshold is 8000. ms). Consider token timeout increase.
>  corosync [TOTEM ] A processor failed, forming new configuration.
> 
>  Is this the indication of a failure due to CPU load issues and will
>  this
>  get resolved if I upgrade to Corosync 2.x series ?
> >>
> >> Yes, most definitely this is a CPU issue. It means corosync isn't
> >> getting enough CPU cycles to handle the cluster token before the
> >> timeout is reached.
> >>
> >> Upgrading may indeed help, as recent versions ensure that corosync runs
> >> with real-time priority in the kernel, and thus are more likely to get
> >> CPU time when something of lower priority is consuming all the CPU.
> >>
> >> But of course, there is some underlying problem that should be
> >> identified and addressed. Figure out what's maxing out the CPU or I/O.
> >> Ulrich's monitoring suggestion is a good start.
> >>
> >>> Hi!
> >>>
> >>> I'd strongly recommend starting monitoring on your nodes, at least
> >>> until you know what's going on. The good old UNIX sa (sysstat
> >>> package) could be a starting point. I'd monitor CPU idle
> >>> specifically. Then go for 100% device utilization, then look for
> >>> network bottlenecks...
> >>>
> >>> A new corosync release cannot fix those, most likely.
> >>>
> >>> Regards,
> >>> Ulrich
> >>>
> 
>  In any case, for the current scenario, we did not see any
>  scheduling
>  related messages.
> 
>  Thanks for your help.
>  Prasad
> 
>  On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
>  wrote:
> 
> > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> >> Hi:
> >>
> >> One of these days, I saw a spurious node loss on my 3-node
> >> corosync
> >> cluster with following logged in the corosync.log of one of the
> >> nodes.
> >>
> >> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> >> Transitional membership event on ring 32: memb=2, new=0, lost=1
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> >> vm02d780875f 67114156
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> >> vmfa2757171f 151000236
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> >> vm728316982d 201331884
> >> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> >> Stable
> >> membership event on ring 32: memb=2, new=0, lost=0
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> >> vm02d780875f 67114156
> >> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> >> vmfa2757171f 151000236
> >> Aug 18 12:40:25 corosync [pcmk  ] info:
> >> ais_mark_unseen_peer_dead:
> >> Node vm728316982d was not seen in the previous transition
> >> Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> >> 201331884/vm728316982d is now: lost
> >> Aug 18 12:40:25 corosync [pcmk  ] info:
> >> send_member_notification:
> >> Sending membership update 32 to 3 children
> >> Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
> >> the
> >> membership and a new membership was formed.
> >> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
> >> plugin_handle_membership: Membership 32: quorum retained
> >

[ClusterLabs] Redundant ring not recovering after node is back

2018-08-22 Thread David Tolosa
Hello,
Im getting crazy about this problem, that I expect to resolve here, with
your help guys:

I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.

I configured both nodes with rrp mode passive. Everything is working well
at this point, but when I shutdown 1 node to test failover, and this node
returns to be online, corosync is marking the interface as FAULTY and rrp
fails to recover the initial state:

1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id  = 192.168.0.1
status  = ring 0 active with no faults
RING ID 1
id  = 192.168.1.1
status  = ring 1 active with no faults


2. When I shutdown the node 2, all continues with no faults. Sometimes the
ring ID's are bonding with 127.0.0.1 and then bond back to their respective
heartbeat IP.

3. When node 2 is back online:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id  = 192.168.0.1
status  = ring 0 active with no faults
RING ID 1
id  = 192.168.1.1
status  = Marking ringid 1 interface 192.168.1.1 FAULTY


# service corosync status
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
preset: enabled)
   Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
 Docs: man:corosync
   man:corosync.conf
   man:corosync_overview
 Main PID: 1439 (corosync)
Tasks: 2 (limit: 4915)
   CGroup: /system.slice/corosync.service
   └─1439 /usr/sbin/corosync -f


Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.1.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.1.1] is now up.
Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
new membership (192.168.0.1:601760) was formed. Members
Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601760) was formed. Members
Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
new membership (192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
Marking ringid 1 interface 192.168.1.1 FAULTY
Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
192.168.1.1 FAULTY


If I execute corosync-cfgtool, clears the faulty error but after some
seconds return to be FAULTY.
The only thing that it resolves the problem is to restart de service with
service corosync restart.

Here you have some of my configuration settings on node 1 (I probed already
to change rrp_mode):

*- corosync.conf*

totem {
version: 2
cluster_name: node
token: 5000
token_retransmits_before_loss_const: 10
secauth: off
threads: 0
rrp_mode: passive
nodeid: 1
interface {
ringnumber: 0
bindnetaddr: 192.168.0.0
#mcastaddr: 226.94.1.1
mcastport: 5405
broadcast: yes
}
interface {
ringnumber: 1
bindnetaddr: 192.168.1.0
#mcastaddr: 226.94.1.2
mcastport: 5407
broadcast: yes
}
}

logging {
fileline: off
to_stderr: yes
to_syslog: yes
to_logfile: yes
logfile: /var/log/corosync/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}

quorum {
provider: corosync_votequorum
expected_votes: 2
}

nodelist {
node {
nodeid: 1
ring0_addr: 192.168.0.1
ring1_addr: 192.168.1.1
}

node {
nodeid: 2
ring0_addr: 192.168.0.2
ring1_addr: 192.168.1.2
}
}

aisexec {
user: root
group: root
}

service {
name: pacemaker
ver: 1
}



*- /etc/hosts*


127.0.0.1   localhost
10.4.172.5  node1.upc.edu node1
10.4.172.6  node2.upc.edu node2


Thank you for you help in advance!

-- 
*David Tolosa Martínez*
Customer Support & Infrastructure
UPCnet - Edifici Vèrtex
Plaça d'Eusebi Güell, 6, 08034 Barcelona
Tel: 934054555



-- 




INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES:

 

Responsable: UPCNET, 
Serveis d'Accés a Internet de la Universitat Politècnica de C

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-22 Thread Emmanuel Gelati
I think you are missing interface with nodelist
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_sample_corosync_configuration.html

2018-08-22 14:53 GMT+02:00 David Tolosa :

> Hello,
> Im getting crazy about this problem, that I expect to resolve here, with
> your help guys:
>
> I have 2 nodes with Corosync redundant ring feature.
>
> Each node has 2 similarly connected/configured NIC's. Both nodes are
> connected each other by two crossover cables.
>
> I configured both nodes with rrp mode passive. Everything is working well
> at this point, but when I shutdown 1 node to test failover, and this node
> returns to be online, corosync is marking the interface as FAULTY and rrp
> fails to recover the initial state:
>
> 1. Initial scenario:
>
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id  = 192.168.0.1
> status  = ring 0 active with no faults
> RING ID 1
> id  = 192.168.1.1
> status  = ring 1 active with no faults
>
>
> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
> ring ID's are bonding with 127.0.0.1 and then bond back to their respective
> heartbeat IP.
>
> 3. When node 2 is back online:
>
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id  = 192.168.0.1
> status  = ring 0 active with no faults
> RING ID 1
> id  = 192.168.1.1
> status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>
>
> # service corosync status
> ● corosync.service - Corosync Cluster Engine
>Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
> preset: enabled)
>Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s
> ago
>  Docs: man:corosync
>man:corosync.conf
>man:corosync_overview
>  Main PID: 1439 (corosync)
> Tasks: 2 (limit: 4915)
>CGroup: /system.slice/corosync.service
>└─1439 /usr/sbin/corosync -f
>
>
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.1.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.1.1] is now up.
> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
> new membership (192.168.0.1:601760) was formed. Members
> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601760) was formed. Members
> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
> new membership (192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
> Marking ringid 1 interface 192.168.1.1 FAULTY
> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
> interface 192.168.1.1 FAULTY
>
>
> If I execute corosync-cfgtool, clears the faulty error but after some
> seconds return to be FAULTY.
> The only thing that it resolves the problem is to restart de service with
> service corosync restart.
>
> Here you have some of my configuration settings on node 1 (I probed
> already to change rrp_mode):
>
> *- corosync.conf*
>
> totem {
> version: 2
> cluster_name: node
> token: 5000
> token_retransmits_before_loss_const: 10
> secauth: off
> threads: 0
> rrp_mode: passive
> nodeid: 1
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.0.0
> #mcastaddr: 226.94.1.1
> mcastport: 5405
> broadcast: yes
> }
> interface {
> ringnumber: 1
> bindnetaddr: 192.168.1.0
> #mcastaddr: 226.94.1.2
> mcastport: 5407
> broadcast: yes
> }
> }
>
> logging {
> fileline: off
> to_stderr: yes
> to_syslog: yes
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> quorum {
> provider: corosync_votequorum
> expected_votes: 2
> }
>
> nodelist {
> node {
> nodeid: 1
> ring0_addr: 192.168.0.1
> ring1_addr: 192.168.1.1
> }
>
> node {
> nodeid: 2
> ring0_addr: 192.168.0.2
> ring1_addr: 192.168.1.2
> }
> }
>
> aisexec {
> user: root
> group: root
> }
>
> service {

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-22 Thread Emmanuel Gelati
Sorry a Typo

I think you are mixing interface with nodelist http://clusterlabs.org/
pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_
Scratch/_sample_corosync_configuration.html

2018-08-22 22:20 GMT+02:00 Emmanuel Gelati :

> I think you are missing interface with nodelist http://clusterlabs.org/
> pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_
> Scratch/_sample_corosync_configuration.html
>
> 2018-08-22 14:53 GMT+02:00 David Tolosa :
>
>> Hello,
>> Im getting crazy about this problem, that I expect to resolve here, with
>> your help guys:
>>
>> I have 2 nodes with Corosync redundant ring feature.
>>
>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>> connected each other by two crossover cables.
>>
>> I configured both nodes with rrp mode passive. Everything is working well
>> at this point, but when I shutdown 1 node to test failover, and this node
>> returns to be online, corosync is marking the interface as FAULTY and rrp
>> fails to recover the initial state:
>>
>> 1. Initial scenario:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>> id  = 192.168.0.1
>> status  = ring 0 active with no faults
>> RING ID 1
>> id  = 192.168.1.1
>> status  = ring 1 active with no faults
>>
>>
>> 2. When I shutdown the node 2, all continues with no faults. Sometimes
>> the ring ID's are bonding with 127.0.0.1 and then bond back to their
>> respective heartbeat IP.
>>
>> 3. When node 2 is back online:
>>
>> # corosync-cfgtool -s
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>> id  = 192.168.0.1
>> status  = ring 0 active with no faults
>> RING ID 1
>> id  = 192.168.1.1
>> status  = Marking ringid 1 interface 192.168.1.1 FAULTY
>>
>>
>> # service corosync status
>> ● corosync.service - Corosync Cluster Engine
>>Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
>> preset: enabled)
>>Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s
>> ago
>>  Docs: man:corosync
>>man:corosync.conf
>>man:corosync_overview
>>  Main PID: 1439 (corosync)
>> Tasks: 2 (limit: 4915)
>>CGroup: /system.slice/corosync.service
>>└─1439 /usr/sbin/corosync -f
>>
>>
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>> The network interface [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.0.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
>> The network interface [192.168.1.1] is now up.
>> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
>> [192.168.1.1] is now up.
>> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
>> new membership (192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601760) was formed. Members
>> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
>> new membership (192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
>> 192.168.0.1:601764) was formed. Members joined: 2
>> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
>> Marking ringid 1 interface 192.168.1.1 FAULTY
>> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
>> interface 192.168.1.1 FAULTY
>>
>>
>> If I execute corosync-cfgtool, clears the faulty error but after some
>> seconds return to be FAULTY.
>> The only thing that it resolves the problem is to restart de service with
>> service corosync restart.
>>
>> Here you have some of my configuration settings on node 1 (I probed
>> already to change rrp_mode):
>>
>> *- corosync.conf*
>>
>> totem {
>> version: 2
>> cluster_name: node
>> token: 5000
>> token_retransmits_before_loss_const: 10
>> secauth: off
>> threads: 0
>> rrp_mode: passive
>> nodeid: 1
>> interface {
>> ringnumber: 0
>> bindnetaddr: 192.168.0.0
>> #mcastaddr: 226.94.1.1
>> mcastport: 5405
>> broadcast: yes
>> }
>> interface {
>> ringnumber: 1
>> bindnetaddr: 192.168.1.0
>> #mcastaddr: 226.94.1.2
>> mcastport: 5407
>> broadcast: yes
>> }
>> }
>>
>> logging {
>> fileline: off
>> to_stderr: yes
>> to_syslog: yes
>> to_logfile: yes
>> logfile: /var/log/corosync/corosync.log
>> debug: off
>> timestamp: on
>> logger_subsys {
>> subsys: AMF
>> debug: off
>> }
>> }
>>
>> amf {
>> mode: disabled
>> }
>>
>> quorum {
>> provider: corosync_votequorum
>> expected_votes: 2
>>

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-22 Thread Andrei Borzenkov
22.08.2018 15:53, David Tolosa пишет:
> Hello,
> Im getting crazy about this problem, that I expect to resolve here, with
> your help guys:
> 
> I have 2 nodes with Corosync redundant ring feature.
> 
> Each node has 2 similarly connected/configured NIC's. Both nodes are
> connected each other by two crossover cables.
> 
> I configured both nodes with rrp mode passive. Everything is working well
> at this point, but when I shutdown 1 node to test failover, and this node
> returns to be online, corosync is marking the interface as FAULTY and rrp
> fails to recover the initial state:
> 
> 1. Initial scenario:
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id  = 192.168.0.1
> status  = ring 0 active with no faults
> RING ID 1
> id  = 192.168.1.1
> status  = ring 1 active with no faults
> 
> 
> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
> ring ID's are bonding with 127.0.0.1 and then bond back to their respective
> heartbeat IP.
> 
> 3. When node 2 is back online:
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
> id  = 192.168.0.1
> status  = ring 0 active with no faults
> RING ID 1
> id  = 192.168.1.1
> status  = Marking ringid 1 interface 192.168.1.1 FAULTY
> 
> 
> # service corosync status
> ● corosync.service - Corosync Cluster Engine
>Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
> preset: enabled)
>Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
>  Docs: man:corosync
>man:corosync.conf
>man:corosync_overview
>  Main PID: 1439 (corosync)
> Tasks: 2 (limit: 4915)
>CGroup: /system.slice/corosync.service
>└─1439 /usr/sbin/corosync -f
> 
> 
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.1.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.1.1] is now up.
> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
> new membership (192.168.0.1:601760) was formed. Members
> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601760) was formed. Members
> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
> new membership (192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
> Marking ringid 1 interface 192.168.1.1 FAULTY
> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
> 192.168.1.1 FAULTY
> 
> 
> If I execute corosync-cfgtool, clears the faulty error but after some
> seconds return to be FAULTY.
> The only thing that it resolves the problem is to restart de service with
> service corosync restart.
> 
> Here you have some of my configuration settings on node 1 (I probed already
> to change rrp_mode):
> 
> *- corosync.conf*
> 
> totem {
> version: 2
> cluster_name: node
> token: 5000
> token_retransmits_before_loss_const: 10
> secauth: off
> threads: 0
> rrp_mode: passive
> nodeid: 1
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.0.0
> #mcastaddr: 226.94.1.1
> mcastport: 5405
> broadcast: yes
> }
> interface {
> ringnumber: 1
> bindnetaddr: 192.168.1.0
> #mcastaddr: 226.94.1.2
> mcastport: 5407
> broadcast: yes
> }
> }
> 
> logging {
> fileline: off
> to_stderr: yes
> to_syslog: yes
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
> 
> amf {
> mode: disabled
> }
> 
> quorum {
> provider: corosync_votequorum
> expected_votes: 2
> }
> 
> nodelist {
> node {
> nodeid: 1
> ring0_addr: 192.168.0.1
> ring1_addr: 192.168.1.1
> }
> 
> node {
> nodeid: 2
> ring0_addr: 192.168.0.2
> ring1_addr: 192.168.1.2
> }
> }
> 

My understanding so far was that nodelist is used with udpu transport
only. You may try without nodelist or with transport: udpu to see if it
makes a difference.

> aisexec {
> user: root
> group: root
> }
>

[ClusterLabs] Q: (SLES11 SP4) lrm_rsc_op without last-run?

2018-08-22 Thread Ulrich Windl
Hi!

Many years ago I wrote a parser that could format the CIB XML in a flexible 
way. Today I used it again to print some statistics for "exec-time". Thereby I 
discovered one operation that has a valid "exec-time", a valid 
"last-rc-change", but no "last-run".
All other operations had "last-run". Can someone explain how this can happen? 
The operation in question is "monitor", so it should be run frequently 
(specified as "op monitor interval=600 timeout=30").

I see no failed actions regarding the resource.

The original XML part for the operation looks like this:
 

The node is not completely up-to-date, and it's using pacemaker-1.1.12-18.1...

Regards,
Ulrich


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-22 Thread Jan Friesse

Prasad,



Hi - My systems are single core cpu VMs running on azure platform. I am


Ok, now it make sense. I don't think you get too much guarantees in the 
cloud environment so quite a large scheduling pause simply can happen. 
Also single core CPU is kind of "unsupported" today.



running MySQL on the nodes that do generate high io load. And my bad , I
meant to say 'High CPU load detected' logged by crmd and not corosync.
Corosync logs messages like 'Corosync main process was not scheduled
for.' kind of messages which inturn makes pacemaker monitor action to
fail sometimes. Is increasing token timeout a solution for this or any
other ways ?


Yes. Actually increasing token timeout to quite a big values seems to be 
 only "solution". Please keep in mind that it's not for free - when 
real problem happens corosync does not detect it before token timeout 
expires, so with large token timeout detection last long.


Regards,
  Honza



Thanks for the help
Prasaf

On Wed, 22 Aug 2018, 11:55 am Jan Friesse,  wrote:


Prasad,


Thanks Ken and Ulrich. There is definitely high IO on the system with
sometimes IOWAIT s of upto 90%
I have come across some previous posts that IOWAIT is also considered as
CPU load by Corosync. Is this true ? Does having high IO may lead

corosync

complain as in " Corosync main process was not scheduled for..." or "Hi
CPU load detected.." ?


Yes it can.

Corosync never logs "Hi CPU load detected...".



I will surely monitor the system more.


Is that system VM or physical machine? Because " Corosync main process
was not scheduled for..." is usually happening on VMs where hosts are
highly overloaded.

Honza



Thanks for your help.
Prasad



On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot 

wrote:



On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:

Prasad Nagaraj  schrieb am
21.08.2018 um 11:42 in


Nachricht
:

Hi Ken - Thanks for you response.

We do have seen messages in other cases like
corosync [MAIN  ] Corosync main process was not scheduled for
17314.4746 ms
(threshold is 8000. ms). Consider token timeout increase.
corosync [TOTEM ] A processor failed, forming new configuration.

Is this the indication of a failure due to CPU load issues and will
this
get resolved if I upgrade to Corosync 2.x series ?


Yes, most definitely this is a CPU issue. It means corosync isn't
getting enough CPU cycles to handle the cluster token before the
timeout is reached.

Upgrading may indeed help, as recent versions ensure that corosync runs
with real-time priority in the kernel, and thus are more likely to get
CPU time when something of lower priority is consuming all the CPU.

But of course, there is some underlying problem that should be
identified and addressed. Figure out what's maxing out the CPU or I/O.
Ulrich's monitoring suggestion is a good start.


Hi!

I'd strongly recommend starting monitoring on your nodes, at least
until you know what's going on. The good old UNIX sa (sysstat
package) could be a starting point. I'd monitor CPU idle
specifically. Then go for 100% device utilization, then look for
network bottlenecks...

A new corosync release cannot fix those, most likely.

Regards,
Ulrich



In any case, for the current scenario, we did not see any
scheduling
related messages.

Thanks for your help.
Prasad

On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
wrote:


On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:

Hi:

One of these days, I saw a spurious node loss on my 3-node
corosync
cluster with following logged in the corosync.log of one of the
nodes.

Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
Transitional membership event on ring 32: memb=2, new=0, lost=1
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
vm728316982d 201331884
Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
Stable
membership event on ring 32: memb=2, new=0, lost=0
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info:
ais_mark_unseen_peer_dead:
Node vm728316982d was not seen in the previous transition
Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
201331884/vm728316982d is now: lost
Aug 18 12:40:25 corosync [pcmk  ] info:
send_member_notification:
Sending membership update 32 to 3 children
Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
the
membership and a new membership was formed.
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
plugin_handle_membership: Membership 32: quorum retained
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4548] vm

Re: [ClusterLabs] Redundant ring not recovering after node is back

2018-08-22 Thread Jan Friesse

David,


Hello,
Im getting crazy about this problem, that I expect to resolve here, with
your help guys:

I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.


I believe this is root of the problem. Are you using NetworkManager? If 
so, have you installed NetworkManager-config-server? If not, please 
install it and test again.




I configured both nodes with rrp mode passive. Everything is working well
at this point, but when I shutdown 1 node to test failover, and this node > 
returns to be online, corosync is marking the interface as FAULTY and rrp


I believe it's because with crossover cables configuration when other 
side is shutdown, NetworkManager detects it and does ifdown of the 
interface. And corosync is unable to handle ifdown properly. Ifdown is 
bad with single ring, but it's just killer with RRP (127.0.0.1 poisons 
every node in the cluster).



fails to recover the initial state:

1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
 id  = 192.168.0.1
 status  = ring 0 active with no faults
RING ID 1
 id  = 192.168.1.1
 status  = ring 1 active with no faults


2. When I shutdown the node 2, all continues with no faults. Sometimes the
ring ID's are bonding with 127.0.0.1 and then bond back to their respective
heartbeat IP.


Again, result of ifdown.



3. When node 2 is back online:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
 id  = 192.168.0.1
 status  = ring 0 active with no faults
RING ID 1
 id  = 192.168.1.1
 status  = Marking ringid 1 interface 192.168.1.1 FAULTY


# service corosync status
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
preset: enabled)
Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
  Docs: man:corosync
man:corosync.conf
man:corosync_overview
  Main PID: 1439 (corosync)
 Tasks: 2 (limit: 4915)
CGroup: /system.slice/corosync.service
└─1439 /usr/sbin/corosync -f


Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
network interface [192.168.1.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.1.1] is now up.
Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
new membership (192.168.0.1:601760) was formed. Members
Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601760) was formed. Members
Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
new membership (192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
Marking ringid 1 interface 192.168.1.1 FAULTY
Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
192.168.1.1 FAULTY


If I execute corosync-cfgtool, clears the faulty error but after some
seconds return to be FAULTY.
The only thing that it resolves the problem is to restart de service with
service corosync restart.

Here you have some of my configuration settings on node 1 (I probed already
to change rrp_mode):

*- corosync.conf*

totem {
 version: 2
 cluster_name: node
 token: 5000
 token_retransmits_before_loss_const: 10
 secauth: off
 threads: 0
 rrp_mode: passive
 nodeid: 1
 interface {
 ringnumber: 0
 bindnetaddr: 192.168.0.0
 #mcastaddr: 226.94.1.1
 mcastport: 5405
 broadcast: yes
 }
 interface {
 ringnumber: 1
 bindnetaddr: 192.168.1.0
 #mcastaddr: 226.94.1.2
 mcastport: 5407
 broadcast: yes
 }
}

logging {
 fileline: off
 to_stderr: yes
 to_syslog: yes
 to_logfile: yes
 logfile: /var/log/corosync/corosync.log
 debug: off
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: off
 }
}

amf {
 mode: disabled
}

quorum {
 provider: corosync_votequorum
 expected_votes: 2
}

nodelist {
 node {
 nodeid: 1
 ring0_addr: 192.168.0.1
 ring1_addr: 192.168.1.1
 }

 node {
 nodeid: 2
 ring0_addr: 192.168.0.2
 ring1_add