Re: [ClusterLabs] corosync 3.0.1 on Debian/Buster reports some MTU errors

christine caulfield Thu, 21 Nov 2019 00:20:31 -0800

On 18/11/2019 21:31, Jean-Francois Malouin wrote:

Hi,


Maybe not directly a pacemaker question but maybe some of you have seen this
problem:

A 2 node pacemaker cluster running corosync-3.0.1 with dual communication ring
sometimes reports errors like this in the corosync log file:

[KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
[KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366
[KNET  ] pmtud: Global data MTU changed to: 1366
[CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at 
run-time
[CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at 
run-time

Those do not happen very frequenly, once a week or so...

Those messages are caused by a config file reload (corosync-cfgtool -R)being triggered by something. If they're happening once a week thencheck your cron jobs.

However the system log on the nodes reports those much more frequently, a few
times a day:

Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] link: host: 2 link: 1 is down
Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best 
link: 0 (pri: 0)
Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] rx: host: 2 link: 1 is up
Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best 
link: 1 (pri: 1)

Those don't look good. having a link down for 6 seconds looks like aserious network outage that needs looking into, especially if they arethat frequent, or it could be a bug. You don't say which version oflibknet you have installed but make sure it's the latest one.

The fencing event in your other message was caused because both linkswere down at the same time, which is a worrying co-incidence. Changingthe token timeout won't make any difference to the knet link events, butif the knet links are down for long enough then that will trigger atoken timeout and a fence event.

Definitely look for something odd in your networking - the corosync.conffile looks sane (though having knet_transport in the top-level totemstanza is doing nothing), so it's not that.

It's hard to make a judgement with just that info, but look for droppedpackets on the interfaces, slow response to other network services orvery high load on one of the nodes. If you can't see anything on thesystems then enable debug logging and get back to us. If it is a bug wewant it fixed!


Chrissie

Are those to be dismissed or are they indicative of a network misconfig/problem?
I tried setting 'knet_transport: udpu' in the totem section (the default value)
but it didn't seem to make a difference...Hard coding netmtu to 1500 and
allowing for longer (10s) token timeout also didn't seem to affect the issue.


Corosync config follows:

/etc/corosync/corosync.conf

totem {
     version: 2
     cluster_name: bicha
     transport: knet
     link_mode: passive
     ip_version: ipv4
     token: 10000
     netmtu: 1500
     knet_transport: sctp
     crypto_model: openssl
     crypto_hash: sha256
     crypto_cipher: aes256
     keyfile: /etc/corosync/authkey
     interface {
         linknumber: 0
         knet_transport: udp
         knet_link_priority: 0
     }
     interface {
         linknumber: 1
         knet_transport: udp
         knet_link_priority: 1
     }
}
quorum {
     provider: corosync_votequorum
     two_node: 1
#    expected_votes: 2
}
nodelist {
     node {
         ring0_addr: xxx.xxx.xxx.xxx
         ring1_addr: zzz.zzz.zzz.zzx
         name: node1
         nodeid: 1
     }
     node {
         ring0_addr: xxx.xxx.xxx.xxy
         ring1_addr: zzz.zzz.zzz.zzy
         name: node2
         nodeid: 2
     }
}
logging {
     to_logfile: yes
     to_syslog: yes
     logfile: /var/log/corosync/corosync.log
     syslog_facility: daemon
     debug: off
     timestamp: on
     logger_subsys {
         subsys: QUORUM
         debug: off
     }
}
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] corosync 3.0.1 on Debian/Buster reports some MTU errors

Reply via email to