On 18/11/2019 21:31, Jean-Francois Malouin wrote:
Hi,
Maybe not directly a pacemaker question but maybe some of you have seen this
problem:
A 2 node pacemaker cluster running corosync-3.0.1 with dual communication ring
sometimes reports errors like this in the corosync log file:
[KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
[KNET ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366
[KNET ] pmtud: Global data MTU changed to: 1366
[CFG ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at
run-time
[CFG ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at
run-time
Those do not happen very frequenly, once a week or so...
Those messages are caused by a config file reload (corosync-cfgtool -R)
being triggered by something. If they're happening once a week then
check your cron jobs.
However the system log on the nodes reports those much more frequently, a few
times a day:
Nov 17 23:26:20 node1 corosync[2258]: [KNET ] link: host: 2 link: 1 is down
Nov 17 23:26:20 node1 corosync[2258]: [KNET ] host: host: 2 (passive) best
link: 0 (pri: 0)
Nov 17 23:26:26 node1 corosync[2258]: [KNET ] rx: host: 2 link: 1 is up
Nov 17 23:26:26 node1 corosync[2258]: [KNET ] host: host: 2 (passive) best
link: 1 (pri: 1)
Those don't look good. having a link down for 6 seconds looks like a
serious network outage that needs looking into, especially if they are
that frequent, or it could be a bug. You don't say which version of
libknet you have installed but make sure it's the latest one.
The fencing event in your other message was caused because both links
were down at the same time, which is a worrying co-incidence. Changing
the token timeout won't make any difference to the knet link events, but
if the knet links are down for long enough then that will trigger a
token timeout and a fence event.
Definitely look for something odd in your networking - the corosync.conf
file looks sane (though having knet_transport in the top-level totem
stanza is doing nothing), so it's not that.
It's hard to make a judgement with just that info, but look for dropped
packets on the interfaces, slow response to other network services or
very high load on one of the nodes. If you can't see anything on the
systems then enable debug logging and get back to us. If it is a bug we
want it fixed!
Chrissie
Are those to be dismissed or are they indicative of a network misconfig/problem?
I tried setting 'knet_transport: udpu' in the totem section (the default value)
but it didn't seem to make a difference...Hard coding netmtu to 1500 and
allowing for longer (10s) token timeout also didn't seem to affect the issue.
Corosync config follows:
/etc/corosync/corosync.conf
totem {
version: 2
cluster_name: bicha
transport: knet
link_mode: passive
ip_version: ipv4
token: 10000
netmtu: 1500
knet_transport: sctp
crypto_model: openssl
crypto_hash: sha256
crypto_cipher: aes256
keyfile: /etc/corosync/authkey
interface {
linknumber: 0
knet_transport: udp
knet_link_priority: 0
}
interface {
linknumber: 1
knet_transport: udp
knet_link_priority: 1
}
}
quorum {
provider: corosync_votequorum
two_node: 1
# expected_votes: 2
}
nodelist {
node {
ring0_addr: xxx.xxx.xxx.xxx
ring1_addr: zzz.zzz.zzz.zzx
name: node1
nodeid: 1
}
node {
ring0_addr: xxx.xxx.xxx.xxy
ring1_addr: zzz.zzz.zzz.zzy
name: node2
nodeid: 2
}
}
logging {
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync/corosync.log
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/