Hi, * christine caulfield <ccaul...@redhat.com> [20191121 03:19]: > On 18/11/2019 21:31, Jean-Francois Malouin wrote: > > Hi, > > > > Maybe not directly a pacemaker question but maybe some of you have seen this > > problem: > > > > A 2 node pacemaker cluster running corosync-3.0.1 with dual communication > > ring > > sometimes reports errors like this in the corosync log file: > > > > [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366 > > [KNET ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366 > > [KNET ] pmtud: Global data MTU changed to: 1366 > > [CFG ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed > > at run-time > > [CFG ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed > > at run-time > > > > Those do not happen very frequenly, once a week or so... > > > > Those messages are caused by a config file reload (corosync-cfgtool -R) > being triggered by something. If they're happening once a week then check > your cron jobs.
no cronjob at work here, but maybe they originate from my own doing, after a reload, as you suggest. > > However the system log on the nodes reports those much more frequently, a > > few > > times a day: > > > > Nov 17 23:26:20 node1 corosync[2258]: [KNET ] link: host: 2 link: 1 is > > down > > Nov 17 23:26:20 node1 corosync[2258]: [KNET ] host: host: 2 (passive) > > best link: 0 (pri: 0) > > Nov 17 23:26:26 node1 corosync[2258]: [KNET ] rx: host: 2 link: 1 is up > > Nov 17 23:26:26 node1 corosync[2258]: [KNET ] host: host: 2 (passive) > > best link: 1 (pri: 1) > > > > Those don't look good. having a link down for 6 seconds looks like a serious > network outage that needs looking into, especially if they are that > frequent, or it could be a bug. You don't say which version of libknet you > have installed but make sure it's the latest one. libknet1 is 1.8-2 and is the latest one from Debian buster distro. > The fencing event in your other message was caused because both links were > down at the same time, which is a worrying co-incidence. Changing the token > timeout won't make any difference to the knet link events, but if the knet > links are down for long enough then that will trigger a token timeout and a > fence event. > > Definitely look for something odd in your networking - the corosync.conf > file looks sane (though having knet_transport in the top-level totem stanza > is doing nothing), so it's not that. > > It's hard to make a judgement with just that info, but look for dropped > packets on the interfaces, slow response to other network services or very > high load on one of the nodes. If you can't see anything on the systems then > enable debug logging and get back to us. If it is a bug we want it fixed! Since that network outage no errors have crept in the corosync logs (I have enabled debug on). I suspect, as you mention, a hardware issue, at the NIC level, or cabling. I do notice quite a few dropped packets from one of the links... Thanks for the reply, jf > > Chrissie > > > > Are those to be dismissed or are they indicative of a network > > misconfig/problem? > > I tried setting 'knet_transport: udpu' in the totem section (the default > > value) > > but it didn't seem to make a difference...Hard coding netmtu to 1500 and > > allowing for longer (10s) token timeout also didn't seem to affect the > > issue. > > > > > > Corosync config follows: > > > > /etc/corosync/corosync.conf > > > > totem { > > version: 2 > > cluster_name: bicha > > transport: knet > > link_mode: passive > > ip_version: ipv4 > > token: 10000 > > netmtu: 1500 > > knet_transport: sctp > > crypto_model: openssl > > crypto_hash: sha256 > > crypto_cipher: aes256 > > keyfile: /etc/corosync/authkey > > interface { > > linknumber: 0 > > knet_transport: udp > > knet_link_priority: 0 > > } > > interface { > > linknumber: 1 > > knet_transport: udp > > knet_link_priority: 1 > > } > > } > > quorum { > > provider: corosync_votequorum > > two_node: 1 > > # expected_votes: 2 > > } > > nodelist { > > node { > > ring0_addr: xxx.xxx.xxx.xxx > > ring1_addr: zzz.zzz.zzz.zzx > > name: node1 > > nodeid: 1 > > } > > node { > > ring0_addr: xxx.xxx.xxx.xxy > > ring1_addr: zzz.zzz.zzz.zzy > > name: node2 > > nodeid: 2 > > } > > } > > logging { > > to_logfile: yes > > to_syslog: yes > > logfile: /var/log/corosync/corosync.log > > syslog_facility: daemon > > debug: off > > timestamp: on > > logger_subsys { > > subsys: QUORUM > > debug: off > > } > > } > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/