[ClusterLabs] corosync 3.0.1 on Debian/Buster reports some MTU errors
Hi, Maybe not directly a pacemaker question but maybe some of you have seen this problem: A 2 node pacemaker cluster running corosync-3.0.1 with dual communication ring sometimes reports errors like this in the corosync log file: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366 [KNET ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366 [KNET ] pmtud: Global data MTU changed to: 1366 [CFG ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time [CFG ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at run-time Those do not happen very frequenly, once a week or so... However the system log on the nodes reports those much more frequently, a few times a day: Nov 17 23:26:20 node1 corosync[2258]: [KNET ] link: host: 2 link: 1 is down Nov 17 23:26:20 node1 corosync[2258]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 0) Nov 17 23:26:26 node1 corosync[2258]: [KNET ] rx: host: 2 link: 1 is up Nov 17 23:26:26 node1 corosync[2258]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) Are those to be dismissed or are they indicative of a network misconfig/problem? I tried setting 'knet_transport: udpu' in the totem section (the default value) but it didn't seem to make a difference...Hard coding netmtu to 1500 and allowing for longer (10s) token timeout also didn't seem to affect the issue. Corosync config follows: /etc/corosync/corosync.conf totem { version: 2 cluster_name: bicha transport: knet link_mode: passive ip_version: ipv4 token: 1 netmtu: 1500 knet_transport: sctp crypto_model: openssl crypto_hash: sha256 crypto_cipher: aes256 keyfile: /etc/corosync/authkey interface { linknumber: 0 knet_transport: udp knet_link_priority: 0 } interface { linknumber: 1 knet_transport: udp knet_link_priority: 1 } } quorum { provider: corosync_votequorum two_node: 1 #expected_votes: 2 } nodelist { node { ring0_addr: xxx.xxx.xxx.xxx ring1_addr: zzz.zzz.zzz.zzx name: node1 nodeid: 1 } node { ring0_addr: xxx.xxx.xxx.xxy ring1_addr: zzz.zzz.zzz.zzy name: node2 nodeid: 2 } } logging { to_logfile: yes to_syslog: yes logfile: /var/log/corosync/corosync.log syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available
On Mon, 18 Nov 2019 10:45:25 -0600 Ken Gaillot wrote: > On Fri, 2019-11-15 at 14:35 +0100, Jehan-Guillaume de Rorthais wrote: > > On Thu, 14 Nov 2019 11:09:57 -0600 > > Ken Gaillot wrote: > > > > > On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote: > > > > > > > Jehan-Guillaume de Rorthais schrieb am > > > > > > > 14.11.2019 um > > > > > > > > 15:17 in > > > > Nachricht <20191114151719.6cbf4e38@firost>: > > > > > On Wed, 13 Nov 2019 17:30:31 ‑0600 > > > > > Ken Gaillot wrote: > > > > > ... > > > > > > A longstanding pain point in the logs has been improved. > > > > > > Whenever > > > > > > the > > > > > > scheduler processes resource history, it logs a warning for > > > > > > any > > > > > > failures it finds, regardless of whether they are new or old, > > > > > > which can > > > > > > confuse anyone reading the logs. Now, the log will contain > > > > > > the > > > > > > time of > > > > > > the failure, so it's obvious whether you're seeing the same > > > > > > event > > > > > > or > > > > > > not. The log will also contain the exit reason if one was > > > > > > provided by > > > > > > the resource agent, for easier troubleshooting. > > > > > > > > > > I've been hurt by this in the past and I was wondering what was > > > > > the > > > > > point of > > > > > warning again and again in the logs for past failures during > > > > > scheduling? > > > > > What this information brings to the administrator? > > > > > > The controller will log an event just once, when it happens. > > > > > > The scheduler, on the other hand, uses the entire recorded resource > > > history to determine the current resource state. Old failures (that > > > haven't been cleaned) must be taken into account. > > > > OK, I wasn't aware of this. If you have a few minutes, I would be > > interested to > > know why the full history is needed and not just find the latest > > entry from > > there. Or maybe there's some comments in the source code that already > > cover this question? > > The full *recorded* history consists of the most recent operation that > affects the state (like start/stop/promote/demote), the most recent > failed operation, and the most recent results of any recurring > monitors. > > For example there may be a failed monitor, but whether the resource is > considered failed or not would depend on whether there was a more > recent successful stop or start. Even if the failed monitor has been > superseded, it needs to stay in the history for display purposes until > the user has cleaned it up. OK, understood. Maybe that's why "FAILED" appears shortly in crm_mon during a resource move on a clean resource, but with past failures? Maybe I should dig this weird behavior and wrap up a bug report if I confirm this? > > > Every run of the scheduler is completely independent, so it doesn't > > > know about any earlier runs or what they logged. Think of it like > > > Frosty the Snowman saying "Happy Birthday!" every time his hat is > > > put > > > on. > > > > I don't have this ref :) > > I figured not everybody would, but it was too fun to pass up :) > > The snowman comes to life every time his magic hat is put on, but to > him each time feels like he's being born for the first time, so he says > "Happy Birthday!" > > https://www.youtube.com/watch?v=1PbWTEYoN8o heh :) > > > As far as each run is concerned, it is the first time it's seen the > > > history. This is what allows the DC role to move from node to node, > > > and > > > the scheduler to be run as a simulation using a saved CIB file. > > > > > > We could change the wording further if necessary. The previous > > > version > > > would log something like: > > > > > > warning: Processing failed monitor of my-rsc on node1: not running > > > > > > and this latest change will log it like: > > > > > > warning: Unexpected result (not running: No process state file > > > found) > > > was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019 > > > > /result/state/ ? > > It's the result of a resource agent action, so it could be for example > a timeout or a permissions issue. ok > > > I wanted to be explicit about the message being about processing > > > resource history that may or may not be the first time it's been > > > processed and logged, but everything I came up with seemed too long > > > for > > > a log line. Another possibility might be something like: > > > > > > warning: Using my-rsc history to determine its current state on > > > node1: > > > Unexpected result (not running: No process state file found) was > > > recorded for monitor at Nov 12 19:19:02 2019 > > > > I better like the first one. > > > > However, it feels like implementation details exposed to the world, > > isn't it? How useful is this information for the end user? What the > > user can do > > with this information? There's noting to fix and this is not actually > > an error > > of the current running process. > > > > I still
Re: [ClusterLabs] Announcing ClusterLabs Summit 2020
On Mon, 2019-11-18 at 16:06 +, Diego Akechi wrote: > Hi Everyone, > > Sorry for the late response here. > > From SUSE, we are still collecting the final list of attendees, but > we > already have 6 people confirmed, but most probably we will have > around > 10 people going. > > I would like to propose two sessions about some of our current work: > > > 1. Cluster monitoring capabilities based on the ha_cluster_exporter, > Prometheus and Grafana > > 2. Cluster deployment automation based on Salt. Great, looking forward to it! > If there is not enough time, we can shrink them into one slot. I'm planning on 1 hour per talk on average (about 40-45 minutes speaking plus 10-15 minutes Q&A and a few minutes between talks). If you'd prefer more or less let me know, but you can plan on that. > > On 15/10/2019 23:42, Ken Gaillot wrote: > > I'm happy to announce that we have a date and location for the next > > ClusterLabs Summit: Wednesday, Feb. 5, and Thursday, Feb. 6, 2020, > > in > > Brno, Czechia. This year's host is Red Hat. > > > > Details will be given on this wiki page as they become available: > > > > http://plan.alteeve.ca/index.php/HA_Cluster_Summit_2020 > > > > We are still in the early stages of organizing, and need your > > input. > > > > Most importantly, we need a good idea of how many people will > > attend, > > to ensure we have an appropriate conference room and amenities. The > > wiki page has a section where you can say how many people from your > > organization expect to attend. We don't need a firm commitment or > > an > > immediate response, just let us know once you have a rough idea. > > > > We also invite you to propose a talk, whether it's a talk you want > > to > > give or something you are interested in hearing more about. The > > wiki > > page has a section for that, too. Anything related to open-source > > clustering is welcome: new features and plans for the cluster > > software projects, how-to's and case histories for integrating > > specific services into a cluster, utilizing specific > > stonith/networking/etc. technologies in a cluster, tips for > > administering a cluster, and so forth. > > > > I'm excited about the chance for developers and users to meet in > > person. Past summits have been helpful for shaping the direction of > > the > > projects and strengthening the community. I look forward to seeing > > many > > of you there! > > > > -- > Diego V. Akechi > Engineering Manager HA Extension & SLES for SAP > SUSE Software Solutions Germany GmbH > Tel: +49-911-74053-373; Fax: +49-911-7417755; https://www.suse.com/ > Maxfeldstr. 5, D-90409 Nürnberg > HRB 247165 (AG München) > Managing Director: Felix Imendörffer > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available
On Fri, 2019-11-15 at 14:35 +0100, Jehan-Guillaume de Rorthais wrote: > On Thu, 14 Nov 2019 11:09:57 -0600 > Ken Gaillot wrote: > > > On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote: > > > > > > Jehan-Guillaume de Rorthais schrieb am > > > > > > 14.11.2019 um > > > > > > 15:17 in > > > Nachricht <20191114151719.6cbf4e38@firost>: > > > > On Wed, 13 Nov 2019 17:30:31 ‑0600 > > > > Ken Gaillot wrote: > > > > ... > > > > > A longstanding pain point in the logs has been improved. > > > > > Whenever > > > > > the > > > > > scheduler processes resource history, it logs a warning for > > > > > any > > > > > failures it finds, regardless of whether they are new or old, > > > > > which can > > > > > confuse anyone reading the logs. Now, the log will contain > > > > > the > > > > > time of > > > > > the failure, so it's obvious whether you're seeing the same > > > > > event > > > > > or > > > > > not. The log will also contain the exit reason if one was > > > > > provided by > > > > > the resource agent, for easier troubleshooting. > > > > > > > > I've been hurt by this in the past and I was wondering what was > > > > the > > > > point of > > > > warning again and again in the logs for past failures during > > > > scheduling? > > > > What this information brings to the administrator? > > > > The controller will log an event just once, when it happens. > > > > The scheduler, on the other hand, uses the entire recorded resource > > history to determine the current resource state. Old failures (that > > haven't been cleaned) must be taken into account. > > OK, I wasn't aware of this. If you have a few minutes, I would be > interested to > know why the full history is needed and not just find the latest > entry from > there. Or maybe there's some comments in the source code that already > cover this question? The full *recorded* history consists of the most recent operation that affects the state (like start/stop/promote/demote), the most recent failed operation, and the most recent results of any recurring monitors. For example there may be a failed monitor, but whether the resource is considered failed or not would depend on whether there was a more recent successful stop or start. Even if the failed monitor has been superseded, it needs to stay in the history for display purposes until the user has cleaned it up. > > Every run of the scheduler is completely independent, so it doesn't > > know about any earlier runs or what they logged. Think of it like > > Frosty the Snowman saying "Happy Birthday!" every time his hat is > > put > > on. > > I don't have this ref :) I figured not everybody would, but it was too fun to pass up :) The snowman comes to life every time his magic hat is put on, but to him each time feels like he's being born for the first time, so he says "Happy Birthday!" https://www.youtube.com/watch?v=1PbWTEYoN8o > > As far as each run is concerned, it is the first time it's seen the > > history. This is what allows the DC role to move from node to node, > > and > > the scheduler to be run as a simulation using a saved CIB file. > > > > We could change the wording further if necessary. The previous > > version > > would log something like: > > > > warning: Processing failed monitor of my-rsc on node1: not running > > > > and this latest change will log it like: > > > > warning: Unexpected result (not running: No process state file > > found) > > was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019 > > /result/state/ ? It's the result of a resource agent action, so it could be for example a timeout or a permissions issue. > > I wanted to be explicit about the message being about processing > > resource history that may or may not be the first time it's been > > processed and logged, but everything I came up with seemed too long > > for > > a log line. Another possibility might be something like: > > > > warning: Using my-rsc history to determine its current state on > > node1: > > Unexpected result (not running: No process state file found) was > > recorded for monitor at Nov 12 19:19:02 2019 > > I better like the first one. > > However, it feels like implementation details exposed to the world, > isn't it? How useful is this information for the end user? What the > user can do > with this information? There's noting to fix and this is not actually > an error > of the current running process. > > I still fail to understand why the scheduler doesn't process the > history > silently, whatever it finds there, then warn for something really > important if > the final result is not expected... From the scheduler's point of view, it's all relevant information that goes into the decision making. Even an old failure can cause new actions, for example if quorum was not held at the time but has now been reached, or if there is a failure-timeout that just expired. So any failure history is important to understanding whatever the scheduler says needs to be
Re: [ClusterLabs] Announcing ClusterLabs Summit 2020
Hi Everyone, Sorry for the late response here. From SUSE, we are still collecting the final list of attendees, but we already have 6 people confirmed, but most probably we will have around 10 people going. I would like to propose two sessions about some of our current work: 1. Cluster monitoring capabilities based on the ha_cluster_exporter, Prometheus and Grafana 2. Cluster deployment automation based on Salt. If there is not enough time, we can shrink them into one slot. On 15/10/2019 23:42, Ken Gaillot wrote: > I'm happy to announce that we have a date and location for the next > ClusterLabs Summit: Wednesday, Feb. 5, and Thursday, Feb. 6, 2020, in > Brno, Czechia. This year's host is Red Hat. > > Details will be given on this wiki page as they become available: > > http://plan.alteeve.ca/index.php/HA_Cluster_Summit_2020 > > We are still in the early stages of organizing, and need your input. > > Most importantly, we need a good idea of how many people will attend, > to ensure we have an appropriate conference room and amenities. The > wiki page has a section where you can say how many people from your > organization expect to attend. We don't need a firm commitment or an > immediate response, just let us know once you have a rough idea. > > We also invite you to propose a talk, whether it's a talk you want to > give or something you are interested in hearing more about. The wiki > page has a section for that, too. Anything related to open-source > clustering is welcome: new features and plans for the cluster software > projects, how-to's and case histories for integrating specific services into > a cluster, utilizing specific stonith/networking/etc. technologies in a > cluster, tips for administering a cluster, and so forth. > > I'm excited about the chance for developers and users to meet in > person. Past summits have been helpful for shaping the direction of the > projects and strengthening the community. I look forward to seeing many > of you there! > -- Diego V. Akechi Engineering Manager HA Extension & SLES for SAP SUSE Software Solutions Germany GmbH Tel: +49-911-74053-373; Fax: +49-911-7417755; https://www.suse.com/ Maxfeldstr. 5, D-90409 Nürnberg HRB 247165 (AG München) Managing Director: Felix Imendörffer ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Announcing ClusterLabs Summit 2020
Great! I've added you to the list. On Fri, 2019-11-15 at 09:50 +, John Colgrave wrote: > We are planning for two people from the IBM MQ development team to > attend. > > Regards, > > John Colgrave > > Disaster Recovery and High Availability Architect > IBM MQ > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with > number 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire > PO6 3AU -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Documentation driven predictability of pacemaker commands (Was: Pacemaker 2.0.3-rc3 now available)
>>> Jan Pokorný schrieb am 15.11.2019 um 11:52 in Nachricht <20191115105233.gc23...@redhat.com>: ... > - on top of previous, what exactly do we gain from appending > --text-fancy? sadly, I observed ho difference in a basic > use case I'd expect fancy colors (if the terminal is capable doing that) ;-) ... Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/