[ClusterLabs] corosync 3.0.1 on Debian/Buster reports some MTU errors

2019-11-18 Thread Jean-Francois Malouin
Hi,

Maybe not directly a pacemaker question but maybe some of you have seen this
problem:

A 2 node pacemaker cluster running corosync-3.0.1 with dual communication ring
sometimes reports errors like this in the corosync log file:

[KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 470 to 1366
[KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 470 to 1366
[KNET  ] pmtud: Global data MTU changed to: 1366
[CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at 
run-time
[CFG   ] Modified entry 'totem.netmtu' in corosync.conf cannot be changed at 
run-time

Those do not happen very frequenly, once a week or so...

However the system log on the nodes reports those much more frequently, a few
times a day:

Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] link: host: 2 link: 1 is down
Nov 17 23:26:20 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best 
link: 0 (pri: 0)
Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] rx: host: 2 link: 1 is up
Nov 17 23:26:26 node1 corosync[2258]:   [KNET  ] host: host: 2 (passive) best 
link: 1 (pri: 1)

Are those to be dismissed or are they indicative of a network misconfig/problem?
I tried setting 'knet_transport: udpu' in the totem section (the default value)
but it didn't seem to make a difference...Hard coding netmtu to 1500 and
allowing for longer (10s) token timeout also didn't seem to affect the issue.


Corosync config follows:

/etc/corosync/corosync.conf

totem {
version: 2
cluster_name: bicha
transport: knet
link_mode: passive
ip_version: ipv4
token: 1
netmtu: 1500
knet_transport: sctp
crypto_model: openssl
crypto_hash: sha256
crypto_cipher: aes256
keyfile: /etc/corosync/authkey
interface {
linknumber: 0
knet_transport: udp
knet_link_priority: 0
}
interface {
linknumber: 1
knet_transport: udp
knet_link_priority: 1
}
}
quorum {
provider: corosync_votequorum
two_node: 1
#expected_votes: 2
}
nodelist {
node {
ring0_addr: xxx.xxx.xxx.xxx
ring1_addr: zzz.zzz.zzz.zzx
name: node1
nodeid: 1
} 
node {
ring0_addr: xxx.xxx.xxx.xxy
ring1_addr: zzz.zzz.zzz.zzy
name: node2
nodeid: 2
} 
}
logging {
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync/corosync.log
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available

2019-11-18 Thread Jehan-Guillaume de Rorthais
On Mon, 18 Nov 2019 10:45:25 -0600
Ken Gaillot  wrote:

> On Fri, 2019-11-15 at 14:35 +0100, Jehan-Guillaume de Rorthais wrote:
> > On Thu, 14 Nov 2019 11:09:57 -0600
> > Ken Gaillot  wrote:
> >   
> > > On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote:  
> > > > > > > Jehan-Guillaume de Rorthais  schrieb am
> > > > > > > 14.11.2019 um
> > > > 
> > > > 15:17 in
> > > > Nachricht <20191114151719.6cbf4e38@firost>:
> > > > > On Wed, 13 Nov 2019 17:30:31 ‑0600
> > > > > Ken Gaillot  wrote:
> > > > > ...
> > > > > > A longstanding pain point in the logs has been improved.
> > > > > > Whenever
> > > > > > the
> > > > > > scheduler processes resource history, it logs a warning for
> > > > > > any
> > > > > > failures it finds, regardless of whether they are new or old,
> > > > > > which can
> > > > > > confuse anyone reading the logs. Now, the log will contain
> > > > > > the
> > > > > > time of
> > > > > > the failure, so it's obvious whether you're seeing the same
> > > > > > event
> > > > > > or
> > > > > > not. The log will also contain the exit reason if one was
> > > > > > provided by
> > > > > > the resource agent, for easier troubleshooting.
> > > > > 
> > > > > I've been hurt by this in the past and I was wondering what was
> > > > > the
> > > > > point of
> > > > > warning again and again in the logs for past failures during
> > > > > scheduling? 
> > > > > What this information brings to the administrator?
> > > 
> > > The controller will log an event just once, when it happens.
> > > 
> > > The scheduler, on the other hand, uses the entire recorded resource
> > > history to determine the current resource state. Old failures (that
> > > haven't been cleaned) must be taken into account.  
> > 
> > OK, I wasn't aware of this. If you have a few minutes, I would be
> > interested to
> > know why the full history is needed and not just find the latest
> > entry from
> > there. Or maybe there's some comments in the source code that already
> > cover this question?  
> 
> The full *recorded* history consists of the most recent operation that
> affects the state (like start/stop/promote/demote), the most recent
> failed operation, and the most recent results of any recurring
> monitors.
> 
> For example there may be a failed monitor, but whether the resource is
> considered failed or not would depend on whether there was a more
> recent successful stop or start. Even if the failed monitor has been
> superseded, it needs to stay in the history for display purposes until
> the user has cleaned it up.

OK, understood.

Maybe that's why "FAILED" appears shortly in crm_mon during a resource move on
a clean resource, but with past failures? Maybe I should dig this weird
behavior and wrap up a bug report if I confirm this?

> > > Every run of the scheduler is completely independent, so it doesn't
> > > know about any earlier runs or what they logged. Think of it like
> > > Frosty the Snowman saying "Happy Birthday!" every time his hat is
> > > put
> > > on.  
> > 
> > I don't have this ref :)  
> 
> I figured not everybody would, but it was too fun to pass up :)
> 
> The snowman comes to life every time his magic hat is put on, but to
> him each time feels like he's being born for the first time, so he says
> "Happy Birthday!"
> 
> https://www.youtube.com/watch?v=1PbWTEYoN8o

heh :)

> > > As far as each run is concerned, it is the first time it's seen the
> > > history. This is what allows the DC role to move from node to node,
> > > and
> > > the scheduler to be run as a simulation using a saved CIB file.
> > > 
> > > We could change the wording further if necessary. The previous
> > > version
> > > would log something like:
> > > 
> > > warning: Processing failed monitor of my-rsc on node1: not running
> > > 
> > > and this latest change will log it like:
> > > 
> > > warning: Unexpected result (not running: No process state file
> > > found)
> > > was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019  
> > 
> > /result/state/ ?  
> 
> It's the result of a resource agent action, so it could be for example
> a timeout or a permissions issue.

ok

> > > I wanted to be explicit about the message being about processing
> > > resource history that may or may not be the first time it's been
> > > processed and logged, but everything I came up with seemed too long
> > > for
> > > a log line. Another possibility might be something like:
> > > 
> > > warning: Using my-rsc history to determine its current state on
> > > node1:
> > > Unexpected result (not running: No process state file found) was
> > > recorded for monitor at Nov 12 19:19:02 2019  
> > 
> > I better like the first one.
> > 
> > However, it feels like implementation details exposed to the world,
> > isn't it? How useful is this information for the end user? What the
> > user can do
> > with this information? There's noting to fix and this is not actually
> > an error
> > of the current running process.
> > 
> > I still 

Re: [ClusterLabs] Announcing ClusterLabs Summit 2020

2019-11-18 Thread Ken Gaillot
On Mon, 2019-11-18 at 16:06 +, Diego Akechi wrote:
> Hi Everyone,
> 
> Sorry for the late response here.
> 
> From SUSE, we are still collecting the final list of attendees, but
> we
> already have 6 people confirmed, but most probably we will have
> around
> 10 people going.
> 
> I would like to propose two sessions about some of our current work:
> 
> 
> 1. Cluster monitoring capabilities based on the ha_cluster_exporter,
> Prometheus and Grafana
> 
> 2. Cluster deployment automation based on Salt.

Great, looking forward to it!

> If there is not enough time, we can shrink them into one slot.

I'm planning on 1 hour per talk on average (about 40-45 minutes
speaking plus 10-15 minutes Q&A and a few minutes between talks). If
you'd prefer more or less let me know, but you can plan on that.

> 
> On 15/10/2019 23:42, Ken Gaillot wrote:
> > I'm happy to announce that we have a date and location for the next
> > ClusterLabs Summit: Wednesday, Feb. 5, and Thursday, Feb. 6, 2020,
> > in
> > Brno, Czechia. This year's host is Red Hat.
> > 
> > Details will be given on this wiki page as they become available:
> > 
> >   http://plan.alteeve.ca/index.php/HA_Cluster_Summit_2020
> > 
> > We are still in the early stages of organizing, and need your
> > input.
> > 
> > Most importantly, we need a good idea of how many people will
> > attend,
> > to ensure we have an appropriate conference room and amenities. The
> > wiki page has a section where you can say how many people from your
> > organization expect to attend. We don't need a firm commitment or
> > an
> > immediate response, just let us know once you have a rough idea.
> > 
> > We also invite you to propose a talk, whether it's a talk you want
> > to
> > give or something you are interested in hearing more about. The
> > wiki
> > page has a section for that, too. Anything related to open-source
> > clustering is welcome: new features and plans for the cluster
> > software projects, how-to's and case histories for integrating
> > specific services into a cluster, utilizing specific
> > stonith/networking/etc. technologies in a cluster, tips for
> > administering a cluster, and so forth.
> > 
> > I'm excited about the chance for developers and users to meet in
> > person. Past summits have been helpful for shaping the direction of
> > the
> > projects and strengthening the community. I look forward to seeing
> > many
> > of you there!
> > 
> 
> -- 
> Diego V. Akechi 
> Engineering Manager HA Extension & SLES for SAP
> SUSE Software Solutions Germany GmbH
> Tel: +49-911-74053-373; Fax: +49-911-7417755;  https://www.suse.com/
> Maxfeldstr. 5, D-90409 Nürnberg
> HRB 247165 (AG München)
> Managing Director: Felix Imendörffer
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available

2019-11-18 Thread Ken Gaillot
On Fri, 2019-11-15 at 14:35 +0100, Jehan-Guillaume de Rorthais wrote:
> On Thu, 14 Nov 2019 11:09:57 -0600
> Ken Gaillot  wrote:
> 
> > On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote:
> > > > > > Jehan-Guillaume de Rorthais  schrieb am
> > > > > > 14.11.2019 um  
> > > 
> > > 15:17 in
> > > Nachricht <20191114151719.6cbf4e38@firost>:  
> > > > On Wed, 13 Nov 2019 17:30:31 ‑0600
> > > > Ken Gaillot  wrote:
> > > > ...  
> > > > > A longstanding pain point in the logs has been improved.
> > > > > Whenever
> > > > > the
> > > > > scheduler processes resource history, it logs a warning for
> > > > > any
> > > > > failures it finds, regardless of whether they are new or old,
> > > > > which can
> > > > > confuse anyone reading the logs. Now, the log will contain
> > > > > the
> > > > > time of
> > > > > the failure, so it's obvious whether you're seeing the same
> > > > > event
> > > > > or
> > > > > not. The log will also contain the exit reason if one was
> > > > > provided by
> > > > > the resource agent, for easier troubleshooting.  
> > > > 
> > > > I've been hurt by this in the past and I was wondering what was
> > > > the
> > > > point of
> > > > warning again and again in the logs for past failures during
> > > > scheduling? 
> > > > What this information brings to the administrator?  
> > 
> > The controller will log an event just once, when it happens.
> > 
> > The scheduler, on the other hand, uses the entire recorded resource
> > history to determine the current resource state. Old failures (that
> > haven't been cleaned) must be taken into account.
> 
> OK, I wasn't aware of this. If you have a few minutes, I would be
> interested to
> know why the full history is needed and not just find the latest
> entry from
> there. Or maybe there's some comments in the source code that already
> cover this question?

The full *recorded* history consists of the most recent operation that
affects the state (like start/stop/promote/demote), the most recent
failed operation, and the most recent results of any recurring
monitors.

For example there may be a failed monitor, but whether the resource is
considered failed or not would depend on whether there was a more
recent successful stop or start. Even if the failed monitor has been
superseded, it needs to stay in the history for display purposes until
the user has cleaned it up.

> > Every run of the scheduler is completely independent, so it doesn't
> > know about any earlier runs or what they logged. Think of it like
> > Frosty the Snowman saying "Happy Birthday!" every time his hat is
> > put
> > on.
> 
> I don't have this ref :)

I figured not everybody would, but it was too fun to pass up :)

The snowman comes to life every time his magic hat is put on, but to
him each time feels like he's being born for the first time, so he says
"Happy Birthday!"

https://www.youtube.com/watch?v=1PbWTEYoN8o


> > As far as each run is concerned, it is the first time it's seen the
> > history. This is what allows the DC role to move from node to node,
> > and
> > the scheduler to be run as a simulation using a saved CIB file.
> > 
> > We could change the wording further if necessary. The previous
> > version
> > would log something like:
> > 
> > warning: Processing failed monitor of my-rsc on node1: not running
> > 
> > and this latest change will log it like:
> > 
> > warning: Unexpected result (not running: No process state file
> > found)
> > was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019
> 
> /result/state/ ?

It's the result of a resource agent action, so it could be for example
a timeout or a permissions issue.

> > I wanted to be explicit about the message being about processing
> > resource history that may or may not be the first time it's been
> > processed and logged, but everything I came up with seemed too long
> > for
> > a log line. Another possibility might be something like:
> > 
> > warning: Using my-rsc history to determine its current state on
> > node1:
> > Unexpected result (not running: No process state file found) was
> > recorded for monitor at Nov 12 19:19:02 2019
> 
> I better like the first one.
> 
> However, it feels like implementation details exposed to the world,
> isn't it? How useful is this information for the end user? What the
> user can do
> with this information? There's noting to fix and this is not actually
> an error
> of the current running process.
> 
> I still fail to understand why the scheduler doesn't process the
> history
> silently, whatever it finds there, then warn for something really
> important if
> the final result is not expected...

From the scheduler's point of view, it's all relevant information that
goes into the decision making. Even an old failure can cause new
actions, for example if quorum was not held at the time but has now
been reached, or if there is a failure-timeout that just expired. So
any failure history is important to understanding whatever the
scheduler says needs to be

Re: [ClusterLabs] Announcing ClusterLabs Summit 2020

2019-11-18 Thread Diego Akechi
Hi Everyone,

Sorry for the late response here.

From SUSE, we are still collecting the final list of attendees, but we
already have 6 people confirmed, but most probably we will have around
10 people going.

I would like to propose two sessions about some of our current work:


1. Cluster monitoring capabilities based on the ha_cluster_exporter,
Prometheus and Grafana

2. Cluster deployment automation based on Salt.


If there is not enough time, we can shrink them into one slot.


On 15/10/2019 23:42, Ken Gaillot wrote:
> I'm happy to announce that we have a date and location for the next
> ClusterLabs Summit: Wednesday, Feb. 5, and Thursday, Feb. 6, 2020, in
> Brno, Czechia. This year's host is Red Hat.
> 
> Details will be given on this wiki page as they become available:
> 
>   http://plan.alteeve.ca/index.php/HA_Cluster_Summit_2020
> 
> We are still in the early stages of organizing, and need your input.
> 
> Most importantly, we need a good idea of how many people will attend,
> to ensure we have an appropriate conference room and amenities. The
> wiki page has a section where you can say how many people from your
> organization expect to attend. We don't need a firm commitment or an
> immediate response, just let us know once you have a rough idea.
> 
> We also invite you to propose a talk, whether it's a talk you want to
> give or something you are interested in hearing more about. The wiki
> page has a section for that, too. Anything related to open-source
> clustering is welcome: new features and plans for the cluster software 
> projects, how-to's and case histories for integrating specific services into 
> a cluster, utilizing specific stonith/networking/etc. technologies in a 
> cluster, tips for administering a cluster, and so forth.
> 
> I'm excited about the chance for developers and users to meet in
> person. Past summits have been helpful for shaping the direction of the
> projects and strengthening the community. I look forward to seeing many
> of you there!
> 

-- 
Diego V. Akechi 
Engineering Manager HA Extension & SLES for SAP
SUSE Software Solutions Germany GmbH
Tel: +49-911-74053-373; Fax: +49-911-7417755;  https://www.suse.com/
Maxfeldstr. 5, D-90409 Nürnberg
HRB 247165 (AG München)
Managing Director: Felix Imendörffer
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Announcing ClusterLabs Summit 2020

2019-11-18 Thread Ken Gaillot
Great! I've added you to the list.

On Fri, 2019-11-15 at 09:50 +, John Colgrave wrote:
> We are planning for two people from the IBM MQ development team to
> attend. 
> 
> Regards,
> 
> John Colgrave
> 
> Disaster Recovery and High Availability Architect
> IBM MQ
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with
> number 741598. 
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
> PO6 3AU
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Documentation driven predictability of pacemaker commands (Was: Pacemaker 2.0.3-rc3 now available)

2019-11-18 Thread Ulrich Windl
>>> Jan Pokorný  schrieb am 15.11.2019 um 11:52 in
Nachricht
<20191115105233.gc23...@redhat.com>:

...
> - on top of previous, what exactly do we gain from appending
>   --text-fancy?  sadly, I observed ho difference in a basic
>   use case

I'd expect fancy colors (if the terminal is capable doing that) ;-)
...

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/