subject:"\[ClusterLabs\] recommendations for corosync totem timeout for CentOS 7 \+ VMware\?"

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-25 Thread Jan Pokorný

On 22/03/19 15:44 -0400, Brian Reichert wrote:
> On Fri, Mar 22, 2019 at 11:07:55AM +0100, Jan Pokorn?? wrote:
>> On 21/03/19 12:21 -0400, Brian Reichert wrote:
>>> I've followed several tutorials about setting up a simple three-node
>>> cluster, with no resources (yet), under CentOS 7.
>>> 
>>> I've discovered the cluster won't restart upon rebooting a node.
>>> 
>>> The other two nodes, however, do claim the cluster is up, as shown
>>> with 'pcs status cluster'.
>> 
>> Please excuse the lack of understanding perhaps owing to the Friday
>> mental power phenomena, but:
>> 
>> 1. what do you mean with "cluster restart"?
>>local instance of cluster services being started anew once
>>the node at hand finishes booting?
> 
> I mean that when I reboot node1, node1 reports the cluster is up,
> via 'pcs cluster status'.

Ok, so my assumption fits in this case, in the context of node1.

And only now I can see the source of this overall confusion, glad
you've tracked that down.

>> 2. why shall a single malfunctioned node (out of three) irrefutably
>>result in dismounting of otherwise healthy cluster?
>>(if that's indeed what you presume)
> 
> I don't presume that.

That was my slightly obtuse attempt to unfold the seemingly incomplete
or downright confusing picture (subject to interpreting "however" in
the original report).  Sorry for that, luckily, you seem to have all
the answers by now.

As for systemd, you can also use "systemctl is-active ", and
within pcs context, "pcs status" will also serve a "Daemon Status"
at the bottom of its output, where also corosync should be detailed
unambiguously.

-- 
Jan (Poki)

pgpPEMg69neZY.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-25 Thread Jan Friesse


Brian,


On Fri, Mar 22, 2019 at 08:57:20AM +0100, Jan Friesse wrote:

- If I manually set 'totem.token' to a higher value, am I responsible
   for tracking the number of nodes in the cluster, to keep in
   alignment with what Red Hat's page says?


Nope. I've tried to explain what is really happening in the manpage
corosync.conf(5). totem.token and totem.token_coefficient are used in
the following formula:


I do see this under token_coefficient, thanks.


Corosync used runtime.config.token.


Cool; thanks.  Bumping up totem.token to 2000 got me over this hump.


- Under these conditions, when corosync exits, why does it do so
   with a zero status? It seems to me that if it exited at all,


That's a good question. How reproducible is the issue? Corosync
shouldn't "exit" with zero status.


If I leave totem.token set to default, %100 in my case.

I stand corrected; yesterday, it was %100.  Today, I cannot reproduce
this at all, even with reverting to the defaults.



That's sad


Here is a snippet of output from yesterday's experiments; this is
based on a typescript capture file, so I apologize for the ANSI
screen codes.



Yep, np. Looks just fine.


- by default, systemd doesn't report full log lines.

- by default, CentOS's config of systemd doesn't persist journaled
   logs, so I can't directly review yesterday's efforts.

- and, it looks like I misinterpreted the 'exited' message; corosync
   was enabled and running, but the 'Process' line doesn't report
   on the 'corosync' process, but some systemd utility.

(Let me count the ways I'm coming to dislike systemd...)

I was able to recover logs from /var/log/messages, but other than
the 'Consider token timeout increase' message, it looks hunky-dory.

With what I've since learned;

- I cannot explain why I can't reproduce the symptoms, even with
   reverting to the defaults.

- And without being able to reproduce, I can't pursue why 'pcs
   status cluster' was actually failing for me. :/

So, I appreciate your attention to this message, and I guess I'm
off to further explore all of this.

   C]0;root@node1:~^G[root@node1 ~]# systemctl status corosync.service
   ESC[1;32m●ESC[0m corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor
preset: disabled)
  Active: ESC[1;32mactive (running)ESC[0m since Thu 2019-03-21 14:26:56
UTC; 1min 35s ago
Docs: man:corosync
  man:corosync.conf
  man:corosync_overview
 Process: 5474 ExecStart=/usr/share/corosync/corosync start (code=exited,
status=0/SUCCESS)
Main PID: 5490 (corosync)
  CGroup: /system.slice/corosync.service
└─5490 corosync




As you can see, corosync service unit in COS 7 is executing init script 
which execs corosync and waits till connection to local IPC can be 
established. IPC connection can be established when corosync is ready. 
Initscript timeout for IPC is 1 minute and return code is 1 if 
connection cannot be established. On success initscript returns 0. So 
ExecStart (initscript) exited with 0/SUCESS = corosync was successfully 
started and it is running as a PID 5490.


Regards,
  Honza


   Honza




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-22 Thread Brian Reichert

On Fri, Mar 22, 2019 at 11:07:55AM +0100, Jan Pokorn?? wrote:
> On 21/03/19 12:21 -0400, Brian Reichert wrote:
> > I've followed several tutorials about setting up a simple three-node
> > cluster, with no resources (yet), under CentOS 7.
> > 
> > I've discovered the cluster won't restart upon rebooting a node.
> > 
> > The other two nodes, however, do claim the cluster is up, as shown
> > with 'pcs status cluster'.
> 
> Please excuse the lack of understanding perhaps owing to the Friday
> mental power phenomena, but:
> 
> 1. what do you mean with "cluster restart"?
>local instance of cluster services being started anew once
>the node at hand finishes booting?

I mean that when I reboot node1, node1 reports the cluster is up,
via 'pcs cluster status'.

> 2. why shall a single malfunctioned node (out of three) irrefutably
>result in dismounting of otherwise healthy cluster?
>(if that's indeed what you presume)

I don't presume that.

> -- 
> Jan (Poki)

-- 
Brian Reichert  
BSD admin/developer at large
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-22 Thread Brian Reichert

On Fri, Mar 22, 2019 at 08:57:20AM +0100, Jan Friesse wrote:
> >- If I manually set 'totem.token' to a higher value, am I responsible
> >   for tracking the number of nodes in the cluster, to keep in
> >   alignment with what Red Hat's page says?
> 
> Nope. I've tried to explain what is really happening in the manpage 
> corosync.conf(5). totem.token and totem.token_coefficient are used in 
> the following formula:

I do see this under token_coefficient, thanks.

> Corosync used runtime.config.token.

Cool; thanks.  Bumping up totem.token to 2000 got me over this hump.

> >- Under these conditions, when corosync exits, why does it do so
> >   with a zero status? It seems to me that if it exited at all,
> 
> That's a good question. How reproducible is the issue? Corosync 
> shouldn't "exit" with zero status.

If I leave totem.token set to default, %100 in my case.

I stand corrected; yesterday, it was %100.  Today, I cannot reproduce
this at all, even with reverting to the defaults.

Here is a snippet of output from yesterday's experiments; this is
based on a typescript capture file, so I apologize for the ANSI
screen codes.

- by default, systemd doesn't report full log lines.

- by default, CentOS's config of systemd doesn't persist journaled
  logs, so I can't directly review yesterday's efforts.

- and, it looks like I misinterpreted the 'exited' message; corosync
  was enabled and running, but the 'Process' line doesn't report
  on the 'corosync' process, but some systemd utility.

(Let me count the ways I'm coming to dislike systemd...)

I was able to recover logs from /var/log/messages, but other than
the 'Consider token timeout increase' message, it looks hunky-dory.

With what I've since learned; 

- I cannot explain why I can't reproduce the symptoms, even with
  reverting to the defaults.

- And without being able to reproduce, I can't pursue why 'pcs
  status cluster' was actually failing for me. :/

So, I appreciate your attention to this message, and I guess I'm
off to further explore all of this.

  C]0;root@node1:~^G[root@node1 ~]# systemctl status corosync.service
  ESC[1;32m???ESC[0m corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor
preset: disabled)
 Active: ESC[1;32mactive (running)ESC[0m since Thu 2019-03-21 14:26:56
UTC; 1min 35s ago
   Docs: man:corosync
 man:corosync.conf
 man:corosync_overview
Process: 5474 ExecStart=/usr/share/corosync/corosync start (code=exited,
status=0/SUCCESS)
   Main PID: 5490 (corosync)
 CGroup: /system.slice/corosync.service
   ??5490 corosync


>   Honza

-- 
Brian Reichert  
BSD admin/developer at large
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-22 Thread Andrei Borzenkov

On Fri, Mar 22, 2019 at 1:08 PM Jan Pokorný  wrote:

>
> Also a Friday's idea:
> Perhaps we should crank up "how to ask" manual for this list

Yest another one?

http://www.catb.org/~esr/faqs/smart-questions.html
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-22 Thread Jan Pokorný

On 21/03/19 12:21 -0400, Brian Reichert wrote:
> I've followed several tutorials about setting up a simple three-node
> cluster, with no resources (yet), under CentOS 7.
> 
> I've discovered the cluster won't restart upon rebooting a node.
> 
> The other two nodes, however, do claim the cluster is up, as shown
> with 'pcs status cluster'.

Please excuse the lack of understanding perhaps owing to the Friday
mental power phenomena, but:

1. what do you mean with "cluster restart"?
   local instance of cluster services being started anew once
   the node at hand finishes booting?

2. why shall a single malfunctioned node (out of three) irrefutably
   result in dismounting of otherwise healthy cluster?
   (if that's indeed what you presume)

Do you, in fact, expect the failed node to be rebooted automatically?
Do you observe a non-running cluster on the previously failed node
once it came up again?

Frankly, I'm lost at deciphering your situation and hence at least
part of your concerns.

* * *

Also a Friday's idea:
Perhaps we should crank up "how to ask" manual for this list
(to be linked in the header) to attempt a smoother, less time
(precious commodity) consuming Q -> A flow, to the benefit
of community.

-- 
Jan (Poki)

pgpRqBJusOjAI.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-22 Thread Jan Friesse


Brian,


I've followed several tutorials about setting up a simple three-node
cluster, with no resources (yet), under CentOS 7.

I've discovered the cluster won't restart upon rebooting a node.

The other two nodes, however, do claim the cluster is up, as shown
with 'pcs status cluster'.

I tracked down that on the rebooted node, corosync exited with a
'0' status.  Nothing outright seems to be what I would call an error
message, but this was recorded:

   [MAIN  ] Corosync main process was not scheduled for 2145.7053
   ms (threshold is 1320. ms). Consider token timeout increase.

This seems related:

   https://access.redhat.com/solutions/1217663

   High Availability cluster node logs the message "Corosync main
   process was not scheduled for X ms (threshold is Y ms). Consider
   token timeout increase."

I've confirmed that corosync is running with the maximum realtime
scheduling priority:

   [root@node1 ~]# ps -eo cmd,rtprio | grep -e [c]orosync -e RTPRIO
   CMD RTPRIO
   corosync99

I am doing my testing in an admittedly underprovisioned VM environment.

I've used this same environment for CentOS 6 / heartbeat-based
solutions, and they were nowhere near as sensitive to these timing
issues.

Manually running 'pcs cluster start' does indeed fire everything
up without a hitch, and remains running for days at a crack.

The 'consider token timeout increase' message has me looking at this:

   https://access.redhat.com/solutions/221263

Which makes this assertion:

   RHEL 7 or 8

   If no token value is specified in the corosync configuration, the
   default is 1000 ms, or 1 second for a 2 node cluster, increasing
   by 650ms for each additional member.

I have a three-node cluster, and the arithmetic for totem.token
seems to hold:

   [root@node3 ~]# corosync-cmapctl | grep totem.token
   runtime.config.totem.token (u32) = 1650
   runtime.config.totem.token_retransmit (u32) = 392
   runtime.config.totem.token_retransmits_before_loss_const (u32) = 4

I'm confused on a number of issues:

- The 'totem.token' value of 1650 doesn't seem to related to the
   threshold number in the diagnostic message the corosync service
   logged:

 threshold is 1320. ms

   Can someone explain the relationship between these values?


Yes. Threshold is 80% of used token timeout.



- If I manually set 'totem.token' to a higher value, am I responsible
   for tracking the number of nodes in the cluster, to keep in
   alignment with what Red Hat's page says?


Nope. I've tried to explain what is really happening in the manpage 
corosync.conf(5). totem.token and totem.token_coefficient are used in 
the following formula:


runtime.config.token = totem.token + (number_of_nodes - 2) * 
totem.token_coefficient


Corosync used runtime.config.token.



- Under these conditions, when corosync exits, why does it do so
   with a zero status? It seems to me that if it exited at all,


That's a good question. How reproducible is the issue? Corosync 
shouldn't "exit" with zero status.



   without someone controllably stopping the service, it warrants a
   non-zero status.

- Is there a recommended way to alter either pacemaker/corosync or
   systemd configuration of these services to harden against resource
   issues?


Enlarging timeout seems like a right way to go.



   I don't know if corosync's startup can be deferred until the CPU
   load settles, or if the some automatic retry can be set up...


This seems more like a init system question.

Regards,
  Honza



Details of my environment; I'm happy to provide others, if anyone
has any specific questions:

   [root@node1 ~]# cat /etc/centos-release
   CentOS Linux release 7.6.1810 (Core)
   [root@node1 ~]# rpm -qa | egrep 'pacemaker|corosync'
   corosynclib-2.4.3-4.el7.x86_64
   pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64
   corosync-2.4.3-4.el7.x86_64
   pacemaker-cli-1.1.19-8.el7_6.4.x86_64
   pacemaker-1.1.19-8.el7_6.4.x86_64
   pacemaker-libs-1.1.19-8.el7_6.4.x86_64



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-21 Thread Brian Reichert

I've followed several tutorials about setting up a simple three-node
cluster, with no resources (yet), under CentOS 7.

I've discovered the cluster won't restart upon rebooting a node.

The other two nodes, however, do claim the cluster is up, as shown
with 'pcs status cluster'.

I tracked down that on the rebooted node, corosync exited with a
'0' status.  Nothing outright seems to be what I would call an error
message, but this was recorded:

  [MAIN  ] Corosync main process was not scheduled for 2145.7053
  ms (threshold is 1320. ms). Consider token timeout increase.

This seems related:

  https://access.redhat.com/solutions/1217663

  High Availability cluster node logs the message "Corosync main
  process was not scheduled for X ms (threshold is Y ms). Consider
  token timeout increase."

I've confirmed that corosync is running with the maximum realtime
scheduling priority:

  [root@node1 ~]# ps -eo cmd,rtprio | grep -e [c]orosync -e RTPRIO
  CMD RTPRIO
  corosync99

I am doing my testing in an admittedly underprovisioned VM environment.

I've used this same environment for CentOS 6 / heartbeat-based
solutions, and they were nowhere near as sensitive to these timing
issues.

Manually running 'pcs cluster start' does indeed fire everything
up without a hitch, and remains running for days at a crack.

The 'consider token timeout increase' message has me looking at this:

  https://access.redhat.com/solutions/221263

Which makes this assertion:

  RHEL 7 or 8

  If no token value is specified in the corosync configuration, the
  default is 1000 ms, or 1 second for a 2 node cluster, increasing
  by 650ms for each additional member.

I have a three-node cluster, and the arithmetic for totem.token
seems to hold:

  [root@node3 ~]# corosync-cmapctl | grep totem.token
  runtime.config.totem.token (u32) = 1650
  runtime.config.totem.token_retransmit (u32) = 392
  runtime.config.totem.token_retransmits_before_loss_const (u32) = 4

I'm confused on a number of issues:

- The 'totem.token' value of 1650 doesn't seem to related to the
  threshold number in the diagnostic message the corosync service
  logged:

threshold is 1320. ms

  Can someone explain the relationship between these values?

- If I manually set 'totem.token' to a higher value, am I responsible
  for tracking the number of nodes in the cluster, to keep in
  alignment with what Red Hat's page says?

- Under these conditions, when corosync exits, why does it do so
  with a zero status? It seems to me that if it exited at all,
  without someone controllably stopping the service, it warrants a
  non-zero status.

- Is there a recommended way to alter either pacemaker/corosync or
  systemd configuration of these services to harden against resource
  issues?

  I don't know if corosync's startup can be deferred until the CPU
  load settles, or if the some automatic retry can be set up...

Details of my environment; I'm happy to provide others, if anyone
has any specific questions:

  [root@node1 ~]# cat /etc/centos-release
  CentOS Linux release 7.6.1810 (Core)
  [root@node1 ~]# rpm -qa | egrep 'pacemaker|corosync'
  corosynclib-2.4.3-4.el7.x86_64
  pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64
  corosync-2.4.3-4.el7.x86_64
  pacemaker-cli-1.1.19-8.el7_6.4.x86_64
  pacemaker-1.1.19-8.el7_6.4.x86_64
  pacemaker-libs-1.1.19-8.el7_6.4.x86_64

-- 
Brian Reichert  
BSD admin/developer at large
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

[ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

8 matches

Site Navigation

Mail list logo

Footer information