Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-09 Thread Ken Gaillot
On Mon, 2019-09-09 at 14:21 +0200, wf...@niif.hu wrote:
> Andrei Borzenkov  writes:
> 
> > 04.09.2019 0:27, wf...@niif.hu пишет:
> > 
> > > Jeevan Patnaik  writes:
> > > 
> > > > [16187] node1 corosyncwarning [MAIN  ] Corosync main process
> > > > was not
> > > > scheduled for 2889.8477 ms (threshold is 800. ms). Consider
> > > > token
> > > > timeout increase.
> > > > [...]
> > > > 2. How to fix this? We have not much load on the nodes, the
> > > > corosync is
> > > > already running with RT priority.
> > > 
> > > Does your corosync daemon use a watchdog device?  (See in the
> > > startup
> > > logs.)  Watchdog interaction can be *slow*.
> > 
> > Can you elaborate? This is the first time I see that corosync has
> > anything to do with watchdog. How exactly corosync interacts with
> > watchdog? Where in corosync configuration watchdog device is
> > defined?
> 
> Inside the resources directive you can specify a watchdog_device, 

Side comment: corosync's built-in watchdog handling is an older
alternative to sbd, the watchdog manager that pacemaker uses. You'd use
one or the other.

If you're running pacemaker on top of corosync, you'd probably want sbd
since pacemaker can use it for more situations than just cluster
membership loss.

> which
> Corosync will "pet" from its main loop.  From corosync.conf(5):
> 
> > In a cluster with properly configured power fencing a watchdog
> > provides no additional value.  On the other hand, slow watchdog
> > communication may incur multi-second delays in the Corosync main
> > loop,
> > potentially breaking down membership.  IPMI watchdogs are
> > particularly
> > notorious in this regard: read about kipmid_max_busy_us in IPMI.txt
> > in
> > the Linux kernel documentation.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-09 Thread wferi
Jeevan Patnaik  writes:

>  writes:
>
>> Jeevan Patnaik  writes:
>>
>>> [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
>>> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token
>>> timeout increase.
>>> [...]
>>> 2. How to fix this? We have not much load on the nodes, the corosync is
>>> already running with RT priority.
>>
>> Does your corosync daemon use a watchdog device?  (See in the startup
>> logs.)  Watchdog interaction can be *slow*.
>
> Watchdog is disabled in pacemaker.

Cool, but I'd recommend checking the corosync startup logs.
-- 
Regards,
Feri
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-09 Thread wferi
Andrei Borzenkov  writes:

> 04.09.2019 0:27, wf...@niif.hu пишет:
>
>> Jeevan Patnaik  writes:
>> 
>>> [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
>>> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token
>>> timeout increase.
>>> [...]
>>> 2. How to fix this? We have not much load on the nodes, the corosync is
>>> already running with RT priority.
>> 
>> Does your corosync daemon use a watchdog device?  (See in the startup
>> logs.)  Watchdog interaction can be *slow*.
>
> Can you elaborate? This is the first time I see that corosync has
> anything to do with watchdog. How exactly corosync interacts with
> watchdog? Where in corosync configuration watchdog device is defined?

Inside the resources directive you can specify a watchdog_device, which
Corosync will "pet" from its main loop.  From corosync.conf(5):

| In a cluster with properly configured power fencing a watchdog
| provides no additional value.  On the other hand, slow watchdog
| communication may incur multi-second delays in the Corosync main loop,
| potentially breaking down membership.  IPMI watchdogs are particularly
| notorious in this regard: read about kipmid_max_busy_us in IPMI.txt in
| the Linux kernel documentation.
-- 
Regards,
Feri
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-04 Thread Jeevan Patnaik
Hi,

On Wed, Sep 4, 2019 at 9:30 AM Andrei Borzenkov  wrote:

> 04.09.2019 0:27, wf...@niif.hu пишет:
> > Jeevan Patnaik  writes:
> >
> >> [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
> >> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token
> >> timeout increase.
> >> [...]
> >> 2. How to fix this? We have not much load on the nodes, the corosync is
> >> already running with RT priority.
> >
> > Does your corosync daemon use a watchdog device?  (See in the startup
> > logs.)  Watchdog interaction can be *slow*.
> >
>
> Watchdog is disabled in pacemaker.

> Can you elaborate? This is the first time I see that corosync has
> anything to do with watchdog. How exactly corosync interacts with
> watchdog? Where in corosync configuration watchdog device is defined?
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



Regards,
Jeevan.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-04 Thread Jeevan Patnaik
Hi Honza,

On Tue, Sep 3, 2019 at 7:20 PM Jan Friesse  wrote:

> Jeevan,
>
> Jeevan Patnaik napsal(a):
> >Hi Honza,
> >
> >   Thanks for the response.
> >
> > If you increase token timeout even higher
> > (let's say 12sec) is it still appearing or not?
> > - I will try this.
> >
> >   If you try to run it without RT priority, does it help?
> > - Can RT priority affect the process scheduling negatively?
>
> Actually we've had report that it can, because it blocks kernel thread
> which is responsible for sending/receiving packets. I was not able to
> reporduce this behavior myself, and it seemed to be kernel specific, but
> resolution was that behavior without RT was better.
>
Thanks. I will check this. Also in theory, can blocking kernel thread
responsible for sending/receiving packets affect scheduling of the corosync
process (with RT priority) ?

>
> >
> > I don't see any irregular IO activity during the time when we got these
> > errors. Also, swap usage and swap IO is not much at all, it's only in
> KBs.
> > we have vm.swappiness set to 1. So, I don't think swap is causing any
> issue.
> >
> > However, I see slight network activity during the issue times (What I
> > understand is network activity should not affect the CPU jobs as long as
> > CPU load is normal and without any blocking IO).
>
> It shouldn't
>
> >
> > I am thinking of debugging in the following way, unless there is option
> to
> > restart corosync with debugger mode. :
>
> You can turn on debug messages (debug: on in logging section of
> corosync.conf).
>
> Yes, I found thist later. Will try debugging. Hoping it would help in
knowing where the problem is.

> >
> > -> Run a process strace in background on the corosync process and
> redirect
> > log to a output
> > -> Add a frequent cron job to rotate the output log (delete old ones),
> > unless there is a flag file to keep the old log
> > -> Add another frequent cron job to check corosync log for the specific
> > token timeout error and add the above mentioned flag file to not delete
> the
> > strace output.
> >
> > Don't know if the above process is safe to run on a production server, >
> without creating much impact on the system resources. Need to check.
> >
>
> Yep. Hopefully you find something.
>
> Regards,
>Honza
>
> >
> > On Mon, Sep 2, 2019 at 5:50 PM Jan Friesse  wrote:
> >
> >> Jeevan,
> >>
> >> Jeevan Patnaik napsal(a):
> >>> Hi,
> >>>
> >>> Also, both are physical machines.
> >>>
> >>> On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik 
> >> wrote:
> >>>
>  Hi,
> 
>  We see the following messages almost everyday in our 2 node cluster
> and
>  resources gets migrated when it happens:
> 
>  [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
> >> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token
> >> timeout increase.
>  [16187] node1 corosyncnotice  [TOTEM ] c.
>  [16187] node1 corosyncnotice  [TOTEM ] A new membership (
> >> 192.168.0.1:1268) was formed. Members joined: 2 left: 2
>  [16187] node1 corosyncnotice  [TOTEM ] Failed to receive the leave
> >> message. failed: 2
> 
> 
>  After setting the token timeout to 6000ms, at least the "Failed to
>  receive the leave message" doesn't appear anymore. But we see corosync
>  timeout errors:
>  [16395] node1 corosyncwarning [MAIN  ] Corosync main process was not
>  scheduled for 6660.9043 ms (threshold is 4800. ms). Consider token
>  timeout increase.
> 
>  1. Why is the set timeout not in effect? It's 4800ms instead of
> 6000ms.
> >>
> >> It is in effect. Threshold for pause detector is set as 0.8 * token
> >> timeout.
> >>
>  2. How to fix this? We have not much load on the nodes, the corosync
> is
>  already running with RT priority.
> >>
> >> There must be something wrong. If you increase token timeout even higher
> >> (let's say 12sec) is it still appearing or not? If so, isn't the machine
> >> swapping (for example) or waiting for IO? If you try to run it without
> >> RT priority, does it help?
> >>
> >> Regards,
> >> Honza
> >>
> >>
> 
>  The following is the details of OS and packages:
> 
>  Kernel: 3.10.0-957.el7.x86_64
>  OS: Oracle Linux Server 7.6
> 
>  corosync-2.4.3-4.el7.x86_64
>  corosynclib-2.4.3-4.el7.x86_64
> 
>  Thanks in advance.
> 
>  --
>  Regards,
>  Jeevan.
>  Create your own email signature
>  <
> >>
> https://www.wisestamp.com/signature-in-email?utm_source=promotion_medium=signature_campaign=create_your_own
> >>>
> 
> >>>
> >>>
> >>>
> >>>
> >>> ___
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >>
> >>
> >
> > Regards,
> > Jeevan.
> >
>
>

Regards,
Jeevan
___
Manage your subscription:

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-03 Thread Andrei Borzenkov
04.09.2019 0:27, wf...@niif.hu пишет:
> Jeevan Patnaik  writes:
> 
>> [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
>> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token
>> timeout increase.
>> [...]
>> 2. How to fix this? We have not much load on the nodes, the corosync is
>> already running with RT priority.
> 
> Does your corosync daemon use a watchdog device?  (See in the startup
> logs.)  Watchdog interaction can be *slow*.
> 

Can you elaborate? This is the first time I see that corosync has
anything to do with watchdog. How exactly corosync interacts with
watchdog? Where in corosync configuration watchdog device is defined?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-03 Thread wferi
Jeevan Patnaik  writes:

> [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token
> timeout increase.
> [...]
> 2. How to fix this? We have not much load on the nodes, the corosync is
> already running with RT priority.

Does your corosync daemon use a watchdog device?  (See in the startup
logs.)  Watchdog interaction can be *slow*.
-- 
Feri
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-03 Thread Jan Friesse

Jeevan,

Jeevan Patnaik napsal(a):

   Hi Honza,

  Thanks for the response.

If you increase token timeout even higher
(let's say 12sec) is it still appearing or not?
- I will try this.

  If you try to run it without RT priority, does it help?
- Can RT priority affect the process scheduling negatively?


Actually we've had report that it can, because it blocks kernel thread 
which is responsible for sending/receiving packets. I was not able to 
reporduce this behavior myself, and it seemed to be kernel specific, but 
resolution was that behavior without RT was better.




I don't see any irregular IO activity during the time when we got these
errors. Also, swap usage and swap IO is not much at all, it's only in KBs.
we have vm.swappiness set to 1. So, I don't think swap is causing any issue.

However, I see slight network activity during the issue times (What I
understand is network activity should not affect the CPU jobs as long as
CPU load is normal and without any blocking IO).


It shouldn't



I am thinking of debugging in the following way, unless there is option to
restart corosync with debugger mode. :


You can turn on debug messages (debug: on in logging section of 
corosync.conf).




-> Run a process strace in background on the corosync process and redirect
log to a output
-> Add a frequent cron job to rotate the output log (delete old ones),
unless there is a flag file to keep the old log
-> Add another frequent cron job to check corosync log for the specific
token timeout error and add the above mentioned flag file to not delete the
strace output.

Don't know if the above process is safe to run on a production server, > 
without creating much impact on the system resources. Need to check.



Yep. Hopefully you find something.

Regards,
  Honza



On Mon, Sep 2, 2019 at 5:50 PM Jan Friesse  wrote:


Jeevan,

Jeevan Patnaik napsal(a):

Hi,

Also, both are physical machines.

On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik 

wrote:



Hi,

We see the following messages almost everyday in our 2 node cluster and
resources gets migrated when it happens:

[16187] node1 corosyncwarning [MAIN  ] Corosync main process was not

scheduled for 2889.8477 ms (threshold is 800. ms). Consider token
timeout increase.

[16187] node1 corosyncnotice  [TOTEM ] c.
[16187] node1 corosyncnotice  [TOTEM ] A new membership (

192.168.0.1:1268) was formed. Members joined: 2 left: 2

[16187] node1 corosyncnotice  [TOTEM ] Failed to receive the leave

message. failed: 2



After setting the token timeout to 6000ms, at least the "Failed to
receive the leave message" doesn't appear anymore. But we see corosync
timeout errors:
[16395] node1 corosyncwarning [MAIN  ] Corosync main process was not
scheduled for 6660.9043 ms (threshold is 4800. ms). Consider token
timeout increase.

1. Why is the set timeout not in effect? It's 4800ms instead of 6000ms.


It is in effect. Threshold for pause detector is set as 0.8 * token
timeout.


2. How to fix this? We have not much load on the nodes, the corosync is
already running with RT priority.


There must be something wrong. If you increase token timeout even higher
(let's say 12sec) is it still appearing or not? If so, isn't the machine
swapping (for example) or waiting for IO? If you try to run it without
RT priority, does it help?

Regards,
Honza




The following is the details of OS and packages:

Kernel: 3.10.0-957.el7.x86_64
OS: Oracle Linux Server 7.6

corosync-2.4.3-4.el7.x86_64
corosynclib-2.4.3-4.el7.x86_64

Thanks in advance.

--
Regards,
Jeevan.
Create your own email signature
<

https://www.wisestamp.com/signature-in-email?utm_source=promotion_medium=signature_campaign=create_your_own









___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/






Regards,
Jeevan.



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-03 Thread Jeevan Patnaik
  Hi Honza,

 Thanks for the response.

If you increase token timeout even higher
(let's say 12sec) is it still appearing or not?
- I will try this.

 If you try to run it without RT priority, does it help?
- Can RT priority affect the process scheduling negatively?

I don't see any irregular IO activity during the time when we got these
errors. Also, swap usage and swap IO is not much at all, it's only in KBs.
we have vm.swappiness set to 1. So, I don't think swap is causing any issue.

However, I see slight network activity during the issue times (What I
understand is network activity should not affect the CPU jobs as long as
CPU load is normal and without any blocking IO).

I am thinking of debugging in the following way, unless there is option to
restart corosync with debugger mode. :

-> Run a process strace in background on the corosync process and redirect
log to a output
-> Add a frequent cron job to rotate the output log (delete old ones),
unless there is a flag file to keep the old log
-> Add another frequent cron job to check corosync log for the specific
token timeout error and add the above mentioned flag file to not delete the
strace output.

Don't know if the above process is safe to run on a production server,
without creating much impact on the system resources. Need to check.


On Mon, Sep 2, 2019 at 5:50 PM Jan Friesse  wrote:

> Jeevan,
>
> Jeevan Patnaik napsal(a):
> > Hi,
> >
> > Also, both are physical machines.
> >
> > On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik 
> wrote:
> >
> >> Hi,
> >>
> >> We see the following messages almost everyday in our 2 node cluster and
> >> resources gets migrated when it happens:
> >>
> >> [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token
> timeout increase.
> >> [16187] node1 corosyncnotice  [TOTEM ] c.
> >> [16187] node1 corosyncnotice  [TOTEM ] A new membership (
> 192.168.0.1:1268) was formed. Members joined: 2 left: 2
> >> [16187] node1 corosyncnotice  [TOTEM ] Failed to receive the leave
> message. failed: 2
> >>
> >>
> >> After setting the token timeout to 6000ms, at least the "Failed to
> >> receive the leave message" doesn't appear anymore. But we see corosync
> >> timeout errors:
> >> [16395] node1 corosyncwarning [MAIN  ] Corosync main process was not
> >> scheduled for 6660.9043 ms (threshold is 4800. ms). Consider token
> >> timeout increase.
> >>
> >> 1. Why is the set timeout not in effect? It's 4800ms instead of 6000ms.
>
> It is in effect. Threshold for pause detector is set as 0.8 * token
> timeout.
>
> >> 2. How to fix this? We have not much load on the nodes, the corosync is
> >> already running with RT priority.
>
> There must be something wrong. If you increase token timeout even higher
> (let's say 12sec) is it still appearing or not? If so, isn't the machine
> swapping (for example) or waiting for IO? If you try to run it without
> RT priority, does it help?
>
> Regards,
>Honza
>
>
> >>
> >> The following is the details of OS and packages:
> >>
> >> Kernel: 3.10.0-957.el7.x86_64
> >> OS: Oracle Linux Server 7.6
> >>
> >> corosync-2.4.3-4.el7.x86_64
> >> corosynclib-2.4.3-4.el7.x86_64
> >>
> >> Thanks in advance.
> >>
> >> --
> >> Regards,
> >> Jeevan.
> >> Create your own email signature
> >> <
> https://www.wisestamp.com/signature-in-email?utm_source=promotion_medium=signature_campaign=create_your_own
> >
> >>
> >
> >
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>

Regards,
Jeevan.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-02 Thread Jan Friesse

Jeevan,

Jeevan Patnaik napsal(a):

Hi,

Also, both are physical machines.

On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik  wrote:


Hi,

We see the following messages almost everyday in our 2 node cluster and
resources gets migrated when it happens:

[16187] node1 corosyncwarning [MAIN  ] Corosync main process was not scheduled 
for 2889.8477 ms (threshold is 800. ms). Consider token timeout increase.
[16187] node1 corosyncnotice  [TOTEM ] c.
[16187] node1 corosyncnotice  [TOTEM ] A new membership (192.168.0.1:1268) was 
formed. Members joined: 2 left: 2
[16187] node1 corosyncnotice  [TOTEM ] Failed to receive the leave message. 
failed: 2


After setting the token timeout to 6000ms, at least the "Failed to
receive the leave message" doesn't appear anymore. But we see corosync
timeout errors:
[16395] node1 corosyncwarning [MAIN  ] Corosync main process was not
scheduled for 6660.9043 ms (threshold is 4800. ms). Consider token
timeout increase.

1. Why is the set timeout not in effect? It's 4800ms instead of 6000ms.


It is in effect. Threshold for pause detector is set as 0.8 * token timeout.


2. How to fix this? We have not much load on the nodes, the corosync is
already running with RT priority.


There must be something wrong. If you increase token timeout even higher 
(let's say 12sec) is it still appearing or not? If so, isn't the machine 
swapping (for example) or waiting for IO? If you try to run it without 
RT priority, does it help?


Regards,
  Honza




The following is the details of OS and packages:

Kernel: 3.10.0-957.el7.x86_64
OS: Oracle Linux Server 7.6

corosync-2.4.3-4.el7.x86_64
corosynclib-2.4.3-4.el7.x86_64

Thanks in advance.

--
Regards,
Jeevan.
Create your own email signature







___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-08-30 Thread Jeevan Patnaik
Hi,

Also, both are physical machines.

On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik  wrote:

> Hi,
>
> We see the following messages almost everyday in our 2 node cluster and
> resources gets migrated when it happens:
>
> [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not 
> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token timeout 
> increase.
> [16187] node1 corosyncnotice  [TOTEM ] c.
> [16187] node1 corosyncnotice  [TOTEM ] A new membership (192.168.0.1:1268) 
> was formed. Members joined: 2 left: 2
> [16187] node1 corosyncnotice  [TOTEM ] Failed to receive the leave message. 
> failed: 2
>
>
> After setting the token timeout to 6000ms, at least the "Failed to
> receive the leave message" doesn't appear anymore. But we see corosync
> timeout errors:
> [16395] node1 corosyncwarning [MAIN  ] Corosync main process was not
> scheduled for 6660.9043 ms (threshold is 4800. ms). Consider token
> timeout increase.
>
> 1. Why is the set timeout not in effect? It's 4800ms instead of 6000ms.
> 2. How to fix this? We have not much load on the nodes, the corosync is
> already running with RT priority.
>
> The following is the details of OS and packages:
>
> Kernel: 3.10.0-957.el7.x86_64
> OS: Oracle Linux Server 7.6
>
> corosync-2.4.3-4.el7.x86_64
> corosynclib-2.4.3-4.el7.x86_64
>
> Thanks in advance.
>
> --
> Regards,
> Jeevan.
> Create your own email signature
> 
>


-- 
[image: photo]
*Jeevan Patnaik Behara*
Sr. Engineer - IT Servers & Services, Netcracker Technology

35916 | +919000607181 | jeevan.beh...@netcracker.com

Skype: g1patn...@gmail.com <#SignatureSanitizer_SafeHtmlFilter_>
HITEC City, Madhapur, Hyderabad
Create your own email signature

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-08-30 Thread Jeevan Patnaik
Hi,

We see the following messages almost everyday in our 2 node cluster and
resources gets migrated when it happens:

[16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
scheduled for 2889.8477 ms (threshold is 800. ms). Consider token
timeout increase.
[16187] node1 corosyncnotice  [TOTEM ] c.
[16187] node1 corosyncnotice  [TOTEM ] A new membership
(192.168.0.1:1268) was formed. Members joined: 2 left: 2
[16187] node1 corosyncnotice  [TOTEM ] Failed to receive the leave
message. failed: 2


After setting the token timeout to 6000ms, at least the "Failed to receive
the leave message" doesn't appear anymore. But we see corosync timeout
errors:
[16395] node1 corosyncwarning [MAIN  ] Corosync main process was not
scheduled for 6660.9043 ms (threshold is 4800. ms). Consider token
timeout increase.

1. Why is the set timeout not in effect? It's 4800ms instead of 6000ms.
2. How to fix this? We have not much load on the nodes, the corosync is
already running with RT priority.

The following is the details of OS and packages:

Kernel: 3.10.0-957.el7.x86_64
OS: Oracle Linux Server 7.6

corosync-2.4.3-4.el7.x86_64
corosynclib-2.4.3-4.el7.x86_64

Thanks in advance.

-- 
Regards,
Jeevan.
Create your own email signature

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/