date:20190220

Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Eric Robinson





> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Wednesday, February 20, 2019 8:51 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When
> Just One Fails?
> 
> 20.02.2019 21:51, Eric Robinson пишет:
> >
> > The following should show OK in a fixed font like Consolas, but the
> following setup is supposed to be possible, and is even referenced in the
> ClusterLabs documentation.
> >
> >
> >
> >
> >
> > +--+
> >
> > |   mysql001   +--+
> >
> > +--+  |
> >
> > +--+  |
> >
> > |   mysql002   +--+
> >
> > +--+  |
> >
> > +--+  |   +-+   ++   +--+
> >
> > |   mysql003   +->+ floating ip +-->+ filesystem +-->+ blockdev |
> >
> > +--+  |   +-+   ++   +--+
> >
> > +--+  |
> >
> > |   mysql004   +--+
> >
> > +--+  |
> >
> > +--+  |
> >
> > |   mysql005   +--+
> >
> > +--+
> >
> >
> >
> > In the layout above, the MySQL instances are dependent on the same
> underlying service stack, but they are not dependent on each other.
> Therefore, as I understand it, the failure of one MySQL instance should not
> cause the failure of other MySQL instances if on-fail=ignore on-fail=stop. At
> least, that’s the way it seems to me, but based on the thread, I guess it does
> not behave that way.
> >
> 
> This works this way for monitor operation if you set on-fail=block.
> Failed resource is left "as is". The only case when it does not work seems to
> be stop operation; even with explicit on-fail=block it still attempts to 
> initiate
> follow up actions. I still consider this a bug.
> 
> If this is not a bug, this needs clear explanation in documentation.
> 
> But please understand that assuming on-fail=block works you effectively
> reduce your cluster to controlled start of resources during boot. As we have

Or failover, correct?

> seen, stopping of resource IP is blocked, meaning pacemaker also cannot
> perform resource level recovery at all. And for mysql resources you explicitly
> ignore any result of monitoring or failure to stop it.
> And not having stonith also prevents pacemaker from handling node failure.
> What leaves is at most restart of resources on another node during graceful
> shutdown.
> 
> It begs a question - what do you need such "cluster" for at all?

Mainly to manage the other relevant resources: drbd, filesystem, and floating 
IP. I'm content to forego resource level recovery for MySQL services and 
monitor their health from outside the cluster and remediate them manually if 
necessary. I don't see an option if I want to avoid the sort of deadlock 
situation we talked about earlier. 

> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Andrei Borzenkov

20.02.2019 21:51, Eric Robinson пишет:
> 
> The following should show OK in a fixed font like Consolas, but the following 
> setup is supposed to be possible, and is even referenced in the ClusterLabs 
> documentation.
> 
> 
> 
> 
> 
> +--+
> 
> |   mysql001   +--+
> 
> +--+  |
> 
> +--+  |
> 
> |   mysql002   +--+
> 
> +--+  |
> 
> +--+  |   +-+   ++   +--+
> 
> |   mysql003   +->+ floating ip +-->+ filesystem +-->+ blockdev |
> 
> +--+  |   +-+   ++   +--+
> 
> +--+  |
> 
> |   mysql004   +--+
> 
> +--+  |
> 
> +--+  |
> 
> |   mysql005   +--+
> 
> +--+
> 
> 
> 
> In the layout above, the MySQL instances are dependent on the same underlying 
> service stack, but they are not dependent on each other. Therefore, as I 
> understand it, the failure of one MySQL instance should not cause the failure 
> of other MySQL instances if on-fail=ignore on-fail=stop. At least, that’s the 
> way it seems to me, but based on the thread, I guess it does not behave that 
> way.
> 

This works this way for monitor operation if you set on-fail=block.
Failed resource is left "as is". The only case when it does not work
seems to be stop operation; even with explicit on-fail=block it still
attempts to initiate follow up actions. I still consider this a bug.

If this is not a bug, this needs clear explanation in documentation.

But please understand that assuming on-fail=block works you effectively
reduce your cluster to controlled start of resources during boot. As we
have seen, stopping of resource IP is blocked, meaning pacemaker also
cannot perform resource level recovery at all. And for mysql resources
you explicitly ignore any result of monitoring or failure to stop it.
And not having stonith also prevents pacemaker from handling node
failure. What leaves is at most restart of resources on another node
during graceful shutdown.

It begs a question - what do you need such "cluster" for at all?
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný

On 20/02/19 21:16 +0100, Klaus Wenninger wrote:
> On 02/20/2019 08:51 PM, Jan Pokorný wrote:
>> On 20/02/19 17:37 +, Edwin Török wrote:
>>> strace for the situation described below (corosync 95%, 1 vCPU):
>>> https://clbin.com/hZL5z
>> I might have missed that earlier or this may be just some sort
>> of insignificant/misleading clue:
>> 
>>> strace: Process 4923 attached with 2 threads
>>> strace: [ Process PID=4923 runs in x32 mode. ]
> 
> A lot of reports can be found that strace seems to get this wrong
> somehow.

Haven't ever seen that (incl. today's attempts), but could be:
https://lkml.org/lkml/2016/4/7/823

It's no fun when debugging tools generate more questions than
they are meant to answer.

>> but do you indeed compile corosync using x32 ABI?  (moreover
>> while pristine x86_64 libqb is used, i.e., using 64bit pointers?)
>> 
>> Perhaps would be good to state any such major divergencies if
>> so unless I missed that.

-- 
Jan (Poki)


pgpHXWMJsanRa.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný

On 20/02/19 21:25 +0100, Klaus Wenninger wrote:
> Hmm maybe the thing that should be scheduled is running at
> SCHED_RR as well but with just a lower prio. So it wouldn't
> profit from the sched_yield and it wouldn't get anything of
> the 5% either.

Actually, it would possibly make the situation even worse in
that case, as explained in sched_yield(2):

> since doing so will result in unnecessary context
> switches, which will degrade system performance

(not sure into which bucket would this context-switched time
get accounted if at all, but the physical-clock time is ticking
in the interim...)

I am curious if well-tuned SCHED_DEADLINE as mentioned might
be a more comprehensive solution here, also to automatically
flip still-alive-without-progress buggy scenarios into
a purposefully exaggerated condition and hence possibly
actionable (like with token loss -> fencing).

-- 
Jan (Poki)

pgproQPcFiqTk.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Klaus Wenninger

On 02/20/2019 06:37 PM, Edwin Török wrote:
> On 20/02/2019 13:08, Jan Friesse wrote:
>> Edwin Török napsal(a):
>>> On 20/02/2019 07:57, Jan Friesse wrote:
 Edwin,
>
> On 19/02/2019 17:02, Klaus Wenninger wrote:
>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>> On 19/02/2019 16:26, Edwin Török wrote:
 On 18/02/2019 18:27, Edwin Török wrote:
> Did a test today with CentOS 7.6 with upstream kernel and with
> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
> patched [1] SBD) and was not able to reproduce the issue yet.
 I was able to finally reproduce this using only upstream components
 (although it seems to be easier to reproduce if we use our patched
 SBD,
 I was able to reproduce this by using only upstream packages
 unpatched
 by us):
>> Just out of curiosity: What did you patch in SBD?
>> Sorry if I missed the answer in the previous communication.
> It is mostly this PR, which calls getquorate quite often (a more
> efficient impl. would be to use the quorum notification API like
> dlm/pacemaker do, although see concerns in
> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
>
> https://github.com/ClusterLabs/sbd/pull/27
>
> We have also added our own servant for watching the health of our
> control plane, but that is not relevant to this bug (it reproduces with
> that watcher turned off too).
>
>>> I was also able to get a corosync blackbox from one of the stuck VMs
>>> that showed something interesting:
>>> https://clbin.com/d76Ha
>>>
>>> It is looping on:
>>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>> (non-critical): Resource temporarily unavailable (11)
>> Hmm ... something like tx-queue of the device full, or no buffers
>> available anymore and kernel-thread doing the cleanup isn't
>> scheduled ...
> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
 But this is exactly what happens. Corosync will call sendmsg to all
 active udpu members and returns back to main loop -> epoll_wait.

> (although this seems different from the original bug where it got stuck
> in epoll_wait)
 I'm pretty sure it is.

 Anyway, let's try "sched_yield" idea. Could you please try included
 patch and see if it makes any difference (only for udpu)?
>>> Thanks for the patch, unfortunately corosync still spins 106% even with
>>> yield:
>>> https://clbin.com/CF64x
>> Yep, it was kind of expected, but at lost worth a try. How does strace
>> look when this happens?
>>
> strace for the situation described below (corosync 95%, 1 vCPU):
> https://clbin.com/hZL5z
>
>> Also Klaus had an idea to try remove sbd from the picture and try
>> different RR process to find out what happens. And I think it's again
>> worth try.
>>
>> Could you please try install/enable/start
>> https://github.com/jfriesse/spausedd (packages built by copr are
>> https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/),
>> disable/remove sbd and run your test?
>>
> Tried this with 4 vCPUs but it didn't detect any pauses (kind of
> expected, plenty of other CPUs to have spausedd scheduled on even if
> corosync hogs one).
> Then tried with 1 vCPU and it didn't detect any pauses here either. The
> 95% max realtime runtime protection kicks in and limits corosync to 95%
> CPU since now global 95% = this CPU's 95% since there is only one.
> Interestingly corosync stays running at 95% CPU usage now, unlike in SMP
> scenarios where the 95% limit was enough to avoid the situation.

Hmm maybe the thing that should be scheduled is running at
SCHED_RR as well but with just a lower prio. So it wouldn't
profit from the sched_yield and it wouldn't get anything of
the 5% either.

> There is a visible keyboard lag over SSH, but spausedd does get scheduled:

But it shows at least that the sched_yield is doing something ...
although the RR might be triggered even without the explicit
sched_yield in cases like this.

>
> 4923 root  rt   0  213344 111036  84732 R 93.3  2.8  19:07.50
> corosync
>
>  9 root  20   0   0  0  0 S  0.3  0.0   0:02.02
> ksoftirqd/0
>
>   4256 root  rt   0   88804  35612  34368 R  0.3  0.9   0:04.36
> spausedd
>
> No SBD running, and corosync-blackbox does not work at all now.
>
> ifconfig eth0; sleep 1; ifconfig eth0
> eth0: flags=4163  mtu 1500
> inet 10.62.98.26  netmask 255.255.240.0  broadcast 10.62.111.255
> inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0  prefixlen 64
> scopeid 0x0
> inet6 fe80::4c0e:daff:fe14:55f0  prefixlen 64  scopeid 0x20
> ether 4e:0e:da:14:55:f0  txqueuelen 1000  (Ethernet)
> RX packets 4929031  bytes 4644924223 (4.3 GiB)
> RX errors 0  dropped 0  overruns 0

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Klaus Wenninger

On 02/20/2019 08:51 PM, Jan Pokorný wrote:
> On 20/02/19 17:37 +, Edwin Török wrote:
>> strace for the situation described below (corosync 95%, 1 vCPU):
>> https://clbin.com/hZL5z
> I might have missed that earlier or this may be just some sort
> of insignificant/misleading clue:
>
>> strace: Process 4923 attached with 2 threads
>> strace: [ Process PID=4923 runs in x32 mode. ]

A lot of reports can be found that strace seems to get this wrong
somehow.

> but do you indeed compile corosync using x32 ABI?  (moreover
> while pristine x86_64 libqb is used, i.e., using 64bit pointers?)
>
> Perhaps would be good to state any such major divergencies if
> so unless I missed that.
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný

On 20/02/19 17:37 +, Edwin Török wrote:
> strace for the situation described below (corosync 95%, 1 vCPU):
> https://clbin.com/hZL5z

I might have missed that earlier or this may be just some sort
of insignificant/misleading clue:

> strace: Process 4923 attached with 2 threads
> strace: [ Process PID=4923 runs in x32 mode. ]

but do you indeed compile corosync using x32 ABI?  (moreover
while pristine x86_64 libqb is used, i.e., using 64bit pointers?)

Perhaps would be good to state any such major divergencies if
so unless I missed that.

-- 
Jan (Poki)

pgpoZQEeLOZfX.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Eric Robinson









> -Original Message-

> From: Users  On Behalf Of Ulrich Windl

> Sent: Tuesday, February 19, 2019 11:35 PM

> To: users@clusterlabs.org

> Subject: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When

> Just One Fails?

>

> >>> Eric Robinson mailto:eric.robin...@psmnv.com>> 
> >>> schrieb am 19.02.2019 um

> >>> 21:06 in

> Nachricht

> mailto:mn2pr03mb4845be22fada30b472174b79fa...@mn2pr03mb4845.namprd03.prod.outlook.com>

> d03.prod.outlook.com>

>

> >>  -Original Message-

> >> From: Users 
> >> mailto:users-boun...@clusterlabs.org>> On 
> >> Behalf Of Ken Gaillot

> >> Sent: Tuesday, February 19, 2019 10:31 AM

> >> To: Cluster Labs - All topics related to open-source clustering

> >> welcomed mailto:users@clusterlabs.org>>

> >> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just

> >> One Fails?

> >>

> >> On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote:

> >> > > -Original Message-

> >> > > From: Users 
> >> > > mailto:users-boun...@clusterlabs.org>> 
> >> > > On Behalf Of Andrei

> >> > > Borzenkov

> >> > > Sent: Sunday, February 17, 2019 11:56 AM

> >> > > To: users@clusterlabs.org

> >> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When

> >> > > Just One Fails?

> >> > >

> >> > > 17.02.2019 0:44, Eric Robinson пишет:

> >> > > > Thanks for the feedback, Andrei.

> >> > > >

> >> > > > I only want cluster failover to occur if the filesystem or drbd

> >> > > > resources fail,

> >> > >

> >> > > or if the cluster messaging layer detects a complete node failure.

> >> > > Is there a

> >> > > way to tell PaceMaker not to trigger a cluster failover if any of

> >> > > the p_mysql resources fail?

> >> > > >

> >> > >

> >> > > Let's look at this differently. If all these applications depend

> >> > > on each other, you should not be able to stop individual resource

> >> > > in the first place - you need to group them or define dependency

> >> > > so that stopping any resource would stop everything.

> >> > >

> >> > > If these applications are independent, they should not share

> >> > > resources.

> >> > > Each MySQL application should have own IP and own FS and own

> >> > > block device for this FS so that they can be moved between

> >> > > cluster nodes independently.

> >> > >

> >> > > Anything else will lead to troubles as you already observed.

> >> >

> >> > FYI, the MySQL services do not depend on each other. All of them

> >> > depend on the floating IP, which depends on the filesystem, which

> >> > depends on DRBD, but they do not depend on each other. Ideally, the

> >> > failure of p_mysql_002 should not cause failure of other mysql

> >> > resources, but now I understand why it happened. Pacemaker wanted

> >> > to start it on the other node, so it needed to move the floating

> >> > IP, filesystem, and DRBD primary, which had the cascade effect of

> >> > stopping the other MySQL resources.

> >> >

> >> > I think I also understand why the p_vip_clust01 resource blocked.

> >> >

> >> > FWIW, we've been using Linux HA since 2006, originally Heartbeat,

> >> > but then Corosync+Pacemaker. The past 12 years have been relatively

> >> > problem free. This symptom is new for us, only within the past year.

> >> > Our cluster nodes have many separate instances of MySQL running, so

> >> > it is not practical to have that many filesystems, IPs, etc. We are

> >> > content with the way things are, except for this new troubling

> >> > behavior.

> >> >

> >> > If I understand the thread correctly, op-fail=stop will not work

> >> > because the cluster will still try to stop the resources that are

> >> > implied dependencies.

> >> >

> >> > Bottom line is, how do we configure the cluster in such a way that

> >> > there are no cascading circumstances when a MySQL resource fails?

> >> > Basically, if a MySQL resource fails, it fails. We'll deal with

> >> > that on an ad-hoc basis. I don't want the whole cluster to barf.

> >> > What about op-fail=ignore? Earlier, you suggested symmetrical=false

> >> > might also do the trick, but you said it comes with its own can or worms.

> >> > What are the downsides with op-fail=ignore or asymmetrical=false?

> >> >

> >> > --Eric

> >>

> >> Even adding on-fail=ignore to the recurring monitors may not do what

> >> you want, because I suspect that even an ignored failure will make

> >> the node

> less

> >> preferable for all the other resources. But it's worth testing.

> >>

> >> Otherwise, your best option is to remove all the recurring monitors

> >> from

> the

> >> mysql resources, and rely on external monitoring (e.g. nagios,

> >> icinga,

> > monit,

> >> ...) to detect problems.

> >

> > This is probably a dumb question, but can we remove just the monitor

> > operation but leave the resource configured in the cluster? If a

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Andrei Borzenkov

18.02.2019 18:53, Ken Gaillot пишет:
> On Sun, 2019-02-17 at 20:33 +0300, Andrei Borzenkov wrote:
>> 17.02.2019 0:33, Andrei Borzenkov пишет:
>>> 17.02.2019 0:03, Eric Robinson пишет:
 Here are the relevant corosync logs.

 It appears that the stop action for resource p_mysql_002 failed,
 and that caused a cascading series of service changes. However, I
 don't understand why, since no other resources are dependent on
 p_mysql_002.

>>>
>>> You have mandatory colocation constraints for each SQL resource
>>> with
>>> VIP. it means that to move SQL resource to another node pacemaker
>>> also
>>> must move VIP to another node which in turn means it needs to move
>>> all
>>> other dependent resources as well.
>>> ...
 Feb 16 14:06:39 [3912] 001db01apengine:  warning:
 check_migration_threshold:Forcing p_mysql_002 away from
 001db01a after 100 failures (max=100)
>>>
>>> ...
 Feb 16 14:06:39 [3912] 001db01apengine:   notice:
 LogAction: *
 Stop   p_vip_clust01 (   001db01a
 )   blocked
>>>
>>> ...
 Feb 16 14:06:39 [3912] 001db01apengine:   notice:
 LogAction: *
 Stop   p_mysql_001   (   001db01a )   due
 to colocation with p_vip_clust01
>>
>> There is apparently more in it. Note that p_vip_clust01 operation is
>> "blocked". That is because mandatory order constraint is symmetrical
>> by
>> default, so to move VIP pacemaker needs first to stop it on current
>> node; but before it can stop VIP it needs to (be able to) stop
>> p_mysql_002; but it cannot do it because by default when "stop" fails
>> without stonith, resource is blocked and no further actions are
>> possible
>> - i.e. resource can no more (tried to) be stopped.
> 
> Correct, failed stop actions are special -- an on-fail policy of "stop"
> or "restart" requires a stop, so obviously they can't be applied to
> failed stops. As you mentioned, without fencing, on-fail defaults to
> "block" for stops, which should freeze the resource as it is.
> 
>> I still consider is rather questionable behavior. I tried to
>> reproduce
>> it and I see the same.
>>
>> 1. After this happens resource p_mysql_002 has target=Stopped in CIB.
>> Why, oh why, pacemaker tries to "force away" resource that is not
>> going
>> to be started on another node anyway?
> 
> Without having the policy engine inputs, I can't be sure, but I suspect
> p_mysql_002 is not being forced away, but its failure causes that node
> to be less preferred for the resources it depends on.
> 
>> 2. pacemaker knows that it cannot stop (and hence move)
>> p_vip_clust01,
>> still it happily will stop all resources that depend on it in
>> preparation to move them and leave them at that because it cannot
>> move
> 
> I think this is the point at which the behavior is undesirable, because
> it would be relevant whether the move was related to the blocked
> failure or not. Feel free to open a bug report and attach the relevant
> policy engine input (or a crm_report).
> 

https://bugs.clusterlabs.org/show_bug.cgi?id=5379

>> them. Resources are neither restarted on current node, nor moved to
>> another node. At this point I'd expect pacemaker to be smart enough
>> and
>> not even initiate actions that are known to be unsuccessful.
>>
>> The best we can do at this point is set symmetrical=false which
>> allows
>> move to actually happen, but it still means downtime for resources
>> that
>> are moved and has its own can of worms in normal case.
> --
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Edwin Török

On 20/02/2019 13:08, Jan Friesse wrote:
> Edwin Török napsal(a):
>> On 20/02/2019 07:57, Jan Friesse wrote:
>>> Edwin,


 On 19/02/2019 17:02, Klaus Wenninger wrote:
> On 02/19/2019 05:41 PM, Edwin Török wrote:
>> On 19/02/2019 16:26, Edwin Török wrote:
>>> On 18/02/2019 18:27, Edwin Török wrote:
 Did a test today with CentOS 7.6 with upstream kernel and with
 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
 patched [1] SBD) and was not able to reproduce the issue yet.
>>> I was able to finally reproduce this using only upstream components
>>> (although it seems to be easier to reproduce if we use our patched
>>> SBD,
>>> I was able to reproduce this by using only upstream packages
>>> unpatched
>>> by us):
>
> Just out of curiosity: What did you patch in SBD?
> Sorry if I missed the answer in the previous communication.

 It is mostly this PR, which calls getquorate quite often (a more
 efficient impl. would be to use the quorum notification API like
 dlm/pacemaker do, although see concerns in
 https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):

 https://github.com/ClusterLabs/sbd/pull/27

 We have also added our own servant for watching the health of our
 control plane, but that is not relevant to this bug (it reproduces with
 that watcher turned off too).

>
>> I was also able to get a corosync blackbox from one of the stuck VMs
>> that showed something interesting:
>> https://clbin.com/d76Ha
>>
>> It is looping on:
>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>> (non-critical): Resource temporarily unavailable (11)
>
> Hmm ... something like tx-queue of the device full, or no buffers
> available anymore and kernel-thread doing the cleanup isn't
> scheduled ...

 Yes that is very plausible. Perhaps it'd be nicer if corosync went back
 to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
>>>
>>> But this is exactly what happens. Corosync will call sendmsg to all
>>> active udpu members and returns back to main loop -> epoll_wait.
>>>
 (although this seems different from the original bug where it got stuck
 in epoll_wait)
>>>
>>> I'm pretty sure it is.
>>>
>>> Anyway, let's try "sched_yield" idea. Could you please try included
>>> patch and see if it makes any difference (only for udpu)?
>>
>> Thanks for the patch, unfortunately corosync still spins 106% even with
>> yield:
>> https://clbin.com/CF64x
> 
> Yep, it was kind of expected, but at lost worth a try. How does strace
> look when this happens?
> 

strace for the situation described below (corosync 95%, 1 vCPU):
https://clbin.com/hZL5z

> Also Klaus had an idea to try remove sbd from the picture and try
> different RR process to find out what happens. And I think it's again
> worth try.
> 
> Could you please try install/enable/start
> https://github.com/jfriesse/spausedd (packages built by copr are
> https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/),
> disable/remove sbd and run your test?
> 

Tried this with 4 vCPUs but it didn't detect any pauses (kind of
expected, plenty of other CPUs to have spausedd scheduled on even if
corosync hogs one).
Then tried with 1 vCPU and it didn't detect any pauses here either. The
95% max realtime runtime protection kicks in and limits corosync to 95%
CPU since now global 95% = this CPU's 95% since there is only one.
Interestingly corosync stays running at 95% CPU usage now, unlike in SMP
scenarios where the 95% limit was enough to avoid the situation.
There is a visible keyboard lag over SSH, but spausedd does get scheduled:

4923 root  rt   0  213344 111036  84732 R 93.3  2.8  19:07.50
corosync

 9 root  20   0   0  0  0 S  0.3  0.0   0:02.02
ksoftirqd/0

  4256 root  rt   0   88804  35612  34368 R  0.3  0.9   0:04.36
spausedd

No SBD running, and corosync-blackbox does not work at all now.

ifconfig eth0; sleep 1; ifconfig eth0
eth0: flags=4163  mtu 1500
inet 10.62.98.26  netmask 255.255.240.0  broadcast 10.62.111.255
inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0  prefixlen 64
scopeid 0x0
inet6 fe80::4c0e:daff:fe14:55f0  prefixlen 64  scopeid 0x20
ether 4e:0e:da:14:55:f0  txqueuelen 1000  (Ethernet)
RX packets 4929031  bytes 4644924223 (4.3 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 15445220  bytes 22485760076 (20.9 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163  mtu 1500
inet 10.62.98.26  netmask 255.255.240.0  broadcast 10.62.111.255
inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0  prefixlen 64
scopeid 0x0
inet6 fe80::4c0e:daff:fe14:55f0  prefixlen 64  scopeid 0x20
ether 4e:0e:da:14:55:f0  txqueuelen 1000  (Ethernet)
RX packets 4934042  bytes

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Ken Gaillot

On Wed, 2019-02-20 at 14:03 +, Edwin Török wrote:
> 
> On 20/02/2019 12:44, Jan Pokorný wrote:
> > On 19/02/19 16:41 +, Edwin Török wrote:
> > > Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip
> > > 7f221c5e03b1 sp 7ffcf9cf9d88 error 4 in
> > > libc-2.17.so[7f221c554000+1c2000] [ 5390.361918] Code: b8 00 00
> > > 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00 c3 0f 1f 80 00 00 00 00
> > > 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77 19  0f 6f 0f
> > > 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0
> > 
> > By any chance, is this an unmodified pacemaker package as
> > obtainable from some public repo together with debug symbols?
> 
> I haven't modified pacemaker, here are the versions:
> 
> rpm -q pacemaker
> pacemaker-1.1.19-8.el7.x86_64
> rpm -q glibc
> glibc-2.17-260.el7_6.3.x86_64
> 
> 0x7f221c5e03b1 - 0x7f221c554000 = 0x8c3b1
> addr2line -fie /lib64/libc.so.6 0x8c3b1
> __GI_strlen
> :?
> 
> Feb 19 16:22:04 host-10 crmd[12620]:  notice: Additional logging
> available in /var/log/cluster/corosync.log
> Feb 19 16:22:05 host-10 crmd[12620]:  notice: Connecting to cluster
> infrastructure: corosync
> Feb 19 16:29:50 host-10 crmd[12620]:   error: Could not join the CPG
> group 'crmd': 6
> Feb 19 16:29:50 host-10 kernel: crmd[12620]: segfault at 0 ip
> 7f221c5e03b1 sp 7ffcf9cf9d88 error 4 in
> libc-2.17.so[7f221c554000+1c2000]
> Feb 19 16:38:28 host-10 pacemakerd[12614]:   error: Managed process
> 12620 (crmd) dumped core
> Feb 19 16:38:28 host-10 pacemakerd[12614]:   error: The crmd process
> (12620) terminated with signal 11 (core=1)
> 
> I found a core file in /var/lib/pacemaker/cores
> (gdb) bt
> #0  0x7f221c5e03b1 in __strlen_sse2 () from /lib64/libc.so.6
> #1  0x7f221c5e00be in strdup () from /lib64/libc.so.6
> #2  0x7f221f1a05cd in election_init (name=name@entry=0x0,
> uname=0x0, period_ms=period_ms@entry=6, cb=cb@entry=0x55ea42cb279
> 0
> )
> at election.c:78

The current code asserts that uname is non-NULL so this won't happen,
but of course that still is a crash.

> #3  0x55ea42cb3d4c in do_ha_control (action=4, cause= out>, cur_state=, current_input=,
> msg_data=0x55ea4464fec0)
> at control.c:139
> #4  0x55ea42cb0524 in s_crmd_fsa_actions
> (fsa_data=fsa_data@entry=0x55ea4464fec0) at fsa.c:305
> #5  0x55ea42cb216a in s_crmd_fsa (cause=cause@entry=C_STARTUP) at
> fsa.c:237
> #6  0x55ea42cad707 in crmd_init () at main.c:173
> #7  0x55ea42cad510 in main (argc=1, argv=0x7ffcf9cfa078) at
> main.c:122
> 
> g
> 
> Best regards,
> --Edwin
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Edwin Török




On 20/02/2019 12:44, Jan Pokorný wrote:
> On 19/02/19 16:41 +, Edwin Török wrote:
>> Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip
>> 7f221c5e03b1 sp 7ffcf9cf9d88 error 4 in
>> libc-2.17.so[7f221c554000+1c2000] [ 5390.361918] Code: b8 00 00
>> 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00 c3 0f 1f 80 00 00 00 00
>> 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77 19  0f 6f 0f
>> 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0
> 
> By any chance, is this an unmodified pacemaker package as
> obtainable from some public repo together with debug symbols?

I haven't modified pacemaker, here are the versions:

rpm -q pacemaker
pacemaker-1.1.19-8.el7.x86_64
rpm -q glibc
glibc-2.17-260.el7_6.3.x86_64

0x7f221c5e03b1 - 0x7f221c554000 = 0x8c3b1
addr2line -fie /lib64/libc.so.6 0x8c3b1
__GI_strlen
:?

Feb 19 16:22:04 host-10 crmd[12620]:  notice: Additional logging
available in /var/log/cluster/corosync.log
Feb 19 16:22:05 host-10 crmd[12620]:  notice: Connecting to cluster
infrastructure: corosync
Feb 19 16:29:50 host-10 crmd[12620]:   error: Could not join the CPG
group 'crmd': 6
Feb 19 16:29:50 host-10 kernel: crmd[12620]: segfault at 0 ip
7f221c5e03b1 sp 7ffcf9cf9d88 error 4 in
libc-2.17.so[7f221c554000+1c2000]
Feb 19 16:38:28 host-10 pacemakerd[12614]:   error: Managed process
12620 (crmd) dumped core
Feb 19 16:38:28 host-10 pacemakerd[12614]:   error: The crmd process
(12620) terminated with signal 11 (core=1)

I found a core file in /var/lib/pacemaker/cores
(gdb) bt
#0  0x7f221c5e03b1 in __strlen_sse2 () from /lib64/libc.so.6
#1  0x7f221c5e00be in strdup () from /lib64/libc.so.6
#2  0x7f221f1a05cd in election_init (name=name@entry=0x0,
uname=0x0, period_ms=period_ms@entry=6, cb=cb@entry=0x55ea42cb2790
)
at election.c:78
#3  0x55ea42cb3d4c in do_ha_control (action=4, cause=, cur_state=, current_input=,
msg_data=0x55ea4464fec0)
at control.c:139
#4  0x55ea42cb0524 in s_crmd_fsa_actions
(fsa_data=fsa_data@entry=0x55ea4464fec0) at fsa.c:305
#5  0x55ea42cb216a in s_crmd_fsa (cause=cause@entry=C_STARTUP) at
fsa.c:237
#6  0x55ea42cad707 in crmd_init () at main.c:173
#7  0x55ea42cad510 in main (argc=1, argv=0x7ffcf9cfa078) at main.c:122

g

Best regards,
--Edwin

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Ulrich Windl

>>> Edwin Török  schrieb am 20.02.2019 um 12:30 in
Nachricht <0a49f593-1543-76e4-a8ab-06a48c596...@citrix.com>:
> On 20/02/2019 07:57, Jan Friesse wrote:
>> Edwin,
>>>
>>>
>>> On 19/02/2019 17:02, Klaus Wenninger wrote:
 On 02/19/2019 05:41 PM, Edwin Török wrote:
> On 19/02/2019 16:26, Edwin Török wrote:
>> On 18/02/2019 18:27, Edwin Török wrote:
>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>> patched [1] SBD) and was not able to reproduce the issue yet.
>> I was able to finally reproduce this using only upstream components
>> (although it seems to be easier to reproduce if we use our patched
>> SBD,
>> I was able to reproduce this by using only upstream packages unpatched
>> by us):

 Just out of curiosity: What did you patch in SBD?
 Sorry if I missed the answer in the previous communication.
>>>
>>> It is mostly this PR, which calls getquorate quite often (a more
>>> efficient impl. would be to use the quorum notification API like
>>> dlm/pacemaker do, although see concerns in
>>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):

>>> https://github.com/ClusterLabs/sbd/pull/27 
>>>
>>> We have also added our own servant for watching the health of our
>>> control plane, but that is not relevant to this bug (it reproduces with
>>> that watcher turned off too).
>>>

> I was also able to get a corosync blackbox from one of the stuck VMs
> that showed something interesting:
> https://clbin.com/d76Ha 
>
> It is looping on:
> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
> (non-critical): Resource temporarily unavailable (11)

 Hmm ... something like tx-queue of the device full, or no buffers
 available anymore and kernel-thread doing the cleanup isn't
 scheduled ...
>>>
>>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
>> 
>> But this is exactly what happens. Corosync will call sendmsg to all
>> active udpu members and returns back to main loop -> epoll_wait.
>> 
>>> (although this seems different from the original bug where it got stuck
>>> in epoll_wait)
>> 
>> I'm pretty sure it is.
>> 
>> Anyway, let's try "sched_yield" idea. Could you please try included
>> patch and see if it makes any difference (only for udpu)?
> 
> Thanks for the patch, unfortunately corosync still spins 106% even with
> yield:
> https://clbin.com/CF64x 
> 
> On another host corosync failed to start up completely (Denied
> connection not ready), and:
> https://clbin.com/Z35Gl 
> (I don't think this is related to the patch, it was doing that before
> when I looked at it this morning, kernel 4.20.0 this time)

I wonder: Is it possible to run "iftop" and "top" (with proper high-speed
setting showing all threads and CPUs) while waiting for the problem to occur.
If I understand it correctly all those other terminals should freeze, so you'll
have plenty of time for snapshotting the output ;-) I expect that your network
load will be close to 100% on the interface, or the CPU handling traffic is
busy with running corosync.

> 
> Best regards,
> --Edwin
> 
>> 
>> Regards,
>>   Honza
>> 
>>>
 Does the kernel log anything in that situation?
>>>
>>> Other than the crmd segfault no.
>>>  From previous observations on xenserver the softirqs were all stuck on
>>> the CPU that corosync hogged 100% (I'll check this on upstream, but I'm
>>> fairly sure it'll be the same). softirqs do not run at realtime priority
>>> (if we increase the priority of ksoftirqd to realtime then it all gets
>>> unstuck), but seem to be essential for whatever corosync is stuck
>>> waiting on, in this case likely the sending/receiving of network packets.
>>>
>>> I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see
>>> why this was only reproducible on 4.19 so far.
>>>
>>> Best regards,
>>> --Edwin
>>>
>>>
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org 
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>>
>> 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Friesse


Edwin Török napsal(a):

On 20/02/2019 07:57, Jan Friesse wrote:

Edwin,



On 19/02/2019 17:02, Klaus Wenninger wrote:

On 02/19/2019 05:41 PM, Edwin Török wrote:

On 19/02/2019 16:26, Edwin Török wrote:

On 18/02/2019 18:27, Edwin Török wrote:

Did a test today with CentOS 7.6 with upstream kernel and with
4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
patched [1] SBD) and was not able to reproduce the issue yet.

I was able to finally reproduce this using only upstream components
(although it seems to be easier to reproduce if we use our patched
SBD,
I was able to reproduce this by using only upstream packages unpatched
by us):


Just out of curiosity: What did you patch in SBD?
Sorry if I missed the answer in the previous communication.


It is mostly this PR, which calls getquorate quite often (a more
efficient impl. would be to use the quorum notification API like
dlm/pacemaker do, although see concerns in
https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
https://github.com/ClusterLabs/sbd/pull/27

We have also added our own servant for watching the health of our
control plane, but that is not relevant to this bug (it reproduces with
that watcher turned off too).




I was also able to get a corosync blackbox from one of the stuck VMs
that showed something interesting:
https://clbin.com/d76Ha

It is looping on:
debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
(non-critical): Resource temporarily unavailable (11)


Hmm ... something like tx-queue of the device full, or no buffers
available anymore and kernel-thread doing the cleanup isn't
scheduled ...


Yes that is very plausible. Perhaps it'd be nicer if corosync went back
to the epoll_wait loop when it gets too many EAGAINs from sendmsg.


But this is exactly what happens. Corosync will call sendmsg to all
active udpu members and returns back to main loop -> epoll_wait.


(although this seems different from the original bug where it got stuck
in epoll_wait)


I'm pretty sure it is.

Anyway, let's try "sched_yield" idea. Could you please try included
patch and see if it makes any difference (only for udpu)?


Thanks for the patch, unfortunately corosync still spins 106% even with
yield:
https://clbin.com/CF64x


Yep, it was kind of expected, but at lost worth a try. How does strace 
look when this happens?


Also Klaus had an idea to try remove sbd from the picture and try 
different RR process to find out what happens. And I think it's again 
worth try.


Could you please try install/enable/start 
https://github.com/jfriesse/spausedd (packages built by copr are 
https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/), 
disable/remove sbd and run your test?




On another host corosync failed to start up completely (Denied
connection not ready), and:
https://clbin.com/Z35Gl
(I don't think this is related to the patch, it was doing that before
when I looked at it this morning, kernel 4.20.0 this time)


This one looks kind of normal and I'm pretty sure it's unrelated (I've 
seen it already sadly never was able to find a "reliable" reproducer)


Regards,
  Honza



Best regards,
--Edwin



Regards,
   Honza




Does the kernel log anything in that situation?


Other than the crmd segfault no.
  From previous observations on xenserver the softirqs were all stuck on
the CPU that corosync hogged 100% (I'll check this on upstream, but I'm
fairly sure it'll be the same). softirqs do not run at realtime priority
(if we increase the priority of ksoftirqd to realtime then it all gets
unstuck), but seem to be essential for whatever corosync is stuck
waiting on, in this case likely the sending/receiving of network packets.

I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see
why this was only reproducible on 4.19 so far.

Best regards,
--Edwin



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org





___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný

On 19/02/19 16:41 +, Edwin Török wrote:
> Also noticed this:
> [ 5390.361861] crmd[12620]: segfault at 0 ip 7f221c5e03b1 sp
> 7ffcf9cf9d88 error 4 in libc-2.17.so[7f221c554000+1c2000]
> [ 5390.361918] Code: b8 00 00 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00
> c3 0f 1f 80 00 00 00 00 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77
> 19  0f 6f 0f 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0

By any chance, is this an unmodified pacemaker package as obtainable
from some public repo together with debug symbols?

-- 
Jan (Poki)


pgpvY5q1QOKcY.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Edwin Török

On 20/02/2019 07:57, Jan Friesse wrote:
> Edwin,
>>
>>
>> On 19/02/2019 17:02, Klaus Wenninger wrote:
>>> On 02/19/2019 05:41 PM, Edwin Török wrote:
 On 19/02/2019 16:26, Edwin Török wrote:
> On 18/02/2019 18:27, Edwin Török wrote:
>> Did a test today with CentOS 7.6 with upstream kernel and with
>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>> patched [1] SBD) and was not able to reproduce the issue yet.
> I was able to finally reproduce this using only upstream components
> (although it seems to be easier to reproduce if we use our patched
> SBD,
> I was able to reproduce this by using only upstream packages unpatched
> by us):
>>>
>>> Just out of curiosity: What did you patch in SBD?
>>> Sorry if I missed the answer in the previous communication.
>>
>> It is mostly this PR, which calls getquorate quite often (a more
>> efficient impl. would be to use the quorum notification API like
>> dlm/pacemaker do, although see concerns in
>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
>> https://github.com/ClusterLabs/sbd/pull/27
>>
>> We have also added our own servant for watching the health of our
>> control plane, but that is not relevant to this bug (it reproduces with
>> that watcher turned off too).
>>
>>>
 I was also able to get a corosync blackbox from one of the stuck VMs
 that showed something interesting:
 https://clbin.com/d76Ha

 It is looping on:
 debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
 (non-critical): Resource temporarily unavailable (11)
>>>
>>> Hmm ... something like tx-queue of the device full, or no buffers
>>> available anymore and kernel-thread doing the cleanup isn't
>>> scheduled ...
>>
>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
> 
> But this is exactly what happens. Corosync will call sendmsg to all
> active udpu members and returns back to main loop -> epoll_wait.
> 
>> (although this seems different from the original bug where it got stuck
>> in epoll_wait)
> 
> I'm pretty sure it is.
> 
> Anyway, let's try "sched_yield" idea. Could you please try included
> patch and see if it makes any difference (only for udpu)?

Thanks for the patch, unfortunately corosync still spins 106% even with
yield:
https://clbin.com/CF64x

On another host corosync failed to start up completely (Denied
connection not ready), and:
https://clbin.com/Z35Gl
(I don't think this is related to the patch, it was doing that before
when I looked at it this morning, kernel 4.20.0 this time)

Best regards,
--Edwin

> 
> Regards,
>   Honza
> 
>>
>>> Does the kernel log anything in that situation?
>>
>> Other than the crmd segfault no.
>>  From previous observations on xenserver the softirqs were all stuck on
>> the CPU that corosync hogged 100% (I'll check this on upstream, but I'm
>> fairly sure it'll be the same). softirqs do not run at realtime priority
>> (if we increase the priority of ksoftirqd to realtime then it all gets
>> unstuck), but seem to be essential for whatever corosync is stuck
>> waiting on, in this case likely the sending/receiving of network packets.
>>
>> I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see
>> why this was only reproducible on 4.19 so far.
>>
>> Best regards,
>> --Edwin
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Klaus Wenninger

On 02/20/2019 08:07 AM, Ulrich Windl wrote:
 Klaus Wenninger  schrieb am 19.02.2019 um 18:02 in
> Nachricht <7b626ca1-4f59-6257-bfb5-ef5d0d823...@redhat.com>:
> [...]
>>> It is looping on:
>>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>> (non-critical): Resource temporarily unavailable (11)
> I wonder whether this is the reason for looping or the consequence of 
> loop-sending. To me it looks like a good idea to try sched_yield() in this 
> situation. Maybe then the other tasks have a chance to empty the send queue.

Doesn't that just trigger RR. So if the other threads aren't SCHED_RR at the
same prio would it help?

>
>> Hmm ... something like tx-queue of the device full, or no buffers
>> available anymore and kernel-thread doing the cleanup isn't
>> scheduled ...
>> Does the kernel log anything in that situation?
>>
>>> Also noticed this:
>>> [ 5390.361861] crmd[12620]: segfault at 0 ip 7f221c5e03b1 sp
>>> 7ffcf9cf9d88 error 4 in libc-2.17.so[7f221c554000+1c2000]
>>> [ 5390.361918] Code: b8 00 00 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00
>>> c3 0f 1f 80 00 00 00 00 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77
>>> 19  0f 6f 0f 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0
> Maybe time to enable core dumps...
>
> [...]
>
> Regards,
> Ulrich Windl
>
>


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Re: [ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

17 matches

Site Navigation

Mail list logo

Footer information