Le jeu. 24 août 2023, 22:18, Richard Purdie <
richard.pur...@linuxfoundation.org> a écrit :

> On Thu, 2023-08-24 at 15:04 +0100, Richard Purdie via
> lists.openembedded.org wrote:
> > On Wed, 2023-08-23 at 22:16 +0100, Richard Purdie via
> > lists.openembedded.org wrote:
> > > On Tue, 2023-08-22 at 23:01 +0100, Richard Purdie via
> > > lists.openembedded.org wrote:
> > > > so the commands are stopping mid flow for unknown reasons or the ssh
> > > > connection fails. I can't tell if this coincides with an rcu stall or
> > > > not. Both logs do have rcu stalls in.
> > > >
> > > > After these failures the system does continue to otherwise work
> > > > normally and subsequent tests pass.
> > > >
> > > > I wonder if the slow emulation might be causing the networking to
> > > > glitch and break the ssh connection.
> > > >
> > > > I'm at a bit of a loss on where from here.
> > >
> > > I thought I'd update the thread with new information.
> > >
> > > I went back to the start with this and looked again and what is going
> > > on. Interestingly, I found one of the autobuilder workers would
> > > consistently fail the qemuppc-alt configuration for core-image-sato-
> > > sdk. I paused the worker and experimented.
> > >
> > > I saw two different failures (included below). One shows systemd-udevd
> > > timing out on it's watchdog after 3 minutes and resetting, including
> > > taking out an ssh session running the cpio configure command. There was
> > > no RCU stall reported.
> > >
> > > The second failure shows systemd-logind as well as systemd-udevd with
> > > the 3 minute time out, the kernel complaining about missed IRQs, an RCU
> > > stall and lots of breakage following including cut ssh commands.
> > >
> > > I could not get the cpio build test to complete.
> > >
> > > Interestingly, I came back to the same image/worker later this evening
> > > and now it all works fine. The difference is earlier there was a world
> > > build running on the worker, which continued to wind down even after I
> > > paused the worker. By the evening, that background load was no longer
> > > present and the ppc image works in isolation. This tells us the issue
> > > is system load dependent and only occurs on loaded systems.
> > >
> > > I suspect I need to replicate the load and retry locally, see if I can
> > > reliably reproduce the hang. The watchdog won't be present on sysvinit
> > > systems which also show the issues but I'd guess there is still some
> > > other starvation/timeout occurring.
> >
> > I've now seen the failure on the autobuilder:
> >
> > * with linux-yocto 6.1.38
> > * with linux-yocto 6.1.46
> > * with qemu 8.0.4
> > * with qemu 8.0.3
> > * with qemu 8.0.0
> >
> > I was a little suspicious of:
> >
> > "hw/ppc: Fix clock update drift"
> >
> https://gitlab.com/qemu-project/qemu/-/commit/73d6ac24c81f1aeae554d469616c9181511e6523
> >
> > but we've tested with and without that.
> >
> > qemu has just released 8.1.0 so perhaps we should try that next.
>
> qemu 8.1.0 brings with it a new set of problems but I've reproduced the
> hang with 8.1.0 so it does not solve that.
>
> I'm really struggling to understand which change brought in these
> issues for qemuppc.
>
> Cheers,
>
> Richard
>

Hello Richard,

I didn't understand the issues but I recently came across some keywords you
used here (rcu, NOHZ warnings, ratelimit...) in a Linux rt thread I just
read : https://www.spinics.net/lists/linux-rt-users/msg27085.html

 I hope you may find it helpful for your investigation but if you was
already aware of that, my bad.

Cheers.

>
> 
>
>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#186689): 
https://lists.openembedded.org/g/openembedded-core/message/186689
Mute This Topic: https://lists.openembedded.org/mt/100733646/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to