Le jeu. 24 août 2023, 22:18, Richard Purdie < richard.pur...@linuxfoundation.org> a écrit :
> On Thu, 2023-08-24 at 15:04 +0100, Richard Purdie via > lists.openembedded.org wrote: > > On Wed, 2023-08-23 at 22:16 +0100, Richard Purdie via > > lists.openembedded.org wrote: > > > On Tue, 2023-08-22 at 23:01 +0100, Richard Purdie via > > > lists.openembedded.org wrote: > > > > so the commands are stopping mid flow for unknown reasons or the ssh > > > > connection fails. I can't tell if this coincides with an rcu stall or > > > > not. Both logs do have rcu stalls in. > > > > > > > > After these failures the system does continue to otherwise work > > > > normally and subsequent tests pass. > > > > > > > > I wonder if the slow emulation might be causing the networking to > > > > glitch and break the ssh connection. > > > > > > > > I'm at a bit of a loss on where from here. > > > > > > I thought I'd update the thread with new information. > > > > > > I went back to the start with this and looked again and what is going > > > on. Interestingly, I found one of the autobuilder workers would > > > consistently fail the qemuppc-alt configuration for core-image-sato- > > > sdk. I paused the worker and experimented. > > > > > > I saw two different failures (included below). One shows systemd-udevd > > > timing out on it's watchdog after 3 minutes and resetting, including > > > taking out an ssh session running the cpio configure command. There was > > > no RCU stall reported. > > > > > > The second failure shows systemd-logind as well as systemd-udevd with > > > the 3 minute time out, the kernel complaining about missed IRQs, an RCU > > > stall and lots of breakage following including cut ssh commands. > > > > > > I could not get the cpio build test to complete. > > > > > > Interestingly, I came back to the same image/worker later this evening > > > and now it all works fine. The difference is earlier there was a world > > > build running on the worker, which continued to wind down even after I > > > paused the worker. By the evening, that background load was no longer > > > present and the ppc image works in isolation. This tells us the issue > > > is system load dependent and only occurs on loaded systems. > > > > > > I suspect I need to replicate the load and retry locally, see if I can > > > reliably reproduce the hang. The watchdog won't be present on sysvinit > > > systems which also show the issues but I'd guess there is still some > > > other starvation/timeout occurring. > > > > I've now seen the failure on the autobuilder: > > > > * with linux-yocto 6.1.38 > > * with linux-yocto 6.1.46 > > * with qemu 8.0.4 > > * with qemu 8.0.3 > > * with qemu 8.0.0 > > > > I was a little suspicious of: > > > > "hw/ppc: Fix clock update drift" > > > https://gitlab.com/qemu-project/qemu/-/commit/73d6ac24c81f1aeae554d469616c9181511e6523 > > > > but we've tested with and without that. > > > > qemu has just released 8.1.0 so perhaps we should try that next. > > qemu 8.1.0 brings with it a new set of problems but I've reproduced the > hang with 8.1.0 so it does not solve that. > > I'm really struggling to understand which change brought in these > issues for qemuppc. > > Cheers, > > Richard > Hello Richard, I didn't understand the issues but I recently came across some keywords you used here (rcu, NOHZ warnings, ratelimit...) in a Linux rt thread I just read : https://www.spinics.net/lists/linux-rt-users/msg27085.html I hope you may find it helpful for your investigation but if you was already aware of that, my bad. Cheers. > > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#186689): https://lists.openembedded.org/g/openembedded-core/message/186689 Mute This Topic: https://lists.openembedded.org/mt/100733646/21656 Group Owner: openembedded-core+ow...@lists.openembedded.org Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-