Re: [OE-core] Dilemma on changes - merge or not to merge (e.g. 6.4)

Mikko Rapeli Fri, 25 Aug 2023 00:26:16 -0700

Hi,

On Fri, Aug 25, 2023 at 07:34:25AM +0100, Richard Purdie wrote:
> On Fri, 2023-08-25 at 09:27 +0300, Mikko Rapeli wrote:
> > Hi,
> > 
> > On Thu, Aug 24, 2023 at 09:18:03PM +0100, Richard Purdie wrote:
> > > On Thu, 2023-08-24 at 15:04 +0100, Richard Purdie via
> > > lists.openembedded.org wrote:
> > > > On Wed, 2023-08-23 at 22:16 +0100, Richard Purdie via
> > > > lists.openembedded.org wrote:
> > > > > On Tue, 2023-08-22 at 23:01 +0100, Richard Purdie via
> > > > > lists.openembedded.org wrote:
> > > > > > so the commands are stopping mid flow for unknown reasons or the ssh
> > > > > > connection fails. I can't tell if this coincides with an rcu stall 
> > > > > > or
> > > > > > not. Both logs do have rcu stalls in.
> > > > > > 
> > > > > > After these failures the system does continue to otherwise work
> > > > > > normally and subsequent tests pass.
> > > > > > 
> > > > > > I wonder if the slow emulation might be causing the networking to
> > > > > > glitch and break the ssh connection.
> > > > > > 
> > > > > > I'm at a bit of a loss on where from here. 
> > > > > 
> > > > > I thought I'd update the thread with new information.
> > > > > 
> > > > > I went back to the start with this and looked again and what is going
> > > > > on. Interestingly, I found one of the autobuilder workers would
> > > > > consistently fail the qemuppc-alt configuration for core-image-sato-
> > > > > sdk. I paused the worker and experimented.
> > > > > 
> > > > > I saw two different failures (included below). One shows systemd-udevd
> > > > > timing out on it's watchdog after 3 minutes and resetting, including
> > > > > taking out an ssh session running the cpio configure command. There 
> > > > > was
> > > > > no RCU stall reported.
> > > > > 
> > > > > The second failure shows systemd-logind as well as systemd-udevd with
> > > > > the 3 minute time out, the kernel complaining about missed IRQs, an 
> > > > > RCU
> > > > > stall and lots of breakage following including cut ssh commands.
> > > > > 
> > > > > I could not get the cpio build test to complete.
> > > > > 
> > > > > Interestingly, I came back to the same image/worker later this evening
> > > > > and now it all works fine. The difference is earlier there was a world
> > > > > build running on the worker, which continued to wind down even after I
> > > > > paused the worker. By the evening, that background load was no longer
> > > > > present and the ppc image works in isolation. This tells us the issue
> > > > > is system load dependent and only occurs on loaded systems.
> > > > > 
> > > > > I suspect I need to replicate the load and retry locally, see if I can
> > > > > reliably reproduce the hang. The watchdog won't be present on sysvinit
> > > > > systems which also show the issues but I'd guess there is still some
> > > > > other starvation/timeout occurring.
> > > > 
> > > > I've now seen the failure on the autobuilder:
> > > > 
> > > > * with linux-yocto 6.1.38
> > > > * with linux-yocto 6.1.46
> > > > * with qemu 8.0.4
> > > > * with qemu 8.0.3
> > > > * with qemu 8.0.0
> > > > 
> > > > I was a little suspicious of:
> > > > 
> > > > "hw/ppc: Fix clock update drift"
> > > > https://gitlab.com/qemu-project/qemu/-/commit/73d6ac24c81f1aeae554d469616c9181511e6523
> > > > 
> > > > but we've tested with and without that.
> > > > 
> > > > qemu has just released 8.1.0 so perhaps we should try that next.
> > > 
> > > qemu 8.1.0 brings with it a new set of problems but I've reproduced the
> > > hang with 8.1.0 so it does not solve that.
> > > 
> > > I'm really struggling to understand which change brought in these
> > > issues for qemuppc.
> > 
> > Are these issues visible on mickledore branch? Maybe mickledore with kernel 
> > 6.1 stable update or
> > qemu 7.2 update to 8.y.x could be tested too. At least then kernel or qemu 
> > could be blamed
> > for the issues.
> 
> Not that I know of.
> 
> I have now also reproduced the failure with glibc 2.37 instead of 2.38
> including the fortify sources change and the 6.1.34 kernel so there is
> something else causing this.
> 
> I've wondered if we need to try going back to qemu 7.2. It may also be
> worth ruling out binutils.


Yes, I'd have no objection to qemu downgrade if that helps with stability
and release deadline. I really trust you don't do this lightly and I would
much prefer you do this instead of burning out when hunting fixes for the 
various
issues. In product environments I've done this a lot: changes get reverted if
they cause too much instability and fixes don't come within some limited time
from the developers who are responsible for those changes.

> It shouldn't be systemd as the sysvinit images show the issue too.

FWIW, poky master branch with 6.4 kernel is working well on our arm64 boards
and CI results are stable, same with qemu-arm64 on 7.2 and 8.0.x versions.

Cheers,

-Mikko

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#186694): 
https://lists.openembedded.org/g/openembedded-core/message/186694
Mute This Topic: https://lists.openembedded.org/mt/100733646/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [OE-core] Dilemma on changes - merge or not to merge (e.g. 6.4)

Reply via email to