Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

Richard Purdie Fri, 13 Oct 2023 05:01:05 -0700

On Wed, 2023-10-11 at 00:49 -0400, Paul Gortmaker via
lists.openembedded.org wrote:
> [Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 
> summary)] On 10/10/2023 (Tue 21:53) Richard Purdie via lists.openembedded.org 
> wrote:
> 
> > On Tue, 2023-10-10 at 21:37 +0100, Richard Purdie via
> > lists.openembedded.org wrote:
> > > I've been asked how to reproduce this setup more easily outside of
> > > OEQA. I hacked this up:
> > > 
> > > https://www.rpsys.net/wp/rp/simpleqemu.tgz
> > > 
> > > extract it, hack the paths in commands.py to point at a qemux86-64
> > > build of core-image-minimal and then "./commands" should boot an qemu
> > > image using the system's qemu binary as it would during the QA tests,
> > > setting up the two serial ports as it would during tests.
> > > 
> > > Obviously I've just brutally hacked this out of the QA code so there
> > > are plenty of sharp edges. It should allow the serial setup to be
> > > replicated and experimented with more easily though without bitbake
> > > involved. It could be cut down a lot more too.
> > 
> > and set runqemuparams = 'nographic' in commands.py if you can't cope
> > with graphics :).
> 
> So I think we are onto something here.  Let me rewind a bit to provide
> a bit of context for those who didn't follow the IRC discussion.
> 
> The OE/QA is running qemu with the "-serial tcp:<IP>:<port>" option and
> binds to that directly through python.  In theory, one should also be
> able to use netcat/socat (or even telnet) to bind to the qemu TCP via
> manual command line testing without the autobuilder/OEQA python.
> 
> But I couldn't get that to work without getting "broken" echo behaviour
> and similar tty/terminal odd characters displayed.  As that was also
> observed on v6.1 and mickledore, I didn't want to go off on a wild goose
> chase after something that probably has nothing to do with the issue at
> hand here.
> 
> Hence leading to what Richard created above.  The basic premise is that
> similar to the AB output, we are looking for log files ending in ".2"
> that are from ttyS1 -- that are smaller than normal - indicating that a
> getty wasn't able to get the login banner out successfully.  Now the results.
> 
> After 300 runs on mickledore with v6.1, all the .2 files were the same
> size - 477 bytes.
> 
> After 155 runs on today's master with v6.5 kernel, I got two runs where
> instead of 477 bytes, they were 345 bytes.
> 
> dot-two$ls -l1|wc -l
> 155
> dot-two$ls -l1| grep -v 477
> total 616
> -rw-r--r-- 1 pgortmak users 345 Oct 10 17:54 qemu_boot_log.20231010175351.2
> -rw-r--r-- 1 pgortmak users 345 Oct 10 18:43 qemu_boot_log.20231010184307.2
> 
> The difference?  The "short" files ended with "Starting crond: OK" and
> were missing the login banner:
>    --------
>    Poky (Yocto Project Reference Distro) 
> 4.2+snapshot-0fb1ff0ffb527a6364cef7159c0d9dcb879e8dd9 qemux86-64 ttyS1
> 
>    qemux86-64 login:
>    --------
> 
> Similarly, for the ttyS0 logs, if we strip off the printk.time prefixes
> thus allowing us to diff the corresponding "outlier" case against any
> one of the 153 "normal" cases, we see that this line is missing:
> 
> Last login: Tue Oct 10 22:40:31 +0000 2023 on /dev/ttyS1.
> 
> ...which makes sense, since the login never appeared on ttyS1 for OEQA
> to interact with and login to.
> 
> (interesting that otherwise, ttyS1 always "wins" the race to update
> wtmp/utmp over ttyS0)
> 
> Anyway, that works out to approximately a reproduce factor of a dismal
> 1.5% by my math.  Ugh.  But at least it isn't coupled to a full build!
> 
> Finally, using an identical build on master but with PREF VERSION of
> v6.4 kernel, I ran 300+ instances and got zero instances of short .2
> ttyS1 log files that were missing the getty output.
> 
> Note that these runs were using the default qemu v4.x that ships with
> ubuntu-20.04 ; I've not (yet?) run similar tests with our own more
> up-to-date qemu build from the sysroot.  Maybe I don't need to?
> 
> Possible conclusions:
> 
> The reproducibility seems pretty low, which seems to match what I've
> heard with respect to the AB results.  It also seems confined to the
> v6.5.x kernel, based on these limited results - matching the AB results.
> 
> So maybe we have a reproducer that is independent of OEQA and the
> backing of a full build to get there?  I'd like to think so, but I'm
> going to run more v6.5+master tests overnight to try and get better
> stats - a +/- of just 1 changes my current reproduce # by a large
> percentage!
> 
> 
> Next steps:
> 
> Each run takes approximately 20s, and with a 1.5% success factor, a
> minimum of 150 clean runs is needed before one can say "OK, *maybe* this
> kernel currently under test is OK.  That translates to over an hour to
> get a data point you have some faith in as being "good".  [Still a heck
> of a lot better than doing 150 *builds* though!]
> 
> On the other hand, just one instance of missing login on ttyS1 is
> concretely "bad" and hence the test run can be stopped as soon as one
> instance of that is seen.
> 
> This "good is really maybe perhaps good, but bad is for sure bad" type
> of bisects that I love to hate.
> 
> Since the data points are "expensive", I probably won't just attempt a
> mindless brute force bisect.  I'd probably do something more like...
> 
> --ensure modules (or lack of) don't impact the results, so I don't have
>   to rebundle rootfs images.
> 
> --maybe see if I can build with host gcc vs. using devshell/sysroot gcc
>   and still see the issue.  Although really, kernel build time optimization
>   won't be the bottleneck in this particular case, so not worthwhile?
> 
> --check vanilla v6.5 (i.e. v6.5.0 -- no linux-stable content) so I know
>   whether to dig around in v6.5.0 --> v6.5.5, or dig around in the
>   much much larger v6.4 --> v6.5 additions.  I expect it to be the
>   latter, i.e. digging through mainline, but who knows.
> 
> --assuming mainline introduced it, instead of a brute force bisect, I'd
>   look for merges by Linus from gregKH pull requests for the tty and
>   serial subsystems in the merge window (v6.4 ---> v6.5-rc1) and test
>   both sides of that/those merge(s).
> 
> --if things haven't completely fallen apart by that point, I'd be left
>   with a relatively small set of subsystem specific commits from a single
>   pull request which then could be bisected.
> 
> Sometimes in the past, when I've got down to that small set of commits,
> the culprit almost jumps out at you when scanning over the shortlogs.
> 
> Anyway, that is where my thoughts are at.  But of course we don't even
> yet know if this "test" is even accurately reflecting the OEQA/AB issue!
> So nobody should get their hopes up too early.
> 
> As a side note, and something that kind of was briefly discussed on IRC,
> it seems to make sense to someday informally "support" the idea of
> developers being able to use a chunk of the OEQA like this in a
> stand-alone mode - since we always seem to be chasing issues that only
> show up in AB output - with people struggling to reproduce elsewhere.


You've been doing some great work on this, thanks!

It is worth noting that for the purposes of experimentation, cutting
the 1000s timeout in the script to something more reasonable like 60s
is fine. If the thing fails, it hangs and you know about it quickly.

I did have some experiments of my own which I thought I'd share.

I hacked the script to append sys.argv[1] to the log filename, then
experimented with a command like:

$ for i in `seq 1 88`; do ./commands.py $i & done

which launches 88 qemus in parallel. It is clear when you do this and
watch the log file sizes, several "hang", for me at different points:

$ ls *.2 -la | cut -d ' ' -f 5 | sort | uniq -c
      1 134
      1 249
      1 251
      2 254
      1 255
     51 273

I think a few crash with races too since that total clearly isn't 88
but I'm not worrying too much about it.

I also came up with the idea of running "cat /proc/tty/driver/serial"
before and after the "echo helloA > /dev/ttyS1\n". This was interesting
as it resulted in:

root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:418 rx:43 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:000002F8 irq:3 tx:249 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3
root@qemux86-64:~# echo helloA > /dev/ttyS1
root@qemux86-64:~# echo helloB > /dev/ttyS0
helloB
root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:803 rx:121 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:000002F8 irq:3 tx:281 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3

The .2 logfile in this specific case was 249 bytes:

-rw-rw-r--  1 richard richard      249 Oct 13 12:50 
qemu_boot_log.2023101312503368.2

which corresponds to the "tx:249". The last line was:

"""
Poky (Yocto Project Reference Distro) 
4.2+snapshot-5b7479692b49154f4c299496974e7c81892ae336 qemux86-64 /dev/ttyS1
"""
i.e. the login prompt wasn't shown.

After the echo "helloA\n" (8 chars) we see the tx count increase to 281
so the login *and* the hello are sent. The expected 273 + 8 == 281.

So what this seems to prove is the kernel is hogging the data and our
qemu processing side of this likely isn't at fault as the kernel never
transmits it.

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189069): 
https://lists.openembedded.org/g/openembedded-core/message/189069
Mute This Topic: https://lists.openembedded.org/mt/101824562/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

Reply via email to