So, here is an update for those not following on IRC:

[Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 
summary)] On 11/10/2023 (Wed 00:50) Paul Gortmaker wrote:

[snip details of getting RP's chunk of AB python working as a 1% reproducer]

> Next steps:
> 
> Each run takes approximately 20s, and with a 1.5% success factor, a
> minimum of 150 clean runs is needed before one can say "OK, *maybe* this
> kernel currently under test is OK.  That translates to over an hour to
> get a data point you have some faith in as being "good".  [Still a heck
> of a lot better than doing 150 *builds* though!]
> 
> On the other hand, just one instance of missing login on ttyS1 is
> concretely "bad" and hence the test run can be stopped as soon as one
> instance of that is seen.
> 
> This "good is really maybe perhaps good, but bad is for sure bad" type
> of bisects that I love to hate.
> 
> Since the data points are "expensive", I probably won't just attempt a
> mindless brute force bisect.  I'd probably do something more like...
> 
> --ensure modules (or lack of) don't impact the results, so I don't have
>   to rebundle rootfs images.

This turned out to be true - didn't need modules.

> --maybe see if I can build with host gcc vs. using devshell/sysroot gcc
>   and still see the issue.  Although really, kernel build time optimization
>   won't be the bottleneck in this particular case, so not worthwhile?

I didn't bother with the above, since build time was not significant here.

> --check vanilla v6.5 (i.e. v6.5.0 -- no linux-stable content) so I know
>   whether to dig around in v6.5.0 --> v6.5.5, or dig around in the
>   much much larger v6.4 --> v6.5 additions.  I expect it to be the
>   latter, i.e. digging through mainline, but who knows.

As expected, it followed v6.5.0 and linux-stable isn't to blame here.

> --assuming mainline introduced it, instead of a brute force bisect, I'd
>   look for merges by Linus from gregKH pull requests for the tty and
>   serial subsystems in the merge window (v6.4 ---> v6.5-rc1) and test
>   both sides of that/those merge(s).

This also largely went according to plan. I started with the tty merge:

v6.5-rc1~74:
868a9fd94807 Merge tag 'tty-6.5-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

and did burrow down into the 60-odd commits within.  One complication
was that there were interim mainline merges into the dev stream:

$ git log --oneline --merges v6.5-rc1~74^2  ^v6.5-rc1~74^1
2076b2a4a6b7 Merge 6.4-rc5 into tty-next
0e4daea31d83 Merge 6.4-rc3 into tty-next

..which usually indicate a dependency on work from another subsystem.
Fortunately the test pass/fail boundary didn't land on one of those.

> --if things haven't completely fallen apart by that point, I'd be left
>   with a relatively small set of subsystem specific commits from a single
>   pull request which then could be bisected.

We did somehow manage to get here with just 60 commits to consider.

> 
> Sometimes in the past, when I've got down to that small set of commits,
> the culprit almost jumps out at you when scanning over the shortlogs.

While I didn't immediately spot the commit via the shortlog, extensive
test iterations did give us confidence in pinpointing a single commit:

84a9582fd203 serial: core: Start managing serial controllers to enable runtime 
PM       *******         <---- FAIL (33/1436)
ae62c49c0cef serial: 8250: omap: convert to modern PM ops                       
                        <---- PASS (0/1593)

> Anyway, that is where my thoughts are at.  But of course we don't even
> yet know if this "test" is even accurately reflecting the OEQA/AB issue!
> So nobody should get their hopes up too early.

I think now, we can say with a solid degree of confidence that the PM
commit gets to own the responsibility of causing the ttyS1 getty not
flushing the login banner and prompt.  Hence why I did over 1000 runs on
both the PASS and FAIL commit boundary.

Unfortunately, there is a lot in that commit ; also a lot in the
discussion and the twelve iterations before acceptance that I've not
even looked at yet:

https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118...@gmail.com/T/#t

My "next steps" at this point would be to try and attempt to break down
that relatively large commit into multiple "sub-commits" in order to use
the AB test harness portion to better get an idea of which change
component is the root cause of the ttyS1 getty stall/seizure.

I won't get to that until Saturday at the earliest - going AFK after
sending this.  However, Mikko already opened a dialog with the
linux-serial folks a few days ago:

https://lore.kernel.org/all/ZSPQY6UYg21Z0PnN@nuoska/

Mikko -- if you want to take our pseudo bisect information (below) and
continue that thread; sharing what we've learned since, and also loop in
Tony (author of v6.4-rc3-31-g84a9582fd203) and others on the CC there,
that might be a good next step to do now, while I am AFK.

A quick technical note on testing - all my kernels have been built in
the same devshell instance, so same path, same compiler, same ".config",
same "test harness" from RP, same qemu, same underlying machine.  Both
Richcard and I have also found that loading on the host machine running
the qemu binary (ubu 20.04 default in my case) influences the rate of
incidence - an otherwise idle machine is less likely to show the issue
vs a machine that has other things going on -- indicating a qemu
scheduling component influences the incident rate.

Paul.
--

Note asterik "*****" marking suspect commit.

v6.5.7                                                                          
                        <---- FAIL (2/188)
v6.5.5                                                                          
                        <---- FAIL (13/1039)
v6.5.0                                                                          
                        <---- FAIL (6/422)
v6.5-rc1                                                                        
                        <---- FAIL (2/230)

v6.5-rc1~74:
868a9fd94807 Merge tag 'tty-6.5-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

$git log --oneline v6.5-rc1~74^2  ^v6.5-rc1~74^1
e534756c9412 tty_audit: make data of tty_audit_log() const                      
                        <---- FAIL (3/266)
e64ed44bce43 tty_audit: make tty pointers in exposed functions const
3e540a7bad85 tty_audit: make icanon a bool
94f94810efc4 tty_audit: invert the condition in tty_audit_log()
e1488513abee tty_audit: use kzalloc() in tty_audit_buf_alloc()
777e456852f7 tty_audit: use TASK_COMM_LEN for task comm
a82d62f70854 Revert "8250: add support for ASIX devices with a FIFO bug"
27a826837ec9 serial: atmel: don't enable IRQs prematurely
930cbf92db01 tty: serial: Add Nuvoton ma35d1 serial driver support
e0edfdc15863 tty: serial: fsl_lpuart: add earlycon for imx8ulp platform
639949a7031e tty: serial: imx: fix rs485 rx after tx
e8cc334847db selftests: tty: add selftest for tty timestamp updates
360c11e2258c tty: tty_io: update timestamps on all device nodes
4903fde8047a tty: fix hang on tty device with no_room set
cef09673c35b serial: core: fix -EPROBE_DEFER handling in init
20a41a62618d serial: 8250_omap: Use force_suspend and resume for system suspend
f3710f5e9e1a tty: serial: samsung_tty: Use abs() to simplify some code
832e231cff47 tty: serial: samsung_tty: Fix a memory leak in 
s3c24xx_serial_getclk() when iterating clk
a9c09546e903 tty: serial: samsung_tty: Fix a memory leak in 
s3c24xx_serial_getclk() in case of error
d9f59caf94a9 serial: 8250: Apply FSL workarounds also without 
SERIAL_8250_CONSOLE
1eea99c04555 powerpc/legacy_serial: Handle SERIAL_8250_FSL=n build failures
52861a3be8d6 serial: core: don't kfree device managed data
b6c7ff2693dd serial: 8250_mtk: Simplify clock sequencing and runtime PM
643662d12bc2 serial: st-asc: fix typo in property name
66eff0ef528b powerpc/legacy_serial: Warn about 8250 devices operated without 
active FSL workarounds
2076b2a4a6b7 Merge 6.4-rc5 into tty-next                                        
                        <---- FAIL (2/145)
49c80922ff81 serial: core: Fix error handling for serial_core_ctrl_device_add() 
                        <---- FAIL (6/992)
539914240a01 serial: core: Fix probing serial_base_bus devices                  
                        <---- FAIL (3/145)
d0a396083e91 serial: core: Don't drop port_mutex in serial_core_remove_one_port
84a9582fd203 serial: core: Start managing serial controllers to enable runtime 
PM       *******         <---- FAIL (33/1436)
ae62c49c0cef serial: 8250: omap: convert to modern PM ops                       
                        <---- PASS (0/1593)
c53aab207622 serial: 8250: omap: Move uart_write() inside PM section
d2d4bd217ccd serial: 8250-fsl: Expand description of the MPC83xx UART's 
misbehaviour
1b997aef4f54 serial: Indicate fintek option may also be required for RS232 
support
20ec397d694b tty: serial: fsl_lpuart: Check the return value of 
dmaengine_tx_status                     <---- PASS (0/316)
0d07703be74f serial: Switch i2c drivers back to use .probe()
d0b309a5d3f4 serial: 8250: synchronize and annotate UART_IER access
25614735a647 serial: 8250: lock port for UART_IER access in omap8250_irq()
8b45503776b6 serial: 8250: lock port for omap8250_restore_regs()
87660fb4041f serial: 8250: lock port for rx_dma() callback
51e45fba14bf serial: core: lock port for start_rx() in uart_resume_port()
ca73a892c5be serial: 8250: lock port for stop_rx() in omap8250_irq()
abcb0cf1f5b2 serial: core: lock port for stop_rx() in uart_suspend_port()
b1207d86169d serial: 8250: lock port in startup() callbacks
0e4daea31d83 Merge 6.4-rc3 into tty-next
d5b3d02d0b10 serial: Make uart_remove_one_port() return void
6bd6cd29c924 serial: stm32: Ignore return value of uart_remove_one_port() in 
.remove()
72fc578b220c serial: 8250_rt288x: Remove unnecessary UART_REG_UNMAPPED
33e3b0eb7550 serial: 8250_rt288x: Name non-standard divisor latch reg
b334214ea08d serial: 8250: RT288x/Au1xxx code away from core
30c61f53fdf2 serial: 8250: Add dl_read/write, bugs and mapsize into 
plat_serial8250_port
98658ae8f392 serial: 8250: Document uart_8250_port's ->dl_read/write()
b245aa0cc583 serial: 8250: Change dl_read/write to handle value as u32
9d86719f8769 serial: 8250: Allow using ports higher than 
SERIAL_8250_RUNTIME_UARTS
db86bb6ed497 serial: 8250: omap: Shut down on remove for console uart
fef4f600319e serial: 8250: omap: Fix life cycle issues for interrupt handlers
398cecc24846 serial: 8250: omap: Fix imprecise external abort for omap_8250_pm()
b9ab22c2bc86 serial: 8250: omap: Fix freeing of resources on failed register
6d8c1fca0a6e serial: pl011: set UART011_CR_RXE in pl011_set_termios after port 
shutdown
cf9aa72d2f91 tty: serial: fsl_lpuart: optimize the timer based EOP logic

Baseline of tty queue for v6.5 was:
$git show --oneline --stat cf9aa72d2f91~
ac9a78681b92 (tag: v6.4-rc1) Linux 6.4-rc1
 Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Have also 1000+ runs on Yocto's v6.4.16 based kernel with no fails.
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189070): 
https://lists.openembedded.org/g/openembedded-core/message/189070
Mute This Topic: https://lists.openembedded.org/mt/101824562/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to