Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-19 Thread Richard Purdie
On Wed, 2023-10-18 at 08:28 +0300, Tony Lindgren wrote:
> * Richard Purdie  [231017 22:15]:
> > On Tue, 2023-10-17 at 09:56 +0300, Tony Lindgren wrote:
> > > * Richard Purdie  [231016 08:10]:
> > > > The port sometimes doesn't come up properly at boot.
> > > > 
> > > > To be clear, the "\n\n" from the qemu side into the port doesn't seem
> > > > to help. The "echo helloB > /dev/ttyS1" inside the image does seem to
> > > > wake it up. 
> > > 
> > > So if I understand correctly, this issue still happens with kernel patched
> > > with commit 81a61051e0ce ("serial: core: Fix checks for tx runtime PM
> > > state"), and the issue now only happens sometimes.
> > 
> > The issue has always been intermittent and it appeared to happen less
> > frequently with 81a61051e0ce added but it was hard to know if I was
> > imagining that.
> 
> Oh OK.
> 
> > > I wonder if the following additional change might help?
> > 
> > I've added it into testing and have not reproduced the failure with it
> > applied yet, locally or on our autobuilder. We need to sort some
> > release pieces which have been delayed by these issues and we're going
> > with a workaround for that. Once that is built I can get back to
> > testing this change more extensively, see if we can still provoke the
> > issue or not. It may take a day or two of testing before we know with
> > any certainty if the issue is resolved or not.
> 
> Thanks for the update, let's wait a few days and see then.

Our release build failed with our workaround, probably as this piece is
missing. 

I've not seen the failure occur in any build with it applied so far. As
such I think we'll move over to this patch as it seems to be better.
Whether it is fixed or not I'm still not 100% sure but it is looking
more likely. Thanks for the patch/fix!

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189454): 
https://lists.openembedded.org/g/openembedded-core/message/189454
Mute This Topic: https://lists.openembedded.org/mt/101849211/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-17 Thread Richard Purdie
On Tue, 2023-10-17 at 09:56 +0300, Tony Lindgren wrote:
> * Richard Purdie  [231016 08:10]:
> > The port sometimes doesn't come up properly at boot.
> > 
> > To be clear, the "\n\n" from the qemu side into the port doesn't seem
> > to help. The "echo helloB > /dev/ttyS1" inside the image does seem to
> > wake it up. 
> 
> So if I understand correctly, this issue still happens with kernel patched
> with commit 81a61051e0ce ("serial: core: Fix checks for tx runtime PM
> state"), and the issue now only happens sometimes.

The issue has always been intermittent and it appeared to happen less
frequently with 81a61051e0ce added but it was hard to know if I was
imagining that.

> I wonder if the following additional change might help?

I've added it into testing and have not reproduced the failure with it
applied yet, locally or on our autobuilder. We need to sort some
release pieces which have been delayed by these issues and we're going
with a workaround for that. Once that is built I can get back to
testing this change more extensively, see if we can still provoke the
issue or not. It may take a day or two of testing before we know with
any certainty if the issue is resolved or not.

Cheers,

Richard



-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189363): 
https://lists.openembedded.org/g/openembedded-core/message/189363
Mute This Topic: https://lists.openembedded.org/mt/101849211/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-16 Thread Richard Purdie
On Mon, 2023-10-16 at 10:23 +0300, Tony Lindgren wrote:
> * Mikko Rapeli  [231016 07:16]:
> > Hi,
> > 
> > On Mon, Oct 16, 2023 at 09:35:01AM +0300, Tony Lindgren wrote:
> > > * Richard Purdie  [231015 21:30]:
> > > > On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> > > > > Can you try the patch below?  I just sent it to Linus and it's from 
> > > > > Tony
> > > > > to resolve some other pm issues with the serial port code.
> > > > 
> > > > Thanks for the pointer to this. I've put it through some testing and
> > > > had one failure so far so I suspect this isn't enough unfortunately.
> > > > 
> > > > FWIW I was looping the testing on the complete removal of the
> > > > conditions and didn't see any failures with that.
> > > 
> > > Care to clarify what's the failing test now?

Failure is where the data never all makes it through ttyS1 and the
login prompt doesn't appear. In our CI this appears like this

https://autobuilder.yoctoproject.org/typhoon/#/builders/145/builds/711/steps/12/logs/stdio

Click the magnify glass to make searchable, then search for "Target
didn't reach login banner in 1000 seconds". You'll then see the echo
helloB to /dev/ttyS1, then the "Extra log data read:" containing the
getty for ttyS1 which woke up.

> > > 
> > > Is the issue still the second port not always coming up after boot or
> > > something else?
> > 
> > Yes, data from the ttyS1 getty is not coming through from kernel and qemu to
> > the test framework looking for login prompt after qemu machine boot.
> > Workarounds like sending "\n\n" from the test framework through qemu to 
> > ttyS1
> > or "echo helloB > /dev/ttyS1" via working ttyS0 don't seem to help and wake
> > it up.
> 
> OK so for trying to reproduce this with qemu, is this with the default uarts
> or with some -device pci-serial-2x type options?

The port sometimes doesn't come up properly at boot.

To be clear, the "\n\n" from the qemu side into the port doesn't seem
to help. The "echo helloB > /dev/ttyS1" inside the image does seem to
wake it up. 

The qemu command we're using is:

qemu-system-x86_64 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:02 
-serial tcp:127.0.0.1:50421 
-serial tcp:127.0.0.1:46457  
-netdev tap,id=net0,ifname=tap0,script=no,downscript=no 
-object rng-random,filename=/dev/urandom,id=rng0 -device 
virtio-rng-pci,rng=rng0 
-drive 
file=./core-image-minimal-qemux86-64.rootfs.ext4.19673,if=virtio,format=raw 
-usb -device usb-tablet -usb -device usb-kbd   
-cpu IvyBridge -machine q35,i8042=off -smp 4 -enable-kvm -m 256 
-pidfile /media/build/poky/aa/pidfile_19670  -S -qmp 
unix:./.rbmorp7r,server,wait -qmp unix:./.kf6y7yqg,server,nowait -nographic  
-kernel /media/build/poky/build/tmp/deploy/images/qemux86-64/bzImage 
-append 'root=/dev/vda rw  
ip=192.168.7.2::192.168.7.1:255.255.255.0::eth0:off:8.8.8.8 net.ifnames=0 
console=ttyS0 console=ttyS1 oprofile.timer=1 tsc=reliable no_timer_check 
rcupdate.rcu_expedited=1 swiotlb=0  printk.time=1'

This is with qemu 8.1.0. The image we're testing with usually doesn't
have an ssh server so we're using the serial ports for control/testing
and not the networking even if we configure it. 

We use the serial ports over the tcp connections to handle the multiple
ports and have python code for that. I did extract that code into a
more standalone form, https://www.rpsys.net/wp/rp/simpleqemu.tgz where
"./commands.py" will then run a boot and wait for the login prompt.
You'd need to set a path in commands.py to point at an images directory
Extra log data read:of an qemux86-64 core-image-minimal OE image build.
Setting runqemuparams = 'nographic' gets rid of the graphics need.

If you don't have a build, I shared a prebuilt image and hacked config
which you could point it at:
https://www.rpsys.net/wp/rp/simpleqemu-images.tgz

I did have to remove some of the cpu qemu options to make it work with
the older qemu versions often found on distros which don't support them
(the q35 machine). It will dump log files into the current directory
and there will be a log for each serial port.

I hacked the script to append sys.argv[1] to the log filename, then
experimented with a command like:

for i in `seq 1 88`; do ./commands.py $i & done

which launches 88 qemus in parallel. Sometimes you see them "hang"
through the size of the ".2" logfile:

$ ls *.2 -la | cut -d ' ' -f 5 | sort | uniq -c
  1 134
  1 249
  1 251
  2 254
  1 255
 51 273

273 is the correct size, the smaller ones are truncated. 

Sadly in some cases this test appears to work when the issue is still
present with the issue only showing up intermittently in our CI. It
does seem timing dependent.

Let me know if can provide any other info.

Cheers,

Richard


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189267): 
https://lists.openembedded.org/g/openembedded-core/message/189267
Mute This Topic: 

Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-16 Thread Mikko Rapeli
Hi,

On Mon, Oct 16, 2023 at 10:23:52AM +0300, Tony Lindgren wrote:
> * Mikko Rapeli  [231016 07:16]:
> > On Mon, Oct 16, 2023 at 09:35:01AM +0300, Tony Lindgren wrote:
> > > * Richard Purdie  [231015 21:30]:
> > > > On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> > > > > Can you try the patch below?  I just sent it to Linus and it's from 
> > > > > Tony
> > > > > to resolve some other pm issues with the serial port code.
> > > > 
> > > > Thanks for the pointer to this. I've put it through some testing and
> > > > had one failure so far so I suspect this isn't enough unfortunately.
> > > >
> > > > FWIW I was looping the testing on the complete removal of the
> > > > conditions and didn't see any failures with that.
> > > 
> > > Care to clarify what's the failing test now?
> > > 
> > > Is the issue still the second port not always coming up after boot or
> > > something else?
> > 
> > Yes, data from the ttyS1 getty is not coming through from kernel and qemu to
> > the test framework looking for login prompt after qemu machine boot.
> > Workarounds like sending "\n\n" from the test framework through qemu to 
> > ttyS1
> > or "echo helloB > /dev/ttyS1" via working ttyS0 don't seem to help and wake
> > it up.
> 
> OK so for trying to reproduce this with qemu, is this with the default uarts
> or with some -device pci-serial-2x type options?

Here is how yocto test frameworks starts qemu:

qemu-system-x86_64 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:35:07 
-netdev 
user,id=net0,hostfwd=tcp:127.0.0.1:2225-:22,hostfwd=tcp:127.0.0.1:2325-:23,tftp=/home/builder/src/base/build_test/tmp/deploy/images/qemux86-64
 -object rng-random,filename=/dev/urandom,id=rng0 -device 
virtio-rng-pci,rng=rng0 -drive 
file=/home/builder/src/base/build_test/tmp/deploy/images/qemux86-64/core-image-ptest-openssh-qemux86-64.rootfs.ext4,if=virtio,format=raw
 -usb -device usb-tablet -usb -device usb-kbd   -cpu IvyBridge -machine 
q35,i8042=off -smp 4 -enable-kvm -m 1024 -serial tcp:127.0.0.1:46313 -serial 
tcp:127.0.0.1:53891  -pidfile /home/builder/src/base/build_test/pidfile_1926373 
 -S -qmp unix:./.lzkynxe1,server,wait -qmp unix:./.qcpdyz4j,server,nowait 
-snapshot -nographic  -kernel 
/home/builder/src/base/build_test/tmp/deploy/images/qemux86-64/bzImage -append 
'root=/dev/vda rw  ip=dhcp console=ttyS0 console=ttyS1 oprofile.timer=1 
tsc=reliable no_timer_check rcupdate.rcu_expedited=1 swiotlb=1  printk.time=1'

Example boot log when this ttyS1 failure happens is here:

https://pastebin.com/raw/jRRa2CwW

Cheers,

-Mikko

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189265): 
https://lists.openembedded.org/g/openembedded-core/message/189265
Mute This Topic: https://lists.openembedded.org/mt/101849211/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-16 Thread Mikko Rapeli
Hi,

On Mon, Oct 16, 2023 at 09:35:01AM +0300, Tony Lindgren wrote:
> * Richard Purdie  [231015 21:30]:
> > On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> > > Can you try the patch below?  I just sent it to Linus and it's from Tony
> > > to resolve some other pm issues with the serial port code.
> > 
> > Thanks for the pointer to this. I've put it through some testing and
> > had one failure so far so I suspect this isn't enough unfortunately.
> >
> > FWIW I was looping the testing on the complete removal of the
> > conditions and didn't see any failures with that.
> 
> Care to clarify what's the failing test now?
> 
> Is the issue still the second port not always coming up after boot or
> something else?

Yes, data from the ttyS1 getty is not coming through from kernel and qemu to
the test framework looking for login prompt after qemu machine boot.
Workarounds like sending "\n\n" from the test framework through qemu to ttyS1
or "echo helloB > /dev/ttyS1" via working ttyS0 don't seem to help and wake
it up.

Cheers,

-Mikko

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189261): 
https://lists.openembedded.org/g/openembedded-core/message/189261
Mute This Topic: https://lists.openembedded.org/mt/101849211/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-15 Thread Mikko Rapeli
Hi,

Thanks a lot Paul! Updating linux-serial that bisection shows

84a9582fd203 serial: core: Start managing serial controllers to enable runtime 
PM

caused this regression in ttyS1 functionality with qemu.

Lots of more details below.

Cheers,

-Mikko

On Fri, Oct 13, 2023 at 11:15:35AM -0400, Paul Gortmaker wrote:
> So, here is an update for those not following on IRC:
> 
> [Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 
> summary)] On 11/10/2023 (Wed 00:50) Paul Gortmaker wrote:
> 
> [snip details of getting RP's chunk of AB python working as a 1% reproducer]
> 
> > Next steps:
> > 
> > Each run takes approximately 20s, and with a 1.5% success factor, a
> > minimum of 150 clean runs is needed before one can say "OK, *maybe* this
> > kernel currently under test is OK.  That translates to over an hour to
> > get a data point you have some faith in as being "good".  [Still a heck
> > of a lot better than doing 150 *builds* though!]
> > 
> > On the other hand, just one instance of missing login on ttyS1 is
> > concretely "bad" and hence the test run can be stopped as soon as one
> > instance of that is seen.
> > 
> > This "good is really maybe perhaps good, but bad is for sure bad" type
> > of bisects that I love to hate.
> > 
> > Since the data points are "expensive", I probably won't just attempt a
> > mindless brute force bisect.  I'd probably do something more like...
> > 
> > --ensure modules (or lack of) don't impact the results, so I don't have
> >   to rebundle rootfs images.
> 
> This turned out to be true - didn't need modules.
> 
> > --maybe see if I can build with host gcc vs. using devshell/sysroot gcc
> >   and still see the issue.  Although really, kernel build time optimization
> >   won't be the bottleneck in this particular case, so not worthwhile?
> 
> I didn't bother with the above, since build time was not significant here.
> 
> > --check vanilla v6.5 (i.e. v6.5.0 -- no linux-stable content) so I know
> >   whether to dig around in v6.5.0 --> v6.5.5, or dig around in the
> >   much much larger v6.4 --> v6.5 additions.  I expect it to be the
> >   latter, i.e. digging through mainline, but who knows.
> 
> As expected, it followed v6.5.0 and linux-stable isn't to blame here.
> 
> > --assuming mainline introduced it, instead of a brute force bisect, I'd
> >   look for merges by Linus from gregKH pull requests for the tty and
> >   serial subsystems in the merge window (v6.4 ---> v6.5-rc1) and test
> >   both sides of that/those merge(s).
> 
> This also largely went according to plan. I started with the tty merge:
> 
> v6.5-rc1~74:
> 868a9fd94807 Merge tag 'tty-6.5-rc1' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> 
> and did burrow down into the 60-odd commits within.  One complication
> was that there were interim mainline merges into the dev stream:
> 
> $ git log --oneline --merges v6.5-rc1~74^2  ^v6.5-rc1~74^1
> 2076b2a4a6b7 Merge 6.4-rc5 into tty-next
> 0e4daea31d83 Merge 6.4-rc3 into tty-next
> 
> ..which usually indicate a dependency on work from another subsystem.
> Fortunately the test pass/fail boundary didn't land on one of those.
> 
> > --if things haven't completely fallen apart by that point, I'd be left
> >   with a relatively small set of subsystem specific commits from a single
> >   pull request which then could be bisected.
> 
> We did somehow manage to get here with just 60 commits to consider.
> 
> > 
> > Sometimes in the past, when I've got down to that small set of commits,
> > the culprit almost jumps out at you when scanning over the shortlogs.
> 
> While I didn't immediately spot the commit via the shortlog, extensive
> test iterations did give us confidence in pinpointing a single commit:
> 
> 84a9582fd203 serial: core: Start managing serial controllers to enable 
> runtime PM   *** < FAIL (33/1436)
> ae62c49c0cef serial: 8250: omap: convert to modern PM ops 
>   < PASS (0/1593)
> 
> > Anyway, that is where my thoughts are at.  But of course we don't even
> > yet know if this "test" is even accurately reflecting the OEQA/AB issue!
> > So nobody should get their hopes up too early.
> 
> I think now, we can say with a solid degree of confidence that the PM
> commit gets to own the responsibility of causing the ttyS1 getty not
> flushing the login banner and prompt.  Hence why I did over 1000 runs on
> both the PASS and FAIL commit boundary.
> 
> Unfortunately, there is a lot in that commit ; also a lot in the
> discussion and the twelve iterations before acceptance that I've not
> even looked at yet:
> 
> https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118...@gmail.com/T/#t
> 
> My "next steps" at this point would be to try and attempt to break down
> that relatively large commit into multiple "sub-commits" in order to use
> the AB test harness portion to better get an idea of which change
> component is the root cause of the ttyS1 

Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-15 Thread Richard Purdie
On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> On Sun, Oct 15, 2023 at 01:29:48PM +0100, Richard Purdie wrote:
> > On Sat, 2023-10-14 at 12:13 +0100, Richard Purdie via
> > lists.openembedded.org wrote:
> > > On Sat, 2023-10-14 at 10:41 +0100, Richard Purdie wrote:
> > > > Brief summary:
> > > > 
> > > > We're seeing an issue on x86_64 with 6.5.X where data appears to be
> > > > left in the transmission buffer and not sent to the port on the second
> > > > serial port (ttyS1) until we trigger it with intervention.
> > > > 
> > > > Paul Gortmaker did some painful bisection over a few days down to:
> > > > 
> > > > serial: core: Start managing serial controllers to enable runtime PM
> > > > https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118...@gmail.com/T/#t
> > > 
> > > Having poked around a bit and knowing nothing about any of this, should
> > > this bit of new code added in the above commit to __uart_start() in
> > > serial_core.c:
> > > 
> > >   /*
> > >* Start TX if enabled, and kick runtime PM. If the device is not
> > >* enabled, serial_port_runtime_resume() calls start_tx()
> > > again
> > >* after enabling the device.
> > >*/
> > >   if (pm_runtime_active(_dev->dev))
> > >   port->ops->start_tx(port);
> > > 
> > > 
> > > actually be something like:
> > > 
> > > 
> > >   if (pm_runtime_active(_dev->dev) || 
> > > !pm_runtime_enabled(_dev->dev))
> > >   port->ops->start_tx(port);
> > > 
> > > 
> > > since there are uarts that don't enable runtime PM?
> > > 
> > > I notice that 16550A I'm using doesn't set UART_CAP_RPM and since we
> > > have data left in the xmit buffer (I managed to confirm that), it is as
> > > if during init, there is a race between the serial probing and the
> > > getty putting data in the buffer? If it weren't statrted, that would
> > > explain things...
> > 
> > The above change didn't work but what does appear to be making a
> > difference is making this code call start_tx unconditionally which is
> > what it did prior to the patch. That does cause a "wake" when there
> > might not be any data but the code handles that gracefully.
> > 
> > I therefore suspect this is the place the issue is, the question is
> > what the right conditions for calling start_tx are?
> > 
> > I'll keep going with testing of that as the intermittent nature does
> > make this hard to know if any change helps or not.
> 
> Can you try the patch below?  I just sent it to Linus and it's from Tony
> to resolve some other pm issues with the serial port code.

Thanks for the pointer to this. I've put it through some testing and
had one failure so far so I suspect this isn't enough unfortunately.

FWIW I was looping the testing on the complete removal of the
conditions and didn't see any failures with that.

Cheers,

Richard

> 
> thanks,
> 
> greg k-h
> 
> From 81a61051e0ce5fd7e09225c0d5985da08c7954a7 Mon Sep 17 00:00:00 2001
> From: Tony Lindgren 
> Date: Thu, 5 Oct 2023 10:56:42 +0300
> Subject: [PATCH 3/4] serial: core: Fix checks for tx runtime PM state
> 
> Maximilian reported that surface_serial_hub serdev tx does not work during
> system suspend. During system suspend, runtime PM gets disabled in
> __device_suspend_late(), and tx is unable to wake-up the serial core port
> device that we use to check if tx is safe to start. Johan summarized the
> regression noting that serdev tx no longer always works as earlier when the
> serdev device is runtime PM active.
> 
> The serdev device and the serial core controller devices are siblings of
> the serial port hardware device. The runtime PM usage count from serdev
> device does not propagate to the serial core device siblings, it only
> propagates to the parent.
> 
> In addition to the tx issue for suspend, testing for the serial core port
> device can cause an unnecessary delay in enabling tx while waiting for the
> serial core port device to wake-up. The serial core port device wake-up is
> only needed to flush pending tx when the serial port hardware device was
> in runtime PM suspended state.
> 
> To fix the regression, we need to check the runtime PM state of the parent
> serial port hardware device for tx instead of the serial core port device.
> 
> As the serial port device drivers may or may not implement runtime PM, we
> need to also add a check for pm_runtime_enabled().
> 
> Reported-by: Maximilian Luz 
> Cc: stable 
> Fixes: 84a9582fd203 ("serial: core: Start managing serial controllers to 
> enable runtime PM")
> Signed-off-by: Tony Lindgren 
> Tested-by: Maximilian Luz 
> Reviewed-by: Andy Shevchenko 
> Link: https://lore.kernel.org/r/20231005075644.25936-1-t...@atomide.com
> Signed-off-by: Greg Kroah-Hartman 
> ---
>  drivers/tty/serial/serial_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/tty/serial/serial_core.c 
> b/drivers/tty/serial/serial_core.c
> index ca26a8aef2cb..d5ba6e90bd95 100644
> --- a/drivers/tty/serial/serial_core.c
> 

Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-15 Thread Richard Purdie
On Sat, 2023-10-14 at 12:13 +0100, Richard Purdie via
lists.openembedded.org wrote:
> On Sat, 2023-10-14 at 10:41 +0100, Richard Purdie wrote:
> > Brief summary:
> > 
> > We're seeing an issue on x86_64 with 6.5.X where data appears to be
> > left in the transmission buffer and not sent to the port on the second
> > serial port (ttyS1) until we trigger it with intervention.
> > 
> > Paul Gortmaker did some painful bisection over a few days down to:
> > 
> > serial: core: Start managing serial controllers to enable runtime PM
> > https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118...@gmail.com/T/#t
> 
> Having poked around a bit and knowing nothing about any of this, should
> this bit of new code added in the above commit to __uart_start() in
> serial_core.c:
> 
>   /*
>* Start TX if enabled, and kick runtime PM. If the device is not
>* enabled, serial_port_runtime_resume() calls start_tx()
> again
>* after enabling the device.
>*/
>   if (pm_runtime_active(_dev->dev))
>   port->ops->start_tx(port);
> 
> 
> actually be something like:
> 
> 
>   if (pm_runtime_active(_dev->dev) || 
> !pm_runtime_enabled(_dev->dev))
>   port->ops->start_tx(port);
> 
> 
> since there are uarts that don't enable runtime PM?
> 
> I notice that 16550A I'm using doesn't set UART_CAP_RPM and since we
> have data left in the xmit buffer (I managed to confirm that), it is as
> if during init, there is a race between the serial probing and the
> getty putting data in the buffer? If it weren't statrted, that would
> explain things...

The above change didn't work but what does appear to be making a
difference is making this code call start_tx unconditionally which is
what it did prior to the patch. That does cause a "wake" when there
might not be any data but the code handles that gracefully.

I therefore suspect this is the place the issue is, the question is
what the right conditions for calling start_tx are?

I'll keep going with testing of that as the intermittent nature does
make this hard to know if any change helps or not.

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189143): 
https://lists.openembedded.org/g/openembedded-core/message/189143
Mute This Topic: https://lists.openembedded.org/mt/101849211/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-14 Thread Richard Purdie
On Sat, 2023-10-14 at 10:41 +0100, Richard Purdie wrote:
> Brief summary:
> 
> We're seeing an issue on x86_64 with 6.5.X where data appears to be
> left in the transmission buffer and not sent to the port on the second
> serial port (ttyS1) until we trigger it with intervention.
> 
> Paul Gortmaker did some painful bisection over a few days down to:
> 
> serial: core: Start managing serial controllers to enable runtime PM
> https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118...@gmail.com/T/#t

Having poked around a bit and knowing nothing about any of this, should
this bit of new code added in the above commit to __uart_start() in
serial_core.c:

/*
 * Start TX if enabled, and kick runtime PM. If the device is not
 * enabled, serial_port_runtime_resume() calls start_tx()
again
 * after enabling the device.
 */
if (pm_runtime_active(_dev->dev))
port->ops->start_tx(port);


actually be something like:


if (pm_runtime_active(_dev->dev) || 
!pm_runtime_enabled(_dev->dev))
port->ops->start_tx(port);


since there are uarts that don't enable runtime PM?

I notice that 16550A I'm using doesn't set UART_CAP_RPM and since we
have data left in the xmit buffer (I managed to confirm that), it is as
if during init, there is a race between the serial probing and the
getty putting data in the buffer? If it weren't statrted, that would
explain things...

Keep in mind I know nothing about pm_runtime or serial/uarts so this is
guesswork on my part.

Cheers,

Richard


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189113): 
https://lists.openembedded.org/g/openembedded-core/message/189113
Mute This Topic: https://lists.openembedded.org/mt/101849211/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

2023-10-14 Thread Richard Purdie
On Mon, 2023-10-09 at 13:05 +0300, Mikko Rapeli wrote:
> Yocto Linux distro maintainer Richard Purdie is seeing a regression or 
> behavior
> change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated 
> tests with qemu
> where a python test framework configures and spawns qemu (version 8.1) with 
> two
> serial ports and boots a newly built kernel and rootfs there. The python test 
> framework
> tries to detect getty login prompt from ttyS1 while boot time logging
> from qemu, kernel, userspace etc happens on serial console ttyS0. This has 
> worked
> reliably for years and across multiple kernel versions. Now with 6.5 kernel 
> there are
> suddenly frequent failures where ttyS1 doesn't show the login prompt in the 
> python code
> which listens on the qemu socket from target system ttyS1. The target system 
> ttyS0 is working
> and boot time logs and getty prompt are visible there. The ttyS1 logs from 
> test framework show
> that only two characters CR and LF make it through. This same happens on 
> x86_64 host
> running qemu machine for x86_64 and arm64 targets.
> 
> The kernel boot logs from target system with 6.5.5 don't show any errors and 
> getty
> processes are started correctly and they don't see any errors either. Data 
> from
> target system userspace is being written, but seems to sometimes hang inside 
> kernel
> or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably 
> all
> the time.
> 
> Example target logs with 6.5.5 via ttyS0:
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> or https://pastebin.com/raw/jRRa2CwW
> 
> ttyS1 logs from the same run:
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> 
> Kernel config:
> https://pastebin.com/raw/Lv9Hfeuh
> 
> Do you have any ideas or hints how to debug or possibly fix this?
> 
> tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and 
> we're not able
> to pinpoint or bisect the issue, since it so annoyingly rare.
> 
> Some more details from Richard below.

Brief summary:

We're seeing an issue on x86_64 with 6.5.X where data appears to be
left in the transmission buffer and not sent to the port on the second
serial port (ttyS1) until we trigger it with intervention.

Paul Gortmaker did some painful bisection over a few days down to:

serial: core: Start managing serial controllers to enable runtime PM
https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118...@gmail.com/T/#t

More details:

We run very simple plain images under qemu with two serial ports
configured. These are x86_64 images with two 16550A ports with the
standard x86 port addresses which appear as ttyS0 and ttyS1, nothing
special. We run a console and getty on ttyS0 and a getty on ttyS1.
After boot, we wait for a getty to appear on ttyS1, login to it and run
commands.

With 6.5.5 (and latest 6.5.X head as of yesterday), we see an issue say
1% of the time where the getty never appears on ttyS1 even after our
timeout of 1000s.

When this happens we've added code to login to the ttyS0 getty and run
debug commands. We've been able to confirm the getty is running and the
init system doesn't matter (happens with sysvinit and systemd). The
most interesting debug I've seen is this:

root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:03F8 irq:4 tx:418 rx:43 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:02F8 irq:3 tx:249 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:03E8 irq:4
3: uart:unknown port:02E8 irq:3
root@qemux86-64:~# echo helloA > /dev/ttyS1
root@qemux86-64:~# echo helloB > /dev/ttyS0
helloB
root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:03F8 irq:4 tx:803 rx:121 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:02F8 irq:3 tx:281 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:03E8 irq:4
3: uart:unknown port:02E8 irq:3

This is being run after the getty didn't appear for 60s on ttyS1 so
we've logged into ttyS0 and run these commands. We've seen that if it
doesn't appear after 60s, it won't appear after 1000s either.

The tx:249 is interesting as it should be tx:273, 273 being the number
of bytes our successful serial getty prompt has. Once we echo something
to the port (8 bytes), tx: jumps to 281, so it suddenly found our
missing login prompt. This is confirmed with the data appearing on the
port after the echo.

I did try disabling the autosuspend code in the commit above but it
made no difference.

Reproducing this is a pain. I thought I had a local setup doing it
fairly frequently, then spent an hour this morning without it happening
once.

I'm hoping someone might recognise what is going on here? I'm happy to
try and add debug or find further information if that would help narrow
it down.

Cheers,

Richard


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply