[Bug 2056461] Re: autopkgtest-virt-qemu on noble images sometimes hangs doing copydown

Bug Watch Updater Fri, 14 Jun 2024 08:35:56 -0700

Launchpad has imported 21 comments from the remote bug at
https://bugzilla.kernel.org/show_bug.cgi?id=218916.


If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2024-05-31T08:56:20+00:00 linux wrote:

Reporting here to track a regression apparently introduced when 9p
switched to netfslib in commit 80105ed2fd2715fb09a8fdb0655a8bdc86c120db
"9p: Use netfslib read/write_iter" that is being reported on mailing
list and in distros:

"Re: [PATCH v5 40/40] 9p: Use netfslib read/write_iter"
https://lore.kernel.org/all/Zj0ErxVBE3DYT2Ea@gpd/

"linux: regression in the 9p protocol in 6.8 breaks autopkgtest qemu jobs 
(affecting debci)"
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1072004

"autopkgtest-virt-qemu on noble images sometimes hangs doing copydown"
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461

I've been digging into the bug and collecting data 9p developers may
better understand. Host is Debian 12 Bookworm amd64 building the example
Debian package "hello" and running autopkgtest in a QEMU virtual machine
with Debian unstable (sid) and local build of latest mainline kernel
(currently 4a4be1ad3a6). The bug was first reported against v6.8.

I was hoping to create a 'simple' reproducer without the complexity of
autopkgtest but first need to understand what is going wrong.

After 31 iterations I believe I may finally have found a clue. It
appears (in the guest OS of course) v9fs_file_read_iter() doesn't
consume the complete content of the host file being shared via 9p.

In the reproducer the host autopkgtest Python scripts control qemu-
system-x86 and send  commands to the guest OS for it to execute. These
include dynamically writing and then executing a python script named
"eofcat" [0] that is used as a "funnel" for stdin [1].

"eofcat" spins on a non-blocking read of its fd 0 until a specific file
exists or the number of bytes read from fd 0 is >= to a limit value read
from the file name passed in as argv[1].

By modifying the guest's mount -t 9p command [2] options to include
"debug=0x08" and using "| ts %.T |" in logging pipelines to capture
microsecond accurate timestamps I was able to identify the guest kernel
log entries when autopkgtest 'hangs' (spins).

What I see is the host file "stdin" being redirected to "eofcat" fd 0
and guest kernel shows a series of v9fs_file_read_iter(count=1000000,
...) calls with "offset" increasing each time: 0, 1000000, 2000000,
3000000, 4000000, 5000000, 5416960. At that point it starts spinning -
repeatedly retrying a v9fs_file_read_iter(count=1000000, offset=5416960)
and a single read of "stdin_eof" (to get the number of bytes to expect -
the "limit" variable) - and that has 8253440 which is also the size of
"stdin" file on the host.

In the guest a stat or ls or tail of the shared "stdin" file will break
the deadlock and operations continue. No 9p cache options are in use.

In following comments I'll attach run logs of the host autopkgtest and
the guest kernel dmesg with microsecond precision timestamps with mount
option debug=0x0f ( P9_DEBUG_VFS | P9_DEBUG_9P | DEBUG_CURRENT |
P9_DEBUG_ERROR ) with the exact timestamp where the host hangs, which
can then be found in the guest kernel log.

[0] https://sources.debian.org/src/autopkgtest/5.35/virt/autopkgtest-
virt-qemu/?hl=320#L314

[1] https://sources.debian.org/src/autopkgtest/5.35/virt/autopkgtest-
virt-qemu/?hl=492#L487

[2] https://sources.debian.org/src/autopkgtest/5.35/virt/autopkgtest-
virt-qemu/#L297

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/16

------------------------------------------------------------------------
On 2024-05-31T09:24:18+00:00 linux wrote:

Interesting finding: with debug=0x0f the bug is not triggered - I have
to reduce the debug flags to trigger it. Currently working up from
P9_DEBUG_VFS:

With run #34 and debug=0x08 (P9_DEBUG_VFS) I captured logs (attached).
Hang occurs in autopkgtest at timestamp 10:08:37.390685 (UTC+1). In the
guest kernel log that maps to timestamp 09:08:37.408129.

The first read on fd 0 is:

09:08:37.481590 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 0

followed by subsequent reads:

09:08:37.517808 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 1000000
...
09:08:37.565864 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 2000000
...
09:08:37.609522 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 3000000
...
09:08:37.653503 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 4000000
...
09:08:37.693647 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 5000000
...
09:08:37.716325 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 5275648

At which point it starts 'spinning' waiting for more data.

On the host we see:

~/tmp/9pfs$ ls -latr autopkgtest-qemu.ngbyoymu/shared/job.mon5m14s/
total 8072
-rw-r--r-- 1 root root       0 2024-05-31 10:08 stdout
-rw-r--r-- 1 root root       0 2024-05-31 10:08 stderr
drwxrwxrwt 3 root root    4096 2024-05-31 10:08 ..
-rwxr-xr-x 1 root root 8253440 2024-05-31 10:08 stdin
-rw-r--r-- 1 root root       7 2024-05-31 10:08 stdin_eof
drwxr-xr-x 2 root root    4096 2024-05-31 10:08 .

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/17

------------------------------------------------------------------------
On 2024-05-31T09:25:31+00:00 linux wrote:

Created attachment 306383
#34 autopkgtest log

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/18

------------------------------------------------------------------------
On 2024-05-31T09:26:13+00:00 linux wrote:

Created attachment 306384
#34 kernel log, debug=0x08

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/19

------------------------------------------------------------------------
On 2024-05-31T10:08:07+00:00 linux wrote:

Run #36 with debug=0x0c ( P9_DEBUG_VFS | P9_DEBUG_9P). Hang occurs in
autopkgtest at timestamp 10:56:44.002011 (UTC+1). In the guest kernel
log that maps to timestamp 09:56:44.666943 where the first
v9fs_file_read_iter(count=1000000, offset=0) occurs.

What is interesting is the reads fetch slightly more data before
spinning at:

09:56:56.732407  v9fs_file_read_iter (885): fid 6 count 1000000 offset
7331840

Do we have a timing/race condition here?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/20

------------------------------------------------------------------------
On 2024-05-31T10:08:42+00:00 linux wrote:

Created attachment 306386
#36 autopkgtest log

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/21

------------------------------------------------------------------------
On 2024-05-31T10:09:24+00:00 linux wrote:

Created attachment 306387
#36 kernel log, debug=0x0c

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/22

------------------------------------------------------------------------
On 2024-05-31T10:26:37+00:00 linux wrote:

Considering the (changed) source-code of
fs/9p/vfs_file.c::v9fs_file_read_iter() in commit
80105ed2fd2715fb09a8fdb0655a8bdc86c120db.

Prior to the commit there was a code path specifically for O_NONBLOCK
(as well as for p9L_DIRECT). Now there is only a code-path for
P9L_DIRECT.

kernel.36.log shows that within this function the p9L_DIRECT path is
taken since there is no debug message from the final debug message:

       if (fid->mode & P9L_DIRECT)
               return netfs_unbuffered_read_iter(iocb, to);

       p9_debug(P9_DEBUG_VFS, "(cached)\n");
       return netfs_file_read_iter(iocb, to);
}

I'm not familiar enough with the netfs layer to understand if O_NONBLOCK
is being handled (correctly) there - if special handling is indeed
required?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/23

------------------------------------------------------------------------
On 2024-06-03T16:53:10+00:00 linux wrote:

Update: 6.10-rc2 still fails the same way.

I've been writing a (minimal!) reproducer script that uses the same
mechanisms as autopkgtest to copy files across but without any of the
autopkgtest infrastructure - just host, qemu guest, and the shared 9P
directory. Even with 2500 iterations of the tar-based copy of the
"hello" package source and build product so far it has failed to
reproduce. I intend to add lots of unique individual file copies since
autopkgtest does a lot fo that prior to hitting the tar-based transfer
hang.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/24

------------------------------------------------------------------------
On 2024-06-05T14:18:13+00:00 dhowells wrote:

Can you try turning on some netfs tracepoints?

echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_read/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_write/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_write_iter/enable

and then capturing the result?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/25

------------------------------------------------------------------------
On 2024-06-07T09:12:39+00:00 linux wrote:

Created attachment 306436
Contains run #120 autopkgtest and kernel log files

v6.10-rc2 netfs debug logging as requested in run #120. Attachment
contains the autopkgtest and virtual machine kernel log files,
timestamped to aid synchronisation between them (autopkgtest is UTC+1
though)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/26

------------------------------------------------------------------------
On 2024-06-07T10:09:48+00:00 linux wrote:

Created attachment 306437
Test case framework

Although I've not get managed to reproduce the bug with the "minimal"
(ha!) reproducer script and tests it is in a state now that others
should be able to easily execute it, and add more tests, and otherwise
try to replicate the bug.

Once extracted read ./bug218916-reproducer.bash or get it to explain how
to use it with:

./bug218917-reproducer.bash

So far it has not reproduced the bug, but it is a framweork and one can
easily add additional test scripts (that are source-d) in ./bug218916.d/

Currently it has test_autopkgtest that replicates the sequence of
copy{down,up} operations autopkgtest does but without the autopkgtest
infrastructure, and test_tar that does the single copydown of a
directory (which causes the transfer to be piped via "tar").

It doesn't execute quite as fast native autopkgtest because it has to
use a "read -t 0.5" timeout for extracing replies from the controlling
ssh bash coproc. That was implemented to avoid repeatedly suffering the
delay of creating an ssh session for every transfer operation.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/27

------------------------------------------------------------------------
On 2024-06-07T10:11:23+00:00 linux wrote:

Edit: disastrous typos in my last comment! the command name should, of
course, be:

./bug218916-reproducer.bash

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/28

------------------------------------------------------------------------
On 2024-06-07T13:56:46+00:00 linux wrote:

Created attachment 306439
Contains run #121 autopkgtest,  kernel, trace log files

Realised I missed the trace log from run #120! This tar archive for run
#121 contains it

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/29

------------------------------------------------------------------------
On 2024-06-07T14:29:17+00:00 linux wrote:

For run #121 the reading of "stdin" by "eofcat" is at dmesg timestamp
13:46:26 with v9fs_read_iter (760) at offset 0 and is at 35.861430
netfs_read s=0 in the trace.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/30

------------------------------------------------------------------------
On 2024-06-08T09:21:41+00:00 linux wrote:

Testing different scenarios in autopkgtest scripts, and especially
"eofcat" Python that reads from the 9P mount in the guest, I
experimented with setting a larger time.sleep(0.2) instead of 0.05, and
separately, removing the O_NONBLOCK flag from its stdin FD. Neither
showed any difference - the hang occurs.

I'm about to test with enabling internal.h's CONFIG_NETFS_DEBUG=y to
capture more clues from the actual IO dispatch functions.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/31

------------------------------------------------------------------------
On 2024-06-08T10:52:48+00:00 linux wrote:

Created attachment 306443
Contains run #132 netfs_debug autopkgtest,  kernel, trace log files

dmesg.132.log now contains the additional netfs_debug messages. the read
of "stdin" file by "eofcat" starts at 10:23:43.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/32

------------------------------------------------------------------------
On 2024-06-08T16:41:48+00:00 linux wrote:

Created attachment 306444
Fix option 1

At last I've found the problem (which is, as I thought a week ago, due
to 'stat'), but quite subtle.

With extra debugging enabled on the mount (debug=0x0f rather than
debug=0x08) it works, so with debug=0x08, after adding additional debug
messages in v9fs_stat2inode_dotl(), I can see that in the failing case
there are 2 calls to that function immediately before the
v9fs_file_read_iter() starts. In those calls the netfs->remote_i_size is
updated from stat->st_size  and I see st_size has different (increasing)
values but smaller than the actual file size.

So, effectively, all the data hasn't arrived before the read and once
the read starts the size is no longer updated.

When debugging the additional cumulative delays of writing extreme
amounts of debug info to the kernel log gives enough time for the data
to arrive so that when v9fs_stat2inode_dotl() is called the final size
is correct.

This possible fix updates stat if a read returns 0 bytes.

Tested with 6.8.12 and 6.10-rc2 repeatedly (10+ iterations so far) and
no failures with debug logging completely disabled.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/33

------------------------------------------------------------------------
On 2024-06-08T16:59:14+00:00 linux wrote:

Extract from the fix log with additional debugging enabled:

9pnet: -- v9fs_file_read_iter (889): fid 6 count 1000000 offset 5580800
[eofcat] ==> netfs_unbuffered_read_iter_locked()
[eofcat] ==> netfs_begin_read(R=111 552800-646a3f)
[eofcat] submit 552800 + 0 >= 552800
9pnet: -- v9fs_file_read_iter (889):  iocb->ki_flags 0 
netfs_unbuffered_read_iter()=0
9pnet: -- v9fs_file_read_iter (889): no data; refreshing inode
9pnet: -- v9fs_stat2inode_dotl (889): P9_STATS_BASIC=2047 
V9FS_STAT2INODE_KEEP_ISIZE=0 P9_STATS_SIZE=512
9pnet: -- v9fs_stat2inode_dotl (889): P9_STATS_BASIC: remote_i_size=5580800 
stat->st_size=8253440

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/34

------------------------------------------------------------------------
On 2024-06-08T19:55:25+00:00 linux wrote:

Created attachment 306445
Fix v2 (deals with .U as well as .L )

Revised patch that should handle .U as well as .L protocol.

10 iterations each of autopkgtests with v6.8.12 and v6.10-rc2 all
successful.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/35

------------------------------------------------------------------------
On 2024-06-14T00:55:06+00:00 kip wrote:

Thanks TJ. I tried applying your patch and can confirm I am still seeing
the same problem with autopkgtest on Ubuntu Noble:


...
qemu-system-x86_64: terminating on signal 15 from pid 5349 (/usr/bin/python3)
autopkgtest [17:52:50]: ERROR: testbed failure: sent `auxverb_debug_fail', got 
`timeout', expected `ok...'

Reply at:
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461/comments/38


** Changed in: linux
       Status: Unknown => Confirmed

** Changed in: linux
   Importance: Unknown => Medium

** Bug watch added: Debian Bug tracker #1072004
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1072004

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2056461

Title:
  autopkgtest-virt-qemu on noble images sometimes hangs doing copydown

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/2056461/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2056461] Re: autopkgtest-virt-qemu on noble images sometimes hangs doing copydown

Reply via email to