Performing verification for Focal.

I started a fresh Focal VM, and installed qemu-utils. I then ran
reproducer.sh from the testcase section.

The kernel is 5.4.0-124-generic from -updates.

Within 30 seconds of starting the reproducer, the testcase script hung,
and the following was in dmesg:

Aug 23 04:49:26 focal-nbd kernel: block nbd15: NBD_DISCONNECT
Aug 23 04:49:26 focal-nbd kernel: block nbd15: Send disconnect failed -32
Aug 23 04:49:26 focal-nbd sudo[1804]: pam_unix(sudo:session): session closed 
for user root
Aug 23 04:49:26 focal-nbd sudo[1807]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; 
USER=root ; COMMAND=/usr/bin/qemu-nbd --connect=/dev/nbd15 --cache=writeback 
--format=qcow2 foo.img
Aug 23 04:49:26 focal-nbd sudo[1807]: pam_unix(sudo:session): session opened 
for user root by ubuntu(uid=0)
Aug 23 04:49:26 focal-nbd kernel: ldm_validate_partition_table(): Disk read 
failed.
Aug 23 04:49:26 focal-nbd kernel: Dev nbd15: unable to read RDB block 0
Aug 23 04:49:26 focal-nbd kernel:  nbd15: unable to read partition table
Aug 23 04:49:56 focal-nbd kernel: block nbd15: Possible stuck request 
000000004d5cc344: control (read@523988992,36864B). Runtime 30 seconds
Aug 23 04:50:26 focal-nbd systemd-udevd[419]: nbd15: Worker [1198] processing 
SEQNUM=5582 is taking a long time
Aug 23 04:50:27 focal-nbd kernel: block nbd15: Possible stuck request 
000000004d5cc344: control (read@523988992,36864B). Runtime 60 seconds
Aug 23 04:50:58 focal-nbd kernel: block nbd15: Possible stuck request 
000000004d5cc344: control (read@523988992,36864B). Runtime 90 seconds
Aug 23 04:51:29 focal-nbd kernel: block nbd15: Possible stuck request 
000000004d5cc344: control (read@523988992,36864B). Runtime 120 seconds
Aug 23 04:51:59 focal-nbd kernel: block nbd15: Possible stuck request 
000000004d5cc344: control (read@523988992,36864B). Runtime 150 seconds
Aug 23 04:52:26 focal-nbd systemd-udevd[419]: nbd15: Worker [1198] processing 
SEQNUM=5582 killed
Aug 23 04:52:30 focal-nbd kernel: block nbd15: Possible stuck request 
000000004d5cc344: control (read@523988992,36864B). Runtime 180 seconds
Aug 23 04:53:27 focal-nbd kernel: INFO: task qemu-nbd:1815 blocked for more 
than 120 seconds.
Aug 23 04:53:27 focal-nbd kernel:       Not tainted 5.4.0-124-generic 
#140-Ubuntu
Aug 23 04:53:27 focal-nbd kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 23 04:53:27 focal-nbd kernel: qemu-nbd        D    0  1815      1 0x00000000
Aug 23 04:53:27 focal-nbd kernel: Call Trace:
Aug 23 04:53:27 focal-nbd kernel:  __schedule+0x2e3/0x740
Aug 23 04:53:27 focal-nbd kernel:  ? __kfifo_to_user_r+0xa0/0xa0
Aug 23 04:53:27 focal-nbd kernel:  schedule+0x42/0xb0
Aug 23 04:53:27 focal-nbd kernel:  blk_mq_freeze_queue_wait+0x4b/0xb0
Aug 23 04:53:27 focal-nbd kernel:  ? __wake_up_pollfree+0x40/0x40
Aug 23 04:53:27 focal-nbd kernel:  blk_mq_freeze_queue+0x1b/0x20
Aug 23 04:53:27 focal-nbd kernel:  nbd_add_socket+0x5e/0x1d0 [nbd]
Aug 23 04:53:27 focal-nbd kernel:  nbd_ioctl+0x2f7/0x410 [nbd]
Aug 23 04:53:27 focal-nbd kernel:  blkdev_ioctl+0x383/0xa30
Aug 23 04:53:27 focal-nbd kernel:  block_ioctl+0x3d/0x50
Aug 23 04:53:27 focal-nbd kernel:  do_vfs_ioctl+0x407/0x670
Aug 23 04:53:27 focal-nbd kernel:  ? putname+0x4a/0x50
Aug 23 04:53:27 focal-nbd kernel:  ksys_ioctl+0x67/0x90
Aug 23 04:53:27 focal-nbd kernel:  __x64_sys_ioctl+0x1a/0x20
Aug 23 04:53:27 focal-nbd kernel:  do_syscall_64+0x57/0x190
Aug 23 04:53:27 focal-nbd kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 23 04:53:27 focal-nbd kernel: RIP: 0033:0x7fd12b7573ab
Aug 23 04:53:27 focal-nbd kernel: Code: Bad RIP value.
Aug 23 04:53:27 focal-nbd kernel: RSP: 002b:00007fd129fa2a18 EFLAGS: 00000246 
ORIG_RAX: 0000000000000010
Aug 23 04:53:27 focal-nbd kernel: RAX: ffffffffffffffda RBX: 0000000000000001 
RCX: 00007fd12b7573ab
Aug 23 04:53:27 focal-nbd kernel: RDX: 000000000000000b RSI: 000000000000ab00 
RDI: 000000000000000d
Aug 23 04:53:27 focal-nbd kernel: RBP: 00007fd129fa2aa8 R08: 0000000000000000 
R09: 0000000000000001
Aug 23 04:53:27 focal-nbd kernel: R10: 0000000000000000 R11: 0000000000000246 
R12: 00007fd129fa2ab0
Aug 23 04:53:27 focal-nbd kernel: R13: 000000000000000d R14: 000000001f400000 
R15: 00007fd120000b80

I then rebooted, enabled -proposed and installed kernel
5.4.0-125-generic.

I left the reproducer.sh script running for a bit over an hour, and it
was still running perfectly fine when I got back to it. Requests are
still moving smoothly, and no longer getting stuck.

The 5.4.0-125-generic kernel in -proposed fixes the issue. Happy to mark
it verified.

** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1896350

Title:
  nbd: requests can become stuck when disconnecting from server with
  qemu-nbd

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Won't Fix
Status in linux source package in Focal:
  Fix Committed
Status in linux source package in Impish:
  Won't Fix
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Kinetic:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1896350

  [Impact]

  After 2516ab1("nbd: only clear the queue on device teardown"), present
  in 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear
  requests currently being processed. This change was made to fix a race
  between using the NBD_CLEAR_SOCK ioctl to clear requests, and teardown
  of the device clearing requests. This worked for the most part, as
  several years ago systemd was not set up to watch nbd devices for
  changes in their state.

  But after:

  commit f82abfcda58168d9f667e2094d438763531d3fa6
  From: Tony Asleson <tasle...@redhat.com>
  Date: Fri, 8 Feb 2019 15:47:10 -0600
  Subject: rules: watch metadata changes on nbd devices
  Link: 
https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6

  in systemd v242-rc1, nbd* devices were added to a udev rule to watch
  those devices for changes with the inotify subsystem. From man udev:

  > watch
  >   Watch the device node with inotify; when the node is closed after being 
  >   opened for writing, a change uevent is synthesized.
  >
  > nowatch
  >   Disable the watching of a device node with inotify.

  This changed the behaviour of device teardown, since systemd now keeps
  tabs on the device with inotify, outstanding requests cannot be
  cleared as nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER',
  and requests get stuck, never to complete, because a disconnect has
  occurred, and never to timeout, as their timers keep being reset.

  Symptoms of this issue is that the nbd subsystem gets stuck with
  messages like:

  block nbd15: NBD_DISCONNECT
  block nbd15: Send disconnect failed -32
  ...
  block nbd15: Possible stuck request 000000007fcf62ba: control 
(read@523915264,24576B). Runtime 30 seconds
  ...
  block nbd15: Possible stuck request 000000007fcf62ba: control 
(read@523915264,24576B). Runtime 150 seconds
  ...
  INFO: task qemu-nbd:1267 blocked for more than 120 seconds.
        Not tainted 5.15.0-23-generic #23-Ubuntu
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:qemu-nbd        state:D stack:    0 pid: 1267 ppid:     1 
flags:0x00000002
  Call Trace:
   <TASK>
   __schedule+0x23d/0x590
   ? call_rcu+0xe/0x10
   schedule+0x4e/0xb0
   blk_mq_freeze_queue_wait+0x69/0xa0
   ? wait_woken+0x70/0x70
   blk_mq_freeze_queue+0x1b/0x30
   nbd_add_socket+0x76/0x1f0 [nbd]
   __nbd_ioctl+0x18b/0x340 [nbd]
   ? security_capable+0x3d/0x60
   nbd_ioctl+0x81/0xb0 [nbd]
   blkdev_ioctl+0x12e/0x270
   ? __fget_files+0x86/0xc0
   block_ioctl+0x46/0x50
   __x64_sys_ioctl+0x91/0xc0
   do_syscall_64+0x5c/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>

  Additionally, in syslog you will also see systemd-udevd get stuck:

  systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is
  taking a long time

  $ ps aux
  ...
  419    1194 root     D     0.1 systemd-udevd   -

  We can workaround the issue by adding a higher priority udev rule to
  not watch nbd* devices.

  $ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
  # Disable inotify watching of change events for NBD devices
  ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
  EOF

  $ sudo udevadm control --reload-rules
  $ sudo udevadm trigger

  [Fix]

  The fix relies on infrastructure provided by the flag
  NBD_CMD_INFLIGHT, which was introduced in 5.16, and added to in 5.19.
  We need to backport all commits related to NBD_CMD_INFLIGHT to our
  kernels for the fix to be effective.

  For Focal, Impish and Jammy:

  commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
  Author: Yu Kuai <yuku...@huawei.com>
  Date:   Thu Sep 16 17:33:44 2021 +0800
  Subject: nbd: don't handle response without a corresponding request message
  Link: 
https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d

  commit 07175cb1baf4c51051b1fbd391097e349f9a02a9
  Author: Yu Kuai <yuku...@huawei.com>
  Date:   Thu Sep 16 17:33:45 2021 +0800
  Subject: nbd: make sure request completion won't concurrent
  Link: 
https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9

  commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e
  Author: Yu Kuai <yuku...@huawei.com>
  Date:   Sat May 21 15:37:46 2022 +0800
  Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
  Link: 
https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e

  commit 09dadb5985023e27d4740ebd17e6fea4640110e5
  Author: Yu Kuai <yuku...@huawei.com>
  Date:   Sat May 21 15:37:47 2022 +0800
  Subject: nbd: fix io hung while disconnecting device
  Link: 
https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5

  For Focal only (dependency commits):

  commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d
  Author: Keith Busch <kbu...@kernel.org>
  Date:   Fri May 29 07:51:59 2020 -0700
  Subject: blk-mq: blk-mq: provide forced completion method
  Link: 
https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d

  commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
  Author: Christoph Hellwig <h...@lst.de>
  Date:   Thu Jun 11 08:44:47 2020 +0200
  Subject: blk-mq: move failure injection out of blk_mq_complete_request
  Link: 
https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2

  I want to talk about the backport of "blk-mq: move failure injection
  out of blk_mq_complete_request" for Focal, since it changes a number
  of drivers using the blk_mq_complete_request() call, and can be
  considered a large regression risk. Now, blk_should_fake_timeout()
  relies on CONFIG_FAIL_IO_TIMEOUT to be enabled, as well as
  QUEUE_FLAG_FAIL_IO being present in the blk subsystem.
  CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus
  blk_should_fake_timeout() just returns false, and is more or less a
  nop on our kernels. Because of this, if
  (likely(!blk_should_fake_timeout(req->q))), is really just if(true),
  and I did think about simply backporting that to the nbd patch "nbd:
  don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed" but
  after talking with Jay, we decided backporting the entire patch was
  the best way to proceed if any of our users wish to use
  CONFIG_FAIL_IO_TIMEOUT on a custom kernel build with the nbd
  subsystem.

  Note: Bionic is also affected, but the backport list to the 4.15
  kernel is unreasonable, and due to systemd in Bionic not having the
  udev rule set to watch nbd* devices by default, Bionic is not directly
  affected by the issue. Users would only see the problem if they add
  nbd* to the watch list in /usr/lib/udev/rules.d/60-block.rules, or if
  they had a privileged 20.04 or later LXC container running using the
  hosts nbd kernel module. Because of this Bionic will be Won't Fix.

  [Testcase]

  The issue can be easily reproduced with:

  $ sudo apt install qemu-utils

  $ cat << EOF >> reproducer.sh
  #!/bin/bash

  sudo modprobe nbd

  while :
  do
          qemu-img create -f qcow2 foo.img 500M
          sudo qemu-nbd --disconnect /dev/nbd15 || true
          sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 
foo.img
          sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
          sudo qemu-nbd --disconnect /dev/nbd15
  done
  EOF

  $ chmod +x reproducer.sh
  $ yes | ./reproducer.sh

  On the Ubuntu kernels, you will see the nbd subsystem hang within 30
  seconds or so, and the kernel log will be filled with stuck request
  messages and hung task timeouts after the 120 second mark.

  Test kernels are available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test

  If you install these test kernels, and re-run the reproducer, your
  system will be perfectly stable for many hours with no issues.

  I have torture tested the backport to the Focal kernel for more than 4
  hours with no issues observed.

  [Where problems can occur]

  Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT,
  this change to the NBD subsystem is quite large, and changes how
  requests are processed and accounted for, depending if they are still
  outstanding or not.

  For Impish and Jammy, these risks are limited to the NBD subsystem
  itself.

  On Focal, the risk of regression is slightly larger due to the
  backport of "blk-mq: move failure injection out of
  blk_mq_complete_request", but is somewhat mitigated by
  blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT
  being disabled on the Focal kernel.

  If a regression were to occur, users may see NBD requests fail, occur
  in the incorrect order, or cleared at incorrect times. There is no way
  to disable NBD_CMD_INFLIGHT during runtime, and users would need to
  revert to an older kernel while a fix is made.

  [Other info]

  My bug report to upstream can be found in the following mailing list thread:
  https://lkml.org/lkml/2022/4/22/61

  You may be wondering why we cannot simply just backport "nbd: fix io
  hung while disconnecting device" and resolve the issue with a 1 line
  change. I actually built test kernels for this scenario to see if it
  was possible, and they are available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single

  While it did sort of fix the issue, that is, connect and disconnect
  from nbd devices lasted longer than the 30 seconds the kernels lasted
  previously, after about 10 minutes of torture testing, I started
  experiencing race conditions of requests completing multiple times,
  leading to the following use after frees:

  Jammy 5.15 test kernel:
  https://paste.ubuntu.com/p/FSM6DrgjTy/

  Impish 5.13 test kernel:
  https://paste.ubuntu.com/p/86tNfsM7Vs/

  Focal 5.4 test kernel:
  https://paste.ubuntu.com/p/zzVzWx23sb/

  This was similar to a mailing list thread I found previously:
  https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ
  https://lore.kernel.org/all/000000000000d0df7f058f625...@google.com/T/

  Keith Busch identified this as a double completion of the same
  request, which is resolved through the NBD_CMD_INFLIGHT
  infrastructure.

  Hence, we cannot just pull in the single line change, we need
  NBD_CMD_INFLIGHT infrastructure for a complete fix.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896350/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to