On Thu, Apr 03, 2025 at 06:05:57PM -0600, Uday Shankar wrote:
> There are currently two ways in which ublk server exit is detected by
> ublk_drv:
> 
> 1. uring_cmd cancellation. If there are any outstanding uring_cmds which
>    have not been completed to the ublk server when it exits, io_uring
>    calls the uring_cmd callback with a special cancellation flag as the
>    issuing task is exiting.
> 2. I/O timeout. This is needed in addition to the above to handle the
>    "saturated queue" case, when all I/Os for a given queue are in the
>    ublk server, and therefore there are no outstanding uring_cmds to
>    cancel when the ublk server exits.
> 
> There are a couple of issues with this approach:
> 
> - It is complex and inelegant to have two methods to detect the same
>   condition
> - The second method detects ublk server exit only after a long delay
>   (~30s, the default timeout assigned by the block layer). This delays
>   the nosrv behavior from kicking in and potential subsequent recovery
>   of the device.
> 
> The second issue is brought to light with the new test_generic_04. It
> fails before this fix:
> 
> selftests: ublk: test_generic_04.sh
> dev id is 0
> dd: error writing '/dev/ublkb0': Input/output error
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 30.0611 s, 0.0 kB/s
> DEAD
> dd took 31 seconds to exit (>= 5s tolerance)!
> generic_04 : [FAIL]
> 
> Fix this by instead detecting and handling ublk server exit in the
> character file release callback. This has several advantages:
> 
> - This one place can handle both saturated and unsaturated queues. Thus,
>   it replaces both preexisting methods of detecting ublk server exit.
> - It runs quickly on ublk server exit - there is no 30s delay.
> - It starts the process of removing task references in ublk_drv. This is
>   needed if we want to relax restrictions in the driver like letting
>   only one thread serve each queue
> 
> There is also the disadvantage that the character file release callback
> can also be triggered by intentional close of the file, which is a
> significant behavior change. Preexisting ublk servers (libublksrv) are
> dependent on the ability to open/close the file multiple times. To
> address this, only transition to a nosrv state if the file is released
> while the ublk device is live. This allows for programs to open/close
> the file multiple times during setup. It is still a behavior change if a
> ublk server decides to close/reopen the file while the device is LIVE
> (i.e. while it is responsible for serving I/O), but that would be highly
> unusual. This behavior is in line with what is done by FUSE, which is
> very similar to ublk in that a userspace daemon is providing services
> traditionally provided by the kernel.
> 
> With this change in, the new test (and all other selftests, and all
> ublksrv tests) pass:
> 
> selftests: ublk: test_generic_04.sh
> dev id is 0
> dd: error writing '/dev/ublkb0': Input/output error
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 0.0376731 s, 0.0 kB/s
> DEAD
> generic_04 : [PASS]
> 
> Signed-off-by: Uday Shankar <ushan...@purestorage.com>
> ---
> Changes in v3:
> - Quiesce queue earlier to avoid concurrent cancellation and "normal"
>   completion of io_uring cmds (Ming Lei)
> - Fix del_gendisk hang, found by test_stress_02
> - Remove unnecessary parameters in fault_inject target (Ming Lei)
> - Fix delay implementation to have separate per-I/O delay instead of
>   blocking the whole thread (Ming Lei)
> - Add delay_us to docs
> - Link to v2: 
> https://lore.kernel.org/r/20250402-ublk_timeout-v2-1-249bc5523...@purestorage.com
> 
> Changes in v2:
> - Leave null ublk selftests target untouched, instead create new
>   fault_inject target for injecting per-I/O delay (Ming Lei)
> - Allow multiple open/close of ublk character device with some
>   restrictions
> - Drop patches which made it in separately at 
> https://lore.kernel.org/r/20250401-ublk_selftests-v1-1-98129c9bc...@purestorage.com
> - Consolidate more nosrv logic in ublk character device release, and
>   associated code cleanup
> - Link to v1: 
> https://lore.kernel.org/r/20250325-ublk_timeout-v1-0-262f0121a...@purestorage.com
> ---
>  drivers/block/ublk_drv.c                        | 228 
> +++++++++---------------
>  tools/testing/selftests/ublk/Makefile           |   4 +-
>  tools/testing/selftests/ublk/fault_inject.c     |  72 ++++++++
>  tools/testing/selftests/ublk/kublk.c            |   6 +-
>  tools/testing/selftests/ublk/kublk.h            |   4 +
>  tools/testing/selftests/ublk/test_generic_04.sh |  43 +++++
>  6 files changed, 215 insertions(+), 142 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 
> 2fd05c1bd30b03343cb6f357f8c08dd92ff47af9..73baa9d22ccafb00723defa755a0b3aab7238934
>  100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -162,7 +162,6 @@ struct ublk_queue {
>  
>       bool force_abort;
>       bool timeout;
> -     bool canceling;
>       bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */
>       unsigned short nr_io_ready;     /* how many ios setup */
>       spinlock_t              cancel_lock;
> @@ -199,8 +198,6 @@ struct ublk_device {
>       struct completion       completion;
>       unsigned int            nr_queues_ready;
>       unsigned int            nr_privileged_daemon;
> -
> -     struct work_struct      nosrv_work;
>  };
>  
>  /* header of ublk_params */
> @@ -209,8 +206,9 @@ struct ublk_params_header {
>       __u32   types;
>  };
>  
> -static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue 
> *ubq);
> -
> +static void ublk_stop_dev_unlocked(struct ublk_device *ub);
> +static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> +static void __ublk_quiesce_dev(struct ublk_device *ub);
>  static inline struct request *__ublk_check_and_get_req(struct ublk_device 
> *ub,
>               struct ublk_queue *ubq, int tag, size_t offset);
>  static inline unsigned int ublk_req_build_flags(struct request *req);
> @@ -1314,8 +1312,6 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, 
> struct rq_list *l)
>  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
>  {
>       struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> -     unsigned int nr_inflight = 0;
> -     int i;
>  
>       if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
>               if (!ubq->timeout) {
> @@ -1326,26 +1322,6 @@ static enum blk_eh_timer_return ublk_timeout(struct 
> request *rq)
>               return BLK_EH_DONE;
>       }
>  
> -     if (!ubq_daemon_is_dying(ubq))
> -             return BLK_EH_RESET_TIMER;
> -
> -     for (i = 0; i < ubq->q_depth; i++) {
> -             struct ublk_io *io = &ubq->ios[i];
> -
> -             if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
> -                     nr_inflight++;
> -     }
> -
> -     /* cancelable uring_cmd can't help us if all commands are in-flight */
> -     if (nr_inflight == ubq->q_depth) {
> -             struct ublk_device *ub = ubq->dev;
> -
> -             if (ublk_abort_requests(ub, ubq)) {
> -                     schedule_work(&ub->nosrv_work);
> -             }
> -             return BLK_EH_DONE;
> -     }
> -
>       return BLK_EH_RESET_TIMER;
>  }
>  
> @@ -1356,19 +1332,16 @@ static blk_status_t ublk_prep_req(struct ublk_queue 
> *ubq, struct request *rq)
>       if (unlikely(ubq->fail_io))
>               return BLK_STS_TARGET;
>  
> -     /* With recovery feature enabled, force_abort is set in
> -      * ublk_stop_dev() before calling del_gendisk(). We have to
> -      * abort all requeued and new rqs here to let del_gendisk()
> -      * move on. Besides, we cannot not call io_uring_cmd_complete_in_task()
> -      * to avoid UAF on io_uring ctx.
> +     /*
> +      * force_abort is set in ublk_stop_dev() before calling
> +      * del_gendisk(). We have to abort all requeued and new rqs here
> +      * to let del_gendisk() move on. Besides, we cannot not call
> +      * io_uring_cmd_complete_in_task() to avoid UAF on io_uring ctx.
>        *
>        * Note: force_abort is guaranteed to be seen because it is set
>        * before request queue is unqiuesced.
>        */
> -     if (ublk_nosrv_should_queue_io(ubq) && unlikely(ubq->force_abort))
> -             return BLK_STS_IOERR;
> -
> -     if (unlikely(ubq->canceling))
> +     if (unlikely(ubq->force_abort))
>               return BLK_STS_IOERR;
>  
>       /* fill iod to slot in io cmd buffer */
> @@ -1391,16 +1364,6 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>       if (res != BLK_STS_OK)
>               return res;
>  
> -     /*
> -      * ->canceling has to be handled after ->force_abort and ->fail_io
> -      * is dealt with, otherwise this request may not be failed in case
> -      * of recovery, and cause hang when deleting disk
> -      */
> -     if (unlikely(ubq->canceling)) {
> -             __ublk_abort_rq(ubq, rq);
> -             return BLK_STS_OK;
> -     }
> -

If `ubq->canceling` is removed, how to handle recovery? The request has to
be requeued in case of ublk_nosrv_dev_should_queue_io().

And that should be the reason why 'make test T=generic/004' hangs forever
after applying this patch.

>       ublk_queue_cmd(ubq, rq);
>       return BLK_STS_OK;
>  }
> @@ -1461,8 +1424,71 @@ static int ublk_ch_open(struct inode *inode, struct 
> file *filp)
>  static int ublk_ch_release(struct inode *inode, struct file *filp)
>  {
>       struct ublk_device *ub = filp->private_data;
> +     int i;
> +
> +     mutex_lock(&ub->mutex);
> +     /*
> +      * If the device is not live, we will not transition to a nosrv
> +      * state. This protects against:
> +      * - accidental poking of the ublk character device
> +      * - some ublk servers which may open/close the ublk character
> +      *   device during startup
> +      */
> +     if (ub->dev_info.state != UBLK_S_DEV_LIVE)
> +             goto out;
> +
> +     /*
> +      * Since we are releasing the ublk character file descriptor, we
> +      * know that there cannot be any concurrent file-related
> +      * activity (e.g. uring_cmds or reads/writes). However, I/O
> +      * might still be getting dispatched. Quiesce that too so that
> +      * we don't need to worry about anything concurrent.
> +      *
> +      * We may have already quiesced the queue if we canceled any
> +      * uring_cmds, so only quiesce if necessary (quiesce is not
> +      * idempotent, it has an internal counter which we need to
> +      * manage carefully).
> +      */
> +     if (!blk_queue_quiesced(ub->ub_disk->queue))
> +             blk_mq_quiesce_queue(ub->ub_disk->queue);
> +
> +     /*
> +      * Handle any requests outstanding to the ublk server
> +      */
> +     for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> +             ublk_abort_queue(ub, ublk_get_queue(ub, i));
>  
> +     /*
> +      * Transition the device to the nosrv state. What exactly this
> +      * means depends on the recovery flags
> +      */
> +     if (ublk_nosrv_should_stop_dev(ub)) {
> +             /*
> +              * Allow any pending/future I/O to pass through quickly
> +              * with an error. This is needed because del_gendisk
> +              * waits for all pending I/O to complete
> +              */
> +             for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> +                     ublk_get_queue(ub, i)->force_abort = true;
> +             blk_mq_unquiesce_queue(ub->ub_disk->queue);
> +
> +             ublk_stop_dev_unlocked(ub);
> +     } else {
> +             if (ublk_nosrv_dev_should_queue_io(ub)) {
> +                     __ublk_quiesce_dev(ub);

Here only inflight IOs are drained, new IO still comes after queue is
unquiesced, then uring_cmd UAF is triggered.

...

> --- /dev/null
> +++ b/tools/testing/selftests/ublk/test_generic_04.sh
> @@ -0,0 +1,43 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +
> +. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
> +
> +TID="generic_04"
> +ERR_CODE=0
> +
> +_prep_test "fault_inject" "fast cleanup when all I/Os of one hctx are in 
> server"
> +
> +# configure ublk server to sleep 2s before completing each I/O
> +dev_id=$(_add_ublk_dev -t fault_inject -q 2 -d 1 --delay_us 2000000)
> +_check_add_dev $TID $?
> +
> +echo "dev id is ${dev_id}"
> +
> +STARTTIME=${SECONDS}
> +
> +dd if=/dev/urandom of=/dev/ublkb${dev_id} oflag=direct bs=4k count=1 &

stdout/stderr need to be discarded.

Also I'd suggest to make the selftest part as one standalone patch.


thanks,
Ming


Reply via email to