Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()

Stefan Hajnoczi Mon, 13 Dec 2021 01:40:22 -0800

On Thu, Dec 09, 2021 at 07:51:02PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 09.12.2021 19:32, Stefan Hajnoczi wrote:
> > On Thu, Dec 09, 2021 at 04:45:13PM +0100, Hanna Reitz wrote:
> > > On 09.12.21 15:23, Stefan Hajnoczi wrote:
> > > > The BlockBackend root child can change during bdrv_drained_begin() when
> > > > aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
> > > > and blk_drain() is left with a dangling BDS pointer.
> > > > 
> > > > One example is scsi_device_purge_requests(), which calls blk_drain() to
> > > > wait for in-flight requests to cancel. If the backup blockjob is active,
> > > > then the BlockBackend root child is a temporary filter BDS owned by the
> > > > blockjob. The blockjob can complete during bdrv_drained_begin() and the
> > > > last reference to the BDS is released when the temporary filter node is
> > > > removed. This results in a use-after-free when blk_drain() calls
> > > > bdrv_drained_end(bs) on the dangling pointer.
> > > > 
> > > > The general problem is that a function and its callers must not assume
> > > > that bs is still valid across aio_poll(). Explicitly hold a reference to
> > > > bs in blk_drain() to avoid the dangling pointer.
> > > > 
> > > > Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
> > > > ---
> > > > I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
> > > > 0 (at least when running "make check"), so it is currently not possible
> > > > to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
> > > > bdrv_do_drained_end() because they will be unbalanced. That would have
> > > > been a more general solution than only fixing blk_drain().
> > > 
> > > Deleting nodes that have a `quiesce_counter > 0` doesn’t seem wrong to me 
> > > –
> > > deleting only depends on strong references, and so I’d expect that 
> > > anything
> > > that increases the quiesce_counter also has a strong reference to the node
> > > if the former wants the latter to stay around.
> > > 
> > > I suppose we could make it so that both the quiesce_counter and the refcnt
> > > need to be 0 before a BDS is deleted (and then deletion can happen both 
> > > from
> > > bdrv_unref() and drained_end), but I don’t know whether that’s really
> > > necessary.  I’d rather leave it to the caller to ensure they keep a strong
> > > reference throughout the drain.
> > > 
> > > The question is, how often do we have a situation like this, where we 
> > > take a
> > > weak reference for draining, because we assume there’s a strong reference
> > > backing us up (namely the one through blk->root), but that strong 
> > > reference
> > > then can go away due to draining...
> > > 
> > > > Any suggestions for a better fix?
> > > 
> > > The fix makes sense to me.
> > 
> > Okay. My concern was that this is a whole class of bugs and my patch
> > only fixes blk_drain(). I have audited the code some more in the
> > meantime.
> > 
> > bdrv_insert_node() may be unsafe in the case where bs is a temporary
> > filter node that is unref'd during bdrv_drained_begin():
> > 
> >    BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *options,
> >                                       int flags, Error **errp)
> >    {
> >        ERRP_GUARD();
> >        int ret;
> >        BlockDriverState *new_node_bs = NULL;
> >        const char *drvname, *node_name;
> >        BlockDriver *drv;
> >        drvname = qdict_get_try_str(options, "driver");
> >        if (!drvname) {
> >            error_setg(errp, "driver is not specified");
> >            goto fail;
> >        }
> >        drv = bdrv_find_format(drvname);
> >        if (!drv) {
> >            error_setg(errp, "Unknown driver: '%s'", drvname);
> >            goto fail;
> >        }
> >        node_name = qdict_get_try_str(options, "node-name");
> >        new_node_bs = bdrv_new_open_driver_opts(drv, node_name, options, 
> > flags,
> >                                                errp);
> >        options = NULL; /* bdrv_new_open_driver() eats options */
> >        if (!new_node_bs) {
> >            error_prepend(errp, "Could not create node: ");
> >            goto fail;
> >        }
> >        bdrv_drained_begin(bs);
> >        ^^^^^^^^^^^^^^^^^^^^^^^ <--- bs can be dangling pointer
> >        ret = bdrv_replace_node(bs, new_node_bs, errp);
> >        bdrv_drained_end(bs);
> > 
> > The fix isn't as simple as blk_drain() because we don't want to insert
> > the new node before the now-deleted node. I think the correct way to
> > insert a node is against BdrvChild, not BlockDriverState. That way we
> > can be sure the new node will be inserted into a graph that is reachable
> > via BdrvChild (e.g. BlockBackend) instead of a detached BDS.
> > 
> > bdrv_set_aio_context_ignore() and blk_io_limits_disable() need to ref bs
> > like blk_drain() in this patch.
> > 
> > There are some other bdrv_drained_begin() calls that I'm assuming are
> > safe because they are during creation/deletion so I think we have strong
> > references there or nothing else knows about our BDS yet.
> > 
> > Do you agree with extending this patch series to cover the functions I
> > mentioned above?
> 
> I'm not sure.
> 
> First, we can't support "any graph change" during some graph changing 
> operation.
> 
> Actually, when we do some specific graph change operation, we should forbid 
> any other graph change operations, they should wait. Possibly, by adding 
> strong references everywhere, we can avoid crashes. But what about the logic? 
> If we do several graph changing operations simultaneously, the result is 
> absolutely unpredictable, it's not what user wants.
> 
> The problem is known. For example it may lead to 030 iotest failure. Probably 
> now it can't, after changes by Hanna.. But I think we'll not be safe, until 
> we have a kind of global mutex for graph changing operations. For example, 
> here is my old attempt: 
> https://lists.nongnu.org/archive/html/qemu-devel/2020-11/msg05290.html .
> 
> So, probably, until we have a good solution for this, better do only 
> necessary small fixes like your original patch..
> 
> 
> Second, actually bs may disappear on first yield point, which will happen 
> earlier than bdrv_drained_being() - in bdrv_new_open_driver_opts(). So, if 
> fix something, we'd better declare that caller of bdrv_insert_node() is 
> responsible for bs not disappear during function call. And check callers.


bdrv_insert_node() is tricky. I will leave it alone and focus on
bdrv_set_aio_context_ignore() and blk_io_limits_disable() because I
think they have exactly the same problem as blk_drain().

Stefan

signature.asc
Description: PGP signature

Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()

Reply via email to