(Restoring cc list from the previous discussion.  Please retain the cc
list of the people who discussed in the previous postings.)

On Fri, Sep 07, 2012 at 03:12:53PM -0700, Kent Overstreet wrote:
> But this is tricky and not a generic solution. This patch solves it for
> all users by inverting the previously described technique. We allocate a
> rescuer workqueue for each bio_set, and then in the allocation code if
> there are bios on current->bio_list we would be blocking, we punt them
> to the rescuer workqueue to be submitted.

It would be great if this explanation can be expanded a bit.  Why does
it make the deadlock condition go away?  What are the restrictions -
e.g. using other mempools for additional per-bio data structure
wouldn't work, right?

> +static void punt_bios_to_rescuer(struct bio_set *bs)
> +{
> +     struct bio_list punt, nopunt;
> +     struct bio *bio;
> +
> +     bio_list_init(&punt);
> +     bio_list_init(&nopunt);
> +
> +     while ((bio = bio_list_pop(current->bio_list)))
> +             bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +
> +     *current->bio_list = nopunt;

Why this is necessary needs explanation and it's done in rather
unusual way.  I suppose the weirdness is from bio_list API
restriction?

> +     spin_lock(&bs->rescue_lock);
> +     bio_list_merge(&bs->rescue_list, &punt);
> +     spin_unlock(&bs->rescue_lock);
> +
> +     queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +}
> +
>  /**
>   * bio_alloc_bioset - allocate a bio for I/O
>   * @gfp_mask:   the GFP_ mask given to the slab allocator
> @@ -317,6 +354,7 @@ EXPORT_SYMBOL(bio_reset);
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set 
> *bs)
>  {
> +     gfp_t saved_gfp = gfp_mask;
>       unsigned front_pad;
>       unsigned inline_vecs;
>       unsigned long idx = BIO_POOL_NONE;
> @@ -334,13 +372,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
> nr_iovecs, struct bio_set *bs)
>               front_pad = 0;
>               inline_vecs = nr_iovecs;
>       } else {
> +             /*
> +              * generic_make_request() converts recursion to iteration; this
> +              * means if we're running beneath it, any bios we allocate and
> +              * submit will not be submitted (and thus freed) until after we
> +              * return.
> +              *
> +              * This exposes us to a potential deadlock if we allocate
> +              * multiple bios from the same bio_set() while running
> +              * underneath generic_make_request(). If we were to allocate
> +              * multiple bios (say a stacking block driver that was splitting
> +              * bios), we would deadlock if we exhausted the mempool's
> +              * reserve.
> +              *
> +              * We solve this, and guarantee forward progress, with a rescuer
> +              * workqueue per bio_set. If we go to allocate and there are
> +              * bios on current->bio_list, we first try the allocation
> +              * without __GFP_WAIT; if that fails, we punt those bios we
> +              * would be blocking to the rescuer workqueue before we retry
> +              * with the original gfp_flags.
> +              */
> +
> +             if (current->bio_list && !bio_list_empty(current->bio_list))
> +                     gfp_mask &= ~__GFP_WAIT;
> +retry:
>               p = mempool_alloc(bs->bio_pool, gfp_mask);
>               front_pad = bs->front_pad;
>               inline_vecs = BIO_INLINE_VECS;
>       }

Wouldn't the following be better?

        p = mempool_alloc(bs->bi_pool, gfp_mask);
        if (unlikely(!p) && gfp_mask != saved_gfp) {
                punt_bios_to_rescuer(bs);
                p = mempool_alloc(bs->bi_pool, saved_gfp);
        }

I really hope the usage restriction (don't mix with mempool) for
bioset is clearly documented somewhere appropriate.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to