Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Kent, On Tue, Sep 11, 2012 at 12:31 PM, Kent Overstreet wrote: > On Tue, Sep 11, 2012 at 11:58:05AM -0700, Muthu Kumar wrote: >> On Tue, Sep 11, 2012 at 11:45 AM, Tejun Heo wrote: >> > Hello, >> > >> > On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote: >> >> Does this preserve the CPU from which the bio was submitted >> >> originally. Not familiar with cmwq, may be Tejun can clarify. >> >> >> >> Tejun - the question is, do we honor the rq_affinity with the above >> >> rescue worker implementation? >> > >> > The work item would run from the same CPU but there isn't any >> > mechanism to keep track of the issuing CPU if there are multiple bios >> > to be rescued. Isn't rq_affinity an optimization hint? If so, I >> > don't think it matters here. >> > >> >> Thanks... Just worried about performance impact. >> >> Kent - Anything to validate that the performance is not impacted would >> be really good. Otherwise, the patch looks great. > > Well - there'll only be any performance impact at all when we're memory > constrained enough that GFP_NOWAIT allocations fail, which for these > size allocations definitely isn't normal. Agreed. If we are in this situation, we are already running in degraded mode. So performance is not in question. Regards, Muthu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Tue, Sep 11, 2012 at 11:58:05AM -0700, Muthu Kumar wrote: > On Tue, Sep 11, 2012 at 11:45 AM, Tejun Heo wrote: > > Hello, > > > > On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote: > >> Does this preserve the CPU from which the bio was submitted > >> originally. Not familiar with cmwq, may be Tejun can clarify. > >> > >> Tejun - the question is, do we honor the rq_affinity with the above > >> rescue worker implementation? > > > > The work item would run from the same CPU but there isn't any > > mechanism to keep track of the issuing CPU if there are multiple bios > > to be rescued. Isn't rq_affinity an optimization hint? If so, I > > don't think it matters here. > > > > Thanks... Just worried about performance impact. > > Kent - Anything to validate that the performance is not impacted would > be really good. Otherwise, the patch looks great. Well - there'll only be any performance impact at all when we're memory constrained enough that GFP_NOWAIT allocations fail, which for these size allocations definitely isn't normal. I did test it with forcing everything to use the rescuer, and I also benchmarked Vivek's version - in any sane configuration, the impact of punting everything to workqueue is not very noticable (the AHCI interrupt handler uses more cpu). > > Feel free to add: > > Reviewed-by: Muthukumar Ratty Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Tue, Sep 11, 2012 at 11:45 AM, Tejun Heo wrote: > Hello, > > On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote: >> Does this preserve the CPU from which the bio was submitted >> originally. Not familiar with cmwq, may be Tejun can clarify. >> >> Tejun - the question is, do we honor the rq_affinity with the above >> rescue worker implementation? > > The work item would run from the same CPU but there isn't any > mechanism to keep track of the issuing CPU if there are multiple bios > to be rescued. Isn't rq_affinity an optimization hint? If so, I > don't think it matters here. > Thanks... Just worried about performance impact. Kent - Anything to validate that the performance is not impacted would be really good. Otherwise, the patch looks great. Feel free to add: Reviewed-by: Muthukumar Ratty Regards, Muthu > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote: > Does this preserve the CPU from which the bio was submitted > originally. Not familiar with cmwq, may be Tejun can clarify. > > Tejun - the question is, do we honor the rq_affinity with the above > rescue worker implementation? The work item would run from the same CPU but there isn't any mechanism to keep track of the issuing CPU if there are multiple bios to be rescued. Isn't rq_affinity an optimization hint? If so, I don't think it matters here. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Kent, On Mon, Sep 10, 2012 at 2:56 PM, Kent Overstreet wrote: .. .. > > +static void bio_alloc_rescue(struct work_struct *work) > +{ > + struct bio_set *bs = container_of(work, struct bio_set, rescue_work); > + struct bio *bio; > + > + while (1) { > + spin_lock(>rescue_lock); > + bio = bio_list_pop(>rescue_list); > + spin_unlock(>rescue_lock); > + > + if (!bio) > + break; > + > + generic_make_request(bio); > + } > +} > + > +static void punt_bios_to_rescuer(struct bio_set *bs) > +{ > + struct bio_list punt, nopunt; > + struct bio *bio; > + > + /* > +* In order to guarantee forward progress we must punt only bios that > +* were allocated from this bio_set; otherwise, if there was a bio on > +* there for a stacking driver higher up in the stack, processing it > +* could require allocating bios from this bio_set, and doing that > from > +* our own rescuer would be bad. > +* > +* Since bio lists are singly linked, pop them all instead of trying > to > +* remove from the middle of the list: > +*/ > + > + bio_list_init(); > + bio_list_init(); > + > + while ((bio = bio_list_pop(current->bio_list))) > + bio_list_add(bio->bi_pool == bs ? : , bio); > + > + *current->bio_list = nopunt; > + > + spin_lock(>rescue_lock); > + bio_list_merge(>rescue_list, ); > + spin_unlock(>rescue_lock); > + > + queue_work(bs->rescue_workqueue, >rescue_work); > +} Does this preserve the CPU from which the bio was submitted originally. Not familiar with cmwq, may be Tejun can clarify. Tejun - the question is, do we honor the rq_affinity with the above rescue worker implementation? Regards, Muthu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Kent, On Mon, Sep 10, 2012 at 2:56 PM, Kent Overstreet koverstr...@google.com wrote: .. snip .. +static void bio_alloc_rescue(struct work_struct *work) +{ + struct bio_set *bs = container_of(work, struct bio_set, rescue_work); + struct bio *bio; + + while (1) { + spin_lock(bs-rescue_lock); + bio = bio_list_pop(bs-rescue_list); + spin_unlock(bs-rescue_lock); + + if (!bio) + break; + + generic_make_request(bio); + } +} + +static void punt_bios_to_rescuer(struct bio_set *bs) +{ + struct bio_list punt, nopunt; + struct bio *bio; + + /* +* In order to guarantee forward progress we must punt only bios that +* were allocated from this bio_set; otherwise, if there was a bio on +* there for a stacking driver higher up in the stack, processing it +* could require allocating bios from this bio_set, and doing that from +* our own rescuer would be bad. +* +* Since bio lists are singly linked, pop them all instead of trying to +* remove from the middle of the list: +*/ + + bio_list_init(punt); + bio_list_init(nopunt); + + while ((bio = bio_list_pop(current-bio_list))) + bio_list_add(bio-bi_pool == bs ? punt : nopunt, bio); + + *current-bio_list = nopunt; + + spin_lock(bs-rescue_lock); + bio_list_merge(bs-rescue_list, punt); + spin_unlock(bs-rescue_lock); + + queue_work(bs-rescue_workqueue, bs-rescue_work); +} Does this preserve the CPU from which the bio was submitted originally. Not familiar with cmwq, may be Tejun can clarify. Tejun - the question is, do we honor the rq_affinity with the above rescue worker implementation? Regards, Muthu -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote: Does this preserve the CPU from which the bio was submitted originally. Not familiar with cmwq, may be Tejun can clarify. Tejun - the question is, do we honor the rq_affinity with the above rescue worker implementation? The work item would run from the same CPU but there isn't any mechanism to keep track of the issuing CPU if there are multiple bios to be rescued. Isn't rq_affinity an optimization hint? If so, I don't think it matters here. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Tue, Sep 11, 2012 at 11:45 AM, Tejun Heo t...@kernel.org wrote: Hello, On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote: Does this preserve the CPU from which the bio was submitted originally. Not familiar with cmwq, may be Tejun can clarify. Tejun - the question is, do we honor the rq_affinity with the above rescue worker implementation? The work item would run from the same CPU but there isn't any mechanism to keep track of the issuing CPU if there are multiple bios to be rescued. Isn't rq_affinity an optimization hint? If so, I don't think it matters here. Thanks... Just worried about performance impact. Kent - Anything to validate that the performance is not impacted would be really good. Otherwise, the patch looks great. Feel free to add: Reviewed-by: Muthukumar Ratty mut...@gmail.com Regards, Muthu Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Tue, Sep 11, 2012 at 11:58:05AM -0700, Muthu Kumar wrote: On Tue, Sep 11, 2012 at 11:45 AM, Tejun Heo t...@kernel.org wrote: Hello, On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote: Does this preserve the CPU from which the bio was submitted originally. Not familiar with cmwq, may be Tejun can clarify. Tejun - the question is, do we honor the rq_affinity with the above rescue worker implementation? The work item would run from the same CPU but there isn't any mechanism to keep track of the issuing CPU if there are multiple bios to be rescued. Isn't rq_affinity an optimization hint? If so, I don't think it matters here. Thanks... Just worried about performance impact. Kent - Anything to validate that the performance is not impacted would be really good. Otherwise, the patch looks great. Well - there'll only be any performance impact at all when we're memory constrained enough that GFP_NOWAIT allocations fail, which for these size allocations definitely isn't normal. I did test it with forcing everything to use the rescuer, and I also benchmarked Vivek's version - in any sane configuration, the impact of punting everything to workqueue is not very noticable (the AHCI interrupt handler uses more cpu). Feel free to add: Reviewed-by: Muthukumar Ratty mut...@gmail.com Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Kent, On Tue, Sep 11, 2012 at 12:31 PM, Kent Overstreet koverstr...@google.com wrote: On Tue, Sep 11, 2012 at 11:58:05AM -0700, Muthu Kumar wrote: On Tue, Sep 11, 2012 at 11:45 AM, Tejun Heo t...@kernel.org wrote: Hello, On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote: Does this preserve the CPU from which the bio was submitted originally. Not familiar with cmwq, may be Tejun can clarify. Tejun - the question is, do we honor the rq_affinity with the above rescue worker implementation? The work item would run from the same CPU but there isn't any mechanism to keep track of the issuing CPU if there are multiple bios to be rescued. Isn't rq_affinity an optimization hint? If so, I don't think it matters here. Thanks... Just worried about performance impact. Kent - Anything to validate that the performance is not impacted would be really good. Otherwise, the patch looks great. Well - there'll only be any performance impact at all when we're memory constrained enough that GFP_NOWAIT allocations fail, which for these size allocations definitely isn't normal. Agreed. If we are in this situation, we are already running in degraded mode. So performance is not in question. Regards, Muthu -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, again. On Mon, Sep 10, 2012 at 03:09:10PM -0700, Tejun Heo wrote: > I'm still a bit scared but think this is correct. > > Acked-by: Tejun Heo > > One last thing is that we may want to add @name on bioset creation so > that we can name the workqueue properly but that's for another patch. Yet another thing that I noticed in a different discussion. http://thread.gmane.org/gmane.linux.network.drbd.devel/2130 Before this, I think bios didn't get reordered while they're traveling down the stacked bio drivers. After this, I don't think that's true anymore. Currently, IIRC, we don't have any ordering functionality at bio interface, so I don't think this breaks anything but this can lead to stupidly subtle bugs if the upper layer is making assumption on ordering somehow. It's something which at least should be noted, I think. Whether we want to update the code so that ordering is maintained, I don't know. I hope not. It's already crazy complex. :( Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, Kent. On Mon, Sep 10, 2012 at 02:56:33PM -0700, Kent Overstreet wrote: > commit df7e63cbffa3065fcc4ba2b9a93418d7c7312243 > Author: Kent Overstreet > Date: Mon Sep 10 14:33:46 2012 -0700 > > block: Avoid deadlocks with bio allocation by stacking drivers > > Previously, if we ever try to allocate more than once from the same bio > set while running under generic_make_request() (i.e. a stacking block > driver), we risk deadlock. > > This is because of the code in generic_make_request() that converts > recursion to iteration; any bios we submit won't actually be submitted > (so they can complete and eventually be freed) until after we return - > this means if we allocate a second bio, we're blocking the first one > from ever being freed. > > Thus if enough threads call into a stacking block driver at the same > time with bios that need multiple splits, and the bio_set's reserve gets > used up, we deadlock. > > This can be worked around in the driver code - we could check if we're > running under generic_make_request(), then mask out __GFP_WAIT when we > go to allocate a bio, and if the allocation fails punt to workqueue and > retry the allocation. > > But this is tricky and not a generic solution. This patch solves it for > all users by inverting the previously described technique. We allocate a > rescuer workqueue for each bio_set, and then in the allocation code if > there are bios on current->bio_list we would be blocking, we punt them > to the rescuer workqueue to be submitted. > > This guarantees forward progress for bio allocations under > generic_make_request() provided each bio is submitted before allocating > the next, and provided the bios are freed after they complete. > > Note that this doesn't do anything for allocation from other mempools. > Instead of allocating per bio data structures from a mempool, code > should use bio_set's front_pad. > > Tested it by forcing the rescue codepath to be taken (by disabling the > first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot > of arbitrary bio splitting) and verified that the rescuer was being > invoked. > > Signed-off-by: Kent Overstreet > CC: Jens Axboe I'm still a bit scared but think this is correct. Acked-by: Tejun Heo One last thing is that we may want to add @name on bioset creation so that we can name the workqueue properly but that's for another patch. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Mon, Sep 10, 2012 at 02:37:10PM -0700, Tejun Heo wrote: > On Mon, Sep 10, 2012 at 02:33:49PM -0700, Kent Overstreet wrote: > > "Simpler" isn't really an objective thing though. To me the goto version > > is more obvious/idiomatic. > > > > Eh. I'll do it your way, but consider this a formal objection :p > > Thanks. :) Here's current version. Good enough for an acked-by? commit df7e63cbffa3065fcc4ba2b9a93418d7c7312243 Author: Kent Overstreet Date: Mon Sep 10 14:33:46 2012 -0700 block: Avoid deadlocks with bio allocation by stacking drivers Previously, if we ever try to allocate more than once from the same bio set while running under generic_make_request() (i.e. a stacking block driver), we risk deadlock. This is because of the code in generic_make_request() that converts recursion to iteration; any bios we submit won't actually be submitted (so they can complete and eventually be freed) until after we return - this means if we allocate a second bio, we're blocking the first one from ever being freed. Thus if enough threads call into a stacking block driver at the same time with bios that need multiple splits, and the bio_set's reserve gets used up, we deadlock. This can be worked around in the driver code - we could check if we're running under generic_make_request(), then mask out __GFP_WAIT when we go to allocate a bio, and if the allocation fails punt to workqueue and retry the allocation. But this is tricky and not a generic solution. This patch solves it for all users by inverting the previously described technique. We allocate a rescuer workqueue for each bio_set, and then in the allocation code if there are bios on current->bio_list we would be blocking, we punt them to the rescuer workqueue to be submitted. This guarantees forward progress for bio allocations under generic_make_request() provided each bio is submitted before allocating the next, and provided the bios are freed after they complete. Note that this doesn't do anything for allocation from other mempools. Instead of allocating per bio data structures from a mempool, code should use bio_set's front_pad. Tested it by forcing the rescue codepath to be taken (by disabling the first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot of arbitrary bio splitting) and verified that the rescuer was being invoked. Signed-off-by: Kent Overstreet CC: Jens Axboe diff --git a/fs/bio.c b/fs/bio.c index 13e9567..4783e31 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -295,6 +295,54 @@ void bio_reset(struct bio *bio) } EXPORT_SYMBOL(bio_reset); +static void bio_alloc_rescue(struct work_struct *work) +{ + struct bio_set *bs = container_of(work, struct bio_set, rescue_work); + struct bio *bio; + + while (1) { + spin_lock(>rescue_lock); + bio = bio_list_pop(>rescue_list); + spin_unlock(>rescue_lock); + + if (!bio) + break; + + generic_make_request(bio); + } +} + +static void punt_bios_to_rescuer(struct bio_set *bs) +{ + struct bio_list punt, nopunt; + struct bio *bio; + + /* +* In order to guarantee forward progress we must punt only bios that +* were allocated from this bio_set; otherwise, if there was a bio on +* there for a stacking driver higher up in the stack, processing it +* could require allocating bios from this bio_set, and doing that from +* our own rescuer would be bad. +* +* Since bio lists are singly linked, pop them all instead of trying to +* remove from the middle of the list: +*/ + + bio_list_init(); + bio_list_init(); + + while ((bio = bio_list_pop(current->bio_list))) + bio_list_add(bio->bi_pool == bs ? : , bio); + + *current->bio_list = nopunt; + + spin_lock(>rescue_lock); + bio_list_merge(>rescue_list, ); + spin_unlock(>rescue_lock); + + queue_work(bs->rescue_workqueue, >rescue_work); +} + /** * bio_alloc_bioset - allocate a bio for I/O * @gfp_mask: the GFP_ mask given to the slab allocator @@ -312,11 +360,27 @@ EXPORT_SYMBOL(bio_reset); * previously allocated bio for IO before attempting to allocate a new one. * Failure to do so can cause deadlocks under memory pressure. * + * Note that when running under generic_make_request() (i.e. any block + * driver), bios are not submitted until after you return - see the code in + * generic_make_request() that converts recursion into iteration, to prevent + * stack overflows. + * + * This would normally mean allocating multiple bios under + * generic_make_request() would be susceptible to deadlocks, but we have + * deadlock avoidance code that resubmits any blocked bios from a rescuer + * thread. +
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Mon, Sep 10, 2012 at 02:33:49PM -0700, Kent Overstreet wrote: > "Simpler" isn't really an objective thing though. To me the goto version > is more obvious/idiomatic. > > Eh. I'll do it your way, but consider this a formal objection :p Thanks. :) -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Mon, Sep 10, 2012 at 01:40:10PM -0700, Tejun Heo wrote: > Hello, Kent. > > On Mon, Sep 10, 2012 at 01:24:35PM -0700, Kent Overstreet wrote: > > And at that point, why duplicate that line of code? It doesn't matter that > > much, but IMO a goto retry better labels what's actually going on (it's > > something that's not uncommon in the kernel and if I see a retry label > > in a function I pretty immediately have an idea of what's going on). > > > > So we could do > > > > retry: > > p = mempool_alloc(bs->bio_pool, gfp_mask); > > if (!p && gfp_mask != saved_gfp) { > > punt_bios_to_rescuer(bs); > > gfp_mask = saved_gfp; > > goto retry; > > } > > Yes, we do retry loops if that makes the code simpler. Doing that to > save one extra alloc call, I don't think so. "Simpler" isn't really an objective thing though. To me the goto version is more obvious/idiomatic. Eh. I'll do it your way, but consider this a formal objection :p -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, Kent. On Mon, Sep 10, 2012 at 01:24:35PM -0700, Kent Overstreet wrote: > And at that point, why duplicate that line of code? It doesn't matter that > much, but IMO a goto retry better labels what's actually going on (it's > something that's not uncommon in the kernel and if I see a retry label > in a function I pretty immediately have an idea of what's going on). > > So we could do > > retry: > p = mempool_alloc(bs->bio_pool, gfp_mask); > if (!p && gfp_mask != saved_gfp) { > punt_bios_to_rescuer(bs); > gfp_mask = saved_gfp; > goto retry; > } Yes, we do retry loops if that makes the code simpler. Doing that to save one extra alloc call, I don't think so. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Mon, Sep 10, 2012 at 10:22:10AM -0700, Tejun Heo wrote: > Hello, Kent. > > On Sun, Sep 09, 2012 at 05:28:10PM -0700, Kent Overstreet wrote: > > > > + while ((bio = bio_list_pop(current->bio_list))) > > > > + bio_list_add(bio->bi_pool == bs ? : , bio); > > > > + > > > > + *current->bio_list = nopunt; > > > > > > Why this is necessary needs explanation and it's done in rather > > > unusual way. I suppose the weirdness is from bio_list API > > > restriction? > > > > It's because bio_lists are singly linked, so deleting an entry from the > > middle of the list would be a real pain - just much cleaner/simpler to > > do it this way. > > Yeah, I wonder how benefical that singly linked list is. Eh well... Well, this is the first time I can think of that it's come up, and IMO this is no less clean a way of writing it... just a bit unusual in C, it feels more functional to me instead of imperative. > > > Wouldn't the following be better? > > > > > > p = mempool_alloc(bs->bi_pool, gfp_mask); > > > if (unlikely(!p) && gfp_mask != saved_gfp) { > > > punt_bios_to_rescuer(bs); > > > p = mempool_alloc(bs->bi_pool, saved_gfp); > > > } > > > > That'd require duplicating the error handling in two different places - > > once for the initial allocation, once for the bvec allocation. And I > > really hate that writing code that does > > > > alloc_something() > > if (fail) { > > alloc_something_again() > > } > > > > it just screams ugly to me. > > I don't know. That at least represents what's going on and goto'ing > back and forth is hardly pretty. Sometimes the code gets much uglier > / unwieldy and we have to live with gotos. Here, that doesn't seem to > be the case. I think this is really more personal preference than anything, but: Setting gfp_mask = saved_gfp after calling punt_bio_to_rescuer() is really the correct thing to do, and makes the code clearer IMO: once we've run punt_bio_to_rescuer() we don't need to mask out GFP_WAIT (not until the next time a bio is submitted, really). This matters a bit for the bvl allocation too, if we call punt_bio_to_rescuer() for the bio allocation no point doing it again. So to be rigorously correct, your way would have to be p = mempool_alloc(bs->bio_pool, gfp_mask); if (!p && gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); gfp_mask = saved_gfp; p = mempool_alloc(bs->bio_pool, gfp_mask); } And at that point, why duplicate that line of code? It doesn't matter that much, but IMO a goto retry better labels what's actually going on (it's something that's not uncommon in the kernel and if I see a retry label in a function I pretty immediately have an idea of what's going on). So we could do retry: p = mempool_alloc(bs->bio_pool, gfp_mask); if (!p && gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); gfp_mask = saved_gfp; goto retry; } (side note: not that it really matters here, but gcc will inline the bvec_alloc_bs() call if it's not duplicated, I've never seen it consolidate duplicated code and /then/ inline based off that) This does have the advantage that we're not freeing and reallocating the bio like Vivek pointed out, but I'm not a huge fan of having the punting/retry logic in the main code path. I don't care that much though. I'd prefer not to have the actual allocations duplicated, but it's starting to feel like bikeshedding to me. > > +static void punt_bios_to_rescuer(struct bio_set *bs) > > +{ > > + struct bio_list punt, nopunt; > > + struct bio *bio; > > + > > + /* > > +* Don't want to punt all bios on current->bio_list; if there was a bio > > +* on there for a stacking driver higher up in the stack, processing it > > +* could require allocating bios from this bio_set, and we don't want to > > +* do that from our own rescuer. > > Hmmm... isn't it more like we "must" process only the bios which are > from this bio_set to have any kind of forward-progress guarantee? The > above sounds like it's just something undesirable. Yeah, that'd be better, I'll change it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, Kent. On Sun, Sep 09, 2012 at 05:28:10PM -0700, Kent Overstreet wrote: > > > + while ((bio = bio_list_pop(current->bio_list))) > > > + bio_list_add(bio->bi_pool == bs ? : , bio); > > > + > > > + *current->bio_list = nopunt; > > > > Why this is necessary needs explanation and it's done in rather > > unusual way. I suppose the weirdness is from bio_list API > > restriction? > > It's because bio_lists are singly linked, so deleting an entry from the > middle of the list would be a real pain - just much cleaner/simpler to > do it this way. Yeah, I wonder how benefical that singly linked list is. Eh well... > > Wouldn't the following be better? > > > > p = mempool_alloc(bs->bi_pool, gfp_mask); > > if (unlikely(!p) && gfp_mask != saved_gfp) { > > punt_bios_to_rescuer(bs); > > p = mempool_alloc(bs->bi_pool, saved_gfp); > > } > > That'd require duplicating the error handling in two different places - > once for the initial allocation, once for the bvec allocation. And I > really hate that writing code that does > > alloc_something() > if (fail) { > alloc_something_again() > } > > it just screams ugly to me. I don't know. That at least represents what's going on and goto'ing back and forth is hardly pretty. Sometimes the code gets much uglier / unwieldy and we have to live with gotos. Here, that doesn't seem to be the case. > +static void punt_bios_to_rescuer(struct bio_set *bs) > +{ > + struct bio_list punt, nopunt; > + struct bio *bio; > + > + /* > + * Don't want to punt all bios on current->bio_list; if there was a bio > + * on there for a stacking driver higher up in the stack, processing it > + * could require allocating bios from this bio_set, and we don't want to > + * do that from our own rescuer. Hmmm... isn't it more like we "must" process only the bios which are from this bio_set to have any kind of forward-progress guarantee? The above sounds like it's just something undesirable. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Sun, Sep 09, 2012 at 05:28:10PM -0700, Kent Overstreet wrote: [..] > > > +retry: > > > p = mempool_alloc(bs->bio_pool, gfp_mask); > > > front_pad = bs->front_pad; > > > inline_vecs = BIO_INLINE_VECS; > > > } > > > > Wouldn't the following be better? > > > > p = mempool_alloc(bs->bi_pool, gfp_mask); > > if (unlikely(!p) && gfp_mask != saved_gfp) { > > punt_bios_to_rescuer(bs); > > p = mempool_alloc(bs->bi_pool, saved_gfp); > > } > > That'd require duplicating the error handling in two different places - > once for the initial allocation, once for the bvec allocation. And I > really hate that writing code that does > > alloc_something() > if (fail) { > alloc_something_again() > } > > it just screams ugly to me. Personally I kind of like Tejun's suggestion. Yes, it means that you will have to do it twice (once for bio and once for bvecs). But at the same time, if bvec allocation fails, then you don't have to free already allocated bio and retry bio allocation again. Current code is doing that and I am not sure why should we free allocated bio and retry again in case of bvec allocation failure. Secondly current code is doing following. if (unlikely(!bvl)) goto err_free; err_free: mempool_free(p, bs->bio_pool); err: retry_allocation. Not sure but err_free kind of gives impression that free up whatever is allocated path, and return NULL. Instead of we end up retrying allocation. Implementing Tejun's idea atleast keeps error and retry code separate. I am not too particular about it. Just a thought. Though I do want to know why to free up already allocated bio and retry in case of bvec allocation failure. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Sun, Sep 09, 2012 at 05:28:10PM -0700, Kent Overstreet wrote: [..] +retry: p = mempool_alloc(bs-bio_pool, gfp_mask); front_pad = bs-front_pad; inline_vecs = BIO_INLINE_VECS; } Wouldn't the following be better? p = mempool_alloc(bs-bi_pool, gfp_mask); if (unlikely(!p) gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); p = mempool_alloc(bs-bi_pool, saved_gfp); } That'd require duplicating the error handling in two different places - once for the initial allocation, once for the bvec allocation. And I really hate that writing code that does alloc_something() if (fail) { alloc_something_again() } it just screams ugly to me. Personally I kind of like Tejun's suggestion. Yes, it means that you will have to do it twice (once for bio and once for bvecs). But at the same time, if bvec allocation fails, then you don't have to free already allocated bio and retry bio allocation again. Current code is doing that and I am not sure why should we free allocated bio and retry again in case of bvec allocation failure. Secondly current code is doing following. if (unlikely(!bvl)) goto err_free; err_free: mempool_free(p, bs-bio_pool); err: retry_allocation. Not sure but err_free kind of gives impression that free up whatever is allocated path, and return NULL. Instead of we end up retrying allocation. Implementing Tejun's idea atleast keeps error and retry code separate. I am not too particular about it. Just a thought. Though I do want to know why to free up already allocated bio and retry in case of bvec allocation failure. Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, Kent. On Sun, Sep 09, 2012 at 05:28:10PM -0700, Kent Overstreet wrote: + while ((bio = bio_list_pop(current-bio_list))) + bio_list_add(bio-bi_pool == bs ? punt : nopunt, bio); + + *current-bio_list = nopunt; Why this is necessary needs explanation and it's done in rather unusual way. I suppose the weirdness is from bio_list API restriction? It's because bio_lists are singly linked, so deleting an entry from the middle of the list would be a real pain - just much cleaner/simpler to do it this way. Yeah, I wonder how benefical that singly linked list is. Eh well... Wouldn't the following be better? p = mempool_alloc(bs-bi_pool, gfp_mask); if (unlikely(!p) gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); p = mempool_alloc(bs-bi_pool, saved_gfp); } That'd require duplicating the error handling in two different places - once for the initial allocation, once for the bvec allocation. And I really hate that writing code that does alloc_something() if (fail) { alloc_something_again() } it just screams ugly to me. I don't know. That at least represents what's going on and goto'ing back and forth is hardly pretty. Sometimes the code gets much uglier / unwieldy and we have to live with gotos. Here, that doesn't seem to be the case. +static void punt_bios_to_rescuer(struct bio_set *bs) +{ + struct bio_list punt, nopunt; + struct bio *bio; + + /* + * Don't want to punt all bios on current-bio_list; if there was a bio + * on there for a stacking driver higher up in the stack, processing it + * could require allocating bios from this bio_set, and we don't want to + * do that from our own rescuer. Hmmm... isn't it more like we must process only the bios which are from this bio_set to have any kind of forward-progress guarantee? The above sounds like it's just something undesirable. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Mon, Sep 10, 2012 at 10:22:10AM -0700, Tejun Heo wrote: Hello, Kent. On Sun, Sep 09, 2012 at 05:28:10PM -0700, Kent Overstreet wrote: + while ((bio = bio_list_pop(current-bio_list))) + bio_list_add(bio-bi_pool == bs ? punt : nopunt, bio); + + *current-bio_list = nopunt; Why this is necessary needs explanation and it's done in rather unusual way. I suppose the weirdness is from bio_list API restriction? It's because bio_lists are singly linked, so deleting an entry from the middle of the list would be a real pain - just much cleaner/simpler to do it this way. Yeah, I wonder how benefical that singly linked list is. Eh well... Well, this is the first time I can think of that it's come up, and IMO this is no less clean a way of writing it... just a bit unusual in C, it feels more functional to me instead of imperative. Wouldn't the following be better? p = mempool_alloc(bs-bi_pool, gfp_mask); if (unlikely(!p) gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); p = mempool_alloc(bs-bi_pool, saved_gfp); } That'd require duplicating the error handling in two different places - once for the initial allocation, once for the bvec allocation. And I really hate that writing code that does alloc_something() if (fail) { alloc_something_again() } it just screams ugly to me. I don't know. That at least represents what's going on and goto'ing back and forth is hardly pretty. Sometimes the code gets much uglier / unwieldy and we have to live with gotos. Here, that doesn't seem to be the case. I think this is really more personal preference than anything, but: Setting gfp_mask = saved_gfp after calling punt_bio_to_rescuer() is really the correct thing to do, and makes the code clearer IMO: once we've run punt_bio_to_rescuer() we don't need to mask out GFP_WAIT (not until the next time a bio is submitted, really). This matters a bit for the bvl allocation too, if we call punt_bio_to_rescuer() for the bio allocation no point doing it again. So to be rigorously correct, your way would have to be p = mempool_alloc(bs-bio_pool, gfp_mask); if (!p gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); gfp_mask = saved_gfp; p = mempool_alloc(bs-bio_pool, gfp_mask); } And at that point, why duplicate that line of code? It doesn't matter that much, but IMO a goto retry better labels what's actually going on (it's something that's not uncommon in the kernel and if I see a retry label in a function I pretty immediately have an idea of what's going on). So we could do retry: p = mempool_alloc(bs-bio_pool, gfp_mask); if (!p gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); gfp_mask = saved_gfp; goto retry; } (side note: not that it really matters here, but gcc will inline the bvec_alloc_bs() call if it's not duplicated, I've never seen it consolidate duplicated code and /then/ inline based off that) This does have the advantage that we're not freeing and reallocating the bio like Vivek pointed out, but I'm not a huge fan of having the punting/retry logic in the main code path. I don't care that much though. I'd prefer not to have the actual allocations duplicated, but it's starting to feel like bikeshedding to me. +static void punt_bios_to_rescuer(struct bio_set *bs) +{ + struct bio_list punt, nopunt; + struct bio *bio; + + /* +* Don't want to punt all bios on current-bio_list; if there was a bio +* on there for a stacking driver higher up in the stack, processing it +* could require allocating bios from this bio_set, and we don't want to +* do that from our own rescuer. Hmmm... isn't it more like we must process only the bios which are from this bio_set to have any kind of forward-progress guarantee? The above sounds like it's just something undesirable. Yeah, that'd be better, I'll change it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, Kent. On Mon, Sep 10, 2012 at 01:24:35PM -0700, Kent Overstreet wrote: And at that point, why duplicate that line of code? It doesn't matter that much, but IMO a goto retry better labels what's actually going on (it's something that's not uncommon in the kernel and if I see a retry label in a function I pretty immediately have an idea of what's going on). So we could do retry: p = mempool_alloc(bs-bio_pool, gfp_mask); if (!p gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); gfp_mask = saved_gfp; goto retry; } Yes, we do retry loops if that makes the code simpler. Doing that to save one extra alloc call, I don't think so. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Mon, Sep 10, 2012 at 01:40:10PM -0700, Tejun Heo wrote: Hello, Kent. On Mon, Sep 10, 2012 at 01:24:35PM -0700, Kent Overstreet wrote: And at that point, why duplicate that line of code? It doesn't matter that much, but IMO a goto retry better labels what's actually going on (it's something that's not uncommon in the kernel and if I see a retry label in a function I pretty immediately have an idea of what's going on). So we could do retry: p = mempool_alloc(bs-bio_pool, gfp_mask); if (!p gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); gfp_mask = saved_gfp; goto retry; } Yes, we do retry loops if that makes the code simpler. Doing that to save one extra alloc call, I don't think so. Simpler isn't really an objective thing though. To me the goto version is more obvious/idiomatic. Eh. I'll do it your way, but consider this a formal objection :p -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Mon, Sep 10, 2012 at 02:33:49PM -0700, Kent Overstreet wrote: Simpler isn't really an objective thing though. To me the goto version is more obvious/idiomatic. Eh. I'll do it your way, but consider this a formal objection :p Thanks. :) -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Mon, Sep 10, 2012 at 02:37:10PM -0700, Tejun Heo wrote: On Mon, Sep 10, 2012 at 02:33:49PM -0700, Kent Overstreet wrote: Simpler isn't really an objective thing though. To me the goto version is more obvious/idiomatic. Eh. I'll do it your way, but consider this a formal objection :p Thanks. :) Here's current version. Good enough for an acked-by? commit df7e63cbffa3065fcc4ba2b9a93418d7c7312243 Author: Kent Overstreet koverstr...@google.com Date: Mon Sep 10 14:33:46 2012 -0700 block: Avoid deadlocks with bio allocation by stacking drivers Previously, if we ever try to allocate more than once from the same bio set while running under generic_make_request() (i.e. a stacking block driver), we risk deadlock. This is because of the code in generic_make_request() that converts recursion to iteration; any bios we submit won't actually be submitted (so they can complete and eventually be freed) until after we return - this means if we allocate a second bio, we're blocking the first one from ever being freed. Thus if enough threads call into a stacking block driver at the same time with bios that need multiple splits, and the bio_set's reserve gets used up, we deadlock. This can be worked around in the driver code - we could check if we're running under generic_make_request(), then mask out __GFP_WAIT when we go to allocate a bio, and if the allocation fails punt to workqueue and retry the allocation. But this is tricky and not a generic solution. This patch solves it for all users by inverting the previously described technique. We allocate a rescuer workqueue for each bio_set, and then in the allocation code if there are bios on current-bio_list we would be blocking, we punt them to the rescuer workqueue to be submitted. This guarantees forward progress for bio allocations under generic_make_request() provided each bio is submitted before allocating the next, and provided the bios are freed after they complete. Note that this doesn't do anything for allocation from other mempools. Instead of allocating per bio data structures from a mempool, code should use bio_set's front_pad. Tested it by forcing the rescue codepath to be taken (by disabling the first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot of arbitrary bio splitting) and verified that the rescuer was being invoked. Signed-off-by: Kent Overstreet koverstr...@google.com CC: Jens Axboe ax...@kernel.dk diff --git a/fs/bio.c b/fs/bio.c index 13e9567..4783e31 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -295,6 +295,54 @@ void bio_reset(struct bio *bio) } EXPORT_SYMBOL(bio_reset); +static void bio_alloc_rescue(struct work_struct *work) +{ + struct bio_set *bs = container_of(work, struct bio_set, rescue_work); + struct bio *bio; + + while (1) { + spin_lock(bs-rescue_lock); + bio = bio_list_pop(bs-rescue_list); + spin_unlock(bs-rescue_lock); + + if (!bio) + break; + + generic_make_request(bio); + } +} + +static void punt_bios_to_rescuer(struct bio_set *bs) +{ + struct bio_list punt, nopunt; + struct bio *bio; + + /* +* In order to guarantee forward progress we must punt only bios that +* were allocated from this bio_set; otherwise, if there was a bio on +* there for a stacking driver higher up in the stack, processing it +* could require allocating bios from this bio_set, and doing that from +* our own rescuer would be bad. +* +* Since bio lists are singly linked, pop them all instead of trying to +* remove from the middle of the list: +*/ + + bio_list_init(punt); + bio_list_init(nopunt); + + while ((bio = bio_list_pop(current-bio_list))) + bio_list_add(bio-bi_pool == bs ? punt : nopunt, bio); + + *current-bio_list = nopunt; + + spin_lock(bs-rescue_lock); + bio_list_merge(bs-rescue_list, punt); + spin_unlock(bs-rescue_lock); + + queue_work(bs-rescue_workqueue, bs-rescue_work); +} + /** * bio_alloc_bioset - allocate a bio for I/O * @gfp_mask: the GFP_ mask given to the slab allocator @@ -312,11 +360,27 @@ EXPORT_SYMBOL(bio_reset); * previously allocated bio for IO before attempting to allocate a new one. * Failure to do so can cause deadlocks under memory pressure. * + * Note that when running under generic_make_request() (i.e. any block + * driver), bios are not submitted until after you return - see the code in + * generic_make_request() that converts recursion into iteration, to prevent + * stack overflows. + * + * This would normally mean allocating multiple bios under + * generic_make_request() would be susceptible to deadlocks, but we have + *
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, Kent. On Mon, Sep 10, 2012 at 02:56:33PM -0700, Kent Overstreet wrote: commit df7e63cbffa3065fcc4ba2b9a93418d7c7312243 Author: Kent Overstreet koverstr...@google.com Date: Mon Sep 10 14:33:46 2012 -0700 block: Avoid deadlocks with bio allocation by stacking drivers Previously, if we ever try to allocate more than once from the same bio set while running under generic_make_request() (i.e. a stacking block driver), we risk deadlock. This is because of the code in generic_make_request() that converts recursion to iteration; any bios we submit won't actually be submitted (so they can complete and eventually be freed) until after we return - this means if we allocate a second bio, we're blocking the first one from ever being freed. Thus if enough threads call into a stacking block driver at the same time with bios that need multiple splits, and the bio_set's reserve gets used up, we deadlock. This can be worked around in the driver code - we could check if we're running under generic_make_request(), then mask out __GFP_WAIT when we go to allocate a bio, and if the allocation fails punt to workqueue and retry the allocation. But this is tricky and not a generic solution. This patch solves it for all users by inverting the previously described technique. We allocate a rescuer workqueue for each bio_set, and then in the allocation code if there are bios on current-bio_list we would be blocking, we punt them to the rescuer workqueue to be submitted. This guarantees forward progress for bio allocations under generic_make_request() provided each bio is submitted before allocating the next, and provided the bios are freed after they complete. Note that this doesn't do anything for allocation from other mempools. Instead of allocating per bio data structures from a mempool, code should use bio_set's front_pad. Tested it by forcing the rescue codepath to be taken (by disabling the first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot of arbitrary bio splitting) and verified that the rescuer was being invoked. Signed-off-by: Kent Overstreet koverstr...@google.com CC: Jens Axboe ax...@kernel.dk I'm still a bit scared but think this is correct. Acked-by: Tejun Heo t...@kernel.org One last thing is that we may want to add @name on bioset creation so that we can name the workqueue properly but that's for another patch. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Hello, again. On Mon, Sep 10, 2012 at 03:09:10PM -0700, Tejun Heo wrote: I'm still a bit scared but think this is correct. Acked-by: Tejun Heo t...@kernel.org One last thing is that we may want to add @name on bioset creation so that we can name the workqueue properly but that's for another patch. Yet another thing that I noticed in a different discussion. http://thread.gmane.org/gmane.linux.network.drbd.devel/2130 Before this, I think bios didn't get reordered while they're traveling down the stacked bio drivers. After this, I don't think that's true anymore. Currently, IIRC, we don't have any ordering functionality at bio interface, so I don't think this breaks anything but this can lead to stupidly subtle bugs if the upper layer is making assumption on ordering somehow. It's something which at least should be noted, I think. Whether we want to update the code so that ordering is maintained, I don't know. I hope not. It's already crazy complex. :( Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Sat, Sep 08, 2012 at 12:36:41PM -0700, Tejun Heo wrote: > (Restoring cc list from the previous discussion. Please retain the cc > list of the people who discussed in the previous postings.) > > On Fri, Sep 07, 2012 at 03:12:53PM -0700, Kent Overstreet wrote: > > But this is tricky and not a generic solution. This patch solves it for > > all users by inverting the previously described technique. We allocate a > > rescuer workqueue for each bio_set, and then in the allocation code if > > there are bios on current->bio_list we would be blocking, we punt them > > to the rescuer workqueue to be submitted. > > It would be great if this explanation can be expanded a bit. Why does > it make the deadlock condition go away? What are the restrictions - > e.g. using other mempools for additional per-bio data structure > wouldn't work, right? Ok, I'll do that. New patch below. > > +static void punt_bios_to_rescuer(struct bio_set *bs) > > +{ > > + struct bio_list punt, nopunt; > > + struct bio *bio; > > + > > + bio_list_init(); > > + bio_list_init(); > > + > > + while ((bio = bio_list_pop(current->bio_list))) > > + bio_list_add(bio->bi_pool == bs ? : , bio); > > + > > + *current->bio_list = nopunt; > > Why this is necessary needs explanation and it's done in rather > unusual way. I suppose the weirdness is from bio_list API > restriction? It's because bio_lists are singly linked, so deleting an entry from the middle of the list would be a real pain - just much cleaner/simpler to do it this way. > > + spin_lock(>rescue_lock); > > + bio_list_merge(>rescue_list, ); > > + spin_unlock(>rescue_lock); > > + > > + queue_work(bs->rescue_workqueue, >rescue_work); > > +} > > + > > /** > > * bio_alloc_bioset - allocate a bio for I/O > > * @gfp_mask: the GFP_ mask given to the slab allocator > > @@ -317,6 +354,7 @@ EXPORT_SYMBOL(bio_reset); > > */ > > struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set > > *bs) > > { > > + gfp_t saved_gfp = gfp_mask; > > unsigned front_pad; > > unsigned inline_vecs; > > unsigned long idx = BIO_POOL_NONE; > > @@ -334,13 +372,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int > > nr_iovecs, struct bio_set *bs) > > front_pad = 0; > > inline_vecs = nr_iovecs; > > } else { > > + /* > > +* generic_make_request() converts recursion to iteration; this > > +* means if we're running beneath it, any bios we allocate and > > +* submit will not be submitted (and thus freed) until after we > > +* return. > > +* > > +* This exposes us to a potential deadlock if we allocate > > +* multiple bios from the same bio_set() while running > > +* underneath generic_make_request(). If we were to allocate > > +* multiple bios (say a stacking block driver that was splitting > > +* bios), we would deadlock if we exhausted the mempool's > > +* reserve. > > +* > > +* We solve this, and guarantee forward progress, with a rescuer > > +* workqueue per bio_set. If we go to allocate and there are > > +* bios on current->bio_list, we first try the allocation > > +* without __GFP_WAIT; if that fails, we punt those bios we > > +* would be blocking to the rescuer workqueue before we retry > > +* with the original gfp_flags. > > +*/ > > + > > + if (current->bio_list && !bio_list_empty(current->bio_list)) > > + gfp_mask &= ~__GFP_WAIT; > > +retry: > > p = mempool_alloc(bs->bio_pool, gfp_mask); > > front_pad = bs->front_pad; > > inline_vecs = BIO_INLINE_VECS; > > } > > Wouldn't the following be better? > > p = mempool_alloc(bs->bi_pool, gfp_mask); > if (unlikely(!p) && gfp_mask != saved_gfp) { > punt_bios_to_rescuer(bs); > p = mempool_alloc(bs->bi_pool, saved_gfp); > } That'd require duplicating the error handling in two different places - once for the initial allocation, once for the bvec allocation. And I really hate that writing code that does alloc_something() if (fail) { alloc_something_again() } it just screams ugly to me. > I really hope the usage restriction (don't mix with mempool) for > bioset is clearly documented somewhere appropriate. Good point, I'm adding it to bio_alloc_bioset's documentation. commit 4edb21e0b749fc098c72edcb4f9abdeca6fc62cd Author: Kent Overstreet Date: Sun Sep 9 17:23:29 2012 -0700 block: Avoid deadlocks with bio allocation by stacking drivers Previously, if we ever try to allocate more than once from the same bio set while running under generic_make_request() (i.e. a stacking block driver), we risk deadlock. This is because of the code in generic_make_request() that converts
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
On Sat, Sep 08, 2012 at 12:36:41PM -0700, Tejun Heo wrote: (Restoring cc list from the previous discussion. Please retain the cc list of the people who discussed in the previous postings.) On Fri, Sep 07, 2012 at 03:12:53PM -0700, Kent Overstreet wrote: But this is tricky and not a generic solution. This patch solves it for all users by inverting the previously described technique. We allocate a rescuer workqueue for each bio_set, and then in the allocation code if there are bios on current-bio_list we would be blocking, we punt them to the rescuer workqueue to be submitted. It would be great if this explanation can be expanded a bit. Why does it make the deadlock condition go away? What are the restrictions - e.g. using other mempools for additional per-bio data structure wouldn't work, right? Ok, I'll do that. New patch below. +static void punt_bios_to_rescuer(struct bio_set *bs) +{ + struct bio_list punt, nopunt; + struct bio *bio; + + bio_list_init(punt); + bio_list_init(nopunt); + + while ((bio = bio_list_pop(current-bio_list))) + bio_list_add(bio-bi_pool == bs ? punt : nopunt, bio); + + *current-bio_list = nopunt; Why this is necessary needs explanation and it's done in rather unusual way. I suppose the weirdness is from bio_list API restriction? It's because bio_lists are singly linked, so deleting an entry from the middle of the list would be a real pain - just much cleaner/simpler to do it this way. + spin_lock(bs-rescue_lock); + bio_list_merge(bs-rescue_list, punt); + spin_unlock(bs-rescue_lock); + + queue_work(bs-rescue_workqueue, bs-rescue_work); +} + /** * bio_alloc_bioset - allocate a bio for I/O * @gfp_mask: the GFP_ mask given to the slab allocator @@ -317,6 +354,7 @@ EXPORT_SYMBOL(bio_reset); */ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) { + gfp_t saved_gfp = gfp_mask; unsigned front_pad; unsigned inline_vecs; unsigned long idx = BIO_POOL_NONE; @@ -334,13 +372,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) front_pad = 0; inline_vecs = nr_iovecs; } else { + /* +* generic_make_request() converts recursion to iteration; this +* means if we're running beneath it, any bios we allocate and +* submit will not be submitted (and thus freed) until after we +* return. +* +* This exposes us to a potential deadlock if we allocate +* multiple bios from the same bio_set() while running +* underneath generic_make_request(). If we were to allocate +* multiple bios (say a stacking block driver that was splitting +* bios), we would deadlock if we exhausted the mempool's +* reserve. +* +* We solve this, and guarantee forward progress, with a rescuer +* workqueue per bio_set. If we go to allocate and there are +* bios on current-bio_list, we first try the allocation +* without __GFP_WAIT; if that fails, we punt those bios we +* would be blocking to the rescuer workqueue before we retry +* with the original gfp_flags. +*/ + + if (current-bio_list !bio_list_empty(current-bio_list)) + gfp_mask = ~__GFP_WAIT; +retry: p = mempool_alloc(bs-bio_pool, gfp_mask); front_pad = bs-front_pad; inline_vecs = BIO_INLINE_VECS; } Wouldn't the following be better? p = mempool_alloc(bs-bi_pool, gfp_mask); if (unlikely(!p) gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); p = mempool_alloc(bs-bi_pool, saved_gfp); } That'd require duplicating the error handling in two different places - once for the initial allocation, once for the bvec allocation. And I really hate that writing code that does alloc_something() if (fail) { alloc_something_again() } it just screams ugly to me. I really hope the usage restriction (don't mix with mempool) for bioset is clearly documented somewhere appropriate. Good point, I'm adding it to bio_alloc_bioset's documentation. commit 4edb21e0b749fc098c72edcb4f9abdeca6fc62cd Author: Kent Overstreet koverstr...@google.com Date: Sun Sep 9 17:23:29 2012 -0700 block: Avoid deadlocks with bio allocation by stacking drivers Previously, if we ever try to allocate more than once from the same bio set while running under generic_make_request() (i.e. a stacking block driver), we risk deadlock. This is because of the code in generic_make_request() that converts recursion to iteration; any bios we submit won't actually be submitted (so they can complete and eventually be freed) until
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
(Restoring cc list from the previous discussion. Please retain the cc list of the people who discussed in the previous postings.) On Fri, Sep 07, 2012 at 03:12:53PM -0700, Kent Overstreet wrote: > But this is tricky and not a generic solution. This patch solves it for > all users by inverting the previously described technique. We allocate a > rescuer workqueue for each bio_set, and then in the allocation code if > there are bios on current->bio_list we would be blocking, we punt them > to the rescuer workqueue to be submitted. It would be great if this explanation can be expanded a bit. Why does it make the deadlock condition go away? What are the restrictions - e.g. using other mempools for additional per-bio data structure wouldn't work, right? > +static void punt_bios_to_rescuer(struct bio_set *bs) > +{ > + struct bio_list punt, nopunt; > + struct bio *bio; > + > + bio_list_init(); > + bio_list_init(); > + > + while ((bio = bio_list_pop(current->bio_list))) > + bio_list_add(bio->bi_pool == bs ? : , bio); > + > + *current->bio_list = nopunt; Why this is necessary needs explanation and it's done in rather unusual way. I suppose the weirdness is from bio_list API restriction? > + spin_lock(>rescue_lock); > + bio_list_merge(>rescue_list, ); > + spin_unlock(>rescue_lock); > + > + queue_work(bs->rescue_workqueue, >rescue_work); > +} > + > /** > * bio_alloc_bioset - allocate a bio for I/O > * @gfp_mask: the GFP_ mask given to the slab allocator > @@ -317,6 +354,7 @@ EXPORT_SYMBOL(bio_reset); > */ > struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set > *bs) > { > + gfp_t saved_gfp = gfp_mask; > unsigned front_pad; > unsigned inline_vecs; > unsigned long idx = BIO_POOL_NONE; > @@ -334,13 +372,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int > nr_iovecs, struct bio_set *bs) > front_pad = 0; > inline_vecs = nr_iovecs; > } else { > + /* > + * generic_make_request() converts recursion to iteration; this > + * means if we're running beneath it, any bios we allocate and > + * submit will not be submitted (and thus freed) until after we > + * return. > + * > + * This exposes us to a potential deadlock if we allocate > + * multiple bios from the same bio_set() while running > + * underneath generic_make_request(). If we were to allocate > + * multiple bios (say a stacking block driver that was splitting > + * bios), we would deadlock if we exhausted the mempool's > + * reserve. > + * > + * We solve this, and guarantee forward progress, with a rescuer > + * workqueue per bio_set. If we go to allocate and there are > + * bios on current->bio_list, we first try the allocation > + * without __GFP_WAIT; if that fails, we punt those bios we > + * would be blocking to the rescuer workqueue before we retry > + * with the original gfp_flags. > + */ > + > + if (current->bio_list && !bio_list_empty(current->bio_list)) > + gfp_mask &= ~__GFP_WAIT; > +retry: > p = mempool_alloc(bs->bio_pool, gfp_mask); > front_pad = bs->front_pad; > inline_vecs = BIO_INLINE_VECS; > } Wouldn't the following be better? p = mempool_alloc(bs->bi_pool, gfp_mask); if (unlikely(!p) && gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); p = mempool_alloc(bs->bi_pool, saved_gfp); } I really hope the usage restriction (don't mix with mempool) for bioset is clearly documented somewhere appropriate. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
(Restoring cc list from the previous discussion. Please retain the cc list of the people who discussed in the previous postings.) On Fri, Sep 07, 2012 at 03:12:53PM -0700, Kent Overstreet wrote: But this is tricky and not a generic solution. This patch solves it for all users by inverting the previously described technique. We allocate a rescuer workqueue for each bio_set, and then in the allocation code if there are bios on current-bio_list we would be blocking, we punt them to the rescuer workqueue to be submitted. It would be great if this explanation can be expanded a bit. Why does it make the deadlock condition go away? What are the restrictions - e.g. using other mempools for additional per-bio data structure wouldn't work, right? +static void punt_bios_to_rescuer(struct bio_set *bs) +{ + struct bio_list punt, nopunt; + struct bio *bio; + + bio_list_init(punt); + bio_list_init(nopunt); + + while ((bio = bio_list_pop(current-bio_list))) + bio_list_add(bio-bi_pool == bs ? punt : nopunt, bio); + + *current-bio_list = nopunt; Why this is necessary needs explanation and it's done in rather unusual way. I suppose the weirdness is from bio_list API restriction? + spin_lock(bs-rescue_lock); + bio_list_merge(bs-rescue_list, punt); + spin_unlock(bs-rescue_lock); + + queue_work(bs-rescue_workqueue, bs-rescue_work); +} + /** * bio_alloc_bioset - allocate a bio for I/O * @gfp_mask: the GFP_ mask given to the slab allocator @@ -317,6 +354,7 @@ EXPORT_SYMBOL(bio_reset); */ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) { + gfp_t saved_gfp = gfp_mask; unsigned front_pad; unsigned inline_vecs; unsigned long idx = BIO_POOL_NONE; @@ -334,13 +372,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) front_pad = 0; inline_vecs = nr_iovecs; } else { + /* + * generic_make_request() converts recursion to iteration; this + * means if we're running beneath it, any bios we allocate and + * submit will not be submitted (and thus freed) until after we + * return. + * + * This exposes us to a potential deadlock if we allocate + * multiple bios from the same bio_set() while running + * underneath generic_make_request(). If we were to allocate + * multiple bios (say a stacking block driver that was splitting + * bios), we would deadlock if we exhausted the mempool's + * reserve. + * + * We solve this, and guarantee forward progress, with a rescuer + * workqueue per bio_set. If we go to allocate and there are + * bios on current-bio_list, we first try the allocation + * without __GFP_WAIT; if that fails, we punt those bios we + * would be blocking to the rescuer workqueue before we retry + * with the original gfp_flags. + */ + + if (current-bio_list !bio_list_empty(current-bio_list)) + gfp_mask = ~__GFP_WAIT; +retry: p = mempool_alloc(bs-bio_pool, gfp_mask); front_pad = bs-front_pad; inline_vecs = BIO_INLINE_VECS; } Wouldn't the following be better? p = mempool_alloc(bs-bi_pool, gfp_mask); if (unlikely(!p) gfp_mask != saved_gfp) { punt_bios_to_rescuer(bs); p = mempool_alloc(bs-bi_pool, saved_gfp); } I really hope the usage restriction (don't mix with mempool) for bioset is clearly documented somewhere appropriate. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/