Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)
On Wed, Aug 08, 2018 at 05:28:43PM +, Bart Van Assche wrote: > On Wed, 2018-08-08 at 14:43 +0800, jianchao.wang wrote: > > > > On 08/08/2018 02:11 PM, jianchao.wang wrote: > > > Hi Bart > > > > > > On 08/08/2018 06:51 AM, Bart Van Assche wrote: > > > > @@ -391,6 +393,9 @@ static struct request *blk_mq_get_request(struct > > > > request_queue *q, > > > > } > > > > } > > > > data->hctx->queued++; > > > > + > > > > + blk_pm_add_request(q, rq); > > > > + > > > > return rq; > > > > } > > > > > > The request_queue is in pm_only mode when suspended, who can reach here > > > to do the resume ? > > > > I mean, in the original blk-legacy runtime pm implementation, any new IO > > could trigger the resume, > > after your patch set, only the pm request could pass the blk_queue_enter > > while the queue is suspended > > and in pm-only mode. But if no resume, where does the pm request come from ? > > > > The blk_pm_add_request should be added to blk_queue_enter. > > It looks like as following: > >1. when an normal io reaches blk_queue_enter, if queue is in suspended > > mode, it invoke blk_pm_add_request > > to trigger the resume, then wait here for the pm_only mode to be > > cleared. > >2. the runtime pm core does the resume work and clear the pm_only more > > finally > >3. the task blocked in blk_queue_enter is waked up and proceed. > > Hello Jianchao, > > Some but not all blk_queue_enter() calls are related to request allocation so The only one I remember is scsi_ioctl_reset(), in which scsi_autopm_get_host() is called before allocating this request, that means it is enough to put host up for handling RESET. Also this request shouldn't have entered the block request queue. Or are there other cases in which request allocation isn't related with blk_queue_enter()? thanks, Ming
[PATCH] block: bvec_nr_vecs() returns value for wrong slab
In commit ed996a52c868 ("block: simplify and cleanup bvec pool handling"), the value of the slab index is incremented by one in bvec_alloc() after the allocation is done to indicate an index value of 0 does not need to be later freed. bvec_nr_vecs() was not updated accordingly, and thus returns the wrong value. Decrement idx before performing the lookup. Fixes: ed996a52c868 ("block: simplify and cleanup bvec pool handling") Signed-off-by: Greg Edwards --- block/bio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index 047c5dca6d90..ff94640bc734 100644 --- a/block/bio.c +++ b/block/bio.c @@ -156,7 +156,7 @@ static void bio_put_slab(struct bio_set *bs) unsigned int bvec_nr_vecs(unsigned short idx) { - return bvec_slabs[idx].nr_vecs; + return bvec_slabs[--idx].nr_vecs; } void bvec_free(mempool_t *pool, struct bio_vec *bv, unsigned int idx) -- 2.17.1
Re: [PATCH v5 6/9] block: Change the runtime power management approach (2/2)
On Wed, 2018-08-08 at 16:50 +0800, Ming Lei wrote: > On Tue, Aug 07, 2018 at 03:51:30PM -0700, Bart Van Assche wrote: > > Instead of allowing requests that are not power management requests > > to enter the queue in runtime suspended status (RPM_SUSPENDED), make > > the blk_get_request() caller block. This change fixes a starvation > > Looks not see the related change which blocks blk_get_request() in > this patchset. I was referring to setting pm-only mode for suspended devices. Since blk_queue_enter() is called before a request is allocated that is sufficient to make request allocation block. > > diff --git a/block/blk-pm.c b/block/blk-pm.c > > index bf8532da952d..d6b65cef9764 100644 > > --- a/block/blk-pm.c > > +++ b/block/blk-pm.c > > @@ -86,14 +86,39 @@ int blk_pre_runtime_suspend(struct request_queue *q) > > if (!q->dev) > > return ret; > > > > + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); > > + > > + blk_set_pm_only(q); > > + /* > > +* This function only gets called if the most recent > > +* pm_request_resume() call occurred at least autosuspend_delay_ms > > +* ago. Since blk_queue_enter() is called by the request allocation > > +* code before pm_request_resume(), if q_usage_counter indicates that > > +* no requests are in flight it is safe to suspend the device. > > +*/ > > + ret = -EBUSY; > > + if (!percpu_ref_is_in_use(>q_usage_counter)) { > > + /* > > +* Switch to preempt-only mode before calling > > +* synchronize_rcu() such that later blk_queue_enter() calls > > +* see the preempt-only state. See also > > +* http://lwn.net/Articles/573497/. > > +*/ > > + synchronize_rcu(); > > + if (!percpu_ref_is_in_use(>q_usage_counter)) > > + ret = 0; > > + } > > + > > In blk_queue_enter(), V5 starts to allow all RQF_PREEMPT requests > to enter queue even though pm_only is set. That means any scsi_execute() > may issue a new command to host just after the above percpu_ref_is_in_use() > returns 0, meantime the suspend is in-progress. > > This case is illegal given RQF_PM is the only kind of request which can be > issued to hardware during suspend. Right, that's something that needs to be addressed. Bart.
Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)
On Wed, 2018-08-08 at 14:43 +0800, jianchao.wang wrote: > > On 08/08/2018 02:11 PM, jianchao.wang wrote: > > Hi Bart > > > > On 08/08/2018 06:51 AM, Bart Van Assche wrote: > > > @@ -391,6 +393,9 @@ static struct request *blk_mq_get_request(struct > > > request_queue *q, > > > } > > > } > > > data->hctx->queued++; > > > + > > > + blk_pm_add_request(q, rq); > > > + > > > return rq; > > > } > > > > The request_queue is in pm_only mode when suspended, who can reach here to > > do the resume ? > > I mean, in the original blk-legacy runtime pm implementation, any new IO > could trigger the resume, > after your patch set, only the pm request could pass the blk_queue_enter > while the queue is suspended > and in pm-only mode. But if no resume, where does the pm request come from ? > > The blk_pm_add_request should be added to blk_queue_enter. > It looks like as following: >1. when an normal io reaches blk_queue_enter, if queue is in suspended > mode, it invoke blk_pm_add_request > to trigger the resume, then wait here for the pm_only mode to be > cleared. >2. the runtime pm core does the resume work and clear the pm_only more > finally >3. the task blocked in blk_queue_enter is waked up and proceed. Hello Jianchao, Some but not all blk_queue_enter() calls are related to request allocation so I'm not sure that that call should be added into blk_queue_enter(). The reason my tests passed is probably because of the scsi_autopm_get_device() calls in the sd and sr drivers. However, not all request submission code is protected by these calls. I will have a closer look at how to preserve the behavior that queueing a new request that is not protected by scsi_autopm_get_*() restores full power mode. Bart.
Re: [PATCH v5 1/9] block: Change the preempt-only flag into a counter
On Wed, 2018-08-08 at 16:21 +0800, Ming Lei wrote: > On Tue, Aug 07, 2018 at 03:51:25PM -0700, Bart Van Assche wrote: > > The RQF_PREEMPT flag is used for three purposes: > > - In the SCSI core, for making sure that power management requests > > are executed if a device is in the "quiesced" state. > > - For domain validation by SCSI drivers that use the parallel port. > > - In the IDE driver, for IDE preempt requests. > > I think the above description may not be accurate, BLK_MQ_REQ_PREEMPT is > always set inside scsi_execute(), that means any scsi_execute() callers > can use the flag of RQF_PREEMPT, of course, not limited to the above > three cases. Hello Ming, What I described is for which cases we really need the RQF_PREEMPT flag rather than in which cases the RQF_PREEMPT flag is set today. I think we should split the RQF_PREEMPT flag into three flags, one flag per case mentioned above. Bart.
Re: [PATCH v5 4/9] percpu-refcount: Introduce percpu_ref_is_in_use()
Hello, On Tue, Aug 07, 2018 at 03:51:28PM -0700, Bart Van Assche wrote: > Introduce a function that allows to determine whether a per-cpu refcount > is in use. This function will be used in a later patch to determine > whether or not any block layer requests are being executed. I thought about it a bit and am having a bit of difficulty convincing myself this is necessary. Switching a percpu_ref to atomic mode isn't expensive - it's one spinlock cycle, a rcu wait and one sweep of the percpu counters. The most expensive part - the percpu sweep - needs to be there even with optimization, the wait doesn't really matter as all it'll do is slightly delaying timer based PM operation and can be overlayed with the propagation of set_pm_only() anyway. So, how about just doing the simple thing? Switch it to atomic mode and check the counter and switch back to percpu mode afterwards. If we see any issues with that, we can try to optimize it later but that seems unlikely to me. Thanks. -- tejun
Re: [PATCH v4 2/2] block: Ensure that a request queue is dissociated from the cgroup controller
On 8/8/18 9:04 AM, Bart Van Assche wrote: > On Wed, 2018-08-08 at 08:41 -0600, Jens Axboe wrote: >> On 7/30/18 3:10 PM, Bart Van Assche wrote: >>> +#ifdef CONFIG_BLK_CGROUP >>> + { >>> + struct blkcg_gq *blkg; >>> + >>> + rcu_read_lock(); >>> + blkg = blkg_lookup(_root, q); >>> + rcu_read_unlock(); >>> + >>> + WARN(blkg, >>> +"request queue %p is being released but it has not yet >>> been removed from the blkcg controller\n", >>> +q); >>> + } >>> +#endif >> >> This last hunk should go in the cgroup code. > > Hello Jens, > > How about leaving out the #ifdef CONFIG_BLK_CGROUP / #endif? Would that be > sufficient? I just noticed that blkg_lookup() returns NULL anyway if cgroup > support > is disabled in the kernel config. That's even better. -- Jens Axboe
Re: [PATCH v4 2/2] block: Ensure that a request queue is dissociated from the cgroup controller
On Wed, 2018-08-08 at 08:41 -0600, Jens Axboe wrote: > On 7/30/18 3:10 PM, Bart Van Assche wrote: > > +#ifdef CONFIG_BLK_CGROUP > > + { > > + struct blkcg_gq *blkg; > > + > > + rcu_read_lock(); > > + blkg = blkg_lookup(_root, q); > > + rcu_read_unlock(); > > + > > + WARN(blkg, > > +"request queue %p is being released but it has not yet > > been removed from the blkcg controller\n", > > +q); > > + } > > +#endif > > This last hunk should go in the cgroup code. Hello Jens, How about leaving out the #ifdef CONFIG_BLK_CGROUP / #endif? Would that be sufficient? I just noticed that blkg_lookup() returns NULL anyway if cgroup support is disabled in the kernel config. Thanks, Bart.
Re: [PATCH v4 2/2] block: Ensure that a request queue is dissociated from the cgroup controller
On 7/30/18 3:10 PM, Bart Van Assche wrote: > Several block drivers call alloc_disk() followed by put_disk() if > something fails before device_add_disk() is called without calling > blk_cleanup_queue(). Make sure that also for this scenario a request > queue is dissociated from the cgroup controller. This patch avoids > that loading the parport_pc, paride and pf drivers triggers the > following kernel crash: > > BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride] > Read of size 4 at addr 0008 by task modprobe/744 > Call Trace: > dump_stack+0x9a/0xeb > kasan_report+0x139/0x350 > pi_init+0x42e/0x580 [paride] > pf_init+0x2bb/0x1000 [pf] > do_one_initcall+0x8e/0x405 > do_init_module+0xd9/0x2f2 > load_module+0x3ab4/0x4700 > SYSC_finit_module+0x176/0x1a0 > do_syscall_64+0xee/0x2b0 > entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > Reported-by: Alexandru Moise <00moses.alexande...@gmail.com> > Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the > block cgroup controller") # v4.17 > Signed-off-by: Bart Van Assche > Tested-by: Alexandru Moise <00moses.alexande...@gmail.com> > Cc: Tejun Heo > Cc: Christoph Hellwig > Cc: Ming Lei > Cc: Johannes Thumshirn > Cc: Alexandru Moise <00moses.alexande...@gmail.com> > Cc: Joseph Qi > Cc: > --- > block/blk-sysfs.c | 25 + > 1 file changed, 25 insertions(+) > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index ca1984ecbdeb..26275d9babcb 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -802,6 +802,31 @@ static void __blk_release_queue(struct work_struct *work) > blk_stat_remove_callback(q, q->poll_cb); > blk_stat_free_callback(q->poll_cb); > > + if (!blk_queue_dead(q)) { > + /* > + * Last reference was dropped without having called > + * blk_cleanup_queue(). > + */ > + WARN_ONCE(blk_queue_init_done(q), > + "request queue %p has been registered but > blk_cleanup_queue() has not been called for that queue\n", > + q); > + blk_exit_queue(q); > + } > + > +#ifdef CONFIG_BLK_CGROUP > + { > + struct blkcg_gq *blkg; > + > + rcu_read_lock(); > + blkg = blkg_lookup(_root, q); > + rcu_read_unlock(); > + > + WARN(blkg, > + "request queue %p is being released but it has not yet > been removed from the blkcg controller\n", > + q); > + } > +#endif This last hunk should go in the cgroup code. -- Jens Axboe
Re: [PATCH v5 6/9] block: Change the runtime power management approach (2/2)
On Tue, Aug 07, 2018 at 03:51:30PM -0700, Bart Van Assche wrote: > Instead of allowing requests that are not power management requests > to enter the queue in runtime suspended status (RPM_SUSPENDED), make > the blk_get_request() caller block. This change fixes a starvation Looks not see the related change which blocks blk_get_request() in this patchset. BTW, blk_pm_add_request() won't block since it uses the async version of runtime resume. > issue: it is now guaranteed that power management requests will be > executed no matter how many blk_get_request() callers are waiting. > Instead of maintaining the q->nr_pending counter, rely on > q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a > request finishes instead of only if the queue depth drops to zero. > > Signed-off-by: Bart Van Assche > Cc: Christoph Hellwig > Cc: Ming Lei > Cc: Jianchao Wang > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > Cc: Alan Stern > --- > block/blk-core.c | 37 - > block/blk-mq-debugfs.c | 1 - > block/blk-pm.c | 40 +++- > block/blk-pm.h | 6 ++ > include/linux/blkdev.h | 1 - > 5 files changed, 45 insertions(+), 40 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 179a13be0fca..a2ef253edfbd 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2737,30 +2737,6 @@ void blk_account_io_done(struct request *req, u64 now) > } > } > > -#ifdef CONFIG_PM > -/* > - * Don't process normal requests when queue is suspended > - * or in the process of suspending/resuming > - */ > -static bool blk_pm_allow_request(struct request *rq) > -{ > - switch (rq->q->rpm_status) { > - case RPM_RESUMING: > - case RPM_SUSPENDING: > - return rq->rq_flags & RQF_PM; > - case RPM_SUSPENDED: > - return false; > - default: > - return true; > - } > -} > -#else > -static bool blk_pm_allow_request(struct request *rq) > -{ > - return true; > -} > -#endif > - > void blk_account_io_start(struct request *rq, bool new_io) > { > struct hd_struct *part; > @@ -2806,11 +2782,14 @@ static struct request *elv_next_request(struct > request_queue *q) > > while (1) { > list_for_each_entry(rq, >queue_head, queuelist) { > - if (blk_pm_allow_request(rq)) > - return rq; > - > - if (rq->rq_flags & RQF_SOFTBARRIER) > - break; > +#ifdef CONFIG_PM > + /* > + * If a request gets queued in state RPM_SUSPENDED > + * then that's a kernel bug. > + */ > + WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED); > +#endif > + return rq; > } > > /* > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > index a5ea86835fcb..7d74d53dc098 100644 > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -332,7 +332,6 @@ static const char *const rqf_name[] = { > RQF_NAME(ELVPRIV), > RQF_NAME(IO_STAT), > RQF_NAME(ALLOCED), > - RQF_NAME(PM), > RQF_NAME(HASHED), > RQF_NAME(STATS), > RQF_NAME(SPECIAL_PAYLOAD), > diff --git a/block/blk-pm.c b/block/blk-pm.c > index bf8532da952d..d6b65cef9764 100644 > --- a/block/blk-pm.c > +++ b/block/blk-pm.c > @@ -86,14 +86,39 @@ int blk_pre_runtime_suspend(struct request_queue *q) > if (!q->dev) > return ret; > > + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); > + > + blk_set_pm_only(q); > + /* > + * This function only gets called if the most recent > + * pm_request_resume() call occurred at least autosuspend_delay_ms > + * ago. Since blk_queue_enter() is called by the request allocation > + * code before pm_request_resume(), if q_usage_counter indicates that > + * no requests are in flight it is safe to suspend the device. > + */ > + ret = -EBUSY; > + if (!percpu_ref_is_in_use(>q_usage_counter)) { > + /* > + * Switch to preempt-only mode before calling > + * synchronize_rcu() such that later blk_queue_enter() calls > + * see the preempt-only state. See also > + * http://lwn.net/Articles/573497/. > + */ > + synchronize_rcu(); > + if (!percpu_ref_is_in_use(>q_usage_counter)) > + ret = 0; > + } > + In blk_queue_enter(), V5 starts to allow all RQF_PREEMPT requests to enter queue even though pm_only is set. That means any scsi_execute() may issue a new command to host just after the above percpu_ref_is_in_use() returns 0, meantime the suspend is in-progress. This case is illegal given RQF_PM is the only kind of request which can be issued to hardware during suspend. Thanks, Ming
Re: [PATCH v5 1/9] block: Change the preempt-only flag into a counter
On Tue, Aug 07, 2018 at 03:51:25PM -0700, Bart Van Assche wrote: > The RQF_PREEMPT flag is used for three purposes: > - In the SCSI core, for making sure that power management requests > are executed if a device is in the "quiesced" state. > - For domain validation by SCSI drivers that use the parallel port. > - In the IDE driver, for IDE preempt requests. I think the above description may not be accurate, BLK_MQ_REQ_PREEMPT is always set inside scsi_execute(), that means any scsi_execute() callers can use the flag of RQF_PREEMPT, of course, not limited to the above three cases. Thanks, Ming
Re: [PATCH v4 2/2] block: Ensure that a request queue is dissociated from the cgroup controller
>From my limited insight into this: Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v4 1/2] block: Introduce blk_exit_queue()
Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)
On 08/08/2018 02:11 PM, jianchao.wang wrote: > Hi Bart > > On 08/08/2018 06:51 AM, Bart Van Assche wrote: >> @@ -391,6 +393,9 @@ static struct request *blk_mq_get_request(struct >> request_queue *q, >> } >> } >> data->hctx->queued++; >> + >> +blk_pm_add_request(q, rq); >> + >> return rq; >> } > > The request_queue is in pm_only mode when suspended, who can reach here to do > the resume ? I mean, in the original blk-legacy runtime pm implementation, any new IO could trigger the resume, after your patch set, only the pm request could pass the blk_queue_enter while the queue is suspended and in pm-only mode. But if no resume, where does the pm request come from ? The blk_pm_add_request should be added to blk_queue_enter. It looks like as following: 1. when an normal io reaches blk_queue_enter, if queue is in suspended mode, it invoke blk_pm_add_request to trigger the resume, then wait here for the pm_only mode to be cleared. 2. the runtime pm core does the resume work and clear the pm_only more finally 3. the task blocked in blk_queue_enter is waked up and proceed. Thanks Jianchao
Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)
Hi Bart On 08/08/2018 06:51 AM, Bart Van Assche wrote: > @@ -391,6 +393,9 @@ static struct request *blk_mq_get_request(struct > request_queue *q, > } > } > data->hctx->queued++; > + > + blk_pm_add_request(q, rq); > + > return rq; > } The request_queue is in pm_only mode when suspended, who can reach here to do the resume ? Thanks Jianchao