Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()

2017-03-02 Thread Ming Lei
On Fri, Mar 3, 2017 at 10:20 AM, Ming Lei  wrote:
> On Thu, Mar 2, 2017 at 3:52 PM, Shaohua Li  wrote:
>> On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
>>> Hi Shaohua,
>>>
>>> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li  wrote:
>>> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
>>> >> Use this helper, instead of direct access to .bi_vcnt.
>>> >
>>> > what We really need to do for the behind IO is:
>>> > - allocate memory and copy bio data to the memory
>>> > - let behind bio do IO against the memory
>>> >
>>> > The behind bio doesn't need to have the exactly same bio_vec setting. If 
>>> > we
>>> > just track the new memory, we don't need use the bio_segments_all and 
>>> > access
>>> > bio_vec too.
>>>
>>> But we need to figure out how many vecs(each vec store one page) to be
>>> allocated for the cloned/behind bio, and that is the only value of
>>> bio_segments_all() here. Or you have idea to avoid that?
>>
>> As I said, the behind bio doesn't need to have the exactly same bio_vec
>> setting. We just allocate memory and copy original bio data to the memory,
>> then do IO against the new memory. The behind bio
>> segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT
>
> The equation isn't always correct, especially when bvec includes just
> part of page, and it is quite often in case of mkfs, in which one bvec often
> includes 512byte buffer.

Think it further, your idea could be workable and more clean, but the change
can be a bit big, looks we need to switch handling write behind into
the following way:

1) replace bio_clone_bioset_partial() with bio_allocate(nr_vecs), and 'nr_vecs'
is computed with your equation;

2) allocate 'nr_vecs' pages once and share them among all created bio in 1)

3) for each created bio, add each page into the bio via bio_add_page()

4) only for the 1st created bio, call bio_copy_data() to copy data from
master bio.

Let me know if you are OK with the above implementaion.


Thanks,
Ming Lei


[PATCH] block: use put_io_context_active() to disassociate bio from a task

2017-03-02 Thread Hou Tao
bio_associate_current() invokes get_io_context_active() to tell
CFQ scheduler that the current io_context is still issuing IOs
by increasing active_ref. When the bio is done, we also need
to invoke put_io_context_active() to decrease active_ref else
the associated io_context will always be active.

Signed-off-by: Hou Tao 
---
 block/bio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 5eec5e0..d8ed36f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2072,7 +2072,7 @@ EXPORT_SYMBOL_GPL(bio_associate_current);
 void bio_disassociate_task(struct bio *bio)
 {
if (bio->bi_ioc) {
-   put_io_context(bio->bi_ioc);
+   put_io_context_active(bio->bi_ioc);
bio->bi_ioc = NULL;
}
if (bio->bi_css) {
-- 
2.5.0



Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages

2017-03-02 Thread Ming Lei
On Fri, Mar 3, 2017 at 1:48 AM, Shaohua Li  wrote:
> On Thu, Mar 02, 2017 at 10:25:10AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li  wrote:
>> > On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
>> >> Now we allocate one page array for managing resync pages, instead
>> >> of using bio's vec table to do that, and the old way is very hacky
>> >> and won't work any more if multipage bvec is enabled.
>> >>
>> >> The introduced cost is that we need to allocate (128 + 16) * raid_disks
>> >> bytes per r1_bio, and it is fine because the inflight r1_bio for
>> >> resync shouldn't be much, as pointed by Shaohua.
>> >>
>> >> Also the bio_reset() in raid1_sync_request() is removed because
>> >> all bios are freshly new now and not necessary to reset any more.
>> >>
>> >> This patch can be thought as a cleanup too
>> >>
>> >> Suggested-by: Shaohua Li 
>> >> Signed-off-by: Ming Lei 
>> >> ---
>> >>  drivers/md/raid1.c | 83 
>> >> ++
>> >>  1 file changed, 53 insertions(+), 30 deletions(-)
>> >>
>> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> >> index c442b4657e2f..900144f39630 100644
>> >> --- a/drivers/md/raid1.c
>> >> +++ b/drivers/md/raid1.c
>> >> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, 
>> >> sector_t sector_nr);
>> >>  #define raid1_log(md, fmt, args...)  \
>> >>   do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, 
>> >> ##args); } while (0)
>> >>
>> >> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
>> >> +{
>> >> + return bio->bi_private;
>> >> +}
>> >> +
>> >> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
>> >> +{
>> >> + return get_resync_pages(bio)->raid_bio;
>> >> +}
>> >
>> > This is a weird between bio, r1bio and the resync_pages. I'd like the 
>> > pages are
>>
>> It is only a bit weird inside allocating and freeing r1bio, once all
>> are allocated, you
>> can see everthing is clean and simple:
>>
>> - r1bio includes lots of bioes,
>> - and one bio is attached by one resync_pages via .bi_private
>
> I don't how complex to let r1bio pointer to the pages, but that's the nartual
> way. r1bio owns the pages, not the pages own r1bio, so we should let r1bio
> points to the pages. The bio.bi_private still points to r1bio.

Actually it is bio which owns the pages for doing its own I/O, and the only
thing related with r10bio is that bios may share these pages, but using
page refcount trick will make the relation quite implicit.

The only reason to allocate all resync_pages together is for sake of efficiency,
and just for avoiding to allocate one resync_pages one time for each bio.

We have to make .bi_private point to resync_pages(per bio), otherwise we
can't fetch pages into one bio at all, thinking about where to store the index
for each bio's pre-allocated pages, and it has to be per bio.

>
>> > embedded in r1bio. Maybe a pointer of r1bio to the pages. It's cleaner and 
>> > more
>> > straightforward.
>>
>> In theory, the cleanest way is to allocate one resync_pages for each resync 
>> bio,
>> but that isn't efficient. That is why this patch allocates all
>> resync_pages together
>> in r1buf_pool_alloc(), and split them into bio.
>>
>> BTW, the only trick is just that the whole page array is stored in 
>> .bi_private
>> of the 1st bio, then we can avoid to add one pointer into r1bio.
>
> You will need to add pointer in resync_pages which points to r1bio

Yeah, that is just what this patchset is doing.

>
>> >
>> > I think the patch 6, 7 8 should be in a same patch. Otherwise bisect will 
>> > be broken.
>> >
>>
>> No, it won't. Both patch 7 and patch 8 just replacing reading from bvec table
>> with reading from the pre-allocated page array.  Both should be fine, but the
>> latter is cleaner and simpler to do.
>
> Ok, sounds good, though I doubt it's really worthy.

Small patch with one purpose is always easy to review, do one thing,
do it better,
:-)


Thanks,
Ming Lei


Re: [PATCH v2 13/13] md: raid10: avoid direct access to bvec table in handle_reshape_read_error

2017-03-02 Thread Ming Lei
On Thu, Mar 2, 2017 at 3:47 PM, Shaohua Li  wrote:
> On Thu, Mar 02, 2017 at 10:37:49AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:46 AM, Shaohua Li  wrote:
>> > On Tue, Feb 28, 2017 at 11:41:43PM +0800, Ming Lei wrote:
>> >> The cost is 128bytes(8*16) stack space in kernel thread context, and
>> >> just use the bio helper to retrieve pages from bio.
>> >>
>> >> Signed-off-by: Ming Lei 
>> >> ---
>> >>  drivers/md/raid10.c | 12 ++--
>> >>  1 file changed, 10 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> >> index 0b97631e3905..6ffb64ab45f8 100644
>> >> --- a/drivers/md/raid10.c
>> >> +++ b/drivers/md/raid10.c
>> >> @@ -4670,7 +4670,15 @@ static int handle_reshape_read_error(struct mddev 
>> >> *mddev,
>> >>   struct r10bio *r10b = _stack.r10_bio;
>> >>   int slot = 0;
>> >>   int idx = 0;
>> >> - struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
>> >> + struct bio_vec *bvl;
>> >> + struct page *pages[RESYNC_PAGES];
>> >> +
>> >> + /*
>> >> +  * This bio is allocated in reshape_request(), and size
>> >> +  * is still RESYNC_PAGES
>> >> +  */
>> >> + bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
>> >> + pages[idx] = bvl->bv_page;
>> >
>> > The reshape bio is doing IO against the memory we allocated for r10_bio, 
>> > I'm
>> > wondering why we can't get the pages from r10_bio. In this way, we don't 
>> > need
>> > access the bio_vec any more.
>>
>> Reshap read is special and the bio(r10_bio->master_bio) isn't allocated from
>> .r10buf_pool, please see reshape_request().
>
> Why does this matter? The bio is still doing IO to the memory allocated in
> r10_bio. bi_private->r10_bio->the pages.

Good question, :-)

>
> My guess why you think the reshape read is special is the bio->bi_private
> doesn't pointer to resync_pages. And since you let resync_pages pointer to
> r10_bio and not vice versa, we can't find r10_bio, correct? I think this is
> another reason why r10_bio embeds resync_pages (I mean a pointer to
> resync_pages). With this way I suggested, the reshape read isn't special at

We still can get the resync_pages for r10_bio->master_bio via
r10_bio->devs[0].bio in handle_reshape_read_error(), will do that in v3.

> all. If we allocate memory for r10_bio/r1_bio, we shouldn't need access any 
> bio
> bvec.

Right.



Thanks,
Ming Lei


Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()

2017-03-02 Thread Ming Lei
On Thu, Mar 2, 2017 at 3:52 PM, Shaohua Li  wrote:
> On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li  wrote:
>> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
>> >> Use this helper, instead of direct access to .bi_vcnt.
>> >
>> > what We really need to do for the behind IO is:
>> > - allocate memory and copy bio data to the memory
>> > - let behind bio do IO against the memory
>> >
>> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
>> > just track the new memory, we don't need use the bio_segments_all and 
>> > access
>> > bio_vec too.
>>
>> But we need to figure out how many vecs(each vec store one page) to be
>> allocated for the cloned/behind bio, and that is the only value of
>> bio_segments_all() here. Or you have idea to avoid that?
>
> As I said, the behind bio doesn't need to have the exactly same bio_vec
> setting. We just allocate memory and copy original bio data to the memory,
> then do IO against the new memory. The behind bio
> segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT

The equation isn't always correct, especially when bvec includes just
part of page, and it is quite often in case of mkfs, in which one bvec often
includes 512byte buffer.


Thanks,
Ming Lei


Re: [PATCH v2 04/13] md: prepare for managing resync I/O pages in clean way

2017-03-02 Thread Ming Lei
On Fri, Mar 3, 2017 at 1:55 AM, Shaohua Li  wrote:
> On Thu, Mar 02, 2017 at 10:09:14AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:30 AM, Shaohua Li  wrote:
>> > On Tue, Feb 28, 2017 at 11:41:34PM +0800, Ming Lei wrote:
>> >> Now resync I/O use bio's bec table to manage pages,
>> >> this way is very hacky, and may not work any more
>> >> once multipage bvec is introduced.
>> >>
>> >> So introduce helpers and new data structure for
>> >> managing resync I/O pages more cleanly.
>> >>
>> >> Signed-off-by: Ming Lei 
>> >> ---
>> >>  drivers/md/md.h | 54 
>> >> ++
>> >>  1 file changed, 54 insertions(+)
>> >>
>> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> >> index 1d63239a1be4..b5a638d85cb4 100644
>> >> --- a/drivers/md/md.h
>> >> +++ b/drivers/md/md.h
>> >> @@ -720,4 +720,58 @@ static inline void mddev_check_writesame(struct 
>> >> mddev *mddev, struct bio *bio)
>> >>  #define RESYNC_BLOCK_SIZE (64*1024)
>> >>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
>> >>
>> >> +/* for managing resync I/O pages */
>> >> +struct resync_pages {
>> >> + unsignedidx;/* for get/put page from the pool */
>> >> + void*raid_bio;
>> >> + struct page *pages[RESYNC_PAGES];
>> >> +};
>> >
>> > I'd like this to be embedded into r1bio directly. Not sure if we really 
>> > need a
>> > structure.
>>
>> There are two reasons we can't put this into r1bio:
>> - r1bio is used in both normal and resync I/O, not fair to allocate more
>> in normal I/O, and that is why this patch wouldn't like to touch r1bio or 
>> r10bio
>>
>> - the count of 'struct resync_pages' instance depends on raid_disks(raid1)
>> or copies(raid10), both can't be decided during compiling.
>
> Yes, I said it should be a pointer of r1bio pointing to the pages in other 
> emails.

OK, got it, :-)

Actually I did that in my local tree, and finally I figured out not
necessary to introduce this pointor, which only usage is for freeing,
and we can get the
pointor from .bi_private of the 1st bio. And this stuff can be commented
clearly.

>
>> >
>> >> +}
>> >> +
>> >> +static inline bool resync_page_available(struct resync_pages *rp)
>> >> +{
>> >> + return rp->idx < RESYNC_PAGES;
>> >> +}
>> >
>> > Then we don't need this obscure API.
>>
>> That is fine.
>>
>>
>> Thanks,
>> Ming Lei
>
>> >
>> >> +
>> >> +static inline int resync_alloc_pages(struct resync_pages *rp,
>> >> +  gfp_t gfp_flags)
>> >> +{
>> >> + int i;
>> >> +
>> >> + for (i = 0; i < RESYNC_PAGES; i++) {
>> >> + rp->pages[i] = alloc_page(gfp_flags);
>> >> + if (!rp->pages[i])
>> >> + goto out_free;
>> >> + }
>> >> +
>> >> + return 0;
>> >> +
>> >> + out_free:
>> >> + while (--i >= 0)
>> >> + __free_page(rp->pages[i]);
>> >> + return -ENOMEM;
>> >> +}
>> >> +
>> >> +static inline void resync_free_pages(struct resync_pages *rp)
>> >> +{
>> >> + int i;
>> >> +
>> >> + for (i = 0; i < RESYNC_PAGES; i++)
>> >> + __free_page(rp->pages[i]);
>> >
>> > Since we will use get_page, shouldn't this be put_page?
>>
>> You are right, will fix in v3.
>>
>> >
>> >> +}
>> >> +
>> >> +static inline void resync_get_all_pages(struct resync_pages *rp)
>> >> +{
>> >> + int i;
>> >> +
>> >> + for (i = 0; i < RESYNC_PAGES; i++)
>> >> + get_page(rp->pages[i]);
>> >> +}
>> >> +
>> >> +static inline struct page *resync_fetch_page(struct resync_pages *rp)
>> >> +{
>> >> + if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
>> >> + return NULL;
>> >> + return rp->pages[rp->idx++];
>> >
>> > I'd like the caller explicitly specify the index instead of a global 
>> > variable
>> > to track it, which will make the code more understandable and less error 
>> > prone.
>>
>> That is fine, but the index has to be per bio, and finally the index
>> has to be stored
>> in 'struct resync_pages', so every user has to call it in the following way:
>>
>>   resync_fetch_page(rp, rp->idx);
>>
>> then looks no benefit to pass index explicitly.
>
> I suppose the callers always iterate though page 0 to RESYNC_PAGES - 1. Then
> the callers can use an index by themselves. That will clearly show which page
> the callers are using. The resync_fetch_page() wrap is a blackbox we always
> need to refer to the definition.

OK, understood.


Thanks,
Ming Lei


Re: [PATCH v2 05/13] md: raid1: simplify r1buf_pool_free()

2017-03-02 Thread Ming Lei
On Fri, Mar 3, 2017 at 1:49 AM, Shaohua Li  wrote:
> On Thu, Mar 02, 2017 at 10:11:54AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:31 AM, Shaohua Li  wrote:
>> > On Tue, Feb 28, 2017 at 11:41:35PM +0800, Ming Lei wrote:
>> >> This patch gets each page's reference of each bio for resync,
>> >> then r1buf_pool_free() gets simplified a lot.
>> >>
>> >> The same policy has been taken in raid10's buf pool allocation/free
>> >> too.
>> >
>> > We are going to delete the code, this simplify isn't really required.
>>
>> Could you explain it a bit? Or I can put this patch into this patchset if you
>> can provide one?
>
> I mean you will replace the code soon, with the resync_pages stuff. So I
> thought this isn't really required.

OK,  one benefit of this patch is to make the following one easier to review,
but not a big deal, I may merge this into the following patch.


Thanks,
Ming Lei


Re: [PATCH v2 13/13] md: raid10: avoid direct access to bvec table in handle_reshape_read_error

2017-03-02 Thread Shaohua Li
On Thu, Mar 02, 2017 at 10:37:49AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:46 AM, Shaohua Li  wrote:
> > On Tue, Feb 28, 2017 at 11:41:43PM +0800, Ming Lei wrote:
> >> The cost is 128bytes(8*16) stack space in kernel thread context, and
> >> just use the bio helper to retrieve pages from bio.
> >>
> >> Signed-off-by: Ming Lei 
> >> ---
> >>  drivers/md/raid10.c | 12 ++--
> >>  1 file changed, 10 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> index 0b97631e3905..6ffb64ab45f8 100644
> >> --- a/drivers/md/raid10.c
> >> +++ b/drivers/md/raid10.c
> >> @@ -4670,7 +4670,15 @@ static int handle_reshape_read_error(struct mddev 
> >> *mddev,
> >>   struct r10bio *r10b = _stack.r10_bio;
> >>   int slot = 0;
> >>   int idx = 0;
> >> - struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
> >> + struct bio_vec *bvl;
> >> + struct page *pages[RESYNC_PAGES];
> >> +
> >> + /*
> >> +  * This bio is allocated in reshape_request(), and size
> >> +  * is still RESYNC_PAGES
> >> +  */
> >> + bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
> >> + pages[idx] = bvl->bv_page;
> >
> > The reshape bio is doing IO against the memory we allocated for r10_bio, I'm
> > wondering why we can't get the pages from r10_bio. In this way, we don't 
> > need
> > access the bio_vec any more.
> 
> Reshap read is special and the bio(r10_bio->master_bio) isn't allocated from
> .r10buf_pool, please see reshape_request().

Why does this matter? The bio is still doing IO to the memory allocated in
r10_bio. bi_private->r10_bio->the pages.

My guess why you think the reshape read is special is the bio->bi_private
doesn't pointer to resync_pages. And since you let resync_pages pointer to
r10_bio and not vice versa, we can't find r10_bio, correct? I think this is
another reason why r10_bio embeds resync_pages (I mean a pointer to
resync_pages). With this way I suggested, the reshape read isn't special at
all. If we allocate memory for r10_bio/r1_bio, we shouldn't need access any bio
bvec.

Thanks,
Shaohua


Re: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock

2017-03-02 Thread Tahsin Erdogan
On Thu, Mar 2, 2017 at 11:32 AM, Tejun Heo  wrote:
> Hello, Tahsin.
>
> On Wed, Mar 01, 2017 at 03:43:19PM -0800, Tahsin Erdogan wrote:
>> @@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg 
>> *blkcg,
>>  struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>> - struct request_queue *q)
>> + struct request_queue *q, bool wait_ok)
>
> I'm okay with this direction but it probably would be better if the
> parameter is gfp_mask and we branch on __GFP_WAIT in the function.

Agreed, gfp mask is better as a parameter.

>
>>  {
>>   struct blkcg_gq *blkg;
>>
>> @@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>>   parent = blkcg_parent(parent);
>>   }
>>
>> - blkg = blkg_create(pos, q, NULL);
>> + if (wait_ok) {
>> + struct blkcg_gq *new_blkg;
>> +
>> + spin_unlock_irq(q->queue_lock);
>> + rcu_read_unlock();
>> +
>> + new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
>> +
>> + rcu_read_lock();
>> + spin_lock_irq(q->queue_lock);
>> +
>> + if (unlikely(!new_blkg))
>> + return ERR_PTR(-ENOMEM);
>> +
>> + if (unlikely(blk_queue_bypass(q))) {
>> + blkg_free(new_blkg);
>> + return ERR_PTR(blk_queue_dying(q) ?
>> + -ENODEV : -EBUSY);
>> + }
>> +
>> + blkg = blkg_create(pos, q, new_blkg);
>> + } else
>> + blkg = blkg_create(pos, q, NULL);
>
> So, while I'm okay with the approach, now we're creating a hybrid
> approach where we have both pre-allocation and allocation mode
> altering mechanisms.  If we're going to take this route, I think the
> right thing to do is passing down @gfp_mask all the way down to
> blkg_create().
>
>> @@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
>> blkcg_policy *pol,
>>  {
>>   struct gendisk *disk;
>>   struct blkcg_gq *blkg;
>> + struct request_queue *q;
>>   struct module *owner;
>>   unsigned int major, minor;
>>   int key_len, part, ret;
>> @@ -812,18 +840,27 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
>> blkcg_policy *pol,
>>   return -ENODEV;
>>   }
>>
>> + q = disk->queue;
>> +
>>   rcu_read_lock();
>> - spin_lock_irq(disk->queue->queue_lock);
>> + spin_lock_irq(q->queue_lock);
>>
>> - if (blkcg_policy_enabled(disk->queue, pol))
>> - blkg = blkg_lookup_create(blkcg, disk->queue);
>> - else
>> + if (blkcg_policy_enabled(q, pol)) {
>> + blkg = blkg_lookup_create(blkcg, q, true /* wait_ok */);
>> +
>> + /*
>> +  * blkg_lookup_create() may have dropped and reacquired the
>> +  * queue lock. Check policy enabled state again.
>> +  */
>> + if (!IS_ERR(blkg) && unlikely(!blkcg_policy_enabled(q, pol)))
>> + blkg = ERR_PTR(-EOPNOTSUPP);
>
> And let blkg_create() verify these conditions after releasing and
> regrabbing the lock.
>
> This also means that the init path can simply pass in GFP_KERNEL.

I tried that approach, but I encountered two issues that complicate things:

1) Pushing down blk_queue_bypass(q) check in blkg_create() doesn't
quite work because when blkcg_init_queue() calls blkg_create(), the
queue is still in bypassing mode.

2) Pushing down blkcg_policy_enabled() doesn't work well either,
because blkcg_init_queue() doesn't have a policy to pass down. We
could let it pass a NULL parameter but that would make blkg_create
more ugly.


Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Jens Axboe
On 03/02/2017 01:53 PM, Omar Sandoval wrote:
> On Thu, Mar 02, 2017 at 09:13:30AM -0700, Jens Axboe wrote:
>> I was worried about that. How about the below? We need to grab the lock
>> at some point for legacy scheduling, but the ordering should be correct.
> 
> Makes sense, now the locking is consistent with the other place we call
> ioc_exit_icq(). One nit below, and you can add
> 
> Reviewed-by: Omar Sandoval 
> 
>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
>> index b12f9c87b4c3..6fd633b5d567 100644
>> --- a/block/blk-ioc.c
>> +++ b/block/blk-ioc.c
>> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
>>  icq->flags |= ICQ_EXITED;
>>  }
>>  
>> -/* Release an icq.  Called with both ioc and q locked. */
>> +/* Release an icq.  Called with ioc locked. */
> 
> For ioc_exit_icq(), we have the more explicit comment
> 
> /*
>  * Exit an icq. Called with both ioc and q locked for sq, only ioc locked for
>  * mq.
>  */
> 
> Could you document that here, too?

Done, I've synced the two comments now. Thanks for the review!

-- 
Jens Axboe



Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Paolo Valente

> Il giorno 02 mar 2017, alle ore 19:25, Jens Axboe  ha scritto:
> 
> On 03/02/2017 11:07 AM, Paolo Valente wrote:
>> 
>>> Il giorno 02 mar 2017, alle ore 17:13, Jens Axboe  ha scritto:
>>> 
>>> On 03/02/2017 09:07 AM, Paolo Valente wrote:
 
> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe  ha 
> scritto:
> 
> On 03/02/2017 08:00 AM, Jens Axboe wrote:
>> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>> 
 Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe  ha 
 scritto:
 
 On 02/15/2017 10:58 AM, Jens Axboe wrote:
> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>> 
>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval 
>>>  ha scritto:
>>> 
>>> From: Omar Sandoval 
>>> 
>>> None of the other blk-mq elevator hooks are called with this lock 
>>> held.
>>> Additionally, it can lead to circular locking dependencies between
>>> queue_lock and the private scheduler lock.
>>> 
>> 
>> Hi Omar,
>> I'm sorry but it seems that a new potential deadlock has showed up.
>> See lockdep splat below.
>> 
>> I've tried to think about different solutions than turning back to
>> deferring the body of exit_icq, but at no avail.
> 
> Looks like a interaction between bfqd->lock and q->queue_lock. Since 
> the
> core has no notion of you bfqd->lock, the naturally dependency here
> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
> you?
> 
> Looking at the code a bit, maybe it'd just be simpler to get rid of
> holding the queue lock for that spot. For the mq scheduler, we really
> don't want places where we invoke with that held already. Does the 
> below
> work for you?
 
 Would need to remove one more lockdep assert. And only test this for
 the mq parts, we'd need to spread a bit of love on the classic
 scheduling icq exit path for this to work on that side.
 
>>> 
>>> 
>>> Jens,
>>> here is the reply I anticipated in my previous email: after rebasing
>>> against master, I'm getting again the deadlock that this patch of
>>> yours solved (together with some changes in bfq-mq).  I thought you 
>>> added a
>>> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
>> 
>> The patch I posted was never pulled to completion, it wasn't clear
>> to me if it fixed your issue or not. Maybe I missed a reply on
>> that?
>> 
>> Let me take another stab at it today, I'll send you a version to test
>> on top of my for-linus branch.
> 
> I took at look at my last posting, and I think it was only missing a
> lock grab in cfq, since we don't hold that for icq exit anymore.  Paolo,
> it'd be great if you could verify that this works. Not for bfq-mq (we
> already know it does), but for CFQ.
 
 No luck, sorry :(
 
 It looks like to have just not to take q->queue_lock in cfq_exit_icq.
>>> 
>>> I was worried about that. How about the below? We need to grab the lock
>>> at some point for legacy scheduling, but the ordering should be correct.
>>> 
>>> 
>> 
>> It seems to be: no more crashes or lockdep complains after several tests, 
>> boots and shutdowns :)
> 
> Great! Can I add your Tested-by?

Sure!

Thanks,
Paolo

> 
> -- 
> Jens Axboe



Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Omar Sandoval
On Thu, Mar 02, 2017 at 09:13:30AM -0700, Jens Axboe wrote:
> I was worried about that. How about the below? We need to grab the lock
> at some point for legacy scheduling, but the ordering should be correct.

Makes sense, now the locking is consistent with the other place we call
ioc_exit_icq(). One nit below, and you can add

Reviewed-by: Omar Sandoval 

> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b12f9c87b4c3..6fd633b5d567 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
>   icq->flags |= ICQ_EXITED;
>  }
>  
> -/* Release an icq.  Called with both ioc and q locked. */
> +/* Release an icq.  Called with ioc locked. */

For ioc_exit_icq(), we have the more explicit comment

/*
 * Exit an icq. Called with both ioc and q locked for sq, only ioc locked for
 * mq.
 */

Could you document that here, too?

>  static void ioc_destroy_icq(struct io_cq *icq)
>  {
>   struct io_context *ioc = icq->ioc;
> @@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
>   struct elevator_type *et = q->elevator->type;
>  
>   lockdep_assert_held(>lock);
> - lockdep_assert_held(q->queue_lock);
>  
>   radix_tree_delete(>icq_tree, icq->q->id);
>   hlist_del_init(>ioc_node);
> @@ -222,24 +221,40 @@ void exit_io_context(struct task_struct *task)
>   put_io_context_active(ioc);
>  }


Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Jens Axboe
On 03/02/2017 11:07 AM, Paolo Valente wrote:
> 
>> Il giorno 02 mar 2017, alle ore 17:13, Jens Axboe  ha scritto:
>>
>> On 03/02/2017 09:07 AM, Paolo Valente wrote:
>>>
 Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe  ha 
 scritto:

 On 03/02/2017 08:00 AM, Jens Axboe wrote:
> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>
>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe  ha 
>>> scritto:
>>>
>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
 On 02/15/2017 10:24 AM, Paolo Valente wrote:
>
>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval 
>>  ha scritto:
>>
>> From: Omar Sandoval 
>>
>> None of the other blk-mq elevator hooks are called with this lock 
>> held.
>> Additionally, it can lead to circular locking dependencies between
>> queue_lock and the private scheduler lock.
>>
>
> Hi Omar,
> I'm sorry but it seems that a new potential deadlock has showed up.
> See lockdep splat below.
>
> I've tried to think about different solutions than turning back to
> deferring the body of exit_icq, but at no avail.

 Looks like a interaction between bfqd->lock and q->queue_lock. Since 
 the
 core has no notion of you bfqd->lock, the naturally dependency here
 would be to nest bfqd->lock inside q->queue_lock. Is that possible for
 you?

 Looking at the code a bit, maybe it'd just be simpler to get rid of
 holding the queue lock for that spot. For the mq scheduler, we really
 don't want places where we invoke with that held already. Does the 
 below
 work for you?
>>>
>>> Would need to remove one more lockdep assert. And only test this for
>>> the mq parts, we'd need to spread a bit of love on the classic
>>> scheduling icq exit path for this to work on that side.
>>>
>>
>>
>> Jens,
>> here is the reply I anticipated in my previous email: after rebasing
>> against master, I'm getting again the deadlock that this patch of
>> yours solved (together with some changes in bfq-mq).  I thought you 
>> added a
>> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
>
> The patch I posted was never pulled to completion, it wasn't clear
> to me if it fixed your issue or not. Maybe I missed a reply on
> that?
>
> Let me take another stab at it today, I'll send you a version to test
> on top of my for-linus branch.

 I took at look at my last posting, and I think it was only missing a
 lock grab in cfq, since we don't hold that for icq exit anymore.  Paolo,
 it'd be great if you could verify that this works. Not for bfq-mq (we
 already know it does), but for CFQ.
>>>
>>> No luck, sorry :(
>>>
>>> It looks like to have just not to take q->queue_lock in cfq_exit_icq.
>>
>> I was worried about that. How about the below? We need to grab the lock
>> at some point for legacy scheduling, but the ordering should be correct.
>>
>>
> 
> It seems to be: no more crashes or lockdep complains after several tests, 
> boots and shutdowns :)

Great! Can I add your Tested-by?

-- 
Jens Axboe



Re: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock

2017-03-02 Thread Tejun Heo
Hello, Tahsin.

On Wed, Mar 01, 2017 at 03:43:19PM -0800, Tahsin Erdogan wrote:
> @@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q)
> + struct request_queue *q, bool wait_ok)

I'm okay with this direction but it probably would be better if the
parameter is gfp_mask and we branch on __GFP_WAIT in the function.

>  {
>   struct blkcg_gq *blkg;
>  
> @@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>   parent = blkcg_parent(parent);
>   }
>  
> - blkg = blkg_create(pos, q, NULL);
> + if (wait_ok) {
> + struct blkcg_gq *new_blkg;
> +
> + spin_unlock_irq(q->queue_lock);
> + rcu_read_unlock();
> +
> + new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
> +
> + rcu_read_lock();
> + spin_lock_irq(q->queue_lock);
> +
> + if (unlikely(!new_blkg))
> + return ERR_PTR(-ENOMEM);
> +
> + if (unlikely(blk_queue_bypass(q))) {
> + blkg_free(new_blkg);
> + return ERR_PTR(blk_queue_dying(q) ?
> + -ENODEV : -EBUSY);
> + }
> +
> + blkg = blkg_create(pos, q, new_blkg);
> + } else
> + blkg = blkg_create(pos, q, NULL);

So, while I'm okay with the approach, now we're creating a hybrid
approach where we have both pre-allocation and allocation mode
altering mechanisms.  If we're going to take this route, I think the
right thing to do is passing down @gfp_mask all the way down to
blkg_create().

> @@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
> blkcg_policy *pol,
>  {
>   struct gendisk *disk;
>   struct blkcg_gq *blkg;
> + struct request_queue *q;
>   struct module *owner;
>   unsigned int major, minor;
>   int key_len, part, ret;
> @@ -812,18 +840,27 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
> blkcg_policy *pol,
>   return -ENODEV;
>   }
>  
> + q = disk->queue;
> +
>   rcu_read_lock();
> - spin_lock_irq(disk->queue->queue_lock);
> + spin_lock_irq(q->queue_lock);
>  
> - if (blkcg_policy_enabled(disk->queue, pol))
> - blkg = blkg_lookup_create(blkcg, disk->queue);
> - else
> + if (blkcg_policy_enabled(q, pol)) {
> + blkg = blkg_lookup_create(blkcg, q, true /* wait_ok */);
> +
> + /*
> +  * blkg_lookup_create() may have dropped and reacquired the
> +  * queue lock. Check policy enabled state again.
> +  */
> + if (!IS_ERR(blkg) && unlikely(!blkcg_policy_enabled(q, pol)))
> + blkg = ERR_PTR(-EOPNOTSUPP);

And let blkg_create() verify these conditions after releasing and
regrabbing the lock.

This also means that the init path can simply pass in GFP_KERNEL.

Thanks.

-- 
tejun


[PATCH] block: Initialize bd_bdi on inode initialization

2017-03-02 Thread Jan Kara
So far we initialized bd_bdi only in bdget(). That is fine for normal
bdev inodes however for the special case of the root inode of
blockdev_superblock that function is never called and thus bd_bdi is
left uninitialized. As a result bdev_evict_inode() may oops doing
bdi_put(root->bd_bdi) on that inode as can be seen when doing:

mount -t bdev none /mnt

Fix the problem by initializing bd_bdi when first allocating the inode
and then reinitializing bd_bdi in bdev_evict_inode().

Thanks to syzkaller team for finding the problem.

Reported-by: Dmitry Vyukov 
Fixes: b1d2dc5659b41741f5a29b2ade76ffb4e5bb13d8
Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 77c30f15a02c..2eca00ec4370 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -870,6 +870,7 @@ static void init_once(void *foo)
 #ifdef CONFIG_SYSFS
INIT_LIST_HEAD(>bd_holder_disks);
 #endif
+   bdev->bd_bdi = _backing_dev_info;
inode_init_once(>vfs_inode);
/* Initialize mutex for freeze. */
mutex_init(>bd_fsfreeze_mutex);
@@ -884,8 +885,10 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(_lock);
list_del_init(>bd_list);
spin_unlock(_lock);
-   if (bdev->bd_bdi != _backing_dev_info)
+   if (bdev->bd_bdi != _backing_dev_info) {
bdi_put(bdev->bd_bdi);
+   bdev->bd_bdi = _backing_dev_info;
+   }
 }
 
 static const struct super_operations bdev_sops = {
@@ -988,7 +991,6 @@ struct block_device *bdget(dev_t dev)
bdev->bd_contains = NULL;
bdev->bd_super = NULL;
bdev->bd_inode = inode;
-   bdev->bd_bdi = _backing_dev_info;
bdev->bd_block_size = i_blocksize(inode);
bdev->bd_part_count = 0;
bdev->bd_invalidated = 0;
-- 
2.10.2



Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages

2017-03-02 Thread Shaohua Li
On Thu, Mar 02, 2017 at 10:25:10AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li  wrote:
> > On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
> >> Now we allocate one page array for managing resync pages, instead
> >> of using bio's vec table to do that, and the old way is very hacky
> >> and won't work any more if multipage bvec is enabled.
> >>
> >> The introduced cost is that we need to allocate (128 + 16) * raid_disks
> >> bytes per r1_bio, and it is fine because the inflight r1_bio for
> >> resync shouldn't be much, as pointed by Shaohua.
> >>
> >> Also the bio_reset() in raid1_sync_request() is removed because
> >> all bios are freshly new now and not necessary to reset any more.
> >>
> >> This patch can be thought as a cleanup too
> >>
> >> Suggested-by: Shaohua Li 
> >> Signed-off-by: Ming Lei 
> >> ---
> >>  drivers/md/raid1.c | 83 
> >> ++
> >>  1 file changed, 53 insertions(+), 30 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index c442b4657e2f..900144f39630 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t 
> >> sector_nr);
> >>  #define raid1_log(md, fmt, args...)  \
> >>   do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, 
> >> ##args); } while (0)
> >>
> >> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
> >> +{
> >> + return bio->bi_private;
> >> +}
> >> +
> >> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
> >> +{
> >> + return get_resync_pages(bio)->raid_bio;
> >> +}
> >
> > This is a weird between bio, r1bio and the resync_pages. I'd like the pages 
> > are
> 
> It is only a bit weird inside allocating and freeing r1bio, once all
> are allocated, you
> can see everthing is clean and simple:
> 
> - r1bio includes lots of bioes,
> - and one bio is attached by one resync_pages via .bi_private

I don't how complex to let r1bio pointer to the pages, but that's the nartual
way. r1bio owns the pages, not the pages own r1bio, so we should let r1bio
points to the pages. The bio.bi_private still points to r1bio.

> > embedded in r1bio. Maybe a pointer of r1bio to the pages. It's cleaner and 
> > more
> > straightforward.
> 
> In theory, the cleanest way is to allocate one resync_pages for each resync 
> bio,
> but that isn't efficient. That is why this patch allocates all
> resync_pages together
> in r1buf_pool_alloc(), and split them into bio.
> 
> BTW, the only trick is just that the whole page array is stored in .bi_private
> of the 1st bio, then we can avoid to add one pointer into r1bio.

You will need to add pointer in resync_pages which points to r1bio
 
> >
> > I think the patch 6, 7 8 should be in a same patch. Otherwise bisect will 
> > be broken.
> >
> 
> No, it won't. Both patch 7 and patch 8 just replacing reading from bvec table
> with reading from the pre-allocated page array.  Both should be fine, but the
> latter is cleaner and simpler to do.

Ok, sounds good, though I doubt it's really worthy.

Thanks,
Shaohua


Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Paolo Valente

> Il giorno 02 mar 2017, alle ore 17:13, Jens Axboe  ha scritto:
> 
> On 03/02/2017 09:07 AM, Paolo Valente wrote:
>> 
>>> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe  ha scritto:
>>> 
>>> On 03/02/2017 08:00 AM, Jens Axboe wrote:
 On 03/02/2017 03:28 AM, Paolo Valente wrote:
> 
>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe  ha 
>> scritto:
>> 
>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
 
> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval 
>  ha scritto:
> 
> From: Omar Sandoval 
> 
> None of the other blk-mq elevator hooks are called with this lock 
> held.
> Additionally, it can lead to circular locking dependencies between
> queue_lock and the private scheduler lock.
> 
 
 Hi Omar,
 I'm sorry but it seems that a new potential deadlock has showed up.
 See lockdep splat below.
 
 I've tried to think about different solutions than turning back to
 deferring the body of exit_icq, but at no avail.
>>> 
>>> Looks like a interaction between bfqd->lock and q->queue_lock. Since the
>>> core has no notion of you bfqd->lock, the naturally dependency here
>>> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
>>> you?
>>> 
>>> Looking at the code a bit, maybe it'd just be simpler to get rid of
>>> holding the queue lock for that spot. For the mq scheduler, we really
>>> don't want places where we invoke with that held already. Does the below
>>> work for you?
>> 
>> Would need to remove one more lockdep assert. And only test this for
>> the mq parts, we'd need to spread a bit of love on the classic
>> scheduling icq exit path for this to work on that side.
>> 
> 
> 
> Jens,
> here is the reply I anticipated in my previous email: after rebasing
> against master, I'm getting again the deadlock that this patch of
> yours solved (together with some changes in bfq-mq).  I thought you added 
> a
> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
 
 The patch I posted was never pulled to completion, it wasn't clear
 to me if it fixed your issue or not. Maybe I missed a reply on
 that?
 
 Let me take another stab at it today, I'll send you a version to test
 on top of my for-linus branch.
>>> 
>>> I took at look at my last posting, and I think it was only missing a
>>> lock grab in cfq, since we don't hold that for icq exit anymore.  Paolo,
>>> it'd be great if you could verify that this works. Not for bfq-mq (we
>>> already know it does), but for CFQ.
>> 
>> No luck, sorry :(
>> 
>> It looks like to have just not to take q->queue_lock in cfq_exit_icq.
> 
> I was worried about that. How about the below? We need to grab the lock
> at some point for legacy scheduling, but the ordering should be correct.
> 
> 

It seems to be: no more crashes or lockdep complains after several tests, boots 
and shutdowns :)

Thanks,
Paolo


> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b12f9c87b4c3..6fd633b5d567 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
>   icq->flags |= ICQ_EXITED;
> }
> 
> -/* Release an icq.  Called with both ioc and q locked. */
> +/* Release an icq.  Called with ioc locked. */
> static void ioc_destroy_icq(struct io_cq *icq)
> {
>   struct io_context *ioc = icq->ioc;
> @@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
>   struct elevator_type *et = q->elevator->type;
> 
>   lockdep_assert_held(>lock);
> - lockdep_assert_held(q->queue_lock);
> 
>   radix_tree_delete(>icq_tree, icq->q->id);
>   hlist_del_init(>ioc_node);
> @@ -222,24 +221,40 @@ void exit_io_context(struct task_struct *task)
>   put_io_context_active(ioc);
> }
> 
> +static void __ioc_clear_queue(struct list_head *icq_list)
> +{
> + unsigned long flags;
> +
> + while (!list_empty(icq_list)) {
> + struct io_cq *icq = list_entry(icq_list->next,
> +struct io_cq, q_node);
> + struct io_context *ioc = icq->ioc;
> +
> + spin_lock_irqsave(>lock, flags);
> + ioc_destroy_icq(icq);
> + spin_unlock_irqrestore(>lock, flags);
> + }
> +}
> +
> /**
>  * ioc_clear_queue - break any ioc association with the specified queue
>  * @q: request_queue being cleared
>  *
> - * Walk @q->icq_list and exit all io_cq's.  Must be called with @q locked.
> + * Walk @q->icq_list and exit all io_cq's.
>  */
> void ioc_clear_queue(struct request_queue *q)
> {
> - lockdep_assert_held(q->queue_lock);
> + LIST_HEAD(icq_list);
> 
> - while (!list_empty(>icq_list)) {

Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Jens Axboe
On 03/02/2017 09:07 AM, Paolo Valente wrote:
> 
>> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe  ha scritto:
>>
>> On 03/02/2017 08:00 AM, Jens Axboe wrote:
>>> On 03/02/2017 03:28 AM, Paolo Valente wrote:

> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe  ha 
> scritto:
>
> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>
 Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval 
  ha scritto:

 From: Omar Sandoval 

 None of the other blk-mq elevator hooks are called with this lock held.
 Additionally, it can lead to circular locking dependencies between
 queue_lock and the private scheduler lock.

>>>
>>> Hi Omar,
>>> I'm sorry but it seems that a new potential deadlock has showed up.
>>> See lockdep splat below.
>>>
>>> I've tried to think about different solutions than turning back to
>>> deferring the body of exit_icq, but at no avail.
>>
>> Looks like a interaction between bfqd->lock and q->queue_lock. Since the
>> core has no notion of you bfqd->lock, the naturally dependency here
>> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
>> you?
>>
>> Looking at the code a bit, maybe it'd just be simpler to get rid of
>> holding the queue lock for that spot. For the mq scheduler, we really
>> don't want places where we invoke with that held already. Does the below
>> work for you?
>
> Would need to remove one more lockdep assert. And only test this for
> the mq parts, we'd need to spread a bit of love on the classic
> scheduling icq exit path for this to work on that side.
>


 Jens,
 here is the reply I anticipated in my previous email: after rebasing
 against master, I'm getting again the deadlock that this patch of
 yours solved (together with some changes in bfq-mq).  I thought you added a
 sort of equivalent commit (now) to the mainline branch.  Am I wrong?
>>>
>>> The patch I posted was never pulled to completion, it wasn't clear
>>> to me if it fixed your issue or not. Maybe I missed a reply on
>>> that?
>>>
>>> Let me take another stab at it today, I'll send you a version to test
>>> on top of my for-linus branch.
>>
>> I took at look at my last posting, and I think it was only missing a
>> lock grab in cfq, since we don't hold that for icq exit anymore.  Paolo,
>> it'd be great if you could verify that this works. Not for bfq-mq (we
>> already know it does), but for CFQ.
> 
> No luck, sorry :(
> 
> It looks like to have just not to take q->queue_lock in cfq_exit_icq.

I was worried about that. How about the below? We need to grab the lock
at some point for legacy scheduling, but the ordering should be correct.


diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b12f9c87b4c3..6fd633b5d567 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
icq->flags |= ICQ_EXITED;
 }
 
-/* Release an icq.  Called with both ioc and q locked. */
+/* Release an icq.  Called with ioc locked. */
 static void ioc_destroy_icq(struct io_cq *icq)
 {
struct io_context *ioc = icq->ioc;
@@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
struct elevator_type *et = q->elevator->type;
 
lockdep_assert_held(>lock);
-   lockdep_assert_held(q->queue_lock);
 
radix_tree_delete(>icq_tree, icq->q->id);
hlist_del_init(>ioc_node);
@@ -222,24 +221,40 @@ void exit_io_context(struct task_struct *task)
put_io_context_active(ioc);
 }
 
+static void __ioc_clear_queue(struct list_head *icq_list)
+{
+   unsigned long flags;
+
+   while (!list_empty(icq_list)) {
+   struct io_cq *icq = list_entry(icq_list->next,
+  struct io_cq, q_node);
+   struct io_context *ioc = icq->ioc;
+
+   spin_lock_irqsave(>lock, flags);
+   ioc_destroy_icq(icq);
+   spin_unlock_irqrestore(>lock, flags);
+   }
+}
+
 /**
  * ioc_clear_queue - break any ioc association with the specified queue
  * @q: request_queue being cleared
  *
- * Walk @q->icq_list and exit all io_cq's.  Must be called with @q locked.
+ * Walk @q->icq_list and exit all io_cq's.
  */
 void ioc_clear_queue(struct request_queue *q)
 {
-   lockdep_assert_held(q->queue_lock);
+   LIST_HEAD(icq_list);
 
-   while (!list_empty(>icq_list)) {
-   struct io_cq *icq = list_entry(q->icq_list.next,
-  struct io_cq, q_node);
-   struct io_context *ioc = icq->ioc;
+   spin_lock_irq(q->queue_lock);
+   list_splice_init(>icq_list, _list);
 
-   spin_lock(>lock);
-   ioc_destroy_icq(icq);
-   

Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Paolo Valente

> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe  ha scritto:
> 
> On 03/02/2017 08:00 AM, Jens Axboe wrote:
>> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>> 
 Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe  ha 
 scritto:
 
 On 02/15/2017 10:58 AM, Jens Axboe wrote:
> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>> 
>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval 
>>>  ha scritto:
>>> 
>>> From: Omar Sandoval 
>>> 
>>> None of the other blk-mq elevator hooks are called with this lock held.
>>> Additionally, it can lead to circular locking dependencies between
>>> queue_lock and the private scheduler lock.
>>> 
>> 
>> Hi Omar,
>> I'm sorry but it seems that a new potential deadlock has showed up.
>> See lockdep splat below.
>> 
>> I've tried to think about different solutions than turning back to
>> deferring the body of exit_icq, but at no avail.
> 
> Looks like a interaction between bfqd->lock and q->queue_lock. Since the
> core has no notion of you bfqd->lock, the naturally dependency here
> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
> you?
> 
> Looking at the code a bit, maybe it'd just be simpler to get rid of
> holding the queue lock for that spot. For the mq scheduler, we really
> don't want places where we invoke with that held already. Does the below
> work for you?
 
 Would need to remove one more lockdep assert. And only test this for
 the mq parts, we'd need to spread a bit of love on the classic
 scheduling icq exit path for this to work on that side.
 
>>> 
>>> 
>>> Jens,
>>> here is the reply I anticipated in my previous email: after rebasing
>>> against master, I'm getting again the deadlock that this patch of
>>> yours solved (together with some changes in bfq-mq).  I thought you added a
>>> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
>> 
>> The patch I posted was never pulled to completion, it wasn't clear
>> to me if it fixed your issue or not. Maybe I missed a reply on
>> that?
>> 
>> Let me take another stab at it today, I'll send you a version to test
>> on top of my for-linus branch.
> 
> I took at look at my last posting, and I think it was only missing a
> lock grab in cfq, since we don't hold that for icq exit anymore.  Paolo,
> it'd be great if you could verify that this works. Not for bfq-mq (we
> already know it does), but for CFQ.

No luck, sorry :(

It looks like to have just not to take q->queue_lock in cfq_exit_icq.

[9.329501] =
[9.336132] [ INFO: possible recursive locking detected ]
[9.343801] 4.10.0-bfq-mq+ #56 Not tainted
[9.353359] -
[9.354101] ata_id/160 is trying to acquire lock:
[9.354750]  (&(>__queue_lock)->rlock){..-...}, at: [] 
cfq_exit_icq+0x2a/0x80
[9.355986] 
[9.355986] but task is already holding lock:
[9.364221]  (&(>__queue_lock)->rlock){..-...}, at: [] 
put_io_context_active+0x86/0xe0
[9.382980] 
[9.382980] other info that might help us debug this:
[9.386460]  Possible unsafe locking scenario:
[9.386460] 
[9.397070]CPU0
[9.397468]
[9.397811]   lock(&(>__queue_lock)->rlock);
[9.398440]   lock(&(>__queue_lock)->rlock);
[9.406265] 
[9.406265]  *** DEADLOCK ***
[9.406265] 
[9.409589]  May be due to missing lock nesting notation
[9.409589] 
[9.413040] 2 locks held by ata_id/160:
[9.413576]  #0:  (&(>lock)->rlock/1){..}, at: [] 
put_io_context_active+0x38/0xe0
[9.418069]  #1:  (&(>__queue_lock)->rlock){..-...}, at: 
[] put_io_context_active+0x86/0xe0
[9.441118] 
[9.441118] stack backtrace:
[9.445113] CPU: 0 PID: 160 Comm: ata_id Not tainted 4.10.0-bfq-mq+ #56
[9.446210] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[9.457223] Call Trace:
[9.457636]  dump_stack+0x85/0xc2
[9.458183]  __lock_acquire+0x1834/0x1890
[9.458839]  ? __lock_acquire+0x4af/0x1890
[9.464206]  lock_acquire+0x11b/0x220
[9.471703]  ? cfq_exit_icq+0x2a/0x80
[9.472225]  _raw_spin_lock+0x3d/0x80
[9.472784]  ? cfq_exit_icq+0x2a/0x80
[9.476336]  cfq_exit_icq+0x2a/0x80
[9.486750]  ioc_exit_icq+0x38/0x50
[9.487245]  put_io_context_active+0x92/0xe0
[9.488049]  exit_io_context+0x48/0x50
[9.488423]  do_exit+0x7a8/0xc80
[9.488797]  do_group_exit+0x50/0xd0
[9.496655]  SyS_exit_group+0x14/0x20
[9.497170]  entry_SYSCALL_64_fastpath+0x23/0xc6
[9.497808] RIP: 0033:0x7f224d1c1b98
[9.498302] RSP: 002b:7ffd342666f8 EFLAGS: 0246 ORIG_RAX: 
00e7
[9.499360] RAX: ffda RBX: 0003 RCX: 7f224d1c1b98
[9.511706] RDX:  RSI: 003c 

Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()

2017-03-02 Thread Shaohua Li
On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li  wrote:
> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
> >> Use this helper, instead of direct access to .bi_vcnt.
> >
> > what We really need to do for the behind IO is:
> > - allocate memory and copy bio data to the memory
> > - let behind bio do IO against the memory
> >
> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
> > just track the new memory, we don't need use the bio_segments_all and access
> > bio_vec too.
> 
> But we need to figure out how many vecs(each vec store one page) to be
> allocated for the cloned/behind bio, and that is the only value of
> bio_segments_all() here. Or you have idea to avoid that?

As I said, the behind bio doesn't need to have the exactly same bio_vec
setting. We just allocate memory and copy original bio data to the memory,
then do IO against the new memory. The behind bio
segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT

Thanks,
Shaohua


Re: [PATCH v2 04/13] md: prepare for managing resync I/O pages in clean way

2017-03-02 Thread Shaohua Li
On Thu, Mar 02, 2017 at 10:09:14AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:30 AM, Shaohua Li  wrote:
> > On Tue, Feb 28, 2017 at 11:41:34PM +0800, Ming Lei wrote:
> >> Now resync I/O use bio's bec table to manage pages,
> >> this way is very hacky, and may not work any more
> >> once multipage bvec is introduced.
> >>
> >> So introduce helpers and new data structure for
> >> managing resync I/O pages more cleanly.
> >>
> >> Signed-off-by: Ming Lei 
> >> ---
> >>  drivers/md/md.h | 54 
> >> ++
> >>  1 file changed, 54 insertions(+)
> >>
> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >> index 1d63239a1be4..b5a638d85cb4 100644
> >> --- a/drivers/md/md.h
> >> +++ b/drivers/md/md.h
> >> @@ -720,4 +720,58 @@ static inline void mddev_check_writesame(struct mddev 
> >> *mddev, struct bio *bio)
> >>  #define RESYNC_BLOCK_SIZE (64*1024)
> >>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
> >>
> >> +/* for managing resync I/O pages */
> >> +struct resync_pages {
> >> + unsignedidx;/* for get/put page from the pool */
> >> + void*raid_bio;
> >> + struct page *pages[RESYNC_PAGES];
> >> +};
> >
> > I'd like this to be embedded into r1bio directly. Not sure if we really 
> > need a
> > structure.
> 
> There are two reasons we can't put this into r1bio:
> - r1bio is used in both normal and resync I/O, not fair to allocate more
> in normal I/O, and that is why this patch wouldn't like to touch r1bio or 
> r10bio
> 
> - the count of 'struct resync_pages' instance depends on raid_disks(raid1)
> or copies(raid10), both can't be decided during compiling.

Yes, I said it should be a pointer of r1bio pointing to the pages in other 
emails.
 
> >
> >> +}
> >> +
> >> +static inline bool resync_page_available(struct resync_pages *rp)
> >> +{
> >> + return rp->idx < RESYNC_PAGES;
> >> +}
> >
> > Then we don't need this obscure API.
> 
> That is fine.
> 
> 
> Thanks,
> Ming Lei
 
> >
> >> +
> >> +static inline int resync_alloc_pages(struct resync_pages *rp,
> >> +  gfp_t gfp_flags)
> >> +{
> >> + int i;
> >> +
> >> + for (i = 0; i < RESYNC_PAGES; i++) {
> >> + rp->pages[i] = alloc_page(gfp_flags);
> >> + if (!rp->pages[i])
> >> + goto out_free;
> >> + }
> >> +
> >> + return 0;
> >> +
> >> + out_free:
> >> + while (--i >= 0)
> >> + __free_page(rp->pages[i]);
> >> + return -ENOMEM;
> >> +}
> >> +
> >> +static inline void resync_free_pages(struct resync_pages *rp)
> >> +{
> >> + int i;
> >> +
> >> + for (i = 0; i < RESYNC_PAGES; i++)
> >> + __free_page(rp->pages[i]);
> >
> > Since we will use get_page, shouldn't this be put_page?
> 
> You are right, will fix in v3.
> 
> >
> >> +}
> >> +
> >> +static inline void resync_get_all_pages(struct resync_pages *rp)
> >> +{
> >> + int i;
> >> +
> >> + for (i = 0; i < RESYNC_PAGES; i++)
> >> + get_page(rp->pages[i]);
> >> +}
> >> +
> >> +static inline struct page *resync_fetch_page(struct resync_pages *rp)
> >> +{
> >> + if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
> >> + return NULL;
> >> + return rp->pages[rp->idx++];
> >
> > I'd like the caller explicitly specify the index instead of a global 
> > variable
> > to track it, which will make the code more understandable and less error 
> > prone.
> 
> That is fine, but the index has to be per bio, and finally the index
> has to be stored
> in 'struct resync_pages', so every user has to call it in the following way:
> 
>   resync_fetch_page(rp, rp->idx);
> 
> then looks no benefit to pass index explicitly.

I suppose the callers always iterate though page 0 to RESYNC_PAGES - 1. Then
the callers can use an index by themselves. That will clearly show which page
the callers are using. The resync_fetch_page() wrap is a blackbox we always
need to refer to the definition.

Thanks,
Shaohua


Re: [PATCH v2 05/13] md: raid1: simplify r1buf_pool_free()

2017-03-02 Thread Shaohua Li
On Thu, Mar 02, 2017 at 10:11:54AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:31 AM, Shaohua Li  wrote:
> > On Tue, Feb 28, 2017 at 11:41:35PM +0800, Ming Lei wrote:
> >> This patch gets each page's reference of each bio for resync,
> >> then r1buf_pool_free() gets simplified a lot.
> >>
> >> The same policy has been taken in raid10's buf pool allocation/free
> >> too.
> >
> > We are going to delete the code, this simplify isn't really required.
> 
> Could you explain it a bit? Or I can put this patch into this patchset if you
> can provide one?

I mean you will replace the code soon, with the resync_pages stuff. So I
thought this isn't really required.

Thanks,
Shaohua


Re: [WIP BRANCH] cgroups support in bfq-mq WIP branch

2017-03-02 Thread Jens Axboe
On 03/02/2017 03:15 AM, Paolo Valente wrote:
> 
>> Il giorno 25 feb 2017, alle ore 19:52, Jens Axboe  ha 
>> scritto:
>>
>> On 02/25/2017 10:44 AM, Paolo Valente wrote:
>>> Hi,
>>> I've just completed cgroups support, and I'd like to highlight the
>>> main blk-mq issue that I have found along the way.  I have pushed the
>>> commit that completes the support for cgroups to the usual WIP branch
>>> [1].  Before moving to this issue, I have preliminary question about
>>> the scheduler name, since I'm about to start preparing the patch
>>> series for submission.  So far, I have used bfq-mq as a temporary
>>> name.  Are we fine with it, or should I change it, for example, to
>>> just bfq?  Jens?
>>
>> Just call it 'bfq', that doesn't conflict with anything that's
>> in the kernel already.
>>
> 
> ok
> 
>>> I've found a sort of circular dependency in blk-mq, related to
>>> scheduler initialization.  To describe both the issue and how I've
>>> addressed it, I'm pasting the message of the new commit.
>>
>> Rebase your patches on top of Linus current master, some of them
>> will need to change and some can be dropped.
>>
> 
> Done, but the last deadlock issue shows up again :( To help you get
> context, I'm going to reply to the email in which your sent the patch that
> solved it.

OK, I got that sent to you. When you have tested it, just add it as
a prep patch in your series. If it works for you, then let me know
and I'll add your Tested-by: to that patch and post it for more
thorough review.

-- 
Jens Axboe



Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Jens Axboe
On 03/02/2017 08:00 AM, Jens Axboe wrote:
> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>
>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe  ha scritto:
>>>
>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
 On 02/15/2017 10:24 AM, Paolo Valente wrote:
>
>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval 
>>  ha scritto:
>>
>> From: Omar Sandoval 
>>
>> None of the other blk-mq elevator hooks are called with this lock held.
>> Additionally, it can lead to circular locking dependencies between
>> queue_lock and the private scheduler lock.
>>
>
> Hi Omar,
> I'm sorry but it seems that a new potential deadlock has showed up.
> See lockdep splat below.
>
> I've tried to think about different solutions than turning back to
> deferring the body of exit_icq, but at no avail.

 Looks like a interaction between bfqd->lock and q->queue_lock. Since the
 core has no notion of you bfqd->lock, the naturally dependency here
 would be to nest bfqd->lock inside q->queue_lock. Is that possible for
 you?

 Looking at the code a bit, maybe it'd just be simpler to get rid of
 holding the queue lock for that spot. For the mq scheduler, we really
 don't want places where we invoke with that held already. Does the below
 work for you?
>>>
>>> Would need to remove one more lockdep assert. And only test this for
>>> the mq parts, we'd need to spread a bit of love on the classic
>>> scheduling icq exit path for this to work on that side.
>>>
>>
>>
>> Jens,
>> here is the reply I anticipated in my previous email: after rebasing
>> against master, I'm getting again the deadlock that this patch of
>> yours solved (together with some changes in bfq-mq).  I thought you added a
>> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
> 
> The patch I posted was never pulled to completion, it wasn't clear
> to me if it fixed your issue or not. Maybe I missed a reply on
> that?
> 
> Let me take another stab at it today, I'll send you a version to test
> on top of my for-linus branch.

I took at look at my last posting, and I think it was only missing a
lock grab in cfq, since we don't hold that for icq exit anymore.  Paolo,
it'd be great if you could verify that this works. Not for bfq-mq (we
already know it does), but for CFQ. Run with lockdep enabled and see if
it spits out anything. We'll hit this exit path very quickly, so the
test doesn't need to be anything exhaustive.


diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b12f9c87b4c3..546ff8f81ede 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
icq->flags |= ICQ_EXITED;
 }
 
-/* Release an icq.  Called with both ioc and q locked. */
+/* Release an icq.  Called with ioc locked. */
 static void ioc_destroy_icq(struct io_cq *icq)
 {
struct io_context *ioc = icq->ioc;
@@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
struct elevator_type *et = q->elevator->type;
 
lockdep_assert_held(>lock);
-   lockdep_assert_held(q->queue_lock);
 
radix_tree_delete(>icq_tree, icq->q->id);
hlist_del_init(>ioc_node);
@@ -222,25 +221,34 @@ void exit_io_context(struct task_struct *task)
put_io_context_active(ioc);
 }
 
+static void __ioc_clear_queue(struct list_head *icq_list)
+{
+   while (!list_empty(icq_list)) {
+   struct io_cq *icq = list_entry(icq_list->next,
+  struct io_cq, q_node);
+   struct io_context *ioc = icq->ioc;
+
+   spin_lock_irq(>lock);
+   ioc_destroy_icq(icq);
+   spin_unlock_irq(>lock);
+   }
+}
+
 /**
  * ioc_clear_queue - break any ioc association with the specified queue
  * @q: request_queue being cleared
  *
- * Walk @q->icq_list and exit all io_cq's.  Must be called with @q locked.
+ * Walk @q->icq_list and exit all io_cq's.
  */
 void ioc_clear_queue(struct request_queue *q)
 {
-   lockdep_assert_held(q->queue_lock);
+   LIST_HEAD(icq_list);
 
-   while (!list_empty(>icq_list)) {
-   struct io_cq *icq = list_entry(q->icq_list.next,
-  struct io_cq, q_node);
-   struct io_context *ioc = icq->ioc;
+   spin_lock_irq(q->queue_lock);
+   list_splice_init(>icq_list, _list);
+   spin_unlock_irq(q->queue_lock);
 
-   spin_lock(>lock);
-   ioc_destroy_icq(icq);
-   spin_unlock(>lock);
-   }
+   __ioc_clear_queue(_list);
 }
 
 int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 002af836aa87..c44b321335f3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -815,9 +815,7 @@ static void blk_release_queue(struct kobject *kobj)

Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Jens Axboe
On 03/02/2017 03:28 AM, Paolo Valente wrote:
> 
>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe  ha scritto:
>>
>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:

> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval 
>  ha scritto:
>
> From: Omar Sandoval 
>
> None of the other blk-mq elevator hooks are called with this lock held.
> Additionally, it can lead to circular locking dependencies between
> queue_lock and the private scheduler lock.
>

 Hi Omar,
 I'm sorry but it seems that a new potential deadlock has showed up.
 See lockdep splat below.

 I've tried to think about different solutions than turning back to
 deferring the body of exit_icq, but at no avail.
>>>
>>> Looks like a interaction between bfqd->lock and q->queue_lock. Since the
>>> core has no notion of you bfqd->lock, the naturally dependency here
>>> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
>>> you?
>>>
>>> Looking at the code a bit, maybe it'd just be simpler to get rid of
>>> holding the queue lock for that spot. For the mq scheduler, we really
>>> don't want places where we invoke with that held already. Does the below
>>> work for you?
>>
>> Would need to remove one more lockdep assert. And only test this for
>> the mq parts, we'd need to spread a bit of love on the classic
>> scheduling icq exit path for this to work on that side.
>>
> 
> 
> Jens,
> here is the reply I anticipated in my previous email: after rebasing
> against master, I'm getting again the deadlock that this patch of
> yours solved (together with some changes in bfq-mq).  I thought you added a
> sort of equivalent commit (now) to the mainline branch.  Am I wrong?

The patch I posted was never pulled to completion, it wasn't clear
to me if it fixed your issue or not. Maybe I missed a reply on
that?

Let me take another stab at it today, I'll send you a version to test
on top of my for-linus branch.

-- 
Jens Axboe



Re: [PATCH] cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode

2017-03-02 Thread Jan Kara
On Wed 01-03-17 10:07:44, Hou Tao wrote:
> When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
> as the delay of cfq_group's vdisktime if there have been other cfq_groups
> already.
> 
> When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
> from jiffies to nanoseconds") could result in a large iops delay and
> lead to an abnormal io schedule delay for the added cfq_group. To fix
> it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
> when iops mode is enabled.
> 
> Cc:  # 4.8+
> Signed-off-by: Hou Tao 

OK, I agree my commit broke the logic in this case. Thanks for the fix.
Please add also tag:

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4

I somewhat disagree with the fix though. See below:

> +static inline u64 cfq_get_cfqg_vdisktime_delay(struct cfq_data *cfqd)
> +{
> + if (!iops_mode(cfqd))
> + return CFQ_IDLE_DELAY;
> + else
> + return nsecs_to_jiffies64(CFQ_IDLE_DELAY);
> +}
> +

So using nsecs_to_jiffies64(CFQ_IDLE_DELAY) when in iops mode just does not
make any sense. AFAIU the code in cfq_group_notify_queue_add() we just want
to add the cfqg as the last one in the tree. So returning 1 from
cfq_get_cfqg_vdisktime_delay() in iops mode should be fine as well.

Frankly, vdisktime is in fixed-point precision shifted by
CFQ_SERVICE_SHIFT so using CFQ_IDLE_DELAY does not make much sense in any
case and just adding 1 to maximum vdisktime should be fine in all the
cases. But that would require more testing whether I did not miss anything
subtle.

Honza

>  static void
>  cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>  {
> @@ -1380,7 +1388,8 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, 
> struct cfq_group *cfqg)
>   n = rb_last(>rb);
>   if (n) {
>   __cfqg = rb_entry_cfqg(n);
> - cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
> + cfqg->vdisktime = __cfqg->vdisktime +
> + cfq_get_cfqg_vdisktime_delay(cfqd);
>   } else
>   cfqg->vdisktime = st->min_vdisktime;
>   cfq_group_service_tree_add(st, cfqg);
> -- 
> 2.5.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback

2017-03-02 Thread Matthew Wilcox
On Thu, Mar 02, 2017 at 11:38:45AM +0100, Jan Kara wrote:
> On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> > On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > > But what's going to kick these pages out of cache?  Shouldn't we rather
> > > find the pages, kick them out if clean, start writeback if not, and *then*
> > > return -EAGAIN?
> > 
> > As pointed out in the last round of these patches I think we really
> > need to pass a flags argument to filemap_write_and_wait_range to
> > communicate the non-blocking nature and only return -EAGAIN if we'd
> > block.  As a bonus that can indeed start to kick the pages out.
> 
> Aren't flags to filemap_write_and_wait_range() unnecessary complication?
> Realistically, most users wanting performance from AIO DIO so badly that
> they bother with this API won't have any pages to write / evict. If they do
> by some bad accident, they can fall back to standard "blocking" AIO DIO.
> So I don't see much value in teaching filemap_write_and_wait_range() about
> a non-blocking mode...

That lets me execute a DoS against a user using this API.  All I have
to do is open the file they're using read-only and read a byte from it.
Page goes into page-cache, and they'll only get -EAGAIN from calling
this syscall until the page ages out.

Also, I don't understand why this is a flag.  Isn't the point of AIO to
be non-blocking?  Why isn't this just a change to how we do AIO?


Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-03-02 Thread Paolo Valente

> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe  ha scritto:
> 
> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>> 
 Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval  
 ha scritto:
 
 From: Omar Sandoval 
 
 None of the other blk-mq elevator hooks are called with this lock held.
 Additionally, it can lead to circular locking dependencies between
 queue_lock and the private scheduler lock.
 
>>> 
>>> Hi Omar,
>>> I'm sorry but it seems that a new potential deadlock has showed up.
>>> See lockdep splat below.
>>> 
>>> I've tried to think about different solutions than turning back to
>>> deferring the body of exit_icq, but at no avail.
>> 
>> Looks like a interaction between bfqd->lock and q->queue_lock. Since the
>> core has no notion of you bfqd->lock, the naturally dependency here
>> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
>> you?
>> 
>> Looking at the code a bit, maybe it'd just be simpler to get rid of
>> holding the queue lock for that spot. For the mq scheduler, we really
>> don't want places where we invoke with that held already. Does the below
>> work for you?
> 
> Would need to remove one more lockdep assert. And only test this for
> the mq parts, we'd need to spread a bit of love on the classic
> scheduling icq exit path for this to work on that side.
> 


Jens,
here is the reply I anticipated in my previous email: after rebasing
against master, I'm getting again the deadlock that this patch of
yours solved (together with some changes in bfq-mq).  I thought you added a
sort of equivalent commit (now) to the mainline branch.  Am I wrong?

Thanks,
Paolo

> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b12f9c87b4c3..546ff8f81ede 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
>   icq->flags |= ICQ_EXITED;
> }
> 
> -/* Release an icq.  Called with both ioc and q locked. */
> +/* Release an icq.  Called with ioc locked. */
> static void ioc_destroy_icq(struct io_cq *icq)
> {
>   struct io_context *ioc = icq->ioc;
> @@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
>   struct elevator_type *et = q->elevator->type;
> 
>   lockdep_assert_held(>lock);
> - lockdep_assert_held(q->queue_lock);
> 
>   radix_tree_delete(>icq_tree, icq->q->id);
>   hlist_del_init(>ioc_node);
> @@ -222,25 +221,34 @@ void exit_io_context(struct task_struct *task)
>   put_io_context_active(ioc);
> }
> 
> +static void __ioc_clear_queue(struct list_head *icq_list)
> +{
> + while (!list_empty(icq_list)) {
> + struct io_cq *icq = list_entry(icq_list->next,
> +struct io_cq, q_node);
> + struct io_context *ioc = icq->ioc;
> +
> + spin_lock_irq(>lock);
> + ioc_destroy_icq(icq);
> + spin_unlock_irq(>lock);
> + }
> +}
> +
> /**
>  * ioc_clear_queue - break any ioc association with the specified queue
>  * @q: request_queue being cleared
>  *
> - * Walk @q->icq_list and exit all io_cq's.  Must be called with @q locked.
> + * Walk @q->icq_list and exit all io_cq's.
>  */
> void ioc_clear_queue(struct request_queue *q)
> {
> - lockdep_assert_held(q->queue_lock);
> + LIST_HEAD(icq_list);
> 
> - while (!list_empty(>icq_list)) {
> - struct io_cq *icq = list_entry(q->icq_list.next,
> -struct io_cq, q_node);
> - struct io_context *ioc = icq->ioc;
> + spin_lock_irq(q->queue_lock);
> + list_splice_init(>icq_list, _list);
> + spin_unlock_irq(q->queue_lock);
> 
> - spin_lock(>lock);
> - ioc_destroy_icq(icq);
> - spin_unlock(>lock);
> - }
> + __ioc_clear_queue(_list);
> }
> 
> int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int 
> node)
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 070d81bae1d5..1944aa1cb899 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -815,9 +815,7 @@ static void blk_release_queue(struct kobject *kobj)
>   blkcg_exit_queue(q);
> 
>   if (q->elevator) {
> - spin_lock_irq(q->queue_lock);
>   ioc_clear_queue(q);
> - spin_unlock_irq(q->queue_lock);
>   elevator_exit(q->elevator);
>   }
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index a25bdd90b270..aaa1e9836512 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -985,9 +985,7 @@ static int elevator_switch(struct request_queue *q, 
> struct elevator_type *new_e)
>   if (old_registered)
>   elv_unregister_queue(q);
> 
> - spin_lock_irq(q->queue_lock);
>   ioc_clear_queue(q);
> - spin_unlock_irq(q->queue_lock);
>   }
> 
>   /* allocate, init