Re: [PATCH 11/19] bcache: Subtract dirty sectors of thin flash from cache_sectors in calculating writeback rate

2017-07-12 Thread Coly Li
On 2017/7/13 下午12:12, Eric Wheeler wrote:
> On Tue, 11 Jul 2017, tang.jun...@zte.com.cn wrote:
> 
>>> Based on the above implementation, non-dirty space from flash only
>>> bcache device will mislead writeback rate calculation too. So I suggest
>>> to subtract all buckets size from all flash only bcache devices. Then it
>>> might be something like,
>>
>> what is non-dirty space from flash only bcache device?
>> Where is non-dirty space from flash only bcache device?
> 
> Hi Tang, Coly:
>
> Waas there more discussion on this thread, or is the patch good to go?  
> 
> Please send your ack if you're happy with it so I can queue it up.

I discussed with Tang offline, this patch is correct. But the patch
commit log should be improved. Now I help to work on it, should be done
quite soon.

Coly

>>
>>
>> 发件人: Coly Li 
>> 收件人: Tang Junhui ,
>> 抄送:bca...@lists.ewheeler.net, linux-block@vger.kernel.org, 
>> linux-bca...@vger.kernel.org,
>> h...@infradead.org, ax...@kernel.dk, sta...@vger.kernel.org
>> 日期: 2017/07/11 02:11
>> 主题:Re: [PATCH 11/19] bcache: Subtract dirty sectors of thin flash 
>> from cache_sectors in calculating
>> writeback rate
>> 发件人:linux-bcache-ow...@vger.kernel.org
>>
>> _
>>
>>
>>
>> On 2017/7/1 上午4:43, bca...@lists.ewheeler.net wrote:
>>> From: Tang Junhui 
>>>
>>> Since dirty sectors of thin flash cannot be used to cache data for backend
>>> device, so we should subtract it in calculating writeback rate.
>>>
>>
>> I see you want to get ride of the noise of flash only cache device for
>> writeback rate calculation. It makes sense, because flash only cache
>> device won't have write back happen at all.
>>
>>
>>> Signed-off-by: Tang Junhui 
>>> Cc: sta...@vger.kernel.org
>>> ---
>>>  drivers/md/bcache/writeback.c |  2 +-
>>>  drivers/md/bcache/writeback.h | 19 +++
>>>  2 files changed, 20 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>>> index 4ac8b13..25289e4 100644
>>> --- a/drivers/md/bcache/writeback.c
>>> +++ b/drivers/md/bcache/writeback.c
>>> @@ -21,7 +21,7 @@
>>>  static void __update_writeback_rate(struct cached_dev *dc)
>>>  {
>>>   struct cache_set *c = dc->disk.c;
>>> - uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size;
>>> + uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
>> bcache_flash_devs_sectors_dirty(c);
>>
>> See flash_dev_run(), the flash volume is created per struct
>> bcache_device of a cache set. That means, all data allocated for the
>> flash volume will be from a flash only bcache device. Regular dirty data
>> won't mixed allocating with flash volume dirty data on identical struct
>> bcache device.
>>
>> Based on the above implementation, non-dirty space from flash only
>> bcache device will mislead writeback rate calculation too. So I suggest
>> to subtract all buckets size from all flash only bcache devices. Then it
>> might be something like,
>>
>> uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
>>   
>> bcache_flash_devs_nbuckets(c);
>>
>>
>>
>> Just FYI. Thanks.
>>
>> Coly
>>
>>>   uint64_t cache_dirty_target =
>>>div_u64(cache_sectors * 
>>> dc->writeback_percent, 100);
>>>  
>>> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
>>> index c2ab4b4..24ff589 100644
>>> --- a/drivers/md/bcache/writeback.h
>>> +++ b/drivers/md/bcache/writeback.h
>>> @@ -14,6 +14,25 @@ static inline uint64_t bcache_dev_sectors_dirty(struct 
>>> bcache_device *d)
>>>   return ret;
>>>  }
>>>  
>>> +static inline uint64_t  bcache_flash_devs_sectors_dirty(struct cache_set 
>>> *c)
>>> +{
>>> + uint64_t i, ret = 0;
>>> +
>>> + mutex_lock(_register_lock);
>>> +
>>> + for (i = 0; i < c->nr_uuids; i++) {
>>> +  struct bcache_device *d = c->devices[i];
>>> +
>>> +  if (!d || !UUID_FLASH_ONLY(>uuids[i]))
>>> +   continue;
>>> +ret += bcache_dev_sectors_dirty(d);
>>> + }
>>> +
>>> + mutex_unlock(_register_lock);
>>> +
>>> + return ret;
>>> +}
>>> +
>>>  static inline unsigned offset_to_stripe(struct bcache_device *d,
>>> 
>>>   uint64_t offset)
>>>  {
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  

Re: [PATCH 12/19] bcache: update bucket_in_use periodically

2017-07-12 Thread Eric Wheeler
On Tue, 11 Jul 2017, Coly Li wrote:

> On 2017/7/11 下午1:39, tang.jun...@zte.com.cn wrote:
> > Compared to bucket depletion, resulting in hanging dead,
> > It is worthy to consumes a little time to update the bucket_in_use.
> > If you have any better solution, please show to us,
> > We should solve it as soon as possible, not wait for it forever.
> > 
> 
> I also test this patch on a cache device with 4x3.8TB size, all buckets
> iteration takes around 40-50ms. If the iteration needs to hold
> bucket_lock of cache set, it is very probably to introduce a huge I/O
> latency in period of every 30 seconds.
> 
> For database people, this is not good news.


Hi Tang,
   
I'm waiting to queue this patch pending your response to Coly.  

Please send a v2 when you're ready.

Thanks!

--
Eric Wheeler

> 
> Coly
> 
> 
> > 
> > 
> > 
> > 发件人: Coly Li 
> > 收件人: linux-block@vger.kernel.org, Tang Junhui
> > ,
> > 抄送:bca...@lists.ewheeler.net, linux-bca...@vger.kernel.org,
> > h...@infradead.org, ax...@kernel.dk
> > 日期: 2017/07/11 13:06
> > 主题:Re: [PATCH 12/19] bcache: update bucket_in_use periodically
> > 发件人:linux-bcache-ow...@vger.kernel.org
> > 
> > 
> > 
> > 
> > On 2017/7/1 上午4:43, bca...@lists.ewheeler.net wrote:
> >> From: Tang Junhui 
> >>
> >> bucket_in_use is updated in gc thread which triggered by invalidating or
> >> writing sectors_to_gc dirty data, It's been too long, Therefore, when we
> >> use it to compare with the threshold, it is often not timely, which leads
> >> to inaccurate judgment and often results in bucket depletion.
> >>
> >> Signed-off-by: Tang Junhui 
> >> ---
> >>  drivers/md/bcache/btree.c | 29 +++--
> >>  1 file changed, 27 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> >> index 866dcf7..77aa20b 100644
> >> --- a/drivers/md/bcache/btree.c
> >> +++ b/drivers/md/bcache/btree.c
> >> @@ -90,6 +90,8 @@
> >>  #define MAX_NEED_GC  64
> >>  #define MAX_SAVE_PRIO  72
> >>  
> >> +#define GC_THREAD_TIMEOUT_MS (30 * 1000)
> >> +
> >>  #define PTR_DIRTY_BIT  (((uint64_t) 1
> > << 36))
> >>  
> >>  #define PTR_HASH(c, k)  
> > \
> >> @@ -1760,6 +1762,23 @@ static void bch_btree_gc(struct cache_set *c)
> >>   bch_moving_gc(c);
> >>  }
> >>  
> >> +void bch_update_bucket_in_use(struct cache_set *c)
> >> +{
> >> + struct cache *ca;
> >> + struct bucket *b;
> >> + unsigned i;
> >> + size_t available = 0;
> >> +
> >> + for_each_cache(ca, c, i) {
> >> +  for_each_bucket(b, ca) {
> >> +   if (!GC_MARK(b) ||
> > GC_MARK(b) == GC_MARK_RECLAIMABLE)
> >> +  
> >  available++;
> >> +  }
> >> + }
> >> +
> > 
> > bucket_lock of cache set should be held before accessing buckets.
> > 
> > 
> >> + c->gc_stats.in_use = (c->nbuckets - available) * 100
> > / c->nbuckets;
> >> +}
> >> +
> >>  static bool gc_should_run(struct cache_set *c)
> >>  {
> >>   struct cache *ca;
> >> @@ -1778,10 +1797,16 @@ static bool gc_should_run(struct cache_set *c)
> >>  static int bch_gc_thread(void *arg)
> >>  {
> >>   struct cache_set *c = arg;
> >> + long  ret;
> >> + unsigned long timeout =
> > msecs_to_jiffies(GC_THREAD_TIMEOUT_MS);
> >>  
> >>   while (1) {
> >> -  wait_event_interruptible(c->gc_wait,
> >> -
> >  kthread_should_stop() || gc_should_run(c));
> >> +  ret =
> > wait_event_interruptible_timeout(c->gc_wait,
> >> +
> >  kthread_should_stop() || gc_should_run(c), timeout);
> >> +  if (!ret) {
> >> +  
> > bch_update_bucket_in_use(c);
> >> +   continue;
> > 
> > A continue here will ignore status returned from kthread_should_stop(),
> > which might not be expected behavior.
> > 
> > 
> >> +  }
> >>  
> >>if (kthread_should_stop())
> >> break;
> >>
> > 
> > Iterating all buckets from the cache set requires bucket_lock to be
> > held. 

Re: Re: [PATCH 11/19] bcache: Subtract dirty sectors of thin flash from cache_sectors in calculating writeback rate

2017-07-12 Thread Eric Wheeler
On Tue, 11 Jul 2017, tang.jun...@zte.com.cn wrote:

> > Based on the above implementation, non-dirty space from flash only
> > bcache device will mislead writeback rate calculation too. So I suggest
> > to subtract all buckets size from all flash only bcache devices. Then it
> > might be something like,
> 
> what is non-dirty space from flash only bcache device?
> Where is non-dirty space from flash only bcache device?

Hi Tang, Coly:
   
Waas there more discussion on this thread, or is the patch good to go?  

Please send your ack if you're happy with it so I can queue it up.

--
Eric Wheeler

> 
> 
> 
> 发件人:         Coly Li 
> 收件人:         Tang Junhui ,
> 抄送:        bca...@lists.ewheeler.net, linux-block@vger.kernel.org, 
> linux-bca...@vger.kernel.org,
> h...@infradead.org, ax...@kernel.dk, sta...@vger.kernel.org
> 日期:         2017/07/11 02:11
> 主题:        Re: [PATCH 11/19] bcache: Subtract dirty sectors of thin flash 
> from cache_sectors in calculating
> writeback rate
> 发件人:        linux-bcache-ow...@vger.kernel.org
> 
> _
> 
> 
> 
> On 2017/7/1 上午4:43, bca...@lists.ewheeler.net wrote:
> > From: Tang Junhui 
> >
> > Since dirty sectors of thin flash cannot be used to cache data for backend
> > device, so we should subtract it in calculating writeback rate.
> >
> 
> I see you want to get ride of the noise of flash only cache device for
> writeback rate calculation. It makes sense, because flash only cache
> device won't have write back happen at all.
> 
> 
> > Signed-off-by: Tang Junhui 
> > Cc: sta...@vger.kernel.org
> > ---
> >  drivers/md/bcache/writeback.c |  2 +-
> >  drivers/md/bcache/writeback.h | 19 +++
> >  2 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index 4ac8b13..25289e4 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -21,7 +21,7 @@
> >  static void __update_writeback_rate(struct cached_dev *dc)
> >  {
> >                   struct cache_set *c = dc->disk.c;
> > -                 uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size;
> > +                 uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
> bcache_flash_devs_sectors_dirty(c);
> 
> See flash_dev_run(), the flash volume is created per struct
> bcache_device of a cache set. That means, all data allocated for the
> flash volume will be from a flash only bcache device. Regular dirty data
> won't mixed allocating with flash volume dirty data on identical struct
> bcache device.
> 
> Based on the above implementation, non-dirty space from flash only
> bcache device will mislead writeback rate calculation too. So I suggest
> to subtract all buckets size from all flash only bcache devices. Then it
> might be something like,
> 
> uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
>                                                   
> bcache_flash_devs_nbuckets(c);
> 
> 
> 
> Just FYI. Thanks.
> 
> Coly
> 
> >                   uint64_t cache_dirty_target =
> >                                    div_u64(cache_sectors * 
> > dc->writeback_percent, 100);
> >  
> > diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> > index c2ab4b4..24ff589 100644
> > --- a/drivers/md/bcache/writeback.h
> > +++ b/drivers/md/bcache/writeback.h
> > @@ -14,6 +14,25 @@ static inline uint64_t bcache_dev_sectors_dirty(struct 
> > bcache_device *d)
> >                   return ret;
> >  }
> >  
> > +static inline uint64_t  bcache_flash_devs_sectors_dirty(struct cache_set 
> > *c)
> > +{
> > +                 uint64_t i, ret = 0;
> > +
> > +                 mutex_lock(_register_lock);
> > +
> > +                 for (i = 0; i < c->nr_uuids; i++) {
> > +                                  struct bcache_device *d = c->devices[i];
> > +
> > +                                  if (!d || !UUID_FLASH_ONLY(>uuids[i]))
> > +                                                   continue;
> > +                    ret += bcache_dev_sectors_dirty(d);
> > +                 }
> > +
> > +                 mutex_unlock(_register_lock);
> > +
> > +                 return ret;
> > +}
> > +
> >  static inline unsigned offset_to_stripe(struct bcache_device *d,
> >                                                                             
> >           uint64_t offset)
> >  {
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 

Re: [PATCH 10/19] bcache: initialize stripe_sectors_dirty correctly for thin flash device

2017-07-12 Thread Eric Wheeler
On Sun, 2 Jul 2017, Coly Li wrote:

> On 2017/7/1 上午4:42, bca...@lists.ewheeler.net wrote:
> > From: Tang Junhui 
> > 
> > Thin flash device does not initialize stripe_sectors_dirty correctly, this
> > patch fixes this issue.
> 
> Hi Junhui,
> 
> Could you please explain why stripe_sectors_ditry is not correctly
> initialized and how about its negative result ?

Hi Tang,
   
I'm waiting to queue this patch pending your response to Coly.  Can you   
update the message send a v2?

--
Eric Wheeler



> 
> Thanks.
> 
> Coly
> 
> > 
> > Signed-off-by: Tang Junhui 
> > Cc: sta...@vger.kernel.org
> > ---
> >  drivers/md/bcache/super.c | 3 ++-
> >  drivers/md/bcache/writeback.c | 8 
> >  drivers/md/bcache/writeback.h | 2 +-
> >  3 files changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > index 1f84791..e06641e 100644
> > --- a/drivers/md/bcache/super.c
> > +++ b/drivers/md/bcache/super.c
> > @@ -1030,7 +1030,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, 
> > struct cache_set *c)
> > }
> >  
> > if (BDEV_STATE(>sb) == BDEV_STATE_DIRTY) {
> > -   bch_sectors_dirty_init(dc);
> > +   bch_sectors_dirty_init(>disk);
> > atomic_set(>has_dirty, 1);
> > atomic_inc(>count);
> > bch_writeback_queue(dc);
> > @@ -1232,6 +1232,7 @@ static int flash_dev_run(struct cache_set *c, struct 
> > uuid_entry *u)
> > goto err;
> >  
> > bcache_device_attach(d, c, u - c->uuids);
> > +   bch_sectors_dirty_init(d);
> > bch_flash_dev_request_init(d);
> > add_disk(d->disk);
> >  
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index 3d463f0..4ac8b13 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -482,17 +482,17 @@ static int sectors_dirty_init_fn(struct btree_op 
> > *_op, struct btree *b,
> > return MAP_CONTINUE;
> >  }
> >  
> > -void bch_sectors_dirty_init(struct cached_dev *dc)
> > +void bch_sectors_dirty_init(struct bcache_device *d)
> >  {
> > struct sectors_dirty_init op;
> >  
> > bch_btree_op_init(, -1);
> > -   op.inode = dc->disk.id;
> > +   op.inode = d->id;
> >  
> > -   bch_btree_map_keys(, dc->disk.c, (op.inode, 0, 0),
> > +   bch_btree_map_keys(, d->c, (op.inode, 0, 0),
> >sectors_dirty_init_fn, 0);
> >  
> > -   dc->disk.sectors_dirty_last = bcache_dev_sectors_dirty(>disk);
> > +   d->sectors_dirty_last = bcache_dev_sectors_dirty(d);
> >  }
> >  
> >  void bch_cached_dev_writeback_init(struct cached_dev *dc)
> > diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> > index ea2f92e..c2ab4b4 100644
> > --- a/drivers/md/bcache/writeback.h
> > +++ b/drivers/md/bcache/writeback.h
> > @@ -109,7 +109,7 @@ static inline void bch_writeback_add(struct cached_dev 
> > *dc)
> >  
> >  void bcache_dev_sectors_dirty_add(struct cache_set *, unsigned, uint64_t, 
> > int);
> >  
> > -void bch_sectors_dirty_init(struct cached_dev *dc);
> > +void bch_sectors_dirty_init(struct bcache_device *);
> >  void bch_cached_dev_writeback_init(struct cached_dev *);
> >  int bch_cached_dev_writeback_start(struct cached_dev *);
> 
> 

Re: [PATCH 04/19] bcache: fix wrong cache_misses statistics

2017-07-12 Thread Eric Wheeler
On Sun, 2 Jul 2017, Coly Li wrote:

> On 2017/7/1 上午4:42, bca...@lists.ewheeler.net wrote:
> > From: Tang Junhui 
> > 
> > Some missed IOs are not counted into cache_misses, this patch fix this
> > issue.
> 
> Could you please explain more about,
> - which kind of missed I/O are mot counted
> - where cache_missed is located
> 
> This will help the patch to be more understandable.

Hi Tang,

I'm waiting to queue this patch pending your response to Coly.  Can you 
update the message send a v2?

--
Eric Wheeler



> 
> > 
> > Signed-off-by: tang.junhui 
> > Reviewed-by: Eric Wheeler 
> > Cc: sta...@vger.kernel.org
> 
> [snip]
> 
> > @@ -758,7 +760,7 @@ static void cached_dev_read_done_bh(struct closure *cl)
> > struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
> >  
> > bch_mark_cache_accounting(s->iop.c, s->d,
> > - !s->cache_miss, s->iop.bypass);
> > + !s->cache_missed, s->iop.bypass);
> > trace_bcache_read(s->orig_bio, !s->cache_miss, s->iop.bypass);
> 
> 
> Should the above line be changed to,
>   trace_bcache_read(s->orig_bio, !s->cache_missed, s->iop.bypass);
> as well ?
> 
> 
> [snip]
> 
> Thanks.
> 
> -- 
> Coly Li
> 

Re: [PATCH] [PATCH v2] bcache: fix calling ida_simple_remove() with incorrect minor

2017-07-12 Thread Eric Wheeler
Tang,

Please resend.  This patch seems to be malformed.

--
Eric Wheeler

On Thu, 6 Jul 2017, tang.jun...@zte.com.cn wrote:

> From: Tang Junhui 
> 
> bcache called ida_simple_remove() with minor which have multiplied by
> BCACHE_MINORS, it would cause minor wrong release and leakage.
> 
> In addition, when adding partition support to bcache, the name assignment
> was not updated, resulting in numbers jumping (bcache0, bcache16,
> bcache32...). This has been fixed implicitly by the rework.
> 
> Signed-off-by: tang.junhui 
> Reviewed-by: Coly Li 
> Reviewed-by: Eric Wheeler 
> Cc: sta...@vger.kernel.org # 4.10
> Cc: Stefan Bader 
> Fixes: b8c0d91 (bcache: partition support: add 16 minors per bcacheN device)
> BugLink: https://bugs.launchpad.net/bugs/1667078
> ---
>  drivers/md/bcache/bcache.h | 12 
>  drivers/md/bcache/super.c  | 10 --
>  2 files changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index c3ea03c..54f9075 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -212,6 +212,10 @@ BITMASK(GC_MARK,         struct bucket, gc_mark, 0, 2);
>  #define MAX_GC_SECTORS_USED        (~(~0ULL << GC_SECTORS_USED_SIZE))
>  BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, GC_SECTORS_USED_SIZE);
>  BITMASK(GC_MOVE, struct bucket, gc_mark, 15, 1);
> +#define BCACHE_MINORS_BITS                4 /* bcache partition support */
> +#define BCACHE_MINORS                     (1 << BCACHE_MINORS_BITS)
> +#define BCACHE_MINOR_TO_IDA                                0
> +#define IDA_MINOR_TO_BCACHE                                1
>  
>  #include "journal.h"
>  #include "stats.h"
> @@ -751,6 +755,14 @@ static inline bool ptr_available(struct cache_set *c, 
> const struct bkey *k,
>          return (PTR_DEV(k, i) < MAX_CACHES_PER_SET) && PTR_CACHE(c, k, i);
>  }
>  
> +static inline int convert_minor(int minor, int style)
> +{
> +        if(style == BCACHE_MINOR_TO_IDA)
> +                return (minor) >> BCACHE_MINORS_BITS;
> +        else
> +                return  (minor) << BCACHE_MINORS_BITS;
> +}
> +
>  /* Btree key macros */
>  
>  /*
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 3a19cbc..6485afe 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -57,8 +57,7 @@ static DEFINE_IDA(bcache_minor);
>  static wait_queue_head_t unregister_wait;
>  struct workqueue_struct *bcache_wq;
>  
> -#define BTREE_MAX_PAGES                (256 * 1024 / PAGE_SIZE)
> -#define BCACHE_MINORS                16 /* partition support */
> +#define BTREE_MAX_PAGES                          (256 * 1024 / PAGE_SIZE)
>  
>  /* Superblock */
>  
> @@ -734,7 +733,7 @@ static void bcache_device_free(struct bcache_device *d)
>          if (d->disk && d->disk->queue)
>                  blk_cleanup_queue(d->disk->queue);
>          if (d->disk) {
> -                ida_simple_remove(_minor, d->disk->first_minor);
> +                ida_simple_remove(_minor, 
> convert_minor(d->disk->first_minor, BCACHE_MINOR_TO_IDA));
>                  put_disk(d->disk);
>          }
>  
> @@ -780,11 +779,10 @@ static int bcache_device_init(struct bcache_device *d, 
> unsigned block_size,
>          if (!d->full_dirty_stripes)
>                  return -ENOMEM;
>  
> -        minor = ida_simple_get(_minor, 0, MINORMASK + 1, GFP_KERNEL);
> +        minor = ida_simple_get(_minor, 0, convert_minor(MINORMASK, 
> BCACHE_MINOR_TO_IDA) + 1, GFP_KERNEL);
>          if (minor < 0)
>                  return minor;
>  
> -        minor *= BCACHE_MINORS;
>  
>          if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
>              !(d->disk = alloc_disk(BCACHE_MINORS))) {
> @@ -796,7 +794,7 @@ static int bcache_device_init(struct bcache_device *d, 
> unsigned block_size,
>          snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i", minor);
>  
>          d->disk->major                = bcache_major;
> -        d->disk->first_minor        = minor;
> +        d->disk->first_minor        = convert_minor(minor, 
> IDA_MINOR_TO_BCACHE);
>          d->disk->fops                = _ops;
>          d->disk->private_data        = d;
>  
> --
> 2.8.1.windows.1
> 

Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()

2017-07-12 Thread NeilBrown
On Thu, Jul 13 2017, Ming Lei wrote:

> On Thu, Jul 13, 2017 at 10:01:33AM +1000, NeilBrown wrote:
>> On Wed, Jul 12 2017, Ming Lei wrote:
>> 
>> > We will support multipage bvec soon, so initialize bvec
>> > table using the standardy way instead of writing the
>> > talbe directly. Otherwise it won't work any more once
>> > multipage bvec is enabled.
>> >
>> > Acked-by: Guoqing Jiang 
>> > Signed-off-by: Ming Lei 
>> > ---
>> >  drivers/md/md.c | 21 +
>> >  drivers/md/md.h |  3 +++
>> >  drivers/md/raid1.c  | 16 ++--
>> >  drivers/md/raid10.c |  4 ++--
>> >  4 files changed, 28 insertions(+), 16 deletions(-)
>> >
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> > index 8cdca0296749..cc8dcd928dde 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr)
>> >  }
>> >  EXPORT_SYMBOL(md_reload_sb);
>> >  
>> > +/* generally called after bio_reset() for reseting bvec */
>> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
>> > + int size)
>> > +{
>> > +  int idx = 0;
>> > +
>> > +  /* initialize bvec table again */
>> > +  do {
>> > +  struct page *page = resync_fetch_page(rp, idx);
>> > +  int len = min_t(int, size, PAGE_SIZE);
>> > +
>> > +  /*
>> > +   * won't fail because the vec table is big
>> > +   * enough to hold all these pages
>> > +   */
>> > +  bio_add_page(bio, page, len, 0);
>> > +  size -= len;
>> > +  } while (idx++ < RESYNC_PAGES && size > 0);
>> > +}
>> > +EXPORT_SYMBOL(md_bio_reset_resync_pages);
>> 
>> I really don't think this is a good idea.
>> This code is specific to raid1/raid10.  It is not generic md code.  So
>> it doesn't belong here.
>
> We already added 'struct resync_pages' in drivers/md/md.h, so I think
> it is reasonable to add this function into drivers/md/md.c

Alternative perspective: it was unreasonable to add "resync_pages" to
md.h.

>
>> 
>> If you want to remove code duplication, then work on moving all raid1
>> functionality into raid10.c, then discard raid1.c
>
> This patch is for avoiding new code duplication, not for removing current
> duplication.
>
>> 
>> Or at the very least, have a separate "raid1-10.c" file for the common
>> code.
>
> You suggested it last time, but looks too overkill to be taken. But if
> someone wants to refactor raid1 and raid10, I think it can be a good start,
> but still not belong to this patch.

You are trying to create common code for raid1 and raid10.  This does
not belong in md.c.
If you really want to have a single copy of common code, then it exactly
is the role of this patch to create a place to put it.
I'm not saying you should put all common code in raid1-10.c.   Just the
function that you have identified.

NeilBrown

>
> Thanks,
> Ming
>
>> 
>> NeilBrown
>> 
>> > +
>> >  #ifndef MODULE
>> >  
>> >  /*
>> > diff --git a/drivers/md/md.h b/drivers/md/md.h
>> > index 2c780aa8d07f..efb32ce7a2f1 100644
>> > --- a/drivers/md/md.h
>> > +++ b/drivers/md/md.h
>> > @@ -782,4 +782,7 @@ static inline struct page *resync_fetch_page(struct 
>> > resync_pages *rp,
>> >return NULL;
>> >return rp->pages[idx];
>> >  }
>> > +
>> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
>> > + int size);
>> >  #endif /* _MD_MD_H */
>> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> > index 7901ddc3362f..5dc3fda2fdf7 100644
>> > --- a/drivers/md/raid1.c
>> > +++ b/drivers/md/raid1.c
>> > @@ -2085,10 +2085,7 @@ static void process_checks(struct r1bio *r1_bio)
>> >/* Fix variable parts of all bios */
>> >vcnt = (r1_bio->sectors + PAGE_SIZE / 512 - 1) >> (PAGE_SHIFT - 9);
>> >for (i = 0; i < conf->raid_disks * 2; i++) {
>> > -  int j;
>> > -  int size;
>> >blk_status_t status;
>> > -  struct bio_vec *bi;
>> >struct bio *b = r1_bio->bios[i];
>> >struct resync_pages *rp = get_resync_pages(b);
>> >if (b->bi_end_io != end_sync_read)
>> > @@ -2097,8 +2094,6 @@ static void process_checks(struct r1bio *r1_bio)
>> >status = b->bi_status;
>> >bio_reset(b);
>> >b->bi_status = status;
>> > -  b->bi_vcnt = vcnt;
>> > -  b->bi_iter.bi_size = r1_bio->sectors << 9;
>> >b->bi_iter.bi_sector = r1_bio->sector +
>> >conf->mirrors[i].rdev->data_offset;
>> >b->bi_bdev = conf->mirrors[i].rdev->bdev;
>> > @@ -2106,15 +2101,8 @@ static void process_checks(struct r1bio *r1_bio)
>> >rp->raid_bio = r1_bio;
>> >b->bi_private = rp;
>> >  
>> > -  size = b->bi_iter.bi_size;
>> > -  bio_for_each_segment_all(bi, b, j) {
>> > -  bi->bv_offset = 0;
>> > -  if (size > PAGE_SIZE)
>> > -  

Re: [PATCH 1/2] md: remove 'idx' from 'struct resync_pages'

2017-07-12 Thread Ming Lei
On Thu, Jul 13, 2017 at 09:58:41AM +1000, NeilBrown wrote:
> On Wed, Jul 12 2017, Ming Lei wrote:
> 
> > bio_add_page() won't fail for resync bio, and the page index for each
> > bio is same, so remove it.
> >
> > More importantly the 'idx' of 'struct resync_pages' is initialized in
> > mempool allocator function, this way is wrong since mempool is only
> > responsible for allocation, we can't use that for initialization.
> >
> > Suggested-by: NeilBrown 
> > Reported-by: NeilBrown 
> > Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync 
> > pages)
> > Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync 
> > pages)
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/md/md.h | 1 -
> >  drivers/md/raid1.c  | 6 +++---
> >  drivers/md/raid10.c | 6 +++---
> >  3 files changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/md/md.h b/drivers/md/md.h
> > index 991f0fe2dcc6..2c780aa8d07f 100644
> > --- a/drivers/md/md.h
> > +++ b/drivers/md/md.h
> > @@ -736,7 +736,6 @@ static inline void mddev_check_write_zeroes(struct 
> > mddev *mddev, struct bio *bio
> >  
> >  /* for managing resync I/O pages */
> >  struct resync_pages {
> > -   unsignedidx;/* for get/put page from the pool */
> > void*raid_bio;
> > struct page *pages[RESYNC_PAGES];
> >  };
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 3febfc8391fb..7901ddc3362f 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -170,7 +170,6 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void 
> > *data)
> > resync_get_all_pages(rp);
> > }
> >  
> > -   rp->idx = 0;
> > rp->raid_bio = r1_bio;
> > bio->bi_private = rp;
> > }
> > @@ -2619,6 +2618,7 @@ static sector_t raid1_sync_request(struct mddev 
> > *mddev, sector_t sector_nr,
> > int good_sectors = RESYNC_SECTORS;
> > int min_bad = 0; /* number of sectors that are bad in all devices */
> > int idx = sector_to_idx(sector_nr);
> > +   int page_idx = 0;
> >  
> > if (!conf->r1buf_pool)
> > if (init_resync(conf))
> > @@ -2846,7 +2846,7 @@ static sector_t raid1_sync_request(struct mddev 
> > *mddev, sector_t sector_nr,
> > bio = r1_bio->bios[i];
> > rp = get_resync_pages(bio);
> > if (bio->bi_end_io) {
> > -   page = resync_fetch_page(rp, rp->idx++);
> > +   page = resync_fetch_page(rp, page_idx);
> >  
> > /*
> >  * won't fail because the vec table is big
> > @@ -2858,7 +2858,7 @@ static sector_t raid1_sync_request(struct mddev 
> > *mddev, sector_t sector_nr,
> > nr_sectors += len>>9;
> > sector_nr += len>>9;
> > sync_blocks -= (len>>9);
> > -   } while (get_resync_pages(r1_bio->bios[disk]->bi_private)->idx < 
> > RESYNC_PAGES);
> > +   } while (page_idx++ < RESYNC_PAGES);
> 
> I think you want ++page_idx < RESYNC_PAGES, otherwise there will be
> one pass through the loop where page_idx == RESYNC_PAGES

Good catch, thanks!

thanks, 
Ming


Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()

2017-07-12 Thread Ming Lei
On Wed, Jul 12, 2017 at 02:16:08AM -0700, Christoph Hellwig wrote:
> On Wed, Jul 12, 2017 at 04:29:12PM +0800, Ming Lei wrote:
> > We will support multipage bvec soon, so initialize bvec
> > table using the standardy way instead of writing the
> > talbe directly. Otherwise it won't work any more once
> > multipage bvec is enabled.
> 
> It seems to me like these callsites also should use bio_init
> instead of bio_reset to get a clean slate (and eventually
> phase out bio_reset), which should probably got into
> the helper as well.  E.g. instead of pretending to preserve
> things we should simply build a new bio here.

At least for resetting bvec table, bio_reset() is enough, so
I think changing to bio_init() should belong to another topic, :-)

Thanks,
Ming


Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()

2017-07-12 Thread Ming Lei
On Thu, Jul 13, 2017 at 10:01:33AM +1000, NeilBrown wrote:
> On Wed, Jul 12 2017, Ming Lei wrote:
> 
> > We will support multipage bvec soon, so initialize bvec
> > table using the standardy way instead of writing the
> > talbe directly. Otherwise it won't work any more once
> > multipage bvec is enabled.
> >
> > Acked-by: Guoqing Jiang 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/md/md.c | 21 +
> >  drivers/md/md.h |  3 +++
> >  drivers/md/raid1.c  | 16 ++--
> >  drivers/md/raid10.c |  4 ++--
> >  4 files changed, 28 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 8cdca0296749..cc8dcd928dde 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr)
> >  }
> >  EXPORT_SYMBOL(md_reload_sb);
> >  
> > +/* generally called after bio_reset() for reseting bvec */
> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
> > +  int size)
> > +{
> > +   int idx = 0;
> > +
> > +   /* initialize bvec table again */
> > +   do {
> > +   struct page *page = resync_fetch_page(rp, idx);
> > +   int len = min_t(int, size, PAGE_SIZE);
> > +
> > +   /*
> > +* won't fail because the vec table is big
> > +* enough to hold all these pages
> > +*/
> > +   bio_add_page(bio, page, len, 0);
> > +   size -= len;
> > +   } while (idx++ < RESYNC_PAGES && size > 0);
> > +}
> > +EXPORT_SYMBOL(md_bio_reset_resync_pages);
> 
> I really don't think this is a good idea.
> This code is specific to raid1/raid10.  It is not generic md code.  So
> it doesn't belong here.

We already added 'struct resync_pages' in drivers/md/md.h, so I think
it is reasonable to add this function into drivers/md/md.c

> 
> If you want to remove code duplication, then work on moving all raid1
> functionality into raid10.c, then discard raid1.c

This patch is for avoiding new code duplication, not for removing current
duplication.

> 
> Or at the very least, have a separate "raid1-10.c" file for the common
> code.

You suggested it last time, but looks too overkill to be taken. But if
someone wants to refactor raid1 and raid10, I think it can be a good start,
but still not belong to this patch.

Thanks,
Ming

> 
> NeilBrown
> 
> > +
> >  #ifndef MODULE
> >  
> >  /*
> > diff --git a/drivers/md/md.h b/drivers/md/md.h
> > index 2c780aa8d07f..efb32ce7a2f1 100644
> > --- a/drivers/md/md.h
> > +++ b/drivers/md/md.h
> > @@ -782,4 +782,7 @@ static inline struct page *resync_fetch_page(struct 
> > resync_pages *rp,
> > return NULL;
> > return rp->pages[idx];
> >  }
> > +
> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
> > +  int size);
> >  #endif /* _MD_MD_H */
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 7901ddc3362f..5dc3fda2fdf7 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -2085,10 +2085,7 @@ static void process_checks(struct r1bio *r1_bio)
> > /* Fix variable parts of all bios */
> > vcnt = (r1_bio->sectors + PAGE_SIZE / 512 - 1) >> (PAGE_SHIFT - 9);
> > for (i = 0; i < conf->raid_disks * 2; i++) {
> > -   int j;
> > -   int size;
> > blk_status_t status;
> > -   struct bio_vec *bi;
> > struct bio *b = r1_bio->bios[i];
> > struct resync_pages *rp = get_resync_pages(b);
> > if (b->bi_end_io != end_sync_read)
> > @@ -2097,8 +2094,6 @@ static void process_checks(struct r1bio *r1_bio)
> > status = b->bi_status;
> > bio_reset(b);
> > b->bi_status = status;
> > -   b->bi_vcnt = vcnt;
> > -   b->bi_iter.bi_size = r1_bio->sectors << 9;
> > b->bi_iter.bi_sector = r1_bio->sector +
> > conf->mirrors[i].rdev->data_offset;
> > b->bi_bdev = conf->mirrors[i].rdev->bdev;
> > @@ -2106,15 +2101,8 @@ static void process_checks(struct r1bio *r1_bio)
> > rp->raid_bio = r1_bio;
> > b->bi_private = rp;
> >  
> > -   size = b->bi_iter.bi_size;
> > -   bio_for_each_segment_all(bi, b, j) {
> > -   bi->bv_offset = 0;
> > -   if (size > PAGE_SIZE)
> > -   bi->bv_len = PAGE_SIZE;
> > -   else
> > -   bi->bv_len = size;
> > -   size -= PAGE_SIZE;
> > -   }
> > +   /* initialize bvec table again */
> > +   md_bio_reset_resync_pages(b, rp, r1_bio->sectors << 9);
> > }
> > for (primary = 0; primary < conf->raid_disks * 2; primary++)
> > if (r1_bio->bios[primary]->bi_end_io == end_sync_read &&
> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> > index e594ca610f27..cb8e803cd1c2 

Re: [PATCH v3 05/14] md: raid1: don't use bio's vec table to manage resync pages

2017-07-12 Thread Ming Lei
On Wed, Jul 12, 2017 at 09:30:50AM -0700, Shaohua Li wrote:
> On Wed, Jul 12, 2017 at 09:40:10AM +0800, Ming Lei wrote:
> > On Tue, Jul 11, 2017 at 7:14 AM, NeilBrown  wrote:
> > > On Mon, Jul 10 2017, Shaohua Li wrote:
> > >
> > >> On Mon, Jul 10, 2017 at 03:25:41PM +0800, Ming Lei wrote:
> > >>> On Mon, Jul 10, 2017 at 02:38:19PM +1000, NeilBrown wrote:
> > >>> > On Mon, Jul 10 2017, Ming Lei wrote:
> > >>> >
> > >>> > > On Mon, Jul 10, 2017 at 11:35:12AM +0800, Ming Lei wrote:
> > >>> > >> On Mon, Jul 10, 2017 at 7:09 AM, NeilBrown  wrote:
> > >>> > ...
> > >>> > >> >> +
> > >>> > >> >> + rp->idx = 0;
> > >>> > >> >
> > >>> > >> > This is the only place the ->idx is initialized, in 
> > >>> > >> > r1buf_pool_alloc().
> > >>> > >> > The mempool alloc function is suppose to allocate memory, not 
> > >>> > >> > initialize
> > >>> > >> > it.
> > >>> > >> >
> > >>> > >> > If the mempool_alloc() call cannot allocate memory it will use 
> > >>> > >> > memory
> > >>> > >> > from the pool.  If this memory has already been used, then it 
> > >>> > >> > will no
> > >>> > >> > longer have the initialized value.
> > >>> > >> >
> > >>> > >> > In short: you need to initialise memory *after* calling
> > >>> > >> > mempool_alloc(), unless you ensure it is reset to the init 
> > >>> > >> > values before
> > >>> > >> > calling mempool_free().
> > >>> > >> >
> > >>> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=196307
> > >>> > >>
> > >>> > >> OK, thanks for posting it out.
> > >>> > >>
> > >>> > >> Another fix might be to reinitialize the variable(rp->idx = 0) in
> > >>> > >> r1buf_pool_free().
> > >>> > >> Or just set it as zero every time when it is used.
> > >>> > >>
> > >>> > >> But I don't understand why mempool_free() calls pool->free() at 
> > >>> > >> the end of
> > >>> > >> this function, which may cause to run pool->free() on a new 
> > >>> > >> allocated buf,
> > >>> > >> seems a bug in mempool?
> > >>> > >
> > >>> > > Looks I missed the 'return' in mempool_free(), so it is fine.
> > >>> > >
> > >>> > > How about the following fix?
> > >>> >
> > >>> > It looks like it would probably work, but it is rather unusual to
> > >>> > initialise something just before freeing it.
> > >>> >
> > >>> > Couldn't you just move the initialization to shortly after the
> > >>> > mempool_alloc() call.  There looks like a good place that already 
> > >>> > loops
> > >>> > over all the bios
> > >>>
> > >>> OK, follows the revised patch according to your suggestion.
> > >
> > > Thanks.
> > >
> > > That isn't as tidy as I hoped.  So I went deeper into the code to try to
> > > understand why...
> > >
> > > I think that maybe we should just discard the ->idx field completely.
> > > It is only used in this code:
> > >
> > > do {
> > > struct page *page;
> > > int len = PAGE_SIZE;
> > > if (sector_nr + (len>>9) > max_sector)
> > > len = (max_sector - sector_nr) << 9;
> > > if (len == 0)
> > > break;
> > > for (bio= biolist ; bio ; bio=bio->bi_next) {
> > > struct resync_pages *rp = get_resync_pages(bio);
> > > page = resync_fetch_page(rp, rp->idx++);
> > > /*
> > >  * won't fail because the vec table is big enough
> > >  * to hold all these pages
> > >  */
> > > bio_add_page(bio, page, len, 0);
> > > }
> > > nr_sectors += len>>9;
> > > sector_nr += len>>9;
> > > } while (get_resync_pages(biolist)->idx < RESYNC_PAGES);
> > >
> > > and all of the different 'rp' always have the same value for 'idx'.
> > > This code is more complex than it needs to be.  This is because it used
> > > to be possible for bio_add_page() to fail.  That cannot happen any more.
> > > So we can make the code something like:
> > >
> > >   for (idx = 0; idx < RESYNC_PAGES; idx++) {
> > >  struct page *page;
> > >  int len = PAGE_SIZE;
> > >  if (sector_nr + (len >> 9) > max_sector)
> > >  len = (max_sector - sector_nr) << 9
> > >  if (len == 0)
> > >  break;
> > >  for (bio = biolist; bio; bio = bio->bi_next) {
> > > struct resync_pages *rp = get_resync_pages(bio);
> > > page = resync_fetch_page(rp, idx);
> > > bio_add_page(bio, page, len, 0);
> > >  }
> > >  nr_sectors += len >> 9;
> > >  sector_nr += len >> 9;
> > >   }
> > >
> > > Or did I miss something?
> > 
> > I think this approach is much clean.
> 
> Thought I suggested not using the 'idx' in your previous post, but you said
> there is reason (not because of bio_add_page) not to do it. Is that changed?
> can't remember the details, I need to dig the mail archives. 

I found it:


Re: [PATCH] bcache: only recovery I/O error for writethrough mode

2017-07-12 Thread Eric Wheeler
On Wed, 12 Jul 2017, Coly Li wrote:

> On 2017/7/12 上午10:01, tang.jun...@zte.com.cn wrote:
> >>I meant "it is very necessary for data base applications which always
> >>use *writeback* mode and not switch to other mode during all their
> >>online time."  ^_^
> > 
> > I know, it is necessary, but not enough. who can promise they will not
> > switch during online time? This patch is logical imperfectly.
> 
> Yes, I agree with you. Since Eric mentions dirty data map, an improved
> fix shows up in my head,
> 
> When cache device disconnected from system,
> 0) If in non mode, do nothing.

Does non mode guarantee that nothing is dirty?  I'm not convinced of that.  
I think you can set non mode with dirty blocks. (Correct me if I'm wrong 
here.)

> 1) If in writeback/writethough/writearound mode, and dirty map is clean,
>- switch to non mode
>- continue to handle I/O without cache device

Sure, that makes sense.  

> 2) If in writeback mode, and dirty map is not clean,

You would want to do a dirty map lookup for each IO.  How about this:

2) If in _any_ mode, and dirty map is dirty for *that specific block*:

  If WRITE request completely overlaps the dirty segment then clear 
the dirty flag and pass through to the backing dev.
  otherwise:
- return -EIO immediately for WRITE request
- return -EIO immediately for READ request (*)

If the WRITE request completely overlaps the dirty segment as indicated 
from the in-memory metadata, then clear its dirty flag and write to the 
backing device.  Whatever was dirty isn't important anymore as it was 
overwritten.

Unless there is a good reason to diverge, we would want this recovery 
logic would be the same for failed IOs from an existing cachedev (eg, with 
badblocks), and for cachedevs that are altogether missing.


> 3) If not in writeback mode, and dirty map is not clean. It means the
> cache mode is switched from writeback mode with dirty data lost, then
>- returns -EIO immediately for WRITE request
>- returns -EIO immediately for READ request (*)

For #2,3, do a dirty map lookup for every IO: if the block is clean, pass 
it to the backing device.  Only -EIO if the request cannot be recovered 
(block is dirty) and invoke pr_crit_once() to notify the user.  We want 
all IO requests to succeed to the extent possible.

I think #3 is the same case as #2.  The logic is the same whether its is 
now or ever was in writeback mode, regardless of the current mode.
 
> (*) NOTE:
> A sysfs entry "recovery_io_error" can be add here, which is disabled as
> default. If it is enabled, if a READ request does not hit dirty map,
> bcache will provide it from backing device.

Resilience first!  This should default on.  

Sometimes systems run for months with bad sectors, and this case is no 
different.  Let the bcache users' IOs succeed if possible but notify them 
with pr_crit_once().

A reboot might loose data.  They could be lucky with important data in the 
page cache; notifying them without killing the device because it is dirty 
might give them a chance to do a hot backup before rebooting (or 
stop/starting bcache).

Since the dirty map is still in memory, that information is 
useful for recovery.  After a reboot the dirty map is lost---and with it 
the data about what is consistent and what is not. 

For example, if LVM snapshots sit atop of the bcache volume, then you 
could `dd` them off.  If you hit an error, you know that copy is at least 
partially inconsistent and can try an older snapshot until one is found 
which is old enough to be 100% consistent.  Without the dirty map, you 
would only be guessing at which volume is actually consistent.

Let users set recovery_io_error=0 for those who really want to fail early.

--
Eric Wheeler

Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()

2017-07-12 Thread NeilBrown
On Wed, Jul 12 2017, Ming Lei wrote:

> We will support multipage bvec soon, so initialize bvec
> table using the standardy way instead of writing the
> talbe directly. Otherwise it won't work any more once
> multipage bvec is enabled.
>
> Acked-by: Guoqing Jiang 
> Signed-off-by: Ming Lei 
> ---
>  drivers/md/md.c | 21 +
>  drivers/md/md.h |  3 +++
>  drivers/md/raid1.c  | 16 ++--
>  drivers/md/raid10.c |  4 ++--
>  4 files changed, 28 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 8cdca0296749..cc8dcd928dde 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr)
>  }
>  EXPORT_SYMBOL(md_reload_sb);
>  
> +/* generally called after bio_reset() for reseting bvec */
> +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
> +int size)
> +{
> + int idx = 0;
> +
> + /* initialize bvec table again */
> + do {
> + struct page *page = resync_fetch_page(rp, idx);
> + int len = min_t(int, size, PAGE_SIZE);
> +
> + /*
> +  * won't fail because the vec table is big
> +  * enough to hold all these pages
> +  */
> + bio_add_page(bio, page, len, 0);
> + size -= len;
> + } while (idx++ < RESYNC_PAGES && size > 0);
> +}
> +EXPORT_SYMBOL(md_bio_reset_resync_pages);

I really don't think this is a good idea.
This code is specific to raid1/raid10.  It is not generic md code.  So
it doesn't belong here.

If you want to remove code duplication, then work on moving all raid1
functionality into raid10.c, then discard raid1.c

Or at the very least, have a separate "raid1-10.c" file for the common
code.

NeilBrown

> +
>  #ifndef MODULE
>  
>  /*
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 2c780aa8d07f..efb32ce7a2f1 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -782,4 +782,7 @@ static inline struct page *resync_fetch_page(struct 
> resync_pages *rp,
>   return NULL;
>   return rp->pages[idx];
>  }
> +
> +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
> +int size);
>  #endif /* _MD_MD_H */
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7901ddc3362f..5dc3fda2fdf7 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2085,10 +2085,7 @@ static void process_checks(struct r1bio *r1_bio)
>   /* Fix variable parts of all bios */
>   vcnt = (r1_bio->sectors + PAGE_SIZE / 512 - 1) >> (PAGE_SHIFT - 9);
>   for (i = 0; i < conf->raid_disks * 2; i++) {
> - int j;
> - int size;
>   blk_status_t status;
> - struct bio_vec *bi;
>   struct bio *b = r1_bio->bios[i];
>   struct resync_pages *rp = get_resync_pages(b);
>   if (b->bi_end_io != end_sync_read)
> @@ -2097,8 +2094,6 @@ static void process_checks(struct r1bio *r1_bio)
>   status = b->bi_status;
>   bio_reset(b);
>   b->bi_status = status;
> - b->bi_vcnt = vcnt;
> - b->bi_iter.bi_size = r1_bio->sectors << 9;
>   b->bi_iter.bi_sector = r1_bio->sector +
>   conf->mirrors[i].rdev->data_offset;
>   b->bi_bdev = conf->mirrors[i].rdev->bdev;
> @@ -2106,15 +2101,8 @@ static void process_checks(struct r1bio *r1_bio)
>   rp->raid_bio = r1_bio;
>   b->bi_private = rp;
>  
> - size = b->bi_iter.bi_size;
> - bio_for_each_segment_all(bi, b, j) {
> - bi->bv_offset = 0;
> - if (size > PAGE_SIZE)
> - bi->bv_len = PAGE_SIZE;
> - else
> - bi->bv_len = size;
> - size -= PAGE_SIZE;
> - }
> + /* initialize bvec table again */
> + md_bio_reset_resync_pages(b, rp, r1_bio->sectors << 9);
>   }
>   for (primary = 0; primary < conf->raid_disks * 2; primary++)
>   if (r1_bio->bios[primary]->bi_end_io == end_sync_read &&
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index e594ca610f27..cb8e803cd1c2 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2086,8 +2086,8 @@ static void sync_request_write(struct mddev *mddev, 
> struct r10bio *r10_bio)
>   rp = get_resync_pages(tbio);
>   bio_reset(tbio);
>  
> - tbio->bi_vcnt = vcnt;
> - tbio->bi_iter.bi_size = fbio->bi_iter.bi_size;
> + md_bio_reset_resync_pages(tbio, rp, fbio->bi_iter.bi_size);
> +
>   rp->raid_bio = r10_bio;
>   tbio->bi_private = rp;
>   tbio->bi_iter.bi_sector = r10_bio->devs[i].addr;
> -- 
> 2.9.4


signature.asc
Description: 

Re: [PATCH 1/2] md: remove 'idx' from 'struct resync_pages'

2017-07-12 Thread NeilBrown
On Wed, Jul 12 2017, Ming Lei wrote:

> bio_add_page() won't fail for resync bio, and the page index for each
> bio is same, so remove it.
>
> More importantly the 'idx' of 'struct resync_pages' is initialized in
> mempool allocator function, this way is wrong since mempool is only
> responsible for allocation, we can't use that for initialization.
>
> Suggested-by: NeilBrown 
> Reported-by: NeilBrown 
> Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync 
> pages)
> Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync 
> pages)
> Signed-off-by: Ming Lei 
> ---
>  drivers/md/md.h | 1 -
>  drivers/md/raid1.c  | 6 +++---
>  drivers/md/raid10.c | 6 +++---
>  3 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 991f0fe2dcc6..2c780aa8d07f 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -736,7 +736,6 @@ static inline void mddev_check_write_zeroes(struct mddev 
> *mddev, struct bio *bio
>  
>  /* for managing resync I/O pages */
>  struct resync_pages {
> - unsignedidx;/* for get/put page from the pool */
>   void*raid_bio;
>   struct page *pages[RESYNC_PAGES];
>  };
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 3febfc8391fb..7901ddc3362f 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -170,7 +170,6 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void 
> *data)
>   resync_get_all_pages(rp);
>   }
>  
> - rp->idx = 0;
>   rp->raid_bio = r1_bio;
>   bio->bi_private = rp;
>   }
> @@ -2619,6 +2618,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, 
> sector_t sector_nr,
>   int good_sectors = RESYNC_SECTORS;
>   int min_bad = 0; /* number of sectors that are bad in all devices */
>   int idx = sector_to_idx(sector_nr);
> + int page_idx = 0;
>  
>   if (!conf->r1buf_pool)
>   if (init_resync(conf))
> @@ -2846,7 +2846,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, 
> sector_t sector_nr,
>   bio = r1_bio->bios[i];
>   rp = get_resync_pages(bio);
>   if (bio->bi_end_io) {
> - page = resync_fetch_page(rp, rp->idx++);
> + page = resync_fetch_page(rp, page_idx);
>  
>   /*
>* won't fail because the vec table is big
> @@ -2858,7 +2858,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, 
> sector_t sector_nr,
>   nr_sectors += len>>9;
>   sector_nr += len>>9;
>   sync_blocks -= (len>>9);
> - } while (get_resync_pages(r1_bio->bios[disk]->bi_private)->idx < 
> RESYNC_PAGES);
> + } while (page_idx++ < RESYNC_PAGES);

I think you want ++page_idx < RESYNC_PAGES, otherwise there will be
one pass through the loop where page_idx == RESYNC_PAGES

>  
>   r1_bio->sectors = nr_sectors;
>  
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 5026e7ad51d3..e594ca610f27 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -221,7 +221,6 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void 
> *data)
>   resync_get_all_pages(rp);
>   }
>  
> - rp->idx = 0;
>   rp->raid_bio = r10_bio;
>   bio->bi_private = rp;
>   if (rbio) {
> @@ -2853,6 +2852,7 @@ static sector_t raid10_sync_request(struct mddev 
> *mddev, sector_t sector_nr,
>   sector_t sectors_skipped = 0;
>   int chunks_skipped = 0;
>   sector_t chunk_mask = conf->geo.chunk_mask;
> + int page_idx = 0;
>  
>   if (!conf->r10buf_pool)
>   if (init_resync(conf))
> @@ -3355,7 +3355,7 @@ static sector_t raid10_sync_request(struct mddev 
> *mddev, sector_t sector_nr,
>   break;
>   for (bio= biolist ; bio ; bio=bio->bi_next) {
>   struct resync_pages *rp = get_resync_pages(bio);
> - page = resync_fetch_page(rp, rp->idx++);
> + page = resync_fetch_page(rp, page_idx);
>   /*
>* won't fail because the vec table is big enough
>* to hold all these pages
> @@ -3364,7 +3364,7 @@ static sector_t raid10_sync_request(struct mddev 
> *mddev, sector_t sector_nr,
>   }
>   nr_sectors += len>>9;
>   sector_nr += len>>9;
> - } while (get_resync_pages(biolist)->idx < RESYNC_PAGES);
> + } while (page_idx++ < RESYNC_PAGES);

Same problem here.

Otherwise, the patch looks good.

NeilBrown

>   r10_bio->sectors = nr_sectors;
>  
>   while (biolist) {
> -- 
> 2.9.4


signature.asc
Description: PGP signature


Re: [PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices

2017-07-12 Thread Jens Axboe
On 07/12/2017 12:57 PM, Alex Ivanov wrote:
> It's now makes sense to use elevator boot argument when blk-mq is in use,
> since there is now a bunch of schedulers for it (deadline, kyber, bfq, none).

No, that boot option was a mistake, let's not propagate that to mq
scheduling as well.

-- 
Jens Axboe



Re: [PATCH V5 00/11] blktrace: output cgroup info

2017-07-12 Thread Jens Axboe
On 07/12/2017 12:49 PM, Shaohua Li wrote:
> From: Shaohua Li 
> 
> Hi,
> 
> Currently blktrace isn't cgroup aware. blktrace prints out task name of 
> current
> context, but the task of current context isn't always in the cgroup where the
> BIO comes from. We can't use task name to find out IO cgroup. For example,
> Writeback BIOs always comes from flusher thread but the BIOs are for different
> blk cgroups. Request could be requeued and dispatched from completely 
> different
> tasks. MD/DM are another examples. This brings challenges if we want to use
> blktrace for performance tunning with cgroup enabled.
> 
> This patchset try to fix the gap. We print out cgroup fhandle info in 
> blktrace.
> Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. 
> Or
> userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
> use a BPF program to filter out blktrace for a specific cgroup.
> 
> The first 6 patches adds export operation handlers for kernfs, so userspace 
> can
> use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
> blktrace output cgroup info.

Added for 4.14, thanks Shaohua.

-- 
Jens Axboe



[PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices

2017-07-12 Thread Alex Ivanov
It's now makes sense to use elevator boot argument when blk-mq is in use,
since there is now a bunch of schedulers for it (deadline, kyber, bfq, none).

Signed-off-by: Alex Ivanov 
---
 block/elevator.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 4bb2f0c93fa6..ef6b27f13dfd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -208,12 +208,12 @@ int elevator_init(struct request_queue *q, char *name)
}
 
/*
-* Use the default elevator specified by config boot param for
-* non-mq devices, or by config option. Don't try to load modules
+* Use the default elevator specified by config boot param
+* or by config option. Don't try to load modules
 * as we could be running off async and request_module() isn't
 * allowed from async.
 */
-   if (!e && !q->mq_ops && *chosen_elevator) {
+   if (!e && *chosen_elevator) {
e = elevator_get(chosen_elevator, false);
if (!e)
printk(KERN_ERR "I/O scheduler %s not found\n",
-- 
2.13.2



[PATCH V5 04/11] kernfs: don't set dentry->d_fsdata

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

When working on adding exportfs operations in kernfs, I found it's hard
to initialize dentry->d_fsdata in the exportfs operations. Looks there
is no way to do it without race condition. Look at the kernfs code
closely, there is no point to set dentry->d_fsdata. inode->i_private
already points to kernfs_node, and we can get inode from a dentry. So
this patch just delete the d_fsdata usage.

Acked-by: Tejun Heo 
Acked-by: Greg Kroah-Hartman 
Signed-off-by: Shaohua Li 
---
 fs/kernfs/dir.c | 25 +
 fs/kernfs/file.c|  6 +++---
 fs/kernfs/inode.c   |  6 +++---
 fs/kernfs/kernfs-internal.h |  7 +++
 fs/kernfs/mount.c   |  8 ++--
 fs/kernfs/symlink.c |  6 +++---
 6 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7be37c8..b61a7ef 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -566,7 +566,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
if (d_really_is_negative(dentry))
goto out_bad_unlocked;
 
-   kn = dentry->d_fsdata;
+   kn = kernfs_dentry_node(dentry);
mutex_lock(_mutex);
 
/* The kernfs node has been deactivated */
@@ -574,7 +574,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
goto out_bad;
 
/* The kernfs node has been moved? */
-   if (dentry->d_parent->d_fsdata != kn->parent)
+   if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
goto out_bad;
 
/* The kernfs node has been renamed */
@@ -594,14 +594,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
return 0;
 }
 
-static void kernfs_dop_release(struct dentry *dentry)
-{
-   kernfs_put(dentry->d_fsdata);
-}
-
 const struct dentry_operations kernfs_dops = {
.d_revalidate   = kernfs_dop_revalidate,
-   .d_release  = kernfs_dop_release,
 };
 
 /**
@@ -617,8 +611,9 @@ const struct dentry_operations kernfs_dops = {
  */
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
 {
-   if (dentry->d_sb->s_op == _sops)
-   return dentry->d_fsdata;
+   if (dentry->d_sb->s_op == _sops &&
+   !d_really_is_negative(dentry))
+   return kernfs_dentry_node(dentry);
return NULL;
 }
 
@@ -1056,7 +1051,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
unsigned int flags)
 {
struct dentry *ret;
-   struct kernfs_node *parent = dentry->d_parent->d_fsdata;
+   struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
struct inode *inode;
const void *ns = NULL;
@@ -1073,8 +1068,6 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
ret = NULL;
goto out_unlock;
}
-   kernfs_get(kn);
-   dentry->d_fsdata = kn;
 
/* attach dentry and inode */
inode = kernfs_get_inode(dir->i_sb, kn);
@@ -,7 +1104,7 @@ static int kernfs_iop_mkdir(struct inode *dir, struct 
dentry *dentry,
 
 static int kernfs_iop_rmdir(struct inode *dir, struct dentry *dentry)
 {
-   struct kernfs_node *kn  = dentry->d_fsdata;
+   struct kernfs_node *kn  = kernfs_dentry_node(dentry);
struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
int ret;
 
@@ -1131,7 +1124,7 @@ static int kernfs_iop_rename(struct inode *old_dir, 
struct dentry *old_dentry,
 struct inode *new_dir, struct dentry *new_dentry,
 unsigned int flags)
 {
-   struct kernfs_node *kn  = old_dentry->d_fsdata;
+   struct kernfs_node *kn = kernfs_dentry_node(old_dentry);
struct kernfs_node *new_parent = new_dir->i_private;
struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
int ret;
@@ -1644,7 +1637,7 @@ static struct kernfs_node *kernfs_dir_next_pos(const void 
*ns,
 static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 {
struct dentry *dentry = file->f_path.dentry;
-   struct kernfs_node *parent = dentry->d_fsdata;
+   struct kernfs_node *parent = kernfs_dentry_node(dentry);
struct kernfs_node *pos = file->private_data;
const void *ns = NULL;
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index ac2dfe0..7f90d4d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -616,7 +616,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 
 static int kernfs_fop_open(struct inode *inode, struct file *file)
 {
-   struct kernfs_node *kn = file->f_path.dentry->d_fsdata;
+   struct kernfs_node *kn = inode->i_private;
struct kernfs_root *root = kernfs_root(kn);
const struct kernfs_ops *ops;
struct kernfs_open_file *of;
@@ -768,7 +768,7 @@ 

[PATCH V5 01/11] kernfs: use idr instead of ida to manage inode number

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

kernfs uses ida to manage inode number. The problem is we can't get
kernfs_node from inode number with ida. Switching to use idr, next patch
will add an API to get kernfs_node from inode number.

Acked-by: Tejun Heo 
Acked-by: Greg Kroah-Hartman 
Signed-off-by: Shaohua Li 
---
 fs/kernfs/dir.c| 17 -
 include/linux/kernfs.h |  2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index db5900aaa..8ad7a17 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -21,6 +21,7 @@
 DEFINE_MUTEX(kernfs_mutex);
 static DEFINE_SPINLOCK(kernfs_rename_lock);/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];  /* protected by rename_lock */
+static DEFINE_SPINLOCK(kernfs_idr_lock);   /* root->ino_idr */
 
 #define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb)
 
@@ -533,7 +534,9 @@ void kernfs_put(struct kernfs_node *kn)
simple_xattrs_free(>iattr->xattrs);
}
kfree(kn->iattr);
-   ida_simple_remove(>ino_ida, kn->ino);
+   spin_lock(_idr_lock);
+   idr_remove(>ino_idr, kn->ino);
+   spin_unlock(_idr_lock);
kmem_cache_free(kernfs_node_cache, kn);
 
kn = parent;
@@ -542,7 +545,7 @@ void kernfs_put(struct kernfs_node *kn)
goto repeat;
} else {
/* just released the root kn, free @root too */
-   ida_destroy(>ino_ida);
+   idr_destroy(>ino_idr);
kfree(root);
}
 }
@@ -630,7 +633,11 @@ static struct kernfs_node *__kernfs_new_node(struct 
kernfs_root *root,
if (!kn)
goto err_out1;
 
-   ret = ida_simple_get(>ino_ida, 1, 0, GFP_KERNEL);
+   idr_preload(GFP_KERNEL);
+   spin_lock(_idr_lock);
+   ret = idr_alloc(>ino_idr, kn, 1, 0, GFP_ATOMIC);
+   spin_unlock(_idr_lock);
+   idr_preload_end();
if (ret < 0)
goto err_out2;
kn->ino = ret;
@@ -875,13 +882,13 @@ struct kernfs_root *kernfs_create_root(struct 
kernfs_syscall_ops *scops,
if (!root)
return ERR_PTR(-ENOMEM);
 
-   ida_init(>ino_ida);
+   idr_init(>ino_idr);
INIT_LIST_HEAD(>supers);
 
kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
   KERNFS_DIR);
if (!kn) {
-   ida_destroy(>ino_ida);
+   idr_destroy(>ino_idr);
kfree(root);
return ERR_PTR(-ENOMEM);
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index a9b11b8..5f5d602 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -163,7 +163,7 @@ struct kernfs_root {
unsigned intflags;  /* KERNFS_ROOT_* flags */
 
/* private fields, do not use outside kernfs proper */
-   struct ida  ino_ida;
+   struct idr  ino_idr;
struct kernfs_syscall_ops *syscall_ops;
 
/* list of kernfs_super_info of this root, protected by kernfs_mutex */
-- 
2.9.3



[PATCH V5 10/11] blktrace: add an option to allow displaying cgroup path

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.

with the option enabled, blktrace will output something like this:
dd-1353  [007] d..2   293.015252:   8,0   /test/level  D   R 24 + 8 [dd]

Signed-off-by: Shaohua Li 
---
 fs/kernfs/mount.c   | 19 +++
 include/linux/cgroup.h  |  6 ++
 include/linux/kernfs.h  |  2 ++
 kernel/cgroup/cgroup.c  | 12 
 kernel/trace/blktrace.c | 14 +-
 5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index fa32358..7c452f4 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -65,6 +65,25 @@ const struct super_operations kernfs_sops = {
.show_path  = kernfs_sop_show_path,
 };
 
+/*
+ * Similar to kernfs_fh_get_inode, this one gets kernfs node from inode
+ * number and generation
+ */
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+   const union kernfs_node_id *id)
+{
+   struct kernfs_node *kn;
+
+   kn = kernfs_find_and_get_node_by_ino(root, id->ino);
+   if (!kn)
+   return NULL;
+   if (kn->id.generation != id->generation) {
+   kernfs_put(kn);
+   return NULL;
+   }
+   return kn;
+}
+
 static struct inode *kernfs_fh_get_inode(struct super_block *sb,
u64 ino, u32 generation)
 {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 52ef9a6..6144fe9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -613,6 +613,9 @@ static inline union kernfs_node_id 
*cgroup_get_kernfs_id(struct cgroup *cgrp)
 {
return >kn->id;
 }
+
+void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
+   char *buf, size_t buflen);
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -645,6 +648,9 @@ static inline bool task_under_cgroup_hierarchy(struct 
task_struct *task,
 {
return true;
 }
+
+static inline void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
+   char *buf, size_t buflen) {}
 #endif /* !CONFIG_CGROUPS */
 
 /*
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d149361..ab25c8b 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -358,6 +358,8 @@ struct super_block *kernfs_pin_sb(struct kernfs_root *root, 
const void *ns);
 
 void kernfs_init(void);
 
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+   const union kernfs_node_id *id);
 #else  /* CONFIG_KERNFS */
 
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8f88aac..c7fae93 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4705,6 +4705,18 @@ static int __init cgroup_wq_init(void)
 }
 core_initcall(cgroup_wq_init);
 
+void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
+   char *buf, size_t buflen)
+{
+   struct kernfs_node *kn;
+
+   kn = kernfs_get_node_by_id(cgrp_dfl_root.kf_root, id);
+   if (!kn)
+   return;
+   kernfs_path(kn, buf, buflen);
+   kernfs_put(kn);
+}
+
 /*
  * proc_cgroup_show()
  *  - Print task's cgroup paths into seq_file, one line for each hierarchy
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index f393d7a..e90974e 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -48,12 +48,14 @@ static __cacheline_aligned_in_smp 
DEFINE_SPINLOCK(running_trace_lock);
 /* Select an alternative, minimalistic output than the original one */
 #define TRACE_BLK_OPT_CLASSIC  0x1
 #define TRACE_BLK_OPT_CGROUP   0x2
+#define TRACE_BLK_OPT_CGNAME   0x4
 
 static struct tracer_opt blk_tracer_opts[] = {
/* Default disable the minimalistic output */
{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
 #ifdef CONFIG_BLK_CGROUP
{ TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+   { TRACER_OPT(blk_cgname, TRACE_BLK_OPT_CGNAME) },
 #endif
{ }
 };
@@ -1213,7 +1215,17 @@ static void blk_log_action(struct trace_iterator *iter, 
const char *act,
if (has_cg) {
const union kernfs_node_id *id = cgid_start(iter->ent);
 
-   trace_seq_printf(>seq, "%3d,%-3d %x,%-x %2s %3s ",
+   if (blk_tracer_flags.val & TRACE_BLK_OPT_CGNAME) {
+   char blkcg_name_buf[NAME_MAX + 1] = "<...>";
+
+   cgroup_path_from_kernfs_id(id, blkcg_name_buf,
+   sizeof(blkcg_name_buf));
+   trace_seq_printf(>seq, "%3d,%-3d %s %2s %3s ",
+MAJOR(t->device), MINOR(t->device),
+blkcg_name_buf, act, rwbs);
+   } else
+   trace_seq_printf(>seq,
+ 

[PATCH V5 02/11] kernfs: implement i_generation

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

Set i_generation for kernfs inode. This is required to implement
exportfs operations. The generation is 32-bit, so it's possible the
generation wraps up and we find stale files. To reduce the posssibility,
we don't reuse inode numer immediately. When the inode number allocation
wraps, we increase generation number. In this way generation/inode
number consist of a 64-bit number which is unlikely duplicated. This
does make the idr tree more sparse and waste some memory. Since idr
manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6
bits of the key. In a 100k inode kernfs, the worst case will have around
300k radix tree node. Each node is 576bytes, so the tree will use about
~150M memory. Sounds not too bad, if this really is a problem, we should
find better data structure.

Acked-by: Tejun Heo 
Acked-by: Greg Kroah-Hartman 
Signed-off-by: Shaohua Li 
---
 fs/kernfs/dir.c| 10 +-
 fs/kernfs/inode.c  |  1 +
 include/linux/kernfs.h |  2 ++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8ad7a17..33f711f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -623,6 +623,8 @@ static struct kernfs_node *__kernfs_new_node(struct 
kernfs_root *root,
 unsigned flags)
 {
struct kernfs_node *kn;
+   u32 gen;
+   int cursor;
int ret;
 
name = kstrdup_const(name, GFP_KERNEL);
@@ -635,12 +637,17 @@ static struct kernfs_node *__kernfs_new_node(struct 
kernfs_root *root,
 
idr_preload(GFP_KERNEL);
spin_lock(_idr_lock);
-   ret = idr_alloc(>ino_idr, kn, 1, 0, GFP_ATOMIC);
+   cursor = idr_get_cursor(>ino_idr);
+   ret = idr_alloc_cyclic(>ino_idr, kn, 1, 0, GFP_ATOMIC);
+   if (ret >= 0 && ret < cursor)
+   root->next_generation++;
+   gen = root->next_generation;
spin_unlock(_idr_lock);
idr_preload_end();
if (ret < 0)
goto err_out2;
kn->ino = ret;
+   kn->generation = gen;
 
atomic_set(>count, 1);
atomic_set(>active, KN_DEACTIVATED_BIAS);
@@ -884,6 +891,7 @@ struct kernfs_root *kernfs_create_root(struct 
kernfs_syscall_ops *scops,
 
idr_init(>ino_idr);
INIT_LIST_HEAD(>supers);
+   root->next_generation = 1;
 
kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
   KERNFS_DIR);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index fb4b4a7..79cdae4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,6 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, 
struct inode *inode)
inode->i_private = kn;
inode->i_mapping->a_ops = _aops;
inode->i_op = _iops;
+   inode->i_generation = kn->generation;
 
set_default_inode_attr(inode, kn->mode);
kernfs_refresh_inode(kn, inode);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5f5d602..8c00d28 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -135,6 +135,7 @@ struct kernfs_node {
umode_t mode;
unsigned intino;
struct kernfs_iattrs*iattr;
+   u32 generation;
 };
 
 /*
@@ -164,6 +165,7 @@ struct kernfs_root {
 
/* private fields, do not use outside kernfs proper */
struct idr  ino_idr;
+   u32 next_generation;
struct kernfs_syscall_ops *syscall_ops;
 
/* list of kernfs_super_info of this root, protected by kernfs_mutex */
-- 
2.9.3



[PATCH V5 06/11] kernfs: add exportfs operations

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.

Acked-by: Greg Kroah-Hartman 
Signed-off-by: Shaohua Li 
---
 fs/kernfs/mount.c  | 56 ++
 include/linux/kernfs.h | 12 +++
 kernel/cgroup/cgroup.c |  3 ++-
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index acd5426..fa32358 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kernfs-internal.h"
 
@@ -64,6 +65,59 @@ const struct super_operations kernfs_sops = {
.show_path  = kernfs_sop_show_path,
 };
 
+static struct inode *kernfs_fh_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
+{
+   struct kernfs_super_info *info = kernfs_info(sb);
+   struct inode *inode;
+   struct kernfs_node *kn;
+
+   if (ino == 0)
+   return ERR_PTR(-ESTALE);
+
+   kn = kernfs_find_and_get_node_by_ino(info->root, ino);
+   if (!kn)
+   return ERR_PTR(-ESTALE);
+   inode = kernfs_get_inode(sb, kn);
+   kernfs_put(kn);
+   if (IS_ERR(inode))
+   return ERR_CAST(inode);
+
+   if (generation && inode->i_generation != generation) {
+   /* we didn't find the right inode.. */
+   iput(inode);
+   return ERR_PTR(-ESTALE);
+   }
+   return inode;
+}
+
+static struct dentry *kernfs_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   kernfs_fh_get_inode);
+}
+
+static struct dentry *kernfs_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   kernfs_fh_get_inode);
+}
+
+static struct dentry *kernfs_get_parent_dentry(struct dentry *child)
+{
+   struct kernfs_node *kn = kernfs_dentry_node(child);
+
+   return d_obtain_alias(kernfs_get_inode(child->d_sb, kn->parent));
+}
+
+static const struct export_operations kernfs_export_ops = {
+   .fh_to_dentry   = kernfs_fh_to_dentry,
+   .fh_to_parent   = kernfs_fh_to_parent,
+   .get_parent = kernfs_get_parent_dentry,
+};
+
 /**
  * kernfs_root_from_sb - determine kernfs_root associated with a super_block
  * @sb: the super_block in question
@@ -159,6 +213,8 @@ static int kernfs_fill_super(struct super_block *sb, 
unsigned long magic)
sb->s_magic = magic;
sb->s_op = _sops;
sb->s_xattr = kernfs_xattr_handlers;
+   if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP)
+   sb->s_export_op = _export_ops;
sb->s_time_gran = 1;
 
/* get root inode, initialize and unlock it */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 06a0c59..d149361 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -69,6 +69,12 @@ enum kernfs_root_flag {
 * following flag enables that behavior.
 */
KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK   = 0x0002,
+
+   /*
+* The filesystem supports exportfs operation, so userspace can use
+* fhandle to access nodes of the fs.
+*/
+   KERNFS_ROOT_SUPPORT_EXPORTOP= 0x0004,
 };
 
 /* type-specific structures for kernfs_node union members */
@@ -98,6 +104,12 @@ struct kernfs_elem_attr {
 /* represent a kernfs node */
 union kernfs_node_id {
struct {
+   /*
+* blktrace will export this struct as a simplified 'struct
+* fid' (which is a big data struction), so userspace can use
+* it to find kernfs node. The layout must match the first two
+* fields of 'struct fid' exactly.
+*/
u32 ino;
u32 generation;
};
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index cc53111..8f88aac 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1737,7 +1737,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 
ss_mask, int ref_flags)
_kf_syscall_ops : _kf_syscall_ops;
 
root->kf_root = kernfs_create_root(kf_sops,
-  KERNFS_ROOT_CREATE_DEACTIVATED,
+  KERNFS_ROOT_CREATE_DEACTIVATED |
+  KERNFS_ROOT_SUPPORT_EXPORTOP,
   root_cgrp);
if (IS_ERR(root->kf_root)) {

[PATCH V5 09/11] block: always attach cgroup info into bio

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

blkcg_bio_issue_check() already gets blkcg for a BIO.
bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
operation. There is no point we don't attach the cgroup info into bio at
blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
info.

Acked-by: Tejun Heo 
Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c   | 7 +--
 include/linux/blk-cgroup.h | 3 +++
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a7285bf..a6ebd2b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2104,14 +2104,9 @@ static inline void throtl_update_latency_buckets(struct 
throtl_data *td)
 static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio)
 {
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
-   int ret;
-
-   ret = bio_associate_current(bio);
-   if (ret == 0 || ret == -EBUSY)
+   if (bio->bi_css)
bio->bi_cg_private = tg;
blk_stat_set_issue(>bi_issue_stat, bio_sectors(bio));
-#else
-   bio_associate_current(bio);
 #endif
 }
 
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 7104bea..9d92153 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -691,6 +691,9 @@ static inline bool blkcg_bio_issue_check(struct 
request_queue *q,
rcu_read_lock();
blkcg = bio_blkcg(bio);
 
+   /* associate blkcg if bio hasn't attached one */
+   bio_associate_blkcg(bio, >css);
+
blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
spin_lock_irq(q->queue_lock);
-- 
2.9.3



[PATCH V5 05/11] kernfs: introduce kernfs_node_id

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

inode number and generation can identify a kernfs node. We are going to
export the identification by exportfs operations, so put ino and
generation into a separate structure. It's convenient when later patches
use the identification.

Acked-by: Greg Kroah-Hartman 
Signed-off-by: Shaohua Li 
---
 fs/kernfs/dir.c  | 10 +-
 fs/kernfs/file.c |  4 ++--
 fs/kernfs/inode.c|  4 ++--
 include/linux/cgroup.h   |  2 +-
 include/linux/kernfs.h   | 12 ++--
 include/trace/events/writeback.h |  2 +-
 6 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index b61a7ef..89d1dc1 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -539,7 +539,7 @@ void kernfs_put(struct kernfs_node *kn)
}
kfree(kn->iattr);
spin_lock(_idr_lock);
-   idr_remove(>ino_idr, kn->ino);
+   idr_remove(>ino_idr, kn->id.ino);
spin_unlock(_idr_lock);
kmem_cache_free(kernfs_node_cache, kn);
 
@@ -645,8 +645,8 @@ static struct kernfs_node *__kernfs_new_node(struct 
kernfs_root *root,
idr_preload_end();
if (ret < 0)
goto err_out2;
-   kn->ino = ret;
-   kn->generation = gen;
+   kn->id.ino = ret;
+   kn->id.generation = gen;
 
/*
 * set ino first. This barrier is paired with atomic_inc_not_zero in
@@ -721,7 +721,7 @@ struct kernfs_node *kernfs_find_and_get_node_by_ino(struct 
kernfs_root *root,
 * before 'count'. So if 'count' is uptodate, 'ino' should be uptodate,
 * hence we can use 'ino' to filter stale node.
 */
-   if (kn->ino != ino)
+   if (kn->id.ino != ino)
goto out;
rcu_read_unlock();
 
@@ -1654,7 +1654,7 @@ static int kernfs_fop_readdir(struct file *file, struct 
dir_context *ctx)
const char *name = pos->name;
unsigned int type = dt_type(pos);
int len = strlen(name);
-   ino_t ino = pos->ino;
+   ino_t ino = pos->id.ino;
 
ctx->pos = pos->hash;
file->private_data = pos;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 7f90d4d..7441925 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -895,7 +895,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 * have the matching @file available.  Look up the inodes
 * and generate the events manually.
 */
-   inode = ilookup(info->sb, kn->ino);
+   inode = ilookup(info->sb, kn->id.ino);
if (!inode)
continue;
 
@@ -903,7 +903,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
if (parent) {
struct inode *p_inode;
 
-   p_inode = ilookup(info->sb, parent->ino);
+   p_inode = ilookup(info->sb, parent->id.ino);
if (p_inode) {
fsnotify(p_inode, FS_MODIFY | FS_EVENT_ON_CHILD,
 inode, FSNOTIFY_EVENT_INODE, kn->name, 
0);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 4c8b510..a343039 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,7 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, 
struct inode *inode)
inode->i_private = kn;
inode->i_mapping->a_ops = _aops;
inode->i_op = _iops;
-   inode->i_generation = kn->generation;
+   inode->i_generation = kn->id.generation;
 
set_default_inode_attr(inode, kn->mode);
kernfs_refresh_inode(kn, inode);
@@ -266,7 +266,7 @@ struct inode *kernfs_get_inode(struct super_block *sb, 
struct kernfs_node *kn)
 {
struct inode *inode;
 
-   inode = iget_locked(sb, kn->ino);
+   inode = iget_locked(sb, kn->id.ino);
if (inode && (inode->i_state & I_NEW))
kernfs_init_inode(kn, inode);
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 710a005..30c6877 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -543,7 +543,7 @@ static inline bool cgroup_is_populated(struct cgroup *cgrp)
 /* returns ino associated with a cgroup */
 static inline ino_t cgroup_ino(struct cgroup *cgrp)
 {
-   return cgrp->kn->ino;
+   return cgrp->kn->id.ino;
 }
 
 /* cft/css accessors for cftype->write() operation */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 8c00d28..06a0c59 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -95,6 +95,15 @@ struct kernfs_elem_attr {
struct kernfs_node  *notify_next;   /* for kernfs_notify() */
 };
 
+/* represent a kernfs node */
+union kernfs_node_id {
+   struct {
+   u32 ino;
+   u32 generation;
+   };
+   u64   

[PATCH V5 11/11] block: use standard blktrace API to output cgroup info for debug notes

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
way. Now we have standard blktrace API for this, so convert them to use
it.

Note, this changes the behavior a little bit. cgroup info isn't output
by default, we only do this with 'blk_cgroup' option enabled. cgroup
info isn't output as a string by default too, we only do this with
'blk_cgname' option enabled. Also cgroup info is output in different
position of the note string. I think these behavior changes aren't a big
issue (actually we make trace data shorter which is good), since the
blktrace note is solely for debugging.

Signed-off-by: Shaohua Li 
---
 block/bfq-iosched.h  | 13 -
 block/blk-throttle.c |  6 ++
 block/cfq-iosched.c  | 15 ++-
 include/linux/blktrace_api.h | 13 +
 kernel/trace/blktrace.c  | 12 ++--
 5 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 8fd83b8..bf69d13 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -917,13 +917,16 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct 
bfq_queue *bfqq);
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do {\
-   blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid,\
-   bfq_bfqq_sync((bfqq)) ? 'S' : 'A',  \
-   bfqq_group(bfqq)->blkg_path, ##args);   \
+   blk_add_cgroup_trace_msg((bfqd)->queue, \
+   bfqg_to_blkg(bfqq_group(bfqq))->blkcg,  \
+   "bfq%d%c " fmt, (bfqq)->pid,\
+   bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
 } while (0)
 
-#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) \
-   blk_add_trace_msg((bfqd)->queue, "%s " fmt, (bfqg)->blkg_path, ##args)
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {\
+   blk_add_cgroup_trace_msg((bfqd)->queue, \
+   bfqg_to_blkg(bfqg)->blkcg, fmt, ##args);\
+} while (0)
 
 #else /* CONFIG_BFQ_GROUP_IOSCHED */
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a6ebd2b..6a4c4c4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -373,10 +373,8 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, 
int rw)
if (likely(!blk_trace_note_message_enabled(__td->queue)))   \
break;  \
if ((__tg)) {   \
-   char __pbuf[128];   \
-   \
-   blkg_path(tg_to_blkg(__tg), __pbuf, sizeof(__pbuf));\
-   blk_add_trace_msg(__td->queue, "throtl %s " fmt, __pbuf, 
##args); \
+   blk_add_cgroup_trace_msg(__td->queue,   \
+   tg_to_blkg(__tg)->blkcg, "throtl " fmt, ##args);\
} else {\
blk_add_trace_msg(__td->queue, "throtl " fmt, ##args);  \
}   \
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3d5c289..0fb78fb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -656,20 +656,17 @@ static inline void cfqg_put(struct cfq_group *cfqg)
 }
 
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...) do {\
-   char __pbuf[128];   \
-   \
-   blkg_path(cfqg_to_blkg((cfqq)->cfqg), __pbuf, sizeof(__pbuf));  \
-   blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c %s " fmt, (cfqq)->pid, \
+   blk_add_cgroup_trace_msg((cfqd)->queue, \
+   cfqg_to_blkg((cfqq)->cfqg)->blkcg,  \
+   "cfq%d%c%c " fmt, (cfqq)->pid,  \
cfq_cfqq_sync((cfqq)) ? 'S' : 'A',  \
cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
- __pbuf, ##args);  \
+ ##args);  \
 } while (0)
 
 #define cfq_log_cfqg(cfqd, cfqg, fmt, args...) do {\
-   char __pbuf[128];   \
-   \
-   blkg_path(cfqg_to_blkg(cfqg), __pbuf, sizeof(__pbuf));  \
-   blk_add_trace_msg((cfqd)->queue, "%s " fmt, __pbuf, ##args);\
+   blk_add_cgroup_trace_msg((cfqd)->queue, \
+   

[PATCH V5 00/11] blktrace: output cgroup info

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

Hi,

Currently blktrace isn't cgroup aware. blktrace prints out task name of current
context, but the task of current context isn't always in the cgroup where the
BIO comes from. We can't use task name to find out IO cgroup. For example,
Writeback BIOs always comes from flusher thread but the BIOs are for different
blk cgroups. Request could be requeued and dispatched from completely different
tasks. MD/DM are another examples. This brings challenges if we want to use
blktrace for performance tunning with cgroup enabled.

This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
use a BPF program to filter out blktrace for a specific cgroup.

The first 6 patches adds export operation handlers for kernfs, so userspace can
use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
blktrace output cgroup info.

Thanks,
Shaohua

V4 -> V5:
- Add ack and refresh to latest -next

V3 -> V4:
- Address some issues pointed out by Tejun
- Fix build errors with latest -next tree

V2 -> V3:
- Uses 32 bits inode number
- Refresh to latest -next tree

V1 -> V2:
- Fix a bug in cgroup association
- Fix build errors reported by 0day
- Address some issues pointed out by Tejun

Shaohua Li (11):
  kernfs: use idr instead of ida to manage inode number
  kernfs: implement i_generation
  kernfs: add an API to get kernfs node from inode number
  kernfs: don't set dentry->d_fsdata
  kernfs: introduce kernfs_node_id
  kernfs: add exportfs operations
  cgroup: export fhandle info for a cgroup
  blktrace: export cgroup info in trace
  block: always attach cgroup info into bio
  blktrace: add an option to allow displaying cgroup path
  block: use standard blktrace API to output cgroup info for debug notes

 block/bfq-iosched.h   |  13 +-
 block/blk-throttle.c  |  13 +-
 block/cfq-iosched.c   |  15 +--
 fs/kernfs/dir.c   | 111 
 fs/kernfs/file.c  |  10 +-
 fs/kernfs/inode.c |   9 +-
 fs/kernfs/kernfs-internal.h   |   9 ++
 fs/kernfs/mount.c |  94 --
 fs/kernfs/symlink.c   |   6 +-
 include/linux/blk-cgroup.h|   3 +
 include/linux/blktrace_api.h  |  13 +-
 include/linux/cgroup.h|  16 ++-
 include/linux/kernfs.h|  28 -
 include/trace/events/writeback.h  |   2 +-
 include/uapi/linux/blktrace_api.h |   3 +
 kernel/cgroup/cgroup.c|  15 ++-
 kernel/trace/blktrace.c   | 259 ++
 17 files changed, 467 insertions(+), 152 deletions(-)

-- 
2.9.3



[PATCH 1/2] block: delete bio_uninit in blockdev

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

This is to partially revert commit 9ae3b3f52c62 (block: provide
bio_uninit() free freeing integrity/task associations). With commit
b222dd2 (block: call bio_uninit in bio_endio) and 7c20f11(bio-integrity:
stop abusing bi_end_io), integrity/cgroup info is freed in bio_endio. We
don't need to call bio_unit in other places

Cc: Christoph Hellwig 
Signed-off-by: Shaohua Li 
---
 block/bio.c | 8 +---
 fs/block_dev.c  | 4 +---
 include/linux/bio.h | 1 -
 3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 9a63597..ce078fb 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -240,11 +240,10 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, 
unsigned long *idx,
return bvl;
 }
 
-void bio_uninit(struct bio *bio)
+static void bio_uninit(struct bio *bio)
 {
bio_disassociate_task(bio);
 }
-EXPORT_SYMBOL(bio_uninit);
 
 static void bio_free(struct bio *bio)
 {
@@ -269,11 +268,6 @@ static void bio_free(struct bio *bio)
}
 }
 
-/*
- * Users of this function have their own bio allocation. Subsequently,
- * they must remember to pair any call to bio_init() with bio_uninit()
- * when IO has completed, or when the bio is released.
- */
 void bio_init(struct bio *bio, struct bio_vec *table,
  unsigned short max_vecs)
 {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9941dc8..673536e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -264,9 +264,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
kfree(vecs);
 
if (unlikely(bio.bi_status))
-   ret = blk_status_to_errno(bio.bi_status);
-
-   bio_uninit();
+   return blk_status_to_errno(bio.bi_status);
 
return ret;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7b1cf4b..8a56a50 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -444,7 +444,6 @@ extern void bio_advance(struct bio *, unsigned);
 
 extern void bio_init(struct bio *bio, struct bio_vec *table,
 unsigned short max_vecs);
-extern void bio_uninit(struct bio *);
 extern void bio_reset(struct bio *);
 void bio_chain(struct bio *, struct bio *);
 
-- 
2.9.3



[PATCH 2/2] block: delete bio_uninit

2017-07-12 Thread Shaohua Li
From: Shaohua Li 

bio_uninit only calls bio_disassociate_task now. It's meaningless to
have a wrap.

Cc: Christoph Hellwig 
Signed-off-by: Shaohua Li 
---
 block/bio.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index ce078fb..8773d1b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -240,17 +240,12 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, 
unsigned long *idx,
return bvl;
 }
 
-static void bio_uninit(struct bio *bio)
-{
-   bio_disassociate_task(bio);
-}
-
 static void bio_free(struct bio *bio)
 {
struct bio_set *bs = bio->bi_pool;
void *p;
 
-   bio_uninit(bio);
+   bio_disassociate_task(bio);
 
if (bs) {
bvec_free(bs->bvec_pool, bio->bi_io_vec, BVEC_POOL_IDX(bio));
@@ -294,7 +289,7 @@ void bio_reset(struct bio *bio)
 {
unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
 
-   bio_uninit(bio);
+   bio_disassociate_task(bio);
 
memset(bio, 0, BIO_RESET_BYTES);
bio->bi_flags = flags;
@@ -1828,7 +1823,7 @@ void bio_endio(struct bio *bio)
 
blk_throtl_bio_endio(bio);
/* release cgroup info */
-   bio_uninit(bio);
+   bio_disassociate_task(bio);
if (bio->bi_end_io)
bio->bi_end_io(bio);
 }
-- 
2.9.3



Re: [PATCH] blk-mq-debugfs: add mapping show for hw queue to cpu

2017-07-12 Thread weiping zhang
On Wed, Jul 12, 2017 at 10:57:37AM -0600, Jens Axboe wrote:
> On 07/12/2017 10:54 AM, weiping zhang wrote:
> > A mapping show as following:
> > 
> > hctxcpus
> > hctx0 0   1
> > hctx1 2
> > hctx2 3
> > hctx3 4   5
> 
> We already have that information in the /sys/block//mq/X/cpu_list
> 
> where X is the hardware queue number. Why do we need it in debugfs as
> well, presented differently?

this arribute give a more obviously showing, only by 1 "cat" command.
/sys/bloc//mq/X/cpu_list not easy get mapping for all hctx by 1
command.
also /sys/kernel/debug/block/xxx/hctxN/cpuX can export that info, but
these two methods both not obviously.


Re: linux-next scsi-mq hang in suspend-resume

2017-07-12 Thread Jens Axboe
On 07/12/2017 08:51 AM, Tomi Sarvela wrote:
> Hello there,
> 
> I've been running Intel GFX CI testing for linux DRM-Tip i915 driver, 
> and couple of weeks ago we took linux-next for a ride to see what kind 
> of integration problems there might pop up when pulling 4.13-rc1. 
> Latest results can be seen at
> 
> https://intel-gfx-ci.01.org/CI/next-issues.html
> https://intel-gfx-ci.01.org/CI/next-all.html
> 
> The purple blocks are hangs, starting from 20170628 (20170627 was 
> untestable due to locking changes which were reverted). Traces were 
> pointing to ext4 but bisecting between good 20170626 and bad 20170628 
> pointed to:
> 
> commit 5c279bd9e40624f4ab6e688671026d6005b066fa
> Date:   Fri Jun 16 10:27:55 2017 +0200
> 
> scsi: default to scsi-mq
> 
> Reproduction is 100% or close to it when running two i-g-t tests as a 
> testlist. I'm assuming that it creates the correct amount or pattern 
> of actions to the device. The testlist consists of the following 
> lines:
> 
> igt@gem_exec_gttfill@basic
> igt@gem_exec_suspend@basic-s3
> 
> Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts. 
> Configuration option was copied over on testhosts and 20170712 was re-
> tested, that's why today looks so much greener.
> 
> More information including traces and reproduction instructions at
> https://bugzilla.kernel.org/show_bug.cgi?id=196223
> 
> I can run patchsets through the farm, if needed. In addition, daily 
> linux-next tags are automatically tested and results published.

Christoph, any ideas? Smells like something in SCSI, my notebook
with nvme/blk-mq suspend/resumes just fine.

-- 
Jens Axboe



Re: [PATCH v3 05/14] md: raid1: don't use bio's vec table to manage resync pages

2017-07-12 Thread Shaohua Li
On Wed, Jul 12, 2017 at 09:40:10AM +0800, Ming Lei wrote:
> On Tue, Jul 11, 2017 at 7:14 AM, NeilBrown  wrote:
> > On Mon, Jul 10 2017, Shaohua Li wrote:
> >
> >> On Mon, Jul 10, 2017 at 03:25:41PM +0800, Ming Lei wrote:
> >>> On Mon, Jul 10, 2017 at 02:38:19PM +1000, NeilBrown wrote:
> >>> > On Mon, Jul 10 2017, Ming Lei wrote:
> >>> >
> >>> > > On Mon, Jul 10, 2017 at 11:35:12AM +0800, Ming Lei wrote:
> >>> > >> On Mon, Jul 10, 2017 at 7:09 AM, NeilBrown  wrote:
> >>> > ...
> >>> > >> >> +
> >>> > >> >> + rp->idx = 0;
> >>> > >> >
> >>> > >> > This is the only place the ->idx is initialized, in 
> >>> > >> > r1buf_pool_alloc().
> >>> > >> > The mempool alloc function is suppose to allocate memory, not 
> >>> > >> > initialize
> >>> > >> > it.
> >>> > >> >
> >>> > >> > If the mempool_alloc() call cannot allocate memory it will use 
> >>> > >> > memory
> >>> > >> > from the pool.  If this memory has already been used, then it will 
> >>> > >> > no
> >>> > >> > longer have the initialized value.
> >>> > >> >
> >>> > >> > In short: you need to initialise memory *after* calling
> >>> > >> > mempool_alloc(), unless you ensure it is reset to the init values 
> >>> > >> > before
> >>> > >> > calling mempool_free().
> >>> > >> >
> >>> > >> > https://bugzilla.kernel.org/show_bug.cgi?id=196307
> >>> > >>
> >>> > >> OK, thanks for posting it out.
> >>> > >>
> >>> > >> Another fix might be to reinitialize the variable(rp->idx = 0) in
> >>> > >> r1buf_pool_free().
> >>> > >> Or just set it as zero every time when it is used.
> >>> > >>
> >>> > >> But I don't understand why mempool_free() calls pool->free() at the 
> >>> > >> end of
> >>> > >> this function, which may cause to run pool->free() on a new 
> >>> > >> allocated buf,
> >>> > >> seems a bug in mempool?
> >>> > >
> >>> > > Looks I missed the 'return' in mempool_free(), so it is fine.
> >>> > >
> >>> > > How about the following fix?
> >>> >
> >>> > It looks like it would probably work, but it is rather unusual to
> >>> > initialise something just before freeing it.
> >>> >
> >>> > Couldn't you just move the initialization to shortly after the
> >>> > mempool_alloc() call.  There looks like a good place that already loops
> >>> > over all the bios
> >>>
> >>> OK, follows the revised patch according to your suggestion.
> >
> > Thanks.
> >
> > That isn't as tidy as I hoped.  So I went deeper into the code to try to
> > understand why...
> >
> > I think that maybe we should just discard the ->idx field completely.
> > It is only used in this code:
> >
> > do {
> > struct page *page;
> > int len = PAGE_SIZE;
> > if (sector_nr + (len>>9) > max_sector)
> > len = (max_sector - sector_nr) << 9;
> > if (len == 0)
> > break;
> > for (bio= biolist ; bio ; bio=bio->bi_next) {
> > struct resync_pages *rp = get_resync_pages(bio);
> > page = resync_fetch_page(rp, rp->idx++);
> > /*
> >  * won't fail because the vec table is big enough
> >  * to hold all these pages
> >  */
> > bio_add_page(bio, page, len, 0);
> > }
> > nr_sectors += len>>9;
> > sector_nr += len>>9;
> > } while (get_resync_pages(biolist)->idx < RESYNC_PAGES);
> >
> > and all of the different 'rp' always have the same value for 'idx'.
> > This code is more complex than it needs to be.  This is because it used
> > to be possible for bio_add_page() to fail.  That cannot happen any more.
> > So we can make the code something like:
> >
> >   for (idx = 0; idx < RESYNC_PAGES; idx++) {
> >  struct page *page;
> >  int len = PAGE_SIZE;
> >  if (sector_nr + (len >> 9) > max_sector)
> >  len = (max_sector - sector_nr) << 9
> >  if (len == 0)
> >  break;
> >  for (bio = biolist; bio; bio = bio->bi_next) {
> > struct resync_pages *rp = get_resync_pages(bio);
> > page = resync_fetch_page(rp, idx);
> > bio_add_page(bio, page, len, 0);
> >  }
> >  nr_sectors += len >> 9;
> >  sector_nr += len >> 9;
> >   }
> >
> > Or did I miss something?
> 
> I think this approach is much clean.

Thought I suggested not using the 'idx' in your previous post, but you said
there is reason (not because of bio_add_page) not to do it. Is that changed?
can't remember the details, I need to dig the mail archives. 

Thanks,
Shaohua


Re: [PATCH V2] block: call bio_uninit in bio_endio

2017-07-12 Thread Shaohua Li
On Wed, Jul 12, 2017 at 09:25:17AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 10, 2017 at 11:40:17AM -0700, Shaohua Li wrote:
> > bio_free isn't a good place to free cgroup info. There are a
> > lot of cases bio is allocated in special way (for example, in stack) and
> > never gets called by bio_put hence bio_free, we are leaking memory. This
> > patch moves the free to bio endio, which should be called anyway. The
> > bio_uninit call in bio_free is kept, in case the bio never gets called
> > bio endio.
> > 
> > This assumes ->bi_end_io() doesn't access cgroup info, which seems true
> > in my audit.
> > 
> > This along with Christoph's integrity patch should fix the memory leak
> > issue.
> 
> Can we please go ahead and:
> 
>  (a) remove the now unneeded call in __blkdev_direct_IO_simple
>  (b) kill bio_uninit - it's just a wrapper around bio_disassociate_task
>  (c) check if we can remove the call in bio_free, which I think we can
>  as bi_ioc and bi_css are only assigned from generic_make_request

Sure, will do these soon.


Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold

2017-07-12 Thread Bart Van Assche
On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > What happens with fluid congestion boundaries, with shared tags?
> 
> The approach in this patch should work, but the threshold may not
> be accurate in this way, one simple method is to use the average
> tag weight in EWMA, like this:
> 
>   sbitmap_weight() / hctx->tags->active_queues

Hello Ming,

That approach would result in a severe performance degradation. "active_queues"
namely represents the number of queues against which I/O ever has been queued.
If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are
responding and if the queue depth would also be 64 then the approach you
proposed will reduce the effective queue depth per LUN from 64 to 1.

Bart.

Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()

2017-07-12 Thread Bart Van Assche
On Wed, 2017-07-12 at 11:15 +0800, Ming Lei wrote:
> On Tue, Jul 11, 2017 at 07:57:53PM +, Bart Van Assche wrote:
> > On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > > Now SCSI won't stop queue, and not necessary to use
> > > blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues()
> > > instead.
> > > 
> > > Cc: "James E.J. Bottomley" 
> > > Cc: "Martin K. Petersen" 
> > > Cc: linux-s...@vger.kernel.org
> > > Signed-off-by: Ming Lei 
> > > ---
> > >  drivers/scsi/scsi_lib.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > index f6097b89d5d3..91d890356b78 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> > >  static void scsi_kick_queue(struct request_queue *q)
> > >  {
> > >   if (q->mq_ops)
> > > - blk_mq_start_hw_queues(q);
> > > + blk_mq_run_hw_queues(q, false);
> > >   else
> > >   blk_run_queue(q);
> > >  }
> > 
> > Hello Ming,
> > 
> > Now that we have separate flags to represent the "stopped" and "quiesced"
> > states, wouldn't it be better not to modify scsi_kick_queue() but instead to
> > stop a SCSI hardware queue again if scsi_queue_rq() returns 
> > BLK_STS_RESOURCE?
> > See also commits 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck") and
> > commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues").
> 
> As you can see in the following patches, all stop/start queue APIs will
> be removed, and the 'stopped' state will become a internal state.
> 
> It doesn't make sense to clear 'stopped' for scsi anymore, since the
> queue won't be stopped actually. So we need this change.

Hello Ming,

As Jens already noticed, this approach won't work properly for concurrent I/O
to multiple LUNs associated with the same SCSI host. This approach won't work
properly for dm-mpath either. Sorry but I'm not convinced that it's possible
to replace the stop/start queue API for all block drivers by an algorithm
that is based on estimating the queue depth.

Bart.

linux-next scsi-mq hang in suspend-resume

2017-07-12 Thread Tomi Sarvela
Hello there,

I've been running Intel GFX CI testing for linux DRM-Tip i915 driver, 
and couple of weeks ago we took linux-next for a ride to see what kind 
of integration problems there might pop up when pulling 4.13-rc1. 
Latest results can be seen at

https://intel-gfx-ci.01.org/CI/next-issues.html
https://intel-gfx-ci.01.org/CI/next-all.html

The purple blocks are hangs, starting from 20170628 (20170627 was 
untestable due to locking changes which were reverted). Traces were 
pointing to ext4 but bisecting between good 20170626 and bad 20170628 
pointed to:

commit 5c279bd9e40624f4ab6e688671026d6005b066fa
Date:   Fri Jun 16 10:27:55 2017 +0200

scsi: default to scsi-mq

Reproduction is 100% or close to it when running two i-g-t tests as a 
testlist. I'm assuming that it creates the correct amount or pattern 
of actions to the device. The testlist consists of the following 
lines:

igt@gem_exec_gttfill@basic
igt@gem_exec_suspend@basic-s3

Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts. 
Configuration option was copied over on testhosts and 20170712 was re-
tested, that's why today looks so much greener.

More information including traces and reproduction instructions at
https://bugzilla.kernel.org/show_bug.cgi?id=196223

I can run patchsets through the farm, if needed. In addition, daily 
linux-next tags are automatically tested and results published.

Best regards,

Tomi Sarvela
-- 
Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo


Re: [PATCH] block, bfq: dispatch request to prevent queue stalling after the request completion

2017-07-12 Thread Paolo Valente

> Il giorno 12 lug 2017, alle ore 16:22, Jens Axboe  ha 
> scritto:
> 
> On 07/12/2017 03:41 AM, Paolo Valente wrote:
>> 
>>> Il giorno 11 lug 2017, alle ore 15:58, Hou Tao  ha 
>>> scritto:
>>> 
>>> There are mq devices (eg., virtio-blk, nbd and loopback) which don't
>>> invoke blk_mq_run_hw_queues() after the completion of a request.
>>> If bfq is enabled on these devices and the slice_idle attribute or
>>> strict_guarantees attribute is set as zero,
>> 
>> I guess you meant slice_idle > 0 or strict_guarantees equal to 1 here.
>> 
>>> it is possible that
>>> after a request completion the remaining requests of busy bfq queue
>>> will stalled in the bfq schedule until a new request arrives.
>>> 
>>> To fix the scheduler latency problem, we need to check whether or not
>>> all issued requests have completed and dispatch more requests to driver
>>> if there is no request in driver.
>>> 
>>> Signed-off-by: Hou Tao 
>>> 
>> 
>> Thanks for this fix.  My only (possible) concern is whether there
>> would be some more coherent and possibly more efficient solution to
>> this problem, outside BFQ.  For example, would it make sense to call
>> blk_mq_run_hw_queues(), after a request completion, from the offended
>> devices too?  This would make the behavior of these devices coherent
>> with that of the other devices.  Unfortunately I have no sound answer
>> to such a question.  Apart from this concern:
> 
> No, that's wrong. If a scheduler decides to stop dispatching
> requests before the list has been drained, it is the schedulers
> responsibility to restart dispatch. Some drivers need to restart
> the queue for other reasons, don't confuse the two.
> 

Thanks for clearing up this point.  As I already wrote, the patch is
then ok for me.

Thanks,
Paolo

> -- 
> Jens Axboe



Re: [PATCH] block, bfq: dispatch request to prevent queue stalling after the request completion

2017-07-12 Thread Jens Axboe
On 07/12/2017 03:41 AM, Paolo Valente wrote:
> 
>> Il giorno 11 lug 2017, alle ore 15:58, Hou Tao  ha 
>> scritto:
>>
>> There are mq devices (eg., virtio-blk, nbd and loopback) which don't
>> invoke blk_mq_run_hw_queues() after the completion of a request.
>> If bfq is enabled on these devices and the slice_idle attribute or
>> strict_guarantees attribute is set as zero,
> 
> I guess you meant slice_idle > 0 or strict_guarantees equal to 1 here.
> 
>> it is possible that
>> after a request completion the remaining requests of busy bfq queue
>> will stalled in the bfq schedule until a new request arrives.
>>
>> To fix the scheduler latency problem, we need to check whether or not
>> all issued requests have completed and dispatch more requests to driver
>> if there is no request in driver.
>>
>> Signed-off-by: Hou Tao 
>>
> 
> Thanks for this fix.  My only (possible) concern is whether there
> would be some more coherent and possibly more efficient solution to
> this problem, outside BFQ.  For example, would it make sense to call
> blk_mq_run_hw_queues(), after a request completion, from the offended
> devices too?  This would make the behavior of these devices coherent
> with that of the other devices.  Unfortunately I have no sound answer
> to such a question.  Apart from this concern:

No, that's wrong. If a scheduler decides to stop dispatching
requests before the list has been drained, it is the schedulers
responsibility to restart dispatch. Some drivers need to restart
the queue for other reasons, don't confuse the two.

-- 
Jens Axboe



Re: Lots of new warnings with gcc-7.1.1

2017-07-12 Thread Arnd Bergmann
On Wed, Jul 12, 2017 at 3:10 PM, Greg Kroah-Hartman
 wrote:
> On Tue, Jul 11, 2017 at 03:35:15PM -0700, Linus Torvalds wrote:
>> [ Very random list of maintainers and mailing lists, at least
>> partially by number of warnings generated by gcc-7.1.1 that is then
>> correlated with the get_maintainers script ]
>>
>> So I upgraded one of my boxes to F26, which upgraded the compiler to 
>> gcc-7.1.1
>>
>> Which in turn means that my nice clean allmodconfig compile is not an
>> unholy mess of annoying new warnings.
>
> I asked Arnd about this the other day on IRC as I've hit this as well on
> the stable releases, and it's really annoying.  He mentioned that he had
> lots of these warnings fixed, but didn't push most of the changes out
> yet.

To clarify: most of the patches I wrote ended up getting merged, but
there were a couple that I did not submit a second time after they
got dropped, but I gave up on trying to fix the new -Wformat warnings
and simply disabled them, hoping someone else would do it before me,
or that the gcc developers would find a way to reduce the false-positive
ones before the release.

>  Arnd, any repo with them in it that we could look at?

I have a private tree on my workstation that has lots of random
crap, and I rebase it all the time but normally don't publish it.

I have uploaded today's snapshot to

git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git randconfig-4.13-next

The way I work with this is helpful to catch build regressions as soon
as they happen, but not so good in finding things that I have either
submitted a patch for before and don't remember if it should be
resubmitted, or stuff that I decided I didn't want to deal with at some
point.

I was already planning to start over from scratch one of these days,
and cherry-pick+resubmit the patches that are actually required
for randconfig builds.

>> Anyway, it would be lovely if some of the more affected developers
>> would take a look at gcc-7.1.1 warnings. Right now I get about three
>> *thousand* lines of warnings from a "make allmodconfig" build, which
>> makes them a bit overwhelming.
>
> I only have 310 when building the 4.12.0 release with 7.1.1, I wonder if
> Fedora turned more warnings on in their compiler release, I'm running
> Arch here:
> $ gcc --version
> gcc (GCC) 7.1.1 20170621

This is what I get in today's linux-next:

$ grep error: 4.13-next-allmod-warning | sed -e 's:^.*\[-W:-W:' | sort
| uniq -c | cut -f 1 -d\] | sort -n
  1 -Werror=parentheses
  2 -Werror=tautological-compare
  2 -Werror=unused-result
 34 -Werror=format-overflow=
 41 -Werror=int-in-bool-context
233 -Werror=format-truncation=

I'll resubmit the patches for -Wparenthese, -Wtautological-compar,
-Wunused-result and -Wint-in-bool-context that I had sent earlier,
plus a new patch to move -Wformat-truncation into W=1.

  Arnd


Re: Lots of new warnings with gcc-7.1.1

2017-07-12 Thread Arnd Bergmann
On Wed, Jul 12, 2017 at 5:41 AM, Linus Torvalds
 wrote:

>
> We also have about a bazillion
>
> warning: ‘*’ in boolean context, suggest ‘&&’ instead
>
> warnings in drivers/ata/libata-core.c, all due to a single macro that
> uses a pattern that gcc-7.1.1 doesn't like. The warning looks a bit
> debatable, but I suspect the macro could easily be changed too.
>
> Tejun, would you hate just moving the "multiply by 1000" part _into_
> that EZ() macro? Something like the attached (UNTESTED!) patch?

Tejun applied an almost identical patch of mine a while ago, but it seems to
have gotten lost in the meantime in some rebase:

https://patchwork.kernel.org/patch/9721397/
https://patchwork.kernel.org/patch/9721399/

I guess I should have resubmitted the second patch with the suggested
improvement.

 Arnd


Re: Lots of new warnings with gcc-7.1.1

2017-07-12 Thread Mauro Carvalho Chehab
Em Tue, 11 Jul 2017 15:35:15 -0700
Linus Torvalds  escreveu:

> [ Very random list of maintainers and mailing lists, at least
> partially by number of warnings generated by gcc-7.1.1 that is then
> correlated with the get_maintainers script ]

Under drivers/media, I fixed a bunch of gcc 7.1 warnings before the
merge window. While most were just noise, some actually pointed to
human errors.

Now, gcc-7.1.1 produces only 6 warnings with W=1 on x86_64 (allyesconfig), 
either due to unused-but-set-variable or unused-const-variable. I guess
both warning options are disabled by default. Anyway, I have patches
to fix them already. I'll send you later.

The atomisp staging driver is a completely different beast, with would
produce itself a huge amount of warnings. I ended by adding some
logic on drivers/staging/media/atomisp/ Makefiles to disable them:

ccflags-y += $(call cc-disable-warning, missing-declarations)
ccflags-y += $(call cc-disable-warning, missing-prototypes)
ccflags-y += $(call cc-disable-warning, unused-but-set-variable)
ccflags-y += $(call cc-disable-warning, unused-const-variable)
ccflags-y += $(call cc-disable-warning, suggest-attribute=format)
ccflags-y += $(call cc-disable-warning, implicit-fallthrough)

(there's actually one patch pending related to atomisp, that I'll also
be sending you soon - meant to avoid warnings if compiled with an older
gcc version)

Thanks,
Mauro


Re: [PATCH] block, bfq: dispatch request to prevent queue stalling after the request completion

2017-07-12 Thread Hou Tao


On 2017/7/12 17:41, Paolo Valente wrote:
> 
>> Il giorno 11 lug 2017, alle ore 15:58, Hou Tao  ha 
>> scritto:
>>
>> There are mq devices (eg., virtio-blk, nbd and loopback) which don't
>> invoke blk_mq_run_hw_queues() after the completion of a request.
>> If bfq is enabled on these devices and the slice_idle attribute or
>> strict_guarantees attribute is set as zero,
> 
> I guess you meant slice_idle > 0 or strict_guarantees equal to 1 here.
Sorry for the typo, I mean slice_idle = 0 or strict_guarantees = 1.
Setting slice_idle as 0 alone could also trigger the latency problem.

>> it is possible that
>> after a request completion the remaining requests of busy bfq queue
>> will stalled in the bfq schedule until a new request arrives.
>>
>> To fix the scheduler latency problem, we need to check whether or not
>> all issued requests have completed and dispatch more requests to driver
>> if there is no request in driver.
>>
>> Signed-off-by: Hou Tao 
>>
> 
> Thanks for this fix.  My only (possible) concern is whether there
> would be some more coherent and possibly more efficient solution to
> this problem, outside BFQ.  For example, would it make sense to call
> blk_mq_run_hw_queues(), after a request completion, from the offended
> devices too?  This would make the behavior of these devices coherent
> with that of the other devices.  Unfortunately I have no sound answer
> to such a question.  Apart from this concern:
> 
> Reviewed-by: Paolo Valente 
Thanks for your review.

The inconsistencies between the different mq drivers also make me confused.
I don't known the reason why scsi-mq dispatches the remaining requests
after each request completion and virtio-blk does not.

I'm not sure whether or not fixing in MQ is a better idea, but I has a proposal.
If the BLK_MQ_S_SCHED_RESTART flag has been set, MQ would invoke 
blk_mq_run_hw_queues()
after the request completion. The BLK_MQ_S_SCHED_RESTART flag will be set by
blk_mq_sched_dispatch_requests() when it finds out that there are residual 
requests in its
dispatch list. We can let .dispatch_request() return a flag to set the 
BLK_MQ_S_SCHED_RESTART
flag if the underlying scheduler needs it, so after the request completion, the 
MQ core will
pull the remaining requests from BFQ.

Regards,
Tao

> 
>> The problem can be reproduced by running the following script
>> on a virtio-blk device with nr_hw_queues as 1:
>>
>> #!/bin/sh
>>
>> dev=vdb
>> # mount point for dev
>> mp=/tmp/mnt
>> cd $mp
>>
>> job=strict.job
>> cat < $job
>> [global]
>> direct=1
>> bs=4k
>> size=256M
>> rw=write
>> ioengine=libaio
>> iodepth=128
>> runtime=5
>> time_based
>>
>> [1]
>> filename=1.data
>>
>> [2]
>> new_group
>> filename=2.data
>> EOF
>>
>> echo bfq > /sys/block/$dev/queue/scheduler
>> echo 1 > /sys/block/$dev/queue/iosched/strict_guarantees
>> fio $job
>> ---
>> block/bfq-iosched.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 12bbc6b..94cd682 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -4293,6 +4293,9 @@ static void bfq_completed_request(struct bfq_queue 
>> *bfqq, struct bfq_data *bfqd)
>>  bfq_bfqq_expire(bfqd, bfqq, false,
>>  BFQQE_NO_MORE_REQUESTS);
>>  }
>> +
>> +if (!bfqd->rq_in_driver)
>> +bfq_schedule_dispatch(bfqd);
>> }
>>
>> static void bfq_put_rq_priv_body(struct bfq_queue *bfqq)
>> -- 
>> 2.5.0
>>
> 
> 
> .
> 



Re: [PATCH] block, bfq: dispatch request to prevent queue stalling after the request completion

2017-07-12 Thread Paolo Valente

> Il giorno 11 lug 2017, alle ore 15:58, Hou Tao  ha 
> scritto:
> 
> There are mq devices (eg., virtio-blk, nbd and loopback) which don't
> invoke blk_mq_run_hw_queues() after the completion of a request.
> If bfq is enabled on these devices and the slice_idle attribute or
> strict_guarantees attribute is set as zero,

I guess you meant slice_idle > 0 or strict_guarantees equal to 1 here.

> it is possible that
> after a request completion the remaining requests of busy bfq queue
> will stalled in the bfq schedule until a new request arrives.
> 
> To fix the scheduler latency problem, we need to check whether or not
> all issued requests have completed and dispatch more requests to driver
> if there is no request in driver.
> 
> Signed-off-by: Hou Tao 
> 

Thanks for this fix.  My only (possible) concern is whether there
would be some more coherent and possibly more efficient solution to
this problem, outside BFQ.  For example, would it make sense to call
blk_mq_run_hw_queues(), after a request completion, from the offended
devices too?  This would make the behavior of these devices coherent
with that of the other devices.  Unfortunately I have no sound answer
to such a question.  Apart from this concern:

Reviewed-by: Paolo Valente 

> The problem can be reproduced by running the following script
> on a virtio-blk device with nr_hw_queues as 1:
> 
> #!/bin/sh
> 
> dev=vdb
> # mount point for dev
> mp=/tmp/mnt
> cd $mp
> 
> job=strict.job
> cat < $job
> [global]
> direct=1
> bs=4k
> size=256M
> rw=write
> ioengine=libaio
> iodepth=128
> runtime=5
> time_based
> 
> [1]
> filename=1.data
> 
> [2]
> new_group
> filename=2.data
> EOF
> 
> echo bfq > /sys/block/$dev/queue/scheduler
> echo 1 > /sys/block/$dev/queue/iosched/strict_guarantees
> fio $job
> ---
> block/bfq-iosched.c | 3 +++
> 1 file changed, 3 insertions(+)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 12bbc6b..94cd682 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4293,6 +4293,9 @@ static void bfq_completed_request(struct bfq_queue 
> *bfqq, struct bfq_data *bfqd)
>   bfq_bfqq_expire(bfqd, bfqq, false,
>   BFQQE_NO_MORE_REQUESTS);
>   }
> +
> + if (!bfqd->rq_in_driver)
> + bfq_schedule_dispatch(bfqd);
> }
> 
> static void bfq_put_rq_priv_body(struct bfq_queue *bfqq)
> -- 
> 2.5.0
> 



Re: [PATCH 09/19 v2] bcache: update bio->bi_opf bypass/writeback REQ_ flag hints

2017-07-12 Thread Coly Li
On 2017/7/11 上午11:48, Coly Li wrote:
> On 2017/7/6 下午11:24, Christoph Hellwig wrote:
>> On Thu, Jul 06, 2017 at 03:35:48PM +0800, Coly Li wrote:
>>> Then gfs2 breaks the above rule ?  in gfs2_metapath_ra() and
>>> gfs2_dir_readahead(), only REQ_META is used in submit_bh(). It seems an
>>> extra REQ_PRIO should be there.
>>
>> Or maybe not.  E.g. XFS absolutely avoids using REQ_PRIO for any of
>> the buffer writeback as it does not have priority.  You'll need to
>> ask gfs2 folks if they want this I/O to have priority or not.
>>
> 
> I see. Just sent a RFC patch to gfs2 development list, and wait for
> response.
> 
> Thanks for the hint.

Hi Eric,

By comments from Christoph, I think the original version is correct
enough. gfs2 is the only file system uses REQ_READAHEAD for metadata
readahead, if gfs2 developers agree to add REQ_PRIO when issue metadata
bio, we can use another patch to cache its metadata in bcache. Currently
we can just let this patch go ahead.

Thanks for your patient explaining to my question, and thanks to
Christoph to correct the concept of REQ_META and REQ_PRIO.

Please add Reviewed-by: Coly Li  to your original
version patch.


Coly Li


Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()

2017-07-12 Thread Christoph Hellwig
On Wed, Jul 12, 2017 at 04:29:12PM +0800, Ming Lei wrote:
> We will support multipage bvec soon, so initialize bvec
> table using the standardy way instead of writing the
> talbe directly. Otherwise it won't work any more once
> multipage bvec is enabled.

It seems to me like these callsites also should use bio_init
instead of bio_reset to get a clean slate (and eventually
phase out bio_reset), which should probably got into
the helper as well.  E.g. instead of pretending to preserve
things we should simply build a new bio here.


Re: [PATCH] block, bfq: fix typos in comments about B-WF2Q+ algorithm

2017-07-12 Thread Paolo Valente

> Il giorno 12 lug 2017, alle ore 09:25, Hou Tao  ha 
> scritto:
> 
> The start time of eligible entity should be less than or equal to
> the current virtual time, and the entity in idle tree has a finish
> time being greater than the current virtual time.
> 

Thanks for checking and fixing these typos, especially the one about
eligibility.

> Signed-off-by: Hou Tao 

Reviewed-by: Paolo Valente 

> ---
> block/bfq-iosched.h | 2 +-
> block/bfq-wf2q.c| 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
> index 5c3bf98..32feddab 100644
> --- a/block/bfq-iosched.h
> +++ b/block/bfq-iosched.h
> @@ -52,7 +52,7 @@ struct bfq_entity;
> struct bfq_service_tree {
>   /* tree for active entities (i.e., those backlogged) */
>   struct rb_root active;
> - /* tree for idle entities (i.e., not backlogged, with V <= F_i)*/
> + /* tree for idle entities (i.e., not backlogged, with V < F_i)*/
>   struct rb_root idle;
> 
>   /* idle entity with minimum F_i */
> diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
> index 8726ede..d03d7b6 100644
> --- a/block/bfq-wf2q.c
> +++ b/block/bfq-wf2q.c
> @@ -1268,7 +1268,7 @@ static void bfq_update_vtime(struct bfq_service_tree 
> *st, u64 new_value)
>  *
>  * This function searches the first schedulable entity, starting from the
>  * root of the tree and going on the left every time on this side there is
> - * a subtree with at least one eligible (start >= vtime) entity. The path on
> + * a subtree with at least one eligible (start <= vtime) entity. The path on
>  * the right is followed only if a) the left subtree contains no eligible
>  * entities and b) no eligible entity has been found yet.
>  */
> -- 
> 2.5.0
> 



[PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()

2017-07-12 Thread Ming Lei
We will support multipage bvec soon, so initialize bvec
table using the standardy way instead of writing the
talbe directly. Otherwise it won't work any more once
multipage bvec is enabled.

Acked-by: Guoqing Jiang 
Signed-off-by: Ming Lei 
---
 drivers/md/md.c | 21 +
 drivers/md/md.h |  3 +++
 drivers/md/raid1.c  | 16 ++--
 drivers/md/raid10.c |  4 ++--
 4 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8cdca0296749..cc8dcd928dde 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr)
 }
 EXPORT_SYMBOL(md_reload_sb);
 
+/* generally called after bio_reset() for reseting bvec */
+void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
+  int size)
+{
+   int idx = 0;
+
+   /* initialize bvec table again */
+   do {
+   struct page *page = resync_fetch_page(rp, idx);
+   int len = min_t(int, size, PAGE_SIZE);
+
+   /*
+* won't fail because the vec table is big
+* enough to hold all these pages
+*/
+   bio_add_page(bio, page, len, 0);
+   size -= len;
+   } while (idx++ < RESYNC_PAGES && size > 0);
+}
+EXPORT_SYMBOL(md_bio_reset_resync_pages);
+
 #ifndef MODULE
 
 /*
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2c780aa8d07f..efb32ce7a2f1 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -782,4 +782,7 @@ static inline struct page *resync_fetch_page(struct 
resync_pages *rp,
return NULL;
return rp->pages[idx];
 }
+
+void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
+  int size);
 #endif /* _MD_MD_H */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7901ddc3362f..5dc3fda2fdf7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2085,10 +2085,7 @@ static void process_checks(struct r1bio *r1_bio)
/* Fix variable parts of all bios */
vcnt = (r1_bio->sectors + PAGE_SIZE / 512 - 1) >> (PAGE_SHIFT - 9);
for (i = 0; i < conf->raid_disks * 2; i++) {
-   int j;
-   int size;
blk_status_t status;
-   struct bio_vec *bi;
struct bio *b = r1_bio->bios[i];
struct resync_pages *rp = get_resync_pages(b);
if (b->bi_end_io != end_sync_read)
@@ -2097,8 +2094,6 @@ static void process_checks(struct r1bio *r1_bio)
status = b->bi_status;
bio_reset(b);
b->bi_status = status;
-   b->bi_vcnt = vcnt;
-   b->bi_iter.bi_size = r1_bio->sectors << 9;
b->bi_iter.bi_sector = r1_bio->sector +
conf->mirrors[i].rdev->data_offset;
b->bi_bdev = conf->mirrors[i].rdev->bdev;
@@ -2106,15 +2101,8 @@ static void process_checks(struct r1bio *r1_bio)
rp->raid_bio = r1_bio;
b->bi_private = rp;
 
-   size = b->bi_iter.bi_size;
-   bio_for_each_segment_all(bi, b, j) {
-   bi->bv_offset = 0;
-   if (size > PAGE_SIZE)
-   bi->bv_len = PAGE_SIZE;
-   else
-   bi->bv_len = size;
-   size -= PAGE_SIZE;
-   }
+   /* initialize bvec table again */
+   md_bio_reset_resync_pages(b, rp, r1_bio->sectors << 9);
}
for (primary = 0; primary < conf->raid_disks * 2; primary++)
if (r1_bio->bios[primary]->bi_end_io == end_sync_read &&
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e594ca610f27..cb8e803cd1c2 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2086,8 +2086,8 @@ static void sync_request_write(struct mddev *mddev, 
struct r10bio *r10_bio)
rp = get_resync_pages(tbio);
bio_reset(tbio);
 
-   tbio->bi_vcnt = vcnt;
-   tbio->bi_iter.bi_size = fbio->bi_iter.bi_size;
+   md_bio_reset_resync_pages(tbio, rp, fbio->bi_iter.bi_size);
+
rp->raid_bio = r10_bio;
tbio->bi_private = rp;
tbio->bi_iter.bi_sector = r10_bio->devs[i].addr;
-- 
2.9.4



[PATCH 1/2] md: remove 'idx' from 'struct resync_pages'

2017-07-12 Thread Ming Lei
bio_add_page() won't fail for resync bio, and the page index for each
bio is same, so remove it.

More importantly the 'idx' of 'struct resync_pages' is initialized in
mempool allocator function, this way is wrong since mempool is only
responsible for allocation, we can't use that for initialization.

Suggested-by: NeilBrown 
Reported-by: NeilBrown 
Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync 
pages)
Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages)
Signed-off-by: Ming Lei 
---
 drivers/md/md.h | 1 -
 drivers/md/raid1.c  | 6 +++---
 drivers/md/raid10.c | 6 +++---
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 991f0fe2dcc6..2c780aa8d07f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -736,7 +736,6 @@ static inline void mddev_check_write_zeroes(struct mddev 
*mddev, struct bio *bio
 
 /* for managing resync I/O pages */
 struct resync_pages {
-   unsignedidx;/* for get/put page from the pool */
void*raid_bio;
struct page *pages[RESYNC_PAGES];
 };
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3febfc8391fb..7901ddc3362f 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -170,7 +170,6 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
resync_get_all_pages(rp);
}
 
-   rp->idx = 0;
rp->raid_bio = r1_bio;
bio->bi_private = rp;
}
@@ -2619,6 +2618,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, 
sector_t sector_nr,
int good_sectors = RESYNC_SECTORS;
int min_bad = 0; /* number of sectors that are bad in all devices */
int idx = sector_to_idx(sector_nr);
+   int page_idx = 0;
 
if (!conf->r1buf_pool)
if (init_resync(conf))
@@ -2846,7 +2846,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, 
sector_t sector_nr,
bio = r1_bio->bios[i];
rp = get_resync_pages(bio);
if (bio->bi_end_io) {
-   page = resync_fetch_page(rp, rp->idx++);
+   page = resync_fetch_page(rp, page_idx);
 
/*
 * won't fail because the vec table is big
@@ -2858,7 +2858,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, 
sector_t sector_nr,
nr_sectors += len>>9;
sector_nr += len>>9;
sync_blocks -= (len>>9);
-   } while (get_resync_pages(r1_bio->bios[disk]->bi_private)->idx < 
RESYNC_PAGES);
+   } while (page_idx++ < RESYNC_PAGES);
 
r1_bio->sectors = nr_sectors;
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5026e7ad51d3..e594ca610f27 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -221,7 +221,6 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
resync_get_all_pages(rp);
}
 
-   rp->idx = 0;
rp->raid_bio = r10_bio;
bio->bi_private = rp;
if (rbio) {
@@ -2853,6 +2852,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, 
sector_t sector_nr,
sector_t sectors_skipped = 0;
int chunks_skipped = 0;
sector_t chunk_mask = conf->geo.chunk_mask;
+   int page_idx = 0;
 
if (!conf->r10buf_pool)
if (init_resync(conf))
@@ -3355,7 +3355,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, 
sector_t sector_nr,
break;
for (bio= biolist ; bio ; bio=bio->bi_next) {
struct resync_pages *rp = get_resync_pages(bio);
-   page = resync_fetch_page(rp, rp->idx++);
+   page = resync_fetch_page(rp, page_idx);
/*
 * won't fail because the vec table is big enough
 * to hold all these pages
@@ -3364,7 +3364,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, 
sector_t sector_nr,
}
nr_sectors += len>>9;
sector_nr += len>>9;
-   } while (get_resync_pages(biolist)->idx < RESYNC_PAGES);
+   } while (page_idx++ < RESYNC_PAGES);
r10_bio->sectors = nr_sectors;
 
while (biolist) {
-- 
2.9.4



Re: [PATCH V2] block: call bio_uninit in bio_endio

2017-07-12 Thread Christoph Hellwig
On Mon, Jul 10, 2017 at 11:40:17AM -0700, Shaohua Li wrote:
> bio_free isn't a good place to free cgroup info. There are a
> lot of cases bio is allocated in special way (for example, in stack) and
> never gets called by bio_put hence bio_free, we are leaking memory. This
> patch moves the free to bio endio, which should be called anyway. The
> bio_uninit call in bio_free is kept, in case the bio never gets called
> bio endio.
> 
> This assumes ->bi_end_io() doesn't access cgroup info, which seems true
> in my audit.
> 
> This along with Christoph's integrity patch should fix the memory leak
> issue.

Can we please go ahead and:

 (a) remove the now unneeded call in __blkdev_direct_IO_simple
 (b) kill bio_uninit - it's just a wrapper around bio_disassociate_task
 (c) check if we can remove the call in bio_free, which I think we can
 as bi_ioc and bi_css are only assigned from generic_make_request


[PATCH] block, bfq: fix typos in comments about B-WF2Q+ algorithm

2017-07-12 Thread Hou Tao
The start time of eligible entity should be less than or equal to
the current virtual time, and the entity in idle tree has a finish
time being greater than the current virtual time.

Signed-off-by: Hou Tao 
---
 block/bfq-iosched.h | 2 +-
 block/bfq-wf2q.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 5c3bf98..32feddab 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -52,7 +52,7 @@ struct bfq_entity;
 struct bfq_service_tree {
/* tree for active entities (i.e., those backlogged) */
struct rb_root active;
-   /* tree for idle entities (i.e., not backlogged, with V <= F_i)*/
+   /* tree for idle entities (i.e., not backlogged, with V < F_i)*/
struct rb_root idle;
 
/* idle entity with minimum F_i */
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 8726ede..d03d7b6 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1268,7 +1268,7 @@ static void bfq_update_vtime(struct bfq_service_tree *st, 
u64 new_value)
  *
  * This function searches the first schedulable entity, starting from the
  * root of the tree and going on the left every time on this side there is
- * a subtree with at least one eligible (start >= vtime) entity. The path on
+ * a subtree with at least one eligible (start <= vtime) entity. The path on
  * the right is followed only if a) the left subtree contains no eligible
  * entities and b) no eligible entity has been found yet.
  */
-- 
2.5.0