date:20180226

Re: [PATCH][RESEND] lightnvm: Avoid validation of default op value

2018-02-26 Thread Matias Bjørling


On 02/26/2018 09:22 PM, Heiner Litz wrote:

Fixes: 38401d231de65 ("lightnvm: set target over-provision on create ioctl")

Signed-off-by: Heiner Litz 
---
  drivers/lightnvm/core.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..74b53fc1b813 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -304,11 +304,9 @@ static int __nvm_config_extended(struct nvm_dev *dev,
}
  
  	/* op not set falls into target's default */

-   if (e->op == 0x)
+   if (e->op == 0x) {
e->op = NVM_TARGET_DEFAULT_OP;
-
-   if (e->op < NVM_TARGET_MIN_OP ||
-   e->op > NVM_TARGET_MAX_OP) {
+   } else if (e->op < NVM_TARGET_MIN_OP || e->op > NVM_TARGET_MAX_OP) {
pr_err("nvm: invalid over provisioning value\n");
return -EINVAL;
}



Thanks Heiner. Applied.

Re: [PATCH] bcache: fix kernel crashes when run fio in a RAID backend bcache device

2018-02-26 Thread Coly Li

Hi linux-raid folks,

I also forward this email to linux-raid mailing list, hope you may find
something suspicious in raid5 code.

Junhui hits a BUG() when bcache runs on top of md raid5, where bio
reference count <= 0. There is no other layer on top of bcache, so the
most suspicious part is md raid5.

Junhui's fix solves the problem, but I'd like to see whether there is
potential bug from raid5 code, which reset memory of bio then free it in
search_free() of bcache.

If raid5 is built on top of bcache, it is easy to say the issue is from
drivers/md/raid5.c:raid5_end_write_request(), there is a bio_reset(bi)
at tail of the function. This function assumes the reference count of
bio is 1, and reset object memory.

But after discussed with Junhui, he tells me bcache is on top of md
raid5, then it is not clear to me how the raid5 taints bio from upper
layer (bcache).

Can anybody have a look and give me any hint ?

P.S. Junhui's fix from bcache side is good, I just want to figure out
what exactly the root cause is.

Thanks in advance.

Coly Li

On 27/02/2018 11:47 AM, Coly Li wrote:
> On 27/02/2018 9:46 AM, tang.jun...@zte.com.cn wrote:
>> From: Tang Junhui 
>>
>> Kernel crashed when run fio in a RAID5 backend bcache device, the call 
>> trace is bellow:
>> [  440.012034] kernel BUG at block/blk-ioc.c:146!
>> [  440.012696] invalid opcode:  [#1] SMP NOPTI
>> [  440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
>> [  440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
>> /2015
>> [  440.028615] RIP: 0010:put_io_context+0x8b/0x90
>> [  440.029246] RSP: 0018:a8c882b43af8 EFLAGS: 00010246
>> [  440.029990] RAX:  RBX: a8c88294fca0 RCX: 00
>> 0f4240
>> [  440.031006] RDX: 0004 RSI: 0286 RDI: a8c882
>> 94fca0
>> [  440.032030] RBP: a8c882b43b10 R08: 0003 R09: 949cb8
>> 0c1700
>> [  440.033206] R10: 0104 R11: b71c R12: 000
>> 01000
>> [  440.034222] R13:  R14: 949cad84db70 R15: 949cb11
>> bd1e0
>> [  440.035239] FS:  () GS:949cba28() knlGS:
>> 
>> [  440.060190] CS:  0010 DS:  ES:  CR0: 80050033
>> [  440.084967] CR2: 7ff0493ef000 CR3: 0002f1e0a002 CR4: 001
>> 606e0
>> [  440.110498] Call Trace:
>> [  440.135443]  bio_disassociate_task+0x1b/0x60
>> [  440.160355]  bio_free+0x1b/0x60
>> [  440.184666]  bio_put+0x23/0x30
>> [  440.208272]  search_free+0x23/0x40 [bcache]
>> [  440.231448]  cached_dev_write_complete+0x31/0x70 [bcache]
>> [  440.254468]  closure_put+0xb6/0xd0 [bcache]
>> [  440.277087]  request_endio+0x30/0x40 [bcache]
>> [  440.298703]  bio_endio+0xa1/0x120
>> [  440.319644]  handle_stripe+0x418/0x2270 [raid456]
>> [  440.340614]  ? load_balance+0x17b/0x9c0
>> [  440.360506]  handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
>> [  440.380675]  ? __release_stripe+0x15/0x20 [raid456]
>> [  440.400132]  raid5d+0x3ed/0x5d0 [raid456]
>> [  440.419193]  ? schedule+0x36/0x80
>> [  440.437932]  ? schedule_timeout+0x1d2/0x2f0
>> [  440.456136]  md_thread+0x122/0x150
>> [  440.473687]  ? wait_woken+0x80/0x80
>> [  440.491411]  kthread+0x102/0x140
>> [  440.508636]  ? find_pers+0x70/0x70
>> [  440.524927]  ? kthread_associate_blkcg+0xa0/0xa0
>> [  440.541791]  ret_from_fork+0x35/0x40
>> [  440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb 
>> c2 
>> 48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 
>> 0b
>> 0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41 
>> [  440.610020] RIP: put_io_context+0x8b/0x90 RSP: a8c882b43af8
>> [  440.628575] ---[ end trace a1fd79d85643a73e ]--
>>
>> All the crash issue happened when a bypass IO coming, in such scenario
>> s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the
>> s->orig_bio by calling bio_complete(), and after that, s->iop.bio became
>> invalid, then kernel would crash when calling bio_put(). Maybe its upper
>> layer's faulty, since bio should not be freed before we calling bio_put(),
>> but we'd better calling bio_put() first before calling bio_complete() to
>> notify upper layer ending this bio.
>>
>> This patch moves bio_complete() under bio_put() to avoid kernel crash.
>>
>> Reported-by: Matthias Ferdinand 
>> Tested-by: Matthias Ferdinand 
>> Signed-off-by: Tang Junhui 
> 
> Hi Junhui,
> 
> The problem is from drivers/md/raid5.c:raid5_end_write_request(), there
> is a bio_reset(bi) at tail of the function. This function assumes the
> reference count of bio is 1, and reset object memory.
> 
> IMHO your fix is the simplest, and maybe it is the correct way as your
> change: we should try best to decrease bio reference count before
> calling bio_endio().
> 
> Reviewed-by: Coly Li 
> 
> Thanks.
> 
> Coly Li
> 
>> ---
>>  drivers/md/bcache/request.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/

Re: [PATCH 2/4] block: bio_check_eod() needs to consider partition

2018-02-26 Thread Jiufei Xue



On 2018/2/27 上午8:11, Christoph Hellwig wrote:
> The issue looks real, but please try to do this without another
> partition lookup. 
> 
Yes, I think the partition size test can be moved to
blk_partition_remap(). I will send version 2 later.

Re: [PATCH] bcache: check all uuid elements when start flash only volumes

2018-02-26 Thread Michael Lyle

Reviewed-by: Michael Lyle 

I have this and the other fix in the queue for testing tomorrow and will
pass along a changeset to Jens if all looks good.  Thanks!

Mike

On 02/26/2018 08:15 AM, Coly Li wrote:
> On 27/02/2018 12:10 AM, Coly Li wrote:
>> Commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by
>> devices_max_used") adds c->devices_max_used to reduce iteration of
>> c->uuids elements, this value is updated in bcache_device_attach().
>>
>> But for flash only volume, when calling flash_devs_run(), the function
>> bcache_device_attach() is not called yet and c->devices_max_used is not
>> updated. The unexpected result is, the flash only volume won't be run
>> by flash_devs_run().
>>
>> This patch fixes the issue by iterate all c->uuids elements in
>> flash_devs_run(). c->devices_max_used will be updated properly when
>> bcache_device_attach() gets called.
>>
>> Fixes: 2831231d4c3f ("bcache: reduce cache_set devices iteration by 
>> devices_max_used")
>> Reported-by: Tang Junhui 
>> Signed-off-by: Coly Li 
>> Cc: Michael Lyle 
>> ---
>>  drivers/md/bcache/super.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 312895788036..4d1d8dfb2d2a 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1274,7 +1274,7 @@ static int flash_devs_run(struct cache_set *c)
>>  struct uuid_entry *u;
>>  
>>  for (u = c->uuids;
>> - u < c->uuids + c->devices_max_used && !ret;
>> + u < c->uuids + c->nr_uuids && !ret;
>>   u++)
>>  if (UUID_FLASH_ONLY(u))
>>  ret = flash_dev_run(c, u);
>>
> 
> Hi Jens and Michael,
> 
> Could you please have a look and pick this fix into 4.16 ? Commit
> 2831231d4c3f ("bcache: reduce cache_set devices iteration by
> devices_max_used") is just merged in 4.16-rc1, so I don't CC stable for
> this fix.
> 
> And I still need a peer code reviewer on this patch, any Reviewed-by is
> welcome.
> 
> Thanks to Junhui Tang for reporting this issue today.
> 
> Coly Li
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: [PATCH] bcache: fix kernel crashes when run fio in a RAID backend bcache device

2018-02-26 Thread Coly Li

On 27/02/2018 9:46 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Kernel crashed when run fio in a RAID5 backend bcache device, the call 
> trace is bellow:
> [  440.012034] kernel BUG at block/blk-ioc.c:146!
> [  440.012696] invalid opcode:  [#1] SMP NOPTI
> [  440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
> [  440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
> /2015
> [  440.028615] RIP: 0010:put_io_context+0x8b/0x90
> [  440.029246] RSP: 0018:a8c882b43af8 EFLAGS: 00010246
> [  440.029990] RAX:  RBX: a8c88294fca0 RCX: 00
> 0f4240
> [  440.031006] RDX: 0004 RSI: 0286 RDI: a8c882
> 94fca0
> [  440.032030] RBP: a8c882b43b10 R08: 0003 R09: 949cb8
> 0c1700
> [  440.033206] R10: 0104 R11: b71c R12: 000
> 01000
> [  440.034222] R13:  R14: 949cad84db70 R15: 949cb11
> bd1e0
> [  440.035239] FS:  () GS:949cba28() knlGS:
> 
> [  440.060190] CS:  0010 DS:  ES:  CR0: 80050033
> [  440.084967] CR2: 7ff0493ef000 CR3: 0002f1e0a002 CR4: 001
> 606e0
> [  440.110498] Call Trace:
> [  440.135443]  bio_disassociate_task+0x1b/0x60
> [  440.160355]  bio_free+0x1b/0x60
> [  440.184666]  bio_put+0x23/0x30
> [  440.208272]  search_free+0x23/0x40 [bcache]
> [  440.231448]  cached_dev_write_complete+0x31/0x70 [bcache]
> [  440.254468]  closure_put+0xb6/0xd0 [bcache]
> [  440.277087]  request_endio+0x30/0x40 [bcache]
> [  440.298703]  bio_endio+0xa1/0x120
> [  440.319644]  handle_stripe+0x418/0x2270 [raid456]
> [  440.340614]  ? load_balance+0x17b/0x9c0
> [  440.360506]  handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
> [  440.380675]  ? __release_stripe+0x15/0x20 [raid456]
> [  440.400132]  raid5d+0x3ed/0x5d0 [raid456]
> [  440.419193]  ? schedule+0x36/0x80
> [  440.437932]  ? schedule_timeout+0x1d2/0x2f0
> [  440.456136]  md_thread+0x122/0x150
> [  440.473687]  ? wait_woken+0x80/0x80
> [  440.491411]  kthread+0x102/0x140
> [  440.508636]  ? find_pers+0x70/0x70
> [  440.524927]  ? kthread_associate_blkcg+0xa0/0xa0
> [  440.541791]  ret_from_fork+0x35/0x40
> [  440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2 
> 48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 
> 0b
> 0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41 
> [  440.610020] RIP: put_io_context+0x8b/0x90 RSP: a8c882b43af8
> [  440.628575] ---[ end trace a1fd79d85643a73e ]--
> 
> All the crash issue happened when a bypass IO coming, in such scenario
> s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the
> s->orig_bio by calling bio_complete(), and after that, s->iop.bio became
> invalid, then kernel would crash when calling bio_put(). Maybe its upper
> layer's faulty, since bio should not be freed before we calling bio_put(),
> but we'd better calling bio_put() first before calling bio_complete() to
> notify upper layer ending this bio.
> 
> This patch moves bio_complete() under bio_put() to avoid kernel crash.
> 
> Reported-by: Matthias Ferdinand 
> Tested-by: Matthias Ferdinand 
> Signed-off-by: Tang Junhui 

Hi Junhui,

The problem is from drivers/md/raid5.c:raid5_end_write_request(), there
is a bio_reset(bi) at tail of the function. This function assumes the
reference count of bio is 1, and reset object memory.

IMHO your fix is the simplest, and maybe it is the correct way as your
change: we should try best to decrease bio reference count before
calling bio_endio().

Reviewed-by: Coly Li 

Thanks.

Coly Li

> ---
>  drivers/md/bcache/request.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 643c3021..eb4e305 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -637,11 +637,11 @@ static void do_bio_hook(struct search *s, struct bio 
> *orig_bio)
>  static void search_free(struct closure *cl)
>  {
>   struct search *s = container_of(cl, struct search, cl);
> - bio_complete(s);
>  
>   if (s->iop.bio)
>   bio_put(s->iop.bio);
>  
> + bio_complete(s);
>   closure_debug_destroy(cl);
>   mempool_free(s, s->d->c->search);
>  }
>

Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-26 Thread Joseph Qi

Hi Tejun,

Could you please help take a look at this and give a confirmation?

Thanks,
Joseph

On 18/2/24 09:45, Joseph Qi wrote:
> Hi Tejun,
> 
> On 18/2/23 22:23, Tejun Heo wrote:
>> Hello,
>>
>> On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote:
 On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote:
> I still don't get how css_tryget can work here.
>
> The race happens when:
> 1) writeback kworker has found the blkg with rcu;
> 2) blkcg is during offlining and blkg_destroy() has already been called.
> Then, writeback kworker will take queue lock and access the blkg with
> refcount 0.

 Yeah, then tryget would fail and it should go through the root.

>>> In this race, the refcount of blkg becomes zero and is destroyed.
>>> However css may still have refcount, and css_tryget can return success
>>> before other callers put the refcount.
>>> So I don't get how css_tryget can fix this race? Or I wonder if we can
>>> add another function blkg_tryget?
>>
>> IIRC, as long as the blkcg and the device are there, the blkgs aren't
>> gonna be destroyed.  So, if you have a ref to the blkcg through
>> tryget, the blkg shouldn't go away.
>>
> 
> Maybe we have misunderstanding here.
> 
> In this case, blkg doesn't go away as we have rcu protect, but
> blkg_destroy() can be called, in which blkg_put() will put the last
> refcnt and then schedule __blkg_release_rcu().
> 
> css refcnt can't prevent blkcg css from offlining, instead it is css
> online_cnt.
> 
> css_tryget() will only get a refcnt of blkcg css, but can't be
> guaranteed to fail when css is confirmed to kill.
> 
> The race sequence:
> writeback kworker   cgroup_rmdir
>   cgroup_destroy_locked
> kill_css
>   css_killed_ref_fn
> css_killed_work_fn
>   offline_css
> blkcg_css_offline
>   blkcg_bio_issue_check
> rcu_read_lock
> blkg_lookup
>   spin_trylock(q->queue_lock)
>   blkg_destroy
>   spin_unlock(q->queue_lock)
> blk_throtl_bio
>   spin_lock_irq(q->queue_lock)
>   spin_unlock_irq(q->queue_lock)
> rcu_read_unlock
> 
> Thanks,
> Joseph
>

[PATCH] bcache: fix kernel crashes when run fio in a RAID backend bcache device

2018-02-26 Thread tang . junhui

From: Tang Junhui 

Kernel crashed when run fio in a RAID5 backend bcache device, the call 
trace is bellow:
[  440.012034] kernel BUG at block/blk-ioc.c:146!
[  440.012696] invalid opcode:  [#1] SMP NOPTI
[  440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
[  440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
/2015
[  440.028615] RIP: 0010:put_io_context+0x8b/0x90
[  440.029246] RSP: 0018:a8c882b43af8 EFLAGS: 00010246
[  440.029990] RAX:  RBX: a8c88294fca0 RCX: 00
0f4240
[  440.031006] RDX: 0004 RSI: 0286 RDI: a8c882
94fca0
[  440.032030] RBP: a8c882b43b10 R08: 0003 R09: 949cb8
0c1700
[  440.033206] R10: 0104 R11: b71c R12: 000
01000
[  440.034222] R13:  R14: 949cad84db70 R15: 949cb11
bd1e0
[  440.035239] FS:  () GS:949cba28() knlGS:

[  440.060190] CS:  0010 DS:  ES:  CR0: 80050033
[  440.084967] CR2: 7ff0493ef000 CR3: 0002f1e0a002 CR4: 001
606e0
[  440.110498] Call Trace:
[  440.135443]  bio_disassociate_task+0x1b/0x60
[  440.160355]  bio_free+0x1b/0x60
[  440.184666]  bio_put+0x23/0x30
[  440.208272]  search_free+0x23/0x40 [bcache]
[  440.231448]  cached_dev_write_complete+0x31/0x70 [bcache]
[  440.254468]  closure_put+0xb6/0xd0 [bcache]
[  440.277087]  request_endio+0x30/0x40 [bcache]
[  440.298703]  bio_endio+0xa1/0x120
[  440.319644]  handle_stripe+0x418/0x2270 [raid456]
[  440.340614]  ? load_balance+0x17b/0x9c0
[  440.360506]  handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
[  440.380675]  ? __release_stripe+0x15/0x20 [raid456]
[  440.400132]  raid5d+0x3ed/0x5d0 [raid456]
[  440.419193]  ? schedule+0x36/0x80
[  440.437932]  ? schedule_timeout+0x1d2/0x2f0
[  440.456136]  md_thread+0x122/0x150
[  440.473687]  ? wait_woken+0x80/0x80
[  440.491411]  kthread+0x102/0x140
[  440.508636]  ? find_pers+0x70/0x70
[  440.524927]  ? kthread_associate_blkcg+0xa0/0xa0
[  440.541791]  ret_from_fork+0x35/0x40
[  440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2 
48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 0b
0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41 
[  440.610020] RIP: put_io_context+0x8b/0x90 RSP: a8c882b43af8
[  440.628575] ---[ end trace a1fd79d85643a73e ]--

All the crash issue happened when a bypass IO coming, in such scenario
s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the
s->orig_bio by calling bio_complete(), and after that, s->iop.bio became
invalid, then kernel would crash when calling bio_put(). Maybe its upper
layer's faulty, since bio should not be freed before we calling bio_put(),
but we'd better calling bio_put() first before calling bio_complete() to
notify upper layer ending this bio.

This patch moves bio_complete() under bio_put() to avoid kernel crash.

Reported-by: Matthias Ferdinand 
Tested-by: Matthias Ferdinand 
Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 643c3021..eb4e305 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -637,11 +637,11 @@ static void do_bio_hook(struct search *s, struct bio 
*orig_bio)
 static void search_free(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
-   bio_complete(s);
 
if (s->iop.bio)
bio_put(s->iop.bio);
 
+   bio_complete(s);
closure_debug_destroy(cl);
mempool_free(s, s->d->c->search);
 }
-- 
1.8.3.1

Re: [PATCH 2/4] block: bio_check_eod() needs to consider partition

2018-02-26 Thread Christoph Hellwig

The issue looks real, but please try to do this without another
partition lookup.

Re: [PATCH 1/4] block: fix the count of PGPGOUT for WRITE_SAME

2018-02-26 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 1/6] genhd: Fix leaked module reference for NVME devices

2018-02-26 Thread Christoph Hellwig

On Mon, Feb 26, 2018 at 01:01:37PM +0100, Jan Kara wrote:
> Commit 8ddcd653257c "block: introduce GENHD_FL_HIDDEN" added handling of
> hidden devices to get_gendisk() but forgot to drop module reference
> which is also acquired by get_disk(). Drop the reference as necessary.
> 
> Arguably the function naming here is misleading as put_disk() is *not*
> the counterpart of get_disk() but let's fix that in the follow up
> commit since that will be more intrusive.

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 1/4] block: fix the count of PGPGOUT for WRITE_SAME

2018-02-26 Thread Bart Van Assche

On Mon, 2018-02-26 at 20:04 +0800, Jiufei Xue wrote:
> The vm counters is counted in sectors, so we should do the conversation
> in submit_bio.
> 
> Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and
> partitions index")
> 
> Signed-off-by: Jiufei Xue 
> ---
>  block/blk-core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2d1a7bb..6d82c4f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2434,7 +2434,7 @@ blk_qc_t submit_bio(struct bio *bio)
>   unsigned int count;
>  
>   if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
> - count = queue_logical_block_size(bio->bi_disk->queue);
> + count = queue_logical_block_size(bio->bi_disk->queue) 
> >> 9;
>   else
>   count = bio_sectors(bio);

Since this is a fix for a kernel v4.14 change, please add a
"Cc: sta...@vger.kernel.org" tag.

Thanks,

Bart.

Re: [PATCH 0/2] blk-mq: kyber: fix domain token leak during requeue

2018-02-26 Thread Omar Sandoval

On Fri, Feb 23, 2018 at 11:36:55PM +0800, Ming Lei wrote:
> Hi,
> 
> This two patch fixes domain token leak on kyber scheduler, and actually
> fixes one IO hang issue.
> 
> Thanks,
> 
> Ming Lei (2):
>   blk-mq: don't call io sched's .requeue_request when requeueing rq to
> ->dispatch
>   block: kyber: fix domain token leak during requeue
> 
>  block/blk-mq.c| 4 +++-
>  block/kyber-iosched.c | 1 +
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> -- 
> 2.9.5
> 

I didn't get to add my reviewed-by before Jens queued this up, but
thanks for fixing this, Ming.

Re: [PATCH] nbd: fix return value in error handling path

2018-02-26 Thread Omar Sandoval

On Mon, Feb 12, 2018 at 11:14:55AM -0600, Gustavo A. R. Silva wrote:
> It seems that the proper value to return in this particular case is the
> one contained into variable new_index instead of ret.
> 
> Addresses-Coverity-ID: 1465148 ("Copy-paste error")
> Fixes: e46c7287b1c2 ("nbd: add a basic netlink interface")

Reviewed-by: Omar Sandoval 

> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/block/nbd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 5f2a424..86258b0 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -1591,7 +1591,7 @@ static int nbd_genl_connect(struct sk_buff *skb, struct 
> genl_info *info)
>   if (new_index < 0) {
>   mutex_unlock(&nbd_index_mutex);
>   printk(KERN_ERR "nbd: failed to add new 
> device\n");
> - return ret;
> + return new_index;
>   }
>   nbd = idr_find(&nbd_index_idr, new_index);
>   }
> -- 
> 2.7.4
>

Re: [PATCH RFC] sbitmap: Use lock/unlock atomic bitops

2018-02-26 Thread Omar Sandoval

On Sun, Feb 18, 2018 at 05:05:06AM -0800, Tejun Heo wrote:
> sbitmap is used to allocate tags.  The free and alloc paths use a
> memory ordering scheme similar to the one used by waitqueue, where the
> waiter and waker synchronize around set_current_state().
> 
> This doesn't seem sufficient for sbitmap given that a tag may get
> released and re-acquired without the allocator blocking at all.  Once
> the bit for the tag is cleared, the tag may be reused immediately and
> due to the lack of memory ordering around bit clearing, its memory
> accesses may race against the ones from before the clearing.
> 
> Given that the bits are the primary synchronization mechanism, they
> should be ordered memory-wise.  This patch replaces waitqueue-style
> memory barriers with clear_bit_unlock() in sbitmap_clear_bit() and
> test_and_set_bit_lock() in __sbitmap_get_word().
> 
> Signed-off-by: Tejun Heo 
> ---
> Hello,
> 
> Spotted this while verifying the timeout fix.  I didn't check the
> whole code, so although unlikely it's possible that the removed mb's
> are needed from elsewhere, so the RFC.  Only boot tested.
> 
> Thanks.
> 
>  include/linux/sbitmap.h |3 ++-
>  lib/sbitmap.c   |   17 ++---
>  2 files changed, 8 insertions(+), 12 deletions(-)
> 
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -297,7 +297,8 @@ static inline void sbitmap_set_bit(struc
>  
>  static inline void sbitmap_clear_bit(struct sbitmap *sb, unsigned int bitnr)
>  {
> - clear_bit(SB_NR_TO_BIT(sb, bitnr), __sbitmap_word(sb, bitnr));
> + /* paired with test_and_set_bit_lock() in __sbitmap_get_word() */
> + clear_bit_unlock(SB_NR_TO_BIT(sb, bitnr), __sbitmap_word(sb, bitnr));
>  }

I agree that we want this, but for a different reason than you described
in your changelog: a sbitmap can be used without a sbitmap_queue, so it
should provide the proper memory ordering. For the sbitmap_queue case,
the compiler/processor could also reorder something after the
clear_bit() and before the smp_mb__after_atomic(), which is also wrong.

>  static inline int sbitmap_test_bit(struct sbitmap *sb, unsigned int bitnr)
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -100,7 +100,8 @@ static int __sbitmap_get_word(unsigned l
>   return -1;
>   }
>  
> - if (!test_and_set_bit(nr, word))
> + /* paired with clear_bit_unlock() in sbitmap_clear_bit() */
> + if (!test_and_set_bit_lock(nr, word))
>   break;

test_and_set_bit_lock() is an ACQUIRE operation which has weaker
guarantees than the full barrier implied by test_and_set_bit(), but
ACQUIRE is enough here, so I agree with this part, too.

>   hint = nr + 1;
> @@ -432,14 +433,9 @@ static void sbq_wake_up(struct sbitmap_q
>   int wait_cnt;
>  
>   /*
> -  * Pairs with the memory barrier in set_current_state() to ensure the
> -  * proper ordering of clear_bit()/waitqueue_active() in the waker and
> -  * test_and_set_bit()/prepare_to_wait()/finish_wait() in the waiter. See
> -  * the comment on waitqueue_active(). This is __after_atomic because we
> -  * just did clear_bit() in the caller.
> +  * Memory ordering is handled by sbitmap_clear_bit() and
> +  * __sbitmap_get_word().
>*/
> - smp_mb__after_atomic();
> -

This, I'm not convinced that we can get rid of. clear_bit_unlock() is a
RELEASE operation, not a full barrier, so the waitqueue_active() read
can be reordered before the clear_bit() store. Imagine we get this
interleaving:

waitqueue_active() -> false |
| /* bitmap is full, allocation fails */
| prepare_to_wait()
clear_bit_unlock()  |

We should've woken up the waiter, but we didn't.

>   ws = sbq_wake_ptr(sbq);
>   if (!ws)
>   return;
> @@ -481,10 +477,9 @@ void sbitmap_queue_wake_all(struct sbitm
>   int i, wake_index;
>  
>   /*
> -  * Pairs with the memory barrier in set_current_state() like in
> -  * sbq_wake_up().
> +  * Memory ordering is handled by sbitmap_clear_bit() and
> +  * __sbitmap_get_word().
>*/
> - smp_mb();

Same idea here.

>   wake_index = atomic_read(&sbq->wake_index);
>   for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
>   struct sbq_wait_state *ws = &sbq->ws[wake_index];

So I think we want a patch for the test_and_set_bit_lock() and
clear_bit_unlock(), but the rest should stay as-is.

Re: [PATCH V15 06/22] mmc: block: Add blk-mq support

2018-02-26 Thread Dmitry Osipenko

On 22.02.2018 20:54, Dmitry Osipenko wrote:
> On 22.02.2018 10:42, Adrian Hunter wrote:
>> On 21/02/18 22:50, Dmitry Osipenko wrote:
>>> On 29.11.2017 16:41, Adrian Hunter wrote:
 Define and use a blk-mq queue. Discards and flushes are processed
 synchronously, but reads and writes asynchronously. In order to support
 slow DMA unmapping, DMA unmapping is not done until after the next request
 is started. That means the request is not completed until then. If there is
 no next request then the completion is done by queued work.
>>>
>>> Hello,
>>>
>>> I'm using (running linux-next and doing some upstream development for) some 
>>> old
>>> NVIDIA Tegra tablet that has built-in (internal) and external MMC's and 
>>> with the
>>> blk-mq being enabled I'm observing a soft lockup. The lockup seems is
>>> reproducible quite reliably by running fsck on any MMC partition, sometimes
>>> kernels lockups on boot during probing partitions table (weirdly only when 
>>> both
>>> SDHCI's are present, i.e. internal storage enabled in DT and external SD is
>>> inserted/enabled) and it also lockups pretty quickly in a case of just a 
>>> general
>>> use. Reverting mmc/ commits up to 1bec43a3b18 ("Remove option not to use
>>> blk-mq") and disabling CONFIG_MMC_MQ_DEFAULT makes everything working fine
>>> again. There is also a third SDHCI populated with built-in WiFi/Bluetooth 
>>> SDIO
>>> and I'm observing odd MMC timeouts with the blk-mq enabled, disabling
>>> CONFIG_MMC_MQ_DEFAULT fixes these timeouts as well.
>>>
>>> Any thoughts?
>>
>> SDIO (unless it is a combo card) should be unaffected by changes to the
>> block driver.

I don't know whether it's a combo card or not. Where I can find info about that?
Is it mentioned in sysfs somewhere? Alternatively you may take a brief look at
what brcmfmac driver does, maybe it will tell you immediately whether blk-mq
affects it or not. And if it's not affected, then it could be that there is some
other issue that is masked by a properly working block driver.

>> I don't have any ideas.  Adding more NVIDIA people.
>>>
>>> WiFi issue
>>> 
>>>
>>> [   38.247006] mmc2: Timeout waiting for hardware interrupt.
>>> [   38.247027] brcmfmac: brcmf_escan_timeout: timer expired
>>> [   38.247036] mmc2: sdhci:  SDHCI REGISTER DUMP ===
>>> [   38.247047] mmc2: sdhci: Sys addr:  0x | Version:  0x0001
>>> [   38.247055] mmc2: sdhci: Blk size:  0x7008 | Blk cnt:  0x
>>> [   38.247062] mmc2: sdhci: Argument:  0x2108 | Trn mode: 0x0013
>>> [   38.247070] mmc2: sdhci: Present:   0x01d7 | Host ctl: 0x0013
>>> [   38.247077] mmc2: sdhci: Power: 0x0001 | Blk gap:  0x
>>> [   38.247084] mmc2: sdhci: Wake-up:   0x | Clock:0x0007
>>> [   38.247091] mmc2: sdhci: Timeout:   0x000e | Int stat: 0x
>>> [   38.247098] mmc2: sdhci: Int enab:  0x02ff000b | Sig enab: 0x02fc000b
>>> [   38.247105] mmc2: sdhci: AC12 err:  0x | Slot int: 0x
>>> [   38.247112] mmc2: sdhci: Caps:  0x61ff30b0 | Caps_1:   0x
>>> [   38.247119] mmc2: sdhci: Cmd:   0x353a | Max curr: 0x0001
>>> [   38.247126] mmc2: sdhci: Resp[0]:   0x1800 | Resp[1]:  0x08002db5
>>> [   38.247133] mmc2: sdhci: Resp[2]:   0x16da8000 | Resp[3]:  0x0400
>>> [   38.247139] mmc2: sdhci: Host ctl2: 0x
>>> [   38.247146] mmc2: sdhci: ADMA Err:  0x | ADMA Ptr: 0x17c47200
>>> [   38.247152] mmc2: sdhci: 
>>> [   38.247250] brcmfmac: brcmf_sdio_readframes: read 520 bytes from channel 
>>> 1
>>> failed: -84
>>> [   38.247274] brcmfmac: brcmf_sdio_rxfail: abort command, terminate frame, 
>>> send NAK
>>> [   40.807019] brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout
>>> [   40.807042] brcmfmac: brcmf_notify_escan_complete: Scan abort failed
>>> [   48.487007] mmc2: Timeout waiting for hardware interrupt.
>>> [   48.487057] mmc2: sdhci:  SDHCI REGISTER DUMP ===
>>> [   48.487096] mmc2: sdhci: Sys addr:  0x | Version:  0x0001
>>> [   48.487128] mmc2: sdhci: Blk size:  0x7040 | Blk cnt:  0x0001
>>> [   48.487160] mmc2: sdhci: Argument:  0x2140 | Trn mode: 0x0013
>>> [   48.487191] mmc2: sdhci: Present:   0x01d7 | Host ctl: 0x0013
>>> [   48.487221] mmc2: sdhci: Power: 0x0001 | Blk gap:  0x
>>> [   48.487251] mmc2: sdhci: Wake-up:   0x | Clock:0x0007
>>> [   48.487281] mmc2: sdhci: Timeout:   0x000e | Int stat: 0x
>>> [   48.487313] mmc2: sdhci: Int enab:  0x02ff000b | Sig enab: 0x02fc000b
>>> [   48.487343] mmc2: sdhci: AC12 err:  0x | Slot int: 0x
>>> [   48.487374] mmc2: sdhci: Caps:  0x61ff30b0 | Caps_1:   0x
>>> [   48.487404] mmc2: sdhci: Cmd:   0x353a | Max curr: 0x0001
>>> [   48.487435] mmc2: sdhci: Resp[0]:   0x1000 | Resp[1]:  0x08002db5
>>> [   48.487466] mmc2: sdhci

Re: [PATCH 3/4] block: display the correct diskname for bio

2018-02-26 Thread Omar Sandoval

On Mon, Feb 26, 2018 at 08:04:41PM +0800, Jiufei Xue wrote:
> bio_devname use __bdevname to display the device name, and can
> only show the major and minor of the part0.
> Fix this by using disk_name to display the correct name.

Besides adding a Fixes: tag like Jens said,

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jiufei Xue 
> ---
>  block/partition-generic.c | 6 ++
>  include/linux/bio.h   | 4 +---
>  2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/block/partition-generic.c b/block/partition-generic.c
> index 91622db..08dabcd 100644
> --- a/block/partition-generic.c
> +++ b/block/partition-generic.c
> @@ -51,6 +51,12 @@ const char *bdevname(struct block_device *bdev, char *buf)
>  
>  EXPORT_SYMBOL(bdevname);
>  
> +const char *bio_devname(struct bio *bio, char *buf)
> +{
> + return disk_name(bio->bi_disk, bio->bi_partno, buf);
> +}
> +EXPORT_SYMBOL(bio_devname);
> +
>  /*
>   * There's very little reason to use this, you should really
>   * have a struct block_device just about everywhere and use
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index d0eb659..ce547a2 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -511,6 +511,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue 
> *,
>  extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
>  extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
>  extern unsigned int bvec_nr_vecs(unsigned short idx);
> +extern const char *bio_devname(struct bio *bio, char *buffer);
>  
>  #define bio_set_dev(bio, bdev)   \
>  do { \
> @@ -529,9 +530,6 @@ extern struct bio *bio_copy_user_iov(struct request_queue 
> *,
>  #define bio_dev(bio) \
>   disk_devt((bio)->bi_disk)
>  
> -#define bio_devname(bio, buf) \
> - __bdevname(bio_dev(bio), (buf))
> -
>  #ifdef CONFIG_BLK_CGROUP
>  int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state 
> *blkcg_css);
>  void bio_disassociate_task(struct bio *bio);
> -- 
> 1.9.4
> 
>

Re: [PATCH 2/4] block: bio_check_eod() needs to consider partition

2018-02-26 Thread Omar Sandoval

On Mon, Feb 26, 2018 at 08:04:38PM +0800, Jiufei Xue wrote:
> bio_check_eod() should get the capacity of partiton, not the whole
> disk.
> 
> Signed-off-by: Jiufei Xue 
> ---
>  block/blk-core.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 6d82c4f..ef6677d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2023,7 +2023,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
> struct bio *bio)
>   return BLK_QC_T_NONE;
>  }
>  
> -static void handle_bad_sector(struct bio *bio)
> +static void handle_bad_sector(struct bio *bio, sector_t maxsector)
>  {
>   char b[BDEVNAME_SIZE];
>  
> @@ -2031,7 +2031,7 @@ static void handle_bad_sector(struct bio *bio)
>   printk(KERN_INFO "%s: rw=%d, want=%Lu, limit=%Lu\n",
>   bio_devname(bio, b), bio->bi_opf,
>   (unsigned long long)bio_end_sector(bio),
> - (long long)get_capacity(bio->bi_disk));
> + (long long)maxsector);
>  }
>  
>  #ifdef CONFIG_FAIL_MAKE_REQUEST
> @@ -2131,12 +2131,20 @@ static inline int blk_partition_remap(struct bio *bio)
>  static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
>  {
>   sector_t maxsector;
> + struct hd_struct *part;
>  
>   if (!nr_sectors)
>   return 0;
>  
>   /* Test device or partition size, when known. */
> - maxsector = get_capacity(bio->bi_disk);
> + rcu_read_lock();
> + part = __disk_get_part(bio->bi_disk, bio->bi_partno);
> + if (part)
> + maxsector = part_nr_sects_read(part);
> + else
> + maxsector = get_capacity(bio->bi_disk);

This case means that bio->bi_partno was invalid, right? Shouldn't this
be an error? I see that this is copied from guard_bio_eod(), but that
serves a slightly different purpose.

> + rcu_read_unlock();
> +
>   if (maxsector) {
>   sector_t sector = bio->bi_iter.bi_sector;
>  
> @@ -2146,7 +2154,7 @@ static inline int bio_check_eod(struct bio *bio, 
> unsigned int nr_sectors)
>* without checking the size of the device, e.g., when
>* mounting a device.
>*/
> - handle_bad_sector(bio);
> + handle_bad_sector(bio, maxsector);
>   return 1;
>   }
>   }
> -- 
> 1.9.4
> 
>

[PATCH][RESEND] lightnvm: Avoid validation of default op value

2018-02-26 Thread Heiner Litz

Fixes: 38401d231de65 ("lightnvm: set target over-provision on create ioctl")

Signed-off-by: Heiner Litz 
---
 drivers/lightnvm/core.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..74b53fc1b813 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -304,11 +304,9 @@ static int __nvm_config_extended(struct nvm_dev *dev,
}
 
/* op not set falls into target's default */
-   if (e->op == 0x)
+   if (e->op == 0x) {
e->op = NVM_TARGET_DEFAULT_OP;
-
-   if (e->op < NVM_TARGET_MIN_OP ||
-   e->op > NVM_TARGET_MAX_OP) {
+   } else if (e->op < NVM_TARGET_MIN_OP || e->op > NVM_TARGET_MAX_OP) {
pr_err("nvm: invalid over provisioning value\n");
return -EINVAL;
}
-- 
2.14.1

Re: [PATCH 1/4] block: fix the count of PGPGOUT for WRITE_SAME

2018-02-26 Thread Omar Sandoval

On Mon, Feb 26, 2018 at 08:04:35PM +0800, Jiufei Xue wrote:
> The vm counters is counted in sectors, so we should do the conversation
> in submit_bio.
> 
> Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and
> partitions index")

The Fixes line shouldn't be wrapped. Besides that,

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jiufei Xue 
> ---
>  block/blk-core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2d1a7bb..6d82c4f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2434,7 +2434,7 @@ blk_qc_t submit_bio(struct bio *bio)
>   unsigned int count;
>  
>   if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
> - count = queue_logical_block_size(bio->bi_disk->queue);
> + count = queue_logical_block_size(bio->bi_disk->queue) 
> >> 9;
>   else
>   count = bio_sectors(bio);
>  
> -- 
> 1.9.4
>

Re: [PATCH][RESEND] lightnvm: Avoid validation of default op value

2018-02-26 Thread Javier Gonzalez

> On 26 Feb 2018, at 21.22, Heiner Litz  wrote:
> 
> Fixes: 38401d231de65 ("lightnvm: set target over-provision on create ioctl")
> 
> Signed-off-by: Heiner Litz 
> ---
> drivers/lightnvm/core.c | 6 ++
> 1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
> index dcc9e621e651..74b53fc1b813 100644
> --- a/drivers/lightnvm/core.c
> +++ b/drivers/lightnvm/core.c
> @@ -304,11 +304,9 @@ static int __nvm_config_extended(struct nvm_dev *dev,
>   }
> 
>   /* op not set falls into target's default */
> - if (e->op == 0x)
> + if (e->op == 0x) {
>   e->op = NVM_TARGET_DEFAULT_OP;
> -
> - if (e->op < NVM_TARGET_MIN_OP ||
> - e->op > NVM_TARGET_MAX_OP) {
> + } else if (e->op < NVM_TARGET_MIN_OP || e->op > NVM_TARGET_MAX_OP) {
>   pr_err("nvm: invalid over provisioning value\n");
>   return -EINVAL;
>   }
> --
> 2.14.1

Thanks for applying the changes. Looks good to me now.



signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] lightnvm: Avoid validation of default op value Avoid validation of default op value, if default value is given. In preparation of using the rsvd field of the extended config for additional

2018-02-26 Thread Javier Gonzalez

> On 26 Feb 2018, at 20.45, Heiner Litz  wrote:
> 
> Fixes: 38401d231de65 ("lightnvm: set target over-provision on create ioctl")
> 
> Signed-off-by: Heiner Litz 
> ---
> drivers/lightnvm/core.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
> index dcc9e621e651..4e6095e2af06 100644
> --- a/drivers/lightnvm/core.c
> +++ b/drivers/lightnvm/core.c
> @@ -304,10 +304,9 @@ static int __nvm_config_extended(struct nvm_dev *dev,
>   }
> 
>   /* op not set falls into target's default */
> - if (e->op == 0x)
> + if (e->op == 0x) {
>   e->op = NVM_TARGET_DEFAULT_OP;
> -
> - if (e->op < NVM_TARGET_MIN_OP ||
> + } else if (e->op < NVM_TARGET_MIN_OP ||
>   e->op > NVM_TARGET_MAX_OP) {

While you are at it, can you fix this and put it in a single line - not
sure why this ended up this way...

>   pr_err("nvm: invalid over provisioning value\n");
>   return -EINVAL;
> --
> 2.14.1

Apart from the comment, the patch looks good. However, the description
went over to the subject. I guess Matias can fix this when picking it
up.

Reviewed-by: Javier González 



signature.asc
Description: Message signed with OpenPGP

[PATCH] lightnvm: Avoid validation of default op value Avoid validation of default op value, if default value is given. In preparation of using the rsvd field of the extended config for additional par

2018-02-26 Thread Heiner Litz

 Fixes: 38401d231de65 ("lightnvm: set target over-provision on create ioctl")

Signed-off-by: Heiner Litz 
---
 drivers/lightnvm/core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..4e6095e2af06 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -304,10 +304,9 @@ static int __nvm_config_extended(struct nvm_dev *dev,
}
 
/* op not set falls into target's default */
-   if (e->op == 0x)
+   if (e->op == 0x) {
e->op = NVM_TARGET_DEFAULT_OP;
-
-   if (e->op < NVM_TARGET_MIN_OP ||
+   } else if (e->op < NVM_TARGET_MIN_OP ||
e->op > NVM_TARGET_MAX_OP) {
pr_err("nvm: invalid over provisioning value\n");
return -EINVAL;
-- 
2.14.1

Re: [PATCH] blktrace_api.h: fix comment for struct blk_user_trace_setup

2018-02-26 Thread Jens Axboe

On 1/26/18 5:58 PM, Eric Biggers wrote:
> From: Eric Biggers 
> 
> 'struct blk_user_trace_setup' is passed to BLKTRACESETUP, not
> BLKTRACESTART.

Applied, thanks.

-- 
Jens Axboe

Re: [PATCH] blktrace_api.h: fix comment for struct blk_user_trace_setup

2018-02-26 Thread Eric Biggers

On Fri, Jan 26, 2018 at 04:58:06PM -0800, Eric Biggers wrote:
> From: Eric Biggers 
> 
> 'struct blk_user_trace_setup' is passed to BLKTRACESETUP, not
> BLKTRACESTART.
> 
> Signed-off-by: Eric Biggers 
> ---
>  include/uapi/linux/blktrace_api.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/blktrace_api.h 
> b/include/uapi/linux/blktrace_api.h
> index 20d1490d6377..3c50e07ee833 100644
> --- a/include/uapi/linux/blktrace_api.h
> +++ b/include/uapi/linux/blktrace_api.h
> @@ -131,7 +131,7 @@ enum {
>  #define BLKTRACE_BDEV_SIZE   32
>  
>  /*
> - * User setup structure passed with BLKTRACESTART
> + * User setup structure passed with BLKTRACESETUP
>   */
>  struct blk_user_trace_setup {
>   char name[BLKTRACE_BDEV_SIZE];  /* output */
> -- 
> 2.16.0.rc1.238.g530d649a79-goog
> 

Ping.

Re: [PATCH 17/19] lightnvm: pblk: implement get log report chunk

2018-02-26 Thread Matias Bjørling


On 02/26/2018 02:17 PM, Javier González wrote:

From: Javier González 

In preparation of pblk supporting 2.0, implement the get log report
chunk in pblk.

This patch only replicates de bad block functionality as the rest of the
metadata requires new pblk functionality (e.g., wear-index to implement
wear-leveling). 



"This functionality will come in future patches."

I think the  above belongs in the cover letter, not in the patch 
description.




Signed-off-by: Javier González 
---
  drivers/lightnvm/pblk-core.c  | 121 +++
  drivers/lightnvm/pblk-init.c  | 218 ++
  drivers/lightnvm/pblk-sysfs.c |  67 +
  drivers/lightnvm/pblk.h   |  20 
  4 files changed, 346 insertions(+), 80 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 1d7c6313f3d9..ce6a7cfdba66 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -44,11 +44,12 @@ static void pblk_line_mark_bb(struct work_struct *work)
  }
  
  static void pblk_mark_bb(struct pblk *pblk, struct pblk_line *line,

-struct ppa_addr *ppa)
+struct ppa_addr ppa_addr)
  {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
-   int pos = pblk_ppa_to_pos(geo, *ppa);
+   struct ppa_addr *ppa;
+   int pos = pblk_ppa_to_pos(geo, ppa_addr);
  
  	pr_debug("pblk: erase failed: line:%d, pos:%d\n", line->id, pos);

atomic_long_inc(&pblk->erase_failed);
@@ -58,6 +59,15 @@ static void pblk_mark_bb(struct pblk *pblk, struct pblk_line 
*line,
pr_err("pblk: attempted to erase bb: line:%d, pos:%d\n",
line->id, pos);
  
+	/* Not necessary to mark bad blocks on 2.0 spec. */

+   if (geo->c.version == NVM_OCSSD_SPEC_20)
+   return;
+
+   ppa = kmalloc(sizeof(struct ppa_addr), GFP_ATOMIC);
+   if (!ppa)
+   return;
+
+   *ppa = ppa_addr;
pblk_gen_run_ws(pblk, NULL, ppa, pblk_line_mark_bb,
GFP_ATOMIC, pblk->bb_wq);
  }
@@ -69,16 +79,8 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct 
nvm_rq *rqd)
line = &pblk->lines[pblk_ppa_to_line(rqd->ppa_addr)];
atomic_dec(&line->left_seblks);
  
-	if (rqd->error) {

-   struct ppa_addr *ppa;
-
-   ppa = kmalloc(sizeof(struct ppa_addr), GFP_ATOMIC);
-   if (!ppa)
-   return;
-
-   *ppa = rqd->ppa_addr;
-   pblk_mark_bb(pblk, line, ppa);
-   }
+   if (rqd->error)
+   pblk_mark_bb(pblk, line, rqd->ppa_addr);
  
  	atomic_dec(&pblk->inflight_io);

  }
@@ -92,6 +94,50 @@ static void pblk_end_io_erase(struct nvm_rq *rqd)
mempool_free(rqd, pblk->e_rq_pool);
  }
  
+/*

+ * Get information for all chunks from the device.
+ *
+ * The caller is responsible for freeing the returned structure
+ */
+struct nvm_chk_meta *pblk_chunk_get_info(struct pblk *pblk)
+{
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = &dev->geo;
+   struct nvm_chk_meta *meta;
+   struct ppa_addr ppa;
+   unsigned long len;
+   int ret;
+
+   ppa.ppa = 0;
+
+   len = geo->all_chunks * sizeof(*meta);
+   meta = kzalloc(len, GFP_KERNEL);
+   if (!meta)
+   return ERR_PTR(-ENOMEM);
+
+   ret = nvm_get_chunk_meta(dev, meta, ppa, geo->all_chunks);
+   if (ret) {
+   pr_err("pblk: could not get chunk metadata (%d)\n", ret);
+   kfree(meta);
+   return ERR_PTR(-EIO);
+   }
+
+   return meta;
+}
+
+struct nvm_chk_meta *pblk_chunk_get_off(struct pblk *pblk,
+ struct nvm_chk_meta *meta,
+ struct ppa_addr ppa)
+{
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = &dev->geo;
+   int ch_off = ppa.m.grp * geo->c.num_chk * geo->num_lun;
+   int lun_off = ppa.m.pu * geo->c.num_chk;
+   int chk_off = ppa.m.chk;
+
+   return meta + ch_off + lun_off + chk_off;
+}
+
  void __pblk_map_invalidate(struct pblk *pblk, struct pblk_line *line,
   u64 paddr)
  {
@@ -1094,10 +1140,38 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
return 1;
  }
  
+static int pblk_prepare_new_line(struct pblk *pblk, struct pblk_line *line)

+{
+   struct pblk_line_meta *lm = &pblk->lm;
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = &dev->geo;
+   int blk_to_erase = atomic_read(&line->blk_in_line);
+   int i;
+
+   for (i = 0; i < lm->blk_per_line; i++) {
+   int state = line->chks[i].state;
+   struct pblk_lun *rlun = &pblk->luns[i];
+
+   /* Free chunks should not be erased */
+   if (state &

Re: [PATCH V3 00/19] lightnvm: pblk: implement 2.0 support

2018-02-26 Thread Matias Bjørling


On 02/26/2018 07:27 PM, Javier Gonzalez wrote:

On 26 Feb 2018, at 19.24, Matias Bjørling  wrote:

On 02/26/2018 07:21 PM, Javier Gonzalez wrote:

On 26 Feb 2018, at 19.19, Matias Bjørling  wrote:

On 02/26/2018 02:16 PM, Javier González wrote:

# Changes since V2:
Apply Matias' feedback:
  - Remove generic nvm_id identify structure.
  - Do not remap capabilities (cap) to media and controlled capabilities
(mccap). Instead, add a comment to prevent confusion when
crosschecking with 2.0 spec.
  - Change maxoc and maxocpu defaults from 1 block to the max number of
blocks.
  - Re-implement the generic geometry to use nvm_geo on both device and
targets. Maintain nvm_common_geo to make it easier to copy the common
part of the geometry (without having to overwrite target-specific
fields, which is ugly and error prone). Matias, if you still want to
get rid of this, we can do it.


I do, the variables should go directly in nvm_geo. Thanks.

Ok. Is the rest ok with you?

I'll go through it when the rebase is posted. Most of the patches is
dependent on the first patch.


As it is now, it will be basically %s/geo->c.X/geo->X/g. If you can look
at the first patch now, I can put all comments in a single version more,
instead of going through one extra cycle...




Also the variable dev_geo -> geo. I'm good with just getting the updated 
version before posting the full patch set again.

Re: [PATCH V3 00/19] lightnvm: pblk: implement 2.0 support

2018-02-26 Thread Javier Gonzalez

> On 26 Feb 2018, at 19.24, Matias Bjørling  wrote:
> 
> On 02/26/2018 07:21 PM, Javier Gonzalez wrote:
>>> On 26 Feb 2018, at 19.19, Matias Bjørling  wrote:
>>> 
>>> On 02/26/2018 02:16 PM, Javier González wrote:
 # Changes since V2:
 Apply Matias' feedback:
  - Remove generic nvm_id identify structure.
  - Do not remap capabilities (cap) to media and controlled capabilities
(mccap). Instead, add a comment to prevent confusion when
crosschecking with 2.0 spec.
  - Change maxoc and maxocpu defaults from 1 block to the max number of
blocks.
  - Re-implement the generic geometry to use nvm_geo on both device and
targets. Maintain nvm_common_geo to make it easier to copy the common
part of the geometry (without having to overwrite target-specific
fields, which is ugly and error prone). Matias, if you still want to
get rid of this, we can do it.
>>> 
>>> I do, the variables should go directly in nvm_geo. Thanks.
>> Ok. Is the rest ok with you?
> I'll go through it when the rebase is posted. Most of the patches is
> dependent on the first patch.

As it is now, it will be basically %s/geo->c.X/geo->X/g. If you can look
at the first patch now, I can put all comments in a single version more,
instead of going through one extra cycle...




signature.asc
Description: Message signed with OpenPGP

Re: [PATCH V3 00/19] lightnvm: pblk: implement 2.0 support

2018-02-26 Thread Matias Bjørling


On 02/26/2018 07:21 PM, Javier Gonzalez wrote:

On 26 Feb 2018, at 19.19, Matias Bjørling  wrote:

On 02/26/2018 02:16 PM, Javier González wrote:

# Changes since V2:
Apply Matias' feedback:
  - Remove generic nvm_id identify structure.
  - Do not remap capabilities (cap) to media and controlled capabilities
(mccap). Instead, add a comment to prevent confusion when
crosschecking with 2.0 spec.
  - Change maxoc and maxocpu defaults from 1 block to the max number of
blocks.
  - Re-implement the generic geometry to use nvm_geo on both device and
targets. Maintain nvm_common_geo to make it easier to copy the common
part of the geometry (without having to overwrite target-specific
fields, which is ugly and error prone). Matias, if you still want to
get rid of this, we can do it.


I do, the variables should go directly in nvm_geo. Thanks.


Ok. Is the rest ok with you?

I'll go through it when the rebase is posted. Most of the patches is 
dependent on the first patch.

Re: [PATCH] lightnvm: pblk: remove unused variable

2018-02-26 Thread Javier Gonzalez


> On 26 Feb 2018, at 19.20, Matias Bjørling  wrote:
> 
> On 02/26/2018 02:18 PM, Javier González wrote:
>> Remove unused variable after a previous cleanup (a8112b631adb)
>> Signed-off-by: Javier González 
>> ---
>>  drivers/lightnvm/pblk-core.c | 3 ---
>>  1 file changed, 3 deletions(-)
>> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
>> index ce6a7cfdba66..e6cb4317bb50 100644
>> --- a/drivers/lightnvm/pblk-core.c
>> +++ b/drivers/lightnvm/pblk-core.c
>> @@ -1071,7 +1071,6 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
>> pblk_line *line,
>>  struct nvm_geo *geo = &dev->geo;
>>  struct pblk_line_meta *lm = &pblk->lm;
>>  struct pblk_line_mgmt *l_mg = &pblk->l_mg;
>> -int nr_bb = 0;
>>  u64 off;
>>  int bit = -1;
>>  int emeta_secs;
>> @@ -1087,8 +1086,6 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
>> pblk_line *line,
>>  bitmap_or(line->map_bitmap, line->map_bitmap, l_mg->bb_aux,
>>  lm->sec_per_line);
>>  line->sec_in_line -= geo->c.clba;
>> -if (bit >= lm->emeta_bb)
>> -nr_bb++;
>>  }
>>  /* Mark smeta metadata sectors as bad sectors */
> 
> Should Fixes be added?

It's not a bug as such, that's why I didn't add it. But be welcome to do
so if you think it's better.


signature.asc
Description: Message signed with OpenPGP

Re: [PATCH V3 00/19] lightnvm: pblk: implement 2.0 support

2018-02-26 Thread Javier Gonzalez

> On 26 Feb 2018, at 19.19, Matias Bjørling  wrote:
> 
> On 02/26/2018 02:16 PM, Javier González wrote:
>> # Changes since V2:
>> Apply Matias' feedback:
>>  - Remove generic nvm_id identify structure.
>>  - Do not remap capabilities (cap) to media and controlled capabilities
>>(mccap). Instead, add a comment to prevent confusion when
>>crosschecking with 2.0 spec.
>>  - Change maxoc and maxocpu defaults from 1 block to the max number of
>>blocks.
>>  - Re-implement the generic geometry to use nvm_geo on both device and
>>targets. Maintain nvm_common_geo to make it easier to copy the common
>>part of the geometry (without having to overwrite target-specific
>>fields, which is ugly and error prone). Matias, if you still want to
>>get rid of this, we can do it.
> 
> I do, the variables should go directly in nvm_geo. Thanks.

Ok. Is the rest ok with you?

Javier


signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] lightnvm: pblk: remove unused variable

2018-02-26 Thread Matias Bjørling


On 02/26/2018 02:18 PM, Javier González wrote:

Remove unused variable after a previous cleanup (a8112b631adb)

Signed-off-by: Javier González 
---
  drivers/lightnvm/pblk-core.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index ce6a7cfdba66..e6cb4317bb50 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1071,7 +1071,6 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
struct nvm_geo *geo = &dev->geo;
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
-   int nr_bb = 0;
u64 off;
int bit = -1;
int emeta_secs;
@@ -1087,8 +1086,6 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
bitmap_or(line->map_bitmap, line->map_bitmap, l_mg->bb_aux,
lm->sec_per_line);
line->sec_in_line -= geo->c.clba;
-   if (bit >= lm->emeta_bb)
-   nr_bb++;
}
  
  	/* Mark smeta metadata sectors as bad sectors */




Should Fixes be added?

Re: [PATCH V3 00/19] lightnvm: pblk: implement 2.0 support

2018-02-26 Thread Matias Bjørling


On 02/26/2018 02:16 PM, Javier González wrote:

# Changes since V2:

Apply Matias' feedback:
  - Remove generic nvm_id identify structure.
  - Do not remap capabilities (cap) to media and controlled capabilities
(mccap). Instead, add a comment to prevent confusion when
crosschecking with 2.0 spec.
  - Change maxoc and maxocpu defaults from 1 block to the max number of
blocks.
  - Re-implement the generic geometry to use nvm_geo on both device and
targets. Maintain nvm_common_geo to make it easier to copy the common
part of the geometry (without having to overwrite target-specific
fields, which is ugly and error prone). Matias, if you still want to
get rid of this, we can do it.


I do, the variables should go directly in nvm_geo. Thanks.

Re: [PATCH 0/6 v2] block: Fix races in bdev - gendisk handling

2018-02-26 Thread Jens Axboe

On Mon, Feb 26 2018, Jan Kara wrote:
> On Mon 26-02-18 09:04:40, Jens Axboe wrote:
> > On 2/26/18 5:01 AM, Jan Kara wrote:
> > > Hello,
> > > 
> > > these patches fix races happening when devices are frequently destroyed 
> > > and
> > > recreated in association of block device inode with corresponding gendisk.
> > > Generally when such race happen it results in use-after-free issues, block
> > > device page cache inconsistencies, or other problems. I have verified 
> > > these
> > > patches fix use-after-free issues that could be reproduced by frequent 
> > > creation
> > > and destruction of loop device. Hou Tao has verified that races reported 
> > > by
> > > him in [1] related to gendisk-blkdev association were also fixed. Jens, 
> > > can
> > > you please merge these patches? Thanks!
> > 
> > What are you thinking in terms of target? I'd like to get it into 4.16,
> > but I'd need to know if you are comfortable with that.
> 
> Yeah, 4.16 should be OK. The patches are quite conservative.

OK good, let's get it queued up. Thanks Jan.

-- 
Jens Axboe

Re: [PATCH 0/6 v2] block: Fix races in bdev - gendisk handling

2018-02-26 Thread Jan Kara

On Mon 26-02-18 09:04:40, Jens Axboe wrote:
> On 2/26/18 5:01 AM, Jan Kara wrote:
> > Hello,
> > 
> > these patches fix races happening when devices are frequently destroyed and
> > recreated in association of block device inode with corresponding gendisk.
> > Generally when such race happen it results in use-after-free issues, block
> > device page cache inconsistencies, or other problems. I have verified these
> > patches fix use-after-free issues that could be reproduced by frequent 
> > creation
> > and destruction of loop device. Hou Tao has verified that races reported by
> > him in [1] related to gendisk-blkdev association were also fixed. Jens, can
> > you please merge these patches? Thanks!
> 
> What are you thinking in terms of target? I'd like to get it into 4.16,
> but I'd need to know if you are comfortable with that.

Yeah, 4.16 should be OK. The patches are quite conservative.

Honza

-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH] bcache: check all uuid elements when start flash only volumes

2018-02-26 Thread Coly Li

On 27/02/2018 12:10 AM, Coly Li wrote:
> Commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by
> devices_max_used") adds c->devices_max_used to reduce iteration of
> c->uuids elements, this value is updated in bcache_device_attach().
> 
> But for flash only volume, when calling flash_devs_run(), the function
> bcache_device_attach() is not called yet and c->devices_max_used is not
> updated. The unexpected result is, the flash only volume won't be run
> by flash_devs_run().
> 
> This patch fixes the issue by iterate all c->uuids elements in
> flash_devs_run(). c->devices_max_used will be updated properly when
> bcache_device_attach() gets called.
> 
> Fixes: 2831231d4c3f ("bcache: reduce cache_set devices iteration by 
> devices_max_used")
> Reported-by: Tang Junhui 
> Signed-off-by: Coly Li 
> Cc: Michael Lyle 
> ---
>  drivers/md/bcache/super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 312895788036..4d1d8dfb2d2a 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1274,7 +1274,7 @@ static int flash_devs_run(struct cache_set *c)
>   struct uuid_entry *u;
>  
>   for (u = c->uuids;
> -  u < c->uuids + c->devices_max_used && !ret;
> +  u < c->uuids + c->nr_uuids && !ret;
>u++)
>   if (UUID_FLASH_ONLY(u))
>   ret = flash_dev_run(c, u);
> 

Hi Jens and Michael,

Could you please have a look and pick this fix into 4.16 ? Commit
2831231d4c3f ("bcache: reduce cache_set devices iteration by
devices_max_used") is just merged in 4.16-rc1, so I don't CC stable for
this fix.

And I still need a peer code reviewer on this patch, any Reviewed-by is
welcome.

Thanks to Junhui Tang for reporting this issue today.

Coly Li

[PATCH] bcache: check all uuid elements when start flash only volumes

2018-02-26 Thread Coly Li

Commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by
devices_max_used") adds c->devices_max_used to reduce iteration of
c->uuids elements, this value is updated in bcache_device_attach().

But for flash only volume, when calling flash_devs_run(), the function
bcache_device_attach() is not called yet and c->devices_max_used is not
updated. The unexpected result is, the flash only volume won't be run
by flash_devs_run().

This patch fixes the issue by iterate all c->uuids elements in
flash_devs_run(). c->devices_max_used will be updated properly when
bcache_device_attach() gets called.

Fixes: 2831231d4c3f ("bcache: reduce cache_set devices iteration by 
devices_max_used")
Reported-by: Tang Junhui 
Signed-off-by: Coly Li 
Cc: Michael Lyle 
---
 drivers/md/bcache/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 312895788036..4d1d8dfb2d2a 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1274,7 +1274,7 @@ static int flash_devs_run(struct cache_set *c)
struct uuid_entry *u;
 
for (u = c->uuids;
-u < c->uuids + c->devices_max_used && !ret;
+u < c->uuids + c->nr_uuids && !ret;
 u++)
if (UUID_FLASH_ONLY(u))
ret = flash_dev_run(c, u);
-- 
2.16.1

Re: [PATCH][RESEND] block: copy ioprio in __bio_clone_fast()

2018-02-26 Thread Jens Axboe

On 2/26/18 6:29 AM, Hannes Reinecke wrote:
> We need to copy the io priority, too; otherwise the clone will run
> with a different priority than the original one.

What about bio_clone_bioset()?

-- 
Jens Axboe

Re: [PATCH 0/6 v2] block: Fix races in bdev - gendisk handling

2018-02-26 Thread Jens Axboe

On 2/26/18 5:01 AM, Jan Kara wrote:
> Hello,
> 
> these patches fix races happening when devices are frequently destroyed and
> recreated in association of block device inode with corresponding gendisk.
> Generally when such race happen it results in use-after-free issues, block
> device page cache inconsistencies, or other problems. I have verified these
> patches fix use-after-free issues that could be reproduced by frequent 
> creation
> and destruction of loop device. Hou Tao has verified that races reported by
> him in [1] related to gendisk-blkdev association were also fixed. Jens, can
> you please merge these patches? Thanks!

What are you thinking in terms of target? I'd like to get it into 4.16,
but I'd need to know if you are comfortable with that.

-- 
Jens Axboe

Re: [PATCH 0/4] fix a few problems in block layer

2018-02-26 Thread Jens Axboe

On 2/26/18 5:01 AM, Jiufei Xue wrote:
> I have found a few problems while reviewing the patch 74d46992e0d9
> ("block: replace bi_bdev with a gendisk pointer and partitions index"),
> So fix them.

Looks good to me. It's important to include a Fixes line in
the individual patches, though. Otherwise this information
is lost, and it's very useful for folks backporting fixes.

Also, patch 4/4 should have some small sensible commit
message.

Care to resend with these items fixed?

-- 
Jens Axboe

Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem

2018-02-26 Thread Michal Hocko

On Fri 23-02-18 11:51:41, Laura Abbott wrote:
> Hi,
> 
> The Fedora arm-32 build VMs have a somewhat long standing problem
> of hanging when running mkfs.ext4 with a bunch of processes stuck
> in D state. This has been seen as far back as 4.13 but is still
> present on 4.14:
> 
[...]
> This looks like everything is blocked on the writeback completing but
> the writeback has been throttled. According to the infra team, this problem
> is _not_ seen without LPAE (i.e. only 4G of RAM). I did see
> https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to
> quite match since this seems to be completely stuck. Any suggestions to
> narrow the problem down?

How much dirtyable memory does the system have? We do allow only lowmem
to be dirtyable by default on 32b highmem systems. Maybe you have the
lowmem mostly consumed by the kernel memory. Have you tried to enable
highmem_is_dirtyable?
-- 
Michal Hocko
SUSE Labs

[PATCH][RESEND] block: copy ioprio in __bio_clone_fast()

2018-02-26 Thread Hannes Reinecke

We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.

Fixes: 43b62ce3ff0a ("block: move bio io prio to a new field")
Signed-off-by: Hannes Reinecke 
---
 block/bio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/bio.c b/block/bio.c
index e1708db48258..e079911c640f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -605,6 +605,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_write_hint = bio_src->bi_write_hint;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
+   bio->bi_ioprio = bio_src->bi_ioprio;
 
bio_clone_blkcg_association(bio, bio_src);
 }
-- 
2.12.3

[PATCH 01/19] lightnvm: simplify geometry structure.

2018-02-26 Thread Javier González

Currently, the device geometry is stored redundantly in the nvm_id and
nvm_geo structures at a device level. Moreover, when instantiating
targets on a specific number of LUNs, these structures are replicated
and manually modified to fit the instance channel and LUN partitioning.

Instead, create a generic geometry around nvm_geo, which can be used by
(i) the underlying device to describe the geometry of the whole device,
and (ii) instances to describe their geometry independently.

Since these share a big part of the geometry, create a nvm_common_geo
structure that keeps the static geoometry values that are shared across
instances.

As we introduce support for 2.0, these structures allow to abstract
spec. specific values and present a common geometry to targets.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  | 131 ++-
 drivers/lightnvm/pblk-core.c |  16 +-
 drivers/lightnvm/pblk-gc.c   |   2 +-
 drivers/lightnvm/pblk-init.c | 123 +++---
 drivers/lightnvm/pblk-read.c |   2 +-
 drivers/lightnvm/pblk-recovery.c |  14 +-
 drivers/lightnvm/pblk-rl.c   |   2 +-
 drivers/lightnvm/pblk-sysfs.c|  39 +++--
 drivers/lightnvm/pblk-write.c|   2 +-
 drivers/lightnvm/pblk.h  |  93 +--
 drivers/nvme/host/lightnvm.c | 346 +++
 include/linux/lightnvm.h | 208 ---
 12 files changed, 514 insertions(+), 464 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 689c97b97775..43e3d6bb5be6 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -111,6 +111,7 @@ static void nvm_release_luns_err(struct nvm_dev *dev, int 
lun_begin,
 static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear)
 {
struct nvm_dev *dev = tgt_dev->parent;
+   struct nvm_geo *dev_geo = &dev->geo;
struct nvm_dev_map *dev_map = tgt_dev->map;
int i, j;
 
@@ -122,7 +123,7 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, 
int clear)
if (clear) {
for (j = 0; j < ch_map->nr_luns; j++) {
int lun = j + lun_offs[j];
-   int lunid = (ch * dev->geo.nr_luns) + lun;
+   int lunid = (ch * dev_geo->nr_luns) + lun;
 
WARN_ON(!test_and_clear_bit(lunid,
dev->lun_map));
@@ -143,19 +144,20 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
  u16 lun_begin, u16 lun_end,
  u16 op)
 {
+   struct nvm_geo *dev_geo = &dev->geo;
struct nvm_tgt_dev *tgt_dev = NULL;
struct nvm_dev_map *dev_rmap = dev->rmap;
struct nvm_dev_map *dev_map;
struct ppa_addr *luns;
int nr_luns = lun_end - lun_begin + 1;
int luns_left = nr_luns;
-   int nr_chnls = nr_luns / dev->geo.nr_luns;
-   int nr_chnls_mod = nr_luns % dev->geo.nr_luns;
-   int bch = lun_begin / dev->geo.nr_luns;
-   int blun = lun_begin % dev->geo.nr_luns;
+   int nr_chnls = nr_luns / dev_geo->nr_luns;
+   int nr_chnls_mod = nr_luns % dev_geo->nr_luns;
+   int bch = lun_begin / dev_geo->nr_luns;
+   int blun = lun_begin % dev_geo->nr_luns;
int lunid = 0;
int lun_balanced = 1;
-   int prev_nr_luns;
+   int sec_per_lun, prev_nr_luns;
int i, j;
 
nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1;
@@ -173,15 +175,15 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
if (!luns)
goto err_luns;
 
-   prev_nr_luns = (luns_left > dev->geo.nr_luns) ?
-   dev->geo.nr_luns : luns_left;
+   prev_nr_luns = (luns_left > dev_geo->nr_luns) ?
+   dev_geo->nr_luns : luns_left;
for (i = 0; i < nr_chnls; i++) {
struct nvm_ch_map *ch_rmap = &dev_rmap->chnls[i + bch];
int *lun_roffs = ch_rmap->lun_offs;
struct nvm_ch_map *ch_map = &dev_map->chnls[i];
int *lun_offs;
-   int luns_in_chnl = (luns_left > dev->geo.nr_luns) ?
-   dev->geo.nr_luns : luns_left;
+   int luns_in_chnl = (luns_left > dev_geo->nr_luns) ?
+   dev_geo->nr_luns : luns_left;
 
if (lun_balanced && prev_nr_luns != luns_in_chnl)
lun_balanced = 0;
@@ -215,18 +217,22 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
if (!tgt_dev)
goto err_ch;
 
-   memcpy(&tgt_dev->geo, &dev->geo, sizeof(struct nvm_geo));
/* Target device only owns a portion of the physical device */
tgt_dev->geo.nr_chnls = nr_chnls;
-   tgt_dev

[PATCH 05/19] lightnvm: complete geo structure with maxoc*

2018-02-26 Thread Javier González

Complete the generic geometry structure with the maxoc and maxocpu
felds, present in the 2.0 spec.

Signed-off-by: Javier González 
---
 drivers/nvme/host/lightnvm.c | 9 +
 include/linux/lightnvm.h | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index a48950a37b8d..a8f35e8b7953 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -323,6 +323,13 @@ static int nvme_nvm_setup_12(struct nvme_nvm_id12 *id,
dev_geo->c.ws_opt = sec_per_pg;
dev_geo->c.mw_cunits = 8;   /* default to MLC safe values */
 
+   /* Do not impose values for maximum number of open blocks as it is
+* unspecified in 1.2. Users of 1.2 must be aware of this and eventually
+* specify these values through a quirk if restrictions apply.
+*/
+   dev_geo->c.maxoc = dev_geo->all_luns * dev_geo->c.num_chk;
+   dev_geo->c.maxocpu = dev_geo->c.num_chk;
+
dev_geo->c.cap = le32_to_cpu(src->mccap);
 
dev_geo->c.trdt = le32_to_cpu(src->trdt);
@@ -413,6 +420,8 @@ static int nvme_nvm_setup_20(struct nvme_nvm_id20 *id,
dev_geo->c.ws_min = le32_to_cpu(id->ws_min);
dev_geo->c.ws_opt = le32_to_cpu(id->ws_opt);
dev_geo->c.mw_cunits = le32_to_cpu(id->mw_cunits);
+   dev_geo->c.maxoc = le32_to_cpu(id->maxoc);
+   dev_geo->c.maxocpu = le32_to_cpu(id->maxocpu);
 
dev_geo->c.cap = le32_to_cpu(id->mccap);
 
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 57d00644a1c6..e8e6bfd4d0d4 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -280,6 +280,8 @@ struct nvm_common_geo {
u32 ws_min; /* minimum write size */
u32 ws_opt; /* optimal write size */
u32 mw_cunits;  /* distance required for successful read */
+   u32 maxoc;  /* maximum open chunks */
+   u32 maxocpu;/* maximum open chunks per parallel unit */
 
/* device capabilities. Note that this represents capabilities in 1.2
 * and media and controller capabilities in 2.0
-- 
2.7.4

[PATCH 02/19] lightnvm: add controller capabilities to 2.0

2018-02-26 Thread Javier González

Assign missing mccap value on 2.0 path

Signed-off-by: Javier González 
---
 drivers/nvme/host/lightnvm.c | 4 +++-
 include/linux/lightnvm.h | 8 +---
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 8f541331b1a9..10392a664b50 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -318,7 +318,7 @@ static int nvme_nvm_setup_12(struct nvme_nvm_id12 *id,
dev_geo->c.ws_opt = sec_per_pg;
dev_geo->c.mw_cunits = 8;   /* default to MLC safe values */
 
-   dev_geo->c.mccap = le32_to_cpu(src->mccap);
+   dev_geo->c.cap = le32_to_cpu(src->mccap);
 
dev_geo->c.trdt = le32_to_cpu(src->trdt);
dev_geo->c.trdm = le32_to_cpu(src->trdm);
@@ -399,6 +399,8 @@ static int nvme_nvm_setup_20(struct nvme_nvm_id20 *id,
dev_geo->c.ws_opt = le32_to_cpu(id->ws_opt);
dev_geo->c.mw_cunits = le32_to_cpu(id->mw_cunits);
 
+   dev_geo->c.cap = le32_to_cpu(id->mccap);
+
dev_geo->c.trdt = le32_to_cpu(id->trdt);
dev_geo->c.trdm = le32_to_cpu(id->trdm);
dev_geo->c.tprt = le32_to_cpu(id->twrt);
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 4537c0140386..42704c4ca9af 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -273,8 +273,10 @@ struct nvm_common_geo {
u32 ws_opt; /* optimal write size */
u32 mw_cunits;  /* distance required for successful read */
 
-   /* device capabilities */
-   u32 mccap;
+   /* device capabilities. Note that this represents capabilities in 1.2
+* and media and controller capabilities in 2.0
+*/
+   u32 cap;
 
/* device timings */
u32 trdt;   /* Avg. Tread (ns) */
@@ -289,7 +291,7 @@ struct nvm_common_geo {
 
/* 1.2 compatibility */
u8  vmnt;
-   u32 cap;
+   u32 mccap;
u32 dom;
 
u8  mtype;
-- 
2.7.4

[PATCH 04/19] lightnvm: add shorten OCSSD version in geo

2018-02-26 Thread Javier González

Create a shorten version to use in the generic geometry.

Signed-off-by: Javier González 
---
 drivers/nvme/host/lightnvm.c | 6 ++
 include/linux/lightnvm.h | 8 
 2 files changed, 14 insertions(+)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 8befb60eeacb..a48950a37b8d 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -299,6 +299,9 @@ static int nvme_nvm_setup_12(struct nvme_nvm_id12 *id,
dev_geo->major_ver_id = id->ver_id;
dev_geo->minor_ver_id = 2;
 
+   /* Set compacted version for upper layers */
+   dev_geo->c.version = NVM_OCSSD_SPEC_12;
+
dev_geo->nr_chnls = src->num_ch;
dev_geo->nr_luns = src->num_lun;
dev_geo->all_luns = dev_geo->nr_chnls * dev_geo->nr_luns;
@@ -385,6 +388,9 @@ static int nvme_nvm_setup_20(struct nvme_nvm_id20 *id,
dev_geo->major_ver_id = id->mjr;
dev_geo->minor_ver_id = id->mnr;
 
+   /* Set compacted version for upper layers */
+   dev_geo->c.version = NVM_OCSSD_SPEC_20;
+
if (!(dev_geo->major_ver_id == 2 && dev_geo->minor_ver_id == 0)) {
pr_err("nvm: OCSSD version not supported (v%d.%d)\n",
dev_geo->major_ver_id, dev_geo->minor_ver_id);
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index ccaf8a30202d..57d00644a1c6 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -23,6 +23,11 @@ enum {
 #define NVM_LUN_BITS (8)
 #define NVM_CH_BITS  (7)
 
+enum {
+   NVM_OCSSD_SPEC_12 = 12,
+   NVM_OCSSD_SPEC_20 = 20,
+};
+
 struct ppa_addr {
/* Generic structure for all addresses */
union {
@@ -262,6 +267,9 @@ enum {
 
 /* Device common geometry */
 struct nvm_common_geo {
+   /* kernel short version */
+   u8  version;
+
/* chunk geometry */
u32 num_chk;/* chunks per lun */
u32 clba;   /* sectors per chunk */
-- 
2.7.4

[PATCH 03/19] lightnvm: add minor version to generic geometry

2018-02-26 Thread Javier González

Separate the version between major and minor on the generic geometry and
represent it through sysfs in the 2.0 path. The 1.2 path only shows the major
version to preserve the existing user space interface.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  |  4 ++--
 drivers/nvme/host/lightnvm.c | 25 -
 include/linux/lightnvm.h |  3 ++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 43e3d6bb5be6..96f4e62d383b 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -897,8 +897,8 @@ static int nvm_init(struct nvm_dev *dev)
goto err;
}
 
-   pr_debug("nvm: ver:%u nvm_vendor:%x\n",
-   dev_geo->ver_id,
+   pr_debug("nvm: ver:%u.%u nvm_vendor:%x\n",
+   dev_geo->major_ver_id, dev_geo->minor_ver_id,
dev_geo->c.vmnt);
 
ret = nvm_core_init(dev);
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 10392a664b50..8befb60eeacb 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -295,7 +295,9 @@ static int nvme_nvm_setup_12(struct nvme_nvm_id12 *id,
return -EINVAL;
}
 
-   dev_geo->ver_id = id->ver_id;
+   /* 1.2 spec. only reports a single version id - unfold */
+   dev_geo->major_ver_id = id->ver_id;
+   dev_geo->minor_ver_id = 2;
 
dev_geo->nr_chnls = src->num_ch;
dev_geo->nr_luns = src->num_lun;
@@ -380,7 +382,14 @@ static void nvme_nvm_set_addr_20(struct nvm_addr_format 
*dst,
 static int nvme_nvm_setup_20(struct nvme_nvm_id20 *id,
 struct nvm_geo *dev_geo)
 {
-   dev_geo->ver_id = id->mjr;
+   dev_geo->major_ver_id = id->mjr;
+   dev_geo->minor_ver_id = id->mnr;
+
+   if (!(dev_geo->major_ver_id == 2 && dev_geo->minor_ver_id == 0)) {
+   pr_err("nvm: OCSSD version not supported (v%d.%d)\n",
+   dev_geo->major_ver_id, dev_geo->minor_ver_id);
+   return -EINVAL;
+   }
 
dev_geo->nr_chnls = le16_to_cpu(id->num_grp);
dev_geo->nr_luns = le16_to_cpu(id->num_pu);
@@ -920,7 +929,13 @@ static ssize_t nvm_dev_attr_show(struct device *dev,
attr = &dattr->attr;
 
if (strcmp(attr->name, "version") == 0) {
-   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->ver_id);
+   if (dev_geo->major_ver_id == 1)
+   return scnprintf(page, PAGE_SIZE, "%u\n",
+   dev_geo->major_ver_id);
+   else
+   return scnprintf(page, PAGE_SIZE, "%u.%u\n",
+   dev_geo->major_ver_id,
+   dev_geo->minor_ver_id);
} else if (strcmp(attr->name, "capabilities") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.cap);
} else if (strcmp(attr->name, "read_typ") == 0) {
@@ -1174,7 +1189,7 @@ int nvme_nvm_register_sysfs(struct nvme_ns *ns)
if (!ndev)
return -EINVAL;
 
-   switch (dev_geo->ver_id) {
+   switch (dev_geo->major_ver_id) {
case 1:
return sysfs_create_group(&disk_to_dev(ns->disk)->kobj,
&nvm_dev_attr_group_12);
@@ -1191,7 +1206,7 @@ void nvme_nvm_unregister_sysfs(struct nvme_ns *ns)
struct nvm_dev *ndev = ns->ndev;
struct nvm_geo *dev_geo = &ndev->geo;
 
-   switch (dev_geo->ver_id) {
+   switch (dev_geo->major_ver_id) {
case 1:
sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
&nvm_dev_attr_group_12);
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 42704c4ca9af..ccaf8a30202d 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -309,7 +309,8 @@ struct nvm_common_geo {
 /* Instance geometry */
 struct nvm_geo {
/* device reported version */
-   u8  ver_id;
+   u8  major_ver_id;
+   u8  minor_ver_id;
 
/* instance specific geometry */
int nr_chnls;
-- 
2.7.4

[PATCH 06/19] lightnvm: pblk: check for supported version

2018-02-26 Thread Javier González

At this point, only 1.2 spec is supported, thus check for it. Also,
since device-side L2P is only supported in the 1.2 spec, make sure to
only check its value under 1.2.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 95ecb0ec736b..6e0567971cf6 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -990,9 +990,15 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
struct pblk *pblk;
int ret;
 
-   if (dev->geo.c.dom & NVM_RSP_L2P) {
+   if (geo->c.version != NVM_OCSSD_SPEC_12) {
+   pr_err("pblk: OCSSD version not supported (%u)\n",
+   geo->c.version);
+   return ERR_PTR(-EINVAL);
+   }
+
+   if (geo->c.version == NVM_OCSSD_SPEC_12 && geo->c.dom & NVM_RSP_L2P) {
pr_err("pblk: host-side L2P table not supported. (%x)\n",
-   dev->geo.c.dom);
+   geo->c.dom);
return ERR_PTR(-EINVAL);
}
 
-- 
2.7.4

[PATCH 07/19] lightnvm: complete 2.0 values in sysfs

2018-02-26 Thread Javier González

Add missing geometry values to sysfs. Namely, maxoc and maxocpu.

Signed-off-by: Javier González 
---
 drivers/nvme/host/lightnvm.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index a8f35e8b7953..4ea553c91756 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -1066,6 +1066,10 @@ static ssize_t nvm_dev_attr_show_20(struct device *dev,
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.ws_min);
} else if (strcmp(attr->name, "ws_opt") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.ws_opt);
+   } else if (strcmp(attr->name, "maxoc") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.maxoc);
+   } else if (strcmp(attr->name, "maxocpu") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.maxocpu);
} else if (strcmp(attr->name, "mw_cunits") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mw_cunits);
} else if (strcmp(attr->name, "write_typ") == 0) {
@@ -1163,6 +1167,8 @@ static NVM_DEV_ATTR_20_RO(chunks);
 static NVM_DEV_ATTR_20_RO(clba);
 static NVM_DEV_ATTR_20_RO(ws_min);
 static NVM_DEV_ATTR_20_RO(ws_opt);
+static NVM_DEV_ATTR_20_RO(maxoc);
+static NVM_DEV_ATTR_20_RO(maxocpu);
 static NVM_DEV_ATTR_20_RO(mw_cunits);
 static NVM_DEV_ATTR_20_RO(write_typ);
 static NVM_DEV_ATTR_20_RO(write_max);
@@ -1179,6 +1185,8 @@ static struct attribute *nvm_dev_attrs_20[] = {
&dev_attr_clba.attr,
&dev_attr_ws_min.attr,
&dev_attr_ws_opt.attr,
+   &dev_attr_maxoc.attr,
+   &dev_attr_maxocpu.attr,
&dev_attr_mw_cunits.attr,
 
&dev_attr_read_typ.attr,
-- 
2.7.4

[PATCH 13/19] lightnvm: make address conversions depend on generic device

2018-02-26 Thread Javier González

On address conversions, use the generic device, instead of the target
device. This allows to use conversions outside of the target's realm.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  |  4 ++--
 include/linux/lightnvm.h | 20 ++--
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 92ea6e39dd64..fae5e8c6fcd0 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -588,7 +588,7 @@ static void nvm_ppa_tgt_to_dev(struct nvm_tgt_dev *tgt_dev,
 
for (i = 0; i < nr_ppas; i++) {
nvm_map_to_dev(tgt_dev, &ppa_list[i]);
-   ppa_list[i] = generic_to_dev_addr(tgt_dev, ppa_list[i]);
+   ppa_list[i] = generic_to_dev_addr(tgt_dev->parent, ppa_list[i]);
}
 }
 
@@ -598,7 +598,7 @@ static void nvm_ppa_dev_to_tgt(struct nvm_tgt_dev *tgt_dev,
int i;
 
for (i = 0; i < nr_ppas; i++) {
-   ppa_list[i] = dev_to_generic_addr(tgt_dev, ppa_list[i]);
+   ppa_list[i] = dev_to_generic_addr(tgt_dev->parent, ppa_list[i]);
nvm_map_to_tgt(tgt_dev, &ppa_list[i]);
}
 }
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 8cda1125c34a..2b909592568a 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -405,15 +405,15 @@ struct nvm_dev {
struct list_head targets;
 };
 
-static inline struct ppa_addr generic_to_dev_addr(struct nvm_tgt_dev *tgt_dev,
+static inline struct ppa_addr generic_to_dev_addr(struct nvm_dev *dev,
  struct ppa_addr r)
 {
-   struct nvm_geo *geo = &tgt_dev->geo;
+   struct nvm_geo *dev_geo = &dev->geo;
struct ppa_addr l;
 
-   if (geo->c.version == NVM_OCSSD_SPEC_12) {
+   if (dev_geo->c.version == NVM_OCSSD_SPEC_12) {
struct nvm_addr_format_12 *ppaf =
-   (struct nvm_addr_format_12 *)&geo->c.addrf;
+   (struct nvm_addr_format_12 *)&dev_geo->c.addrf;
 
l.ppa = ((u64)r.g.ch) << ppaf->ch_offset;
l.ppa |= ((u64)r.g.lun) << ppaf->lun_offset;
@@ -422,7 +422,7 @@ static inline struct ppa_addr generic_to_dev_addr(struct 
nvm_tgt_dev *tgt_dev,
l.ppa |= ((u64)r.g.pl) << ppaf->pln_offset;
l.ppa |= ((u64)r.g.sec) << ppaf->sec_offset;
} else {
-   struct nvm_addr_format *lbaf = &geo->c.addrf;
+   struct nvm_addr_format *lbaf = &dev_geo->c.addrf;
 
l.ppa = ((u64)r.m.grp) << lbaf->ch_offset;
l.ppa |= ((u64)r.m.pu) << lbaf->lun_offset;
@@ -433,17 +433,17 @@ static inline struct ppa_addr generic_to_dev_addr(struct 
nvm_tgt_dev *tgt_dev,
return l;
 }
 
-static inline struct ppa_addr dev_to_generic_addr(struct nvm_tgt_dev *tgt_dev,
+static inline struct ppa_addr dev_to_generic_addr(struct nvm_dev *dev,
  struct ppa_addr r)
 {
-   struct nvm_geo *geo = &tgt_dev->geo;
+   struct nvm_geo *dev_geo = &dev->geo;
struct ppa_addr l;
 
l.ppa = 0;
 
-   if (geo->c.version == NVM_OCSSD_SPEC_12) {
+   if (dev_geo->c.version == NVM_OCSSD_SPEC_12) {
struct nvm_addr_format_12 *ppaf =
-   (struct nvm_addr_format_12 *)&geo->c.addrf;
+   (struct nvm_addr_format_12 *)&dev_geo->c.addrf;
 
l.g.ch = (r.ppa & ppaf->ch_mask) >> ppaf->ch_offset;
l.g.lun = (r.ppa & ppaf->lun_mask) >> ppaf->lun_offset;
@@ -452,7 +452,7 @@ static inline struct ppa_addr dev_to_generic_addr(struct 
nvm_tgt_dev *tgt_dev,
l.g.pl = (r.ppa & ppaf->pln_mask) >> ppaf->pln_offset;
l.g.sec = (r.ppa & ppaf->sec_mask) >> ppaf->sec_offset;
} else {
-   struct nvm_addr_format *lbaf = &geo->c.addrf;
+   struct nvm_addr_format *lbaf = &dev_geo->c.addrf;
 
l.m.grp = (r.ppa & lbaf->ch_mask) >> lbaf->ch_offset;
l.m.pu = (r.ppa & lbaf->lun_mask) >> lbaf->lun_offset;
-- 
2.7.4

[PATCH 18/19] lightnvm: pblk: refactor init/exit sequences

2018-02-26 Thread Javier González

Refactor init and exit sequences to improve readability. In the way, fix
bad free ordering on the init error path.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c | 535 +--
 1 file changed, 267 insertions(+), 268 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index cf4f49d48aed..7f88fb937440 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -103,7 +103,40 @@ static void pblk_l2p_free(struct pblk *pblk)
vfree(pblk->trans_map);
 }
 
-static int pblk_l2p_init(struct pblk *pblk)
+static int pblk_l2p_recover(struct pblk *pblk, bool factory_init)
+{
+   struct pblk_line *line = NULL;
+
+   if (factory_init) {
+   pblk_setup_uuid(pblk);
+   } else {
+   line = pblk_recov_l2p(pblk);
+   if (IS_ERR(line)) {
+   pr_err("pblk: could not recover l2p table\n");
+   return -EFAULT;
+   }
+   }
+
+#ifdef CONFIG_NVM_DEBUG
+   pr_info("pblk init: L2P CRC: %x\n", pblk_l2p_crc(pblk));
+#endif
+
+   /* Free full lines directly as GC has not been started yet */
+   pblk_gc_free_full_lines(pblk);
+
+   if (!line) {
+   /* Configure next line for user data */
+   line = pblk_line_get_first_data(pblk);
+   if (!line) {
+   pr_err("pblk: line list corrupted\n");
+   return -EFAULT;
+   }
+   }
+
+   return 0;
+}
+
+static int pblk_l2p_init(struct pblk *pblk, bool factory_init)
 {
sector_t i;
struct ppa_addr ppa;
@@ -119,7 +152,7 @@ static int pblk_l2p_init(struct pblk *pblk)
for (i = 0; i < pblk->rl.nr_secs; i++)
pblk_trans_map_set(pblk, i, ppa);
 
-   return 0;
+   return pblk_l2p_recover(pblk, factory_init);
 }
 
 static void pblk_rwb_free(struct pblk *pblk)
@@ -268,86 +301,113 @@ static int pblk_core_init(struct pblk *pblk)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
+   int max_write_ppas;
+
+   atomic64_set(&pblk->user_wa, 0);
+   atomic64_set(&pblk->pad_wa, 0);
+   atomic64_set(&pblk->gc_wa, 0);
+   pblk->user_rst_wa = 0;
+   pblk->pad_rst_wa = 0;
+   pblk->gc_rst_wa = 0;
+
+   atomic_long_set(&pblk->nr_flush, 0);
+   pblk->nr_flush_rst = 0;
 
pblk->pgs_in_buffer = geo->c.mw_cunits * geo->c.ws_opt * geo->all_luns;
 
+   pblk->min_write_pgs = geo->c.ws_opt * (geo->c.csecs / PAGE_SIZE);
+   max_write_ppas = pblk->min_write_pgs * geo->all_luns;
+   pblk->max_write_pgs = (max_write_ppas < NVM_MAX_VLBA) ?
+   max_write_ppas : NVM_MAX_VLBA;
+   pblk_set_sec_per_write(pblk, pblk->min_write_pgs);
+
+   if (pblk->max_write_pgs > PBLK_MAX_REQ_ADDRS) {
+   pr_err("pblk: cannot support device max_phys_sect\n");
+   return -EINVAL;
+   }
+
+   pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) * sizeof(atomic64_t),
+   GFP_KERNEL);
+   if (!pblk->pad_dist)
+   return -ENOMEM;
+
if (pblk_init_global_caches(pblk))
-   return -ENOMEM;
+   goto fail_free_pad_dist;
 
/* Internal bios can be at most the sectors signaled by the device. */
pblk->page_bio_pool = mempool_create_page_pool(NVM_MAX_VLBA, 0);
if (!pblk->page_bio_pool)
-   goto free_global_caches;
+   goto fail_free_global_caches;
 
pblk->gen_ws_pool = mempool_create_slab_pool(PBLK_GEN_WS_POOL_SIZE,
pblk_ws_cache);
if (!pblk->gen_ws_pool)
-   goto free_page_bio_pool;
+   goto fail_free_page_bio_pool;
 
pblk->rec_pool = mempool_create_slab_pool(geo->all_luns,
pblk_rec_cache);
if (!pblk->rec_pool)
-   goto free_gen_ws_pool;
+   goto fail_free_gen_ws_pool;
 
pblk->r_rq_pool = mempool_create_slab_pool(geo->all_luns,
pblk_g_rq_cache);
if (!pblk->r_rq_pool)
-   goto free_rec_pool;
+   goto fail_free_rec_pool;
 
pblk->e_rq_pool = mempool_create_slab_pool(geo->all_luns,
pblk_g_rq_cache);
if (!pblk->e_rq_pool)
-   goto free_r_rq_pool;
+   goto fail_free_r_rq_pool;
 
pblk->w_rq_pool = mempool_create_slab_pool(geo->all_luns,
pblk_w_rq_cache);
if (!pblk->w_rq_pool)
-   goto free_e_rq_pool;
+   goto fail_free_e_rq_pool;
 
pblk->close_wq = alloc_workqueue("pblk-close-wq",
WQ_MEM_RECLAI

[PATCH 14/19] nvme: make nvme_get_log_ext available

2018-02-26 Thread Javier González

Make nvme_get_log_ext available outside of core.c.

This is in preparation for using it in lightnvm.c

Signed-off-by: Javier González 
---
 drivers/nvme/host/nvme.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1ca08f4993ba..505f797f8c6c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -396,6 +396,9 @@ int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
 int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
 int nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
 
+int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+u8 log_page, void *log, size_t size, size_t offset);
+
 extern const struct attribute_group nvme_ns_id_attr_group;
 extern const struct block_device_operations nvme_ns_head_ops;
 
-- 
2.7.4

[PATCH 19/19] lightnvm: pblk: implement 2.0 support

2018-02-26 Thread Javier González

Implement 2.0 support in pblk. This includes the address formatting and
mapping paths, as well as the sysfs entries for them.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c  |  57 ++--
 drivers/lightnvm/pblk-sysfs.c |  36 ++--
 drivers/lightnvm/pblk.h   | 198 --
 3 files changed, 233 insertions(+), 58 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 7f88fb937440..fe04996cd3ba 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -231,20 +231,63 @@ static int pblk_set_addrf_12(struct nvm_geo *geo,
return dst->blk_offset + src->blk_len;
 }
 
+static int pblk_set_addrf_20(struct nvm_geo *geo,
+struct nvm_addr_format *adst,
+struct pblk_addr_format *udst)
+{
+   struct nvm_addr_format *src = &geo->c.addrf;
+
+   adst->ch_len = get_count_order(geo->num_ch);
+   adst->lun_len = get_count_order(geo->num_lun);
+   adst->chk_len = src->chk_len;
+   adst->sec_len = src->sec_len;
+
+   adst->sec_offset = 0;
+   adst->ch_offset = adst->sec_len;
+   adst->lun_offset = adst->ch_offset + adst->ch_len;
+   adst->chk_offset = adst->lun_offset + adst->lun_len;
+
+   adst->sec_mask = ((1ULL << adst->sec_len) - 1) << adst->sec_offset;
+   adst->chk_mask = ((1ULL << adst->chk_len) - 1) << adst->chk_offset;
+   adst->lun_mask = ((1ULL << adst->lun_len) - 1) << adst->lun_offset;
+   adst->ch_mask = ((1ULL << adst->ch_len) - 1) << adst->ch_offset;
+
+   udst->sec_stripe = geo->c.ws_opt;
+   udst->ch_stripe = geo->num_ch;
+   udst->lun_stripe = geo->num_lun;
+
+   udst->sec_lun_stripe = udst->sec_stripe * udst->ch_stripe;
+   udst->sec_ws_stripe = udst->sec_lun_stripe * udst->lun_stripe;
+
+   return adst->chk_offset + adst->chk_len;
+}
+
 static int pblk_set_addrf(struct pblk *pblk)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
int mod;
 
-   div_u64_rem(geo->c.clba, pblk->min_write_pgs, &mod);
-   if (mod) {
-   pr_err("pblk: bad configuration of sectors/pages\n");
+   switch (geo->c.version) {
+   case NVM_OCSSD_SPEC_12:
+   div_u64_rem(geo->c.clba, pblk->min_write_pgs, &mod);
+   if (mod) {
+   pr_err("pblk: bad configuration of sectors/pages\n");
+   return -EINVAL;
+   }
+
+   pblk->addrf_len = pblk_set_addrf_12(geo, (void *)&pblk->addrf);
+   break;
+   case NVM_OCSSD_SPEC_20:
+   pblk->addrf_len = pblk_set_addrf_20(geo, (void *)&pblk->addrf,
+   &pblk->uaddrf);
+   break;
+   default:
+   pr_err("pblk: OCSSD revision not supported (%d)\n",
+   geo->c.version);
return -EINVAL;
}
 
-   pblk->addrf_len = pblk_set_addrf_12(geo, (void *)&pblk->addrf);
-
return 0;
 }
 
@@ -1113,7 +1156,9 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
struct pblk *pblk;
int ret;
 
-   if (geo->c.version != NVM_OCSSD_SPEC_12) {
+   /* pblk supports 1.2 and 2.0 versions */
+   if (!(geo->c.version == NVM_OCSSD_SPEC_12 ||
+   geo->c.version == NVM_OCSSD_SPEC_20)) {
pr_err("pblk: OCSSD version not supported (%u)\n",
geo->c.version);
return ERR_PTR(-EINVAL);
diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c
index 1ce5b956c622..42ec00e8d393 100644
--- a/drivers/lightnvm/pblk-sysfs.c
+++ b/drivers/lightnvm/pblk-sysfs.c
@@ -113,15 +113,16 @@ static ssize_t pblk_sysfs_ppaf(struct pblk *pblk, char 
*page)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
-   struct nvm_addr_format_12 *ppaf;
-   struct nvm_addr_format_12 *geo_ppaf;
ssize_t sz = 0;
 
-   ppaf = (struct nvm_addr_format_12 *)&pblk->addrf;
-   geo_ppaf = (struct nvm_addr_format_12 *)&geo->c.addrf;
+   if (geo->c.version == NVM_OCSSD_SPEC_12) {
+   struct nvm_addr_format_12 *ppaf =
+   (struct nvm_addr_format_12 *)&pblk->addrf;
+   struct nvm_addr_format_12 *geo_ppaf =
+   (struct nvm_addr_format_12 *)&geo->c.addrf;
 
-   sz = snprintf(page, PAGE_SIZE,
-   
"pblk:(s:%d)ch:%d/%d,lun:%d/%d,blk:%d/%d,pg:%d/%d,pl:%d/%d,sec:%d/%d\n",
+   sz = snprintf(page, PAGE_SIZE,
+   
"pblk:(s:%d)ch:%d/%d,lun:%d/%d,blk:%d/%d,pg:%d/%d,pl:%d/%d,sec:%d/%d\n",
pblk->addrf_len,
ppaf->ch_offset, ppaf->ch_len,

[PATCH] lightnvm: pblk: remove unused variable

2018-02-26 Thread Javier González

Remove unused variable after a previous cleanup (a8112b631adb)

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-core.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index ce6a7cfdba66..e6cb4317bb50 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1071,7 +1071,6 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
struct nvm_geo *geo = &dev->geo;
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
-   int nr_bb = 0;
u64 off;
int bit = -1;
int emeta_secs;
@@ -1087,8 +1086,6 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
bitmap_or(line->map_bitmap, line->map_bitmap, l_mg->bb_aux,
lm->sec_per_line);
line->sec_in_line -= geo->c.clba;
-   if (bit >= lm->emeta_bb)
-   nr_bb++;
}
 
/* Mark smeta metadata sectors as bad sectors */
-- 
2.7.4

[PATCH 15/19] lightnvm: implement get log report chunk helpers

2018-02-26 Thread Javier González

From: Javier González 

The 2.0 spec provides a report chunk log page that can be retrieved
using the stangard nvme get log page. This replaces the dedicated
get/put bad block table in 1.2.

This patch implements the helper functions to allow targets retrieve the
chunk metadata using get log page

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  | 11 +++
 drivers/nvme/host/lightnvm.c | 74 
 include/linux/lightnvm.h | 24 ++
 3 files changed, 109 insertions(+)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index fae5e8c6fcd0..0f4114cb48ee 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -719,6 +719,17 @@ static void nvm_free_rqd_ppalist(struct nvm_tgt_dev 
*tgt_dev,
nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list);
 }
 
+int nvm_get_chunk_meta(struct nvm_tgt_dev *tgt_dev, struct nvm_chk_meta *meta,
+   struct ppa_addr ppa, int nchks)
+{
+   struct nvm_dev *dev = tgt_dev->parent;
+
+   nvm_ppa_tgt_to_dev(tgt_dev, &ppa, 1);
+
+   return dev->ops->get_chk_meta(tgt_dev->parent, meta,
+   (sector_t)ppa.ppa, nchks);
+}
+EXPORT_SYMBOL(nvm_get_chunk_meta);
 
 int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
   int nr_ppas, int type)
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 377e6525e53f..7c4af132d072 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -35,6 +35,10 @@ enum nvme_nvm_admin_opcode {
nvme_nvm_admin_set_bb_tbl   = 0xf1,
 };
 
+enum nvme_nvm_log_page {
+   NVME_NVM_LOG_REPORT_CHUNK   = 0xca,
+};
+
 struct nvme_nvm_ph_rw {
__u8opcode;
__u8flags;
@@ -236,6 +240,16 @@ struct nvme_nvm_id20 {
__u8vs[1024];
 };
 
+struct nvme_nvm_chk_meta {
+   __u8state;
+   __u8type;
+   __u8wi;
+   __u8rsvd[5];
+   __le64  slba;
+   __le64  cnlb;
+   __le64  wp;
+};
+
 /*
  * Check we didn't inadvertently grow the command struct
  */
@@ -252,6 +266,9 @@ static inline void _nvme_nvm_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 64);
BUILD_BUG_ON(sizeof(struct nvme_nvm_id20_addrf) != 8);
BUILD_BUG_ON(sizeof(struct nvme_nvm_id20) != NVME_IDENTIFY_DATA_SIZE);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_chk_meta) != 32);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_chk_meta) !=
+   sizeof(struct nvm_chk_meta));
 }
 
 static void nvme_nvm_set_addr_12(struct nvm_addr_format_12 *dst,
@@ -558,6 +575,61 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, 
struct ppa_addr *ppas,
return ret;
 }
 
+/*
+ * Expect the lba in device format
+ */
+static int nvme_nvm_get_chk_meta(struct nvm_dev *ndev,
+struct nvm_chk_meta *meta,
+sector_t slba, int nchks)
+{
+   struct nvm_geo *dev_geo = &ndev->geo;
+   struct nvme_ns *ns = ndev->q->queuedata;
+   struct nvme_ctrl *ctrl = ns->ctrl;
+   struct nvme_nvm_chk_meta *dev_meta = (struct nvme_nvm_chk_meta *)meta;
+   struct ppa_addr ppa;
+   size_t left = nchks * sizeof(struct nvme_nvm_chk_meta);
+   size_t log_pos, offset, len;
+   int ret, i;
+
+   /* Normalize lba address space to obtain log offset */
+   ppa.ppa = slba;
+   ppa = dev_to_generic_addr(ndev, ppa);
+
+   log_pos = ppa.m.chk;
+   log_pos += ppa.m.pu * dev_geo->c.num_chk;
+   log_pos += ppa.m.grp * dev_geo->num_lun * dev_geo->c.num_chk;
+
+   offset = log_pos * sizeof(struct nvme_nvm_chk_meta);
+
+   while (left) {
+   len = min_t(unsigned int, left, ctrl->max_hw_sectors << 9);
+
+   ret = nvme_get_log_ext(ctrl, ns, NVME_NVM_LOG_REPORT_CHUNK,
+   dev_meta, len, offset);
+   if (ret) {
+   dev_err(ctrl->device, "Get REPORT CHUNK log error\n");
+   break;
+   }
+
+   for (i = 0; i < len; i += sizeof(struct nvme_nvm_chk_meta)) {
+   meta->state = dev_meta->state;
+   meta->type = dev_meta->type;
+   meta->wi = dev_meta->wi;
+   meta->slba = le64_to_cpu(dev_meta->slba);
+   meta->cnlb = le64_to_cpu(dev_meta->cnlb);
+   meta->wp = le64_to_cpu(dev_meta->wp);
+
+   meta++;
+   dev_meta++;
+   }
+
+   offset += len;
+   left -= len;
+   }
+
+   return ret;
+}
+
 static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns,
struct nvme_nvm_command *c)
 {
@@ -689,6 +761,8 @@ static str

[PATCH 16/19] lightnvm: define chunk states

2018-02-26 Thread Javier González

Define chunk states as given in the 2.0 spec. Also, add an extra chunk
state that signals that the chunk is in use by the host. This allows for
the chunk metadata to be "owned" by a target when active, thus
completing the chunk state machine from the host perspective and
facilitating sanity checks. This state is transparent for the device.

Signed-off-by: Javier González 
---
 include/linux/lightnvm.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 776ffdc2e137..74830d5dbd6e 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -232,6 +232,20 @@ struct nvm_addr_format {
u64 rsv_mask[2];
 };
 
+enum {
+   /* Chunk states */
+   NVM_CHK_ST_FREE =   1 << 0,
+   NVM_CHK_ST_CLOSED = 1 << 1,
+   NVM_CHK_ST_OPEN =   1 << 2,
+   NVM_CHK_ST_OFFLINE =1 << 3,
+   NVM_CHK_ST_HOST_USE =   1 << 7,
+
+   /* Chunk types */
+   NVM_CHK_TP_W_SEQ =  1 << 0,
+   NVM_CHK_TP_W_RAN =  1 << 1,
+   NVM_CHK_TP_SZ_SPEC =1 << 4,
+};
+
 /*
  * Note: The structure size is linked to nvme_nvm_chk_meta such that the same
  * buffer can be used when converting from little endian to cpu addressing.
-- 
2.7.4

[PATCH 17/19] lightnvm: pblk: implement get log report chunk

2018-02-26 Thread Javier González

From: Javier González 

In preparation of pblk supporting 2.0, implement the get log report
chunk in pblk.

This patch only replicates de bad block functionality as the rest of the
metadata requires new pblk functionality (e.g., wear-index to implement
wear-leveling). This functionality will come in future patches.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-core.c  | 121 +++
 drivers/lightnvm/pblk-init.c  | 218 ++
 drivers/lightnvm/pblk-sysfs.c |  67 +
 drivers/lightnvm/pblk.h   |  20 
 4 files changed, 346 insertions(+), 80 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 1d7c6313f3d9..ce6a7cfdba66 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -44,11 +44,12 @@ static void pblk_line_mark_bb(struct work_struct *work)
 }
 
 static void pblk_mark_bb(struct pblk *pblk, struct pblk_line *line,
-struct ppa_addr *ppa)
+struct ppa_addr ppa_addr)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
-   int pos = pblk_ppa_to_pos(geo, *ppa);
+   struct ppa_addr *ppa;
+   int pos = pblk_ppa_to_pos(geo, ppa_addr);
 
pr_debug("pblk: erase failed: line:%d, pos:%d\n", line->id, pos);
atomic_long_inc(&pblk->erase_failed);
@@ -58,6 +59,15 @@ static void pblk_mark_bb(struct pblk *pblk, struct pblk_line 
*line,
pr_err("pblk: attempted to erase bb: line:%d, pos:%d\n",
line->id, pos);
 
+   /* Not necessary to mark bad blocks on 2.0 spec. */
+   if (geo->c.version == NVM_OCSSD_SPEC_20)
+   return;
+
+   ppa = kmalloc(sizeof(struct ppa_addr), GFP_ATOMIC);
+   if (!ppa)
+   return;
+
+   *ppa = ppa_addr;
pblk_gen_run_ws(pblk, NULL, ppa, pblk_line_mark_bb,
GFP_ATOMIC, pblk->bb_wq);
 }
@@ -69,16 +79,8 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct 
nvm_rq *rqd)
line = &pblk->lines[pblk_ppa_to_line(rqd->ppa_addr)];
atomic_dec(&line->left_seblks);
 
-   if (rqd->error) {
-   struct ppa_addr *ppa;
-
-   ppa = kmalloc(sizeof(struct ppa_addr), GFP_ATOMIC);
-   if (!ppa)
-   return;
-
-   *ppa = rqd->ppa_addr;
-   pblk_mark_bb(pblk, line, ppa);
-   }
+   if (rqd->error)
+   pblk_mark_bb(pblk, line, rqd->ppa_addr);
 
atomic_dec(&pblk->inflight_io);
 }
@@ -92,6 +94,50 @@ static void pblk_end_io_erase(struct nvm_rq *rqd)
mempool_free(rqd, pblk->e_rq_pool);
 }
 
+/*
+ * Get information for all chunks from the device.
+ *
+ * The caller is responsible for freeing the returned structure
+ */
+struct nvm_chk_meta *pblk_chunk_get_info(struct pblk *pblk)
+{
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = &dev->geo;
+   struct nvm_chk_meta *meta;
+   struct ppa_addr ppa;
+   unsigned long len;
+   int ret;
+
+   ppa.ppa = 0;
+
+   len = geo->all_chunks * sizeof(*meta);
+   meta = kzalloc(len, GFP_KERNEL);
+   if (!meta)
+   return ERR_PTR(-ENOMEM);
+
+   ret = nvm_get_chunk_meta(dev, meta, ppa, geo->all_chunks);
+   if (ret) {
+   pr_err("pblk: could not get chunk metadata (%d)\n", ret);
+   kfree(meta);
+   return ERR_PTR(-EIO);
+   }
+
+   return meta;
+}
+
+struct nvm_chk_meta *pblk_chunk_get_off(struct pblk *pblk,
+ struct nvm_chk_meta *meta,
+ struct ppa_addr ppa)
+{
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = &dev->geo;
+   int ch_off = ppa.m.grp * geo->c.num_chk * geo->num_lun;
+   int lun_off = ppa.m.pu * geo->c.num_chk;
+   int chk_off = ppa.m.chk;
+
+   return meta + ch_off + lun_off + chk_off;
+}
+
 void __pblk_map_invalidate(struct pblk *pblk, struct pblk_line *line,
   u64 paddr)
 {
@@ -1094,10 +1140,38 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
return 1;
 }
 
+static int pblk_prepare_new_line(struct pblk *pblk, struct pblk_line *line)
+{
+   struct pblk_line_meta *lm = &pblk->lm;
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = &dev->geo;
+   int blk_to_erase = atomic_read(&line->blk_in_line);
+   int i;
+
+   for (i = 0; i < lm->blk_per_line; i++) {
+   int state = line->chks[i].state;
+   struct pblk_lun *rlun = &pblk->luns[i];
+
+   /* Free chunks should not be erased */
+   if (state & NVM_CHK_ST_FREE) {
+   set_bit(pblk_ppa_to_pos(geo, rlun->chunk_bppa),
+   line->erase_b

[PATCH 11/19] lightnvm: pblk: rename ppaf* to addrf*

2018-02-26 Thread Javier González

In preparation for 2.0 support in pblk, rename variables referring to
the address format to addrf and reserve ppaf for the 1.2 path.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c  |  8 
 drivers/lightnvm/pblk-sysfs.c |  4 ++--
 drivers/lightnvm/pblk.h   | 16 
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 30aec863f136..ec39800eea42 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -80,7 +80,7 @@ static size_t pblk_trans_map_size(struct pblk *pblk)
 {
int entry_size = 8;
 
-   if (pblk->ppaf_bitsize < 32)
+   if (pblk->addrf_len < 32)
entry_size = 4;
 
return entry_size * pblk->rl.nr_secs;
@@ -198,7 +198,7 @@ static int pblk_set_addrf_12(struct nvm_geo *geo,
return dst->blk_offset + src->blk_len;
 }
 
-static int pblk_set_ppaf(struct pblk *pblk)
+static int pblk_set_addrf(struct pblk *pblk)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
@@ -210,7 +210,7 @@ static int pblk_set_ppaf(struct pblk *pblk)
return -EINVAL;
}
 
-   pblk->ppaf_bitsize = pblk_set_addrf_12(geo, (void *)&pblk->ppaf);
+   pblk->addrf_len = pblk_set_addrf_12(geo, (void *)&pblk->addrf);
 
return 0;
 }
@@ -319,7 +319,7 @@ static int pblk_core_init(struct pblk *pblk)
if (!pblk->r_end_wq)
goto free_bb_wq;
 
-   if (pblk_set_ppaf(pblk))
+   if (pblk_set_addrf(pblk))
goto free_r_end_wq;
 
if (pblk_rwb_init(pblk))
diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c
index 81288aa9162a..d3b50741b691 100644
--- a/drivers/lightnvm/pblk-sysfs.c
+++ b/drivers/lightnvm/pblk-sysfs.c
@@ -117,12 +117,12 @@ static ssize_t pblk_sysfs_ppaf(struct pblk *pblk, char 
*page)
struct nvm_addr_format_12 *geo_ppaf;
ssize_t sz = 0;
 
-   ppaf = (struct nvm_addr_format_12 *)&pblk->ppaf;
+   ppaf = (struct nvm_addr_format_12 *)&pblk->addrf;
geo_ppaf = (struct nvm_addr_format_12 *)&geo->c.addrf;
 
sz = snprintf(page, PAGE_SIZE,

"pblk:(s:%d)ch:%d/%d,lun:%d/%d,blk:%d/%d,pg:%d/%d,pl:%d/%d,sec:%d/%d\n",
-   pblk->ppaf_bitsize,
+   pblk->addrf_len,
ppaf->ch_offset, ppaf->ch_len,
ppaf->lun_offset, ppaf->lun_len,
ppaf->blk_offset, ppaf->blk_len,
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 4f7a365436f1..46b29a492f74 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -570,8 +570,8 @@ struct pblk {
struct pblk_line_mgmt l_mg; /* Line management */
struct pblk_line_meta lm;   /* Line metadata */
 
-   struct nvm_addr_format ppaf;
-   int ppaf_bitsize;
+   struct nvm_addr_format addrf;
+   int addrf_len;
 
struct pblk_rb rwb;
 
@@ -948,7 +948,7 @@ static inline struct ppa_addr addr_to_gen_ppa(struct pblk 
*pblk, u64 paddr,
  u64 line_id)
 {
struct nvm_addr_format_12 *ppaf =
-   (struct nvm_addr_format_12 *)&pblk->ppaf;
+   (struct nvm_addr_format_12 *)&pblk->addrf;
struct ppa_addr ppa;
 
ppa.ppa = 0;
@@ -966,7 +966,7 @@ static inline u64 pblk_dev_ppa_to_line_addr(struct pblk 
*pblk,
struct ppa_addr p)
 {
struct nvm_addr_format_12 *ppaf =
-   (struct nvm_addr_format_12 *)&pblk->ppaf;
+   (struct nvm_addr_format_12 *)&pblk->addrf;
u64 paddr;
 
paddr = (u64)p.g.ch << ppaf->ch_offset;
@@ -991,7 +991,7 @@ static inline struct ppa_addr pblk_ppa32_to_ppa64(struct 
pblk *pblk, u32 ppa32)
ppa64.c.is_cached = 1;
} else {
struct nvm_addr_format_12 *ppaf =
-   (struct nvm_addr_format_12 *)&pblk->ppaf;
+   (struct nvm_addr_format_12 *)&pblk->addrf;
 
ppa64.g.ch = (ppa32 & ppaf->ch_mask) >> ppaf->ch_offset;
ppa64.g.lun = (ppa32 & ppaf->lun_mask) >> ppaf->lun_offset;
@@ -1015,7 +1015,7 @@ static inline u32 pblk_ppa64_to_ppa32(struct pblk *pblk, 
struct ppa_addr ppa64)
ppa32 |= 1U << 31;
} else {
struct nvm_addr_format_12 *ppaf =
-   (struct nvm_addr_format_12 *)&pblk->ppaf;
+   (struct nvm_addr_format_12 *)&pblk->addrf;
 
ppa32 |= ppa64.g.ch << ppaf->ch_offset;
ppa32 |= ppa64.g.lun << ppaf->lun_offset;
@@ -1033,7 +1033,7 @@ static inline struct ppa_addr pblk_trans_map_get(struct 
pblk *pblk,
 {
struct ppa_addr ppa;
 
-   if (pblk->ppaf_bitsi

[PATCH 08/19] lightnvm: rename number of channels and luns

2018-02-26 Thread Javier González

Normalize nomenclature for naming number of channels and number of luns
in order to improve readability. Use num_ch and num_lun.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  | 69 ++--
 drivers/lightnvm/pblk-core.c |  4 +--
 drivers/lightnvm/pblk-init.c | 16 +-
 drivers/lightnvm/pblk.h  |  6 ++--
 drivers/nvme/host/lightnvm.c | 20 ++---
 include/linux/lightnvm.h |  4 +--
 6 files changed, 59 insertions(+), 60 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 96f4e62d383b..45bed54488f9 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -36,13 +36,13 @@ static DECLARE_RWSEM(nvm_lock);
 /* Map between virtual and physical channel and lun */
 struct nvm_ch_map {
int ch_off;
-   int nr_luns;
+   int num_lun;
int *lun_offs;
 };
 
 struct nvm_dev_map {
struct nvm_ch_map *chnls;
-   int nr_chnls;
+   int num_ch;
 };
 
 static struct nvm_target *nvm_find_target(struct nvm_dev *dev, const char 
*name)
@@ -115,15 +115,15 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev 
*tgt_dev, int clear)
struct nvm_dev_map *dev_map = tgt_dev->map;
int i, j;
 
-   for (i = 0; i < dev_map->nr_chnls; i++) {
+   for (i = 0; i < dev_map->num_ch; i++) {
struct nvm_ch_map *ch_map = &dev_map->chnls[i];
int *lun_offs = ch_map->lun_offs;
int ch = i + ch_map->ch_off;
 
if (clear) {
-   for (j = 0; j < ch_map->nr_luns; j++) {
+   for (j = 0; j < ch_map->num_lun; j++) {
int lun = j + lun_offs[j];
-   int lunid = (ch * dev_geo->nr_luns) + lun;
+   int lunid = (ch * dev_geo->num_lun) + lun;
 
WARN_ON(!test_and_clear_bit(lunid,
dev->lun_map));
@@ -149,47 +149,46 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
struct nvm_dev_map *dev_rmap = dev->rmap;
struct nvm_dev_map *dev_map;
struct ppa_addr *luns;
-   int nr_luns = lun_end - lun_begin + 1;
-   int luns_left = nr_luns;
-   int nr_chnls = nr_luns / dev_geo->nr_luns;
-   int nr_chnls_mod = nr_luns % dev_geo->nr_luns;
-   int bch = lun_begin / dev_geo->nr_luns;
-   int blun = lun_begin % dev_geo->nr_luns;
+   int num_lun = lun_end - lun_begin + 1;
+   int luns_left = num_lun;
+   int num_ch = num_lun / dev_geo->num_lun;
+   int num_ch_mod = num_lun % dev_geo->num_lun;
+   int bch = lun_begin / dev_geo->num_lun;
+   int blun = lun_begin % dev_geo->num_lun;
int lunid = 0;
int lun_balanced = 1;
-   int sec_per_lun, prev_nr_luns;
+   int sec_per_lun, prev_num_lun;
int i, j;
 
-   nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1;
+   num_ch = (num_ch_mod == 0) ? num_ch : num_ch + 1;
 
dev_map = kmalloc(sizeof(struct nvm_dev_map), GFP_KERNEL);
if (!dev_map)
goto err_dev;
 
-   dev_map->chnls = kcalloc(nr_chnls, sizeof(struct nvm_ch_map),
-   GFP_KERNEL);
+   dev_map->chnls = kcalloc(num_ch, sizeof(struct nvm_ch_map), GFP_KERNEL);
if (!dev_map->chnls)
goto err_chnls;
 
-   luns = kcalloc(nr_luns, sizeof(struct ppa_addr), GFP_KERNEL);
+   luns = kcalloc(num_lun, sizeof(struct ppa_addr), GFP_KERNEL);
if (!luns)
goto err_luns;
 
-   prev_nr_luns = (luns_left > dev_geo->nr_luns) ?
-   dev_geo->nr_luns : luns_left;
-   for (i = 0; i < nr_chnls; i++) {
+   prev_num_lun = (luns_left > dev_geo->num_lun) ?
+   dev_geo->num_lun : luns_left;
+   for (i = 0; i < num_ch; i++) {
struct nvm_ch_map *ch_rmap = &dev_rmap->chnls[i + bch];
int *lun_roffs = ch_rmap->lun_offs;
struct nvm_ch_map *ch_map = &dev_map->chnls[i];
int *lun_offs;
-   int luns_in_chnl = (luns_left > dev_geo->nr_luns) ?
-   dev_geo->nr_luns : luns_left;
+   int luns_in_chnl = (luns_left > dev_geo->num_lun) ?
+   dev_geo->num_lun : luns_left;
 
-   if (lun_balanced && prev_nr_luns != luns_in_chnl)
+   if (lun_balanced && prev_num_lun != luns_in_chnl)
lun_balanced = 0;
 
ch_map->ch_off = ch_rmap->ch_off = bch;
-   ch_map->nr_luns = luns_in_chnl;
+   ch_map->num_lun = luns_in_chnl;
 
lun_offs = kcalloc(luns_in_chnl, sizeof(int), GFP_KERNEL);
if (!lun_offs)
@@ -211,22 +210,22 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(s

[PATCH V3 00/19] lightnvm: pblk: implement 2.0 support

2018-02-26 Thread Javier González

# Changes since V2:

Apply Matias' feedback:
 - Remove generic nvm_id identify structure.
 - Do not remap capabilities (cap) to media and controlled capabilities
   (mccap). Instead, add a comment to prevent confusion when
   crosschecking with 2.0 spec.
 - Change maxoc and maxocpu defaults from 1 block to the max number of
   blocks.
 - Re-implement the generic geometry to use nvm_geo on both device and
   targets. Maintain nvm_common_geo to make it easier to copy the common
   part of the geometry (without having to overwrite target-specific
   fields, which is ugly and error prone). Matias, if you still want to
   get rid of this, we can do it.
 - Re-order patches with renaming to make them more meaningful. These
   belong to the series, since the name changes are motivated by 2.0
   inclusions. The only exception would be 36d10bfd3234, but I hope it
   is OK I include it here.

Also,
 - Eliminate a dependency between luns and lines in the init/exit
   refactoring.
 - Use the global address format when possible to avoid defaulting on
   the 1.2 path. This will safe headaches if the address format changes
   at some point.

I took out the patch allowing to do bit shifts on non power-of-2 media
formats on pblk's mapping since it requires touching many places that
are not 2.0 related. I'll submit this separately.

# Changes since V1:

Apply Matias' feedback:
 - Rebase on top of Matias' latest patches.
 - Use nvme_get_log_ext to submit report chunk and export it.
 - Re-write report chunk based on Matias' suggestions. Here, I
   maintained the lba interface, but it was necessary to redo the
   address formatting to match the chunk log page format. For pblk,
   this means a double address transformation, but it enables the
   standard path to use lbas, plus, this is not in the fast path.
 - Fold address format together with address transformations.
 - Split the generic geometry patch in different patches.
 - Remove refactoring of ligthnvm's core sysfs.

Feedback not applied:
 - Not letting pblk know about 1.2 and 2.0 bad block paths.
   Since the interfaces for get/set bad block and report chunk are so
   different, moving this logic to core adds assumptions on how the
   targets would want to get the data back. A way of doing this is
   creating a logical report chunk on the 1.2 path, but this would
   mean that values like the wear-index are invalid, which requires
   the target knowledge. I'm open to suggestions here.

Also:
 - Do some further renamings
 - Create a generic address format to make it explicit where we share
   1.2 and 2.0 fields to avoid address formatting in the fast path.
 - Add new fields to sysfs to complete spec and show major/minor
   versions (version and subversion to respect current interface).

Implement 2.0 support in pblk. This includes the address formatting and
mapping paths, as well as the sysfs entries for them.

Javier

Javier González (19):
  lightnvm: simplify geometry structure.
  lightnvm: add controller capabilities to 2.0
  lightnvm: add minor version to generic geometry
  lightnvm: add shorten OCSSD version in geo
  lightnvm: complete geo structure with maxoc*
  lightnvm: pblk: check for supported version
  lightnvm: complete 2.0 values in sysfs
  lightnvm: rename number of channels and luns
  lightnvm: rename sect_* to sec_*
  lightnvm: add support for 2.0 address format
  lightnvm: pblk: rename ppaf* to addrf*
  lightnvn: pblk: use generic address format
  lightnvm: make address conversions depend on generic device
  nvme: make nvme_get_log_ext available
  lightnvm: implement get log report chunk helpers
  lightnvm: define chunk states
  lightnvm: pblk: implement get log report chunk
  lightnvm: pblk: refactor init/exit sequences
  lightnvm: pblk: implement 2.0 support

 drivers/lightnvm/core.c  | 199 +-
 drivers/lightnvm/pblk-core.c | 151 +--
 drivers/lightnvm/pblk-gc.c   |   2 +-
 drivers/lightnvm/pblk-init.c | 833 +++
 drivers/lightnvm/pblk-map.c  |   4 +-
 drivers/lightnvm/pblk-read.c |   2 +-
 drivers/lightnvm/pblk-recovery.c |  14 +-
 drivers/lightnvm/pblk-rl.c   |   2 +-
 drivers/lightnvm/pblk-sysfs.c| 134 ++-
 drivers/lightnvm/pblk-write.c|   2 +-
 drivers/lightnvm/pblk.h  | 255 
 drivers/nvme/host/lightnvm.c | 458 ++---
 drivers/nvme/host/nvme.h |   3 +
 include/linux/lightnvm.h | 338 ++--
 14 files changed, 1559 insertions(+), 838 deletions(-)

-- 
2.7.4

[PATCH 12/19] lightnvn: pblk: use generic address format

2018-02-26 Thread Javier González

Use the generic address format on common address manipulations.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-core.c  | 10 +-
 drivers/lightnvm/pblk-map.c   |  4 ++--
 drivers/lightnvm/pblk-sysfs.c |  4 ++--
 drivers/lightnvm/pblk.h   |  4 ++--
 4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 7c726003a8d2..1d7c6313f3d9 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -885,7 +885,7 @@ int pblk_line_erase(struct pblk *pblk, struct pblk_line 
*line)
}
 
ppa = pblk->luns[bit].bppa; /* set ch and lun */
-   ppa.g.blk = line->id;
+   ppa.a.blk = line->id;
 
atomic_dec(&line->left_eblks);
WARN_ON(test_and_set_bit(bit, line->erase_bitmap));
@@ -1693,8 +1693,8 @@ static void __pblk_down_page(struct pblk *pblk, struct 
ppa_addr *ppa_list,
int i;
 
for (i = 1; i < nr_ppas; i++)
-   WARN_ON(ppa_list[0].g.lun != ppa_list[i].g.lun ||
-   ppa_list[0].g.ch != ppa_list[i].g.ch);
+   WARN_ON(ppa_list[0].a.lun != ppa_list[i].a.lun ||
+   ppa_list[0].a.ch != ppa_list[i].a.ch);
 #endif
 
ret = down_timeout(&rlun->wr_sem, msecs_to_jiffies(3));
@@ -1738,8 +1738,8 @@ void pblk_up_page(struct pblk *pblk, struct ppa_addr 
*ppa_list, int nr_ppas)
int i;
 
for (i = 1; i < nr_ppas; i++)
-   WARN_ON(ppa_list[0].g.lun != ppa_list[i].g.lun ||
-   ppa_list[0].g.ch != ppa_list[i].g.ch);
+   WARN_ON(ppa_list[0].a.lun != ppa_list[i].a.lun ||
+   ppa_list[0].a.ch != ppa_list[i].a.ch);
 #endif
 
rlun = &pblk->luns[pos];
diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c
index 04e08d76ea5f..20dbaa89c9df 100644
--- a/drivers/lightnvm/pblk-map.c
+++ b/drivers/lightnvm/pblk-map.c
@@ -127,7 +127,7 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
atomic_dec(&e_line->left_eblks);
 
*erase_ppa = rqd->ppa_list[i];
-   erase_ppa->g.blk = e_line->id;
+   erase_ppa->a.blk = e_line->id;
 
spin_unlock(&e_line->lock);
 
@@ -168,6 +168,6 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
set_bit(bit, e_line->erase_bitmap);
atomic_dec(&e_line->left_eblks);
*erase_ppa = pblk->luns[bit].bppa; /* set ch and lun */
-   erase_ppa->g.blk = e_line->id;
+   erase_ppa->a.blk = e_line->id;
}
 }
diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c
index d3b50741b691..ccfb3abd2773 100644
--- a/drivers/lightnvm/pblk-sysfs.c
+++ b/drivers/lightnvm/pblk-sysfs.c
@@ -39,8 +39,8 @@ static ssize_t pblk_sysfs_luns_show(struct pblk *pblk, char 
*page)
sz += snprintf(page + sz, PAGE_SIZE - sz,
"pblk: pos:%d, ch:%d, lun:%d - %d\n",
i,
-   rlun->bppa.g.ch,
-   rlun->bppa.g.lun,
+   rlun->bppa.a.ch,
+   rlun->bppa.a.lun,
active);
}
 
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 46b29a492f74..6e1fcd1a538a 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -936,12 +936,12 @@ static inline int pblk_pad_distance(struct pblk *pblk)
 
 static inline int pblk_ppa_to_line(struct ppa_addr p)
 {
-   return p.g.blk;
+   return p.a.blk;
 }
 
 static inline int pblk_ppa_to_pos(struct nvm_geo *geo, struct ppa_addr p)
 {
-   return p.g.lun * geo->num_ch + p.g.ch;
+   return p.a.lun * geo->num_ch + p.a.ch;
 }
 
 static inline struct ppa_addr addr_to_gen_ppa(struct pblk *pblk, u64 paddr,
-- 
2.7.4

[PATCH 09/19] lightnvm: rename sect_* to sec_*

2018-02-26 Thread Javier González

Rename abbreviations for sector from sect_* to sec_* as most of the code
uses this format and it is confusing when using the different
structures.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c  | 8 
 drivers/lightnvm/pblk-sysfs.c | 4 ++--
 drivers/lightnvm/pblk.h   | 8 
 drivers/nvme/host/lightnvm.c  | 8 
 include/linux/lightnvm.h  | 8 
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 7e3297be517d..30aec863f136 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -179,16 +179,16 @@ static int pblk_set_addrf_12(struct nvm_geo *geo,
dst->blk_len = src->blk_len;
dst->pg_len = src->pg_len;
dst->pln_len = src->pln_len;
-   dst->sect_len = src->sect_len;
+   dst->sec_len = src->sec_len;
 
-   dst->sect_offset = 0;
-   dst->pln_offset = dst->sect_len;
+   dst->sec_offset = 0;
+   dst->pln_offset = dst->sec_len;
dst->ch_offset = dst->pln_offset + dst->pln_len;
dst->lun_offset = dst->ch_offset + dst->ch_len;
dst->pg_offset = dst->lun_offset + dst->lun_len;
dst->blk_offset = dst->pg_offset + dst->pg_len;
 
-   dst->sec_mask = ((1ULL << dst->sect_len) - 1) << dst->sect_offset;
+   dst->sec_mask = ((1ULL << dst->sec_len) - 1) << dst->sec_offset;
dst->pln_mask = ((1ULL << dst->pln_len) - 1) << dst->pln_offset;
dst->ch_mask = ((1ULL << dst->ch_len) - 1) << dst->ch_offset;
dst->lun_mask = ((1ULL << dst->lun_len) - 1) << dst->lun_offset;
diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c
index 5eb21a279361..81288aa9162a 100644
--- a/drivers/lightnvm/pblk-sysfs.c
+++ b/drivers/lightnvm/pblk-sysfs.c
@@ -128,7 +128,7 @@ static ssize_t pblk_sysfs_ppaf(struct pblk *pblk, char 
*page)
ppaf->blk_offset, ppaf->blk_len,
ppaf->pg_offset, ppaf->pg_len,
ppaf->pln_offset, ppaf->pln_len,
-   ppaf->sect_offset, ppaf->sect_len);
+   ppaf->sec_offset, ppaf->sec_len);
 
sz += snprintf(page + sz, PAGE_SIZE - sz,

"device:ch:%d/%d,lun:%d/%d,blk:%d/%d,pg:%d/%d,pl:%d/%d,sec:%d/%d\n",
@@ -137,7 +137,7 @@ static ssize_t pblk_sysfs_ppaf(struct pblk *pblk, char 
*page)
geo_ppaf->blk_offset, geo_ppaf->blk_len,
geo_ppaf->pg_offset, geo_ppaf->pg_len,
geo_ppaf->pln_offset, geo_ppaf->pln_len,
-   geo_ppaf->sect_offset, geo_ppaf->sect_len);
+   geo_ppaf->sec_offset, geo_ppaf->sec_len);
 
return sz;
 }
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 1f32284b0aec..4f7a365436f1 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -957,7 +957,7 @@ static inline struct ppa_addr addr_to_gen_ppa(struct pblk 
*pblk, u64 paddr,
ppa.g.lun = (paddr & ppaf->lun_mask) >> ppaf->lun_offset;
ppa.g.ch = (paddr & ppaf->ch_mask) >> ppaf->ch_offset;
ppa.g.pl = (paddr & ppaf->pln_mask) >> ppaf->pln_offset;
-   ppa.g.sec = (paddr & ppaf->sec_mask) >> ppaf->sect_offset;
+   ppa.g.sec = (paddr & ppaf->sec_mask) >> ppaf->sec_offset;
 
return ppa;
 }
@@ -973,7 +973,7 @@ static inline u64 pblk_dev_ppa_to_line_addr(struct pblk 
*pblk,
paddr |= (u64)p.g.lun << ppaf->lun_offset;
paddr |= (u64)p.g.pg << ppaf->pg_offset;
paddr |= (u64)p.g.pl << ppaf->pln_offset;
-   paddr |= (u64)p.g.sec << ppaf->sect_offset;
+   paddr |= (u64)p.g.sec << ppaf->sec_offset;
 
return paddr;
 }
@@ -998,7 +998,7 @@ static inline struct ppa_addr pblk_ppa32_to_ppa64(struct 
pblk *pblk, u32 ppa32)
ppa64.g.blk = (ppa32 & ppaf->blk_mask) >> ppaf->blk_offset;
ppa64.g.pg = (ppa32 & ppaf->pg_mask) >> ppaf->pg_offset;
ppa64.g.pl = (ppa32 & ppaf->pln_mask) >> ppaf->pln_offset;
-   ppa64.g.sec = (ppa32 & ppaf->sec_mask) >> ppaf->sect_offset;
+   ppa64.g.sec = (ppa32 & ppaf->sec_mask) >> ppaf->sec_offset;
}
 
return ppa64;
@@ -1022,7 +1022,7 @@ static inline u32 pblk_ppa64_to_ppa32(struct pblk *pblk, 
struct ppa_addr ppa64)
ppa32 |= ppa64.g.blk << ppaf->blk_offset;
ppa32 |= ppa64.g.pg << ppaf->pg_offset;
ppa32 |= ppa64.g.pl << ppaf->pln_offset;
-   ppa32 |= ppa64.g.sec << ppaf->sect_offset;
+   ppa32 |= ppa64.g.sec << ppaf->sec_offset;
}
 
return ppa32;
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index f3f524dc1389..377e6525e53f 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -262,21 +262,21 @@ static void nvme_nvm_set_addr_12(struct 
nvm_addr_format_12 *dst,
dst->blk_len = src->blk_len;
dst->pg_len =

[PATCH 10/19] lightnvm: add support for 2.0 address format

2018-02-26 Thread Javier González

Add support for 2.0 address format. Also, align address bits for 1.2 and
2.0 to be able to operate on channel and luns without requiring a format
conversion. Use a generic address format for this purpose.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  |  20 -
 include/linux/lightnvm.h | 105 ++-
 2 files changed, 86 insertions(+), 39 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 45bed54488f9..92ea6e39dd64 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -196,8 +196,8 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
 
for (j = 0; j < luns_in_chnl; j++) {
luns[lunid].ppa = 0;
-   luns[lunid].g.ch = i;
-   luns[lunid++].g.lun = j;
+   luns[lunid].a.ch = i;
+   luns[lunid++].a.lun = j;
 
lun_offs[j] = blun;
lun_roffs[j + blun] = blun;
@@ -563,22 +563,22 @@ static void nvm_unregister_map(struct nvm_dev *dev)
 static void nvm_map_to_dev(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *p)
 {
struct nvm_dev_map *dev_map = tgt_dev->map;
-   struct nvm_ch_map *ch_map = &dev_map->chnls[p->g.ch];
-   int lun_off = ch_map->lun_offs[p->g.lun];
+   struct nvm_ch_map *ch_map = &dev_map->chnls[p->a.ch];
+   int lun_off = ch_map->lun_offs[p->a.lun];
 
-   p->g.ch += ch_map->ch_off;
-   p->g.lun += lun_off;
+   p->a.ch += ch_map->ch_off;
+   p->a.lun += lun_off;
 }
 
 static void nvm_map_to_tgt(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *p)
 {
struct nvm_dev *dev = tgt_dev->parent;
struct nvm_dev_map *dev_rmap = dev->rmap;
-   struct nvm_ch_map *ch_rmap = &dev_rmap->chnls[p->g.ch];
-   int lun_roff = ch_rmap->lun_offs[p->g.lun];
+   struct nvm_ch_map *ch_rmap = &dev_rmap->chnls[p->a.ch];
+   int lun_roff = ch_rmap->lun_offs[p->a.lun];
 
-   p->g.ch -= ch_rmap->ch_off;
-   p->g.lun -= lun_roff;
+   p->a.ch -= ch_rmap->ch_off;
+   p->a.lun -= lun_roff;
 }
 
 static void nvm_ppa_tgt_to_dev(struct nvm_tgt_dev *tgt_dev,
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 18bb9e9774d6..8cda1125c34a 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -16,12 +16,21 @@ enum {
NVM_IOTYPE_GC = 1,
 };
 
-#define NVM_BLK_BITS (16)
-#define NVM_PG_BITS  (16)
-#define NVM_SEC_BITS (8)
-#define NVM_PL_BITS  (8)
-#define NVM_LUN_BITS (8)
-#define NVM_CH_BITS  (7)
+/* common format */
+#define NVM_GEN_CH_BITS  (8)
+#define NVM_GEN_LUN_BITS (8)
+#define NVM_GEN_BLK_BITS (16)
+#define NVM_GEN_RESERVED (32)
+
+/* 1.2 format */
+#define NVM_12_PG_BITS  (16)
+#define NVM_12_PL_BITS  (4)
+#define NVM_12_SEC_BITS (4)
+#define NVM_12_RESERVED (8)
+
+/* 2.0 format */
+#define NVM_20_SEC_BITS (24)
+#define NVM_20_RESERVED (8)
 
 enum {
NVM_OCSSD_SPEC_12 = 12,
@@ -31,16 +40,34 @@ enum {
 struct ppa_addr {
/* Generic structure for all addresses */
union {
+   /* generic device format */
struct {
-   u64 blk : NVM_BLK_BITS;
-   u64 pg  : NVM_PG_BITS;
-   u64 sec : NVM_SEC_BITS;
-   u64 pl  : NVM_PL_BITS;
-   u64 lun : NVM_LUN_BITS;
-   u64 ch  : NVM_CH_BITS;
-   u64 reserved: 1;
+   u64 ch  : NVM_GEN_CH_BITS;
+   u64 lun : NVM_GEN_LUN_BITS;
+   u64 blk : NVM_GEN_BLK_BITS;
+   u64 reserved: NVM_GEN_RESERVED;
+   } a;
+
+   /* 1.2 device format */
+   struct {
+   u64 ch  : NVM_GEN_CH_BITS;
+   u64 lun : NVM_GEN_LUN_BITS;
+   u64 blk : NVM_GEN_BLK_BITS;
+   u64 pg  : NVM_12_PG_BITS;
+   u64 pl  : NVM_12_PL_BITS;
+   u64 sec : NVM_12_SEC_BITS;
+   u64 reserved: NVM_12_RESERVED;
} g;
 
+   /* 2.0 device format */
+   struct {
+   u64 grp : NVM_GEN_CH_BITS;
+   u64 pu  : NVM_GEN_LUN_BITS;
+   u64 chk : NVM_GEN_BLK_BITS;
+   u64 sec : NVM_20_SEC_BITS;
+   u64 reserved: NVM_20_RESERVED;
+   } m;
+
struct {
u64 line: 63;
u64 is_cached   : 1;
@@ -382,16 +409,26 @@ static inline struct ppa_addr generic_to_dev_addr(struct 
nvm_tgt_dev *tgt_dev,
  struct p

Re: [PATCH] lightnvm: centralize permission check for lightnvm ioctl

2018-02-26 Thread Matias Bjørling


On 02/26/2018 01:11 PM, Johannes Thumshirn wrote:

Currently all functions for handling the lightnvm core ioctl commands
do a check for CAP_SYS_ADMIN.

Change this to fail early in nvm_ctl_ioctl(), so we don't have to
duplicate the permission checks all over.

Signed-off-by: Johannes Thumshirn 
---
  drivers/lightnvm/core.c | 21 +++--
  1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..74d0bda94c11 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -1040,9 +1040,6 @@ static long nvm_ioctl_info(struct file *file, void __user 
*arg)
struct nvm_tgt_type *tt;
int tgt_iter = 0;
  
-	if (!capable(CAP_SYS_ADMIN))

-   return -EPERM;
-
info = memdup_user(arg, sizeof(struct nvm_ioctl_info));
if (IS_ERR(info))
return -EFAULT;
@@ -1081,9 +1078,6 @@ static long nvm_ioctl_get_devices(struct file *file, void 
__user *arg)
struct nvm_dev *dev;
int i = 0;
  
-	if (!capable(CAP_SYS_ADMIN))

-   return -EPERM;
-
devices = kzalloc(sizeof(struct nvm_ioctl_get_devices), GFP_KERNEL);
if (!devices)
return -ENOMEM;
@@ -1124,9 +1118,6 @@ static long nvm_ioctl_dev_create(struct file *file, void 
__user *arg)
  {
struct nvm_ioctl_create create;
  
-	if (!capable(CAP_SYS_ADMIN))

-   return -EPERM;
-
if (copy_from_user(&create, arg, sizeof(struct nvm_ioctl_create)))
return -EFAULT;
  
@@ -1162,9 +1153,6 @@ static long nvm_ioctl_dev_remove(struct file *file, void __user *arg)

struct nvm_dev *dev;
int ret = 0;
  
-	if (!capable(CAP_SYS_ADMIN))

-   return -EPERM;
-
if (copy_from_user(&remove, arg, sizeof(struct nvm_ioctl_remove)))
return -EFAULT;
  
@@ -1189,9 +1177,6 @@ static long nvm_ioctl_dev_init(struct file *file, void __user *arg)

  {
struct nvm_ioctl_dev_init init;
  
-	if (!capable(CAP_SYS_ADMIN))

-   return -EPERM;
-
if (copy_from_user(&init, arg, sizeof(struct nvm_ioctl_dev_init)))
return -EFAULT;
  
@@ -1208,9 +1193,6 @@ static long nvm_ioctl_dev_factory(struct file *file, void __user *arg)

  {
struct nvm_ioctl_dev_factory fact;
  
-	if (!capable(CAP_SYS_ADMIN))

-   return -EPERM;
-
if (copy_from_user(&fact, arg, sizeof(struct nvm_ioctl_dev_factory)))
return -EFAULT;
  
@@ -1226,6 +1208,9 @@ static long nvm_ctl_ioctl(struct file *file, uint cmd, unsigned long arg)

  {
void __user *argp = (void __user *)arg;
  
+	if (!capable(CAP_SYS_ADMIN))

+   return -EPERM;
+
switch (cmd) {
case NVM_INFO:
return nvm_ioctl_info(file, argp);



Thanks Johannes. Applied.

Re: v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context

2018-02-26 Thread Jan Kara

On Mon 26-02-18 11:38:19, Mark Rutland wrote:
> On Mon, Feb 26, 2018 at 11:52:56AM +0100, Jan Kara wrote:
> > On Fri 23-02-18 15:47:36, Mark Rutland wrote:
> > > Hi all,
> > > 
> > > While fuzzing arm64/v4.16-rc2 with syzkaller, I simultaneously hit a
> > > number of splats in the block layer:
> > > 
> > > * inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage in
> > >   jbd2_trans_will_send_data_barrier
> > > 
> > > * BUG: sleeping function called from invalid context at mm/mempool.c:320
> > > 
> > > * WARNING: CPU: 0 PID: 0 at block/blk.h:297 
> > > generic_make_request_checks+0x670/0x750
> > > 
> > > ... I've included the full splats at the end of the mail.
> > > 
> > > These all happen in the context of the virtio block IRQ handler, so I
> > > wonder if this calls something that doesn't expect to be called from IRQ
> > > context. Is it valid to call blk_mq_complete_request() or
> > > blk_mq_end_request() from an IRQ handler?
> > 
> > No, it's likely a bug in detection whether IO completion should be deferred
> > to a workqueue or not. Does attached patch fix the problem? I don't see
> > exactly this being triggered by the syzkaller but it's close enough :)
> > 
> > Honza
> 
> That seems to be it!
> 
> With the below patch applied, I can't trigger the bug after ~10 minutes,
> whereas prior to the patch I can trigger it in ~10 seconds. I'll leave
> that running for a while just in case there's another part to the
> problem, but FWIW:
> 
> Tested-by: Mark Rutland 

Thanks for testing! Sent the patch to Jens for inclusion.

Honza
-- 
Jan Kara 
SUSE Labs, CR

[PATCH] lightnvm: centralize permission check for lightnvm ioctl

2018-02-26 Thread Johannes Thumshirn

Currently all functions for handling the lightnvm core ioctl commands
do a check for CAP_SYS_ADMIN.

Change this to fail early in nvm_ctl_ioctl(), so we don't have to
duplicate the permission checks all over.

Signed-off-by: Johannes Thumshirn 
---
 drivers/lightnvm/core.c | 21 +++--
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..74d0bda94c11 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -1040,9 +1040,6 @@ static long nvm_ioctl_info(struct file *file, void __user 
*arg)
struct nvm_tgt_type *tt;
int tgt_iter = 0;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
info = memdup_user(arg, sizeof(struct nvm_ioctl_info));
if (IS_ERR(info))
return -EFAULT;
@@ -1081,9 +1078,6 @@ static long nvm_ioctl_get_devices(struct file *file, void 
__user *arg)
struct nvm_dev *dev;
int i = 0;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
devices = kzalloc(sizeof(struct nvm_ioctl_get_devices), GFP_KERNEL);
if (!devices)
return -ENOMEM;
@@ -1124,9 +1118,6 @@ static long nvm_ioctl_dev_create(struct file *file, void 
__user *arg)
 {
struct nvm_ioctl_create create;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
if (copy_from_user(&create, arg, sizeof(struct nvm_ioctl_create)))
return -EFAULT;
 
@@ -1162,9 +1153,6 @@ static long nvm_ioctl_dev_remove(struct file *file, void 
__user *arg)
struct nvm_dev *dev;
int ret = 0;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
if (copy_from_user(&remove, arg, sizeof(struct nvm_ioctl_remove)))
return -EFAULT;
 
@@ -1189,9 +1177,6 @@ static long nvm_ioctl_dev_init(struct file *file, void 
__user *arg)
 {
struct nvm_ioctl_dev_init init;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
if (copy_from_user(&init, arg, sizeof(struct nvm_ioctl_dev_init)))
return -EFAULT;
 
@@ -1208,9 +1193,6 @@ static long nvm_ioctl_dev_factory(struct file *file, void 
__user *arg)
 {
struct nvm_ioctl_dev_factory fact;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
if (copy_from_user(&fact, arg, sizeof(struct nvm_ioctl_dev_factory)))
return -EFAULT;
 
@@ -1226,6 +1208,9 @@ static long nvm_ctl_ioctl(struct file *file, uint cmd, 
unsigned long arg)
 {
void __user *argp = (void __user *)arg;
 
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
switch (cmd) {
case NVM_INFO:
return nvm_ioctl_info(file, argp);
-- 
2.13.6

[PATCH 2/4] block: bio_check_eod() needs to consider partition

2018-02-26 Thread Jiufei Xue

bio_check_eod() should get the capacity of partiton, not the whole
disk.

Signed-off-by: Jiufei Xue 
---
 block/blk-core.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d82c4f..ef6677d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2023,7 +2023,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
return BLK_QC_T_NONE;
 }
 
-static void handle_bad_sector(struct bio *bio)
+static void handle_bad_sector(struct bio *bio, sector_t maxsector)
 {
char b[BDEVNAME_SIZE];
 
@@ -2031,7 +2031,7 @@ static void handle_bad_sector(struct bio *bio)
printk(KERN_INFO "%s: rw=%d, want=%Lu, limit=%Lu\n",
bio_devname(bio, b), bio->bi_opf,
(unsigned long long)bio_end_sector(bio),
-   (long long)get_capacity(bio->bi_disk));
+   (long long)maxsector);
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
@@ -2131,12 +2131,20 @@ static inline int blk_partition_remap(struct bio *bio)
 static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
 {
sector_t maxsector;
+   struct hd_struct *part;
 
if (!nr_sectors)
return 0;
 
/* Test device or partition size, when known. */
-   maxsector = get_capacity(bio->bi_disk);
+   rcu_read_lock();
+   part = __disk_get_part(bio->bi_disk, bio->bi_partno);
+   if (part)
+   maxsector = part_nr_sects_read(part);
+   else
+   maxsector = get_capacity(bio->bi_disk);
+   rcu_read_unlock();
+
if (maxsector) {
sector_t sector = bio->bi_iter.bi_sector;
 
@@ -2146,7 +2154,7 @@ static inline int bio_check_eod(struct bio *bio, unsigned 
int nr_sectors)
 * without checking the size of the device, e.g., when
 * mounting a device.
 */
-   handle_bad_sector(bio);
+   handle_bad_sector(bio, maxsector);
return 1;
}
}
-- 
1.9.4

[PATCH 1/4] block: fix the count of PGPGOUT for WRITE_SAME

2018-02-26 Thread Jiufei Xue

The vm counters is counted in sectors, so we should do the conversation
in submit_bio.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and
partitions index")

Signed-off-by: Jiufei Xue 
---
 block/blk-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2d1a7bb..6d82c4f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2434,7 +2434,7 @@ blk_qc_t submit_bio(struct bio *bio)
unsigned int count;
 
if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
-   count = queue_logical_block_size(bio->bi_disk->queue);
+   count = queue_logical_block_size(bio->bi_disk->queue) 
>> 9;
else
count = bio_sectors(bio);
 
-- 
1.9.4

[PATCH 4/4] block: fix a typo

2018-02-26 Thread Jiufei Xue

Signed-off-by: Jiufei Xue 
---
 drivers/block/pktcdvd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 531a091..c61d20c 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1122,7 +1122,7 @@ static int pkt_start_recovery(struct packet_data *pkt)
pkt->sector = new_sector;
 
bio_reset(pkt->bio);
-   bio_set_set(pkt->bio, pd->bdev);
+   bio_set_dev(pkt->bio, pd->bdev);
bio_set_op_attrs(pkt->bio, REQ_OP_WRITE, 0);
pkt->bio->bi_iter.bi_sector = new_sector;
pkt->bio->bi_iter.bi_size = pkt->frames * CD_FRAMESIZE;
-- 
1.9.4

[PATCH 3/4] block: display the correct diskname for bio

2018-02-26 Thread Jiufei Xue

bio_devname use __bdevname to display the device name, and can
only show the major and minor of the part0.
Fix this by using disk_name to display the correct name.

Signed-off-by: Jiufei Xue 
---
 block/partition-generic.c | 6 ++
 include/linux/bio.h   | 4 +---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 91622db..08dabcd 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -51,6 +51,12 @@ const char *bdevname(struct block_device *bdev, char *buf)
 
 EXPORT_SYMBOL(bdevname);
 
+const char *bio_devname(struct bio *bio, char *buf)
+{
+   return disk_name(bio->bi_disk, bio->bi_partno, buf);
+}
+EXPORT_SYMBOL(bio_devname);
+
 /*
  * There's very little reason to use this, you should really
  * have a struct block_device just about everywhere and use
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d0eb659..ce547a2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -511,6 +511,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
 extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
 extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
 extern unsigned int bvec_nr_vecs(unsigned short idx);
+extern const char *bio_devname(struct bio *bio, char *buffer);
 
 #define bio_set_dev(bio, bdev) \
 do {   \
@@ -529,9 +530,6 @@ extern struct bio *bio_copy_user_iov(struct request_queue *,
 #define bio_dev(bio) \
disk_devt((bio)->bi_disk)
 
-#define bio_devname(bio, buf) \
-   __bdevname(bio_dev(bio), (buf))
-
 #ifdef CONFIG_BLK_CGROUP
 int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state 
*blkcg_css);
 void bio_disassociate_task(struct bio *bio);
-- 
1.9.4

[PATCH 0/4] fix a few problems in block layer

2018-02-26 Thread Jiufei Xue

I have found a few problems while reviewing the patch 74d46992e0d9
("block: replace bi_bdev with a gendisk pointer and partitions index"),
So fix them.

Jiufei Xue (4)
block: fix the count of PGPGOUT for WRITE_SAME
block: bio_check_eod() needs to consider partition
block: display the correct disk name for bio
block: fix a typo and modify the documentation for bio

block/blk-core.c  |   18 +-
block/partition-generic.c |6 --
drivers/block/pktcdvd.c   |2 +-
include/linux/bio.h   |4 +++-
4 files changed, 9 insertions(+), 21 deletions(-)

[PATCH 4/6] genhd: Fix use after free in __blkdev_get()

2018-02-26 Thread Jan Kara

When two blkdev_open() calls race with device removal and recreation,
__blkdev_get() can use looked up gendisk after it is freed:

CPU0CPU1CPU2
del_gendisk(disk);
  
bdev_unhash_inode(inode);
blkdev_open()   blkdev_open()
  bdev = bd_acquire(inode);
- creates and returns new inode
  bdev = bd_acquire(inode);
- returns the same inode
  __blkdev_get(devt)  __blkdev_get(devt)
disk = get_gendisk(devt);
  - got structure of device going away


  disk = get_gendisk(devt);
- got new device structure
  if (!bdev->bd_openers) {
does the first open
  }
if (!bdev->bd_openers)
  - false
} else {
  put_disk_and_module(disk)
- remember this was old device - this was last ref and disk is
  now freed
}
disk_unblock_events(disk); -> oops

Fix the problem by making sure we drop reference to disk in
__blkdev_get() only after we are really done with it.

Reported-by: Hou Tao 
Tested-by: Hou Tao 
Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1dbbf847911a..fe41a76769fa 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1409,6 +1409,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
int ret;
int partno;
int perm = 0;
+   bool first_open = false;
 
if (mode & FMODE_READ)
perm |= MAY_READ;
@@ -1435,6 +1436,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
disk_block_events(disk);
mutex_lock_nested(&bdev->bd_mutex, for_part);
if (!bdev->bd_openers) {
+   first_open = true;
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
@@ -1520,14 +1522,15 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
if (ret)
goto out_unlock_bdev;
}
-   /* only one opener holds refs to the module and disk */
-   put_disk_and_module(disk);
}
bdev->bd_openers++;
if (for_part)
bdev->bd_part_count++;
mutex_unlock(&bdev->bd_mutex);
disk_unblock_events(disk);
+   /* only one opener holds refs to the module and disk */
+   if (!first_open)
+   put_disk_and_module(disk);
return 0;
 
  out_clear:
-- 
2.13.6

[PATCH 2/6] genhd: Rename get_disk() to get_disk_and_module()

2018-02-26 Thread Jan Kara

Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.

Signed-off-by: Jan Kara 
---
 block/genhd.c   | 10 --
 drivers/block/amiflop.c |  2 +-
 drivers/block/ataflop.c |  2 +-
 drivers/block/brd.c |  2 +-
 drivers/block/floppy.c  |  2 +-
 drivers/block/loop.c|  2 +-
 drivers/block/swim.c|  2 +-
 drivers/block/z2ram.c   |  2 +-
 drivers/ide/ide-probe.c |  2 +-
 include/linux/genhd.h   |  2 +-
 10 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 5098bffe6ba6..21b2843b27d0 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -547,7 +547,7 @@ static int exact_lock(dev_t devt, void *data)
 {
struct gendisk *p = data;
 
-   if (!get_disk(p))
+   if (!get_disk_and_module(p))
return -1;
return 0;
 }
@@ -809,7 +809,7 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
 
spin_lock_bh(&ext_devt_lock);
part = idr_find(&ext_devt_idr, blk_mangle_minor(MINOR(devt)));
-   if (part && get_disk(part_to_disk(part))) {
+   if (part && get_disk_and_module(part_to_disk(part))) {
*partno = part->partno;
disk = part_to_disk(part);
}
@@ -1456,7 +1456,7 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
 }
 EXPORT_SYMBOL(__alloc_disk_node);
 
-struct kobject *get_disk(struct gendisk *disk)
+struct kobject *get_disk_and_module(struct gendisk *disk)
 {
struct module *owner;
struct kobject *kobj;
@@ -1474,15 +1474,13 @@ struct kobject *get_disk(struct gendisk *disk)
return kobj;
 
 }
-
-EXPORT_SYMBOL(get_disk);
+EXPORT_SYMBOL(get_disk_and_module);
 
 void put_disk(struct gendisk *disk)
 {
if (disk)
kobject_put(&disk_to_dev(disk)->kobj);
 }
-
 EXPORT_SYMBOL(put_disk);
 
 static void set_disk_ro_uevent(struct gendisk *gd, int ro)
diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index e5aa62fcf5a8..3aaf6af3ec23 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -1758,7 +1758,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (unit[drive].type->code == FD_NODRIVE)
return NULL;
*part = 0;
-   return get_disk(unit[drive].gendisk);
+   return get_disk_and_module(unit[drive].gendisk);
 }
 
 static int __init amiga_floppy_probe(struct platform_device *pdev)
diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 8bc3b9fd8dd2..dfb2c2622e5a 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -1917,7 +1917,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (drive >= FD_MAX_UNITS || type > NUM_DISK_MINORS)
return NULL;
*part = 0;
-   return get_disk(unit[drive].disk);
+   return get_disk_and_module(unit[drive].disk);
 }
 
 static int __init atari_floppy_init (void)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 8028a3a7e7fd..deea78e485da 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -456,7 +456,7 @@ static struct kobject *brd_probe(dev_t dev, int *part, void 
*data)
 
mutex_lock(&brd_devices_mutex);
brd = brd_init_one(MINOR(dev) / max_part, &new);
-   kobj = brd ? get_disk(brd->brd_disk) : NULL;
+   kobj = brd ? get_disk_and_module(brd->brd_disk) : NULL;
mutex_unlock(&brd_devices_mutex);
 
if (new)
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index eae484acfbbc..8ec7235fc93b 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4505,7 +4505,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (((*part >> 2) & 0x1f) >= ARRAY_SIZE(floppy_type))
return NULL;
*part = 0;
-   return get_disk(disks[drive]);
+   return get_disk_and_module(disks[drive]);
 }
 
 static int __init do_floppy_init(void)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d5fe720cf149..87855b5123a6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1922,7 +1922,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
if (err < 0)
kobj = NULL;
else
-   kobj = get_disk(lo->lo_disk);
+   kobj = get_disk_and_module(lo->lo_disk);
mutex_unlock(&loop_index_mutex);
 
*part = 0;
diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 84434d3ea19b..64e066eba72e 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -799,7 +799,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
return NULL;
 
*part = 0;
-   return get_disk(swd->unit[drive].disk);
+   return get_disk_and_module(swd->unit[drive].disk);
 }
 
 static int swim_add_floppy(struct swi

[PATCH 1/6] genhd: Fix leaked module reference for NVME devices

2018-02-26 Thread Jan Kara

Commit 8ddcd653257c "block: introduce GENHD_FL_HIDDEN" added handling of
hidden devices to get_gendisk() but forgot to drop module reference
which is also acquired by get_disk(). Drop the reference as necessary.

Arguably the function naming here is misleading as put_disk() is *not*
the counterpart of get_disk() but let's fix that in the follow up
commit since that will be more intrusive.

Fixes: 8ddcd653257c18a669fcb75ee42c37054908e0d6
CC: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 88a53c188cb7..5098bffe6ba6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -817,7 +817,10 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
}
 
if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+   struct module *owner = disk->fops->owner;
+
put_disk(disk);
+   module_put(owner);
disk = NULL;
}
return disk;
-- 
2.13.6

[PATCH 6/6] blockdev: Avoid two active bdev inodes for one device

2018-02-26 Thread Jan Kara

When blkdev_open() races with device removal and creation it can happen
that unhashed bdev inode gets associated with newly created gendisk
like:

CPU0CPU1
blkdev_open()
  bdev = bd_acquire()
del_gendisk()
  bdev_unhash_inode(bdev);
remove device
create new device with the same number
  __blkdev_get()
disk = get_gendisk()
  - gets reference to gendisk of the new device

Now another blkdev_open() will not find original 'bdev' as it got
unhashed, create a new one and associate it with the same 'disk' at
which point problems start as we have two independent page caches for
one device.

Fix the problem by verifying that the bdev inode didn't get unhashed
before we acquired gendisk reference. That way we make sure gendisk can
get associated only with visible bdev inodes.

Tested-by: Hou Tao 
Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index fe41a76769fa..fe09ef9c21f3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1058,6 +1058,27 @@ static int bd_prepare_to_claim(struct block_device *bdev,
return 0;
 }
 
+static struct gendisk *bdev_get_gendisk(struct block_device *bdev, int *partno)
+{
+   struct gendisk *disk = get_gendisk(bdev->bd_dev, partno);
+
+   if (!disk)
+   return NULL;
+   /*
+* Now that we hold gendisk reference we make sure bdev we looked up is
+* not stale. If it is, it means device got removed and created before
+* we looked up gendisk and we fail open in such case. Associating
+* unhashed bdev with newly created gendisk could lead to two bdevs
+* (and thus two independent caches) being associated with one device
+* which is bad.
+*/
+   if (inode_unhashed(bdev->bd_inode)) {
+   put_disk_and_module(disk);
+   return NULL;
+   }
+   return disk;
+}
+
 /**
  * bd_start_claiming - start claiming a block device
  * @bdev: block device of interest
@@ -1094,7 +1115,7 @@ static struct block_device *bd_start_claiming(struct 
block_device *bdev,
 * @bdev might not have been initialized properly yet, look up
 * and grab the outer block device the hard way.
 */
-   disk = get_gendisk(bdev->bd_dev, &partno);
+   disk = bdev_get_gendisk(bdev, &partno);
if (!disk)
return ERR_PTR(-ENXIO);
 
@@ -1429,7 +1450,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
  restart:
 
ret = -ENXIO;
-   disk = get_gendisk(bdev->bd_dev, &partno);
+   disk = bdev_get_gendisk(bdev, &partno);
if (!disk)
goto out;
 
-- 
2.13.6

[PATCH 0/6 v2] block: Fix races in bdev - gendisk handling

2018-02-26 Thread Jan Kara

Hello,

these patches fix races happening when devices are frequently destroyed and
recreated in association of block device inode with corresponding gendisk.
Generally when such race happen it results in use-after-free issues, block
device page cache inconsistencies, or other problems. I have verified these
patches fix use-after-free issues that could be reproduced by frequent creation
and destruction of loop device. Hou Tao has verified that races reported by
him in [1] related to gendisk-blkdev association were also fixed. Jens, can
you please merge these patches? Thanks!

Changes since v1:
* Added tested-by tags

Honza

[1] https://www.spinics.net/lists/linux-block/msg20015.html

[PATCH 5/6] genhd: Fix BUG in blkdev_open()

2018-02-26 Thread Jan Kara

When two blkdev_open() calls for a partition race with device removal
and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
blkdev_open(). The race can happen as follows:

CPU0CPU1CPU2
del_gendisk()
  
bdev_unhash_inode(part1);

blkdev_open(part1, O_EXCL)  blkdev_open(part1, O_EXCL)
  bdev = bd_acquire() bdev = bd_acquire()
  blkdev_get(bdev)
bd_start_claiming(bdev)
  - finds old inode 'whole'
  bd_prepare_to_claim() -> 0
  
bdev_unhash_inode(whole);


  blkdev_get(bdev);
bd_start_claiming(bdev)
  - finds new inode 'whole'
  bd_prepare_to_claim()
- this also succeeds as we have
  different 'whole' here...
- bad things happen now as we
  have two exclusive openers of
  the same bdev

The problem here is that block device opens can see various intermediate
states while gendisk is shutting down and then being recreated.

We fix the problem by introducing new lookup_sem in gendisk that
synchronizes gendisk deletion with get_gendisk() and furthermore by
making sure that get_gendisk() does not return gendisk that is being (or
has been) deleted. This makes sure that once we ever manage to look up
newly created bdev inode, we are also guaranteed that following
get_gendisk() will either return failure (and we fail open) or it
returns gendisk for the new device and following bdget_disk() will
return new bdev inode (i.e., blkdev_open() follows the path as if it is
completely run after new device is created).

Reported-and-analyzed-by: Hou Tao 
Tested-by: Hou Tao 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 21 -
 include/linux/genhd.h |  1 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 4c0590434591..9656f9e9f99e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -717,6 +717,11 @@ void del_gendisk(struct gendisk *disk)
blk_integrity_del(disk);
disk_del_events(disk);
 
+   /*
+* Block lookups of the disk until all bdevs are unhashed and the
+* disk is marked as dead (GENHD_FL_UP cleared).
+*/
+   down_write(&disk->lookup_sem);
/* invalidate stuff */
disk_part_iter_init(&piter, disk,
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
@@ -731,6 +736,7 @@ void del_gendisk(struct gendisk *disk)
bdev_unhash_inode(disk_devt(disk));
set_capacity(disk, 0);
disk->flags &= ~GENHD_FL_UP;
+   up_write(&disk->lookup_sem);
 
if (!(disk->flags & GENHD_FL_HIDDEN))
sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
@@ -816,9 +822,21 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
spin_unlock_bh(&ext_devt_lock);
}
 
-   if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+   if (!disk)
+   return NULL;
+
+   /*
+* Synchronize with del_gendisk() to not return disk that is being
+* destroyed.
+*/
+   down_read(&disk->lookup_sem);
+   if (unlikely((disk->flags & GENHD_FL_HIDDEN) ||
+!(disk->flags & GENHD_FL_UP))) {
+   up_read(&disk->lookup_sem);
put_disk_and_module(disk);
disk = NULL;
+   } else {
+   up_read(&disk->lookup_sem);
}
return disk;
 }
@@ -1418,6 +1436,7 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
kfree(disk);
return NULL;
}
+   init_rwsem(&disk->lookup_sem);
disk->node_id = node_id;
if (disk_expand_part_tbl(disk, 0)) {
free_part_stats(&disk->part0);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7f5906fe1b70..c826b0b5232a 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -198,6 +198,7 @@ struct gendisk {
void *private_data;
 
int flags;
+   struct rw_semaphore lookup_sem;
struct kobject *slave_dir;
 
struct timer_rand_state *random;
-- 
2.13.6

[PATCH 3/6] genhd: Add helper put_disk_and_module()

2018-02-26 Thread Jan Kara

Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.

Signed-off-by: Jan Kara 
---
 block/blk-cgroup.c| 11 ++-
 block/genhd.c | 20 
 fs/block_dev.c| 19 +--
 include/linux/genhd.h |  1 +
 4 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4117524ca45b..c2033a232a44 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -812,7 +812,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
struct gendisk *disk;
struct request_queue *q;
struct blkcg_gq *blkg;
-   struct module *owner;
unsigned int major, minor;
int key_len, part, ret;
char *body;
@@ -904,9 +903,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
spin_unlock_irq(q->queue_lock);
rcu_read_unlock();
 fail:
-   owner = disk->fops->owner;
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
/*
 * If queue was bypassing, we should retry.  Do so after a
 * short msleep().  It isn't strictly necessary but queue
@@ -931,13 +928,9 @@ EXPORT_SYMBOL_GPL(blkg_conf_prep);
 void blkg_conf_finish(struct blkg_conf_ctx *ctx)
__releases(ctx->disk->queue->queue_lock) __releases(rcu)
 {
-   struct module *owner;
-
spin_unlock_irq(ctx->disk->queue->queue_lock);
rcu_read_unlock();
-   owner = ctx->disk->fops->owner;
-   put_disk(ctx->disk);
-   module_put(owner);
+   put_disk_and_module(ctx->disk);
 }
 EXPORT_SYMBOL_GPL(blkg_conf_finish);
 
diff --git a/block/genhd.c b/block/genhd.c
index 21b2843b27d0..4c0590434591 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -817,10 +817,7 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
}
 
if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
-   struct module *owner = disk->fops->owner;
-
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
disk = NULL;
}
return disk;
@@ -1483,6 +1480,21 @@ void put_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(put_disk);
 
+/*
+ * This is a counterpart of get_disk_and_module() and thus also of
+ * get_gendisk().
+ */
+void put_disk_and_module(struct gendisk *disk)
+{
+   if (disk) {
+   struct module *owner = disk->fops->owner;
+
+   put_disk(disk);
+   module_put(owner);
+   }
+}
+EXPORT_SYMBOL(put_disk_and_module);
+
 static void set_disk_ro_uevent(struct gendisk *gd, int ro)
 {
char event[] = "DISK_RO=1";
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb5175..1dbbf847911a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -,8 +,7 @@ static struct block_device *bd_start_claiming(struct 
block_device *bdev,
else
whole = bdgrab(bdev);
 
-   module_put(disk->fops->owner);
-   put_disk(disk);
+   put_disk_and_module(disk);
if (!whole)
return ERR_PTR(-ENOMEM);
 
@@ -1407,7 +1406,6 @@ static void __blkdev_put(struct block_device *bdev, 
fmode_t mode, int for_part);
 static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 {
struct gendisk *disk;
-   struct module *owner;
int ret;
int partno;
int perm = 0;
@@ -1433,7 +1431,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
disk = get_gendisk(bdev->bd_dev, &partno);
if (!disk)
goto out;
-   owner = disk->fops->owner;
 
disk_block_events(disk);
mutex_lock_nested(&bdev->bd_mutex, for_part);
@@ -1463,8 +1460,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_queue = NULL;
mutex_unlock(&bdev->bd_mutex);
disk_unblock_events(disk);
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
goto restart;
}
}
@@ -1525,8 +1521,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
goto out_unlock_bdev;
}
/* only one opener holds refs to the module and disk */
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
}
bdev->bd_openers++;
if (for_part)
@@ -1546,8 +1541,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
  out_unlock_bdev:
mutex_unlock(&bdev->bd_mutex);

Re: [PATCH] block: copy ioprio in __bio_clone_fast()

2018-02-26 Thread Hannes Reinecke

On 02/26/2018 11:55 AM, Hannes Reinecke wrote:
> We need to copy the io priority, too; otherwise the clone will run
> with a different priority than the original one.
> 
> Fixes: 43b62ce3ff0a ("block: move bio io prio to a new field")
> Signed-off-by: Hannes Reinecke 
> ---
>  block/bio.c  |   1 +
>  drivers/vhost/npiv.c | 379 
> +--
>  2 files changed, 37 insertions(+), 343 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index e1708db48258..e079911c640f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -605,6 +605,7 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> *bio_src)
>   bio->bi_write_hint = bio_src->bi_write_hint;
>   bio->bi_iter = bio_src->bi_iter;
>   bio->bi_io_vec = bio_src->bi_io_vec;
> + bio->bi_ioprio = bio_src->bi_ioprio;
>  
>   bio_clone_blkcg_association(bio, bio_src);
>  }
> diff --git a/drivers/vhost/npiv.c b/drivers/vhost/npiv.c
> index 20e2a66e332d..3527996aab3f 100644
> --- a/drivers/vhost/npiv.c
> +++ b/drivers/vhost/npiv.c
> @@ -13,7 +13,7 @@

Gnaa.
Please ignore.

Will be sending the correct patch.

-- 
Dr. Hannes ReineckezSeries & Storage
h...@suse.com  +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context

2018-02-26 Thread Mark Rutland

On Mon, Feb 26, 2018 at 11:52:56AM +0100, Jan Kara wrote:
> On Fri 23-02-18 15:47:36, Mark Rutland wrote:
> > Hi all,
> > 
> > While fuzzing arm64/v4.16-rc2 with syzkaller, I simultaneously hit a
> > number of splats in the block layer:
> > 
> > * inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage in
> >   jbd2_trans_will_send_data_barrier
> > 
> > * BUG: sleeping function called from invalid context at mm/mempool.c:320
> > 
> > * WARNING: CPU: 0 PID: 0 at block/blk.h:297 
> > generic_make_request_checks+0x670/0x750
> > 
> > ... I've included the full splats at the end of the mail.
> > 
> > These all happen in the context of the virtio block IRQ handler, so I
> > wonder if this calls something that doesn't expect to be called from IRQ
> > context. Is it valid to call blk_mq_complete_request() or
> > blk_mq_end_request() from an IRQ handler?
> 
> No, it's likely a bug in detection whether IO completion should be deferred
> to a workqueue or not. Does attached patch fix the problem? I don't see
> exactly this being triggered by the syzkaller but it's close enough :)
> 
>   Honza

That seems to be it!

With the below patch applied, I can't trigger the bug after ~10 minutes,
whereas prior to the patch I can trigger it in ~10 seconds. I'll leave
that running for a while just in case there's another part to the
problem, but FWIW:

Tested-by: Mark Rutland 

Thanks,
Mark.

> From 501d97ed88f5020a55a0de4d546df5ad11461cea Mon Sep 17 00:00:00 2001
> From: Jan Kara 
> Date: Mon, 26 Feb 2018 11:36:52 +0100
> Subject: [PATCH] direct-io: Fix sleep in atomic due to sync AIO
> 
> Commit e864f39569f4 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
> way for direct IO to become synchronous and thus trigger fsync from the
> IO completion handler. Then commit 9830f4be159b "fs: Use RWF_* flags for
> AIO operations" allowed these flags to be set for AIO as well. However
> that commit forgot to update the condition checking whether the IO
> completion handling should be defered to a workqueue and thus AIO DIO
> with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
> sleep in atomic.
> 
> Fix the problem by checking directly iocb flags (the same way as it is
> done in dio_complete()) instead of checking all conditions that could
> lead to IO being synchronous.
> 
> CC: Christoph Hellwig 
> CC: Goldwyn Rodrigues 
> CC: sta...@vger.kernel.org
> Reported-by: Mark Rutland 
> Fixes: 9830f4be159b29399d107bffb99e0132bc5aedd4
> Signed-off-by: Jan Kara 
> ---
>  fs/direct-io.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index a0ca9e48e993..1357ef563893 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1274,8 +1274,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode 
> *inode,
>*/
>   if (dio->is_async && iov_iter_rw(iter) == WRITE) {
>   retval = 0;
> - if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> - IS_SYNC(iocb->ki_filp->f_mapping->host))
> + if (iocb->ki_flags & IOCB_DSYNC)
>   retval = dio_set_defer_completion(dio);
>   else if (!dio->inode->i_sb->s_dio_done_wq) {
>   /*
> -- 
> 2.13.6
>

[PATCH] block: copy ioprio in __bio_clone_fast()

2018-02-26 Thread Hannes Reinecke

We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.

Fixes: 43b62ce3ff0a ("block: move bio io prio to a new field")
Signed-off-by: Hannes Reinecke 
---
 block/bio.c  |   1 +
 drivers/vhost/npiv.c | 379 +--
 2 files changed, 37 insertions(+), 343 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index e1708db48258..e079911c640f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -605,6 +605,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_write_hint = bio_src->bi_write_hint;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
+   bio->bi_ioprio = bio_src->bi_ioprio;
 
bio_clone_blkcg_association(bio, bio_src);
 }
diff --git a/drivers/vhost/npiv.c b/drivers/vhost/npiv.c
index 20e2a66e332d..3527996aab3f 100644
--- a/drivers/vhost/npiv.c
+++ b/drivers/vhost/npiv.c
@@ -13,7 +13,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
- /
+ */
 
 #include 
 #include 
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -87,8 +88,6 @@ struct vhost_npiv_cmd {
struct vhost_virtqueue *tvc_vq;
/* Pointer to vhost nexus memory */
struct vhost_npiv_nexus *tvc_nexus;
-   /* The TCM I/O descriptor that is accessed via container_of() */
-   struct se_cmd tvc_se_cmd;
/* work item used for cmwq dispatch to vhost_npiv_submission_work() */
struct work_struct work;
/* Copy of the incoming SCSI command descriptor block (CDB) */
@@ -101,11 +100,6 @@ struct vhost_npiv_cmd {
struct vhost_npiv_inflight *inflight;
 };
 
-struct vhost_npiv_nexus {
-   /* Pointer to TCM session for I_T Nexus */
-   struct se_session *tvn_se_sess;
-};
-
 struct vhost_npiv_tpg {
/* Vhost port target portal group tag for TCM */
u16 tport_tpgt;
@@ -121,25 +115,10 @@ struct vhost_npiv_tpg {
struct mutex tv_tpg_mutex;
/* Pointer to the TCM VHost I_T Nexus for this TPG endpoint */
struct vhost_npiv_nexus *tpg_nexus;
-   /* Pointer back to vhost_npiv_tport */
-   struct vhost_npiv_tport *tport;
-   /* Returned by vhost_npiv_make_tpg() */
-   struct se_portal_group se_tpg;
/* Pointer back to vhost_npiv, protected by tv_tpg_mutex */
struct vhost_npiv *vhost_npiv;
 };
 
-struct vhost_npiv_tport {
-   /* SCSI protocol the tport is providing */
-   u8 tport_proto_id;
-   /* Binary World Wide unique Port Name for Vhost Target port */
-   u64 tport_wwpn;
-   /* ASCII formatted WWPN for Vhost Target port */
-   char tport_name[VHOST_NPIV_NAMELEN];
-   /* Returned by vhost_npiv_make_tport() */
-   struct se_wwn tport_wwn;
-};
-
 struct vhost_npiv_evt {
/* event to be sent to guest */
struct virtio_scsi_event event;
@@ -195,6 +174,7 @@ struct vhost_npiv {
 };
 
 static struct workqueue_struct *vhost_npiv_workqueue;
+static struct mempool *vhost_npiv_cmd_pool;
 
 /* Global spinlock to protect vhost_npiv TPG list for vhost IOCTL access */
 static DEFINE_MUTEX(vhost_npiv_mutex);
@@ -253,132 +233,6 @@ static void vhost_npiv_put_inflight(struct 
vhost_npiv_inflight *inflight)
kref_put(&inflight->kref, vhost_npiv_done_inflight);
 }
 
-static int vhost_npiv_check_true(struct se_portal_group *se_tpg)
-{
-   return 1;
-}
-
-static int vhost_npiv_check_false(struct se_portal_group *se_tpg)
-{
-   return 0;
-}
-
-static char *vhost_npiv_get_fabric_name(void)
-{
-   return "vhost";
-}
-
-static char *vhost_npiv_get_fabric_wwn(struct se_portal_group *se_tpg)
-{
-   struct vhost_npiv_tpg *tpg = container_of(se_tpg,
-   struct vhost_npiv_tpg, se_tpg);
-   struct vhost_npiv_tport *tport = tpg->tport;
-
-   return &tport->tport_name[0];
-}
-
-static u16 vhost_npiv_get_tpgt(struct se_portal_group *se_tpg)
-{
-   struct vhost_npiv_tpg *tpg = container_of(se_tpg,
-   struct vhost_npiv_tpg, se_tpg);
-   return tpg->tport_tpgt;
-}
-
-static int vhost_npiv_check_prot_fabric_only(struct se_portal_group *se_tpg)
-{
-   struct vhost_npiv_tpg *tpg = container_of(se_tpg,
-   struct vhost_npiv_tpg, se_tpg);
-
-   return tpg->tv_fabric_prot_type;
-}
-
-static u32 vhost_npiv_tpg_get_inst_index(struct se_portal_group *se_tpg)
-{
-   return 1;
-}
-
-static void vhost_npiv_release_cmd(struct se_cmd *se_cmd)
-{
-   struct vhost_npiv_cmd *tv_cmd = container_of(se_cmd,
-   struct vhost_npiv_cmd, tvc_se_cmd);
-   struct se_session *se_sess = tv_cmd->tvc_nexus->tvn_se_sess;
-   int i;
-
-   if (tv_cmd->tvc_sgl_count) {
-   for (i = 0; i < tv_cmd->tvc_sgl_

Re: v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context

2018-02-26 Thread Jan Kara

On Fri 23-02-18 15:47:36, Mark Rutland wrote:
> Hi all,
> 
> While fuzzing arm64/v4.16-rc2 with syzkaller, I simultaneously hit a
> number of splats in the block layer:
> 
> * inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage in
>   jbd2_trans_will_send_data_barrier
> 
> * BUG: sleeping function called from invalid context at mm/mempool.c:320
> 
> * WARNING: CPU: 0 PID: 0 at block/blk.h:297 
> generic_make_request_checks+0x670/0x750
> 
> ... I've included the full splats at the end of the mail.
> 
> These all happen in the context of the virtio block IRQ handler, so I
> wonder if this calls something that doesn't expect to be called from IRQ
> context. Is it valid to call blk_mq_complete_request() or
> blk_mq_end_request() from an IRQ handler?

No, it's likely a bug in detection whether IO completion should be deferred
to a workqueue or not. Does attached patch fix the problem? I don't see
exactly this being triggered by the syzkaller but it's close enough :)

Honza


> Syzkaller came up with a minimized reproducer, but it's a bit wacky (the
> fcntl and bpf calls should have no practical effect), and I haven't
> managed to come up with a C reproducer.
> 
> Any ideas?
> 
> Thanks,
> Mark.
> 
> 
> Syzkaller reproducer:
> # {Threaded:true Collide:true Repeat:false Procs:1 Sandbox:setuid Fault:false 
> FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true 
> WaitRepeat:false Debug:false Repro:false}
> mmap(&(0x7f00/0x24000)=nil, 0x24000, 0x3, 0x32, 0x, 
> 0x0)
> r0 = openat(0xff9c, &(0x7f019000-0x8)='./file0\x00', 0x42, 
> 0x0)
> fcntl$setstatus(r0, 0x4, 0x1)
> ftruncate(r0, 0x400)
> io_setup(0x1f, &(0x7f018000)=0x0)
> io_submit(r1, 0x1, &(0x7f01d000-0x28)=[&(0x7f01b000)={0x0, 0x0, 0x0, 
> 0x1, 0x0, r0, 
> &(0x7f022000-0x1000)="0
>  
> 000",
>  0x200, 0x0, 0x0, 0x0, 0x0}])
> bpf$BPF_PROG_ATTACH(0x, &(0x7f01b000)={0x0, 0x0, 0x3, 0x2}, 
> 0x4000)
> 
> 
> Full splat:
> [  162.337073] 
> [  162.338055] WARNING: inconsistent lock state
> [  162.339017] 4.16.0-rc2 #1 Not tainted
> [  162.339797] 
> [  162.340725] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage.
> [  162.342030] swapper/0/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
> [  162.343061]  (&journal->j_state_lock){+?++}, at: [<3b9c3e4b>] 
> jbd2_trans_will_send_data_barrier+0x44/0xc8
> [  162.353187] {HARDIRQ-ON-W} state was registered at:
> [  162.354433]   lock_acquire+0x48/0x68
> [  162.358640]   _raw_write_lock+0x3c/0x50
> [  162.360716]   ext4_init_journal_params.isra.6+0x40/0xa0
> [  162.363445]   ext4_fill_super+0x25cc/0x2e88
> [  162.364481]   mount_bdev+0x19c/0x1d8
> [  162.365345]   ext4_mount+0x14/0x20
> [  162.366130]   mount_fs+0x34/0x160
> [  162.366790]   vfs_kern_mount.part.8+0x54/0x160
> [  162.367874]   do_mount+0x540/0xd40
> [  162.373776]   SyS_mount+0x68/0x100
> [  162.374467]   mount_block_root+0x11c/0x28c
> [  162.376558]   mount_root+0x130/0x164
> [  162.380753]   prepare_namespace+0x138/0x180
> [  162.381729]   kernel_init_freeable+0x25c/0x280
> [  162.382625]   kernel_init+0x10/0x100
> [  162.383337]   ret_from_fork+0x10/0x18
> [  162.384072] irq event stamp: 3670810
> [  162.384787] hardirqs last  enabled at (3670805): [] 
> arch_cpu_idle+0x14/0x28
> [  162.386505] hardirqs last disabled at (3670806): [<341112e2>] 
> el1_irq+0x74/0x130
> [  162.388107] softirqs last  enabled at (3670810): [] 
> _local_bh_enable+0x20/0x40
> [  162.389880] softirqs last disabled at (3670809): [] 
> irq_enter+0x54/0x70
> [  162.391443]
> [  162.391443] other info that might help us debug this:
> [  162.392680]  Possible unsafe locking scenario:
> [  162.392680]
> [  162.405967]CPU0
> [  162.406513]
> [  162.407055]   lock(&journal->j_state_lock);
> [

81 matches

Mail list logo