Re: [RESEND PATCH 1/3] completion: Add support for initializing completion with lockdep_map

2017-10-20 Thread Byungchul Park
On Sat, Oct 21, 2017 at 4:58 AM, Bart Van Assche  wrote:
> Sorry but I'm not sure that's the best possible answer. In my opinion
> avoiding that completion objects have dependencies on other lock objects,
> e.g. by avoiding to wait on a completion object while holding a mutex, is a
> far superior strategy over adding cross-release checking to completion
> objects. The former strategy namely makes it unnecessary to add
> cross-release checking to completion objects because that strategy ensures
> that these completion objects cannot get involved in a deadlock. The latter

It's true if we force it. But do you think it's possible?

> strategy can lead to false positive deadlock reports by the lockdep code,

What do you think false positives come from? It comes from assigning
lock classes falsely where we should more care, rather than lockdep code
itself. The same is applicable to cross-release.

> something none of us wants.
>
> A possible alternative strategy could be to enable cross-release checking
> only for those completion objects for which waiting occurs inside a critical
> section.

Of course, it already did. Cross-release doesn't consider any waiting
outside of critical sections at all, and it should do.

> As explained in another e-mail thread, unlike the lock inversion checking
> performed by the <= v4.13 lockdep code, cross-release checking is a heuristic
> that does not have a sound theoretical basis. The lock validator is an

It's not heuristic but based on the same theoretical basis as <=4.13
lockdep. I mean, the key basis is:

   1) What causes deadlock
   2) What is a dependency
   3) Build a dependency when identified

> important tool for kernel developers. It is important that it produces as few
> false positives as possible. Since the cross-release checks are enabled
> automatically when enabling lockdep, I think it is normal that I, as a kernel
> developer, care that the cross-release checks produce as few false positives
> as possible.

No doubt. That's why I proposed these patches.


Re: [bug report] regression bisected to "block: Make most scsi_req_init() calls implicit"

2017-10-20 Thread Bart Van Assche
On Fri, 2017-10-20 at 16:54 -0600, dann frazier wrote:
> hey,
>   I'm seeing a regression when executing 'dmraid -r -c' in an arm64
> QEMU guest, which I've bisected to the following commit:
> 
>   ca18d6f7 "block: Make most scsi_req_init() calls implicit"
> 
> I haven't yet had time to try and debug it yet, but wanted to get
> the report out there before the weekend. Here's the crash:
> 
> [  138.519885] usercopy: kernel memory overwrite attempt detected to  
>  (null) () (6 bytes)
> [  138.521562] kernel BUG at mm/usercopy.c:72!
> [  138.522294] Internal error: Oops - BUG: 0 [#1] SMP
> [  138.523105] Modules linked in: nls_utf8 isofs nls_iso8859_1 qemu_fw_cfg 
> ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
> scsi_transport_iscsi ip_tables
> x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash raid10 raid456 
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
> libcrc32c raid1 raid0
> multipath linear aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce 
> sha2_ce sha256_arm64 sha1_ce virtio_net virtio_blk aes_neon_bs aes_neon_blk 
> crypto_simd cryptd
> aes_arm64
> [  138.531307] CPU: 62 PID: 2271 Comm: dmraid Not tainted 4.14.0-rc5+ #20
> [  138.532512] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [  138.533796] task: 8003cba2e900 task.stack: 110e8000
> [  138.534887] PC is at __check_object_size+0x114/0x200
> [  138.535800] LR is at __check_object_size+0x114/0x200
> [  138.536711] pc : [] lr : [] pstate: 
> 00400145
> [  138.538073] sp : 110ebb00
> [  138.538682] x29: 110ebb00 x28:  
> [  138.539658] x27: d88e1110 x26: 8003e8d3d800 
> [  138.540633] x25: 0802001d x24: 8003e1131920 
> [  138.541621] x23: 0006 x22: 0006 
> [  138.542596] x21:  x20: 0006 
> [  138.543571] x19:  x18:  
> [  138.544548] x17: 83380ce0 x16: 082dd3b0 
> [  138.545525] x15: 093c8c08 x14: 6c756e2820202020 
> [  138.546511] x13: 202020202020206f x12: 7420646574636574 
> [  138.547489] x11: 093c9658 x10: 086ae800 
> [  138.548466] x9 : 7265766f2079726f x8 : 0017 
> [  138.549445] x7 : 6c756e3c2820296c x6 : 8003eeb51c28 
> [  138.550434] x5 : 8003eeb51c28 x4 :  
> [  138.551411] x3 : 8003eeb59ec8 x2 : d4a0cd0f45236000 
> [  138.552388] x1 :  x0 : 0059 
> [  138.553364] Process dmraid (pid: 2271, stack limit = 0x110e8000)
> [  138.554593] Call trace:
> [  138.555043] Exception stack(0x110eb9c0 to 0x110ebb00)
> [  138.556214] b9c0: 0059  d4a0cd0f45236000 
> 8003eeb59ec8
> [  138.557653] b9e0:  8003eeb51c28 8003eeb51c28 
> 6c756e3c2820296c
> [  138.559082] ba00: 0017 7265766f2079726f 086ae800 
> 093c9658
> [  138.560510] ba20: 7420646574636574 202020202020206f 6c756e2820202020 
> 093c8c08
> [  138.561950] ba40: 082dd3b0 83380ce0  
> 
> [  138.563379] ba60: 0006  0006 
> 0006
> [  138.564805] ba80: 8003e1131920 0802001d 8003e8d3d800 
> d88e1110
> [  138.566238] baa0:  110ebb00 082c0e5c 
> 110ebb00
> [  138.567666] bac0: 082c0e5c 00400145 08e25a80 
> 
> [  138.569090] bae0: 0001 0006 110ebb00 
> 082c0e5c
> [  138.570523] [] __check_object_size+0x114/0x200
> [  138.571628] [] sg_io+0x120/0x438
> [  138.572507] [] scsi_cmd_ioctl+0x594/0x728
> [  138.573531] [] scsi_cmd_blk_ioctl+0x50/0x60
> [  138.574594] [] virtblk_ioctl+0x60/0x80 [virtio_blk]
> [  138.575769] [] blkdev_ioctl+0x5e4/0xb50
> [  138.576756] [] block_ioctl+0x50/0x68
> [  138.577698] [] do_vfs_ioctl+0xc4/0x940
> [  138.578671] [] SyS_ioctl+0x8c/0xa8
> [  138.579581] Exception stack(0x110ebec0 to 0x110ec000)
> [  138.580752] bec0: 0005 2285 d88e10b8 
> 0006
> [  138.582199] bee0:  0004 83416648 
> 0050
> [  138.583623] bf00: 001d 0003 0012 
> 0011
> [  138.585050] bf20: 83409000 00ff 8309dc70 
> 0531
> [  138.586490] bf40: 8344a360 83380ce0 00dc 
> 83478948
> [  138.587918] bf60: 0004 17ee7f90 0005 
> 17ede920
> [  138.589346] bf80: 17ee7f60 0003 83416648 
> 17ee7f60
> [  138.590785] bfa0: d88e1218 d88e1090 834166dc 
> d88e1090
> [  138.592215] bfc0: 83380cec 8000 0005 
> 001d
> [  138.593649] bfe0: 

[bug report] regression bisected to "block: Make most scsi_req_init() calls implicit"

2017-10-20 Thread dann frazier
hey,
  I'm seeing a regression when executing 'dmraid -r -c' in an arm64
QEMU guest, which I've bisected to the following commit:

  ca18d6f7 "block: Make most scsi_req_init() calls implicit"

I haven't yet had time to try and debug it yet, but wanted to get
the report out there before the weekend. Here's the crash:

[  138.519885] usercopy: kernel memory overwrite attempt detected to   
(null) () (6 bytes)
[  138.521562] kernel BUG at mm/usercopy.c:72!
[  138.522294] Internal error: Oops - BUG: 0 [#1] SMP
[  138.523105] Modules linked in: nls_utf8 isofs nls_iso8859_1 qemu_fw_cfg 
ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_decompress 
zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 
aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha256_arm64 
sha1_ce virtio_net virtio_blk aes_neon_bs aes_neon_blk crypto_simd cryptd 
aes_arm64
[  138.531307] CPU: 62 PID: 2271 Comm: dmraid Not tainted 4.14.0-rc5+ #20
[  138.532512] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  138.533796] task: 8003cba2e900 task.stack: 110e8000
[  138.534887] PC is at __check_object_size+0x114/0x200
[  138.535800] LR is at __check_object_size+0x114/0x200
[  138.536711] pc : [] lr : [] pstate: 
00400145
[  138.538073] sp : 110ebb00
[  138.538682] x29: 110ebb00 x28:  
[  138.539658] x27: d88e1110 x26: 8003e8d3d800 
[  138.540633] x25: 0802001d x24: 8003e1131920 
[  138.541621] x23: 0006 x22: 0006 
[  138.542596] x21:  x20: 0006 
[  138.543571] x19:  x18:  
[  138.544548] x17: 83380ce0 x16: 082dd3b0 
[  138.545525] x15: 093c8c08 x14: 6c756e2820202020 
[  138.546511] x13: 202020202020206f x12: 7420646574636574 
[  138.547489] x11: 093c9658 x10: 086ae800 
[  138.548466] x9 : 7265766f2079726f x8 : 0017 
[  138.549445] x7 : 6c756e3c2820296c x6 : 8003eeb51c28 
[  138.550434] x5 : 8003eeb51c28 x4 :  
[  138.551411] x3 : 8003eeb59ec8 x2 : d4a0cd0f45236000 
[  138.552388] x1 :  x0 : 0059 
[  138.553364] Process dmraid (pid: 2271, stack limit = 0x110e8000)
[  138.554593] Call trace:
[  138.555043] Exception stack(0x110eb9c0 to 0x110ebb00)
[  138.556214] b9c0: 0059  d4a0cd0f45236000 
8003eeb59ec8
[  138.557653] b9e0:  8003eeb51c28 8003eeb51c28 
6c756e3c2820296c
[  138.559082] ba00: 0017 7265766f2079726f 086ae800 
093c9658
[  138.560510] ba20: 7420646574636574 202020202020206f 6c756e2820202020 
093c8c08
[  138.561950] ba40: 082dd3b0 83380ce0  

[  138.563379] ba60: 0006  0006 
0006
[  138.564805] ba80: 8003e1131920 0802001d 8003e8d3d800 
d88e1110
[  138.566238] baa0:  110ebb00 082c0e5c 
110ebb00
[  138.567666] bac0: 082c0e5c 00400145 08e25a80 

[  138.569090] bae0: 0001 0006 110ebb00 
082c0e5c
[  138.570523] [] __check_object_size+0x114/0x200
[  138.571628] [] sg_io+0x120/0x438
[  138.572507] [] scsi_cmd_ioctl+0x594/0x728
[  138.573531] [] scsi_cmd_blk_ioctl+0x50/0x60
[  138.574594] [] virtblk_ioctl+0x60/0x80 [virtio_blk]
[  138.575769] [] blkdev_ioctl+0x5e4/0xb50
[  138.576756] [] block_ioctl+0x50/0x68
[  138.577698] [] do_vfs_ioctl+0xc4/0x940
[  138.578671] [] SyS_ioctl+0x8c/0xa8
[  138.579581] Exception stack(0x110ebec0 to 0x110ec000)
[  138.580752] bec0: 0005 2285 d88e10b8 
0006
[  138.582199] bee0:  0004 83416648 
0050
[  138.583623] bf00: 001d 0003 0012 
0011
[  138.585050] bf20: 83409000 00ff 8309dc70 
0531
[  138.586490] bf40: 8344a360 83380ce0 00dc 
83478948
[  138.587918] bf60: 0004 17ee7f90 0005 
17ede920
[  138.589346] bf80: 17ee7f60 0003 83416648 
17ee7f60
[  138.590785] bfa0: d88e1218 d88e1090 834166dc 
d88e1090
[  138.592215] bfc0: 83380cec 8000 0005 
001d
[  138.593649] bfe0:    

[  138.595091] [] el0_svc_naked+0x24/0x28
[  138.596071] Code: aa1403e5 aa1303e3 9119a0c0 97f9d96d (d421) 
[  138.597193] ---[ end trace b7eecd0b21001177 ]---

Here's the ioctl as repor

Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart

2017-10-20 Thread Bart Van Assche
On Fri, 2017-10-20 at 11:39 +0200, Roman Penyaev wrote:
> But what bothers me is these looong loops inside blk_mq_sched_restart(),
> and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
> a single queue if tag sets are shared") I want to ask what was the original
> problem which you attempted to fix?  Likely I am missing some test scenario
> which would be great to know about.

Long loops? How many queues share the same tag set on your setup? How many
hardware queues does your block driver create per request queue?

Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the
SCSI and dm-mq drivers.

Bart.

Re: [RESEND PATCH 1/3] completion: Add support for initializing completion with lockdep_map

2017-10-20 Thread Bart Van Assche
On Fri, 2017-10-20 at 08:34 +0200, Thomas Gleixner wrote:
> On Thu, 19 Oct 2017, Bart Van Assche wrote:
> > Are there any completion objects for which the cross-release checking is
> > useful?
> 
> All of them by definition.

Sorry but I'm not sure that's the best possible answer. In my opinion
avoiding that completion objects have dependencies on other lock objects,
e.g. by avoiding to wait on a completion object while holding a mutex, is a
far superior strategy over adding cross-release checking to completion
objects. The former strategy namely makes it unnecessary to add
cross-release checking to completion objects because that strategy ensures
that these completion objects cannot get involved in a deadlock. The latter
strategy can lead to false positive deadlock reports by the lockdep code,
something none of us wants.

A possible alternative strategy could be to enable cross-release checking
only for those completion objects for which waiting occurs inside a critical
section.

> > Are there any wait_for_completion() callers that hold a mutex or
> > other locking object?
> 
> Yes, there are also cross completion dependencies. There have been such
> bugs and I expect more to be unearthed.
> 
> I really have to ask what your motiviation is to fight the lockdep coverage
> of synchronization objects tooth and nail?

As explained in another e-mail thread, unlike the lock inversion checking
performed by the <= v4.13 lockdep code, cross-release checking is a heuristic
that does not have a sound theoretical basis. The lock validator is an
important tool for kernel developers. It is important that it produces as few
false positives as possible. Since the cross-release checks are enabled
automatically when enabling lockdep, I think it is normal that I, as a kernel
developer, care that the cross-release checks produce as few false positives
as possible.

Bart.

Re: [PATCH 15/17] nvme: track shared namespaces

2017-10-20 Thread Javier González
Javier
> On 18 Oct 2017, at 18.52, Christoph Hellwig  wrote:
> 
> Introduce a new struct nvme_ns_head that holds information about an actual
> namespace, unlike struct nvme_ns, which only holds the per-controller
> namespace information.  For private namespaces there is a 1:1 relation of
> the two, but for shared namespaces this lets us discover all the paths to
> it.  For now only the identifiers are moved to the new structure, but most
> of the information in struct nvme_ns should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Keith Busch 
> ---
> drivers/nvme/host/core.c | 192 +--
> drivers/nvme/host/lightnvm.c |  14 ++--
> drivers/nvme/host/nvme.h |  21 -
> 3 files changed, 192 insertions(+), 35 deletions(-)
> 
> 
LGTM

Reviewed-by: Javier González 



signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 9/9] bsg: split handling of SCSI CDBs vs transport requeues

2017-10-20 Thread Benjamin Block
On Fri, Oct 20, 2017 at 06:26:30PM +0200, Christoph Hellwig wrote:
> On Thu, Oct 19, 2017 at 05:59:33PM +0200, Benjamin Block wrote:
> > > +#define ptr64(val) ((void __user *)(uintptr_t)(val))
> > 
> > Better to reflect the special property, that it is a user pointer, in
> > the name of the macro. Maybe something like user_ptr(64). The same
> > comment for the same macro in bsg.c.
> 
> Not sure it's worth it especially now that Martin has merged the patch.

He did? I only saw a mail that he picked patches 2-5. So all the bsg
changes are still open I think.

(Maybe I just missed that, I haven't exactly followed the list very
closely as of late)

> But given how many interface we have all over the kernel that use a u64
> to store a user pointer in ioctls and similar it might make sense to
> lift a helper like this to a generic header.  In that case we'll need
> a more descriptive name for sure.
> 
> > > +static int bsg_transport_check_proto(struct sg_io_v4 *hdr)
> > > +{
> > > + if (hdr->protocol != BSG_PROTOCOL_SCSI  ||
> > > + hdr->subprotocol != BSG_SUB_PROTOCOL_SCSI_TRANSPORT)
> > > + return -EINVAL;
> > > + if (!capable(CAP_SYS_RAWIO))
> > > + return -EPERM;
> > 
> > Any particular reason why this is not symmetric with bsg_scsi? IOW
> > permission checking done in bsg_transport_fill_hdr(), like it is done in
> > bsg_scsi_fill_hdr()?
> > 
> > We might save some time copying memory with this (we also only talk
> > about ~20 bytes here), but on the other hand the interface would be more
> > clean otherwise IMO (if we already do restructure the interface) -
> > similar callbacks have similar responsibilities.
> 
> I could move the capable check around, no sure why I had done it that
> way, it's been a while.  Probably because blk_verify_command needs the
> CDB while a simple capable() check does not.

That was my guess, too. I just though it would be more consistent otherwise.
Its not a big thing, really.


Beste Grüße / Best regards,
  - Benjamin Block
-- 
Linux on z Systems Development / IBM Systems & Technology Group
  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294



Re: [PATCH 9/9] bsg: split handling of SCSI CDBs vs transport requeues

2017-10-20 Thread Christoph Hellwig
On Thu, Oct 19, 2017 at 05:59:33PM +0200, Benjamin Block wrote:
> > +#define ptr64(val) ((void __user *)(uintptr_t)(val))
> 
> Better to reflect the special property, that it is a user pointer, in
> the name of the macro. Maybe something like user_ptr(64). The same
> comment for the same macro in bsg.c.

Not sure it's worth it especially now that Martin has merged the patch.
But given how many interface we have all over the kernel that use a u64
to store a user pointer in ioctls and similar it might make sense to
lift a helper like this to a generic header.  In that case we'll need
a more descriptive name for sure.

> > +static int bsg_transport_check_proto(struct sg_io_v4 *hdr)
> > +{
> > +   if (hdr->protocol != BSG_PROTOCOL_SCSI  ||
> > +   hdr->subprotocol != BSG_SUB_PROTOCOL_SCSI_TRANSPORT)
> > +   return -EINVAL;
> > +   if (!capable(CAP_SYS_RAWIO))
> > +   return -EPERM;
> 
> Any particular reason why this is not symmetric with bsg_scsi? IOW
> permission checking done in bsg_transport_fill_hdr(), like it is done in
> bsg_scsi_fill_hdr()?
> 
> We might save some time copying memory with this (we also only talk
> about ~20 bytes here), but on the other hand the interface would be more
> clean otherwise IMO (if we already do restructure the interface) -
> similar callbacks have similar responsibilities.

I could move the capable check around, no sure why I had done it that
way, it's been a while.  Probably because blk_verify_command needs the
CDB while a simple capable() check does not.

> If I understand this right, the this reflects the old code, if only
> written down a little different.
> 
> But I wonder why we do that? Wouldn't that be interesting to know for
> uspace, if more was received than it allocated space for? Isn't that the
> typical residual over run case (similar to LUN scanning in SCSI common
> code), and din_resid is signed after all? Well I guess it could be an
> ABI break, I don't know.
> 
> Ah well, at least the documentation for 'struct sg_io_v4' makes no such
> restrictions (that it can not be below 0).
> 
> Just a thought I had while reading it.

Maybe it would, but I really didn't want to change behavior.  If we
were to redo transport passthrough I would do it totally different today.

> > +   job->reply = kzalloc(SCSI_SENSE_BUFFERSIZE, gfp);
> 
> One suggestion here. Maybe we could get rid of this implicit knowledge
> about SCSI_SENSE_BUFFERSIZE being the max size for a bsg-reply?
> Especially if we use this patch and get rid of other similarities (like
> using scsi_request).
> 
> Maybe we can just define a extra macro in bsg-lib.c, or in one of the
> headers, and define its size to be SCSI_SENSE_BUFFERSIZE (for now) and
> then use that in all cases.
> 
> I tried something similar some time ego if you remember, but I couldn't
> follow up because other stuff got more important in the meantime. One
> could also static check the transport reply-types against that.
> 
> This way, should need to change that value for a sepcific transport, we
> only need to change one knob, and not 10 (I guess SCSI_SENSE_BUFFERSIZE
> could not be changed for such cases ;) ).

There shouldn't be any dependencies on SCSI_SENSE_BUFFERSIZE left,
so yes, this could be cleaned up.  Great opportunity for a follow on
patch.

> > -   /* if the LLD has been removed then the bsg_unregister_queue will
> > -* eventually be called and the class_dev was freed, so we can no
> > -* longer use this request_queue. Return no such address.
> > -*/
> 
> Why remove the comment? Has that changed?

Nothing, but then again it's standard behavior so the comment doesn't
really add any value.

> > +   rq->timeout = msecs_to_jiffies(hdr->timeout);
> > +   if (!rq->timeout)
> > +   rq->timeout = rq->q->sg_timeout;
> 
> No need to use the rq pointer, you already have a variable q with the
> same content.

True.

> > -   ret = blk_rq_map_user(q, rq, NULL, dxferp, dxfer_len,
> > - GFP_KERNEL);
> > -   if (ret)
> > -   goto out;
> > +   ret = blk_rq_map_user(q, rq, NULL, ptr64(hdr->din_xferp),
> > +   hdr->din_xfer_len, GFP_KERNEL);
> > +   } else {
> > +   ret = blk_rq_map_user(q, rq, NULL, NULL, 0, GFP_KERNEL);
> 
> Why do we behave differently in this case now? To prevent special
> handling elsewhere? Otherwise it seems a bit pointless/error-prone
> mapping zero length to nothing.

Yes, this could be removed again.  I'll send a follow up.


[PATCH v2] block: fix peeking requests during PM

2017-10-20 Thread Christoph Hellwig
We need to look for an active PM request until the next softbarrier
instead of looking for the first non-PM request.  Otherwise any cause
of request reordering might starve the PM request(s).

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 35 ++-
 1 file changed, 18 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14f7674fa0b1..480029e4fb54 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2464,20 +2464,22 @@ void blk_account_io_done(struct request *req)
  * Don't process normal requests when queue is suspended
  * or in the process of suspending/resuming
  */
-static struct request *blk_pm_peek_request(struct request_queue *q,
-  struct request *rq)
+static bool blk_pm_allow_request(struct request *rq)
 {
-   if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
-   (q->rpm_status != RPM_ACTIVE && !(rq->rq_flags & RQF_PM
-   return NULL;
-   else
-   return rq;
+   switch (rq->q->rpm_status) {
+   case RPM_RESUMING:
+   case RPM_SUSPENDING:
+   return rq->rq_flags & RQF_PM;
+   case RPM_SUSPENDED:
+   return false;
+   }
+
+   return true;
 }
 #else
-static inline struct request *blk_pm_peek_request(struct request_queue *q,
- struct request *rq)
+static bool blk_pm_allow_request(struct request *rq)
 {
-   return rq;
+   return true;
 }
 #endif
 
@@ -2525,9 +2527,12 @@ static struct request *elv_next_request(struct 
request_queue *q)
WARN_ON_ONCE(q->mq_ops);
 
while (1) {
-   if (!list_empty(&q->queue_head)) {
-   rq = list_entry_rq(q->queue_head.next);
-   return rq;
+   list_for_each_entry(rq, &q->queue_head, queuelist) {
+   if (blk_pm_allow_request(rq))
+   return rq;
+
+   if (rq->rq_flags & RQF_SOFTBARRIER)
+   break;
}
 
/*
@@ -2578,10 +2583,6 @@ struct request *blk_peek_request(struct request_queue *q)
WARN_ON_ONCE(q->mq_ops);
 
while ((rq = elv_next_request(q)) != NULL) {
-   rq = blk_pm_peek_request(q, rq);
-   if (!rq)
-   break;
-
if (!(rq->rq_flags & RQF_STARTED)) {
/*
 * This is the first time the device driver
-- 
2.14.2



Re: [PATCH v2 4/4] lockdep: Assign a lock_class per gendisk used for wait_for_completion()

2017-10-20 Thread Christoph Hellwig
The Subject prefix for this should be "block:".

> @@ -945,7 +945,7 @@ int submit_bio_wait(struct bio *bio)
>  {
>   struct submit_bio_ret ret;
>  
> - init_completion(&ret.event);
> + init_completion_with_map(&ret.event, &bio->bi_disk->lockdep_map);

FYI, I have an outstanding patch to simplify this a lot, which
switches this to DECLARE_COMPLETION_ONSTACK.  I can delay this or let
you pick it up with your series, but we'll need a variant of
DECLARE_COMPLETION_ONSTACK with the lockdep annotations.

Patch below for reference:

---
>From d65b89843c9f82c0744643515ba51dd10e66e67b Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Thu, 5 Oct 2017 18:31:02 +0200
Subject: block: use DECLARE_COMPLETION_ONSTACK in submit_bio_wait

Simplify the code by getting rid of the submit_bio_ret structure.

Signed-off-by: Christoph Hellwig 
---
 block/bio.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 8338304ea256..4e18e959fc0a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -917,17 +917,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct 
iov_iter *iter)
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
-struct submit_bio_ret {
-   struct completion event;
-   int error;
-};
-
 static void submit_bio_wait_endio(struct bio *bio)
 {
-   struct submit_bio_ret *ret = bio->bi_private;
-
-   ret->error = blk_status_to_errno(bio->bi_status);
-   complete(&ret->event);
+   complete(bio->bi_private);
 }
 
 /**
@@ -943,16 +935,15 @@ static void submit_bio_wait_endio(struct bio *bio)
  */
 int submit_bio_wait(struct bio *bio)
 {
-   struct submit_bio_ret ret;
+   DECLARE_COMPLETION_ONSTACK(done);
 
-   init_completion(&ret.event);
-   bio->bi_private = &ret;
+   bio->bi_private = &done;
bio->bi_end_io = submit_bio_wait_endio;
bio->bi_opf |= REQ_SYNC;
submit_bio(bio);
-   wait_for_completion_io(&ret.event);
+   wait_for_completion_io(&done);
 
-   return ret.error;
+   return blk_status_to_errno(bio->bi_status);
 }
 EXPORT_SYMBOL(submit_bio_wait);
 
-- 
2.14.2



Re: [GIT PULL] nvme fixes for 4.14

2017-10-20 Thread Jens Axboe
On 10/20/2017 08:17 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> below are two regression fixes each for RDMA and FC, and a fix for a SQHD
> update race in the target.
> 
> The following changes since commit 639812a1ed9bf49ae2c026086fbf975339cd1eef:
> 
>   nbd: don't set the device size until we're connected (2017-10-09 12:29:22 
> -0600)
> 
> are available in the git repository at:
> 
>   git://git.infradead.org/nvme.git nvme-4.14

Pulled, thanks.

-- 
Jens Axboe



[GIT PULL] nvme fixes for 4.14

2017-10-20 Thread Christoph Hellwig
Hi Jens,

below are two regression fixes each for RDMA and FC, and a fix for a SQHD
update race in the target.

The following changes since commit 639812a1ed9bf49ae2c026086fbf975339cd1eef:

  nbd: don't set the device size until we're connected (2017-10-09 12:29:22 
-0600)

are available in the git repository at:

  git://git.infradead.org/nvme.git nvme-4.14

for you to fetch changes up to f04b9cc87b5fc466b1b7231ba7b078e885956c5b:

  nvme-rdma: Fix error status return in tagset allocation failure (2017-10-19 
17:13:51 +0200)


James Smart (3):
  nvme-fc: fix iowait hang
  nvme-fc: retry initial controller connections 3 times
  nvmet: synchronize sqhd update

Sagi Grimberg (2):
  nvme-rdma: Fix possible double free in reconnect flow
  nvme-rdma: Fix error status return in tagset allocation failure

 drivers/nvme/host/fc.c  | 37 +
 drivers/nvme/host/rdma.c| 16 
 drivers/nvme/target/core.c  | 15 ---
 drivers/nvme/target/nvmet.h |  2 +-
 4 files changed, 58 insertions(+), 12 deletions(-)


Re: [PATCH V8 00/14] mmc: Add Command Queue support

2017-10-20 Thread Adrian Hunter
On 19/10/17 14:44, Adrian Hunter wrote:
> On 18/10/17 09:16, Adrian Hunter wrote:
>> On 11/10/17 16:58, Ulf Hansson wrote:
>>> On 11 October 2017 at 14:58, Adrian Hunter  wrote:
 On 11/10/17 15:13, Ulf Hansson wrote:
> On 10 October 2017 at 15:31, Adrian Hunter  
> wrote:
>> On 10/10/17 16:08, Ulf Hansson wrote:
>>> [...]
>>>
>>>
>>> I have also run some test on my ux500 board and enabling the blkmq
>>> path via the new MMC Kconfig option. My idea was to run some iozone
>>> comparisons between the legacy path and the new blkmq path, but I 
>>> just
>>> couldn't get to that point because of the following errors.
>>>
>>> I am using a Kingston 4GB SDHC card, which is detected and mounted
>>> nicely. However, when I decide to do some writes to the card I get 
>>> the
>>> following errors.
>>>
>>> root@ME:/mnt/sdcard dd if=/dev/zero of=testfile bs=8192 count=5000 
>>> conv=fsync
>>> [  463.714294] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  464.722656] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  466.081481] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  467.111236] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  468.669647] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  469.685699] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  471.043334] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  472.052337] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  473.342651] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  474.323760] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  475.544769] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  476.539031] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  477.748474] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>> [  478.724182] mmci-pl18x 80126000.sdi0_per1: error during DMA 
>>> transfer!
>>>
>>> I haven't yet got the point of investigating this any further, and
>>> unfortunate I have a busy schedule with traveling next week. I will 
>>> do
>>> my best to look into this as soon as I can.
>>>
>>> Perhaps you have some ideas?
>>
>> The behaviour depends on whether you have MMC_CAP_WAIT_WHILE_BUSY. 
>> Try
>> changing that and see if it makes a difference.
>
> Yes, it does! I disabled MMC_CAP_WAIT_WHILE_BUSY (and its
> corresponding code in mmci.c) and the errors goes away.
>
> When I use MMC_CAP_WAIT_WHILE_BUSY I get these problems:
>
> [  223.820983] mmci-pl18x 80126000.sdi0_per1: error during DMA 
> transfer!
> [  224.815795] mmci-pl18x 80126000.sdi0_per1: error during DMA 
> transfer!
> [  226.034881] mmci-pl18x 80126000.sdi0_per1: error during DMA 
> transfer!
> [  227.112884] mmci-pl18x 80126000.sdi0_per1: error during DMA 
> transfer!
> [  227.220275] mmc0: Card stuck in wrong state! mmcblk0 
> mmc_blk_card_stuck
> [  228.686798] mmci-pl18x 80126000.sdi0_per1: error during DMA 
> transfer!
> [  229.892150] mmci-pl18x 80126000.sdi0_per1: error during DMA 
> transfer!
> [  231.031890] mmci-pl18x 80126000.sdi0_per1: error during DMA 
> transfer!
> [  232.239013] mmci-pl18x 80126000.sdi0_per1: error during DMA 
> transfer!
> 5000+0 records in
> 5000+0 records out
> root@ME:/mnt/sdcard
>
> I looked at the new blkmq code from patch v10 13/15. It seems like the
> MMC_CAP_WAIT_WHILE_BUSY is used to determine whether the async request
> mechanism should be used or not. Perhaps I didn't looked close enough,
> but maybe you could elaborate on why this seems to be the case!?

 MMC_CAP_WAIT_WHILE_BUSY is necessary because it means that a data 
 transfer
 request has finished when the host controller calls 
 mmc_request_done(). i.e.
 polling the card is not necessary.
>>>
>>> Well, that is a rather big change on its own. Earlier we polled with
>>> CMD13 to verify that the card has moved back to the transfer state, in
>>> case it was a write. And that was no matter of MMC_CAP_WAIT_WHILE_BUSY
>>> was set or not. Right!?
>>
>> Yes
>>
>>>
>>> I am not sure it's a good idea to bypass that validation, it seems
>>> fragile to rely 

RE: [PATCH 0/6] v4 block refcount conversion patches

2017-10-20 Thread Reshetova, Elena

> Elena Reshetova  writes:
> > Elena Reshetova (6):
> >   block: convert bio.__bi_cnt from atomic_t to refcount_t
> >   block: convert blk_queue_tag.refcnt from atomic_t to refcount_t
> >   block: convert blkcg_gq.refcnt from atomic_t to refcount_t
> >   block: convert io_context.active_ref from atomic_t to refcount_t
> >   block: convert bsg_device.ref_count from atomic_t to refcount_t
> >   drivers, block: convert xen_blkif.refcnt from atomic_t to refcount_t
> 
> Hi Elena,
> 
> While the bsg ref_count is cheap, do you have any numbers how the other
> conversions compare in performance (throughput and latency) vs atomics?
Hi Johannes,

The performance would depend on which "breed" of refcount_t is used underneath. 
We currently have 3 versions:

- refcount_t defaults to atomic_t (no CONFIG_REFCOUNT_FULL enabled, no arch. 
support)
  Impact is zero in this case since it is just atomic functions are used. 
- refcount_t uses arch. specific implementation (arch. enables 
ARCH_HAS_REFCOUNT)
 Impact depends on arch. implementation. Currently only x86 provides one. 
- refcount_t uses "full" arch. independent implementation.

Here are cycle numbers for comparing these 3 (https://lwn.net/Articles/728626/):
Just copy pasting for convenience:

">These are the cycle counts comparing a loop of refcount_inc() from 1
>to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
>unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
>(refcount_t-full), and this overflow-protected refcount (refcount_t-fast):

>2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
cycles  protections
>atomic_t  82249267387  none
>refcount_t-fast82211446892 overflow, untested dec-to-zero
>refcount_t-full   144814735193 overflow, untested dec-to-zero, inc-from-zero"

So, the middle option (called here refcount_t-fast) with arch. specific
implementation gives a negligible impact. The "full" one is more pricey, but it 
is
disabled by default anyway, so only people who want strict security enable it.  

Are these numbers convincing enough that we don't have to measure
the block devices? :)

Best Regards,
Elena.


> 
> It should be quite easy to measure against a null_blk device.
> 
> Thanks a lot,
>Johannes
> 
> --
> Johannes Thumshirn  Storage
> jthumsh...@suse.de+49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart

2017-10-20 Thread Roman Penyaev
Hi Bart,

On Thu, Oct 19, 2017 at 7:47 PM, Bart Van Assche  wrote:
> On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:
>> the patch below fixes queue stalling when shared hctx marked for restart
>> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
>> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
>> belongs to the particular queue, which in fact may not need to be restarted,
>> thus we return from blk_mq_sched_restart() and leave shared hctx of another
>> queue never restarted.
>>
>> The fix is to make shared_hctx_restart counter belong not to the queue, but
>> to tags, thereby counter will reflect real number of shared hctx needed to
>> be restarted.
>
> Hello Roman,
>
> The patch you posted looks fine to me but seeing this patch and the patch
> description makes me wonder why this had not been noticed before.

This is a good question, which I could not answer.  I tried to simulate the
same behaviour (completion timings, completion pinning, number of submission
queues, shared tags, etc) on null block.  but what I see is that
*_sched_restart()
never observes 'shared_hctx_restart',  literally never (I made a counter when
we take a path and start looking for a hctx to restart, and a counter stays 0).

That makes me nervous and then I gave up.  After some time I want return to
that and try to reproduce the problem on something else, say nvme.

> Are you perhaps using a block driver that returns BLK_STS_RESOURCE more
> often than other block drivers? Did you perhaps run into this with the
> Infiniband network block device (IBNBD) driver?

Yep, this is IBNBD, but in these tests I tested with mq scheduler, shared tags
and 1 hctx for each queue (blk device),  thus I never run out of internal tags
and never return BLK_STS_RESOURCE.

Indeed, not modified IBNBD does internal tags management.  This was needed
because each queue (block device) was created with hctx number (nr_hw_queues)
equal to number of cpus on the system, but blk-mq tags set is shared only
between hctx, not globally, which led to need to return BLK_STS_RESOURCE
and queues restarts.

But, with mq scheduler situation changed: 1 hctx with shared tags can be
specified for all hundreds of devices without any performance impact.

Testing this configuration (1 hctx, shared tags, mq-deadline) immediately
shows these two problems: request stalling and slow loops inside
blk_mq_sched_restart().

> No matter what driver triggered this, I think this bug should be fixed.

Yes, queue stalling can be easily fixed.  I can resend current patch with
shorter description which targets only this particular bug, if no one else
has objections/comments etc.

But what bothers me is these looong loops inside blk_mq_sched_restart(),
and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
a single queue if tag sets are shared") I want to ask what was the original
problem which you attempted to fix?  Likely I am missing some test scenario
which would be great to know about.

--
Roman


Re: [PATCH 0/6] v4 block refcount conversion patches

2017-10-20 Thread Johannes Thumshirn
Elena Reshetova  writes:
> Elena Reshetova (6):
>   block: convert bio.__bi_cnt from atomic_t to refcount_t
>   block: convert blk_queue_tag.refcnt from atomic_t to refcount_t
>   block: convert blkcg_gq.refcnt from atomic_t to refcount_t
>   block: convert io_context.active_ref from atomic_t to refcount_t
>   block: convert bsg_device.ref_count from atomic_t to refcount_t
>   drivers, block: convert xen_blkif.refcnt from atomic_t to refcount_t

Hi Elena,

While the bsg ref_count is cheap, do you have any numbers how the other
conversions compare in performance (throughput and latency) vs atomics?

It should be quite easy to measure against a null_blk device.

Thanks a lot,
   Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH 1/6] block: convert bio.__bi_cnt from atomic_t to refcount_t

2017-10-20 Thread Elena Reshetova
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable bio.__bi_cnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook 
Reviewed-by: David Windsor 
Reviewed-by: Hans Liljestrand 
Signed-off-by: Elena Reshetova 
---
 block/bio.c   | 6 +++---
 fs/btrfs/volumes.c| 2 +-
 include/linux/bio.h   | 4 ++--
 include/linux/blk_types.h | 3 ++-
 4 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 101c2a9..58edc1b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -279,7 +279,7 @@ void bio_init(struct bio *bio, struct bio_vec *table,
 {
memset(bio, 0, sizeof(*bio));
atomic_set(&bio->__bi_remaining, 1);
-   atomic_set(&bio->__bi_cnt, 1);
+   refcount_set(&bio->__bi_cnt, 1);
 
bio->bi_io_vec = table;
bio->bi_max_vecs = max_vecs;
@@ -557,12 +557,12 @@ void bio_put(struct bio *bio)
if (!bio_flagged(bio, BIO_REFFED))
bio_free(bio);
else {
-   BIO_BUG_ON(!atomic_read(&bio->__bi_cnt));
+   BIO_BUG_ON(!refcount_read(&bio->__bi_cnt));
 
/*
 * last put frees it
 */
-   if (atomic_dec_and_test(&bio->__bi_cnt))
+   if (refcount_dec_and_test(&bio->__bi_cnt))
bio_free(bio);
}
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b397375..11812ee 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -450,7 +450,7 @@ static noinline void run_scheduled_bios(struct btrfs_device 
*device)
waitqueue_active(&fs_info->async_submit_wait))
wake_up(&fs_info->async_submit_wait);
 
-   BUG_ON(atomic_read(&cur->__bi_cnt) == 0);
+   BUG_ON(refcount_read(&cur->__bi_cnt) == 0);
 
/*
 * if we're doing the sync list, record that our
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 275c91c..0fa4dd2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -253,7 +253,7 @@ static inline void bio_get(struct bio *bio)
 {
bio->bi_flags |= (1 << BIO_REFFED);
smp_mb__before_atomic();
-   atomic_inc(&bio->__bi_cnt);
+   refcount_inc(&bio->__bi_cnt);
 }
 
 static inline void bio_cnt_set(struct bio *bio, unsigned int count)
@@ -262,7 +262,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned 
int count)
bio->bi_flags |= (1 << BIO_REFFED);
smp_mb__before_atomic();
}
-   atomic_set(&bio->__bi_cnt, count);
+   refcount_set(&bio->__bi_cnt, count);
 }
 
 static inline bool bio_flagged(struct bio *bio, unsigned int bit)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a2d2aa7..1ec370e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 
 struct bio_set;
 struct bio;
@@ -104,7 +105,7 @@ struct bio {
 
unsigned short  bi_max_vecs;/* max bvl_vecs we can hold */
 
-   atomic_t__bi_cnt;   /* pin count */
+   refcount_t  __bi_cnt;   /* pin count */
 
struct bio_vec  *bi_io_vec; /* the actual vec list */
 
-- 
2.7.4



[PATCH 0/6] v4 block refcount conversion patches

2017-10-20 Thread Elena Reshetova
Changes in v4:
 - Improved commit messages and signoff info.
 - Rebase on top of linux-next as of yesterday.
 - WARN_ONs are restored since x86 refcount_t does not WARN on zero

Changes in v3:
No changes in patches apart from trivial rebases, but now by
default refcount_t = atomic_t and uses all atomic standard operations
unless CONFIG_REFCOUNT_FULL is enabled. This is a compromize for the
systems that are critical on performance and cannot accept even
slight delay on the refcounter operations.

Changes in v2:
Not needed WARNs are removed since refcount_t warns by itself.
BUG_ONs are left as it is, since refcount_t doesn't bug by default.

This series, for block subsystem, replaces atomic_t reference
counters with the new refcount_t type and API (see include/linux/refcount.h).
By doing this we prevent intentional or accidental
underflows or overflows that can lead to use-after-free vulnerabilities.

The patches are fully independent and can be cherry-picked separately.
If there are no objections to the patches, please merge them via respective 
trees.

Elena Reshetova (6):
  block: convert bio.__bi_cnt from atomic_t to refcount_t
  block: convert blk_queue_tag.refcnt from atomic_t to refcount_t
  block: convert blkcg_gq.refcnt from atomic_t to refcount_t
  block: convert io_context.active_ref from atomic_t to refcount_t
  block: convert bsg_device.ref_count from atomic_t to refcount_t
  drivers, block: convert xen_blkif.refcnt from atomic_t to refcount_t

 block/bfq-iosched.c|  2 +-
 block/bio.c|  6 +++---
 block/blk-cgroup.c |  2 +-
 block/blk-ioc.c|  4 ++--
 block/blk-tag.c|  8 
 block/bsg.c|  9 +
 block/cfq-iosched.c|  4 ++--
 drivers/block/xen-blkback/common.h |  7 ---
 drivers/block/xen-blkback/xenbus.c |  2 +-
 fs/btrfs/volumes.c |  2 +-
 include/linux/bio.h|  4 ++--
 include/linux/blk-cgroup.h | 11 ++-
 include/linux/blk_types.h  |  3 ++-
 include/linux/blkdev.h |  3 ++-
 include/linux/iocontext.h  |  7 ---
 15 files changed, 40 insertions(+), 34 deletions(-)

-- 
2.7.4



[PATCH 3/6] block: convert blkcg_gq.refcnt from atomic_t to refcount_t

2017-10-20 Thread Elena Reshetova
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable blkcg_gq.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook 
Reviewed-by: David Windsor 
Reviewed-by: Hans Liljestrand 
Signed-off-by: Elena Reshetova 
---
 block/blk-cgroup.c |  2 +-
 include/linux/blk-cgroup.h | 11 ++-
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d3f56ba..1e7cedc 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -107,7 +107,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, 
struct request_queue *q,
blkg->q = q;
INIT_LIST_HEAD(&blkg->q_node);
blkg->blkcg = blkcg;
-   atomic_set(&blkg->refcnt, 1);
+   refcount_set(&blkg->refcnt, 1);
 
/* root blkg uses @q->root_rl, init rl only for !root blkgs */
if (blkcg != &blkcg_root) {
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 9d92153..c95d29d 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */
 #define BLKG_STAT_CPU_BATCH(INT_MAX / 2)
@@ -122,7 +123,7 @@ struct blkcg_gq {
struct request_list rl;
 
/* reference count */
-   atomic_trefcnt;
+   refcount_t  refcnt;
 
/* is this blkg online? protected by both blkcg and q locks */
boolonline;
@@ -354,8 +355,8 @@ static inline int blkg_path(struct blkcg_gq *blkg, char 
*buf, int buflen)
  */
 static inline void blkg_get(struct blkcg_gq *blkg)
 {
-   WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
-   atomic_inc(&blkg->refcnt);
+   WARN_ON_ONCE(refcount_read(&blkg->refcnt) == 0);
+   refcount_inc(&blkg->refcnt);
 }
 
 void __blkg_release_rcu(struct rcu_head *rcu);
@@ -366,8 +367,8 @@ void __blkg_release_rcu(struct rcu_head *rcu);
  */
 static inline void blkg_put(struct blkcg_gq *blkg)
 {
-   WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
-   if (atomic_dec_and_test(&blkg->refcnt))
+   WARN_ON_ONCE(refcount_read(&blkg->refcnt) == 0);
+   if (refcount_dec_and_test(&blkg->refcnt))
call_rcu(&blkg->rcu_head, __blkg_release_rcu);
 }
 
-- 
2.7.4



[PATCH 2/6] block: convert blk_queue_tag.refcnt from atomic_t to refcount_t

2017-10-20 Thread Elena Reshetova
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable blk_queue_tag.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook 
Reviewed-by: David Windsor 
Reviewed-by: Hans Liljestrand 
Signed-off-by: Elena Reshetova 
---
 block/blk-tag.c| 8 
 include/linux/blkdev.h | 3 ++-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/block/blk-tag.c b/block/blk-tag.c
index e1a9c15..a7263e3 100644
--- a/block/blk-tag.c
+++ b/block/blk-tag.c
@@ -35,7 +35,7 @@ EXPORT_SYMBOL(blk_queue_find_tag);
  */
 void blk_free_tags(struct blk_queue_tag *bqt)
 {
-   if (atomic_dec_and_test(&bqt->refcnt)) {
+   if (refcount_dec_and_test(&bqt->refcnt)) {
BUG_ON(find_first_bit(bqt->tag_map, bqt->max_depth) <
bqt->max_depth);
 
@@ -130,7 +130,7 @@ static struct blk_queue_tag *__blk_queue_init_tags(struct 
request_queue *q,
if (init_tag_map(q, tags, depth))
goto fail;
 
-   atomic_set(&tags->refcnt, 1);
+   refcount_set(&tags->refcnt, 1);
tags->alloc_policy = alloc_policy;
tags->next_tag = 0;
return tags;
@@ -180,7 +180,7 @@ int blk_queue_init_tags(struct request_queue *q, int depth,
queue_flag_set(QUEUE_FLAG_QUEUED, q);
return 0;
} else
-   atomic_inc(&tags->refcnt);
+   refcount_inc(&tags->refcnt);
 
/*
 * assign it, all done
@@ -225,7 +225,7 @@ int blk_queue_resize_tags(struct request_queue *q, int 
new_depth)
 * Currently cannot replace a shared tag map with a new
 * one, so error out if this is the case
 */
-   if (atomic_read(&bqt->refcnt) != 1)
+   if (refcount_read(&bqt->refcnt) != 1)
return -EBUSY;
 
/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 02fa42d..1fefdbb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct module;
 struct scsi_ioctl_command;
@@ -295,7 +296,7 @@ struct blk_queue_tag {
unsigned long *tag_map; /* bit map of free/busy tags */
int max_depth;  /* what we will send to device */
int real_max_depth; /* what the array can hold */
-   atomic_t refcnt;/* map can be shared */
+   refcount_t refcnt;  /* map can be shared */
int alloc_policy;   /* tag allocation policy */
int next_tag;   /* next tag */
 };
-- 
2.7.4



[PATCH 6/6] drivers, block: convert xen_blkif.refcnt from atomic_t to refcount_t

2017-10-20 Thread Elena Reshetova
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable xen_blkif.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook 
Reviewed-by: David Windsor 
Reviewed-by: Hans Liljestrand 
Signed-off-by: Elena Reshetova 
---
 drivers/block/xen-blkback/common.h | 7 ---
 drivers/block/xen-blkback/xenbus.c | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index ecb35fe..0c3320d 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -319,7 +320,7 @@ struct xen_blkif {
struct xen_vbd  vbd;
/* Back pointer to the backend_info. */
struct backend_info *be;
-   atomic_trefcnt;
+   refcount_t  refcnt;
/* for barrier (drain) requests */
struct completion   drain_complete;
atomic_tdrain;
@@ -372,10 +373,10 @@ struct pending_req {
 (_v)->bdev->bd_part->nr_sects : \
  get_capacity((_v)->bdev->bd_disk))
 
-#define xen_blkif_get(_b) (atomic_inc(&(_b)->refcnt))
+#define xen_blkif_get(_b) (refcount_inc(&(_b)->refcnt))
 #define xen_blkif_put(_b)  \
do {\
-   if (atomic_dec_and_test(&(_b)->refcnt)) \
+   if (refcount_dec_and_test(&(_b)->refcnt))   \
schedule_work(&(_b)->free_work);\
} while (0)
 
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 21c1be1..5955b61 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -176,7 +176,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
return ERR_PTR(-ENOMEM);
 
blkif->domid = domid;
-   atomic_set(&blkif->refcnt, 1);
+   refcount_set(&blkif->refcnt, 1);
init_completion(&blkif->drain_complete);
INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);
 
-- 
2.7.4



[PATCH 4/6] block: convert io_context.active_ref from atomic_t to refcount_t

2017-10-20 Thread Elena Reshetova
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable io_context.active_ref is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook 
Reviewed-by: David Windsor 
Reviewed-by: Hans Liljestrand 
Signed-off-by: Elena Reshetova 
---
 block/bfq-iosched.c   | 2 +-
 block/blk-ioc.c   | 4 ++--
 block/cfq-iosched.c   | 4 ++--
 include/linux/iocontext.h | 7 ---
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a4783da..1ec9b22 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4030,7 +4030,7 @@ static void bfq_update_has_short_ttime(struct bfq_data 
*bfqd,
 * bfqq. Otherwise check average think time to
 * decide whether to mark as has_short_ttime
 */
-   if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+   if (refcount_read(&bic->icq.ioc->active_ref) == 0 ||
(bfq_sample_valid(bfqq->ttime.ttime_samples) &&
 bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle))
has_short_ttime = false;
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 63898d2..69704d2 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -176,7 +176,7 @@ void put_io_context_active(struct io_context *ioc)
unsigned long flags;
struct io_cq *icq;
 
-   if (!atomic_dec_and_test(&ioc->active_ref)) {
+   if (!refcount_dec_and_test(&ioc->active_ref)) {
put_io_context(ioc);
return;
}
@@ -275,7 +275,7 @@ int create_task_io_context(struct task_struct *task, gfp_t 
gfp_flags, int node)
/* initialize */
atomic_long_set(&ioc->refcount, 1);
atomic_set(&ioc->nr_tasks, 1);
-   atomic_set(&ioc->active_ref, 1);
+   refcount_set(&ioc->active_ref, 1);
spin_lock_init(&ioc->lock);
INIT_RADIX_TREE(&ioc->icq_tree, GFP_ATOMIC | __GFP_HIGH);
INIT_HLIST_HEAD(&ioc->icq_list);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9f342ef..e6d5d6d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2941,7 +2941,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 * task has exited, don't wait
 */
cic = cfqd->active_cic;
-   if (!cic || !atomic_read(&cic->icq.ioc->active_ref))
+   if (!cic || !refcount_read(&cic->icq.ioc->active_ref))
return;
 
/*
@@ -3933,7 +3933,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
 
if (cfqq->next_rq && req_noidle(cfqq->next_rq))
enable_idle = 0;
-   else if (!atomic_read(&cic->icq.ioc->active_ref) ||
+   else if (!refcount_read(&cic->icq.ioc->active_ref) ||
 !cfqd->cfq_slice_idle ||
 (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq)))
enable_idle = 0;
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index df38db2..a1e28c3 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 enum {
@@ -96,7 +97,7 @@ struct io_cq {
  */
 struct io_context {
atomic_long_t refcount;
-   atomic_t active_ref;
+   refcount_t active_ref;
atomic_t nr_tasks;
 
/* all the fields below are protected by this lock */
@@ -128,9 +129,9 @@ struct io_context {
 static inline void get_io_context_active(struct io_context *ioc)
 {
WARN_ON_ONCE(atomic_long_read(&ioc->refcount) <= 0);
-   WARN_ON_ONCE(atomic_read(&ioc->active_ref) <= 0);
+   WARN_ON_ONCE(refcount_read(&ioc->active_ref) == 0);
atomic_long_inc(&ioc->refcount);
-   atomic_inc(&ioc->active_ref);
+   refcount_inc(&ioc->active_ref);
 }
 
 static inline void ioc_task_link(struct io_context *ioc)
-- 
2.7.4



[PATCH 5/6] block: convert bsg_device.ref_count from atomic_t to refcount_t

2017-10-20 Thread Elena Reshetova
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable bsg_device.ref_count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook 
Reviewed-by: David Windsor 
Reviewed-by: Hans Liljestrand 
Signed-off-by: Elena Reshetova 
---
 block/bsg.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/bsg.c b/block/bsg.c
index ee1335c..6c98422 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -38,7 +39,7 @@ struct bsg_device {
struct list_head busy_list;
struct list_head done_list;
struct hlist_node dev_list;
-   atomic_t ref_count;
+   refcount_t ref_count;
int queued_cmds;
int done_cmds;
wait_queue_head_t wq_done;
@@ -710,7 +711,7 @@ static int bsg_put_device(struct bsg_device *bd)
 
mutex_lock(&bsg_mutex);
 
-   do_free = atomic_dec_and_test(&bd->ref_count);
+   do_free = refcount_dec_and_test(&bd->ref_count);
if (!do_free) {
mutex_unlock(&bsg_mutex);
goto out;
@@ -768,7 +769,7 @@ static struct bsg_device *bsg_add_device(struct inode 
*inode,
 
bsg_set_block(bd, file);
 
-   atomic_set(&bd->ref_count, 1);
+   refcount_set(&bd->ref_count, 1);
mutex_lock(&bsg_mutex);
hlist_add_head(&bd->dev_list, bsg_dev_idx_hash(iminor(inode)));
 
@@ -788,7 +789,7 @@ static struct bsg_device *__bsg_get_device(int minor, 
struct request_queue *q)
 
hlist_for_each_entry(bd, bsg_dev_idx_hash(minor), dev_list) {
if (bd->queue == q) {
-   atomic_inc(&bd->ref_count);
+   refcount_inc(&bd->ref_count);
goto found;
}
}
-- 
2.7.4