[PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-04 Thread Kashyap Desai
Problem statement :
Whenever try to get outstanding request via scsi_host_find_tag,
block layer will return stale entries instead of actual outstanding
request. Kernel panic if stale entry is inaccessible or memory is reused.
Fix :
Undo request mapping in blk_mq_put_driver_tag  nce request is return.

More detail :
Whenever each SDEV entry is created, block layer allocate separate tags
and static requestis.Those requests are not valid after SDEV is deleted
from the system. On the fly, block layer maps static rqs to rqs as below
from blk_mq_get_driver_tag()

data.hctx->tags->rqs[rq->tag] = rq;

Above mapping is active in-used requests and it is the same mapping which
is referred in function scsi_host_find_tag().
After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
entries which will never be reset in block layer.

There would be a kernel panic, If request pointing to
“data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
and as part of that all the memory allocation of request associated with
that sdev might be reused or inaccessible to the driver.
Kernel panic snippet -

BUG: unable to handle kernel paging request at ff800010
IP: [] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
PGD aa4414067 PUD 0
Oops:  [#1] SMP
Call Trace:
 [] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
 [] scsih_shutdown+0x55/0x100 [mpt3sas]

Cc: 
Signed-off-by: Kashyap Desai 
Signed-off-by: Sreekanth Reddy 


---
 block/blk-mq.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47..57432be 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -175,6 +175,7 @@ static inline bool
blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
struct request *rq)
 {
+hctx->tags->rqs[rq->tag] = NULL;
 blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
 rq->tag = -1;

-- 
1.8.3.1


Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-04 Thread Ming Lei
On Tue, Dec 04, 2018 at 03:30:11PM +0530, Kashyap Desai wrote:
> Problem statement :
> Whenever try to get outstanding request via scsi_host_find_tag,
> block layer will return stale entries instead of actual outstanding
> request. Kernel panic if stale entry is inaccessible or memory is reused.
> Fix :
> Undo request mapping in blk_mq_put_driver_tag  nce request is return.
> 
> More detail :
> Whenever each SDEV entry is created, block layer allocate separate tags
> and static requestis.Those requests are not valid after SDEV is deleted
> from the system. On the fly, block layer maps static rqs to rqs as below
> from blk_mq_get_driver_tag()
> 
> data.hctx->tags->rqs[rq->tag] = rq;
> 
> Above mapping is active in-used requests and it is the same mapping which
> is referred in function scsi_host_find_tag().
> After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
> entries which will never be reset in block layer.

However, if rq & rq->tag is valid, data.hctx->tags->rqs[rq->tag] should
have pointed to one active request instead of the stale one, right?

Thanks,
Ming


Re: [PATCH v5 0/5] lightnvm: Flexible metadata

2018-12-04 Thread Hans Holmberg
On Mon, Dec 3, 2018 at 9:51 AM Hans Holmberg
 wrote:
>
> Great! The tests(rocksdb, pblk recovery and the generic xfs suite)
> completed successfully on one of our disks, so feel free to add:
>
> Tested-by: Hans Holmberg 
>

The tests completed fine on our hardware (which has metadata support),
but fails when I run the code on a qemu disk without metadata support.
I started to look into what goes wrong, and it seems the ppa list gets
corrupted during smeta and padding writes.

Building the kernel with CONFIG_NVM_PBLK_DEBUG, i see these errors:
[  159.735409] pblk : ppa: (boundary: 1) cache line: 9223372036854775807

I also got this during recovery:

[ 5384.511939] BUG: KASAN: use-after-free in pblk_get_packed_meta+0x5e/0x151
[ 5384.513036] Read of size 8 at addr 8881edc220e0 by task nvme/9147

There might of course be something very wrong with my qemu setup, but
it it looks more like something went wrong in the later revisions of
the patch set (i tested the one of the first versions and did not see
these issues then).

Igor: If you want it, I can share my qemu environment, more debugging
info and a couple of fixups i've done while looking into this.

Thanks,
Hans

> Thanks,
> Hans
> On Fri, Nov 30, 2018 at 2:03 PM Hans Holmberg
>  wrote:
> >
> > I just started a regression test on this patch set that'll run over
> > the weekend. I'll add a tested-by if everything checks out.
> >
> > All the best,
> > Hans
> > On Fri, Nov 30, 2018 at 12:49 PM Igor Konopko  
> > wrote:
> > >
> > > This series of patches extends the way how pblk can
> > > store L2P sector metadata. After this set of changes
> > > any size of NVMe metadata is supported in pblk.
> > > Also there is an support for case without NVMe metadata.
> > >
> > > Changes v4 --> v5:
> > > -rebase on top of ocssd/for-4.21/core
> > >
> > > Changes v3 --> v4:
> > > -rename nvm_alloc_dma_pool() to nvm_create_dma_pool()
> > > -split pblk_get_meta() calls and lba setting into
> > > two operations for better core readability
> > > -fixing compilation with CONFIG_NVM disabled
> > > -getting rid of unnecessary memcpy for packed metadata
> > > on write path
> > > -support for drives with oob size >0 and <16B in packed
> > > metadata mode
> > > -minor commit message updates
> > >
> > > Changes v2 --> v3:
> > > -Rebase on top of ocssd/for-4.21/core
> > > -get/set_meta_lba helpers were removed
> > > -dma reallocation was replaced with single allocation
> > > -oob metadata size was added to pblk structure
> > > -proper checks on pblk creation were added
> > >
> > > Changes v1 --> v2:
> > > -Revert sector meta size back to 16b for pblk
> > > -Dma pool for larger oob meta are handled in core instead of pblk
> > > -Pblk oob meta helpers uses __le64 as input outpu instead of u64
> > > -Other minor fixes based on v1 patch review
> > >
> > > Igor Konopko (5):
> > >   lightnvm: pblk: Move lba list to partial read context
> > >   lightnvm: pblk: Helpers for OOB metadata
> > >   lightnvm: Flexible DMA pool entry size
> > >   lightnvm: Disable interleaved metadata
> > >   lightnvm: pblk: Support for packed metadata
> > >
> > >  drivers/lightnvm/core.c  |  9 --
> > >  drivers/lightnvm/pblk-core.c | 61 
> > > +++--
> > >  drivers/lightnvm/pblk-init.c | 44 +--
> > >  drivers/lightnvm/pblk-map.c  | 20 +++-
> > >  drivers/lightnvm/pblk-rb.c   |  3 ++
> > >  drivers/lightnvm/pblk-read.c | 66 
> > > +++-
> > >  drivers/lightnvm/pblk-recovery.c | 25 +--
> > >  drivers/lightnvm/pblk-sysfs.c|  7 +
> > >  drivers/lightnvm/pblk-write.c|  9 +++---
> > >  drivers/lightnvm/pblk.h  | 24 +--
> > >  drivers/nvme/host/lightnvm.c |  6 ++--
> > >  include/linux/lightnvm.h |  3 +-
> > >  12 files changed, 209 insertions(+), 68 deletions(-)
> > >
> > > --
> > > 2.14.5
> > >


Re: [PATCH 02/27] aio: clear IOCB_HIPRI

2018-12-04 Thread Christoph Hellwig
On Fri, Nov 30, 2018 at 10:14:31AM -0700, Jens Axboe wrote:
> On 11/30/18 10:13 AM, Christoph Hellwig wrote:
> > I think we'll need to queue this up for 4.21 ASAP independent of the
> > rest, given that with separate poll queues userspace could otherwise
> > submit I/O that will never get polled for anywhere.
> 
> Probably a good idea, I can just move it to my 4.21 branch, it's not
> strictly dependent on the series.

So, can you add it to the 4.21 branch?


Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT

2018-12-04 Thread Christoph Hellwig
On Fri, Nov 30, 2018 at 10:17:49AM -0700, Jens Axboe wrote:
> > Setting REQ_NOWAIT from inside the block layer will make the code that
> > submits requests harder to review. Have you considered to make this code
> > fail I/O if REQ_NOWAIT has not been set and to require that the context
> > that submits I/O sets REQ_NOWAIT?
> 
> It's technically still feasible to do for sync polled IO, it's only
> the async case that makes it a potential deadlock.

I wonder if we want a REQ_ASYNC_POLL compound flag #define that sets
REQ_POLL and REQ_NOWAIT to make this blindly obvious.


Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-04 Thread Bart Van Assche

On 12/4/18 2:00 AM, Kashyap Desai wrote:

Problem statement :
Whenever try to get outstanding request via scsi_host_find_tag,
block layer will return stale entries instead of actual outstanding
request. Kernel panic if stale entry is inaccessible or memory is reused.
Fix :
Undo request mapping in blk_mq_put_driver_tag  nce request is return.

More detail :
Whenever each SDEV entry is created, block layer allocate separate tags
and static requestis.Those requests are not valid after SDEV is deleted
from the system. On the fly, block layer maps static rqs to rqs as below
from blk_mq_get_driver_tag()

data.hctx->tags->rqs[rq->tag] = rq;

Above mapping is active in-used requests and it is the same mapping which
is referred in function scsi_host_find_tag().
After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
entries which will never be reset in block layer.

There would be a kernel panic, If request pointing to
“data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
and as part of that all the memory allocation of request associated with
that sdev might be reused or inaccessible to the driver.
Kernel panic snippet -

BUG: unable to handle kernel paging request at ff800010
IP: [] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
PGD aa4414067 PUD 0
Oops:  [#1] SMP
Call Trace:
  [] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
  [] scsih_shutdown+0x55/0x100 [mpt3sas]

Cc: 
Signed-off-by: Kashyap Desai 
Signed-off-by: Sreekanth Reddy 


---
  block/blk-mq.h | 1 +
  1 file changed, 1 insertion(+)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47..57432be 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -175,6 +175,7 @@ static inline bool
blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
  static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 struct request *rq)
  {
+hctx->tags->rqs[rq->tag] = NULL;
  blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
  rq->tag = -1;


No SCSI driver should call scsi_host_find_tag() after a request has 
finished. The above patch introduces yet another race and hence can't be 
a proper fix.


Bart.



Re: [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req()

2018-12-04 Thread Christoph Hellwig
> - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
> - if (unlikely(!req))
> - return NULL;
> + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
> + if (req) {
> + percpu_ref_get(&ctx->reqs);
> + req->ki_ctx = ctx;
> + INIT_LIST_HEAD(&req->ki_list);
> + refcount_set(&req->ki_refcnt, 0);
> + req->ki_eventfd = NULL;
> + }
>  
> - percpu_ref_get(&ctx->reqs);
> - INIT_LIST_HEAD(&req->ki_list);
> - refcount_set(&req->ki_refcnt, 0);
> - req->ki_ctx = ctx;
>   return req;

Why the reformatting?  Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig 


Re: [PATCH 11/27] aio: only use blk plugs for > 2 depth submissions

2018-12-04 Thread Christoph Hellwig
On Fri, Nov 30, 2018 at 09:56:30AM -0700, Jens Axboe wrote:
> Plugging is meant to optimize submission of a string of IOs, if we don't
> have more than 2 being submitted, don't bother setting up a plug.
> 
> Signed-off-by: Jens Axboe 

Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 12/27] aio: use iocb_put() instead of open coding it

2018-12-04 Thread Christoph Hellwig
On Fri, Nov 30, 2018 at 09:56:31AM -0700, Jens Axboe wrote:
> Replace the percpu_ref_put() + kmem_cache_free() with a call to
> iocb_put() instead.
> 
> Signed-off-by: Jens Axboe 

Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio

2018-12-04 Thread Christoph Hellwig
On Fri, Nov 30, 2018 at 07:21:02PM +, Al Viro wrote:
> On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote:
> > For an ITER_KVEC, we can just iterate the iov and add the pages
> > to the bio directly.
> 
> > +   page = virt_to_page(kv->iov_base);
> > +   size = bio_add_page(bio, page, kv->iov_len,
> > +   offset_in_page(kv->iov_base));
> 
> Who said that you *can* do virt_to_page() on those?  E.g. vmalloc()'ed
> addresses are fine for ITER_KVEC, etc.

In this particular case the caller it seems the callers knows what kind
of pages we have, but we need to properly document this at the very least.

Note that in the completely generic case iov_base could also point to
memory without a struct page at all (e.g. ioremap()ed memory).


Re: [PATCH 01/13] block: move queues types to the block layer

2018-12-04 Thread Christoph Hellwig
On Mon, Dec 03, 2018 at 04:49:56PM -0800, Sagi Grimberg wrote:
>> @@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx 
>> *blk_mq_map_queue(struct request_queue *q,
>>   unsigned int flags,
>>   unsigned int cpu)
>>   {
>> -int hctx_type = 0;
>> +enum hctx_type type = HCTX_TYPE_DEFAULT;
>> +
>> +if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
>> +((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
>> +type = HCTX_TYPE_POLL;
>>   -  if (q->mq_ops->rq_flags_to_type)
>> -hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
>> +else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
>> + ((flags & REQ_OP_MASK) == REQ_OP_READ))
>> +type = HCTX_TYPE_READ;
>
> Nit, there seems to be an extra newline that can be omitted here before
> the else if statement (if I'm reading this correctly)...

Empty lines can always be ommited, but in this case I actually like it
as it seems to help readability..


Re: [PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension

2018-12-04 Thread Christoph Hellwig
Why does this go to linux-block?


Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled

2018-12-04 Thread Christoph Hellwig
On Mon, Dec 03, 2018 at 04:54:15PM -0800, Sagi Grimberg wrote:
>
>> @@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>>  if (nr_io_queues == 0)
>>  return 0;
>> +
>> +clear_bit(NVMEQ_ENABLED, &adminq->flags);
>>   
>
> This is a change of behavior, looks correct though as we can fail
> nvme_setup_irqs after we freed the admin vector. Needs documentation 
> though..

I have a hard time parsing the above, can you point to where the problem
is?


Re: [PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension

2018-12-04 Thread Ard Biesheuvel
On Tue, 4 Dec 2018 at 16:01, Christoph Hellwig  wrote:
>
> Why does this go to linux-block?

Because xor_blocks() is part of the RAID driver?


Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues

2018-12-04 Thread Christoph Hellwig
On Mon, Dec 03, 2018 at 04:58:25PM -0800, Sagi Grimberg wrote:
>
>> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
>
> Do we still need to carry the tag around?

Yes, the timeout handler polls for a specific tag.


Re: [PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension

2018-12-04 Thread Will Deacon
On Tue, Dec 04, 2018 at 07:01:24AM -0800, Christoph Hellwig wrote:
> Why does this go to linux-block?

FWIW, I'll take this via arm64 so feel free to ignore.

Will


Re: [PATCH 10/13] nvme-mpath: remove I/O polling support

2018-12-04 Thread Christoph Hellwig
On Mon, Dec 03, 2018 at 05:11:43PM -0800, Sagi Grimberg wrote:
>> If it really becomes an issue we
>> should rework the nvme code to also skip the multipath code for any
>> private namespace, even if that could mean some trouble when rescanning.
>>
>
> This requires some explanation? skip the multipath code how?

We currently always go through the multipath node as long the the
controller is multipath capable.  If we care about e.g. polling
on a private namespace on a dual ported U.2 drive we'd have to make
sure we don't go through the multipath device node for private namespaces
that can only have one path, but only for shared namespaces.


Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues

2018-12-04 Thread Christoph Hellwig
On Mon, Dec 03, 2018 at 05:00:59PM -0800, Sagi Grimberg wrote:
>
>> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
>> bool shutdown)
>>  nvme_stop_queues(&dev->ctrl);
>>  if (!dead && dev->ctrl.queue_count > 0) {
>> -nvme_disable_io_queues(dev);
>> +if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
>> +
>
> Would be nice if the opcode change would be kept inside but still
> split like:

I actually like not having another wrapper to stop through..

Keith, Jens, any preference?


Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio

2018-12-04 Thread Jens Axboe
On 12/4/18 7:55 AM, Christoph Hellwig wrote:
> On Fri, Nov 30, 2018 at 07:21:02PM +, Al Viro wrote:
>> On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote:
>>> For an ITER_KVEC, we can just iterate the iov and add the pages
>>> to the bio directly.
>>
>>> +   page = virt_to_page(kv->iov_base);
>>> +   size = bio_add_page(bio, page, kv->iov_len,
>>> +   offset_in_page(kv->iov_base));
>>
>> Who said that you *can* do virt_to_page() on those?  E.g. vmalloc()'ed
>> addresses are fine for ITER_KVEC, etc.
> 
> In this particular case the caller it seems the callers knows what kind
> of pages we have, but we need to properly document this at the very least.
> 
> Note that in the completely generic case iov_base could also point to
> memory without a struct page at all (e.g. ioremap()ed memory).

That's why I went to ITER_BVEC instead, that's much saner for this
use case.

-- 
Jens Axboe



Re: [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req()

2018-12-04 Thread Jens Axboe
On 12/4/18 7:49 AM, Christoph Hellwig wrote:
>> -req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
>> -if (unlikely(!req))
>> -return NULL;
>> +req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
>> +if (req) {
>> +percpu_ref_get(&ctx->reqs);
>> +req->ki_ctx = ctx;
>> +INIT_LIST_HEAD(&req->ki_list);
>> +refcount_set(&req->ki_refcnt, 0);
>> +req->ki_eventfd = NULL;
>> +}
>>  
>> -percpu_ref_get(&ctx->reqs);
>> -INIT_LIST_HEAD(&req->ki_list);
>> -refcount_set(&req->ki_refcnt, 0);
>> -req->ki_ctx = ctx;
>>  return req;
> 
> Why the reformatting?  Otherwise this looks fine to me:
> 
> Reviewed-by: Christoph Hellwig 

Probably just the (over) abuse of likely/unlikely in aio.c. I can get
rid of it.

-- 
Jens Axboe



Re: [PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension

2018-12-04 Thread Christoph Hellwig
On Tue, Dec 04, 2018 at 04:02:36PM +0100, Ard Biesheuvel wrote:
> On Tue, 4 Dec 2018 at 16:01, Christoph Hellwig  wrote:
> >
> > Why does this go to linux-block?
> 
> Because xor_blocks() is part of the RAID driver?

The only caller of xor_blocks() seems btrfs.  And the RAID code has its
own list anyway..


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Keith Busch
On Mon, Dec 03, 2018 at 05:33:06PM -0800, Sagi Grimberg wrote:
> >>
> > Yes, I'm very much in favour of this, too.
> > We always have this IMO slightly weird notion of stopping the queue, set 
> > some error flags in the driver, then _restarting_ the queue, just so 
> > that the driver then sees the error flag and terminates the requests.
> > Which I always found quite counter-intuitive.
> 
> What about requests that come in after the iteration runs? how are those
> terminated?

If we've reached a dead state, I think you'd want to start a queue freeze
before running the terminating iterator.


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread James Smart

On 12/4/2018 7:46 AM, Keith Busch wrote:

On Mon, Dec 03, 2018 at 05:33:06PM -0800, Sagi Grimberg wrote:

Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, set
some error flags in the driver, then _restarting_ the queue, just so
that the driver then sees the error flag and terminates the requests.
Which I always found quite counter-intuitive.

What about requests that come in after the iteration runs? how are those
terminated?

If we've reached a dead state, I think you'd want to start a queue freeze
before running the terminating iterator.


For the requests that come in after the iterator, the nvmf_check_ready() 
routine, which validates controller state, will catch and bounce them.


Keith states why we froze it in the past.  Whatever the itterator is, 
I'd prefer we not get abort calls on io that has yet to be successfully 
started.


-- james



Re: [PATCH 02/27] aio: clear IOCB_HIPRI

2018-12-04 Thread Jens Axboe
On 12/4/18 7:46 AM, Christoph Hellwig wrote:
> On Fri, Nov 30, 2018 at 10:14:31AM -0700, Jens Axboe wrote:
>> On 11/30/18 10:13 AM, Christoph Hellwig wrote:
>>> I think we'll need to queue this up for 4.21 ASAP independent of the
>>> rest, given that with separate poll queues userspace could otherwise
>>> submit I/O that will never get polled for anywhere.
>>
>> Probably a good idea, I can just move it to my 4.21 branch, it's not
>> strictly dependent on the series.
> 
> So, can you add it to the 4.21 branch?

Done

-- 
Jens Axboe



RE: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-04 Thread Kashyap Desai
> On Tue, Dec 04, 2018 at 03:30:11PM +0530, Kashyap Desai wrote:
> > Problem statement :
> > Whenever try to get outstanding request via scsi_host_find_tag,
> > block layer will return stale entries instead of actual outstanding
> > request. Kernel panic if stale entry is inaccessible or memory is
> > reused.
> > Fix :
> > Undo request mapping in blk_mq_put_driver_tag  nce request is return.
> >
> > More detail :
> > Whenever each SDEV entry is created, block layer allocate separate tags
> > and static requestis.Those requests are not valid after SDEV is deleted
> > from the system. On the fly, block layer maps static rqs to rqs as below
> > from blk_mq_get_driver_tag()
> >
> > data.hctx->tags->rqs[rq->tag] = rq;
> >
> > Above mapping is active in-used requests and it is the same mapping
> > which
> > is referred in function scsi_host_find_tag().
> > After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
> > entries which will never be reset in block layer.
>
> However, if rq & rq->tag is valid, data.hctx->tags->rqs[rq->tag] should
> have pointed to one active request instead of the stale one, right?

Yes that is my understanding and learning from this issue.
Side note -

At driver load whenever driver does scsi_add_host_with_dma(), it follows
below code path in block layer.

scsi_mq_setup_tags
  ->blk_mq_alloc_tag_set
  -> blk_mq_alloc_rq_maps
 -> __blk_mq_alloc_rq_maps

SML create two set of request pool. One is per HBA and other is per SDEV. I
was confused why SML creates request pool per HBA.

>
> Thanks,
> Ming


Re: [PATCH 01/13] block: move queues types to the block layer

2018-12-04 Thread Sagi Grimberg




Nit, there seems to be an extra newline that can be omitted here before
the else if statement (if I'm reading this correctly)...


Empty lines can always be ommited, but in this case I actually like it
as it seems to help readability..


If you think its useful I'm fine with it as is...


Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled

2018-12-04 Thread Sagi Grimberg




@@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (nr_io_queues == 0)
return 0;
+   
+   clear_bit(NVMEQ_ENABLED, &adminq->flags);
   


This is a change of behavior, looks correct though as we can fail
nvme_setup_irqs after we freed the admin vector. Needs documentation
though..


I have a hard time parsing the above, can you point to where the problem
is?


This flag replaces cq_vector = -1 for indicating the queue state so I'd
expect that it would not appear where the former didn't. In this case
we clear the flag in a place that before we did not set the cq_vector
before, which seems correct to me though.


Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues

2018-12-04 Thread Sagi Grimberg




+static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)


Do we still need to carry the tag around?


Yes, the timeout handler polls for a specific tag.


Does it have to? the documentation suggests that we missed
an interrupt, so it is probably waiting on the completion queue.

I'd say that it'd be better if the tag search would be implemented
on the timeout handler alone so we don't have to pass the tag around
for everyone... Thoughts?


Re: [PATCH 10/13] nvme-mpath: remove I/O polling support

2018-12-04 Thread Sagi Grimberg




If it really becomes an issue we
should rework the nvme code to also skip the multipath code for any
private namespace, even if that could mean some trouble when rescanning.



This requires some explanation? skip the multipath code how?


We currently always go through the multipath node as long the the
controller is multipath capable.  If we care about e.g. polling
on a private namespace on a dual ported U.2 drive we'd have to make
sure we don't go through the multipath device node for private namespaces
that can only have one path, but only for shared namespaces.


But we'd still use the multipath node for shared namespaces (and also
polling if needed). I agree that private namespaces can skip the
multipath node.


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Sagi Grimberg




Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, 
set

some error flags in the driver, then _restarting_ the queue, just so
that the driver then sees the error flag and terminates the requests.
Which I always found quite counter-intuitive.

What about requests that come in after the iteration runs? how are those
terminated?

If we've reached a dead state, I think you'd want to start a queue freeze
before running the terminating iterator.


For the requests that come in after the iterator, the nvmf_check_ready() 
routine, which validates controller state, will catch and bounce them.


The point of this patch was to omit the check in queue_rq like Keith
did in patch #2.


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Sagi Grimberg




Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, set
some error flags in the driver, then _restarting_ the queue, just so
that the driver then sees the error flag and terminates the requests.
Which I always found quite counter-intuitive.


What about requests that come in after the iteration runs? how are those
terminated?


If we've reached a dead state, I think you'd want to start a queue freeze
before running the terminating iterator.


Its not necessarily dead, in fabrics we need to handle disconnections
that last for a while before we are able to reconnect (for a variety of
reasons) and we need a way to fail I/O for failover (or requeue, or
block its up to the upper layer). Its less of a "last resort" action
like in the pci case.

Does this guarantee that after freeze+iter we won't get queued with any
other request? If not then we still need to unfreeze and fail at
queue_rq.


[PATCH 2/3] blktests: add python scripts for parsing fio json output

2018-12-04 Thread Josef Bacik
I wrote these scripts for xfstests, just copying them over to blktests
as well.  The terse output fio support that blktests doesn't get all of
the various fio performance things that we may want to look at, this
gives us a lot of flexibility around getting specific values out of the
fio results data.

Signed-off-by: Josef Bacik 
---
 src/FioResultDecoder.py | 64 +
 src/fio-key-value.py| 28 ++
 2 files changed, 92 insertions(+)
 create mode 100644 src/FioResultDecoder.py
 create mode 100644 src/fio-key-value.py

diff --git a/src/FioResultDecoder.py b/src/FioResultDecoder.py
new file mode 100644
index ..d004140c0fdf
--- /dev/null
+++ b/src/FioResultDecoder.py
@@ -0,0 +1,64 @@
+# SPDX-License-Identifier: GPL-2.0
+
+import json
+
+class FioResultDecoder(json.JSONDecoder):
+"""Decoder for decoding fio result json to an object for our database
+
+This decodes the json output from fio into an object that can be directly
+inserted into our database.  This just strips out the fields we don't care
+about and collapses the read/write/trim classes into a flat value structure
+inside of the jobs object.
+
+For example
+"write" : {
+"io_bytes" : 313360384,
+"bw" : 1016,
+}
+
+Get's collapsed to
+
+"write_io_bytes" : 313360384,
+"write_bw": 1016,
+
+Currently any dict under 'jobs' get's dropped, with the exception of 
'read',
+'write', and 'trim'.  For those sub sections we drop any dict's under 
those.
+
+Attempt to keep this as generic as possible, we don't want to break every
+time fio changes it's json output format.
+"""
+_ignore_types = ['dict', 'list']
+_override_keys = ['lat_ns', 'lat']
+_io_ops = ['read', 'write', 'trim']
+
+_transform_keys = { 'lat': 'lat_ns' }
+
+def decode(self, json_string):
+"""This does the dirty work of converting everything"""
+default_obj = super(FioResultDecoder, self).decode(json_string)
+obj = {}
+obj['global'] = {}
+obj['global']['time'] = default_obj['time']
+obj['jobs'] = []
+for job in default_obj['jobs']:
+new_job = {}
+for key,value in job.iteritems():
+if key not in self._io_ops:
+if value.__class__.__name__ in self._ignore_types:
+continue
+new_job[key] = value
+continue
+for k,v in value.iteritems():
+if k in self._override_keys:
+if k in self._transform_keys:
+k = self._transform_keys[k]
+for subk,subv in v.iteritems():
+collapsed_key = "{}_{}_{}".format(key, k, subk)
+new_job[collapsed_key] = subv
+continue
+if v.__class__.__name__ in self._ignore_types:
+continue
+collapsed_key = "{}_{}".format(key, k)
+new_job[collapsed_key] = v
+obj['jobs'].append(new_job)
+return obj
diff --git a/src/fio-key-value.py b/src/fio-key-value.py
new file mode 100644
index ..208e9a453a19
--- /dev/null
+++ b/src/fio-key-value.py
@@ -0,0 +1,28 @@
+# SPDX-License-Identifier: GPL-2.0
+
+import FioResultDecoder
+import json
+import argparse
+import sys
+import platform
+
+parser = argparse.ArgumentParser()
+parser.add_argument('-j', '--jobname', type=str,
+help="The jobname we want our key from.",
+required=True)
+parser.add_argument('-k', '--key', type=str,
+help="The key we want the value of", required=True)
+parser.add_argument('result', type=str,
+help="The result file read")
+args = parser.parse_args()
+
+json_data = open(args.result)
+data = json.load(json_data, cls=FioResultDecoder.FioResultDecoder)
+
+for job in data['jobs']:
+if job['jobname'] == args.jobname:
+if args.key not in job:
+print('')
+else:
+print("{}".format(job[args.key]))
+break
-- 
2.14.3



[PATCH 1/3] blktests: add cgroup2 infrastructure

2018-12-04 Thread Josef Bacik
In order to test io.latency and other cgroup related things we need some
supporting helpers to setup and tear down cgroup2.  This adds support
for checking that we can even configure cgroup2 things, set them up if
need be, and then add the cleanup stuff to the main cleanup function so
everything is always in a clean state.

Signed-off-by: Josef Bacik 
---
 check |  2 ++
 common/rc | 48 
 2 files changed, 50 insertions(+)

diff --git a/check b/check
index ebd87c097e25..1c9dbc518fa1 100755
--- a/check
+++ b/check
@@ -294,6 +294,8 @@ _cleanup() {
done
unset RESTORE_CPUS_ONLINE
fi
+
+   _cleanup_cgroup2
 }
 
 _call_test() {
diff --git a/common/rc b/common/rc
index 8a892bcd5fde..a785f2329687 100644
--- a/common/rc
+++ b/common/rc
@@ -202,3 +202,51 @@ _test_dev_in_hotplug_slot() {
 _filter_xfs_io_error() {
sed -e 's/^\(.*\)64\(: .*$\)/\1\2/'
 }
+
+_cgroup2_base_dir()
+{
+   grep cgroup2 /proc/mounts | awk '{ print $2 }'
+}
+
+_cleanup_cgroup2()
+{
+   _dir=$(_cgroup2_base_dir)/blktests
+   [ -d "${_dir}" ] || return
+
+   for i in $(find ${_dir} -type d | tac)
+   do
+   rmdir $i
+   done
+}
+
+_have_cgroup2()
+{
+   if ! grep -q 'cgroup2' /proc/mounts; then
+   SKIP_REASON="This test requires cgroup2"
+   return 1
+   fi
+   return 0
+}
+
+_have_cgroup2_controller_file()
+{
+   _have_cgroup2 || return 1
+
+   _controller=$1
+   _file=$2
+   _dir=$(_cgroup2_base_dir)
+
+   if ! grep -q ${_controller} ${_dir}/cgroup.controllers; then
+   SKIP_REASON="No support for ${_controller} cgroup controller"
+   return 1
+   fi
+
+   mkdir ${_dir}/blktests
+   echo "+${_controller}" > ${_dir}/cgroup.subtree_control
+   if [ ! -f ${_dir}/blktests/${_file} ]; then
+   _cleanup_cgroup2
+   SKIP_REASON="Cgroup file ${_file} doesn't exist"
+   return 1
+   fi
+   return 0
+}
-- 
2.14.3



[PATCH 0/3] io.latency test for blktests

2018-12-04 Thread Josef Bacik
This patchset is to add a test to verify io.latency is working properly, and to
add all the supporting code to run that test.

First is the cgroup2 infrastructure which is fairly straightforward.  Just
verifies we have cgroup2, and gives us the helpers to check and make sure we
have the right controllers in place for the test.

The second patch brings over some python scripts I put in xfstests for parsing
the fio json output.  I looked at the existing fio performance stuff in
blktests, but we only capture bw stuff, which is wonky with this particular test
because once the fast group is finished the slow group is allowed to go as fast
as it wants.  So I needed this to pull out actual jobtime spent.  This will give
us flexibility to pull out other fio performance data in the future.

The final patch is the test itself.  It simply runs a job by itself to get a
baseline view of the disk performance.  Then it creates 2 cgroups, one fast and
one slow, and runs the same job simultaneously in both groups.  The result
should be that the fast group takes just slightly longer time than the baseline
(I use a 15% threshold to be safe), and that the slow one takes considerably
longer.  Thanks,

Josef


[PATCH 3/3] blktests: block/025: an io.latency test

2018-12-04 Thread Josef Bacik
This is a test to verify io.latency is working properly.  It does this
by first running a fio job by itself to figure out how fast it runs.
Then we calculate some thresholds, set up 2 cgroups, a fast and a slow
cgroup, and then run the same job in both groups simultaneously.  We
should see the slow group get throttled until the first cgroup is able
to finish, and then the slow cgroup will be allowed to finish.

Signed-off-by: Josef Bacik 
---
 tests/block/025 | 133 
 tests/block/025.out |   1 +
 2 files changed, 134 insertions(+)
 create mode 100644 tests/block/025
 create mode 100644 tests/block/025.out

diff --git a/tests/block/025 b/tests/block/025
new file mode 100644
index ..88ce7cca944e
--- /dev/null
+++ b/tests/block/025
@@ -0,0 +1,133 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+#
+# Test io.latency to make sure it's protecting the higher priority group
+# properly.
+
+. tests/block/rc
+
+DESCRIPTION="test the io.latency interface to make sure it's working right"
+
+requires() {
+   _have_cgroup2_controller_file io io.latency && _have_fio && \
+   _have_program python
+}
+
+_fio_results_key() {
+   _job=$1
+   _key=$2
+   _resultfile=$3
+
+   python src/fio-key-value.py -k $_key -j $_job $_resultfile
+}
+
+test_device() {
+   local fio_config_single fio_config_double fio_results fio_args qd
+
+   fio_config_single=${TMPDIR}/single.fio
+   fio_config_double=${TMPDIR}/double.fio
+   fio_results=${TMPDIR}/025.json
+   fio_args="--output-format=json --output=${fio_results}"
+   qd=$(cat ${TEST_DEV_SYSFS}/queue/nr_requests)
+
+   cat << EOF > "${fio_config_single}"
+   [fast]
+   filename=${TEST_DEV}
+   direct=1
+   allrandrepeat=1
+   readwrite=randrw
+   size=4G
+   ioengine=libaio
+   iodepth=${qd}
+   fallocate=none
+   randseed=12345
+EOF
+
+   cat << EOF > "${fio_config_double}"
+   [global]
+   filename=${TEST_DEV}
+   direct=1
+   allrandrepeat=1
+   readwrite=randrw
+   size=4G
+   ioengine=libaio
+   iodepth=${qd}
+   fallocate=none
+   randseed=12345
+
+   [fast]
+   cgroup=blktests/fast
+
+   [slow]
+   cgroup=blktests/slow
+EOF
+   # We run the test once so we have an idea of how fast this workload will
+   # go with nobody else doing IO on the device.
+   if ! fio ${fio_args} ${fio_config_single}; then
+   echo "fio exited with status $?"
+   return 1
+   fi
+
+   _time_taken=$(_fio_results_key fast job_runtime ${fio_results})
+   if [ "${_time_taken}" = "" ]; then
+   echo "fio doesn't report job_runtime"
+   return 1
+   fi
+
+   echo "normal time taken ${_time_taken}" >> $FULL
+
+   # There's no way to predict how the two workloads are going to affect
+   # each other, so we weant to set thresholds to something reasonable so
+   # we can verify io.latency is doing something.  This means we set 15%
+   # for the fast cgroup, just to give us enough wiggle room as throttling
+   # doesn't happen immediately.  But if we have a super fast disk we could
+   # run both groups really fast and make it under our fast threshold, so
+   # we need to set a threshold for the slow group at 50%.  We assume that
+   # if it was faster than 50% of the fast threshold then we probably
+   # didn't throttle and we can assume io.latency is broken.
+   _fast_thresh=$((${_time_taken} + ${_time_taken} * 15 / 100))
+   _slow_thresh=$((${_time_taken} + ${_time_taken} * 50 / 100))
+   echo "fast threshold time is ${_fast_thresh}" >> $FULL
+   echo "slow threshold time is ${_slow_thresh}" >> $FULL
+
+   # Create teh cgroup files
+   _dir=$(_cgroup2_base_dir)/blktests
+   echo "+io" > ${_dir}/cgroup.subtree_control
+   mkdir ${_dir}/fast
+   mkdir ${_dir}/slow
+
+   # We set teh target to 1usec because we could have a fast device that is
+   # capable of remarkable IO latencies that would skew the test.  It needs
+   # to be low enough that we do actually throttle the slow group,
+   # otherwise the test will fail when there's nothing wrong.
+   _major=$((0x$(stat -c "%t" ${TEST_DEV})))
+   _minor=$((0x$(stat -c "%T" ${TEST_DEV})))
+   echo "${_major}:${_minor} is our device" >> $FULL
+   if ! echo "${_major}:${_minor} target=1" > ${_dir}/fast/io.latency; then
+   echo "Failed to set our latency target"
+   return 1
+   fi
+
+   if ! fio ${fio_args} ${fio_config_double}; then
+   echo "fio exited with status $?"
+   return 1
+   fi
+
+   _fast_time=$(_fio_results_key fast job_runtime ${fio_results})
+   echo "Fast time ${_fast_time}" >> $FULL
+   _slow_time=$(_fio_results_key slow job_runtime ${fio_results})
+   echo "Slow time

Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Keith Busch
On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote:
> 
> > > > Yes, I'm very much in favour of this, too.
> > > > We always have this IMO slightly weird notion of stopping the queue, set
> > > > some error flags in the driver, then _restarting_ the queue, just so
> > > > that the driver then sees the error flag and terminates the requests.
> > > > Which I always found quite counter-intuitive.
> > > 
> > > What about requests that come in after the iteration runs? how are those
> > > terminated?
> > 
> > If we've reached a dead state, I think you'd want to start a queue freeze
> > before running the terminating iterator.
> 
> Its not necessarily dead, in fabrics we need to handle disconnections
> that last for a while before we are able to reconnect (for a variety of
> reasons) and we need a way to fail I/O for failover (or requeue, or
> block its up to the upper layer). Its less of a "last resort" action
> like in the pci case.
> 
> Does this guarantee that after freeze+iter we won't get queued with any
> other request? If not then we still need to unfreeze and fail at
> queue_rq.

It sounds like there are different scenarios to consider.

For the dead controller, we call blk_cleanup_queue() at the end which
ends callers who blocked on entering.

If you're doing a failover, you'd replace the freeze with a current path
update in order to prevent new requests from entering.

In either case, you don't need checks in queue_rq. The queue_rq check
is redundant with the quiesce state that blk-mq already provides.

Once quiesced, the proposed iterator can handle the final termination
of the request, perform failover, or some other lld specific action
depending on your situation.


[PATCH 3/3] block: convert io-latency to use rq_qos_wait

2018-12-04 Thread Josef Bacik
Now that we have this common helper, convert io-latency over to use it
as well.

Signed-off-by: Josef Bacik 
---
 block/blk-iolatency.c | 31 ---
 1 file changed, 8 insertions(+), 23 deletions(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 5f7f1773be61..ac74b41f4131 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -262,15 +262,15 @@ static inline void iolat_update_total_lat_avg(struct 
iolatency_grp *iolat,
   stat->rqs.mean);
 }
 
-static inline bool iolatency_may_queue(struct iolatency_grp *iolat,
-  wait_queue_entry_t *wait,
-  bool first_block)
+static void iolat_cleanup_cb(struct rq_wait *rqw, void *private_data)
 {
-   struct rq_wait *rqw = &iolat->rq_wait;
+   atomic_dec(&rqw->inflight);
+   wake_up(&rqw->wait);
+}
 
-   if (first_block && waitqueue_active(&rqw->wait) &&
-   rqw->wait.head.next != &wait->entry)
-   return false;
+static bool iolat_acquire_inflight(struct rq_wait *rqw, void *private_data)
+{
+   struct iolatency_grp *iolat = private_data;
return rq_wait_inc_below(rqw, iolat->rq_depth.max_depth);
 }
 
@@ -281,8 +281,6 @@ static void __blkcg_iolatency_throttle(struct rq_qos *rqos,
 {
struct rq_wait *rqw = &iolat->rq_wait;
unsigned use_delay = atomic_read(&lat_to_blkg(iolat)->use_delay);
-   DEFINE_WAIT(wait);
-   bool first_block = true;
 
if (use_delay)
blkcg_schedule_throttle(rqos->q, use_memdelay);
@@ -299,20 +297,7 @@ static void __blkcg_iolatency_throttle(struct rq_qos *rqos,
return;
}
 
-   if (iolatency_may_queue(iolat, &wait, first_block))
-   return;
-
-   do {
-   prepare_to_wait_exclusive(&rqw->wait, &wait,
- TASK_UNINTERRUPTIBLE);
-
-   if (iolatency_may_queue(iolat, &wait, first_block))
-   break;
-   first_block = false;
-   io_schedule();
-   } while (1);
-
-   finish_wait(&rqw->wait, &wait);
+   rq_qos_wait(rqw, iolat, iolat_acquire_inflight, iolat_cleanup_cb);
 }
 
 #define SCALE_DOWN_FACTOR 2
-- 
2.14.3



[PATCH 1/3] block: add rq_qos_wait to rq_qos

2018-12-04 Thread Josef Bacik
Originally when I split out the common code from blk-wbt into rq_qos I
left the wbt_wait() where it was and simply copied and modified it
slightly to work for io-latency.  However they are both basically the
same thing, and as time has gone on wbt_wait() has ended up much smarter
and kinder than it was when I copied it into io-latency, which means
io-latency has lost out on these improvements.

Since they are the same thing essentially except for a few minor things,
create rq_qos_wait() that replicates what wbt_wait() currently does with
callbacks that can be passed in for the snowflakes to do their own thing
as appropriate.

Signed-off-by: Josef Bacik 
---
 block/blk-rq-qos.c | 86 ++
 block/blk-rq-qos.h |  6 
 2 files changed, 92 insertions(+)

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 80f603b76f61..e932ef9d2718 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -176,6 +176,92 @@ void rq_depth_scale_down(struct rq_depth *rqd, bool 
hard_throttle)
rq_depth_calc_max_depth(rqd);
 }
 
+struct rq_qos_wait_data {
+   struct wait_queue_entry wq;
+   struct task_struct *task;
+   struct rq_wait *rqw;
+   acquire_inflight_cb_t *cb;
+   void *private_data;
+   bool got_token;
+};
+
+static int rq_qos_wake_function(struct wait_queue_entry *curr,
+   unsigned int mode, int wake_flags, void *key)
+{
+   struct rq_qos_wait_data *data = container_of(curr,
+struct rq_qos_wait_data,
+wq);
+
+   /*
+* If we fail to get a budget, return -1 to interrupt the wake up loop
+* in __wake_up_common.
+*/
+   if (!data->cb(data->rqw, data->private_data))
+   return -1;
+
+   data->got_token = true;
+   list_del_init(&curr->entry);
+   wake_up_process(data->task);
+   return 1;
+}
+
+/**
+ * rq_qos_wait - throttle on a rqw if we need to
+ * @private_data - caller provided specific data
+ * @acquire_inflight_cb - inc the rqw->inflight counter if we can
+ * @cleanup_cb - the callback to cleanup in case we race with a waker
+ *
+ * This provides a uniform place for the rq_qos users to do their throttling.
+ * Since you can end up with a lot of things sleeping at once, this manages the
+ * waking up based on the resources available.  The acquire_inflight_cb should
+ * inc the rqw->inflight if we have the ability to do so, or return false if 
not
+ * and then we will sleep until the room becomes available.
+ *
+ * cleanup_cb is in case that we race with a waker and need to cleanup the
+ * inflight count accordingly.
+ */
+void rq_qos_wait(struct rq_wait *rqw, void *private_data,
+acquire_inflight_cb_t *acquire_inflight_cb,
+cleanup_cb_t *cleanup_cb)
+{
+   struct rq_qos_wait_data data = {
+   .wq = {
+   .func   = rq_qos_wake_function,
+   .entry  = LIST_HEAD_INIT(data.wq.entry),
+   },
+   .task = current,
+   .rqw = rqw,
+   .cb = acquire_inflight_cb,
+   .private_data = private_data,
+   };
+   bool has_sleeper;
+
+   has_sleeper = wq_has_sleeper(&rqw->wait);
+   if (!has_sleeper && acquire_inflight_cb(rqw, private_data))
+   return;
+
+   prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
+   do {
+   if (data.got_token)
+   break;
+   if (!has_sleeper && acquire_inflight_cb(rqw, private_data)) {
+   finish_wait(&rqw->wait, &data.wq);
+
+   /*
+* We raced with wbt_wake_function() getting a token,
+* which means we now have two. Put our local token
+* and wake anyone else potentially waiting for one.
+*/
+   if (data.got_token)
+   cleanup_cb(rqw, private_data);
+   break;
+   }
+   io_schedule();
+   has_sleeper = false;
+   } while (1);
+   finish_wait(&rqw->wait, &data.wq);
+}
+
 void rq_qos_exit(struct request_queue *q)
 {
while (q->rq_qos) {
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 6e09e98b93ea..8678875de420 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -93,6 +93,12 @@ static inline void rq_qos_del(struct request_queue *q, 
struct rq_qos *rqos)
}
 }
 
+typedef bool (acquire_inflight_cb_t)(struct rq_wait *rqw, void *private_data);
+typedef void (cleanup_cb_t)(struct rq_wait *rqw, void *private_data);
+
+void rq_qos_wait(struct rq_wait *rqw, void *private_data,
+acquire_inflight_cb_t *acquire_inflight_cb,
+cleanup_cb_t *cleanup_cb);
 bool rq_wait_inc_below(struct 

[PATCH 2/3] block: convert wbt_wait() to use rq_qos_wait()

2018-12-04 Thread Josef Bacik
Now that we have rq_qos_wait() in place, convert wbt_wait() over to
using it with it's specific callbacks.

Signed-off-by: Josef Bacik 
---
 block/blk-wbt.c | 65 ++---
 1 file changed, 11 insertions(+), 54 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index d051ebfb4852..40207edd1d89 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -489,31 +489,21 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
unsigned long rw)
 }
 
 struct wbt_wait_data {
-   struct wait_queue_entry wq;
-   struct task_struct *task;
struct rq_wb *rwb;
-   struct rq_wait *rqw;
+   enum wbt_flags wb_acct;
unsigned long rw;
-   bool got_token;
 };
 
-static int wbt_wake_function(struct wait_queue_entry *curr, unsigned int mode,
-int wake_flags, void *key)
+static bool wbt_inflight_cb(struct rq_wait *rqw, void *private_data)
 {
-   struct wbt_wait_data *data = container_of(curr, struct wbt_wait_data,
-   wq);
-
-   /*
-* If we fail to get a budget, return -1 to interrupt the wake up
-* loop in __wake_up_common.
-*/
-   if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
-   return -1;
+   struct wbt_wait_data *data = private_data;
+   return rq_wait_inc_below(rqw, get_limit(data->rwb, data->rw));
+}
 
-   data->got_token = true;
-   list_del_init(&curr->entry);
-   wake_up_process(data->task);
-   return 1;
+static void wbt_cleanup_cb(struct rq_wait *rqw, void *private_data)
+{
+   struct wbt_wait_data *data = private_data;
+   wbt_rqw_done(data->rwb, rqw, data->wb_acct);
 }
 
 /*
@@ -525,45 +515,12 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags 
wb_acct,
 {
struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
struct wbt_wait_data data = {
-   .wq = {
-   .func   = wbt_wake_function,
-   .entry  = LIST_HEAD_INIT(data.wq.entry),
-   },
-   .task = current,
.rwb = rwb,
-   .rqw = rqw,
+   .wb_acct = wb_acct,
.rw = rw,
};
-   bool has_sleeper;
-
-   has_sleeper = wq_has_sleeper(&rqw->wait);
-   if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
-   return;
-
-   prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
-   do {
-   if (data.got_token)
-   break;
-
-   if (!has_sleeper &&
-   rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
-   finish_wait(&rqw->wait, &data.wq);
-
-   /*
-* We raced with wbt_wake_function() getting a token,
-* which means we now have two. Put our local token
-* and wake anyone else potentially waiting for one.
-*/
-   if (data.got_token)
-   wbt_rqw_done(rwb, rqw, wb_acct);
-   break;
-   }
-
-   io_schedule();
-   has_sleeper = false;
-   } while (1);
 
-   finish_wait(&rqw->wait, &data.wq);
+   rq_qos_wait(rqw, &data, wbt_inflight_cb, wbt_cleanup_cb);
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
-- 
2.14.3



[PATCH 0/3] Unify the throttling code for wbt and io-latency

2018-12-04 Thread Josef Bacik
Originally when I wrote io-latency and the rq_qos code to provide a common base
between wbt and io-latency I left out the throttling part.  These were basically
the same, but slightly different in both cases.  The difference was enough and
the code wasn't too complicated that I just copied it into io-latency and
modified it for what I needed and carried on.

Since then Jens has fixed a few issues with wakeups with the niave approach.
Before you could easily cycle waiters back to the end of the line if they were
woken up without the ability to actually do their IO yet.  But because this was
only in wbt we didn't get it in io-latency.

Resolve this by creating a unified interface for doing the throttling, and then
just handle the differences between the two users with user specific callbacks.
This allows us to have one place where we have to mess with wakeups, and gives
each user the ability to be their own special snowflake.

Jens, I based this on for-next from 12/03, let me know if you want a different
base.  I tested this with my blktests test.  Thanks,

Josef



Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT

2018-12-04 Thread Jens Axboe
On 12/4/18 7:48 AM, Christoph Hellwig wrote:
> On Fri, Nov 30, 2018 at 10:17:49AM -0700, Jens Axboe wrote:
>>> Setting REQ_NOWAIT from inside the block layer will make the code that
>>> submits requests harder to review. Have you considered to make this code
>>> fail I/O if REQ_NOWAIT has not been set and to require that the context
>>> that submits I/O sets REQ_NOWAIT?
>>
>> It's technically still feasible to do for sync polled IO, it's only
>> the async case that makes it a potential deadlock.
> 
> I wonder if we want a REQ_ASYNC_POLL compound flag #define that sets
> REQ_POLL and REQ_NOWAIT to make this blindly obvious.

Yeah that might make sense, all the async cases should certainly use it,
and sync can keep using REQ_POLL. I'll add that and fold where I can.

-- 
Jens Axboe



Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues

2018-12-04 Thread Jens Axboe
On 12/4/18 8:05 AM, Christoph Hellwig wrote:
> On Mon, Dec 03, 2018 at 05:00:59PM -0800, Sagi Grimberg wrote:
>>
>>> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
>>> bool shutdown)
>>> nvme_stop_queues(&dev->ctrl);
>>> if (!dead && dev->ctrl.queue_count > 0) {
>>> -   nvme_disable_io_queues(dev);
>>> +   if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
>>> +   
>>
>> Would be nice if the opcode change would be kept inside but still
>> split like:
> 
> I actually like not having another wrapper to stop through..
> 
> Keith, Jens, any preference?

Fine either way, prefer not having a wrapper.

-- 
Jens Axboe



Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues

2018-12-04 Thread Jens Axboe
On 12/4/18 10:13 AM, Sagi Grimberg wrote:
> 
 +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int 
 tag)
>>>
>>> Do we still need to carry the tag around?
>>
>> Yes, the timeout handler polls for a specific tag.
> 
> Does it have to? the documentation suggests that we missed
> an interrupt, so it is probably waiting on the completion queue.
> 
> I'd say that it'd be better if the tag search would be implemented
> on the timeout handler alone so we don't have to pass the tag around
> for everyone... Thoughts?

Without that you don't know if that's the request that completed. You
have to be able to look that up from the timeout handler.

-- 
Jens Axboe



Re: block and nvme polling improvements V3

2018-12-04 Thread Jens Axboe
On 12/2/18 9:46 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series optimizes a few bits in the block layer and nvme code
> related to polling.
> 
> It starts by moving the queue types recently introduce entirely into
> the block layer instead of requiring an indirect call for them.
> 
> It then switches nvme and the block layer to only allow polling
> with separate poll queues, which allows us to realize the following
> benefits:
> 
>  - poll queues can safely avoid disabling irqs on any locks
>(we already do that in NVMe, but it isn't 100% kosher as-is)
>  - regular interrupt driven queues can drop the CQ lock entirely,
>as we won't race for completing CQs
> 
> Then we drop the NVMe RDMA code, as it doesn't follow the new mode,
> and remove the nvme multipath polling code including the block hooks
> for it, which didn't make much sense to start with given that we
> started bypassing the multipath code for single controller subsystems
> early on.  Last but not least we enable polling in the block layer
> by default if the underlying driver has poll queues, as that already
> requires explicit user action.
> 
> Note that it would be really nice to have polling back for RDMA with
> dedicated poll queues, but that might take a while.  Also based on
> Jens' polling aio patches we could now implement a model in nvmet
> where we have a thread polling both the backend nvme device and
> the RDMA CQs, which might give us some pretty nice performace
> (I know Sagi looked into something similar a while ago).

Applied, thanks.

-- 
Jens Axboe



Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread James Smart




On 12/4/2018 9:23 AM, Sagi Grimberg wrote:



Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the 
queue, set

some error flags in the driver, then _restarting_ the queue, just so
that the driver then sees the error flag and terminates the requests.
Which I always found quite counter-intuitive.
What about requests that come in after the iteration runs? how are 
those

terminated?
If we've reached a dead state, I think you'd want to start a queue 
freeze

before running the terminating iterator.


For the requests that come in after the iterator, the 
nvmf_check_ready() routine, which validates controller state, will 
catch and bounce them.


The point of this patch was to omit the check in queue_rq like Keith
did in patch #2.


well - I'm not sure that's possible. The fabrics will have different 
time constraints vs pci.


-- james



Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread James Smart




On 12/4/2018 9:48 AM, Keith Busch wrote:

On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote:

Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, set
some error flags in the driver, then _restarting_ the queue, just so
that the driver then sees the error flag and terminates the requests.
Which I always found quite counter-intuitive.

What about requests that come in after the iteration runs? how are those
terminated?

If we've reached a dead state, I think you'd want to start a queue freeze
before running the terminating iterator.

Its not necessarily dead, in fabrics we need to handle disconnections
that last for a while before we are able to reconnect (for a variety of
reasons) and we need a way to fail I/O for failover (or requeue, or
block its up to the upper layer). Its less of a "last resort" action
like in the pci case.

Does this guarantee that after freeze+iter we won't get queued with any
other request? If not then we still need to unfreeze and fail at
queue_rq.

It sounds like there are different scenarios to consider.

For the dead controller, we call blk_cleanup_queue() at the end which
ends callers who blocked on entering.

If you're doing a failover, you'd replace the freeze with a current path
update in order to prevent new requests from entering.
and if you're not multipath ?    I assume you want the io queues to be 
frozen so they queue there - which can block threads such as ns 
verification. It's good to have them live, as todays checks bounce the 
io, letting the thread terminate as its in a reset/reconnect state, 
which allows those threads to exit out or finish before a new reconnect 
kicks them off again. We've already been fighting deadlocks with the 
reset/delete/rescan paths and these io paths. suspending the queues 
completely over the reconnect will likely create more issues in this area.




In either case, you don't need checks in queue_rq. The queue_rq check
is redundant with the quiesce state that blk-mq already provides.


I disagree.  The cases I've run into are on the admin queue - where we 
are sending io to initialize the controller when another error/reset 
occurs, and the checks are required to identify/reject the "old" 
initialization commands, with another state check allowing them to 
proceed on the "new" initialization commands.  And there are also cases 
for ioctls and other things that occur during the middle of those 
initialization steps that need to be weeded out.   The Admin queue has 
to be kept live to allow the initialization commands on the new controller.


state checks are also needed for those namespace validation cases



Once quiesced, the proposed iterator can handle the final termination
of the request, perform failover, or some other lld specific action
depending on your situation.


I don't believe they can remain frozen, definitely not for the admin queue.
-- james



Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Keith Busch
On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote:
> 
> 
> On 12/4/2018 9:48 AM, Keith Busch wrote:
> > On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote:
> > > > > > Yes, I'm very much in favour of this, too.
> > > > > > We always have this IMO slightly weird notion of stopping the 
> > > > > > queue, set
> > > > > > some error flags in the driver, then _restarting_ the queue, just so
> > > > > > that the driver then sees the error flag and terminates the 
> > > > > > requests.
> > > > > > Which I always found quite counter-intuitive.
> > > > > What about requests that come in after the iteration runs? how are 
> > > > > those
> > > > > terminated?
> > > > If we've reached a dead state, I think you'd want to start a queue 
> > > > freeze
> > > > before running the terminating iterator.
> > > Its not necessarily dead, in fabrics we need to handle disconnections
> > > that last for a while before we are able to reconnect (for a variety of
> > > reasons) and we need a way to fail I/O for failover (or requeue, or
> > > block its up to the upper layer). Its less of a "last resort" action
> > > like in the pci case.
> > > 
> > > Does this guarantee that after freeze+iter we won't get queued with any
> > > other request? If not then we still need to unfreeze and fail at
> > > queue_rq.
> > It sounds like there are different scenarios to consider.
> > 
> > For the dead controller, we call blk_cleanup_queue() at the end which
> > ends callers who blocked on entering.
> > 
> > If you're doing a failover, you'd replace the freeze with a current path
> > update in order to prevent new requests from entering.
> and if you're not multipath ?    I assume you want the io queues to be
> frozen so they queue there - which can block threads such as ns
> verification. It's good to have them live, as todays checks bounce the io,
> letting the thread terminate as its in a reset/reconnect state, which allows
> those threads to exit out or finish before a new reconnect kicks them off
> again. We've already been fighting deadlocks with the reset/delete/rescan
> paths and these io paths. suspending the queues completely over the
> reconnect will likely create more issues in this area.
> 
> 
> > In either case, you don't need checks in queue_rq. The queue_rq check
> > is redundant with the quiesce state that blk-mq already provides.
> 
> I disagree.  The cases I've run into are on the admin queue - where we are
> sending io to initialize the controller when another error/reset occurs, and
> the checks are required to identify/reject the "old" initialization
> commands, with another state check allowing them to proceed on the "new"
> initialization commands.  And there are also cases for ioctls and other
> things that occur during the middle of those initialization steps that need
> to be weeded out.   The Admin queue has to be kept live to allow the
> initialization commands on the new controller.
> 
> state checks are also needed for those namespace validation cases
> 
> > 
> > Once quiesced, the proposed iterator can handle the final termination
> > of the request, perform failover, or some other lld specific action
> > depending on your situation.
> 
> I don't believe they can remain frozen, definitely not for the admin queue.
> -- james

Quiesced and frozen carry different semantics.

My understanding of the nvme-fc implementation is that it returns
BLK_STS_RESOURCE in the scenario you've described where the admin
command can't be executed at the moment. That just has the block layer
requeue it for later resubmission 3 milliseconds later, which will
continue to return the same status code until you're really ready for
it.

What I'm proposing is that instead of using that return code, you may
have nvme-fc control when to dispatch those queued requests by utilizing
the blk-mq quiesce on/off states. Is there a reason that wouldn't work?


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Keith Busch
On Tue, Dec 04, 2018 at 02:21:17PM -0700, Keith Busch wrote:
> On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote:
> > On 12/4/2018 9:48 AM, Keith Busch wrote:
> > > Once quiesced, the proposed iterator can handle the final termination
> > > of the request, perform failover, or some other lld specific action
> > > depending on your situation.
> > 
> > I don't believe they can remain frozen, definitely not for the admin queue.
> > -- james
> 
> Quiesced and frozen carry different semantics.
> 
> My understanding of the nvme-fc implementation is that it returns
> BLK_STS_RESOURCE in the scenario you've described where the admin
> command can't be executed at the moment. That just has the block layer
> requeue it for later resubmission 3 milliseconds later, which will
> continue to return the same status code until you're really ready for
> it.
> 
> What I'm proposing is that instead of using that return code, you may
> have nvme-fc control when to dispatch those queued requests by utilizing
> the blk-mq quiesce on/off states. Is there a reason that wouldn't work?

BTW, this is digressing from this patch's use case. The proposed iteration
doesn't come into play for the above scenario, which can be handled with
existing interfaces.


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread James Smart




On 12/4/2018 1:21 PM, Keith Busch wrote:

On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote:

I disagree.  The cases I've run into are on the admin queue - where we are
sending io to initialize the controller when another error/reset occurs, and
the checks are required to identify/reject the "old" initialization
commands, with another state check allowing them to proceed on the "new"
initialization commands.  And there are also cases for ioctls and other
things that occur during the middle of those initialization steps that need
to be weeded out.   The Admin queue has to be kept live to allow the
initialization commands on the new controller.

state checks are also needed for those namespace validation cases


Once quiesced, the proposed iterator can handle the final termination
of the request, perform failover, or some other lld specific action
depending on your situation.

I don't believe they can remain frozen, definitely not for the admin queue.
-- james

Quiesced and frozen carry different semantics.

My understanding of the nvme-fc implementation is that it returns
BLK_STS_RESOURCE in the scenario you've described where the admin
command can't be executed at the moment. That just has the block layer
requeue it for later resubmission 3 milliseconds later, which will
continue to return the same status code until you're really ready for
it.


BLK_STS_RESOURCE is correct - but for "normal" io, which comes from the 
filesystem, etc and are mostly on the io queues.


But if the io originated from other sources, like the core layer via 
nvme_alloc_request() - used by lots of service routines for the 
transport to initialize the controller, core routines to talk to the 
controller, and ioctls from user space - then they are failed with a 
PATHING ERROR status.  The status doesn't mean much to these other 
places which mainly care if they succeed or not, and if not, they fail 
and unwind.  It's pretty critical for these paths to get that error 
status as many of those threads do synchronous io.  And this is not just 
for nvme-fc. Any transport initializing the controller and getting half 
way through it when an error occurs that kills the association will 
depend on this behavior.  PCI is a large exception as interaction with a 
pci function is very different from sending packets over a network and 
detecting network errors.


Io requests, on the io queues, that are flagged as multipath, also are 
failed this way rather than requeued.  We would need some iterator here 
to classify the type of io (one valid to go down another path) and move 
it to another path (but the transport doesn't want to know about other 
paths).





What I'm proposing is that instead of using that return code, you may
have nvme-fc control when to dispatch those queued requests by utilizing
the blk-mq quiesce on/off states. Is there a reason that wouldn't work?


and quiesce on/off isn't sufficient to do this.

-- james




Re: [PATCH 3/3] blktests: block/025: an io.latency test

2018-12-04 Thread Federico Motta
On 12/4/18 6:47 PM, Josef Bacik wrote:
> This is a test to verify io.latency is working properly.  It does this
> by first running a fio job by itself to figure out how fast it runs.
> Then we calculate some thresholds, set up 2 cgroups, a fast and a slow
> cgroup, and then run the same job in both groups simultaneously.  We
> should see the slow group get throttled until the first cgroup is able
> to finish, and then the slow cgroup will be allowed to finish.
> 
Hi,
two small typos below.

> Signed-off-by: Josef Bacik 
> ---
>  tests/block/025 | 133 
> 
>  tests/block/025.out |   1 +
>  2 files changed, 134 insertions(+)
>  create mode 100644 tests/block/025
>  create mode 100644 tests/block/025.out
> 
> diff --git a/tests/block/025 b/tests/block/025
> new file mode 100644
> index ..88ce7cca944e
> --- /dev/null
> +++ b/tests/block/025
> @@ -0,0 +1,133 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-3.0+
> +#
> +# Test io.latency to make sure it's protecting the higher priority group
> +# properly.
> +
> +. tests/block/rc
> +
> +DESCRIPTION="test the io.latency interface to make sure it's working right"
> +
> +requires() {
> + _have_cgroup2_controller_file io io.latency && _have_fio && \
> + _have_program python
> +}
> +
> +_fio_results_key() {
> + _job=$1
> + _key=$2
> + _resultfile=$3
> +
> + python src/fio-key-value.py -k $_key -j $_job $_resultfile
> +}
> +
> +test_device() {
> + local fio_config_single fio_config_double fio_results fio_args qd
> +
> + fio_config_single=${TMPDIR}/single.fio
> + fio_config_double=${TMPDIR}/double.fio
> + fio_results=${TMPDIR}/025.json
> + fio_args="--output-format=json --output=${fio_results}"
> + qd=$(cat ${TEST_DEV_SYSFS}/queue/nr_requests)
> +
> + cat << EOF > "${fio_config_single}"
> + [fast]
> + filename=${TEST_DEV}
> + direct=1
> + allrandrepeat=1
> + readwrite=randrw
> + size=4G
> + ioengine=libaio
> + iodepth=${qd}
> + fallocate=none
> + randseed=12345
> +EOF
> +
> + cat << EOF > "${fio_config_double}"
> + [global]
> + filename=${TEST_DEV}
> + direct=1
> + allrandrepeat=1
> + readwrite=randrw
> + size=4G
> + ioengine=libaio
> + iodepth=${qd}
> + fallocate=none
> + randseed=12345
> +
> + [fast]
> + cgroup=blktests/fast
> +
> + [slow]
> + cgroup=blktests/slow
> +EOF
> + # We run the test once so we have an idea of how fast this workload will
> + # go with nobody else doing IO on the device.
> + if ! fio ${fio_args} ${fio_config_single}; then
> + echo "fio exited with status $?"
> + return 1
> + fi
> +
> + _time_taken=$(_fio_results_key fast job_runtime ${fio_results})
> + if [ "${_time_taken}" = "" ]; then
> + echo "fio doesn't report job_runtime"
> + return 1
> + fi
> +
> + echo "normal time taken ${_time_taken}" >> $FULL
> +
> + # There's no way to predict how the two workloads are going to affect
> + # each other, so we weant to set thresholds to something reasonable so
> + # we can verify io.latency is doing something.  This means we set 15%
> + # for the fast cgroup, just to give us enough wiggle room as throttling
> + # doesn't happen immediately.  But if we have a super fast disk we could
> + # run both groups really fast and make it under our fast threshold, so
> + # we need to set a threshold for the slow group at 50%.  We assume that
> + # if it was faster than 50% of the fast threshold then we probably
> + # didn't throttle and we can assume io.latency is broken.
> + _fast_thresh=$((${_time_taken} + ${_time_taken} * 15 / 100))
> + _slow_thresh=$((${_time_taken} + ${_time_taken} * 50 / 100))
> + echo "fast threshold time is ${_fast_thresh}" >> $FULL
> + echo "slow threshold time is ${_slow_thresh}" >> $FULL
> +
> + # Create teh cgroup files
Typo, "the".  ^

> + _dir=$(_cgroup2_base_dir)/blktests
> + echo "+io" > ${_dir}/cgroup.subtree_control
> + mkdir ${_dir}/fast
> + mkdir ${_dir}/slow
> +
> + # We set teh target to 1usec because we could have a fast device that is
Also here, "the". ^

> + # capable of remarkable IO latencies that would skew the test.  It needs
> + # to be low enough that we do actually throttle the slow group,
> + # otherwise the test will fail when there's nothing wrong.
> + _major=$((0x$(stat -c "%t" ${TEST_DEV})))
> + _minor=$((0x$(stat -c "%T" ${TEST_DEV})))
> + echo "${_major}:${_minor} is our device" >> $FULL
> + if ! echo "${_major}:${_minor} target=1" > ${_dir}/fast/io.latency; then
> + echo "Failed to set our latency target"
> + return 1
> + fi
> +
> + if ! fio ${fio_args} ${fio_config_double}; then
> + echo "fio exited with status $?"
> + return 1
> + fi

Re: [PATCH 2/3] blktests: add python scripts for parsing fio json output

2018-12-04 Thread Federico Motta
On 12/4/18 6:47 PM, Josef Bacik wrote:
> I wrote these scripts for xfstests, just copying them over to blktests
> as well.  The terse output fio support that blktests doesn't get all of
> the various fio performance things that we may want to look at, this
> gives us a lot of flexibility around getting specific values out of the
> fio results data.
> 
Hi,
some suggestions below :)

> Signed-off-by: Josef Bacik 
> ---
>  src/FioResultDecoder.py | 64 
> +
>  src/fio-key-value.py| 28 ++
>  2 files changed, 92 insertions(+)
>  create mode 100644 src/FioResultDecoder.py
>  create mode 100644 src/fio-key-value.py
> 
> diff --git a/src/FioResultDecoder.py b/src/FioResultDecoder.py
> new file mode 100644
> index ..d004140c0fdf
> --- /dev/null
> +++ b/src/FioResultDecoder.py
> @@ -0,0 +1,64 @@
I think a shebang could clarify this runs under python2.

> +# SPDX-License-Identifier: GPL-2.0
> +
> +import json
> +
> +class FioResultDecoder(json.JSONDecoder):
> +"""Decoder for decoding fio result json to an object for our database
> +
> +This decodes the json output from fio into an object that can be directly
> +inserted into our database.  This just strips out the fields we don't 
> care
> +about and collapses the read/write/trim classes into a flat value 
> structure
> +inside of the jobs object.
> +
> +For example
> +"write" : {
> +"io_bytes" : 313360384,
> +"bw" : 1016,
> +}
> +
> +Get's collapsed to
> +
> +"write_io_bytes" : 313360384,
> +"write_bw": 1016,
> +
> +Currently any dict under 'jobs' get's dropped, with the exception of 
> 'read',
> +'write', and 'trim'.  For those sub sections we drop any dict's under 
> those.
> +
> +Attempt to keep this as generic as possible, we don't want to break every
> +time fio changes it's json output format.
> +"""
> +_ignore_types = ['dict', 'list']
> +_override_keys = ['lat_ns', 'lat']
> +_io_ops = ['read', 'write', 'trim']
> +
> +_transform_keys = { 'lat': 'lat_ns' }
> +
> +def decode(self, json_string):
> +"""This does the dirty work of converting everything"""
> +default_obj = super(FioResultDecoder, self).decode(json_string)
> +obj = {}
> +obj['global'] = {}
> +obj['global']['time'] = default_obj['time']
> +obj['jobs'] = []
> +for job in default_obj['jobs']:
> +new_job = {}
> +for key,value in job.iteritems():
> +if key not in self._io_ops:
> +if value.__class__.__name__ in self._ignore_types:
> +continue
> +new_job[key] = value
> +continue
> +for k,v in value.iteritems():
> +if k in self._override_keys:
> +if k in self._transform_keys:
> +k = self._transform_keys[k]
> +for subk,subv in v.iteritems():
> +collapsed_key = "{}_{}_{}".format(key, k, subk)
> +new_job[collapsed_key] = subv
> +continue
> +if v.__class__.__name__ in self._ignore_types:
> +continue
> +collapsed_key = "{}_{}".format(key, k)
> +new_job[collapsed_key] = v
> +obj['jobs'].append(new_job)
> +return obj
> diff --git a/src/fio-key-value.py b/src/fio-key-value.py
> new file mode 100644
> index ..208e9a453a19
> --- /dev/null
> +++ b/src/fio-key-value.py
> @@ -0,0 +1,28 @@
Also here a shebang could be fine.

> +# SPDX-License-Identifier: GPL-2.0
> +
> +import FioResultDecoder
> +import json
> +import argparse
> +import sys
> +import platform
> +
> +parser = argparse.ArgumentParser()
> +parser.add_argument('-j', '--jobname', type=str,
> +help="The jobname we want our key from.",
> +required=True)
> +parser.add_argument('-k', '--key', type=str,
> +help="The key we want the value of", required=True)
> +parser.add_argument('result', type=str,
If here you set type=open  ^

> +help="The result file read")
> +args = parser.parse_args()
> +
> +json_data = open(args.result)
you could avoid this line ^

> +data = json.load(json_data, cls=FioResultDecoder.FioResultDecoder)
and substitute json_data with args.result which would return a file object.

In short the above piece of patch would become something like:

parser.add_argument('result', type=open, help="The result file read")
args = parser.parse_args()
data = json.load(args.result, cls=FioResultDecoder.FioResultDecoder)

Which still raises IOError if bad arguments are passed.

> +
> +for job in data['jobs']:
> +if job['jobname'] == args.jobname:
> +if args.key not in job:
> +   

[PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Jens Axboe
If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm 
dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
  https://bugzilla.kernel.org/show_bug.cgi?id=201685

Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 
'none'")
Signed-off-by: Jens Axboe 

---

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e5b17a..d8f518c6ea38 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
break;
case BLK_STS_RESOURCE:
case BLK_STS_DEV_RESOURCE:
+   /*
+* If direct dispatch fails, we cannot allow any merging on
+* this IO. Drivers (like SCSI) may have set up permanent state
+* for this request, like SG tables and mappings, and if we
+* merge to it later on then we'll still only do IO to the
+* original part.
+*/
+   rq->cmd_flags |= REQ_NOMERGE;
+
blk_mq_update_dispatch_busy(hctx, true);
__blk_mq_requeue_request(rq);
break;

-- 
Jens Axboe



[PATCHSET v5] Support for polled aio

2018-12-04 Thread Jens Axboe
For the grand introduction to this feature, see my original posting
here:

https://lore.kernel.org/linux-block/20181117235317.7366-1-ax...@kernel.dk/

and refer to the previous postings of this patchset for whatever
features were added there. Particularly v4 has some performance results:

https://lore.kernel.org/linux-block/20181130165646.27341-1-ax...@kernel.dk/

New in this version is io_ring_enter(2) and a ring based SQ/CQ
interface. The rings are mapped from user space, an application can
submit IO by writing to its own SQ ring of struct iocbs, and get
completions by reading the CQ ring of io_events. This eliminates the
need to copy iocbs and io_events completely. It also opens up the
possibility of doing polled IO without any system calls at all, if we
add a thread on the kernel side... I've done experiments with that, but
not ready yet. Doesn't require any changes to the API. I can trivially
do 1.2M IOPS with a single thread through this, and 173K IOPS with
QD=1/sync.

You can also find the patches in my aio-poll branch:

http://git.kernel.dk/cgit/linux-block/log/?h=aio-poll

or by cloning:

git://git.kernel.dk/linux-block aio-poll

Patches are against for-4.21/block.


Since v4

- Switch to using ITER_BVEC for user mapped buffers.
- Drop unneeded import_kvec().
- Use RLIMIT_MEMLOCK as a cap on total memory pinned.
- Fix poll check with min_events == 0 actually checking for events
- Add REQ_HIPRI_ASYNC
- Add ring interface and io_ring_enter(2)
- Change io_setup2() system call to accommodate the new ring interface

 Documentation/filesystems/vfs.txt  |3 +
 arch/x86/entry/syscalls/syscall_64.tbl |2 +
 block/bio.c|   33 +-
 fs/aio.c   | 1440 +---
 fs/block_dev.c |   34 +-
 fs/file.c  |   15 +-
 fs/file_table.c|   10 +-
 fs/gfs2/file.c |2 +
 fs/iomap.c |   52 +-
 fs/xfs/xfs_file.c  |1 +
 include/linux/bio.h|1 +
 include/linux/blk_types.h  |2 +
 include/linux/file.h   |2 +
 include/linux/fs.h |5 +-
 include/linux/iomap.h  |1 +
 include/linux/syscalls.h   |5 +
 include/uapi/asm-generic/unistd.h  |4 +-
 include/uapi/linux/aio_abi.h   |   32 +
 kernel/sys_ni.c|1 +
 19 files changed, 1450 insertions(+), 195 deletions(-)

-- 
Jens Axboe





[PATCH 05/26] iomap: wire up the iopoll method

2018-12-04 Thread Jens Axboe
From: Christoph Hellwig 

Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device.  Also refactor the common direct I/O bio submission code into a
nice little helper.

Signed-off-by: Christoph Hellwig 

Modified to use REQ_HIPRI_ASYNC for async polled IO.

Signed-off-by: Jens Axboe 
---
 fs/gfs2/file.c|  2 ++
 fs/iomap.c| 47 +--
 fs/xfs/xfs_file.c |  1 +
 include/linux/iomap.h |  1 +
 4 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 45a17b770d97..358157efc5b7 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = {
.llseek = gfs2_llseek,
.read_iter  = gfs2_file_read_iter,
.write_iter = gfs2_file_write_iter,
+   .iopoll = iomap_dio_iopoll,
.unlocked_ioctl = gfs2_ioctl,
.mmap   = gfs2_mmap,
.open   = gfs2_open,
@@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = {
.llseek = gfs2_llseek,
.read_iter  = gfs2_file_read_iter,
.write_iter = gfs2_file_write_iter,
+   .iopoll = iomap_dio_iopoll,
.unlocked_ioctl = gfs2_ioctl,
.mmap   = gfs2_mmap,
.open   = gfs2_open,
diff --git a/fs/iomap.c b/fs/iomap.c
index d094e5688bd3..bd483fcb7b5a 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1441,6 +1441,32 @@ struct iomap_dio {
};
 };
 
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
+{
+   struct request_queue *q = READ_ONCE(kiocb->private);
+
+   if (!q)
+   return 0;
+   return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin);
+}
+EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
+
+static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
+   struct bio *bio)
+{
+   atomic_inc(&dio->ref);
+
+   if (dio->iocb->ki_flags & IOCB_HIPRI) {
+   if (!dio->wait_for_completion)
+   bio->bi_opf |= REQ_HIPRI_ASYNC;
+   else
+   bio->bi_opf |= REQ_HIPRI;
+   }
+
+   dio->submit.last_queue = bdev_get_queue(iomap->bdev);
+   dio->submit.cookie = submit_bio(bio);
+}
+
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
struct kiocb *iocb = dio->iocb;
@@ -1553,7 +1579,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
}
 }
 
-static blk_qc_t
+static void
 iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
unsigned len)
 {
@@ -1567,15 +1593,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap 
*iomap, loff_t pos,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
 
-   if (dio->iocb->ki_flags & IOCB_HIPRI)
-   flags |= REQ_HIPRI;
-
get_page(page);
__bio_add_page(bio, page, len, 0);
bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
-
-   atomic_inc(&dio->ref);
-   return submit_bio(bio);
+   iomap_dio_submit_bio(dio, iomap, bio);
 }
 
 static loff_t
@@ -1678,9 +1699,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
loff_t length,
bio_set_pages_dirty(bio);
}
 
-   if (dio->iocb->ki_flags & IOCB_HIPRI)
-   bio->bi_opf |= REQ_HIPRI;
-
iov_iter_advance(dio->submit.iter, n);
 
dio->size += n;
@@ -1688,11 +1706,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
loff_t length,
copied += n;
 
nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
-
-   atomic_inc(&dio->ref);
-
-   dio->submit.last_queue = bdev_get_queue(iomap->bdev);
-   dio->submit.cookie = submit_bio(bio);
+   iomap_dio_submit_bio(dio, iomap, bio);
} while (nr_pages);
 
/*
@@ -1912,6 +1926,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (dio->flags & IOMAP_DIO_WRITE_FUA)
dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
+   WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie);
+   WRITE_ONCE(iocb->private, dio->submit.last_queue);
+
if (!atomic_dec_and_test(&dio->ref)) {
if (!dio->wait_for_completion)
return -EIOCBQUEUED;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e47425071e65..60c2da41f0fc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = {
.write_iter = xfs_file_write_iter,
.splice_read= generic_file_splice_read,
.splice_write   = iter_file_splice_write,
+   .iopoll = iomap_dio_iopoll,
.unlocked_ioctl = xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = xfs_file_compat_ioctl,
diff --git a/inc

[PATCH 07/26] aio: separate out ring reservation from req allocation

2018-12-04 Thread Jens Axboe
From: Christoph Hellwig 

This is in preparation for certain types of IO not needing a ring
reserveration.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
---
 fs/aio.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index cf0de61743e8..eaceb40e6cf5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -901,7 +901,7 @@ static void put_reqs_available(struct kioctx *ctx, unsigned 
nr)
local_irq_restore(flags);
 }
 
-static bool get_reqs_available(struct kioctx *ctx)
+static bool __get_reqs_available(struct kioctx *ctx)
 {
struct kioctx_cpu *kcpu;
bool ret = false;
@@ -993,6 +993,14 @@ static void user_refill_reqs_available(struct kioctx *ctx)
spin_unlock_irq(&ctx->completion_lock);
 }
 
+static bool get_reqs_available(struct kioctx *ctx)
+{
+   if (__get_reqs_available(ctx))
+   return true;
+   user_refill_reqs_available(ctx);
+   return __get_reqs_available(ctx);
+}
+
 /* aio_get_req
  * Allocate a slot for an aio request.
  * Returns NULL if no requests are free.
@@ -1001,24 +1009,15 @@ static inline struct aio_kiocb *aio_get_req(struct 
kioctx *ctx)
 {
struct aio_kiocb *req;
 
-   if (!get_reqs_available(ctx)) {
-   user_refill_reqs_available(ctx);
-   if (!get_reqs_available(ctx))
-   return NULL;
-   }
-
req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
-   goto out_put;
+   return NULL;
 
percpu_ref_get(&ctx->reqs);
INIT_LIST_HEAD(&req->ki_list);
refcount_set(&req->ki_refcnt, 0);
req->ki_ctx = ctx;
return req;
-out_put:
-   put_reqs_available(ctx, 1);
-   return NULL;
 }
 
 static struct kioctx *lookup_ioctx(unsigned long ctx_id)
@@ -1805,9 +1804,13 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
return -EINVAL;
}
 
+   if (!get_reqs_available(ctx))
+   return -EAGAIN;
+
+   ret = -EAGAIN;
req = aio_get_req(ctx);
if (unlikely(!req))
-   return -EAGAIN;
+   goto out_put_reqs_available;
 
if (iocb.aio_flags & IOCB_FLAG_RESFD) {
/*
@@ -1870,11 +1873,12 @@ static int io_submit_one(struct kioctx *ctx, struct 
iocb __user *user_iocb,
goto out_put_req;
return 0;
 out_put_req:
-   put_reqs_available(ctx, 1);
percpu_ref_put(&ctx->reqs);
if (req->ki_eventfd)
eventfd_ctx_put(req->ki_eventfd);
kmem_cache_free(kiocb_cachep, req);
+out_put_reqs_available:
+   put_reqs_available(ctx, 1);
return ret;
 }
 
-- 
2.17.1



[PATCH 18/26] aio: use fget/fput_many() for file references

2018-12-04 Thread Jens Axboe
On the submission side, add file reference batching to the
aio_submit_state. We get as many references as the number of iocbs we
are submitting, and drop unused ones if we end up switching files. The
assumption here is that we're usually only dealing with one fd, and if
there are multiple, hopefuly they are at least somewhat ordered. Could
trivially be extended to cover multiple fds, if needed.

On the completion side we do the same thing, except this is trivially
done just locally in aio_iopoll_reap().

Signed-off-by: Jens Axboe 
---
 fs/aio.c | 106 +++
 1 file changed, 91 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index ae0805dc814c..634b540b0c92 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -253,6 +253,15 @@ struct aio_submit_state {
 */
struct list_head req_list;
unsigned int req_count;
+
+   /*
+* File reference cache
+*/
+   struct file *file;
+   unsigned int fd;
+   unsigned int has_refs;
+   unsigned int used_refs;
+   unsigned int ios_left;
 };
 
 /*-- sysctl variables*/
@@ -1355,7 +1364,8 @@ static long aio_iopoll_reap(struct kioctx *ctx, struct 
io_event __user *evs,
 {
void *iocbs[AIO_IOPOLL_BATCH];
struct aio_kiocb *iocb, *n;
-   int to_free = 0, ret = 0;
+   int file_count, to_free = 0, ret = 0;
+   struct file *file = NULL;
 
/* Shouldn't happen... */
if (*nr_events >= max)
@@ -1372,7 +1382,20 @@ static long aio_iopoll_reap(struct kioctx *ctx, struct 
io_event __user *evs,
list_del(&iocb->ki_list);
iocbs[to_free++] = iocb;
 
-   fput(iocb->rw.ki_filp);
+   /*
+* Batched puts of the same file, to avoid dirtying the
+* file usage count multiple times, if avoidable.
+*/
+   if (!file) {
+   file = iocb->rw.ki_filp;
+   file_count = 1;
+   } else if (file == iocb->rw.ki_filp) {
+   file_count++;
+   } else {
+   fput_many(file, file_count);
+   file = iocb->rw.ki_filp;
+   file_count = 1;
+   }
 
if (evs && copy_to_user(evs + *nr_events, &iocb->ki_ev,
sizeof(iocb->ki_ev))) {
@@ -1382,6 +1405,9 @@ static long aio_iopoll_reap(struct kioctx *ctx, struct 
io_event __user *evs,
(*nr_events)++;
}
 
+   if (file)
+   fput_many(file, file_count);
+
if (to_free)
iocb_put_many(ctx, iocbs, &to_free);
 
@@ -1804,13 +1830,58 @@ static void aio_complete_rw_poll(struct kiocb *kiocb, 
long res, long res2)
}
 }
 
-static int aio_prep_rw(struct aio_kiocb *kiocb, const struct iocb *iocb)
+static void aio_file_put(struct aio_submit_state *state)
+{
+   if (state->file) {
+   int diff = state->has_refs - state->used_refs;
+
+   if (diff)
+   fput_many(state->file, diff);
+   state->file = NULL;
+   }
+}
+
+/*
+ * Get as many references to a file as we have IOs left in this submission,
+ * assuming most submissions are for one file, or at least that each file
+ * has more than one submission.
+ */
+static struct file *aio_file_get(struct aio_submit_state *state, int fd)
+{
+   if (!state)
+   return fget(fd);
+
+   if (!state->file) {
+get_file:
+   state->file = fget_many(fd, state->ios_left);
+   if (!state->file)
+   return NULL;
+
+   state->fd = fd;
+   state->has_refs = state->ios_left;
+   state->used_refs = 1;
+   state->ios_left--;
+   return state->file;
+   }
+
+   if (state->fd == fd) {
+   state->used_refs++;
+   state->ios_left--;
+   return state->file;
+   }
+
+   aio_file_put(state);
+   goto get_file;
+}
+
+static int aio_prep_rw(struct aio_kiocb *kiocb, const struct iocb *iocb,
+  struct aio_submit_state *state)
 {
struct kioctx *ctx = kiocb->ki_ctx;
struct kiocb *req = &kiocb->rw;
int ret;
 
-   req->ki_filp = fget(iocb->aio_fildes);
+   req->ki_filp = aio_file_get(state, iocb->aio_fildes);
if (unlikely(!req->ki_filp))
return -EBADF;
req->ki_pos = iocb->aio_offset;
@@ -1960,7 +2031,8 @@ static void aio_iopoll_iocb_issued(struct 
aio_submit_state *state,
 }
 
 static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
-   bool vectored, bool compat)
+   struct aio_submit_state *state, bool vectored,
+   bool compat)
 {
struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
struct kiocb *req = &kiocb->rw;
@@ -1968,7 +2040,7 @@ s

[PATCH 17/26] fs: add fget_many() and fput_many()

2018-12-04 Thread Jens Axboe
Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Signed-off-by: Jens Axboe 
---
 fs/file.c| 15 ++-
 fs/file_table.c  | 10 --
 include/linux/file.h |  2 ++
 include/linux/fs.h   |  3 ++-
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 7ffd6e9d103d..ad9870edfd51 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -676,7 +676,7 @@ void do_close_on_exec(struct files_struct *files)
spin_unlock(&files->file_lock);
 }
 
-static struct file *__fget(unsigned int fd, fmode_t mask)
+static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
 {
struct files_struct *files = current->files;
struct file *file;
@@ -691,7 +691,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 */
if (file->f_mode & mask)
file = NULL;
-   else if (!get_file_rcu(file))
+   else if (!get_file_rcu_many(file, refs))
goto loop;
}
rcu_read_unlock();
@@ -699,15 +699,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
return file;
 }
 
+struct file *fget_many(unsigned int fd, unsigned int refs)
+{
+   return __fget(fd, FMODE_PATH, refs);
+}
+
 struct file *fget(unsigned int fd)
 {
-   return __fget(fd, FMODE_PATH);
+   return fget_many(fd, 1);
 }
 EXPORT_SYMBOL(fget);
 
 struct file *fget_raw(unsigned int fd)
 {
-   return __fget(fd, 0);
+   return __fget(fd, 0, 1);
 }
 EXPORT_SYMBOL(fget_raw);
 
@@ -738,7 +743,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t 
mask)
return 0;
return (unsigned long)file;
} else {
-   file = __fget(fd, mask);
+   file = __fget(fd, mask, 1);
if (!file)
return 0;
return FDPUT_FPUT | (unsigned long)file;
diff --git a/fs/file_table.c b/fs/file_table.c
index e49af4caf15d..6a3964df33e4 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -326,9 +326,9 @@ void flush_delayed_fput(void)
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
-void fput(struct file *file)
+void fput_many(struct file *file, unsigned int refs)
 {
-   if (atomic_long_dec_and_test(&file->f_count)) {
+   if (atomic_long_sub_and_test(refs, &file->f_count)) {
struct task_struct *task = current;
 
if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
@@ -347,6 +347,12 @@ void fput(struct file *file)
}
 }
 
+void fput(struct file *file)
+{
+   fput_many(file, 1);
+}
+
+
 /*
  * synchronous analog of fput(); for kernel threads that might be needed
  * in some umount() (and thus can't use flush_delayed_fput() without
diff --git a/include/linux/file.h b/include/linux/file.h
index 6b2fb032416c..3fcddff56bc4 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -13,6 +13,7 @@
 struct file;
 
 extern void fput(struct file *);
+extern void fput_many(struct file *, unsigned int);
 
 struct file_operations;
 struct vfsmount;
@@ -44,6 +45,7 @@ static inline void fdput(struct fd fd)
 }
 
 extern struct file *fget(unsigned int fd);
+extern struct file *fget_many(unsigned int fd, unsigned int refs);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
 extern unsigned long __fdget_raw(unsigned int fd);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6a5f71f8ae06..dc54a65c401a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -952,7 +952,8 @@ static inline struct file *get_file(struct file *f)
atomic_long_inc(&f->f_count);
return f;
 }
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+#define get_file_rcu_many(x, cnt) atomic_long_add_unless(&(x)->f_count, (cnt), 
0)
+#define get_file_rcu(x) get_file_rcu_many((x), 1)
 #define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1)
 #define file_count(x)  atomic_long_read(&(x)->f_count)
 
-- 
2.17.1



[PATCH 06/26] aio: use assigned completion handler

2018-12-04 Thread Jens Axboe
We know this is a read/write request, but in preparation for
having different kinds of those, ensure that we call the assigned
handler instead of assuming it's aio_complete_rq().

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
---
 fs/aio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index 05647d352bf3..cf0de61743e8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1490,7 +1490,7 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t 
ret)
ret = -EINTR;
/*FALLTHRU*/
default:
-   aio_complete_rw(req, ret, 0);
+   req->ki_complete(req, ret, 0);
}
 }
 
-- 
2.17.1



[PATCH 10/26] aio: use iocb_put() instead of open coding it

2018-12-04 Thread Jens Axboe
Replace the percpu_ref_put() + kmem_cache_free() with a call to
iocb_put() instead.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
---
 fs/aio.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index ed6c3914477a..cf93b92bfb1e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1884,10 +1884,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
goto out_put_req;
return 0;
 out_put_req:
-   percpu_ref_put(&ctx->reqs);
if (req->ki_eventfd)
eventfd_ctx_put(req->ki_eventfd);
-   kmem_cache_free(kiocb_cachep, req);
+   iocb_put(req);
 out_put_reqs_available:
put_reqs_available(ctx, 1);
return ret;
-- 
2.17.1



[PATCH 09/26] aio: only use blk plugs for > 2 depth submissions

2018-12-04 Thread Jens Axboe
Plugging is meant to optimize submission of a string of IOs, if we don't
have more than 2 being submitted, don't bother setting up a plug.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
---
 fs/aio.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 522c04864d82..ed6c3914477a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -69,6 +69,12 @@ struct aio_ring {
struct io_event io_events[0];
 }; /* 128 bytes + ring size */
 
+/*
+ * Plugging is meant to work with larger batches of IOs. If we don't
+ * have more than the below, then don't bother setting up a plug.
+ */
+#define AIO_PLUG_THRESHOLD 2
+
 #define AIO_RING_PAGES 8
 
 struct kioctx_table {
@@ -1919,7 +1925,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, 
nr,
if (nr > ctx->nr_events)
nr = ctx->nr_events;
 
-   blk_start_plug(&plug);
+   if (nr > AIO_PLUG_THRESHOLD)
+   blk_start_plug(&plug);
for (i = 0; i < nr; i++) {
struct iocb __user *user_iocb;
 
@@ -1932,7 +1939,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, 
nr,
if (ret)
break;
}
-   blk_finish_plug(&plug);
+   if (nr > AIO_PLUG_THRESHOLD)
+   blk_finish_plug(&plug);
 
percpu_ref_put(&ctx->users);
return i ? i : ret;
@@ -1959,7 +1967,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, 
ctx_id,
if (nr > ctx->nr_events)
nr = ctx->nr_events;
 
-   blk_start_plug(&plug);
+   if (nr > AIO_PLUG_THRESHOLD)
+   blk_start_plug(&plug);
for (i = 0; i < nr; i++) {
compat_uptr_t user_iocb;
 
@@ -1972,7 +1981,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, 
ctx_id,
if (ret)
break;
}
-   blk_finish_plug(&plug);
+   if (nr > AIO_PLUG_THRESHOLD)
+   blk_finish_plug(&plug);
 
percpu_ref_put(&ctx->users);
return i ? i : ret;
-- 
2.17.1



[PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()

2018-12-04 Thread Jens Axboe
It's 192 bytes, fairly substantial. Most items don't need to be cleared,
especially not upfront. Clear the ones we do need to clear, and leave
the other ones for setup when the iocb is prepared and submitted.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
---
 fs/aio.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index eaceb40e6cf5..522c04864d82 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1009,14 +1009,15 @@ static inline struct aio_kiocb *aio_get_req(struct 
kioctx *ctx)
 {
struct aio_kiocb *req;
 
-   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
+   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
if (unlikely(!req))
return NULL;
 
percpu_ref_get(&ctx->reqs);
+   req->ki_ctx = ctx;
INIT_LIST_HEAD(&req->ki_list);
refcount_set(&req->ki_refcnt, 0);
-   req->ki_ctx = ctx;
+   req->ki_eventfd = NULL;
return req;
 }
 
@@ -1730,6 +1731,10 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, struct 
iocb *iocb)
if (unlikely(!req->file))
return -EBADF;
 
+   req->head = NULL;
+   req->woken = false;
+   req->cancelled = false;
+
apt.pt._qproc = aio_poll_queue_proc;
apt.pt._key = req->events;
apt.iocb = aiocb;
-- 
2.17.1



[PATCH 12/26] aio: abstract out io_event filler helper

2018-12-04 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 fs/aio.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 06c8bcc72496..173f1f79dc8f 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1063,6 +1063,15 @@ static inline void iocb_put(struct aio_kiocb *iocb)
}
 }
 
+static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb,
+  long res, long res2)
+{
+   ev->obj = (u64)(unsigned long)iocb->ki_user_iocb;
+   ev->data = iocb->ki_user_data;
+   ev->res = res;
+   ev->res2 = res2;
+}
+
 /* aio_complete
  * Called when the io request on the given iocb is complete.
  */
@@ -1090,10 +1099,7 @@ static void aio_complete(struct aio_kiocb *iocb, long 
res, long res2)
ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
event = ev_page + pos % AIO_EVENTS_PER_PAGE;
 
-   event->obj = (u64)(unsigned long)iocb->ki_user_iocb;
-   event->data = iocb->ki_user_data;
-   event->res = res;
-   event->res2 = res2;
+   aio_fill_event(event, iocb, res, res2);
 
kunmap_atomic(ev_page);
flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
-- 
2.17.1



[PATCH 22/26] block: implement bio helper to add iter bvec pages to bio

2018-12-04 Thread Jens Axboe
For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly.

Signed-off-by: Jens Axboe 
---
 block/bio.c | 27 +++
 include/linux/bio.h |  1 +
 2 files changed, 28 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index ab174bce5436..0e466d06f748 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -903,6 +903,33 @@ int bio_iov_iter_get_pages(struct bio *bio, struct 
iov_iter *iter)
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
+/**
+ * bio_iov_bvec_add_pages - add pages from an ITER_BVEC to a bio
+ * @bio: bio to add pages to
+ * @iter: iov iterator describing the region to be added
+ *
+ * Iterate pages in the @iter and add them to the bio. We flag the
+ * @bio with BIO_HOLD_PAGES, telling IO completion not to free them.
+ */
+int bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
+{
+   unsigned short orig_vcnt = bio->bi_vcnt;
+   const struct bio_vec *bv;
+
+   do {
+   size_t size;
+
+   bv = iter->bvec + iter->iov_offset;
+   size = bio_add_page(bio, bv->bv_page, bv->bv_len, 
bv->bv_offset);
+   if (size != bv->bv_len)
+   break;
+   iov_iter_advance(iter, size);
+   } while (iov_iter_count(iter) && !bio_full(bio));
+
+   bio_set_flag(bio, BIO_HOLD_PAGES);
+   return bio->bi_vcnt > orig_vcnt ? 0 : -EINVAL;
+}
+
 static void submit_bio_wait_endio(struct bio *bio)
 {
complete(bio->bi_private);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 056fb627edb3..39ddb53e7e7f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -434,6 +434,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
*page,
 void __bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
+int bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter);
 struct rq_map_data;
 extern struct bio *bio_map_user_iov(struct request_queue *,
struct iov_iter *, gfp_t);
-- 
2.17.1



[PATCH 02/26] block: add REQ_HIPRI_ASYNC

2018-12-04 Thread Jens Axboe
For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.

Signed-off-by: Jens Axboe 
---
 include/linux/blk_types.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c0ba1a038ff3..1d9c3da0f2a1 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -346,6 +346,7 @@ enum req_flag_bits {
 #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
 #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI  (1ULL << __REQ_HIPRI)
+#define REQ_HIPRI_ASYNC(REQ_HIPRI | REQ_NOWAIT)
 
 #define REQ_DRV(1ULL << __REQ_DRV)
 #define REQ_SWAP   (1ULL << __REQ_SWAP)
-- 
2.17.1



[PATCH 21/26] block: add BIO_HOLD_PAGES flag

2018-12-04 Thread Jens Axboe
For user mapped IO, we do get_user_pages() upfront, and then do a
put_page() on each page at end_io time to release the page reference. In
preparation for having permanently mapped pages, add a BIO_HOLD_PAGES
flag that tells us not to release the pages, the caller will do that.

Signed-off-by: Jens Axboe 
---
 block/bio.c   | 6 --
 include/linux/blk_types.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 03895cc0d74a..ab174bce5436 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1635,7 +1635,8 @@ static void bio_dirty_fn(struct work_struct *work)
next = bio->bi_private;
 
bio_set_pages_dirty(bio);
-   bio_release_pages(bio);
+   if (!bio_flagged(bio, BIO_HOLD_PAGES))
+   bio_release_pages(bio);
bio_put(bio);
}
 }
@@ -1651,7 +1652,8 @@ void bio_check_pages_dirty(struct bio *bio)
goto defer;
}
 
-   bio_release_pages(bio);
+   if (!bio_flagged(bio, BIO_HOLD_PAGES))
+   bio_release_pages(bio);
bio_put(bio);
return;
 defer:
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1d9c3da0f2a1..344e5b61aa4e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -227,6 +227,7 @@ struct bio {
 #define BIO_TRACE_COMPLETION 10/* bio_endio() should trace the final 
completion
 * of this bio. */
 #define BIO_QUEUE_ENTERED 11   /* can use blk_queue_enter_live() */
+#define BIO_HOLD_PAGES 12  /* don't put O_DIRECT pages */
 
 /* See BVEC_POOL_OFFSET below before adding new flags */
 
-- 
2.17.1



[PATCH 19/26] aio: split iocb init from allocation

2018-12-04 Thread Jens Axboe
In preparation from having pre-allocated requests, that we then just
need to initialize before use.

Signed-off-by: Jens Axboe 
---
 fs/aio.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 634b540b0c92..416bb2e365e0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1099,6 +1099,16 @@ static bool get_reqs_available(struct kioctx *ctx)
return __get_reqs_available(ctx);
 }
 
+static void aio_iocb_init(struct kioctx *ctx, struct aio_kiocb *req)
+{
+   percpu_ref_get(&ctx->reqs);
+   req->ki_ctx = ctx;
+   INIT_LIST_HEAD(&req->ki_list);
+   req->ki_flags = 0;
+   refcount_set(&req->ki_refcnt, 0);
+   req->ki_eventfd = NULL;
+}
+
 /* aio_get_req
  * Allocate a slot for an aio request.
  * Returns NULL if no requests are free.
@@ -,12 +1121,7 @@ static inline struct aio_kiocb *aio_get_req(struct 
kioctx *ctx)
if (unlikely(!req))
return NULL;
 
-   percpu_ref_get(&ctx->reqs);
-   req->ki_ctx = ctx;
-   INIT_LIST_HEAD(&req->ki_list);
-   req->ki_flags = 0;
-   refcount_set(&req->ki_refcnt, 0);
-   req->ki_eventfd = NULL;
+   aio_iocb_init(ctx, req);
return req;
 }
 
-- 
2.17.1



[PATCH 20/26] aio: batch aio_kiocb allocation

2018-12-04 Thread Jens Axboe
Similarly to how we use the state->ios_left to know how many references
to get to a file, we can use it to allocate the aio_kiocb's we need in
bulk.

Signed-off-by: Jens Axboe 
---
 fs/aio.c | 47 +++
 1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 416bb2e365e0..1735ec556089 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -240,6 +240,8 @@ struct aio_kiocb {
};
 };
 
+#define AIO_IOPOLL_BATCH   8
+
 struct aio_submit_state {
struct kioctx *ctx;
 
@@ -254,6 +256,13 @@ struct aio_submit_state {
struct list_head req_list;
unsigned int req_count;
 
+   /*
+* aio_kiocb alloc cache
+*/
+   void *iocbs[AIO_IOPOLL_BATCH];
+   unsigned int free_iocbs;
+   unsigned int cur_iocb;
+
/*
 * File reference cache
 */
@@ -1113,15 +1122,35 @@ static void aio_iocb_init(struct kioctx *ctx, struct 
aio_kiocb *req)
  * Allocate a slot for an aio request.
  * Returns NULL if no requests are free.
  */
-static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
+static struct aio_kiocb *aio_get_req(struct kioctx *ctx,
+struct aio_submit_state *state)
 {
struct aio_kiocb *req;
 
-   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
-   if (unlikely(!req))
-   return NULL;
+   if (!state)
+   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
+   else if (!state->free_iocbs) {
+   size_t size;
+
+   size = min_t(size_t, state->ios_left, ARRAY_SIZE(state->iocbs));
+   size = kmem_cache_alloc_bulk(kiocb_cachep, GFP_KERNEL, size,
+   state->iocbs);
+   if (size < 0)
+   return ERR_PTR(size);
+   else if (!size)
+   return ERR_PTR(-ENOMEM);
+   state->free_iocbs = size - 1;
+   state->cur_iocb = 1;
+   req = state->iocbs[0];
+   } else {
+   req = state->iocbs[state->cur_iocb];
+   state->free_iocbs--;
+   state->cur_iocb++;
+   }
+
+   if (req)
+   aio_iocb_init(ctx, req);
 
-   aio_iocb_init(ctx, req);
return req;
 }
 
@@ -1359,8 +1388,6 @@ static bool aio_read_events(struct kioctx *ctx, long 
min_nr, long nr,
return ret < 0 || *i >= min_nr;
 }
 
-#define AIO_IOPOLL_BATCH   8
-
 /*
  * Process completed iocb iopoll entries, copying the result to userspace.
  */
@@ -2357,7 +2384,7 @@ static int __io_submit_one(struct kioctx *ctx, const 
struct iocb *iocb,
return -EAGAIN;
 
ret = -EAGAIN;
-   req = aio_get_req(ctx);
+   req = aio_get_req(ctx, state);
if (unlikely(!req))
goto out_put_reqs_available;
 
@@ -2489,6 +2516,9 @@ static void aio_submit_state_end(struct aio_submit_state 
*state)
if (!list_empty(&state->req_list))
aio_flush_state_reqs(state->ctx, state);
aio_file_put(state);
+   if (state->free_iocbs)
+   kmem_cache_free_bulk(kiocb_cachep, state->free_iocbs,
+   &state->iocbs[state->cur_iocb]);
 }
 
 /*
@@ -2500,6 +2530,7 @@ static void aio_submit_state_start(struct 
aio_submit_state *state,
state->ctx = ctx;
INIT_LIST_HEAD(&state->req_list);
state->req_count = 0;
+   state->free_iocbs = 0;
state->file = NULL;
state->ios_left = max_ios;
 #ifdef CONFIG_BLOCK
-- 
2.17.1



[PATCH 13/26] aio: add io_setup2() system call

2018-12-04 Thread Jens Axboe
This is just like io_setup(), except add a flags argument to let the
caller control/define some of the io_context behavior.

Outside of the flags, we add an iocb array and two user pointers for
future use.

Signed-off-by: Jens Axboe 
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 fs/aio.c   | 69 --
 include/linux/syscalls.h   |  3 ++
 include/uapi/asm-generic/unistd.h  |  4 +-
 kernel/sys_ni.c|  1 +
 5 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..67c357225fb0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
 332common  statx   __x64_sys_statx
 333common  io_pgetevents   __x64_sys_io_pgetevents
 334common  rseq__x64_sys_rseq
+335common  io_setup2   __x64_sys_io_setup2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/aio.c b/fs/aio.c
index 173f1f79dc8f..26631d6872d2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -100,6 +100,8 @@ struct kioctx {
 
unsigned long   user_id;
 
+   unsigned intflags;
+
struct __percpu kioctx_cpu *cpu;
 
/*
@@ -686,10 +688,8 @@ static void aio_nr_sub(unsigned nr)
spin_unlock(&aio_nr_lock);
 }
 
-/* ioctx_alloc
- * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
- */
-static struct kioctx *ioctx_alloc(unsigned nr_events)
+static struct kioctx *io_setup_flags(unsigned long ctxid,
+unsigned int nr_events, unsigned int flags)
 {
struct mm_struct *mm = current->mm;
struct kioctx *ctx;
@@ -701,6 +701,12 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 */
unsigned int max_reqs = nr_events;
 
+   if (unlikely(ctxid || nr_events == 0)) {
+   pr_debug("EINVAL: ctx %lu nr_events %u\n",
+ctxid, nr_events);
+   return ERR_PTR(-EINVAL);
+   }
+
/*
 * We keep track of the number of available ringbuffer slots, to prevent
 * overflow (reqs_available), and we also use percpu counters for this.
@@ -726,6 +732,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
if (!ctx)
return ERR_PTR(-ENOMEM);
 
+   ctx->flags = flags;
ctx->max_reqs = max_reqs;
 
spin_lock_init(&ctx->ctx_lock);
@@ -1281,6 +1288,34 @@ static long read_events(struct kioctx *ctx, long min_nr, 
long nr,
return ret;
 }
 
+SYSCALL_DEFINE6(io_setup2, u32, nr_events, u32, flags, struct iocb __user *,
+   iocbs, void __user *, user1, void __user *, user2,
+   aio_context_t __user *, ctxp)
+{
+   struct kioctx *ioctx;
+   unsigned long ctx;
+   long ret;
+
+   if (flags || user1 || user2)
+   return -EINVAL;
+
+   ret = get_user(ctx, ctxp);
+   if (unlikely(ret))
+   goto out;
+
+   ioctx = io_setup_flags(ctx, nr_events, flags);
+   ret = PTR_ERR(ioctx);
+   if (IS_ERR(ioctx))
+   goto out;
+
+   ret = put_user(ioctx->user_id, ctxp);
+   if (ret)
+   kill_ioctx(current->mm, ioctx, NULL);
+   percpu_ref_put(&ioctx->users);
+out:
+   return ret;
+}
+
 /* sys_io_setup:
  * Create an aio_context capable of receiving at least nr_events.
  * ctxp must not point to an aio_context that already exists, and
@@ -1296,7 +1331,7 @@ static long read_events(struct kioctx *ctx, long min_nr, 
long nr,
  */
 SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
 {
-   struct kioctx *ioctx = NULL;
+   struct kioctx *ioctx;
unsigned long ctx;
long ret;
 
@@ -1304,14 +1339,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, 
aio_context_t __user *, ctxp)
if (unlikely(ret))
goto out;
 
-   ret = -EINVAL;
-   if (unlikely(ctx || nr_events == 0)) {
-   pr_debug("EINVAL: ctx %lu nr_events %u\n",
-ctx, nr_events);
-   goto out;
-   }
-
-   ioctx = ioctx_alloc(nr_events);
+   ioctx = io_setup_flags(ctx, nr_events, 0);
ret = PTR_ERR(ioctx);
if (!IS_ERR(ioctx)) {
ret = put_user(ioctx->user_id, ctxp);
@@ -1327,7 +1355,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, 
aio_context_t __user *, ctxp)
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, u32 __user *, ctx32p)
 {
-   struct kioctx *ioctx = NULL;
+   struct kioctx *ioctx;
unsigned long ctx;
long ret;
 
@@ -1335,23 +1363,14 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, 
u32 __user *, ctx32p)
if (unlikely(ret))
goto out;
 
-   ret = -EINVAL;
-   if

[PATCH 11/26] aio: split out iocb copy from io_submit_one()

2018-12-04 Thread Jens Axboe
In preparation of handing in iocbs in a different fashion as well. Also
make it clear that the iocb being passed in isn't modified, by marking
it const throughout.

Signed-off-by: Jens Axboe 
---
 fs/aio.c | 68 +++-
 1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index cf93b92bfb1e..06c8bcc72496 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1420,7 +1420,7 @@ static void aio_complete_rw(struct kiocb *kiocb, long 
res, long res2)
aio_complete(iocb, res, res2);
 }
 
-static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
+static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 {
int ret;
 
@@ -1461,7 +1461,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
*iocb)
return ret;
 }
 
-static int aio_setup_rw(int rw, struct iocb *iocb, struct iovec **iovec,
+static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec,
bool vectored, bool compat, struct iov_iter *iter)
 {
void __user *buf = (void __user *)(uintptr_t)iocb->aio_buf;
@@ -1500,8 +1500,8 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t 
ret)
}
 }
 
-static ssize_t aio_read(struct kiocb *req, struct iocb *iocb, bool vectored,
-   bool compat)
+static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
+   bool vectored, bool compat)
 {
struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
struct iov_iter iter;
@@ -1533,8 +1533,8 @@ static ssize_t aio_read(struct kiocb *req, struct iocb 
*iocb, bool vectored,
return ret;
 }
 
-static ssize_t aio_write(struct kiocb *req, struct iocb *iocb, bool vectored,
-   bool compat)
+static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
+bool vectored, bool compat)
 {
struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
struct iov_iter iter;
@@ -1589,7 +1589,8 @@ static void aio_fsync_work(struct work_struct *work)
aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0);
 }
 
-static int aio_fsync(struct fsync_iocb *req, struct iocb *iocb, bool datasync)
+static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
+bool datasync)
 {
if (unlikely(iocb->aio_buf || iocb->aio_offset || iocb->aio_nbytes ||
iocb->aio_rw_flags))
@@ -1717,7 +1718,7 @@ aio_poll_queue_proc(struct file *file, struct 
wait_queue_head *head,
add_wait_queue(head, &pt->iocb->poll.wait);
 }
 
-static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
+static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 {
struct kioctx *ctx = aiocb->ki_ctx;
struct poll_iocb *req = &aiocb->poll;
@@ -1789,27 +1790,23 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, struct 
iocb *iocb)
return 0;
 }
 
-static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
-bool compat)
+static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
+  struct iocb __user *user_iocb, bool compat)
 {
struct aio_kiocb *req;
-   struct iocb iocb;
ssize_t ret;
 
-   if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb
-   return -EFAULT;
-
/* enforce forwards compatibility on users */
-   if (unlikely(iocb.aio_reserved2)) {
+   if (unlikely(iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}
 
/* prevent overflows */
if (unlikely(
-   (iocb.aio_buf != (unsigned long)iocb.aio_buf) ||
-   (iocb.aio_nbytes != (size_t)iocb.aio_nbytes) ||
-   ((ssize_t)iocb.aio_nbytes < 0)
+   (iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
+   (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) ||
+   ((ssize_t)iocb->aio_nbytes < 0)
   )) {
pr_debug("EINVAL: overflow check\n");
return -EINVAL;
@@ -1823,14 +1820,14 @@ static int io_submit_one(struct kioctx *ctx, struct 
iocb __user *user_iocb,
if (unlikely(!req))
goto out_put_reqs_available;
 
-   if (iocb.aio_flags & IOCB_FLAG_RESFD) {
+   if (iocb->aio_flags & IOCB_FLAG_RESFD) {
/*
 * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
 * instance of the file* now. The file descriptor must be
 * an eventfd() fd, and will be signaled for each completed
 * event using the eventfd_signal() function.
 */
-   req->ki_eventfd = eventfd_ctx_fdget((int) iocb.aio_resfd);
+   req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
if (IS_ERR(req->ki_eventfd)) {
ret = PTR_ERR(req->ki_eventf

[PATCH 04/26] block: use REQ_HIPRI_ASYNC for non-sync polled IO

2018-12-04 Thread Jens Axboe
Tell the block layer if it's a sync or async polled request, so it can
do the right thing.

Signed-off-by: Jens Axboe 
---
 fs/block_dev.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6de8d35f6e41..b8f574615792 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -402,8 +402,12 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
 
nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
if (!nr_pages) {
-   if (iocb->ki_flags & IOCB_HIPRI)
-   bio->bi_opf |= REQ_HIPRI;
+   if (iocb->ki_flags & IOCB_HIPRI) {
+   if (!is_sync)
+   bio->bi_opf |= REQ_HIPRI_ASYNC;
+   else
+   bio->bi_opf |= REQ_HIPRI;
+   }
 
qc = submit_bio(bio);
WRITE_ONCE(iocb->ki_cookie, qc);
-- 
2.17.1



[PATCH 15/26] aio: support for IO polling

2018-12-04 Thread Jens Axboe
Add polled variants of PREAD/PREADV and PWRITE/PWRITEV. These act
like their non-polled counterparts, except we expect to poll for
completion of them. The polling happens at io_getevent() time, and
works just like non-polled IO.

To setup an io_context for polled IO, the application must call
io_setup2() with IOCTX_FLAG_IOPOLL as one of the flags. It is illegal
to mix and match polled and non-polled IO on an io_context.

Polled IO doesn't support the user mapped completion ring. Events
must be reaped through the io_getevents() system call. For non-irq
driven poll devices, there's no way to support completion reaping
from userspace by just looking at the ring. The application itself
is the one that pulls completion entries.

Signed-off-by: Jens Axboe 
---
 fs/aio.c | 393 +++
 include/uapi/linux/aio_abi.h |   3 +
 2 files changed, 360 insertions(+), 36 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index bb6f07ca6940..5d34317c2929 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -153,6 +153,18 @@ struct kioctx {
atomic_treqs_available;
} cacheline_aligned_in_smp;
 
+   /* iopoll submission state */
+   struct {
+   spinlock_t poll_lock;
+   struct list_head poll_submitted;
+   } cacheline_aligned_in_smp;
+
+   /* iopoll completion state */
+   struct {
+   struct list_head poll_completing;
+   struct mutex getevents_lock;
+   } cacheline_aligned_in_smp;
+
struct {
spinlock_t  ctx_lock;
struct list_head active_reqs;   /* used for cancellation */
@@ -205,14 +217,27 @@ struct aio_kiocb {
__u64   ki_user_data;   /* user's data for completion */
 
struct list_headki_list;/* the aio core uses this
-* for cancellation */
+* for cancellation, or for
+* polled IO */
+
+   unsigned long   ki_flags;
+#define IOCB_POLL_COMPLETED0   /* polled IO has completed */
+#define IOCB_POLL_EAGAIN   1   /* polled submission got EAGAIN */
+
refcount_t  ki_refcnt;
 
-   /*
-* If the aio_resfd field of the userspace iocb is not zero,
-* this is the underlying eventfd context to deliver events to.
-*/
-   struct eventfd_ctx  *ki_eventfd;
+   union {
+   /*
+* If the aio_resfd field of the userspace iocb is not zero,
+* this is the underlying eventfd context to deliver events to.
+*/
+   struct eventfd_ctx  *ki_eventfd;
+
+   /*
+* For polled IO, stash completion info here
+*/
+   struct io_event ki_ev;
+   };
 };
 
 /*-- sysctl variables*/
@@ -233,6 +258,7 @@ static const unsigned int iocb_page_shift =
ilog2(PAGE_SIZE / sizeof(struct iocb));
 
 static void aio_useriocb_unmap(struct kioctx *);
+static void aio_iopoll_reap_events(struct kioctx *);
 
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
@@ -471,11 +497,15 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned 
int nr_events)
int i;
struct file *file;
 
-   /* Compensate for the ring buffer's head/tail overlap entry */
-   nr_events += 2; /* 1 is required, 2 for good luck */
-
+   /*
+* Compensate for the ring buffer's head/tail overlap entry.
+* IO polling doesn't require any io event entries
+*/
size = sizeof(struct aio_ring);
-   size += sizeof(struct io_event) * nr_events;
+   if (!(ctx->flags & IOCTX_FLAG_IOPOLL)) {
+   nr_events += 2; /* 1 is required, 2 for good luck */
+   size += sizeof(struct io_event) * nr_events;
+   }
 
nr_pages = PFN_UP(size);
if (nr_pages < 0)
@@ -758,6 +788,11 @@ static struct kioctx *io_setup_flags(unsigned long ctxid,
 
INIT_LIST_HEAD(&ctx->active_reqs);
 
+   spin_lock_init(&ctx->poll_lock);
+   INIT_LIST_HEAD(&ctx->poll_submitted);
+   INIT_LIST_HEAD(&ctx->poll_completing);
+   mutex_init(&ctx->getevents_lock);
+
if (percpu_ref_init(&ctx->users, free_ioctx_users, 0, GFP_KERNEL))
goto err;
 
@@ -829,11 +864,15 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx 
*ctx,
 {
struct kioctx_table *table;
 
+   mutex_lock(&ctx->getevents_lock);
spin_lock(&mm->ioctx_lock);
if (atomic_xchg(&ctx->dead, 1)) {
spin_unlock(&mm->ioctx_lock);
+   mutex_unlock(&ctx->getevents_lock);
return -EINVAL;
}
+   aio_iopoll_reap_events(ctx);
+   mutex_unlock(&ctx->getevents_lock);
 
table = rcu_dereference_raw(mm->ioctx_

[PATCH 14/26] aio: add support for having user mapped iocbs

2018-12-04 Thread Jens Axboe
For io_submit(), we have to first copy each pointer to an iocb, then
copy the iocb. The latter is 64 bytes in size, and that's a lot of
copying for a single IO.

Add support for setting IOCTX_FLAG_USERIOCB through the new io_setup2()
system call, which allows the iocbs to reside in userspace. If this flag
is used, then io_submit() doesn't take pointers to iocbs anymore, it
takes an index value into the array of iocbs instead. Similary, for
io_getevents(), the iocb ->obj will be the index, not the pointer to the
iocb.

See the change made to fio to support this feature, it's pretty
trivialy to adapt to. For applications, like fio, that previously
embedded the iocb inside a application private structure, some sort
of lookup table/structure is needed to find the private IO structure
from the index at io_getevents() time.

http://git.kernel.dk/cgit/fio/commit/?id=3c3168e91329c83880c91e5abc28b9d6b940fd95

Signed-off-by: Jens Axboe 
---
 fs/aio.c | 126 +++
 include/uapi/linux/aio_abi.h |   2 +
 2 files changed, 116 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 26631d6872d2..bb6f07ca6940 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -92,6 +92,11 @@ struct ctx_rq_wait {
atomic_t count;
 };
 
+struct aio_mapped_range {
+   struct page **pages;
+   long nr_pages;
+};
+
 struct kioctx {
struct percpu_ref   users;
atomic_tdead;
@@ -127,6 +132,8 @@ struct kioctx {
struct page **ring_pages;
longnr_pages;
 
+   struct aio_mapped_range iocb_range;
+
struct rcu_work free_rwork; /* see free_ioctx() */
 
/*
@@ -222,6 +229,11 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+static const unsigned int iocb_page_shift =
+   ilog2(PAGE_SIZE / sizeof(struct iocb));
+
+static void aio_useriocb_unmap(struct kioctx *);
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
struct file *file;
@@ -578,6 +590,7 @@ static void free_ioctx(struct work_struct *work)
  free_rwork);
pr_debug("freeing %p\n", ctx);
 
+   aio_useriocb_unmap(ctx);
aio_free_ring(ctx);
free_percpu(ctx->cpu);
percpu_ref_exit(&ctx->reqs);
@@ -1288,6 +1301,70 @@ static long read_events(struct kioctx *ctx, long min_nr, 
long nr,
return ret;
 }
 
+static struct iocb *aio_iocb_from_index(struct kioctx *ctx, int index)
+{
+   struct iocb *iocb;
+
+   iocb = page_address(ctx->iocb_range.pages[index >> iocb_page_shift]);
+   index &= ((1 << iocb_page_shift) - 1);
+   return iocb + index;
+}
+
+static void aio_unmap_range(struct aio_mapped_range *range)
+{
+   int i;
+
+   if (!range->nr_pages)
+   return;
+
+   for (i = 0; i < range->nr_pages; i++)
+   put_page(range->pages[i]);
+
+   kfree(range->pages);
+   range->pages = NULL;
+   range->nr_pages = 0;
+}
+
+static int aio_map_range(struct aio_mapped_range *range, void __user *uaddr,
+size_t size, int gup_flags)
+{
+   int nr_pages, ret;
+
+   if ((unsigned long) uaddr & ~PAGE_MASK)
+   return -EINVAL;
+
+   nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+   range->pages = kzalloc(nr_pages * sizeof(struct page *), GFP_KERNEL);
+   if (!range->pages)
+   return -ENOMEM;
+
+   down_write(¤t->mm->mmap_sem);
+   ret = get_user_pages((unsigned long) uaddr, nr_pages, gup_flags,
+   range->pages, NULL);
+   up_write(¤t->mm->mmap_sem);
+
+   if (ret < nr_pages) {
+   kfree(range->pages);
+   return -ENOMEM;
+   }
+
+   range->nr_pages = nr_pages;
+   return 0;
+}
+
+static void aio_useriocb_unmap(struct kioctx *ctx)
+{
+   aio_unmap_range(&ctx->iocb_range);
+}
+
+static int aio_useriocb_map(struct kioctx *ctx, struct iocb __user *iocbs)
+{
+   size_t size = sizeof(struct iocb) * ctx->max_reqs;
+
+   return aio_map_range(&ctx->iocb_range, iocbs, size, 0);
+}
+
 SYSCALL_DEFINE6(io_setup2, u32, nr_events, u32, flags, struct iocb __user *,
iocbs, void __user *, user1, void __user *, user2,
aio_context_t __user *, ctxp)
@@ -1296,7 +1373,9 @@ SYSCALL_DEFINE6(io_setup2, u32, nr_events, u32, flags, 
struct iocb __user *,
unsigned long ctx;
long ret;
 
-   if (flags || user1 || user2)
+   if (user1 || user2)
+   return -EINVAL;
+   if (flags & ~IOCTX_FLAG_USERIOCB)
return -EINVAL;
 
ret = get_user(ctx, ctxp);
@@ -1308,9 +1387,17 @@ SYSCALL_DEFINE6(io_setup2, u32, nr_events, u32, flags, 
struct iocb __user *,
if (IS_ERR(ioctx))
goto out;
 
+   if (flags

[PATCH 16/26] aio: add submission side request cache

2018-12-04 Thread Jens Axboe
We have to add each submitted polled request to the io_context
poll_submitted list, which means we have to grab the poll_lock. We
already use the block plug to batch submissions if we're doing a batch
of IO submissions, extend that to cover the poll requests internally as
well.

Signed-off-by: Jens Axboe 
---
 fs/aio.c | 136 +--
 1 file changed, 113 insertions(+), 23 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5d34317c2929..ae0805dc814c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -240,6 +240,21 @@ struct aio_kiocb {
};
 };
 
+struct aio_submit_state {
+   struct kioctx *ctx;
+
+   struct blk_plug plug;
+#ifdef CONFIG_BLOCK
+   struct blk_plug_cb plug_cb;
+#endif
+
+   /*
+* Polled iocbs that have been submitted, but not added to the ctx yet
+*/
+   struct list_head req_list;
+   unsigned int req_count;
+};
+
 /*-- sysctl variables*/
 static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;  /* current system wide number of aio requests */
@@ -257,6 +272,15 @@ static const struct address_space_operations aio_ctx_aops;
 static const unsigned int iocb_page_shift =
ilog2(PAGE_SIZE / sizeof(struct iocb));
 
+/*
+ * We rely on block level unplugs to flush pending requests, if we schedule
+ */
+#ifdef CONFIG_BLOCK
+static const bool aio_use_state_req_list = true;
+#else
+static const bool aio_use_state_req_list = false;
+#endif
+
 static void aio_useriocb_unmap(struct kioctx *);
 static void aio_iopoll_reap_events(struct kioctx *);
 
@@ -1887,13 +1911,28 @@ static inline void aio_rw_done(struct kiocb *req, 
ssize_t ret)
}
 }
 
+/*
+ * Called either at the end of IO submission, or through a plug callback
+ * because we're going to schedule. Moves out local batch of requests to
+ * the ctx poll list, so they can be found for polling + reaping.
+ */
+static void aio_flush_state_reqs(struct kioctx *ctx,
+struct aio_submit_state *state)
+{
+   spin_lock(&ctx->poll_lock);
+   list_splice_tail_init(&state->req_list, &ctx->poll_submitted);
+   spin_unlock(&ctx->poll_lock);
+   state->req_count = 0;
+}
+
 /*
  * After the iocb has been issued, it's safe to be found on the poll list.
  * Adding the kiocb to the list AFTER submission ensures that we don't
  * find it from a io_getevents() thread before the issuer is done accessing
  * the kiocb cookie.
  */
-static void aio_iopoll_iocb_issued(struct aio_kiocb *kiocb)
+static void aio_iopoll_iocb_issued(struct aio_submit_state *state,
+  struct aio_kiocb *kiocb)
 {
/*
 * For fast devices, IO may have already completed. If it has, add
@@ -1903,12 +1942,21 @@ static void aio_iopoll_iocb_issued(struct aio_kiocb 
*kiocb)
const int front_add = test_bit(IOCB_POLL_COMPLETED, &kiocb->ki_flags);
struct kioctx *ctx = kiocb->ki_ctx;
 
-   spin_lock(&ctx->poll_lock);
-   if (front_add)
-   list_add(&kiocb->ki_list, &ctx->poll_submitted);
-   else
-   list_add_tail(&kiocb->ki_list, &ctx->poll_submitted);
-   spin_unlock(&ctx->poll_lock);
+   if (!state || !aio_use_state_req_list) {
+   spin_lock(&ctx->poll_lock);
+   if (front_add)
+   list_add(&kiocb->ki_list, &ctx->poll_submitted);
+   else
+   list_add_tail(&kiocb->ki_list, &ctx->poll_submitted);
+   spin_unlock(&ctx->poll_lock);
+   } else {
+   if (front_add)
+   list_add(&kiocb->ki_list, &state->req_list);
+   else
+   list_add_tail(&kiocb->ki_list, &state->req_list);
+   if (++state->req_count >= AIO_IOPOLL_BATCH)
+   aio_flush_state_reqs(ctx, state);
+   }
 }
 
 static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
@@ -2204,7 +2252,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const 
struct iocb *iocb)
 }
 
 static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
-  struct iocb __user *user_iocb, bool compat)
+  struct iocb __user *user_iocb,
+  struct aio_submit_state *state, bool compat)
 {
struct aio_kiocb *req;
ssize_t ret;
@@ -2308,7 +2357,7 @@ static int __io_submit_one(struct kioctx *ctx, const 
struct iocb *iocb,
ret = -EAGAIN;
goto out_put_req;
}
-   aio_iopoll_iocb_issued(req);
+   aio_iopoll_iocb_issued(state, req);
}
return 0;
 out_put_req:
@@ -2322,7 +2371,7 @@ static int __io_submit_one(struct kioctx *ctx, const 
struct iocb *iocb,
 }
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
-bool compat)
+struct a

[PATCH 03/26] block: wire up block device iopoll method

2018-12-04 Thread Jens Axboe
From: Christoph Hellwig 

Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.

Reviewed-by: Johannes Thumshirn 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
---
 fs/block_dev.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e1886cc7048f..6de8d35f6e41 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -281,6 +281,14 @@ struct blkdev_dio {
 
 static struct bio_set blkdev_dio_pool;
 
+static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
+{
+   struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host);
+   struct request_queue *q = bdev_get_queue(bdev);
+
+   return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
+}
+
 static void blkdev_bio_end_io(struct bio *bio)
 {
struct blkdev_dio *dio = bio->bi_private;
@@ -398,6 +406,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bio->bi_opf |= REQ_HIPRI;
 
qc = submit_bio(bio);
+   WRITE_ONCE(iocb->ki_cookie, qc);
break;
}
 
@@ -2070,6 +2079,7 @@ const struct file_operations def_blk_fops = {
.llseek = block_llseek,
.read_iter  = blkdev_read_iter,
.write_iter = blkdev_write_iter,
+   .iopoll = blkdev_iopoll,
.mmap   = generic_file_mmap,
.fsync  = blkdev_fsync,
.unlocked_ioctl = block_ioctl,
-- 
2.17.1



[PATCH 01/26] fs: add an iopoll method to struct file_operations

2018-12-04 Thread Jens Axboe
From: Christoph Hellwig 

This new methods is used to explicitly poll for I/O completion for an
iocb.  It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.

TODO: we can probably union ki_cookie with the existing hint and I/O
priority fields to avoid struct kiocb growth.

Reviewed-by: Johannes Thumshirn 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
---
 Documentation/filesystems/vfs.txt | 3 +++
 include/linux/fs.h| 2 ++
 2 files changed, 5 insertions(+)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index 5f71a252e2e0..d9dc5e4d82b9 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,7 @@ struct file_operations {
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+   int (*iopoll)(struct kiocb *kiocb, bool spin);
int (*iterate) (struct file *, struct dir_context *);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
@@ -902,6 +903,8 @@ otherwise noted.
 
   write_iter: possibly asynchronous write with iov_iter as source
 
+  iopoll: called when aio wants to poll for completions on HIPRI iocbs
+
   iterate: called when the VFS needs to read the directory contents
 
   iterate_shared: called when the VFS needs to read the directory contents
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a1ab233e6469..6a5f71f8ae06 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -310,6 +310,7 @@ struct kiocb {
int ki_flags;
u16 ki_hint;
u16 ki_ioprio; /* See linux/ioprio.h */
+   unsigned intki_cookie; /* for ->iopoll */
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1781,6 +1782,7 @@ struct file_operations {
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+   int (*iopoll)(struct kiocb *kiocb, bool spin);
int (*iterate) (struct file *, struct dir_context *);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
-- 
2.17.1



[PATCH 25/26] aio: split old ring complete out from aio_complete()

2018-12-04 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 fs/aio.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1c8a8bb631a9..39aaffd6d436 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1218,12 +1218,9 @@ static void aio_fill_event(struct io_event *ev, struct 
aio_kiocb *iocb,
ev->res2 = res2;
 }
 
-/* aio_complete
- * Called when the io request on the given iocb is complete.
- */
-static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
+static void aio_ring_complete(struct kioctx *ctx, struct aio_kiocb *iocb,
+ long res, long res2)
 {
-   struct kioctx   *ctx = iocb->ki_ctx;
struct aio_ring *ring;
struct io_event *ev_page, *event;
unsigned tail, pos, head;
@@ -1273,6 +1270,16 @@ static void aio_complete(struct aio_kiocb *iocb, long 
res, long res2)
spin_unlock_irqrestore(&ctx->completion_lock, flags);
 
pr_debug("added to ring %p at [%u]\n", iocb, tail);
+}
+
+/* aio_complete
+ * Called when the io request on the given iocb is complete.
+ */
+static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
+{
+   struct kioctx *ctx = iocb->ki_ctx;
+
+   aio_ring_complete(ctx, iocb, res, res2);
 
/*
 * Check if the user asked us to deliver the result through an
-- 
2.17.1



[PATCH 23/26] fs: add support for mapping an ITER_BVEC for O_DIRECT

2018-12-04 Thread Jens Axboe
This adds support for sync/async O_DIRECT to make a bvec type iter
for bdev access, as well as iomap.

Signed-off-by: Jens Axboe 
---
 fs/block_dev.c | 16 
 fs/iomap.c |  5 -
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index b8f574615792..236c6abe649d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -219,7 +219,10 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
bio.bi_end_io = blkdev_bio_end_io_simple;
bio.bi_ioprio = iocb->ki_ioprio;
 
-   ret = bio_iov_iter_get_pages(&bio, iter);
+   if (iov_iter_is_bvec(iter))
+   ret = bio_iov_bvec_add_pages(&bio, iter);
+   else
+   ret = bio_iov_iter_get_pages(&bio, iter);
if (unlikely(ret))
goto out;
ret = bio.bi_iter.bi_size;
@@ -326,8 +329,9 @@ static void blkdev_bio_end_io(struct bio *bio)
struct bio_vec *bvec;
int i;
 
-   bio_for_each_segment_all(bvec, bio, i)
-   put_page(bvec->bv_page);
+   if (!bio_flagged(bio, BIO_HOLD_PAGES))
+   bio_for_each_segment_all(bvec, bio, i)
+   put_page(bvec->bv_page);
bio_put(bio);
}
 }
@@ -381,7 +385,11 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bio->bi_end_io = blkdev_bio_end_io;
bio->bi_ioprio = iocb->ki_ioprio;
 
-   ret = bio_iov_iter_get_pages(bio, iter);
+   if (iov_iter_is_bvec(iter))
+   ret = bio_iov_bvec_add_pages(bio, iter);
+   else
+   ret = bio_iov_iter_get_pages(bio, iter);
+
if (unlikely(ret)) {
bio->bi_status = BLK_STS_IOERR;
bio_endio(bio);
diff --git a/fs/iomap.c b/fs/iomap.c
index bd483fcb7b5a..f00e5e198c57 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1673,7 +1673,10 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
loff_t length,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
 
-   ret = bio_iov_iter_get_pages(bio, &iter);
+   if (iov_iter_is_bvec(&iter))
+   ret = bio_iov_bvec_add_pages(bio, &iter);
+   else
+   ret = bio_iov_iter_get_pages(bio, &iter);
if (unlikely(ret)) {
/*
 * We have to stop part way through an IO. We must fall
-- 
2.17.1



[PATCH 24/26] aio: add support for pre-mapped user IO buffers

2018-12-04 Thread Jens Axboe
If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must set both
IOCTX_FLAG_USERIOCB, to provide iocb's in userspace, and then
IOCTX_FLAG_FIXEDBUFS. The latter tells aio that the iocbs that are
mapped already contain valid destination and sizes. These buffers can
then be mapped into the kernel for the life time of the io_context, as
opposed to just the duration of the each single IO.

Only works with non-vectored read/write commands for now, not with
PREADV/PWRITEV.

A limit of 4M is imposed as the largest buffer we currently support.
There's nothing preventing us from going larger, but we need some cap,
and 4M seemed like it would definitely be big enough. RLIMIT_MEMLOCK
is used to cap the total amount of memory pinned.

See the fio change for how to utilize this feature:

http://git.kernel.dk/cgit/fio/commit/?id=2041bd343da1c1e955253f62374588718c64f0f3

Signed-off-by: Jens Axboe 
---
 fs/aio.c | 193 ---
 include/uapi/linux/aio_abi.h |   1 +
 2 files changed, 177 insertions(+), 17 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1735ec556089..1c8a8bb631a9 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -97,6 +98,11 @@ struct aio_mapped_range {
long nr_pages;
 };
 
+struct aio_mapped_ubuf {
+   struct bio_vec *bvec;
+   unsigned int nr_bvecs;
+};
+
 struct kioctx {
struct percpu_ref   users;
atomic_tdead;
@@ -132,6 +138,8 @@ struct kioctx {
struct page **ring_pages;
longnr_pages;
 
+   struct aio_mapped_ubuf  *user_bufs;
+
struct aio_mapped_range iocb_range;
 
struct rcu_work free_rwork; /* see free_ioctx() */
@@ -301,6 +309,7 @@ static const bool aio_use_state_req_list = false;
 
 static void aio_useriocb_unmap(struct kioctx *);
 static void aio_iopoll_reap_events(struct kioctx *);
+static void aio_iocb_buffer_unmap(struct kioctx *);
 
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
@@ -662,6 +671,7 @@ static void free_ioctx(struct work_struct *work)
  free_rwork);
pr_debug("freeing %p\n", ctx);
 
+   aio_iocb_buffer_unmap(ctx);
aio_useriocb_unmap(ctx);
aio_free_ring(ctx);
free_percpu(ctx->cpu);
@@ -1672,6 +1682,122 @@ static int aio_useriocb_map(struct kioctx *ctx, struct 
iocb __user *iocbs)
return aio_map_range(&ctx->iocb_range, iocbs, size, 0);
 }
 
+static void aio_iocb_buffer_unmap(struct kioctx *ctx)
+{
+   int i, j;
+
+   if (!ctx->user_bufs)
+   return;
+
+   for (i = 0; i < ctx->max_reqs; i++) {
+   struct aio_mapped_ubuf *amu = &ctx->user_bufs[i];
+
+   for (j = 0; j < amu->nr_bvecs; j++)
+   put_page(amu->bvec[j].bv_page);
+
+   kfree(amu->bvec);
+   amu->nr_bvecs = 0;
+   }
+
+   kfree(ctx->user_bufs);
+   ctx->user_bufs = NULL;
+}
+
+static int aio_iocb_buffer_map(struct kioctx *ctx)
+{
+   unsigned long total_pages, page_limit;
+   struct page **pages = NULL;
+   int i, j, got_pages = 0;
+   struct iocb *iocb;
+   int ret = -EINVAL;
+
+   ctx->user_bufs = kzalloc(ctx->max_reqs * sizeof(struct aio_mapped_ubuf),
+   GFP_KERNEL);
+   if (!ctx->user_bufs)
+   return -ENOMEM;
+
+   /* Don't allow more pages than we can safely lock */
+   total_pages = 0;
+   page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+   for (i = 0; i < ctx->max_reqs; i++) {
+   struct aio_mapped_ubuf *amu = &ctx->user_bufs[i];
+   unsigned long off, start, end, ubuf;
+   int pret, nr_pages;
+   size_t size;
+
+   iocb = aio_iocb_from_index(ctx, i);
+
+   /*
+* Don't impose further limits on the size and buffer
+* constraints here, we'll -EINVAL later when IO is
+* submitted if they are wrong.
+*/
+   ret = -EFAULT;
+   if (!iocb->aio_buf)
+   goto err;
+
+   /* arbitrary limit, but we need something */
+   if (iocb->aio_nbytes > SZ_4M)
+   goto err;
+
+   ubuf = iocb->aio_buf;
+   end = (ubuf + iocb->aio_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   start = ubuf >> PAGE_SHIFT;
+   nr_pages = end - start;
+
+   ret = -ENOMEM;
+   if (total_pages + nr_pages > page_limit)
+   goto err;
+
+   if (!pages || nr_pages > got_pages) {
+   kfree(pages);
+  

[PATCH 26/26] aio: add support for submission/completion rings

2018-12-04 Thread Jens Axboe
Experimental support for submitting and completing IO through rings
shared between the application and kernel.

The submission rings are struct iocb, like we would submit through
io_submit(), and the completion rings are struct io_event, like we
would pass in (and copy back) from io_getevents().

A new system call is added for this, io_ring_enter(). This system
call submits IO that is queued in the SQ ring, and/or completes IO
and stores the results in the CQ ring.

This could be augmented with a kernel thread that does the submission
and polling, then the application would never have to enter the
kernel to do IO.

Sample application: http://brick.kernel.dk/snaps/aio-ring.c

Signed-off-by: Jens Axboe 
---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/aio.c   | 312 +++--
 include/linux/syscalls.h   |   4 +-
 include/uapi/linux/aio_abi.h   |  26 +++
 4 files changed, 323 insertions(+), 20 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 67c357225fb0..55a26700a637 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -344,6 +344,7 @@
 333common  io_pgetevents   __x64_sys_io_pgetevents
 334common  rseq__x64_sys_rseq
 335common  io_setup2   __x64_sys_io_setup2
+336common  io_ring_enter   __x64_sys_io_ring_enter
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/aio.c b/fs/aio.c
index 39aaffd6d436..6024c6943d7d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -142,6 +142,10 @@ struct kioctx {
 
struct aio_mapped_range iocb_range;
 
+   /* if used, completion and submission rings */
+   struct aio_mapped_range sq_ring;
+   struct aio_mapped_range cq_ring;
+
struct rcu_work free_rwork; /* see free_ioctx() */
 
/*
@@ -297,6 +301,8 @@ static const struct address_space_operations aio_ctx_aops;
 
 static const unsigned int iocb_page_shift =
ilog2(PAGE_SIZE / sizeof(struct iocb));
+static const unsigned int event_page_shift =
+   ilog2(PAGE_SIZE / sizeof(struct io_event));
 
 /*
  * We rely on block level unplugs to flush pending requests, if we schedule
@@ -307,6 +313,7 @@ static const bool aio_use_state_req_list = true;
 static const bool aio_use_state_req_list = false;
 #endif
 
+static void aio_scqring_unmap(struct kioctx *);
 static void aio_useriocb_unmap(struct kioctx *);
 static void aio_iopoll_reap_events(struct kioctx *);
 static void aio_iocb_buffer_unmap(struct kioctx *);
@@ -673,6 +680,7 @@ static void free_ioctx(struct work_struct *work)
 
aio_iocb_buffer_unmap(ctx);
aio_useriocb_unmap(ctx);
+   aio_scqring_unmap(ctx);
aio_free_ring(ctx);
free_percpu(ctx->cpu);
percpu_ref_exit(&ctx->reqs);
@@ -1218,6 +1226,47 @@ static void aio_fill_event(struct io_event *ev, struct 
aio_kiocb *iocb,
ev->res2 = res2;
 }
 
+static struct io_event *__aio_get_cqring_ev(struct aio_io_event_ring *ring,
+   struct aio_mapped_range *range,
+   unsigned *next_tail)
+{
+   struct io_event *ev;
+   unsigned tail;
+
+   smp_rmb();
+   tail = READ_ONCE(ring->tail);
+   *next_tail = tail + 1;
+   if (*next_tail == ring->nr_events)
+   *next_tail = 0;
+   if (*next_tail == READ_ONCE(ring->head))
+   return NULL;
+
+   /* io_event array starts offset one into the mapped range */
+   tail++;
+   ev = page_address(range->pages[tail >> event_page_shift]);
+   tail &= ((1 << event_page_shift) - 1);
+   return ev + tail;
+}
+
+static void aio_commit_cqring(struct kioctx *ctx, unsigned next_tail)
+{
+   struct aio_io_event_ring *ring;
+
+   ring = page_address(ctx->cq_ring.pages[0]);
+   if (next_tail != ring->tail) {
+   ring->tail = next_tail;
+   smp_wmb();
+   }
+}
+
+static struct io_event *aio_peek_cqring(struct kioctx *ctx, unsigned *ntail)
+{
+   struct aio_io_event_ring *ring;
+
+   ring = page_address(ctx->cq_ring.pages[0]);
+   return __aio_get_cqring_ev(ring, &ctx->cq_ring, ntail);
+}
+
 static void aio_ring_complete(struct kioctx *ctx, struct aio_kiocb *iocb,
  long res, long res2)
 {
@@ -1279,7 +1328,17 @@ static void aio_complete(struct aio_kiocb *iocb, long 
res, long res2)
 {
struct kioctx *ctx = iocb->ki_ctx;
 
-   aio_ring_complete(ctx, iocb, res, res2);
+   if (ctx->flags & IOCTX_FLAG_SCQRING) {
+   struct io_event *ev;
+   unsigned int tail;
+
+   /* Can't fail, we have a ring reservation */
+   ev = aio_peek_cqring(ctx, &tail);
+   aio_fill_event(ev, iocb, res, res2);
+   ai

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Ming Lei
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> we queue the request up normally. However, the SCSI layer may have
> already setup SG tables etc for this particular command. If we later
> merge with this request, then the old tables are no longer valid. Once
> we issue the IO, we only read/write the original part of the request,
> not the new state of it.
> 
> This causes data corruption, and is most often noticed with the file
> system complaining about the just read data being invalid:
> 
> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm 
> dpkg-query: bad extra_isize 24937 (inode size 256)
> 
> because most of it is garbage...
> 
> This doesn't happen from the normal issue path, as we will simply defer
> the request to the hardware queue dispatch list if we fail. Once it's on
> the dispatch list, we never merge with it.
> 
> Fix this from the direct issue path by flagging the request as
> REQ_NOMERGE so we don't change the size of it before issue.
> 
> See also:
>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> 
> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 
> 'none'")
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3f91c6e5b17a..d8f518c6ea38 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   break;
>   case BLK_STS_RESOURCE:
>   case BLK_STS_DEV_RESOURCE:
> + /*
> +  * If direct dispatch fails, we cannot allow any merging on
> +  * this IO. Drivers (like SCSI) may have set up permanent state
> +  * for this request, like SG tables and mappings, and if we
> +  * merge to it later on then we'll still only do IO to the
> +  * original part.
> +  */
> + rq->cmd_flags |= REQ_NOMERGE;
> +
>   blk_mq_update_dispatch_busy(hctx, true);
>   __blk_mq_requeue_request(rq);
>   break;
> 

Not sure it is enough to just mark it as NOMERGE, for example, driver
may have setup the .special_vec for discard, and NOMERGE may not prevent
request from entering elevator queue completely. Cause 'rq.rb_node' and
'rq.special_vec' share same space.

So how about inserting this request via blk_mq_request_bypass_insert()
in case that direct issue returns BUSY? Then it is invariant that
any request queued via .queue_rq() won't enter scheduler queue.

--

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e5b17a..4b2db0b2909e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1764,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
if (bypass_insert)
return BLK_STS_RESOURCE;
 
-   blk_mq_sched_insert_request(rq, false, run_queue, false);
+   blk_mq_request_bypass_insert(rq, run_queue);
return BLK_STS_OK;
 }
 
@@ -1780,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
-   blk_mq_sched_insert_request(rq, false, true, false);
+   blk_mq_request_bypass_insert(rq, true);
else if (ret != BLK_STS_OK)
blk_mq_end_request(rq, ret);
 


Thanks,
Ming


Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Guenter Roeck
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> we queue the request up normally. However, the SCSI layer may have
> already setup SG tables etc for this particular command. If we later
> merge with this request, then the old tables are no longer valid. Once
> we issue the IO, we only read/write the original part of the request,
> not the new state of it.
> 
> This causes data corruption, and is most often noticed with the file
> system complaining about the just read data being invalid:
> 
> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm 
> dpkg-query: bad extra_isize 24937 (inode size 256)
> 
> because most of it is garbage...
> 
> This doesn't happen from the normal issue path, as we will simply defer
> the request to the hardware queue dispatch list if we fail. Once it's on
> the dispatch list, we never merge with it.
> 
> Fix this from the direct issue path by flagging the request as
> REQ_NOMERGE so we don't change the size of it before issue.
> 
> See also:
>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> 
> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 
> 'none'")
> Signed-off-by: Jens Axboe 

Tested-by: Guenter Roeck 

... on two systems affected by the problem.

> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3f91c6e5b17a..d8f518c6ea38 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   break;
>   case BLK_STS_RESOURCE:
>   case BLK_STS_DEV_RESOURCE:
> + /*
> +  * If direct dispatch fails, we cannot allow any merging on
> +  * this IO. Drivers (like SCSI) may have set up permanent state
> +  * for this request, like SG tables and mappings, and if we
> +  * merge to it later on then we'll still only do IO to the
> +  * original part.
> +  */
> + rq->cmd_flags |= REQ_NOMERGE;
> +
>   blk_mq_update_dispatch_busy(hctx, true);
>   __blk_mq_requeue_request(rq);
>   break;


Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Jens Axboe
On 12/4/18 6:37 PM, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>> we queue the request up normally. However, the SCSI layer may have
>> already setup SG tables etc for this particular command. If we later
>> merge with this request, then the old tables are no longer valid. Once
>> we issue the IO, we only read/write the original part of the request,
>> not the new state of it.
>>
>> This causes data corruption, and is most often noticed with the file
>> system complaining about the just read data being invalid:
>>
>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
>> comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>
>> because most of it is garbage...
>>
>> This doesn't happen from the normal issue path, as we will simply defer
>> the request to the hardware queue dispatch list if we fail. Once it's on
>> the dispatch list, we never merge with it.
>>
>> Fix this from the direct issue path by flagging the request as
>> REQ_NOMERGE so we don't change the size of it before issue.
>>
>> See also:
>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case 
>> of 'none'")
>> Signed-off-by: Jens Axboe 
>>
>> ---
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 3f91c6e5b17a..d8f518c6ea38 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
>> blk_mq_hw_ctx *hctx,
>>  break;
>>  case BLK_STS_RESOURCE:
>>  case BLK_STS_DEV_RESOURCE:
>> +/*
>> + * If direct dispatch fails, we cannot allow any merging on
>> + * this IO. Drivers (like SCSI) may have set up permanent state
>> + * for this request, like SG tables and mappings, and if we
>> + * merge to it later on then we'll still only do IO to the
>> + * original part.
>> + */
>> +rq->cmd_flags |= REQ_NOMERGE;
>> +
>>  blk_mq_update_dispatch_busy(hctx, true);
>>  __blk_mq_requeue_request(rq);
>>  break;
>>
> 
> Not sure it is enough to just mark it as NOMERGE, for example, driver
> may have setup the .special_vec for discard, and NOMERGE may not prevent
> request from entering elevator queue completely. Cause 'rq.rb_node' and
> 'rq.special_vec' share same space.

We should rather limit the scope of the direct dispatch instead. It
doesn't make sense to do for anything but read/write anyway.

> So how about inserting this request via blk_mq_request_bypass_insert()
> in case that direct issue returns BUSY? Then it is invariant that
> any request queued via .queue_rq() won't enter scheduler queue.

I did consider this, but I didn't want to experiment with exercising
a new path for an important bug fix. You do realize that your original
patch has been corrupting data for months? I think a little caution
is in order here.

-- 
Jens Axboe



Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Jens Axboe
On 12/4/18 7:16 PM, Jens Axboe wrote:
> On 12/4/18 6:37 PM, Ming Lei wrote:
>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>>> we queue the request up normally. However, the SCSI layer may have
>>> already setup SG tables etc for this particular command. If we later
>>> merge with this request, then the old tables are no longer valid. Once
>>> we issue the IO, we only read/write the original part of the request,
>>> not the new state of it.
>>>
>>> This causes data corruption, and is most often noticed with the file
>>> system complaining about the just read data being invalid:
>>>
>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
>>> comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>>
>>> because most of it is garbage...
>>>
>>> This doesn't happen from the normal issue path, as we will simply defer
>>> the request to the hardware queue dispatch list if we fail. Once it's on
>>> the dispatch list, we never merge with it.
>>>
>>> Fix this from the direct issue path by flagging the request as
>>> REQ_NOMERGE so we don't change the size of it before issue.
>>>
>>> See also:
>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>>
>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case 
>>> of 'none'")
>>> Signed-off-by: Jens Axboe 
>>>
>>> ---
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 3f91c6e5b17a..d8f518c6ea38 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
>>> blk_mq_hw_ctx *hctx,
>>> break;
>>> case BLK_STS_RESOURCE:
>>> case BLK_STS_DEV_RESOURCE:
>>> +   /*
>>> +* If direct dispatch fails, we cannot allow any merging on
>>> +* this IO. Drivers (like SCSI) may have set up permanent state
>>> +* for this request, like SG tables and mappings, and if we
>>> +* merge to it later on then we'll still only do IO to the
>>> +* original part.
>>> +*/
>>> +   rq->cmd_flags |= REQ_NOMERGE;
>>> +
>>> blk_mq_update_dispatch_busy(hctx, true);
>>> __blk_mq_requeue_request(rq);
>>> break;
>>>
>>
>> Not sure it is enough to just mark it as NOMERGE, for example, driver
>> may have setup the .special_vec for discard, and NOMERGE may not prevent
>> request from entering elevator queue completely. Cause 'rq.rb_node' and
>> 'rq.special_vec' share same space.
> 
> We should rather limit the scope of the direct dispatch instead. It
> doesn't make sense to do for anything but read/write anyway.
> 
>> So how about inserting this request via blk_mq_request_bypass_insert()
>> in case that direct issue returns BUSY? Then it is invariant that
>> any request queued via .queue_rq() won't enter scheduler queue.
> 
> I did consider this, but I didn't want to experiment with exercising
> a new path for an important bug fix. You do realize that your original
> patch has been corrupting data for months? I think a little caution
> is in order here.

Here's a further limiting version. And we seriously need to clean up the
direct issue paths, it's ridiculous.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e5b17a..3262d83b9e07 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
break;
case BLK_STS_RESOURCE:
case BLK_STS_DEV_RESOURCE:
+   /*
+* If direct dispatch fails, we cannot allow any merging on
+* this IO. Drivers (like SCSI) may have set up permanent state
+* for this request, like SG tables and mappings, and if we
+* merge to it later on then we'll still only do IO to the
+* original part.
+*/
+   rq->cmd_flags |= REQ_NOMERGE;
+
blk_mq_update_dispatch_busy(hctx, true);
__blk_mq_requeue_request(rq);
break;
@@ -1727,6 +1736,18 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
+/*
+ * Don't allow direct dispatch of anything but regular reads/writes,
+ * as some of the other commands can potentially share request space
+ * with data we need for the IO scheduler. If we attempt a direct dispatch
+ * on those and fail, we can't safely add it to the scheduler afterwards
+ * without potentially overwriting data that the driver has already written.
+ */
+static bool blk_rq_can_direct_dispatch(struct request *rq)
+{
+   return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE;
+}
+
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
blk_qc_t *cookie,

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Jens Axboe
On 12/4/18 6:38 PM, Guenter Roeck wrote:
> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>> we queue the request up normally. However, the SCSI layer may have
>> already setup SG tables etc for this particular command. If we later
>> merge with this request, then the old tables are no longer valid. Once
>> we issue the IO, we only read/write the original part of the request,
>> not the new state of it.
>>
>> This causes data corruption, and is most often noticed with the file
>> system complaining about the just read data being invalid:
>>
>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
>> comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>
>> because most of it is garbage...
>>
>> This doesn't happen from the normal issue path, as we will simply defer
>> the request to the hardware queue dispatch list if we fail. Once it's on
>> the dispatch list, we never merge with it.
>>
>> Fix this from the direct issue path by flagging the request as
>> REQ_NOMERGE so we don't change the size of it before issue.
>>
>> See also:
>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case 
>> of 'none'")
>> Signed-off-by: Jens Axboe 
> 
> Tested-by: Guenter Roeck 
> 
> ... on two systems affected by the problem.

Thanks for testing! And for being persistent in reproducing and
providing clues for getting this nailed.

-- 
Jens Axboe



Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Ming Lei
On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> On 12/4/18 6:37 PM, Ming Lei wrote:
> > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> >> we queue the request up normally. However, the SCSI layer may have
> >> already setup SG tables etc for this particular command. If we later
> >> merge with this request, then the old tables are no longer valid. Once
> >> we issue the IO, we only read/write the original part of the request,
> >> not the new state of it.
> >>
> >> This causes data corruption, and is most often noticed with the file
> >> system complaining about the just read data being invalid:
> >>
> >> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
> >> comm dpkg-query: bad extra_isize 24937 (inode size 256)
> >>
> >> because most of it is garbage...
> >>
> >> This doesn't happen from the normal issue path, as we will simply defer
> >> the request to the hardware queue dispatch list if we fail. Once it's on
> >> the dispatch list, we never merge with it.
> >>
> >> Fix this from the direct issue path by flagging the request as
> >> REQ_NOMERGE so we don't change the size of it before issue.
> >>
> >> See also:
> >>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> >>
> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case 
> >> of 'none'")
> >> Signed-off-by: Jens Axboe 
> >>
> >> ---
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 3f91c6e5b17a..d8f518c6ea38 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
> >> blk_mq_hw_ctx *hctx,
> >>break;
> >>case BLK_STS_RESOURCE:
> >>case BLK_STS_DEV_RESOURCE:
> >> +  /*
> >> +   * If direct dispatch fails, we cannot allow any merging on
> >> +   * this IO. Drivers (like SCSI) may have set up permanent state
> >> +   * for this request, like SG tables and mappings, and if we
> >> +   * merge to it later on then we'll still only do IO to the
> >> +   * original part.
> >> +   */
> >> +  rq->cmd_flags |= REQ_NOMERGE;
> >> +
> >>blk_mq_update_dispatch_busy(hctx, true);
> >>__blk_mq_requeue_request(rq);
> >>break;
> >>
> > 
> > Not sure it is enough to just mark it as NOMERGE, for example, driver
> > may have setup the .special_vec for discard, and NOMERGE may not prevent
> > request from entering elevator queue completely. Cause 'rq.rb_node' and
> > 'rq.special_vec' share same space.
> 
> We should rather limit the scope of the direct dispatch instead. It
> doesn't make sense to do for anything but read/write anyway.

discard is kind of write, and it isn't treated very specially in make
request path, except for multi-range discard.

> 
> > So how about inserting this request via blk_mq_request_bypass_insert()
> > in case that direct issue returns BUSY? Then it is invariant that
> > any request queued via .queue_rq() won't enter scheduler queue.
> 
> I did consider this, but I didn't want to experiment with exercising
> a new path for an important bug fix. You do realize that your original
> patch has been corrupting data for months? I think a little caution
> is in order here.

But marking NOMERGE still may have a hole on re-insert discard request as
mentioned above.

Given we never allow to re-insert queued request to scheduler queue
except for 6ce3dd6eec1, I think it is the correct thing to do, and the
fix is simple too.

Thanks,
Ming


Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Jens Axboe
On 12/4/18 7:27 PM, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
>> On 12/4/18 6:37 PM, Ming Lei wrote:
>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
 If we attempt a direct issue to a SCSI device, and it returns BUSY, then
 we queue the request up normally. However, the SCSI layer may have
 already setup SG tables etc for this particular command. If we later
 merge with this request, then the old tables are no longer valid. Once
 we issue the IO, we only read/write the original part of the request,
 not the new state of it.

 This causes data corruption, and is most often noticed with the file
 system complaining about the just read data being invalid:

 [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
 comm dpkg-query: bad extra_isize 24937 (inode size 256)

 because most of it is garbage...

 This doesn't happen from the normal issue path, as we will simply defer
 the request to the hardware queue dispatch list if we fail. Once it's on
 the dispatch list, we never merge with it.

 Fix this from the direct issue path by flagging the request as
 REQ_NOMERGE so we don't change the size of it before issue.

 See also:
   https://bugzilla.kernel.org/show_bug.cgi?id=201685

 Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case 
 of 'none'")
 Signed-off-by: Jens Axboe 

 ---

 diff --git a/block/blk-mq.c b/block/blk-mq.c
 index 3f91c6e5b17a..d8f518c6ea38 100644
 --- a/block/blk-mq.c
 +++ b/block/blk-mq.c
 @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
 blk_mq_hw_ctx *hctx,
break;
case BLK_STS_RESOURCE:
case BLK_STS_DEV_RESOURCE:
 +  /*
 +   * If direct dispatch fails, we cannot allow any merging on
 +   * this IO. Drivers (like SCSI) may have set up permanent state
 +   * for this request, like SG tables and mappings, and if we
 +   * merge to it later on then we'll still only do IO to the
 +   * original part.
 +   */
 +  rq->cmd_flags |= REQ_NOMERGE;
 +
blk_mq_update_dispatch_busy(hctx, true);
__blk_mq_requeue_request(rq);
break;

>>>
>>> Not sure it is enough to just mark it as NOMERGE, for example, driver
>>> may have setup the .special_vec for discard, and NOMERGE may not prevent
>>> request from entering elevator queue completely. Cause 'rq.rb_node' and
>>> 'rq.special_vec' share same space.
>>
>> We should rather limit the scope of the direct dispatch instead. It
>> doesn't make sense to do for anything but read/write anyway.
> 
> discard is kind of write, and it isn't treated very specially in make
> request path, except for multi-range discard.

The point of direct dispatch is to reduce latencies for requests,
discards are so damn slow on ALL devices anyway that it doesn't make any
sense to try direct dispatch to begin with, regardless of whether it
possible or not.

>>> So how about inserting this request via blk_mq_request_bypass_insert()
>>> in case that direct issue returns BUSY? Then it is invariant that
>>> any request queued via .queue_rq() won't enter scheduler queue.
>>
>> I did consider this, but I didn't want to experiment with exercising
>> a new path for an important bug fix. You do realize that your original
>> patch has been corrupting data for months? I think a little caution
>> is in order here.
> 
> But marking NOMERGE still may have a hole on re-insert discard request as
> mentioned above.

What I said was further limit the scope of direct dispatch, which means
not allowing anything that isn't a read/write.

> Given we never allow to re-insert queued request to scheduler queue
> except for 6ce3dd6eec1, I think it is the correct thing to do, and the
> fix is simple too.

As I said, it's not the time to experiment. This issue has been there
since 4.19-rc1. The alternative is yanking both those patches, and then
looking at it later when the direct issue path has been cleaned up
first.

-- 
Jens Axboe



Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Ming Lei
On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
> On 12/4/18 7:27 PM, Ming Lei wrote:
> > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> >> On 12/4/18 6:37 PM, Ming Lei wrote:
> >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>  If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>  we queue the request up normally. However, the SCSI layer may have
>  already setup SG tables etc for this particular command. If we later
>  merge with this request, then the old tables are no longer valid. Once
>  we issue the IO, we only read/write the original part of the request,
>  not the new state of it.
> 
>  This causes data corruption, and is most often noticed with the file
>  system complaining about the just read data being invalid:
> 
>  [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
>  comm dpkg-query: bad extra_isize 24937 (inode size 256)
> 
>  because most of it is garbage...
> 
>  This doesn't happen from the normal issue path, as we will simply defer
>  the request to the hardware queue dispatch list if we fail. Once it's on
>  the dispatch list, we never merge with it.
> 
>  Fix this from the direct issue path by flagging the request as
>  REQ_NOMERGE so we don't change the size of it before issue.
> 
>  See also:
>    https://bugzilla.kernel.org/show_bug.cgi?id=201685
> 
>  Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in 
>  case of 'none'")
>  Signed-off-by: Jens Axboe 
> 
>  ---
> 
>  diff --git a/block/blk-mq.c b/block/blk-mq.c
>  index 3f91c6e5b17a..d8f518c6ea38 100644
>  --- a/block/blk-mq.c
>  +++ b/block/blk-mq.c
>  @@ -1715,6 +1715,15 @@ static blk_status_t 
>  __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>   break;
>   case BLK_STS_RESOURCE:
>   case BLK_STS_DEV_RESOURCE:
>  +/*
>  + * If direct dispatch fails, we cannot allow any 
>  merging on
>  + * this IO. Drivers (like SCSI) may have set up 
>  permanent state
>  + * for this request, like SG tables and mappings, and 
>  if we
>  + * merge to it later on then we'll still only do IO to 
>  the
>  + * original part.
>  + */
>  +rq->cmd_flags |= REQ_NOMERGE;
>  +
>   blk_mq_update_dispatch_busy(hctx, true);
>   __blk_mq_requeue_request(rq);
>   break;
> 
> >>>
> >>> Not sure it is enough to just mark it as NOMERGE, for example, driver
> >>> may have setup the .special_vec for discard, and NOMERGE may not prevent
> >>> request from entering elevator queue completely. Cause 'rq.rb_node' and
> >>> 'rq.special_vec' share same space.
> >>
> >> We should rather limit the scope of the direct dispatch instead. It
> >> doesn't make sense to do for anything but read/write anyway.
> > 
> > discard is kind of write, and it isn't treated very specially in make
> > request path, except for multi-range discard.
> 
> The point of direct dispatch is to reduce latencies for requests,
> discards are so damn slow on ALL devices anyway that it doesn't make any
> sense to try direct dispatch to begin with, regardless of whether it
> possible or not.

SCSI MQ device may benefit from direct dispatch from reduced lock contention.

> 
> >>> So how about inserting this request via blk_mq_request_bypass_insert()
> >>> in case that direct issue returns BUSY? Then it is invariant that
> >>> any request queued via .queue_rq() won't enter scheduler queue.
> >>
> >> I did consider this, but I didn't want to experiment with exercising
> >> a new path for an important bug fix. You do realize that your original
> >> patch has been corrupting data for months? I think a little caution
> >> is in order here.
> > 
> > But marking NOMERGE still may have a hole on re-insert discard request as
> > mentioned above.
> 
> What I said was further limit the scope of direct dispatch, which means
> not allowing anything that isn't a read/write.

IMO, the conservative approach is to take the one used in legacy io
path, in which it is never allowed to re-insert queued request to
scheduler queue except for requeue, however RQF_DONTPREP is cleared
before requeuing request to scheduler.

> 
> > Given we never allow to re-insert queued request to scheduler queue
> > except for 6ce3dd6eec1, I think it is the correct thing to do, and the
> > fix is simple too.
> 
> As I said, it's not the time to experiment. This issue has been there
> since 4.19-rc1. The alternative is yanking both those patches, and then
> looking at it later when the direct issue path has been cleaned up
> first.

The issue should have been there from v4.1, especially after commit
f984

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Ming Lei
On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
> > On 12/4/18 7:27 PM, Ming Lei wrote:
> > > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> > >> On 12/4/18 6:37 PM, Ming Lei wrote:
> > >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> >  If we attempt a direct issue to a SCSI device, and it returns BUSY, 
> >  then
> >  we queue the request up normally. However, the SCSI layer may have
> >  already setup SG tables etc for this particular command. If we later
> >  merge with this request, then the old tables are no longer valid. Once
> >  we issue the IO, we only read/write the original part of the request,
> >  not the new state of it.
> > 
> >  This causes data corruption, and is most often noticed with the file
> >  system complaining about the just read data being invalid:
> > 
> >  [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode 
> >  #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
> > 
> >  because most of it is garbage...
> > 
> >  This doesn't happen from the normal issue path, as we will simply defer
> >  the request to the hardware queue dispatch list if we fail. Once it's 
> >  on
> >  the dispatch list, we never merge with it.
> > 
> >  Fix this from the direct issue path by flagging the request as
> >  REQ_NOMERGE so we don't change the size of it before issue.
> > 
> >  See also:
> >    https://bugzilla.kernel.org/show_bug.cgi?id=201685
> > 
> >  Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in 
> >  case of 'none'")
> >  Signed-off-by: Jens Axboe 
> > 
> >  ---
> > 
> >  diff --git a/block/blk-mq.c b/block/blk-mq.c
> >  index 3f91c6e5b17a..d8f518c6ea38 100644
> >  --- a/block/blk-mq.c
> >  +++ b/block/blk-mq.c
> >  @@ -1715,6 +1715,15 @@ static blk_status_t 
> >  __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> > break;
> > case BLK_STS_RESOURCE:
> > case BLK_STS_DEV_RESOURCE:
> >  +  /*
> >  +   * If direct dispatch fails, we cannot allow any 
> >  merging on
> >  +   * this IO. Drivers (like SCSI) may have set up 
> >  permanent state
> >  +   * for this request, like SG tables and mappings, and 
> >  if we
> >  +   * merge to it later on then we'll still only do IO to 
> >  the
> >  +   * original part.
> >  +   */
> >  +  rq->cmd_flags |= REQ_NOMERGE;
> >  +
> > blk_mq_update_dispatch_busy(hctx, true);
> > __blk_mq_requeue_request(rq);
> > break;
> > 
> > >>>
> > >>> Not sure it is enough to just mark it as NOMERGE, for example, driver
> > >>> may have setup the .special_vec for discard, and NOMERGE may not prevent
> > >>> request from entering elevator queue completely. Cause 'rq.rb_node' and
> > >>> 'rq.special_vec' share same space.
> > >>
> > >> We should rather limit the scope of the direct dispatch instead. It
> > >> doesn't make sense to do for anything but read/write anyway.
> > > 
> > > discard is kind of write, and it isn't treated very specially in make
> > > request path, except for multi-range discard.
> > 
> > The point of direct dispatch is to reduce latencies for requests,
> > discards are so damn slow on ALL devices anyway that it doesn't make any
> > sense to try direct dispatch to begin with, regardless of whether it
> > possible or not.
> 
> SCSI MQ device may benefit from direct dispatch from reduced lock contention.
> 
> > 
> > >>> So how about inserting this request via blk_mq_request_bypass_insert()
> > >>> in case that direct issue returns BUSY? Then it is invariant that
> > >>> any request queued via .queue_rq() won't enter scheduler queue.
> > >>
> > >> I did consider this, but I didn't want to experiment with exercising
> > >> a new path for an important bug fix. You do realize that your original
> > >> patch has been corrupting data for months? I think a little caution
> > >> is in order here.
> > > 
> > > But marking NOMERGE still may have a hole on re-insert discard request as
> > > mentioned above.
> > 
> > What I said was further limit the scope of direct dispatch, which means
> > not allowing anything that isn't a read/write.
> 
> IMO, the conservative approach is to take the one used in legacy io
> path, in which it is never allowed to re-insert queued request to
> scheduler queue except for requeue, however RQF_DONTPREP is cleared
> before requeuing request to scheduler.
> 
> > 
> > > Given we never allow to re-insert queued request to scheduler queue
> > > except for 6ce3dd6eec1, I think it is the correct thing to do, and the
> > > fix is simple too.
> > 
> > As I said, it's not the time 

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Jens Axboe
On 12/4/18 7:58 PM, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
>> On 12/4/18 7:27 PM, Ming Lei wrote:
>>> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
 On 12/4/18 6:37 PM, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>> we queue the request up normally. However, the SCSI layer may have
>> already setup SG tables etc for this particular command. If we later
>> merge with this request, then the old tables are no longer valid. Once
>> we issue the IO, we only read/write the original part of the request,
>> not the new state of it.
>>
>> This causes data corruption, and is most often noticed with the file
>> system complaining about the just read data being invalid:
>>
>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
>> comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>
>> because most of it is garbage...
>>
>> This doesn't happen from the normal issue path, as we will simply defer
>> the request to the hardware queue dispatch list if we fail. Once it's on
>> the dispatch list, we never merge with it.
>>
>> Fix this from the direct issue path by flagging the request as
>> REQ_NOMERGE so we don't change the size of it before issue.
>>
>> See also:
>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in 
>> case of 'none'")
>> Signed-off-by: Jens Axboe 
>>
>> ---
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 3f91c6e5b17a..d8f518c6ea38 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1715,6 +1715,15 @@ static blk_status_t 
>> __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>  break;
>>  case BLK_STS_RESOURCE:
>>  case BLK_STS_DEV_RESOURCE:
>> +/*
>> + * If direct dispatch fails, we cannot allow any 
>> merging on
>> + * this IO. Drivers (like SCSI) may have set up 
>> permanent state
>> + * for this request, like SG tables and mappings, and 
>> if we
>> + * merge to it later on then we'll still only do IO to 
>> the
>> + * original part.
>> + */
>> +rq->cmd_flags |= REQ_NOMERGE;
>> +
>>  blk_mq_update_dispatch_busy(hctx, true);
>>  __blk_mq_requeue_request(rq);
>>  break;
>>
>
> Not sure it is enough to just mark it as NOMERGE, for example, driver
> may have setup the .special_vec for discard, and NOMERGE may not prevent
> request from entering elevator queue completely. Cause 'rq.rb_node' and
> 'rq.special_vec' share same space.

 We should rather limit the scope of the direct dispatch instead. It
 doesn't make sense to do for anything but read/write anyway.
>>>
>>> discard is kind of write, and it isn't treated very specially in make
>>> request path, except for multi-range discard.
>>
>> The point of direct dispatch is to reduce latencies for requests,
>> discards are so damn slow on ALL devices anyway that it doesn't make any
>> sense to try direct dispatch to begin with, regardless of whether it
>> possible or not.
> 
> SCSI MQ device may benefit from direct dispatch from reduced lock contention.
> 
>>
> So how about inserting this request via blk_mq_request_bypass_insert()
> in case that direct issue returns BUSY? Then it is invariant that
> any request queued via .queue_rq() won't enter scheduler queue.

 I did consider this, but I didn't want to experiment with exercising
 a new path for an important bug fix. You do realize that your original
 patch has been corrupting data for months? I think a little caution
 is in order here.
>>>
>>> But marking NOMERGE still may have a hole on re-insert discard request as
>>> mentioned above.
>>
>> What I said was further limit the scope of direct dispatch, which means
>> not allowing anything that isn't a read/write.
> 
> IMO, the conservative approach is to take the one used in legacy io
> path, in which it is never allowed to re-insert queued request to
> scheduler queue except for requeue, however RQF_DONTPREP is cleared
> before requeuing request to scheduler.

I don't agree. Whether you agree or not, I'm doing the simple fix for
4.20 and queueing it up for 4.21 as well. Then I hope Jianchao can
rework his direct dispatch cleanup patches for 4.21 on top of that,
and then we can experiment on top of that with other solutions.

>>> Given we never allow to re-insert queued request to scheduler queue
>>> except for 6ce3dd6eec1, I think it is the correct thing to do, and t

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Jens Axboe
On 12/4/18 8:03 PM, Ming Lei wrote:
> On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote:
>> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
>>> On 12/4/18 7:27 PM, Ming Lei wrote:
 On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> On 12/4/18 6:37 PM, Ming Lei wrote:
>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>>> we queue the request up normally. However, the SCSI layer may have
>>> already setup SG tables etc for this particular command. If we later
>>> merge with this request, then the old tables are no longer valid. Once
>>> we issue the IO, we only read/write the original part of the request,
>>> not the new state of it.
>>>
>>> This causes data corruption, and is most often noticed with the file
>>> system complaining about the just read data being invalid:
>>>
>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode 
>>> #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>>
>>> because most of it is garbage...
>>>
>>> This doesn't happen from the normal issue path, as we will simply defer
>>> the request to the hardware queue dispatch list if we fail. Once it's on
>>> the dispatch list, we never merge with it.
>>>
>>> Fix this from the direct issue path by flagging the request as
>>> REQ_NOMERGE so we don't change the size of it before issue.
>>>
>>> See also:
>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>>
>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in 
>>> case of 'none'")
>>> Signed-off-by: Jens Axboe 
>>>
>>> ---
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 3f91c6e5b17a..d8f518c6ea38 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1715,6 +1715,15 @@ static blk_status_t 
>>> __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>> break;
>>> case BLK_STS_RESOURCE:
>>> case BLK_STS_DEV_RESOURCE:
>>> +   /*
>>> +* If direct dispatch fails, we cannot allow any 
>>> merging on
>>> +* this IO. Drivers (like SCSI) may have set up 
>>> permanent state
>>> +* for this request, like SG tables and mappings, and 
>>> if we
>>> +* merge to it later on then we'll still only do IO to 
>>> the
>>> +* original part.
>>> +*/
>>> +   rq->cmd_flags |= REQ_NOMERGE;
>>> +
>>> blk_mq_update_dispatch_busy(hctx, true);
>>> __blk_mq_requeue_request(rq);
>>> break;
>>>
>>
>> Not sure it is enough to just mark it as NOMERGE, for example, driver
>> may have setup the .special_vec for discard, and NOMERGE may not prevent
>> request from entering elevator queue completely. Cause 'rq.rb_node' and
>> 'rq.special_vec' share same space.
>
> We should rather limit the scope of the direct dispatch instead. It
> doesn't make sense to do for anything but read/write anyway.

 discard is kind of write, and it isn't treated very specially in make
 request path, except for multi-range discard.
>>>
>>> The point of direct dispatch is to reduce latencies for requests,
>>> discards are so damn slow on ALL devices anyway that it doesn't make any
>>> sense to try direct dispatch to begin with, regardless of whether it
>>> possible or not.
>>
>> SCSI MQ device may benefit from direct dispatch from reduced lock contention.
>>
>>>
>> So how about inserting this request via blk_mq_request_bypass_insert()
>> in case that direct issue returns BUSY? Then it is invariant that
>> any request queued via .queue_rq() won't enter scheduler queue.
>
> I did consider this, but I didn't want to experiment with exercising
> a new path for an important bug fix. You do realize that your original
> patch has been corrupting data for months? I think a little caution
> is in order here.

 But marking NOMERGE still may have a hole on re-insert discard request as
 mentioned above.
>>>
>>> What I said was further limit the scope of direct dispatch, which means
>>> not allowing anything that isn't a read/write.
>>
>> IMO, the conservative approach is to take the one used in legacy io
>> path, in which it is never allowed to re-insert queued request to
>> scheduler queue except for requeue, however RQF_DONTPREP is cleared
>> before requeuing request to scheduler.
>>
>>>
 Given we never allow to re-insert queued request to scheduler queue
 except for 6ce3dd6eec1, I think it is the correct thing to do, and the
 fix is simple too.
>>>
>>> As I said, it's not the time to experiment. This issue has been there
>>> since 4.1

RE: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-04 Thread Kashyap Desai
+ Linux-scsi

> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 9497b47..57432be 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -175,6 +175,7 @@ static inline bool
> > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> >   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> >  struct request *rq)
> >   {
> > +hctx->tags->rqs[rq->tag] = NULL;
> >   blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> >   rq->tag = -1;
>
> No SCSI driver should call scsi_host_find_tag() after a request has
> finished. The above patch introduces yet another race and hence can't be
> a proper fix.

Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id to
find out pending IO in firmware.
One of the use case is -  HBA firmware recovery.  In case of firmware
recovery, driver may require to traverse the list and return back pending
scsi command to SML for retry.
I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic,
mpt3sas are using API scsi_host_find_tag for the same purpose.

Without this patch, we hit very basic kernel panic due to page fault.  This
is not an issue in non-mq code path. Non-mq path use
blk_map_queue_find_tag() and that particular API does not provide stale
requests.

Kashyap

>
> Bart.


Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-04 Thread Bart Van Assche
On Tue, 2018-12-04 at 22:17 +0530, Kashyap Desai wrote:
> + Linux-scsi
> 
> > > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > > index 9497b47..57432be 100644
> > > --- a/block/blk-mq.h
> > > +++ b/block/blk-mq.h
> > > @@ -175,6 +175,7 @@ static inline bool
> > > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> > >   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> > >  struct request *rq)
> > >   {
> > > +hctx->tags->rqs[rq->tag] = NULL;
> > >   blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> > >   rq->tag = -1;
> > 
> > No SCSI driver should call scsi_host_find_tag() after a request has
> > finished. The above patch introduces yet another race and hence can't be
> > a proper fix.
> 
> Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id to
> find out pending IO in firmware.
> One of the use case is -  HBA firmware recovery.  In case of firmware
> recovery, driver may require to traverse the list and return back pending
> scsi command to SML for retry.
> I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic,
> mpt3sas are using API scsi_host_find_tag for the same purpose.
> 
> Without this patch, we hit very basic kernel panic due to page fault.  This
> is not an issue in non-mq code path. Non-mq path use
> blk_map_queue_find_tag() and that particular API does not provide stale
> requests.

As I wrote before, your patch doesn't fix the race you described but only
makes the race window smaller. If you want an example of how to use
scsi_host_find_tag() properly, have a look at the SRP initiator driver
(drivers/infiniband/ulp/srp). That driver uses scsi_host_find_tag() without
triggering any NULL pointer dereferences. The approach used in that driver
also works when having to support HBA firmware recovery.

Bart.


RE: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in blk+AF8-mq+AF8-put+AF8-driver+AF8-tag

2018-12-04 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bvanass...@acm.org]
> Sent: Tuesday, December 4, 2018 10:45 PM
> To: Kashyap Desai; linux-block; Jens Axboe; Ming Lei; linux-scsi
> Cc: Suganath Prabu Subramani; Sreekanth Reddy; Sathya Prakash Veerichetty
> Subject: Re: [PATCH] blk-mq: Set request mapping to NULL in
> blk_mq_put_driver_tag
>
> On Tue, 2018-12-04 at 22:17 +0530, Kashyap Desai wrote:
> > + Linux-scsi
> >
> > > > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > > > index 9497b47..57432be 100644
> > > > --- a/block/blk-mq.h
> > > > +++ b/block/blk-mq.h
> > > > @@ -175,6 +175,7 @@ static inline bool
> > > > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> > > >   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx
> *hctx,
> > > >  struct request *rq)
> > > >   {
> > > > +hctx->tags->rqs[rq->tag] = NULL;
> > > >   blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> > > >   rq->tag = -1;
> > >
> > > No SCSI driver should call scsi_host_find_tag() after a request has
> > > finished. The above patch introduces yet another race and hence can't
> > > be
> > > a proper fix.
> >
> > Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id
> > to
> > find out pending IO in firmware.
> > One of the use case is -  HBA firmware recovery.  In case of firmware
> > recovery, driver may require to traverse the list and return back
> > pending
> > scsi command to SML for retry.
> > I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic,
> > mpt3sas are using API scsi_host_find_tag for the same purpose.
> >
> > Without this patch, we hit very basic kernel panic due to page fault.
> > This
> > is not an issue in non-mq code path. Non-mq path use
> > blk_map_queue_find_tag() and that particular API does not provide stale
> > requests.
>
> As I wrote before, your patch doesn't fix the race you described but only
> makes the race window smaller.
Hi Bart,

Let me explain the issue. It is not a race, but very straight issue.  Let's
say we have one scsi_device /dev/sda and total IO submitted + completed are
some number 100.
All the 100 IO is *completed*.   Now, As part of Firmware recovery, driver
tries to find our outstanding IOs using scsi_host_find_tag().
Block layer will return all the 100 commands to the driver but really those
100 commands are not outstanding. This patch will return *actual*
outstanding commands.
If scsi_device /dev/sda is not removed in OS, driver accessing scmd of those
100 commands are safe memory access.

Now consider a case where scsi_device /dev/sda is removed and driver
performs firmware recovery. This time driver will crash while accessing scmd
(randomly based on memory reused.).

Along with this patch, low level driver should make sure that all request
queue at block layer is quiesce.

If you want an example of how to use
> scsi_host_find_tag() properly, have a look at the SRP initiator driver
> (drivers/infiniband/ulp/srp). That driver uses scsi_host_find_tag()
> without
> triggering any NULL pointer dereferences.

I am not able to find right context from srp, but I check the srp code and
looks like that driver is getting scmd using scsi_host_find_tag() for live
command.

> The approach used in that driver
> also works when having to support HBA firmware recovery.
>
> Bart.


[PATCH v5 00/14] block: always associate blkg and refcount cleanup

2018-12-04 Thread Dennis Zhou
Hi everyone,

A special case with dm is the flush bio which is statically initialized
before the block device is opened and associated with a disk. This
caused blkg association to throw a NPE. 0005 addresses this case by
moving association to be on the flush path.

With v4 moving association to piggyback off of bio_set_dev(), this
caused a NPE to be thrown by the special case above. I was overly
cautious with v4 and added the bio_has_queue() check which is now
removed in v5.

Also, the addition of bio_set_dev_only() wasn't quite right due to
writeback and swap sharing the same bio init paths in many places. The
safer thing to do is double association for those paths and in a follow
up series split out the bio init paths.

Changes in v5: 
All:  Fixed minor grammar and syntactic issues.
0004: Removed bio_has_queue() for being overly cautious.
0005: New, properly addressed the static flush_bio in md. 
0006: Removed the rcu lock in blkcg_bio_issue_check() as the bio will
  own a ref on the blkg so it is unnecessary.
0011: Consolidated bio_associate_blkg_from_css() (removed __ version).

>From v4:
This is respin of v3 [1] with fixes for the errors reported in [2] and
[3]. v3 was reverted in [4].

The issue in [3] was that bio->bi_disk->queue and blkg->q were out
of sync. So when I changed blk_get_rl() to use blkg->q, the wrong queue
was returned and elevator from q->elevator->type threw a NPE. Note, with
v4.21, the old block stack was removed and so this patch was dropped. I
did backport this to v4.20 and verified this series does not encounter
the error.

The biggest changes in v4 are when association occurs and clearly
defining the cases where association should happen.
  1. Association is now done when the device is set to keep blkg->q and
 bio->bi_disk->queue in sync.
  2. When a bio is submitted directly to the device, it will not be
 associated with a blkg. This is because a blkg represents the
 relationship between a blkcg and a request_queue. Going directly to
 the device means the request_queue may not exist meaning no blkg
 will exist.

The patch updating blk_get_rl() was dropped (v3 10/12). The patch to
always associate a blkg from v3 (v3 04/12) was fixed and split into
patches 0004 and 0005. 0011 is new removing bio_disassociate_task().

Summarizing the ideas of this series:
  1. Gracefully handle blkg failure to create by walking up the blkg
 tree rather than fall through to root.
  2. Associate a bio with a blkg in core logic rather than per
 controller logic.
  3. Rather than have a css and blkg reference, hold just a blkg ref
 as it also holds a css ref.
  4. Switch to percpu ref counting for blkg.

[1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennissz...@gmail.com/ 
[2] https://lore.kernel.org/lkml/13987.1539646...@turing-police.cc.vt.edu/
[3] https://marc.info/?l=linux-cgroups&m=154110436103723
[4] https://lore.kernel.org/lkml/20181101212410.47569-1-den...@kernel.org/

This patchset contains the following 14 patches:
  0001-blkcg-fix-ref-count-issue-with-bio_blkcg-using-task_.patch
  0002-blkcg-update-blkg_lookup_create-to-do-locking.patch
  0003-blkcg-convert-blkg_lookup_create-to-find-closest-blk.patch
  0004-blkcg-introduce-common-blkg-association-logic.patch
  0005-dm-set-flush-bio-device-on-demand.patch
  0006-blkcg-associate-blkg-when-associating-a-device.patch
  0007-blkcg-consolidate-bio_issue_init-to-be-a-part-of-cor.patch
  0008-blkcg-associate-a-blkg-for-pages-being-evicted-by-sw.patch
  0009-blkcg-associate-writeback-bios-with-a-blkg.patch
  0010-blkcg-remove-bio-bi_css-and-instead-use-bio-bi_blkg.patch
  0011-blkcg-remove-additional-reference-to-the-css.patch
  0012-blkcg-remove-bio_disassociate_task.patch
  0013-blkcg-change-blkg-reference-counting-to-use-percpu_r.patch
  0014-blkcg-rename-blkg_try_get-to-blkg_tryget.patch

This patchset is on top of linu-block#for-4.21/block 154989e45fd8.

diffstats below:

Dennis Zhou (14):
  blkcg: fix ref count issue with bio_blkcg() using task_css
  blkcg: update blkg_lookup_create() to do locking
  blkcg: convert blkg_lookup_create() to find closest blkg
  blkcg: introduce common blkg association logic
  dm: set the static flush bio device on demand
  blkcg: associate blkg when associating a device
  blkcg: consolidate bio_issue_init() to be a part of core
  blkcg: associate a blkg for pages being evicted by swap
  blkcg: associate writeback bios with a blkg
  blkcg: remove bio->bi_css and instead use bio->bi_blkg
  blkcg: remove additional reference to the css
  blkcg: remove bio_disassociate_task()
  blkcg: change blkg reference counting to use percpu_ref
  blkcg: rename blkg_try_get() to blkg_tryget()

 Documentation/admin-guide/cgroup-v2.rst |   8 +-
 block/bfq-cgroup.c  |   4 +-
 block/bfq-iosched.c |   2 +-
 block/bio.c | 156 +++-
 block/blk-cgroup.c  |  95 ---
 block/bl

[PATCH 08/14] blkcg: associate a blkg for pages being evicted by swap

2018-12-04 Thread Dennis Zhou
A prior patch in this series added blkg association to bios issued by
cgroups. There are two other paths that we want to attribute work back
to the appropriate cgroup: swap and writeback. Here we modify the way
swap tags bios to include the blkg. Writeback will be tackle in the next
patch.

Signed-off-by: Dennis Zhou 
Reviewed-by: Josef Bacik 
Acked-by: Tejun Heo 
---
 block/bio.c | 62 +++--
 include/linux/bio.h |  6 ++---
 mm/page_io.c|  2 +-
 3 files changed, 42 insertions(+), 28 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 90089124b512..f0f069c1823c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1957,30 +1957,6 @@ EXPORT_SYMBOL(bioset_init_from_src);
 
 #ifdef CONFIG_BLK_CGROUP
 
-#ifdef CONFIG_MEMCG
-/**
- * bio_associate_blkcg_from_page - associate a bio with the page's blkcg
- * @bio: target bio
- * @page: the page to lookup the blkcg from
- *
- * Associate @bio with the blkcg from @page's owning memcg.  This works like
- * every other associate function wrt references.
- */
-int bio_associate_blkcg_from_page(struct bio *bio, struct page *page)
-{
-   struct cgroup_subsys_state *blkcg_css;
-
-   if (unlikely(bio->bi_css))
-   return -EBUSY;
-   if (!page->mem_cgroup)
-   return 0;
-   blkcg_css = cgroup_get_e_css(page->mem_cgroup->css.cgroup,
-&io_cgrp_subsys);
-   bio->bi_css = blkcg_css;
-   return 0;
-}
-#endif /* CONFIG_MEMCG */
-
 /**
  * bio_associate_blkcg - associate a bio with the specified blkcg
  * @bio: target bio
@@ -2045,6 +2021,44 @@ static void __bio_associate_blkg(struct bio *bio, struct 
blkcg_gq *blkg)
bio->bi_blkg = blkg_try_get_closest(blkg);
 }
 
+static void __bio_associate_blkg_from_css(struct bio *bio,
+ struct cgroup_subsys_state *css)
+{
+   struct blkcg_gq *blkg;
+
+   rcu_read_lock();
+
+   blkg = blkg_lookup_create(css_to_blkcg(css), bio->bi_disk->queue);
+   __bio_associate_blkg(bio, blkg);
+
+   rcu_read_unlock();
+}
+
+#ifdef CONFIG_MEMCG
+/**
+ * bio_associate_blkg_from_page - associate a bio with the page's blkg
+ * @bio: target bio
+ * @page: the page to lookup the blkcg from
+ *
+ * Associate @bio with the blkg from @page's owning memcg and the respective
+ * request_queue.  This works like every other associate function wrt
+ * references.
+ */
+void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
+{
+   struct cgroup_subsys_state *css;
+
+   if (unlikely(bio->bi_css))
+   return;
+   if (!page->mem_cgroup)
+   return;
+
+   css = cgroup_get_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys);
+   bio->bi_css = css;
+   __bio_associate_blkg_from_css(bio, css);
+}
+#endif /* CONFIG_MEMCG */
+
 /**
  * bio_associate_blkg - associate a bio with a blkg
  * @bio: target bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 6ee2ea8b378a..f13572c254a7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -505,10 +505,10 @@ do {  \
disk_devt((bio)->bi_disk)
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
-int bio_associate_blkcg_from_page(struct bio *bio, struct page *page);
+void bio_associate_blkg_from_page(struct bio *bio, struct page *page);
 #else
-static inline int bio_associate_blkcg_from_page(struct bio *bio,
-   struct page *page) {  return 0; 
}
+static inline void bio_associate_blkg_from_page(struct bio *bio,
+   struct page *page) { }
 #endif
 
 #ifdef CONFIG_BLK_CGROUP
diff --git a/mm/page_io.c b/mm/page_io.c
index 5bdfd21c1bd9..3475733b1926 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -339,7 +339,7 @@ int __swap_writepage(struct page *page, struct 
writeback_control *wbc,
goto out;
}
bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc);
-   bio_associate_blkcg_from_page(bio, page);
+   bio_associate_blkg_from_page(bio, page);
count_swpout_vm_event(page);
set_page_writeback(page);
unlock_page(page);
-- 
2.17.1



[PATCH 14/14] blkcg: rename blkg_try_get() to blkg_tryget()

2018-12-04 Thread Dennis Zhou
blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or %NULL.

Signed-off-by: Dennis Zhou 
Reviewed-by: Josef Bacik 
Acked-by: Tejun Heo 
---
 block/bio.c|  2 +-
 block/blk-cgroup.c |  3 +--
 block/blk-iolatency.c  |  2 +-
 include/linux/blk-cgroup.h | 12 +---
 4 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7ec5316e6ecc..06760543ec81 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1990,7 +1990,7 @@ static void __bio_associate_blkg(struct bio *bio, struct 
blkcg_gq *blkg)
 {
bio_disassociate_blkg(bio);
 
-   bio->bi_blkg = blkg_try_get_closest(blkg);
+   bio->bi_blkg = blkg_tryget_closest(blkg);
 }
 
 /**
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2ca7611fe274..6bd0619a7d6e 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1736,8 +1736,7 @@ void blkcg_maybe_throttle_current(void)
blkg = blkg_lookup(blkcg, q);
if (!blkg)
goto out;
-   blkg = blkg_try_get(blkg);
-   if (!blkg)
+   if (!blkg_tryget(blkg))
goto out;
rcu_read_unlock();
 
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 5a79f06a730d..0b14c3d57769 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -698,7 +698,7 @@ static void blkiolatency_timer_fn(struct timer_list *t)
 * We could be exiting, don't access the pd unless we have a
 * ref on the blkg.
 */
-   if (!blkg_try_get(blkg))
+   if (!blkg_tryget(blkg))
continue;
 
iolat = blkg_to_lat(blkg);
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 05909cf31ed1..64ac89130ae8 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -490,27 +490,25 @@ static inline void blkg_get(struct blkcg_gq *blkg)
 }
 
 /**
- * blkg_try_get - try and get a blkg reference
+ * blkg_tryget - try and get a blkg reference
  * @blkg: blkg to get
  *
  * This is for use when doing an RCU lookup of the blkg.  We may be in the 
midst
  * of freeing this blkg, so we can only use it if the refcnt is not zero.
  */
-static inline struct blkcg_gq *blkg_try_get(struct blkcg_gq *blkg)
+static inline bool blkg_tryget(struct blkcg_gq *blkg)
 {
-   if (percpu_ref_tryget(&blkg->refcnt))
-   return blkg;
-   return NULL;
+   return percpu_ref_tryget(&blkg->refcnt);
 }
 
 /**
- * blkg_try_get_closest - try and get a blkg ref on the closet blkg
+ * blkg_tryget_closest - try and get a blkg ref on the closet blkg
  * @blkg: blkg to get
  *
  * This walks up the blkg tree to find the closest non-dying blkg and returns
  * the blkg that it did association with as it may not be the passed in blkg.
  */
-static inline struct blkcg_gq *blkg_try_get_closest(struct blkcg_gq *blkg)
+static inline struct blkcg_gq *blkg_tryget_closest(struct blkcg_gq *blkg)
 {
while (!percpu_ref_tryget(&blkg->refcnt))
blkg = blkg->parent;
-- 
2.17.1



[PATCH 13/14] blkcg: change blkg reference counting to use percpu_ref

2018-12-04 Thread Dennis Zhou
Every bio is now associated with a blkg putting blkg_get, blkg_try_get,
and blkg_put on the hot path. Switch over the refcnt in blkg to use
percpu_ref.

Signed-off-by: Dennis Zhou 
Acked-by: Tejun Heo 
---
 block/blk-cgroup.c | 41 --
 include/linux/blk-cgroup.h | 15 +-
 2 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 120f2e2835fb..2ca7611fe274 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -81,6 +81,37 @@ static void blkg_free(struct blkcg_gq *blkg)
kfree(blkg);
 }
 
+static void __blkg_release(struct rcu_head *rcu)
+{
+   struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head);
+
+   percpu_ref_exit(&blkg->refcnt);
+
+   /* release the blkcg and parent blkg refs this blkg has been holding */
+   css_put(&blkg->blkcg->css);
+   if (blkg->parent)
+   blkg_put(blkg->parent);
+
+   wb_congested_put(blkg->wb_congested);
+
+   blkg_free(blkg);
+}
+
+/*
+ * A group is RCU protected, but having an rcu lock does not mean that one
+ * can access all the fields of blkg and assume these are valid.  For
+ * example, don't try to follow throtl_data and request queue links.
+ *
+ * Having a reference to blkg under an rcu allows accesses to only values
+ * local to groups like group stats and group rate limits.
+ */
+static void blkg_release(struct percpu_ref *ref)
+{
+   struct blkcg_gq *blkg = container_of(ref, struct blkcg_gq, refcnt);
+
+   call_rcu(&blkg->rcu_head, __blkg_release);
+}
+
 /**
  * blkg_alloc - allocate a blkg
  * @blkcg: block cgroup the new blkg is associated with
@@ -107,7 +138,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, 
struct request_queue *q,
blkg->q = q;
INIT_LIST_HEAD(&blkg->q_node);
blkg->blkcg = blkcg;
-   atomic_set(&blkg->refcnt, 1);
 
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
@@ -207,6 +237,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_get(blkg->parent);
}
 
+   ret = percpu_ref_init(&blkg->refcnt, blkg_release, 0,
+ GFP_NOWAIT | __GFP_NOWARN);
+   if (ret)
+   goto err_cancel_ref;
+
/* invoke per-policy init */
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
@@ -239,6 +274,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_put(blkg);
return ERR_PTR(ret);
 
+err_cancel_ref:
+   percpu_ref_exit(&blkg->refcnt);
 err_put_congested:
wb_congested_put(wb_congested);
 err_put_css:
@@ -367,7 +404,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 * Put the reference taken at the time of creation so that when all
 * queues are gone, group can be destroyed.
 */
-   blkg_put(blkg);
+   percpu_ref_kill(&blkg->refcnt);
 }
 
 /**
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 68cae7961060..05909cf31ed1 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -123,7 +123,7 @@ struct blkcg_gq {
struct blkcg_gq *parent;
 
/* reference count */
-   atomic_trefcnt;
+   struct percpu_ref   refcnt;
 
/* is this blkg online? protected by both blkcg and q locks */
boolonline;
@@ -486,8 +486,7 @@ static inline int blkg_path(struct blkcg_gq *blkg, char 
*buf, int buflen)
  */
 static inline void blkg_get(struct blkcg_gq *blkg)
 {
-   WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
-   atomic_inc(&blkg->refcnt);
+   percpu_ref_get(&blkg->refcnt);
 }
 
 /**
@@ -499,7 +498,7 @@ static inline void blkg_get(struct blkcg_gq *blkg)
  */
 static inline struct blkcg_gq *blkg_try_get(struct blkcg_gq *blkg)
 {
-   if (atomic_inc_not_zero(&blkg->refcnt))
+   if (percpu_ref_tryget(&blkg->refcnt))
return blkg;
return NULL;
 }
@@ -513,23 +512,19 @@ static inline struct blkcg_gq *blkg_try_get(struct 
blkcg_gq *blkg)
  */
 static inline struct blkcg_gq *blkg_try_get_closest(struct blkcg_gq *blkg)
 {
-   while (!atomic_inc_not_zero(&blkg->refcnt))
+   while (!percpu_ref_tryget(&blkg->refcnt))
blkg = blkg->parent;
 
return blkg;
 }
 
-void __blkg_release_rcu(struct rcu_head *rcu);
-
 /**
  * blkg_put - put a blkg reference
  * @blkg: blkg to put
  */
 static inline void blkg_put(struct blkcg_gq *blkg)
 {
-   WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
-   if (atomic_dec_and_test(&blkg->refcnt))
-   call_rcu(&blkg->rcu_head, __blkg_release_rcu);
+   percpu_ref_put(&blkg->refcnt);
 }
 
 /**
-- 
2.17.1



[PATCH 12/14] blkcg: remove bio_disassociate_task()

2018-12-04 Thread Dennis Zhou
Now that a bio only holds a blkg reference, so clean up is simply
putting back that reference. Remove bio_disassociate_task() as it just
calls bio_disassociate_blkg() and call the latter directly.

Signed-off-by: Dennis Zhou 
Acked-by: Tejun Heo 
---
 block/bio.c | 11 +--
 include/linux/bio.h |  2 --
 2 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index ce1e512dca5a..7ec5316e6ecc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -244,7 +244,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned 
long *idx,
 
 void bio_uninit(struct bio *bio)
 {
-   bio_disassociate_task(bio);
+   bio_disassociate_blkg(bio);
 }
 EXPORT_SYMBOL(bio_uninit);
 
@@ -2073,15 +2073,6 @@ void bio_associate_blkg(struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(bio_associate_blkg);
 
-/**
- * bio_disassociate_task - undo bio_associate_current()
- * @bio: target bio
- */
-void bio_disassociate_task(struct bio *bio)
-{
-   bio_disassociate_blkg(bio);
-}
-
 /**
  * bio_clone_blkg_association - clone blkg association from src to dst bio
  * @dst: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 84e1c4dc703a..7380b094dcca 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -516,7 +516,6 @@ void bio_disassociate_blkg(struct bio *bio);
 void bio_associate_blkg(struct bio *bio);
 void bio_associate_blkg_from_css(struct bio *bio,
 struct cgroup_subsys_state *css);
-void bio_disassociate_task(struct bio *bio);
 void bio_clone_blkg_association(struct bio *dst, struct bio *src);
 #else  /* CONFIG_BLK_CGROUP */
 static inline void bio_disassociate_blkg(struct bio *bio) { }
@@ -524,7 +523,6 @@ static inline void bio_associate_blkg(struct bio *bio) { }
 static inline void bio_associate_blkg_from_css(struct bio *bio,
   struct cgroup_subsys_state *css)
 { }
-static inline void bio_disassociate_task(struct bio *bio) { }
 static inline void bio_clone_blkg_association(struct bio *dst,
  struct bio *src) { }
 #endif /* CONFIG_BLK_CGROUP */
-- 
2.17.1



[PATCH 10/14] blkcg: remove bio->bi_css and instead use bio->bi_blkg

2018-12-04 Thread Dennis Zhou
Prior patches ensured that any bio that interacts with a request_queue
is properly associated with a blkg. This makes bio->bi_css unnecessary
as blkg maintains a reference to blkcg already.

This removes the bio field bi_css and transfers corresponding uses to
access via bi_blkg.

Signed-off-by: Dennis Zhou 
Reviewed-by: Josef Bacik 
Acked-by: Tejun Heo 
---
 block/bio.c| 59 +-
 block/bounce.c |  2 +-
 drivers/block/loop.c   |  5 ++--
 drivers/md/raid0.c |  2 +-
 include/linux/bio.h| 11 +++
 include/linux/blk-cgroup.h |  8 +++---
 include/linux/blk_types.h  |  7 +++--
 kernel/trace/blktrace.c|  4 +--
 8 files changed, 32 insertions(+), 66 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b42477b6a225..2b6bc7b805ec 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -610,7 +610,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
 
-   bio_clone_blkcg_association(bio, bio_src);
+   bio_clone_blkg_association(bio, bio_src);
blkcg_bio_issue_init(bio);
 }
 EXPORT_SYMBOL(__bio_clone_fast);
@@ -1957,34 +1957,6 @@ EXPORT_SYMBOL(bioset_init_from_src);
 
 #ifdef CONFIG_BLK_CGROUP
 
-/**
- * bio_associate_blkcg - associate a bio with the specified blkcg
- * @bio: target bio
- * @blkcg_css: css of the blkcg to associate
- *
- * Associate @bio with the blkcg specified by @blkcg_css.  Block layer will
- * treat @bio as if it were issued by a task which belongs to the blkcg.
- *
- * This function takes an extra reference of @blkcg_css which will be put
- * when @bio is released.  The caller must own @bio and is responsible for
- * synchronizing calls to this function.  If @blkcg_css is %NULL, a call to
- * blkcg_get_css() finds the current css from the kthread or task.
- */
-int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
-{
-   if (unlikely(bio->bi_css))
-   return -EBUSY;
-
-   if (blkcg_css)
-   css_get(blkcg_css);
-   else
-   blkcg_css = blkcg_get_css();
-
-   bio->bi_css = blkcg_css;
-   return 0;
-}
-EXPORT_SYMBOL_GPL(bio_associate_blkcg);
-
 /**
  * bio_disassociate_blkg - puts back the blkg reference if associated
  * @bio: target bio
@@ -1994,6 +1966,8 @@ EXPORT_SYMBOL_GPL(bio_associate_blkcg);
 void bio_disassociate_blkg(struct bio *bio)
 {
if (bio->bi_blkg) {
+   /* a ref is always taken on css */
+   css_put(&bio_blkcg(bio)->css);
blkg_put(bio->bi_blkg);
bio->bi_blkg = NULL;
}
@@ -2047,7 +2021,6 @@ void bio_associate_blkg_from_css(struct bio *bio,
 struct cgroup_subsys_state *css)
 {
css_get(css);
-   bio->bi_css = css;
__bio_associate_blkg_from_css(bio, css);
 }
 EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css);
@@ -2066,13 +2039,10 @@ void bio_associate_blkg_from_page(struct bio *bio, 
struct page *page)
 {
struct cgroup_subsys_state *css;
 
-   if (unlikely(bio->bi_css))
-   return;
if (!page->mem_cgroup)
return;
 
css = cgroup_get_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys);
-   bio->bi_css = css;
__bio_associate_blkg_from_css(bio, css);
 }
 #endif /* CONFIG_MEMCG */
@@ -2094,8 +2064,10 @@ void bio_associate_blkg(struct bio *bio)
 
rcu_read_lock();
 
-   bio_associate_blkcg(bio, NULL);
-   blkcg = bio_blkcg(bio);
+   if (bio->bi_blkg)
+   blkcg = bio->bi_blkg->blkcg;
+   else
+   blkcg = css_to_blkcg(blkcg_get_css());
 
if (!blkcg->css.parent) {
__bio_associate_blkg(bio, q->root_blkg);
@@ -2115,27 +2087,22 @@ EXPORT_SYMBOL_GPL(bio_associate_blkg);
  */
 void bio_disassociate_task(struct bio *bio)
 {
-   if (bio->bi_css) {
-   css_put(bio->bi_css);
-   bio->bi_css = NULL;
-   }
bio_disassociate_blkg(bio);
 }
 
 /**
- * bio_clone_blkcg_association - clone blkcg association from src to dst bio
+ * bio_clone_blkg_association - clone blkg association from src to dst bio
  * @dst: destination bio
  * @src: source bio
  */
-void bio_clone_blkcg_association(struct bio *dst, struct bio *src)
+void bio_clone_blkg_association(struct bio *dst, struct bio *src)
 {
-   if (src->bi_css)
-   WARN_ON(bio_associate_blkcg(dst, src->bi_css));
-
-   if (src->bi_blkg)
+   if (src->bi_blkg) {
+   css_get(&bio_blkcg(src)->css);
__bio_associate_blkg(dst, src->bi_blkg);
+   }
 }
-EXPORT_SYMBOL_GPL(bio_clone_blkcg_association);
+EXPORT_SYMBOL_GPL(bio_clone_blkg_association);
 #endif /* CONFIG_BLK_CGROUP */
 
 static void __init biovec_init_slabs(void)
diff --git a/block/bounce.c b/block/bounce.c
index cfb96d5170d0..ffb9e9ecfa7e 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@

[PATCH 09/14] blkcg: associate writeback bios with a blkg

2018-12-04 Thread Dennis Zhou
One of the goals of this series is to remove a separate reference to
the css of the bio. This can and should be accessed via bio_blkcg(). In
this patch, wbc_init_bio() now requires a bio to have a device
associated with it.

Signed-off-by: Dennis Zhou 
Reviewed-by: Josef Bacik 
Acked-by: Tejun Heo 
---
 Documentation/admin-guide/cgroup-v2.rst |  8 +---
 block/bio.c | 18 ++
 fs/buffer.c | 10 +-
 fs/ext4/page-io.c   |  2 +-
 include/linux/bio.h |  5 +
 include/linux/writeback.h   |  5 +++--
 6 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 476722b7b636..baf19bf28385 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1879,8 +1879,10 @@ following two functions.
 
   wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and
-   associates the bio with the inode's owner cgroup.  Can be
-   called anytime between bio allocation and submission.
+   associates the bio with the inode's owner cgroup and the
+   corresponding request queue.  This must be called after
+   a queue (device) has been associated with the bio and
+   before submission.
 
   wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out.
@@ -1899,7 +1901,7 @@ the configuration, the bio may be executed at a lower 
priority and if
 the writeback session is holding shared resources, e.g. a journal
 entry, may lead to priority inversion.  There is no one easy solution
 for the problem.  Filesystems can try to work around specific problem
-cases by skipping wbc_init_bio() or using bio_associate_blkcg()
+cases by skipping wbc_init_bio() and using bio_associate_blkg()
 directly.
 
 
diff --git a/block/bio.c b/block/bio.c
index f0f069c1823c..b42477b6a225 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2034,6 +2034,24 @@ static void __bio_associate_blkg_from_css(struct bio 
*bio,
rcu_read_unlock();
 }
 
+/**
+ * bio_associate_blkg_from_css - associate a bio with a specified css
+ * @bio: target bio
+ * @css: target css
+ *
+ * Associate @bio with the blkg found by combining the css's blkg and the
+ * request_queue of the @bio.  This takes a reference on the css that will
+ * be put upon freeing of @bio.
+ */
+void bio_associate_blkg_from_css(struct bio *bio,
+struct cgroup_subsys_state *css)
+{
+   css_get(css);
+   bio->bi_css = css;
+   __bio_associate_blkg_from_css(bio, css);
+}
+EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css);
+
 #ifdef CONFIG_MEMCG
 /**
  * bio_associate_blkg_from_page - associate a bio with the page's blkg
diff --git a/fs/buffer.c b/fs/buffer.c
index 1286c2b95498..d60d61e8ed7d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3060,11 +3060,6 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
 */
bio = bio_alloc(GFP_NOIO, 1);
 
-   if (wbc) {
-   wbc_init_bio(wbc, bio);
-   wbc_account_io(wbc, bh->b_page, bh->b_size);
-   }
-
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio_set_dev(bio, bh->b_bdev);
bio->bi_write_hint = write_hint;
@@ -3084,6 +3079,11 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
op_flags |= REQ_PRIO;
bio_set_op_attrs(bio, op, op_flags);
 
+   if (wbc) {
+   wbc_init_bio(wbc, bio);
+   wbc_account_io(wbc, bh->b_page, bh->b_size);
+   }
+
submit_bio(bio);
return 0;
 }
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index db7590178dfc..2aa62d58d8dd 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -374,13 +374,13 @@ static int io_submit_init_bio(struct ext4_io_submit *io,
bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES);
if (!bio)
return -ENOMEM;
-   wbc_init_bio(io->io_wbc, bio);
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio_set_dev(bio, bh->b_bdev);
bio->bi_end_io = ext4_end_bio;
bio->bi_private = ext4_get_io_end(io->io_end);
io->io_bio = bio;
io->io_next_block = bh->b_blocknr;
+   wbc_init_bio(io->io_wbc, bio);
return 0;
 }
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f13572c254a7..f0438061a5a3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -515,6 +515,8 @@ static inline void bio_associate_blkg_from_page(struct bio 
*bio,
 int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state 
*blkcg_css);
 void bio_disassociate_blkg(struct bio *bio);
 void bio_associate_blkg(struct bio *bio);
+void bio_associate_blkg_from_css(struct bio *bio,
+struct cgroup_subsys_state *css);
 void bio_disassoci

  1   2   >