On 1/9/19 8:17 PM, Jaegeuk Kim wrote:
> If we don't drop caches used in old offset or block_size, we can get old data
> from new offset/block_size, which gives unexpected data to user.
>
> For example, Martijn found a loopback bug in the below scenario.
> 1) LOOP_SET_FD loads first two pages on lo
If we don't drop caches used in old offset or block_size, we can get old data
from new offset/block_size, which gives unexpected data to user.
For example, Martijn found a loopback bug in the below scenario.
1) LOOP_SET_FD loads first two pages on loop file
2) LOOP_SET_STATUS64 changes the offset
Add hint on whether a read was served out of the page cache, or if it
hit media. This is useful for buffered async IO, O_DIRECT reads would
never have this set (for obvious reasons).
If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT
set.
Signed-off-by: Jens Axboe
---
fs/io_ur
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new events and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/p
Add support for backing the io_uring fd with either a thread, or a
workqueue and letting those handle the submission for us. This can
be used to reduce overhead for submission, or to always make submission
async. The latter is particularly useful for buffered aio, which is
now fully async with this
Similarly to how we use the state->ios_left to know how many references
to get to a file, we can use it to allocate the io_kiocb's we need in
bulk.
Signed-off-by: Jens Axboe
---
fs/io_uring.c | 71 +--
1 file changed, 52 insertions(+), 19 deletions
Add support for polled read and write commands. These act like their
non-polled counterparts, except we expect to poll for completion of
them.
To use polling, io_uring_setup() must be used with the
IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
polled and non-polled IO on an io
On the submission side, add file reference batching to the
io_submit_state. We get as many references as the number of iocbs we
are submitting, and drop unused ones if we end up switching files. The
assumption here is that we're usually only dealing with one fd, and if
there are multiple, hopefuly
Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.
Add fget_many(), which works just like fget(), except it takes an
argument fo
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_sqe
For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.
Utilize the helper in the blockdev DIRE
We have to add each submitted polled request to the io_ring_ctx
poll_submitted list, which means we have to grab the poll_lock. We
already use the block plug to batch submissions if we're doing a batch
of IO submissions, extend that to cover the poll requests internally as
well.
Signed-off-by: Jen
Here's v2 of the io_uring interface. See the v1 posting for some more info:
https://lore.kernel.org/linux-block/20190108165645.19311-1-ax...@kernel.dk/
The data structures changed, to improve the symmetry of the submission
and completion side. The io_uring_iocb is now io_uring_sqe, but it
otherwi
From: Christoph Hellwig
This new methods is used to explicitly poll for I/O completion for an
iocb. It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.
The method is assisted by a new ki_cookie field in struct iocb to
For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. This requires that the caller doesn't releases
the pages on IO completion, we add a BIO_HOLD_PAGES flag for that.
The current two callers of bio_iov_iter_get_pages() are updated to
check if they need to release pa
If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must pass in an array of iovecs
that contain the desired buffer addresses and lengths. These buff
From: Christoph Hellwig
Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device. Also refactor the common direct I/O bio submission code into a
nice little helper.
Signed-off-by: Christoph Hellwig
Modified
From: Christoph Hellwig
Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.
Reviewed-by: Johannes Thumshirn
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
---
fs/block_dev.c | 10 ++
1 file changed, 10 insertions(+)
diff --git
On 1/9/19 1:59 PM, Jonathan Corbet wrote:
> Commit 5f0ed774ed29 ("block: sum requests in the plug structure") removed
> the request_count parameter from block_attempt_plug_merge(), but did not
> remove the associated kerneldoc comment, introducing this warning to the
> docs build:
>
> ./block/bl
Commit 5f0ed774ed29 ("block: sum requests in the plug structure") removed
the request_count parameter from block_attempt_plug_merge(), but did not
remove the associated kerneldoc comment, introducing this warning to the
docs build:
./block/blk-core.c:685: warning: Excess function parameter 'requ
On Tue, 2018-12-18 at 14:41 -0800, Jaegeuk Kim wrote:
> [ ... ]
Please post new versions of a patch as a new e-mail thread instead of
as a reply to a previous e-mail.
> [ ... ]
>
> if (lo->lo_offset != info->lo_offset ||
> lo->lo_sizelimit != info->lo_sizelimit) {
> +
On 1/9/19 12:06 PM, Christoph Hellwig wrote:
>> +struct iocb_submit {
>> +const struct io_uring_iocb *iocb;
>> +unsigned int index;
>> +};
>> +
>> +struct io_work {
>> +struct work_struct work;
>> +struct io_ring_ctx *ctx;
>> +struct io_uring_iocb iocb;
>> +unsigned iocb_ind
On 1/9/19 12:03 PM, Christoph Hellwig wrote:
> On Wed, Jan 09, 2019 at 09:57:59AM -0700, Jens Axboe wrote:
>> On 1/9/19 5:13 AM, Christoph Hellwig wrote:
+ if (!state)
+ req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
>>>
>>> Just return an error here if kmem_cache_alloc fails
On 1/9/19 11:30 AM, Christoph Hellwig wrote:
> On Wed, Jan 09, 2019 at 08:53:31AM -0700, Jens Axboe wrote:
+static int io_setup_rw(int rw, const struct io_uring_iocb *iocb,
+ struct iovec **iovec, struct iov_iter *iter)
+{
+ void __user *buf = (void __user *)(ui
> +struct iocb_submit {
> + const struct io_uring_iocb *iocb;
> + unsigned int index;
> +};
> +
> +struct io_work {
> + struct work_struct work;
> + struct io_ring_ctx *ctx;
> + struct io_uring_iocb iocb;
> + unsigned iocb_index;
> +};
I think we should use struct iocb_subm
On Wed, Jan 09, 2019 at 09:57:59AM -0700, Jens Axboe wrote:
> On 1/9/19 5:13 AM, Christoph Hellwig wrote:
> >> + if (!state)
> >> + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
> >
> > Just return an error here if kmem_cache_alloc fails.
> >
> >> + if (req)
> >> + io_req_
This looks good. I wonder if there is any good way to prevent other
drivers from picking up this bug byt using a better interface, but
that should not delay your fix.
On Wed, Jan 09, 2019 at 08:53:31AM -0700, Jens Axboe wrote:
> >> +static int io_setup_rw(int rw, const struct io_uring_iocb *iocb,
> >> + struct iovec **iovec, struct iov_iter *iter)
> >> +{
> >> + void __user *buf = (void __user *)(uintptr_t)iocb->addr;
> >> + size_t ret;
> >> +
On 1/9/19 5:16 AM, Christoph Hellwig wrote:
>> +static int io_setup_rw(int rw, struct io_kiocb *kiocb,
>> + const struct io_uring_iocb *iocb, struct iovec **iovec,
>> + struct iov_iter *iter, bool kaddr)
>> {
>> void __user *buf = (void __user *)(uintptr_t)
On 1/9/19 5:13 AM, Christoph Hellwig wrote:
>> +if (!state)
>> +req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
>
> Just return an error here if kmem_cache_alloc fails.
>
>> +if (req)
>> +io_req_init(ctx, req);
>
> Because all the other ones can't reached this w
On 1/9/19 5:12 AM, Christoph Hellwig wrote:
> On Tue, Jan 08, 2019 at 09:56:39AM -0700, Jens Axboe wrote:
>> In preparation from having pre-allocated requests, that we then just
>> need to initialize before use.
>>
>> Signed-off-by: Jens Axboe
>> ---
>> fs/io_uring.c | 13 +
>> 1 file
On 9 Jan 2019, at 11:00, Matthew Wilcox wrote:
> On Tue, Jan 08, 2019 at 09:56:29AM -0700, Jens Axboe wrote:
>> After some arm twisting from Christoph, I finally caved and divorced
>> the
>> aio-poll patches from aio/libaio itself. The io_uring interface
>> itself
>> is useful and efficient, and
On Tue, Jan 08, 2019 at 09:56:29AM -0700, Jens Axboe wrote:
> After some arm twisting from Christoph, I finally caved and divorced the
> aio-poll patches from aio/libaio itself. The io_uring interface itself
> is useful and efficient, and after rebasing all the new goodies on top
> of that, there w
On 1/9/19 5:11 AM, Christoph Hellwig wrote:
> On Tue, Jan 08, 2019 at 09:56:35AM -0700, Jens Axboe wrote:
>> Add polled variants of the read and write commands. These act like their
>> non-polled counterparts, except we expect to poll for completion of
>> them.
>
> These aren't really need command
On 1/9/19 5:10 AM, Christoph Hellwig wrote:
>> index 293733f61594..9ef9987b4192 100644
>> --- a/fs/Makefile
>> +++ b/fs/Makefile
>> @@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
>> obj-$(CONFIG_TIMERFD) += timerfd.o
>> obj-$(CONFIG_EVENTFD) += even
This patch bumps up write-hint count to support four new, in-kernel
hints.
Signed-off-by: Kanchan Joshi
---
include/linux/blkdev.h | 5 -
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 338604d..df07759 100644
--- a/include/l
submit_bh and write_dirty_buffer do not take write-hint as parameter.
This patch introduces variants which do.
Signed-off-by: Kanchan Joshi
---
fs/buffer.c | 18 --
include/linux/buffer_head.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git
Exiting write-hints are exposed to user-mode. There is a possiblity
of conflict if kernel happens to use those. This patch introduces four
write-hints for exclusive kernel-mode use.
Signed-off-by: Kanchan Joshi
---
include/linux/fs.h | 5 +
1 file changed, 5 insertions(+)
diff --git a/inclu
Towards supporing write-hints/streams for filesystem journal.
Here is the v1 patch for background -
https://marc.info/?l=linux-fsdevel&m=15637519020&w=2
For NAND based SSDs, mixing of data with different life-time reduces
efficiency of internal garbage-collection. During FS operations, series
of journal updates will follow/precede series of data/meta updates, causing
intermixing inside SSD. By passing a write-hint with journal, its write
can be iso
Thanks; noted.
On Wed, Jan 9, 2019 at 9:39 AM Jens Axboe wrote:
>
> On 1/8/19 2:56 PM, John Pittman wrote:
> > Of the tunables available for the bfq I/O scheduler,
> > the only one missing from the documentation in
> > 'Documentation/block/bfq-iosched.txt' is slice_idle_us.
> > Add this tunable
On 1/8/19 2:56 PM, John Pittman wrote:
> Of the tunables available for the bfq I/O scheduler,
> the only one missing from the documentation in
> 'Documentation/block/bfq-iosched.txt' is slice_idle_us.
> Add this tunable to the documentation and a short
> explanation of its purpose.
Applied, but I
> +static int io_setup_rw(int rw, struct io_kiocb *kiocb,
> +const struct io_uring_iocb *iocb, struct iovec **iovec,
> +struct iov_iter *iter, bool kaddr)
> {
> void __user *buf = (void __user *)(uintptr_t)iocb->addr;
> size_t ret;
>
> - re
> + if (!state)
> + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
Just return an error here if kmem_cache_alloc fails.
> + if (req)
> + io_req_init(ctx, req);
Because all the other ones can't reached this with a NULL req.
On Tue, Jan 08, 2019 at 09:56:39AM -0700, Jens Axboe wrote:
> In preparation from having pre-allocated requests, that we then just
> need to initialize before use.
>
> Signed-off-by: Jens Axboe
> ---
> fs/io_uring.c | 13 +
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff
On Tue, Jan 08, 2019 at 09:56:35AM -0700, Jens Axboe wrote:
> Add polled variants of the read and write commands. These act like their
> non-polled counterparts, except we expect to poll for completion of
> them.
These aren't really need command variants, but a different type of context.
>
> index 293733f61594..9ef9987b4192 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
> obj-$(CONFIG_TIMERFD)+= timerfd.o
> obj-$(CONFIG_EVENTFD)+= eventfd.o
> obj-$(CONFIG_USERFAULTFD)+= userfa
On 09/01/2019 02:35, Damien Le Moal wrote:
> From: Shin'ichiro Kawasaki
> +_test_dev_is_zoned() {
> + local zoned_file="${TEST_DEV_SYSFS}/queue/zoned"
> + if grep -q -e "none" "${zoned_file}" ; then
Nit: I think we can leave the zoned_file variable out
if grep -qe "none" "${TEST_D
48 matches
Mail list logo