Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

2019-06-19 Thread Maxim Levitsky
On Wed, 2019-06-19 at 11:14 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 17, 2019 at 03:26:50PM +0300, Maxim Levitsky wrote:
> > On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> > > +if (!cqes) {
> > > +break;
> > > +}
> > > +LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> > > +ret = cqes->res;
> > > +
> > > +if (ret == luringcb->qiov->size) {
> > > +ret = 0;
> > > +} else if (ret >= 0) {
> > 
> > 
> > You should very carefully check the allowed return values here.
> > 
> > It looks like you can get '-EINTR' here, which would ask you to rerun the 
> > read operation, and otherwise
> > you will get the number of bytes read, which might be less that what was 
> > asked for, which implies that you
> > need to retry the read operation with the remainder of the buffer rather 
> > that zero the end of the buffer IMHO 
> > 
> > (0 is returned on EOF according to 'read' semantics, which I think are used 
> > here, thus a short read might not be an EOF)
> > 
> > 
> > Looking at linux-aio.c though I do see that it just passes through the 
> > returned value with no special treatments. 
> > including lack of check for -EINTR.
> > 
> > I assume that since aio is linux specific, and it only supports direct IO, 
> > it happens
> > to have assumption of no short reads/-EINTR (but since libaio has very 
> > sparse documentation I can't verify this)
> > 
> > On the other hand the aio=threads implementation actually does everything 
> > as specified on the 'write' manpage,
> > retrying the reads on -EINTR, and doing additional reads if less that 
> > required number of bytes were read.
> > 
> > Looking at io_uring implementation in the kernel I see that it does support 
> > synchronous (non O_DIRECT mode), 
> > and in this case, it goes through the same ->read_iter which is pretty much 
> > the same path that 
> > regular read() takes and so it might return short reads and or -EINTR.
> 
> Interesting point.  Investigating EINTR should at least be a TODO
> comment and needs to be resolved before io_uring lands in a QEMU
> release.
> 
> > > +static int ioq_submit(LuringState *s)
> > > +{
> > > +int ret = 0;
> > > +LuringAIOCB *luringcb, *luringcb_next;
> > > +
> > > +while (s->io_q.in_queue > 0) {
> > > +QSIMPLEQ_FOREACH_SAFE(luringcb, >io_q.sq_overflow, next,
> > > +  luringcb_next) {
> > 
> > I am torn about the 'sq_overflow' name. it seems to me that its not 
> > immediately clear that these
> > are the requests that are waiting because the io uring got full, but I 
> > can't now think of a better name.
> > 
> > Maybe add a comment here to explain what is going on here?
> 
> Hmm...I suggested this name because I thought it was clear.  But the
> fact that it puzzled you proves it wasn't clear :-).
> 
> Can anyone think of a better name?  It's the queue we keep in QEMU to
> hold requests while the io_uring sq ring is full.
> 
> > Also maybe we could somehow utilize the plug/unplug facility to avoid 
> > reaching that state in first place?
> > Maybe the block layer has some kind of 'max outstanding requests' limit 
> > that could be used?
> > 
> > In my nvme-mdev I opted to not process the input queues when such a 
> > condition is detected, but here you can't as the block layer
> > pretty much calls you to process the requests.
> 
> Block layer callers are allowed to submit as many I/O requests as they
> like and there is no feedback mechanism.  It's up to linux-aio.c and
> io_uring.c to handle the case where host kernel I/O submission resources
> are exhausted.
> 
> Plug/unplug is a batching performance optimization to reduce the number
> of io_uring_enter() calls but it does not stop the callers from
> submitting more I/O requests.  So plug/unplug isn't directly applicable
> here.

Thanks for the explanation! I guess we can leave that name as is, but add some 
comment or so
in the place where the queue is accessed.



> 
> > > +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState 
> > > *s,
> > > +uint64_t offset, int type)
> > > +{
> > > +struct io_uring_sqe *sqes = io_uring_get_sqe(>ring);
> > > +if (!sqes) {
> > > +sqes = >sqeq;
> > > +QSIMPLEQ_INSERT_TAIL(>io_q.sq_overflow, luringcb, next);
> > > +}
> > > +
> > > +switch (type) {
> > > +case QEMU_AIO_WRITE:
> > > +io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> > > + luringcb->qiov->niov, offset);
> > > +break;
> > > +case QEMU_AIO_READ:
> > > +io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
> > > +luringcb->qiov->niov, offset);
> > > +break;
> > > +case QEMU_AIO_FLUSH:
> > > +io_uring_prep_fsync(sqes, fd, 0);
> > > +break;
> > > +default:
> > > +fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> > > +   

Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

2019-06-19 Thread Stefan Hajnoczi
On Mon, Jun 17, 2019 at 03:26:50PM +0300, Maxim Levitsky wrote:
> On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> > +if (!cqes) {
> > +break;
> > +}
> > +LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> > +ret = cqes->res;
> > +
> > +if (ret == luringcb->qiov->size) {
> > +ret = 0;
> > +} else if (ret >= 0) {
> 
> 
> You should very carefully check the allowed return values here.
> 
> It looks like you can get '-EINTR' here, which would ask you to rerun the 
> read operation, and otherwise
> you will get the number of bytes read, which might be less that what was 
> asked for, which implies that you
> need to retry the read operation with the remainder of the buffer rather that 
> zero the end of the buffer IMHO 
> 
> (0 is returned on EOF according to 'read' semantics, which I think are used 
> here, thus a short read might not be an EOF)
> 
> 
> Looking at linux-aio.c though I do see that it just passes through the 
> returned value with no special treatments. 
> including lack of check for -EINTR.
> 
> I assume that since aio is linux specific, and it only supports direct IO, it 
> happens
> to have assumption of no short reads/-EINTR (but since libaio has very sparse 
> documentation I can't verify this)
> 
> On the other hand the aio=threads implementation actually does everything as 
> specified on the 'write' manpage,
> retrying the reads on -EINTR, and doing additional reads if less that 
> required number of bytes were read.
> 
> Looking at io_uring implementation in the kernel I see that it does support 
> synchronous (non O_DIRECT mode), 
> and in this case, it goes through the same ->read_iter which is pretty much 
> the same path that 
> regular read() takes and so it might return short reads and or -EINTR.

Interesting point.  Investigating EINTR should at least be a TODO
comment and needs to be resolved before io_uring lands in a QEMU
release.

> > +static int ioq_submit(LuringState *s)
> > +{
> > +int ret = 0;
> > +LuringAIOCB *luringcb, *luringcb_next;
> > +
> > +while (s->io_q.in_queue > 0) {
> > +QSIMPLEQ_FOREACH_SAFE(luringcb, >io_q.sq_overflow, next,
> > +  luringcb_next) {
> 
> I am torn about the 'sq_overflow' name. it seems to me that its not 
> immediately clear that these
> are the requests that are waiting because the io uring got full, but I can't 
> now think of a better name.
> 
> Maybe add a comment here to explain what is going on here?

Hmm...I suggested this name because I thought it was clear.  But the
fact that it puzzled you proves it wasn't clear :-).

Can anyone think of a better name?  It's the queue we keep in QEMU to
hold requests while the io_uring sq ring is full.

> Also maybe we could somehow utilize the plug/unplug facility to avoid 
> reaching that state in first place?
> Maybe the block layer has some kind of 'max outstanding requests' limit that 
> could be used?
> 
> In my nvme-mdev I opted to not process the input queues when such a condition 
> is detected, but here you can't as the block layer
> pretty much calls you to process the requests.

Block layer callers are allowed to submit as many I/O requests as they
like and there is no feedback mechanism.  It's up to linux-aio.c and
io_uring.c to handle the case where host kernel I/O submission resources
are exhausted.

Plug/unplug is a batching performance optimization to reduce the number
of io_uring_enter() calls but it does not stop the callers from
submitting more I/O requests.  So plug/unplug isn't directly applicable
here.

> > +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
> > +uint64_t offset, int type)
> > +{
> > +struct io_uring_sqe *sqes = io_uring_get_sqe(>ring);
> > +if (!sqes) {
> > +sqes = >sqeq;
> > +QSIMPLEQ_INSERT_TAIL(>io_q.sq_overflow, luringcb, next);
> > +}
> > +
> > +switch (type) {
> > +case QEMU_AIO_WRITE:
> > +io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> > + luringcb->qiov->niov, offset);
> > +break;
> > +case QEMU_AIO_READ:
> > +io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
> > +luringcb->qiov->niov, offset);
> > +break;
> > +case QEMU_AIO_FLUSH:
> > +io_uring_prep_fsync(sqes, fd, 0);
> > +break;
> > +default:
> > +fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> > +__func__, type);
> 
> Nitpick: Don't we use some king of error printing functions like 'error_setg' 
> rather that fprintf?

Here we're not in a context where an Error object can be returned (e.g.
printed by the QMP monitor).  We only have an errno return value that
the emulated storage controller may squash down further to a single
EIO-type error code.

'type' is a QEMU-internal value so the default case is 

Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

2019-06-17 Thread Maxim Levitsky
On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> Aborts when sqe fails to be set as sqes cannot be returned to the ring.
> 
> Signed-off-by: Aarushi Mehta 
> ---
>  MAINTAINERS |   7 +
>  block/Makefile.objs |   3 +
>  block/io_uring.c| 314 
>  include/block/aio.h |  16 +-
>  include/block/raw-aio.h |  12 ++
>  5 files changed, 351 insertions(+), 1 deletion(-)
>  create mode 100644 block/io_uring.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7be1225415..49f896796e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2516,6 +2516,13 @@ F: block/file-posix.c
>  F: block/file-win32.c
>  F: block/win32-aio.c
>  
> +Linux io_uring
> +M: Aarushi Mehta 
> +R: Stefan Hajnoczi 
> +L: qemu-bl...@nongnu.org
> +S: Maintained
> +F: block/io_uring.c
> +
>  qcow2
>  M: Kevin Wolf 
>  M: Max Reitz 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index ae11605c9f..8fde7a23a5 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -18,6 +18,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
>  block-obj-$(CONFIG_WIN32) += file-win32.o win32-aio.o
>  block-obj-$(CONFIG_POSIX) += file-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> +block-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
>  block-obj-y += null.o mirror.o commit.o io.o create.o
>  block-obj-y += throttle-groups.o
>  block-obj-$(CONFIG_LINUX) += nvme.o
> @@ -61,5 +62,7 @@ block-obj-$(if $(CONFIG_LZFSE),m,n) += dmg-lzfse.o
>  dmg-lzfse.o-libs   := $(LZFSE_LIBS)
>  qcow.o-libs:= -lz
>  linux-aio.o-libs   := -laio
> +io_uring.o-cflags  := $(LINUX_IO_URING_CFLAGS)
> +io_uring.o-libs:= $(LINUX_IO_URING_LIBS)
>  parallels.o-cflags := $(LIBXML2_CFLAGS)
>  parallels.o-libs   := $(LIBXML2_LIBS)
> diff --git a/block/io_uring.c b/block/io_uring.c
> new file mode 100644
> index 00..f327c7ef96
> --- /dev/null
> +++ b/block/io_uring.c
> @@ -0,0 +1,314 @@
> +/*
> + * Linux io_uring support.
> + *
> + * Copyright (C) 2009 IBM, Corp.
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2019 Aarushi Mehta
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include 
> +#include "qemu-common.h"
> +#include "block/aio.h"
> +#include "qemu/queue.h"
> +#include "block/block.h"
> +#include "block/raw-aio.h"
> +#include "qemu/coroutine.h"
> +#include "qapi/error.h"
> +
> +#define MAX_EVENTS 128
> +
> +typedef struct LuringAIOCB {
> +Coroutine *co;
> +struct io_uring_sqe sqeq;
> +ssize_t ret;
> +QEMUIOVector *qiov;
> +bool is_read;
> +QSIMPLEQ_ENTRY(LuringAIOCB) next;
> +} LuringAIOCB;
> +
> +typedef struct LuringQueue {
> +int plugged;
> +unsigned int in_queue;
> +unsigned int in_flight;
> +bool blocked;
> +QSIMPLEQ_HEAD(, LuringAIOCB) sq_overflow;
> +} LuringQueue;
> +
> +typedef struct LuringState {
> +AioContext *aio_context;
> +
> +struct io_uring ring;
> +
> +/* io queue for submit at batch.  Protected by AioContext lock. */
> +LuringQueue io_q;
> +
> +/* I/O completion processing.  Only runs in I/O thread.  */
> +QEMUBH *completion_bh;
> +} LuringState;
> +
> +/**
> + * ioq_submit:
> + * @s: AIO state
> + *
> + * Queues pending sqes and submits them
> + *
> + */
> +static int ioq_submit(LuringState *s);
> +
> +/**
> + * qemu_luring_process_completions:
> + * @s: AIO state
> + *
> + * Fetches completed I/O requests, consumes cqes and invokes their callbacks.
> + *
> + */
> +static void qemu_luring_process_completions(LuringState *s)
> +{
> +struct io_uring_cqe *cqes;
> +int ret;
> +
> +/*
> + * Request completion callbacks can run the nested event loop.
> + * Schedule ourselves so the nested event loop will "see" remaining
> + * completed requests and process them.  Without this, completion
> + * callbacks that wait for other requests using a nested event loop
> + * would hang forever.
> + */

About that qemu_bh_schedule
The code is copied from linux-aio.c where it was added with the below commit.

Author: Stefan Hajnoczi 
Date:   Mon Aug 4 16:56:33 2014 +0100

linux-aio: avoid deadlock in nested aio_poll() calls

If two Linux AIO request completions are fetched in the same
io_getevents() call, QEMU will deadlock if request A's callback waits
for request B to complete using an aio_poll() loop.  This was reported
to happen with the mirror blockjob.

This patch moves completion processing into a BH and makes it resumable.
Nested event loops can resume completion processing so that request B
will complete and the deadlock will not occur.

Cc: Kevin Wolf 
Cc: Paolo Bonzini 
Cc: Ming Lei 
Cc: Marcin Gibuła 
Reported-by: Marcin Gibuła 
Signed-off-by: Stefan Hajnoczi 
Tested-by: Marcin Gibuła 


I kind of opened a Pandora box by researching that area 

Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

2019-06-12 Thread Stefan Hajnoczi
On Tue, Jun 11, 2019 at 07:17:14PM +0800, Fam Zheng wrote:
> On Mon, 06/10 19:18, Aarushi Mehta wrote:
> > +/* Prevent infinite loop if submission is refused */
> > +if (ret <= 0) {
> > +if (ret == -EAGAIN) {
> > +continue;
> > +}
> > +break;
> > +}
> > +s->io_q.in_flight += ret;
> > +s->io_q.in_queue  -= ret;
> > +}
> > +s->io_q.blocked = (s->io_q.in_queue > 0);
> 
> I'm confused about s->io_q.blocked. ioq_submit is where it gets updated, but
> if it becomes true, calling ioq_submit will be fenced. So how does it get
> cleared?

When blocked, additional I/O requests are not submitted until the next
completion.  See qemu_luring_process_completions_and_submit() for the
code path where ioq_submit() gets called again.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

2019-06-11 Thread Fam Zheng
On Mon, 06/10 19:18, Aarushi Mehta wrote:
> Aborts when sqe fails to be set as sqes cannot be returned to the ring.
> 
> Signed-off-by: Aarushi Mehta 
> ---
>  MAINTAINERS |   7 +
>  block/Makefile.objs |   3 +
>  block/io_uring.c| 314 
>  include/block/aio.h |  16 +-
>  include/block/raw-aio.h |  12 ++
>  5 files changed, 351 insertions(+), 1 deletion(-)
>  create mode 100644 block/io_uring.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7be1225415..49f896796e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2516,6 +2516,13 @@ F: block/file-posix.c
>  F: block/file-win32.c
>  F: block/win32-aio.c
>  
> +Linux io_uring
> +M: Aarushi Mehta 
> +R: Stefan Hajnoczi 
> +L: qemu-bl...@nongnu.org
> +S: Maintained
> +F: block/io_uring.c
> +
>  qcow2
>  M: Kevin Wolf 
>  M: Max Reitz 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index ae11605c9f..8fde7a23a5 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -18,6 +18,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
>  block-obj-$(CONFIG_WIN32) += file-win32.o win32-aio.o
>  block-obj-$(CONFIG_POSIX) += file-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> +block-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
>  block-obj-y += null.o mirror.o commit.o io.o create.o
>  block-obj-y += throttle-groups.o
>  block-obj-$(CONFIG_LINUX) += nvme.o
> @@ -61,5 +62,7 @@ block-obj-$(if $(CONFIG_LZFSE),m,n) += dmg-lzfse.o
>  dmg-lzfse.o-libs   := $(LZFSE_LIBS)
>  qcow.o-libs:= -lz
>  linux-aio.o-libs   := -laio
> +io_uring.o-cflags  := $(LINUX_IO_URING_CFLAGS)
> +io_uring.o-libs:= $(LINUX_IO_URING_LIBS)
>  parallels.o-cflags := $(LIBXML2_CFLAGS)
>  parallels.o-libs   := $(LIBXML2_LIBS)
> diff --git a/block/io_uring.c b/block/io_uring.c
> new file mode 100644
> index 00..f327c7ef96
> --- /dev/null
> +++ b/block/io_uring.c
> @@ -0,0 +1,314 @@
> +/*
> + * Linux io_uring support.
> + *
> + * Copyright (C) 2009 IBM, Corp.
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2019 Aarushi Mehta
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include 
> +#include "qemu-common.h"
> +#include "block/aio.h"
> +#include "qemu/queue.h"
> +#include "block/block.h"
> +#include "block/raw-aio.h"
> +#include "qemu/coroutine.h"
> +#include "qapi/error.h"
> +
> +#define MAX_EVENTS 128
> +
> +typedef struct LuringAIOCB {

I have to say it is a good name.

> +Coroutine *co;
> +struct io_uring_sqe sqeq;
> +ssize_t ret;
> +QEMUIOVector *qiov;
> +bool is_read;
> +QSIMPLEQ_ENTRY(LuringAIOCB) next;
> +} LuringAIOCB;
> +
> +typedef struct LuringQueue {
> +int plugged;
> +unsigned int in_queue;
> +unsigned int in_flight;
> +bool blocked;
> +QSIMPLEQ_HEAD(, LuringAIOCB) sq_overflow;
> +} LuringQueue;
> +
> +typedef struct LuringState {
> +AioContext *aio_context;
> +
> +struct io_uring ring;
> +
> +/* io queue for submit at batch.  Protected by AioContext lock. */
> +LuringQueue io_q;
> +
> +/* I/O completion processing.  Only runs in I/O thread.  */
> +QEMUBH *completion_bh;
> +} LuringState;
> +
> +/**
> + * ioq_submit:
> + * @s: AIO state
> + *
> + * Queues pending sqes and submits them
> + *
> + */
> +static int ioq_submit(LuringState *s);
> +
> +/**
> + * qemu_luring_process_completions:
> + * @s: AIO state
> + *
> + * Fetches completed I/O requests, consumes cqes and invokes their callbacks.
> + *
> + */
> +static void qemu_luring_process_completions(LuringState *s)
> +{
> +struct io_uring_cqe *cqes;
> +int ret;
> +
> +/*
> + * Request completion callbacks can run the nested event loop.
> + * Schedule ourselves so the nested event loop will "see" remaining
> + * completed requests and process them.  Without this, completion
> + * callbacks that wait for other requests using a nested event loop
> + * would hang forever.
> + */
> +qemu_bh_schedule(s->completion_bh);
> +
> +while (io_uring_peek_cqe(>ring, ) == 0) {
> +if (!cqes) {
> +break;
> +}
> +LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> +ret = cqes->res;

Declarations should be in the beginning of the code block.

> +
> +if (ret == luringcb->qiov->size) {
> +ret = 0;
> +} else if (ret >= 0) {
> +/* Short Read/Write */
> +if (luringcb->is_read) {
> +/* Read, pad with zeroes */
> +qemu_iovec_memset(luringcb->qiov, ret, 0,
> +luringcb->qiov->size - ret);

Should you check that (ret < luringcb->qiov->size) since ret is from external?

Either way, ret should be assigned 0, I think.

> +} else {
> +ret = -ENOSPC;;

s/;;/;/

> +}
> +}
> +luringcb->ret = ret;
> +
> +

[Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

2019-06-10 Thread Aarushi Mehta
Aborts when sqe fails to be set as sqes cannot be returned to the ring.

Signed-off-by: Aarushi Mehta 
---
 MAINTAINERS |   7 +
 block/Makefile.objs |   3 +
 block/io_uring.c| 314 
 include/block/aio.h |  16 +-
 include/block/raw-aio.h |  12 ++
 5 files changed, 351 insertions(+), 1 deletion(-)
 create mode 100644 block/io_uring.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7be1225415..49f896796e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2516,6 +2516,13 @@ F: block/file-posix.c
 F: block/file-win32.c
 F: block/win32-aio.c
 
+Linux io_uring
+M: Aarushi Mehta 
+R: Stefan Hajnoczi 
+L: qemu-bl...@nongnu.org
+S: Maintained
+F: block/io_uring.c
+
 qcow2
 M: Kevin Wolf 
 M: Max Reitz 
diff --git a/block/Makefile.objs b/block/Makefile.objs
index ae11605c9f..8fde7a23a5 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -18,6 +18,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
 block-obj-$(CONFIG_WIN32) += file-win32.o win32-aio.o
 block-obj-$(CONFIG_POSIX) += file-posix.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
+block-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
 block-obj-y += null.o mirror.o commit.o io.o create.o
 block-obj-y += throttle-groups.o
 block-obj-$(CONFIG_LINUX) += nvme.o
@@ -61,5 +62,7 @@ block-obj-$(if $(CONFIG_LZFSE),m,n) += dmg-lzfse.o
 dmg-lzfse.o-libs   := $(LZFSE_LIBS)
 qcow.o-libs:= -lz
 linux-aio.o-libs   := -laio
+io_uring.o-cflags  := $(LINUX_IO_URING_CFLAGS)
+io_uring.o-libs:= $(LINUX_IO_URING_LIBS)
 parallels.o-cflags := $(LIBXML2_CFLAGS)
 parallels.o-libs   := $(LIBXML2_LIBS)
diff --git a/block/io_uring.c b/block/io_uring.c
new file mode 100644
index 00..f327c7ef96
--- /dev/null
+++ b/block/io_uring.c
@@ -0,0 +1,314 @@
+/*
+ * Linux io_uring support.
+ *
+ * Copyright (C) 2009 IBM, Corp.
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Copyright (C) 2019 Aarushi Mehta
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include 
+#include "qemu-common.h"
+#include "block/aio.h"
+#include "qemu/queue.h"
+#include "block/block.h"
+#include "block/raw-aio.h"
+#include "qemu/coroutine.h"
+#include "qapi/error.h"
+
+#define MAX_EVENTS 128
+
+typedef struct LuringAIOCB {
+Coroutine *co;
+struct io_uring_sqe sqeq;
+ssize_t ret;
+QEMUIOVector *qiov;
+bool is_read;
+QSIMPLEQ_ENTRY(LuringAIOCB) next;
+} LuringAIOCB;
+
+typedef struct LuringQueue {
+int plugged;
+unsigned int in_queue;
+unsigned int in_flight;
+bool blocked;
+QSIMPLEQ_HEAD(, LuringAIOCB) sq_overflow;
+} LuringQueue;
+
+typedef struct LuringState {
+AioContext *aio_context;
+
+struct io_uring ring;
+
+/* io queue for submit at batch.  Protected by AioContext lock. */
+LuringQueue io_q;
+
+/* I/O completion processing.  Only runs in I/O thread.  */
+QEMUBH *completion_bh;
+} LuringState;
+
+/**
+ * ioq_submit:
+ * @s: AIO state
+ *
+ * Queues pending sqes and submits them
+ *
+ */
+static int ioq_submit(LuringState *s);
+
+/**
+ * qemu_luring_process_completions:
+ * @s: AIO state
+ *
+ * Fetches completed I/O requests, consumes cqes and invokes their callbacks.
+ *
+ */
+static void qemu_luring_process_completions(LuringState *s)
+{
+struct io_uring_cqe *cqes;
+int ret;
+
+/*
+ * Request completion callbacks can run the nested event loop.
+ * Schedule ourselves so the nested event loop will "see" remaining
+ * completed requests and process them.  Without this, completion
+ * callbacks that wait for other requests using a nested event loop
+ * would hang forever.
+ */
+qemu_bh_schedule(s->completion_bh);
+
+while (io_uring_peek_cqe(>ring, ) == 0) {
+if (!cqes) {
+break;
+}
+LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
+ret = cqes->res;
+
+if (ret == luringcb->qiov->size) {
+ret = 0;
+} else if (ret >= 0) {
+/* Short Read/Write */
+if (luringcb->is_read) {
+/* Read, pad with zeroes */
+qemu_iovec_memset(luringcb->qiov, ret, 0,
+luringcb->qiov->size - ret);
+} else {
+ret = -ENOSPC;;
+}
+}
+luringcb->ret = ret;
+
+io_uring_cqe_seen(>ring, cqes);
+cqes = NULL;
+/* Change counters one-by-one because we can be nested. */
+s->io_q.in_flight--;
+
+/*
+ * If the coroutine is already entered it must be in ioq_submit()
+ * and will notice luringcb->ret has been filled in when it
+ * eventually runs later. Coroutines cannot be entered recursively
+ * so avoid doing that!
+ */
+if (!qemu_coroutine_entered(luringcb->co)) {
+aio_co_wake(luringcb->co);
+}
+}
+qemu_bh_cancel(s->completion_bh);
+}
+