from:"Ming Lei"

qcow2-rs v0.1 and rublk-qcow2

2024-01-09 Thread Ming Lei

Hello,

qcow2-rs[1] is one pure Rust library for reading/writing qcow2 image, it is
based on rsd's[2] internal qcow2 implementation, but with lots of change, so 
far:

- supports read/write on data file, backing file and compressed image

- block device like interface, minimized read/write unit is aligned with block
size of image, so that direct io can be supported

- l2 table & refcount block load & store in slice way, and the minimized
slice size is block size, and the maximized size is cluster size

- built over Rust async/await, low level IO handling is abstracted by async
traits, and multiple low level io engines can be supported, so far, verified
on tokio-uring[3], raw linux sync IO syscall and io-uring[4] with smol[5]
runtime

Attributed to excellent async/.await, any IO(include meta IO) is handled in
async way actually, but the programming looks just like writing sync code,
so this library can be well-designed & implemented, and it is easy to add
new features & run further optimization with current code base.

rublk-qcow2[6] wires qcow2-rs, libublk-rs[7], smol(LocalExecutor) and io-uring
together, and provides block device interface for qcow2 image in 500 LoC.

Inside rublk-qcow2 async implementation, io-uring future is mapped to
(waker, result) by using unique cqe.user_data as key via HashMap, this easy way
does work, even though it may slow things a bit, but performance is still not
bad. In simple 'fio/t/io_uring $DEV' test, IOPS of rublk-qcow2 is better than
vdpa-virtio-blk by 20% with same setting(cache.direct=on,aio=io_uring) when
reading from fully allocated image in my test VM.

The initial motivation is for supporting rblk-qcow2, but I can’t find any
Rust qcow2 library with read/write support & simple interfaces and efficient
AIOs support, finally it is evolved into one generic qcow2 library. Many
qcow2 test cases are added. Also one utility is included in this project,
which can dump qcow2 meta, show any meta related statistics of the image,
check image meta integrity & host cluster leak, format qcow2 image,
read & write, ...

Any comments are welcome!



[1] qcow2-rs
https://github.com/ublk-org/qcow2-rs

[2] rsd
https://gitlab.com/hreitz/rsd/-/tree/main/src/node/qcow2?ref_type=heads

[3] tokio-uring
https://docs.rs/tokio-uring

[4] io-uring
https://docs.rs/io-uring

[5] smol
https://docs.rs/smol

[6] rublk-qcow2
https://github.com/ublk-org/rublk

[7] libublk-rs
https://github.com/ublk-org/libublk-rs



Thanks, 
Ming

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-12 Thread Ming Lei

On Wed, Oct 12, 2022 at 10:15:28AM -0400, Stefan Hajnoczi wrote:
> On Thu, 6 Oct 2022 at 06:14, Richard W.M. Jones  wrote:
> >
> > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > qemu-nbd doesn't use io_uring to handle the backend IO,
> >
> > Would this be fixed by your (not yet upstream) libblkio driver for
> > qemu?
> 
> I was wrong, qemu-nbd has syntax to use io_uring:
> 
>   $ qemu-nbd ... --image-opts driver=file,filename=test.img,aio=io_uring

Yeah, I saw the option, previously when I tried io_uring via:

qemu-nbd -c /dev/nbd11 -n --aio=io_uring $my_file

It complains that 'qemu-nbd: Invalid aio mode 'io_uring'' even though
that 'qemu-nbd --help' does say that io_uring is supported.

Today just tried it on Fedora 37, looks it starts working with
--aio=io_uring, but the IOPS is basically same with --aio=native, and
IO trace shows that io_uring is used by qemu-nbd.


Thanks,
Ming

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-07 Thread Ming Lei

On Fri, Oct 07, 2022 at 07:21:51PM +0800, Yongji Xie wrote:
> On Fri, Oct 7, 2022 at 6:51 PM Ming Lei  wrote:
> >
> > On Fri, Oct 07, 2022 at 06:04:29PM +0800, Yongji Xie wrote:
> > > On Thu, Oct 6, 2022 at 7:24 PM Ming Lei  wrote:
> > > >
> > > > On Wed, Oct 05, 2022 at 08:21:45AM -0400, Stefan Hajnoczi wrote:
> > > > > On Wed, 5 Oct 2022 at 00:19, Ming Lei  wrote:
> > > > > >
> > > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > > > On Tue, 4 Oct 2022 at 05:44, Ming Lei  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > > > ublk-qcow2 is available now.
> > > > > > > > >
> > > > > > > > > Cool, thanks for sharing!
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > So far it provides basic read/write function, and 
> > > > > > > > > > compression and snapshot
> > > > > > > > > > aren't supported yet. The target/backend implementation is 
> > > > > > > > > > completely
> > > > > > > > > > based on io_uring, and share the same io_uring with ublk IO 
> > > > > > > > > > command
> > > > > > > > > > handler, just like what ublk-loop does.
> > > > > > > > > >
> > > > > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > > > > >
> > > > > > > > > > - building one complicated target from scratch helps 
> > > > > > > > > > libublksrv APIs/functions
> > > > > > > > > >   become mature/stable more quickly, since qcow2 is 
> > > > > > > > > > complicated and needs more
> > > > > > > > > >   requirement from libublksrv compared with other simple 
> > > > > > > > > > ones(loop, null)
> > > > > > > > > >
> > > > > > > > > > - there are several attempts of implementing qcow2 driver 
> > > > > > > > > > in kernel, such as
> > > > > > > > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel 
> > > > > > > > > > qcow2(ro)`` [4], so ublk-qcow2
> > > > > > > > > >   might useful be for covering requirement in this field
> > > > > > > > > >
> > > > > > > > > > - performance comparison with qemu-nbd, and it was my 1st 
> > > > > > > > > > thought to evaluate
> > > > > > > > > >   performance of ublk/io_uring backend by writing one 
> > > > > > > > > > ublk-qcow2 since ublksrv
> > > > > > > > > >   is started
> > > > > > > > > >
> > > > > > > > > > - help to abstract common building block or design pattern 
> > > > > > > > > > for writing new ublk
> > > > > > > > > >   target/backend
> > > > > > > > > >
> > > > > > > > > > So far it basically passes xfstest(XFS) test by using 
> > > > > > > > > > ublk-qcow2 block
> > > > > > > > > > device as TEST_DEV, and kernel building workload is 
> > > > > > > > > > verified too. Also
> > > > > > > > > > soft update approach is applied in meta flushing, and meta 
> > > > > > > > > > data
> > > > > > > > > > integrity is guaranteed, 'make test T=qcow2/040' covers 
> > > > > > > > > > this kind of
> > > > > > > > > > test, and only cluster leak is reported during this test.
> > > > > > > > > >
> > > > > > > > > > The performance data looks much better compared with 
> > > > > > > > > > qemu-nbd, see
> > > > > > > > > > details in commit log[1], README[5]

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-07 Thread Ming Lei

On Thu, Oct 06, 2022 at 02:29:55PM -0400, Stefan Hajnoczi wrote:
> On Thu, Oct 06, 2022 at 11:09:48PM +0800, Ming Lei wrote:
> > On Thu, Oct 06, 2022 at 09:59:40AM -0400, Stefan Hajnoczi wrote:
> > > On Thu, Oct 06, 2022 at 06:26:15PM +0800, Ming Lei wrote:
> > > > On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> > > > > On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > > > > > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > ublk-qcow2 is available now.
> > > > > > > Cool, thanks for sharing!
> > > > > > yep
> > > > > > 
> > > > > > > > So far it provides basic read/write function, and compression 
> > > > > > > > and snapshot
> > > > > > > > aren't supported yet. The target/backend implementation is 
> > > > > > > > completely
> > > > > > > > based on io_uring, and share the same io_uring with ublk IO 
> > > > > > > > command
> > > > > > > > handler, just like what ublk-loop does.
> > > > > > > > 
> > > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > > > 
> > > > > > > > - building one complicated target from scratch helps libublksrv 
> > > > > > > > APIs/functions
> > > > > > > >become mature/stable more quickly, since qcow2 is 
> > > > > > > > complicated and needs more
> > > > > > > >requirement from libublksrv compared with other simple 
> > > > > > > > ones(loop, null)
> > > > > > > > 
> > > > > > > > - there are several attempts of implementing qcow2 driver in 
> > > > > > > > kernel, such as
> > > > > > > >``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` 
> > > > > > > > [4], so ublk-qcow2
> > > > > > > >might useful be for covering requirement in this field
> > > > > > There is one important thing to keep in mind about all 
> > > > > > partly-userspace
> > > > > > implementations though:
> > > > > > * any single allocation happened in the context of the
> > > > > >    userspace daemon through try_to_free_pages() in
> > > > > >    kernel has a possibility to trigger the operation,
> > > > > >    which will require userspace daemon action, which
> > > > > >    is inside the kernel now.
> > > > > > * the probability of this is higher in the overcommitted
> > > > > >    environment
> > > > > > 
> > > > > > This was the main motivation of us in favor for the in-kernel
> > > > > > implementation.
> > > > > 
> > > > > CCed Josef Bacik because the Linux NBD driver has dealt with memory
> > > > > reclaim hangs in the past.
> > > > > 
> > > > > Josef: Any thoughts on userspace block drivers (whether NBD or ublk) 
> > > > > and
> > > > > how to avoid hangs in memory reclaim?
> > > > 
> > > > If I remember correctly, there isn't new report after the last 
> > > > NBD(TCMU) deadlock
> > > > in memory reclaim was addressed by 8d19f1c8e193 ("prctl: 
> > > > PR_{G,S}ET_IO_FLUSHER
> > > > to support controlling memory reclaim").
> > > 
> > > Denis: I'm trying to understand the problem you described. Is this
> > > correct:
> > > 
> > > Due to memory pressure, the kernel reclaims pages and submits a write to
> > > a ublk block device. The userspace process attempts to allocate memory
> > > in order to service the write request, but it gets stuck because there
> > > is no memory available. As a result reclaim gets stuck, the system is
> > > unable to free more memory and therefore it hangs?
> > 
> > The process should be killed in this situation if PR_SET_IO_FLUSHER
> > is applied since the page allocation is done in VM fault handler.
> 
> Thanks for mentioning PR_SET_IO_FLUSHER. There is more info in commit
> 8d19f1c8e1937baf74e1962aae9f90fa3aeab463 ("prctl: PR_{G,S}ET_IO_FLUSHER
> to support controlling memory reclaim").
> 
> It requires CAP_SYS_RESOURCE :/. This makes me wonder whether
> unprivileged ublk will ever be possible.

IMO, it shouldn't be one blocker, there might be lots of choices for us

- unprivileged ublk can simply not call it, if such io hang is triggered,
ublksrv is capable of figuring out this problem, then kill & recover the device.

- set PR_IO_FLUSHER for current task in ublk_ch_uring_cmd(UBLK_IO_FETCH_REQ)

- ...

> 
> I think this addresses Denis' concern about hangs, but it doesn't solve
> them because I/O will fail. The real solution is probably what you
> mentioned...

So far, not see real report yet, and it may be never one issue if proper
swap device/file is configured.


Thanks, 
Ming

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-07 Thread Ming Lei

On Fri, Oct 07, 2022 at 06:04:29PM +0800, Yongji Xie wrote:
> On Thu, Oct 6, 2022 at 7:24 PM Ming Lei  wrote:
> >
> > On Wed, Oct 05, 2022 at 08:21:45AM -0400, Stefan Hajnoczi wrote:
> > > On Wed, 5 Oct 2022 at 00:19, Ming Lei  wrote:
> > > >
> > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > On Tue, 4 Oct 2022 at 05:44, Ming Lei  wrote:
> > > > > >
> > > > > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > ublk-qcow2 is available now.
> > > > > > >
> > > > > > > Cool, thanks for sharing!
> > > > > > >
> > > > > > > >
> > > > > > > > So far it provides basic read/write function, and compression 
> > > > > > > > and snapshot
> > > > > > > > aren't supported yet. The target/backend implementation is 
> > > > > > > > completely
> > > > > > > > based on io_uring, and share the same io_uring with ublk IO 
> > > > > > > > command
> > > > > > > > handler, just like what ublk-loop does.
> > > > > > > >
> > > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > > >
> > > > > > > > - building one complicated target from scratch helps libublksrv 
> > > > > > > > APIs/functions
> > > > > > > >   become mature/stable more quickly, since qcow2 is complicated 
> > > > > > > > and needs more
> > > > > > > >   requirement from libublksrv compared with other simple 
> > > > > > > > ones(loop, null)
> > > > > > > >
> > > > > > > > - there are several attempts of implementing qcow2 driver in 
> > > > > > > > kernel, such as
> > > > > > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` 
> > > > > > > > [4], so ublk-qcow2
> > > > > > > >   might useful be for covering requirement in this field
> > > > > > > >
> > > > > > > > - performance comparison with qemu-nbd, and it was my 1st 
> > > > > > > > thought to evaluate
> > > > > > > >   performance of ublk/io_uring backend by writing one 
> > > > > > > > ublk-qcow2 since ublksrv
> > > > > > > >   is started
> > > > > > > >
> > > > > > > > - help to abstract common building block or design pattern for 
> > > > > > > > writing new ublk
> > > > > > > >   target/backend
> > > > > > > >
> > > > > > > > So far it basically passes xfstest(XFS) test by using 
> > > > > > > > ublk-qcow2 block
> > > > > > > > device as TEST_DEV, and kernel building workload is verified 
> > > > > > > > too. Also
> > > > > > > > soft update approach is applied in meta flushing, and meta data
> > > > > > > > integrity is guaranteed, 'make test T=qcow2/040' covers this 
> > > > > > > > kind of
> > > > > > > > test, and only cluster leak is reported during this test.
> > > > > > > >
> > > > > > > > The performance data looks much better compared with qemu-nbd, 
> > > > > > > > see
> > > > > > > > details in commit log[1], README[5] and STATUS[6]. And the test 
> > > > > > > > covers both
> > > > > > > > empty image and pre-allocated image, for example of 
> > > > > > > > pre-allocated qcow2
> > > > > > > > image(8GB):
> > > > > > > >
> > > > > > > > - qemu-nbd (make test T=qcow2/002)
> > > > > > >
> > > > > > > Single queue?
> > > > > >
> > > > > > Yeah.
> > > > > >
> > > > > > >
> > > > > > > > randwrite(4k): jobs 1, iops 24605
> > > > > > > > randread(4k): jobs 1, iops 30938
> > > > > > > > randrw(4

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-06 Thread Ming Lei

On Thu, Oct 06, 2022 at 09:59:40AM -0400, Stefan Hajnoczi wrote:
> On Thu, Oct 06, 2022 at 06:26:15PM +0800, Ming Lei wrote:
> > On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> > > On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > > > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > ublk-qcow2 is available now.
> > > > > Cool, thanks for sharing!
> > > > yep
> > > > 
> > > > > > So far it provides basic read/write function, and compression and 
> > > > > > snapshot
> > > > > > aren't supported yet. The target/backend implementation is 
> > > > > > completely
> > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > handler, just like what ublk-loop does.
> > > > > > 
> > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > 
> > > > > > - building one complicated target from scratch helps libublksrv 
> > > > > > APIs/functions
> > > > > >become mature/stable more quickly, since qcow2 is complicated 
> > > > > > and needs more
> > > > > >requirement from libublksrv compared with other simple 
> > > > > > ones(loop, null)
> > > > > > 
> > > > > > - there are several attempts of implementing qcow2 driver in 
> > > > > > kernel, such as
> > > > > >``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], 
> > > > > > so ublk-qcow2
> > > > > >might useful be for covering requirement in this field
> > > > There is one important thing to keep in mind about all partly-userspace
> > > > implementations though:
> > > > * any single allocation happened in the context of the
> > > >    userspace daemon through try_to_free_pages() in
> > > >    kernel has a possibility to trigger the operation,
> > > >    which will require userspace daemon action, which
> > > >    is inside the kernel now.
> > > > * the probability of this is higher in the overcommitted
> > > >    environment
> > > > 
> > > > This was the main motivation of us in favor for the in-kernel
> > > > implementation.
> > > 
> > > CCed Josef Bacik because the Linux NBD driver has dealt with memory
> > > reclaim hangs in the past.
> > > 
> > > Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
> > > how to avoid hangs in memory reclaim?
> > 
> > If I remember correctly, there isn't new report after the last NBD(TCMU) 
> > deadlock
> > in memory reclaim was addressed by 8d19f1c8e193 ("prctl: 
> > PR_{G,S}ET_IO_FLUSHER
> > to support controlling memory reclaim").
> 
> Denis: I'm trying to understand the problem you described. Is this
> correct:
> 
> Due to memory pressure, the kernel reclaims pages and submits a write to
> a ublk block device. The userspace process attempts to allocate memory
> in order to service the write request, but it gets stuck because there
> is no memory available. As a result reclaim gets stuck, the system is
> unable to free more memory and therefore it hangs?

The process should be killed in this situation if PR_SET_IO_FLUSHER
is applied since the page allocation is done in VM fault handler.

Firstly in theory the userspace part should provide forward progress
guarantee in code path for handling IO, such as reserving/mlock pages
for such situation. However, this issue isn't unique for nbd or ublk,
all userspace block device should have such potential risk, and vduse
is no exception, IMO.

Secondly with proper/enough swap space, I think it is hard to trigger
such kind of issue.

Finally ublk driver has added user recovery commands for recovering from
crash, and ublksrv will support it soon.

Thanks,
Ming

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-06 Thread Ming Lei

On Wed, Oct 05, 2022 at 08:21:45AM -0400, Stefan Hajnoczi wrote:
> On Wed, 5 Oct 2022 at 00:19, Ming Lei  wrote:
> >
> > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > On Tue, 4 Oct 2022 at 05:44, Ming Lei  wrote:
> > > >
> > > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > ublk-qcow2 is available now.
> > > > >
> > > > > Cool, thanks for sharing!
> > > > >
> > > > > >
> > > > > > So far it provides basic read/write function, and compression and 
> > > > > > snapshot
> > > > > > aren't supported yet. The target/backend implementation is 
> > > > > > completely
> > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > handler, just like what ublk-loop does.
> > > > > >
> > > > > > Follows the main motivations of ublk-qcow2:
> > > > > >
> > > > > > - building one complicated target from scratch helps libublksrv 
> > > > > > APIs/functions
> > > > > >   become mature/stable more quickly, since qcow2 is complicated and 
> > > > > > needs more
> > > > > >   requirement from libublksrv compared with other simple ones(loop, 
> > > > > > null)
> > > > > >
> > > > > > - there are several attempts of implementing qcow2 driver in 
> > > > > > kernel, such as
> > > > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], 
> > > > > > so ublk-qcow2
> > > > > >   might useful be for covering requirement in this field
> > > > > >
> > > > > > - performance comparison with qemu-nbd, and it was my 1st thought 
> > > > > > to evaluate
> > > > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 
> > > > > > since ublksrv
> > > > > >   is started
> > > > > >
> > > > > > - help to abstract common building block or design pattern for 
> > > > > > writing new ublk
> > > > > >   target/backend
> > > > > >
> > > > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 
> > > > > > block
> > > > > > device as TEST_DEV, and kernel building workload is verified too. 
> > > > > > Also
> > > > > > soft update approach is applied in meta flushing, and meta data
> > > > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > test, and only cluster leak is reported during this test.
> > > > > >
> > > > > > The performance data looks much better compared with qemu-nbd, see
> > > > > > details in commit log[1], README[5] and STATUS[6]. And the test 
> > > > > > covers both
> > > > > > empty image and pre-allocated image, for example of pre-allocated 
> > > > > > qcow2
> > > > > > image(8GB):
> > > > > >
> > > > > > - qemu-nbd (make test T=qcow2/002)
> > > > >
> > > > > Single queue?
> > > >
> > > > Yeah.
> > > >
> > > > >
> > > > > > randwrite(4k): jobs 1, iops 24605
> > > > > > randread(4k): jobs 1, iops 30938
> > > > > > randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > rw(512k): jobs 1, iops read 724 write 728
> > > > >
> > > > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > command-line should be similar to this:
> > > > >
> > > > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > >
> > > > Not found virtio_vdpa module even though I enabled all the following
> > > > options:
> > > >
> > > > --- vDPA drivers
> > > >  vDPA device simulator core
> > > >vDPA simulator for networking device
> > > >vDPA simulator for block device
> > > >  VDUSE (vDPA Device in Userspace) support
> > > >  Intel IFC VF vDPA driver
>

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-06 Thread Ming Lei

On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > ublk-qcow2 is available now.
> > > Cool, thanks for sharing!
> > yep
> > 
> > > > So far it provides basic read/write function, and compression and 
> > > > snapshot
> > > > aren't supported yet. The target/backend implementation is completely
> > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > handler, just like what ublk-loop does.
> > > > 
> > > > Follows the main motivations of ublk-qcow2:
> > > > 
> > > > - building one complicated target from scratch helps libublksrv 
> > > > APIs/functions
> > > >become mature/stable more quickly, since qcow2 is complicated and 
> > > > needs more
> > > >requirement from libublksrv compared with other simple ones(loop, 
> > > > null)
> > > > 
> > > > - there are several attempts of implementing qcow2 driver in kernel, 
> > > > such as
> > > >``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so 
> > > > ublk-qcow2
> > > >might useful be for covering requirement in this field
> > There is one important thing to keep in mind about all partly-userspace
> > implementations though:
> > * any single allocation happened in the context of the
> >    userspace daemon through try_to_free_pages() in
> >    kernel has a possibility to trigger the operation,
> >    which will require userspace daemon action, which
> >    is inside the kernel now.
> > * the probability of this is higher in the overcommitted
> >    environment
> > 
> > This was the main motivation of us in favor for the in-kernel
> > implementation.
> 
> CCed Josef Bacik because the Linux NBD driver has dealt with memory
> reclaim hangs in the past.
> 
> Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
> how to avoid hangs in memory reclaim?

If I remember correctly, there isn't new report after the last NBD(TCMU) 
deadlock
in memory reclaim was addressed by 8d19f1c8e193 ("prctl: PR_{G,S}ET_IO_FLUSHER
to support controlling memory reclaim").


Thanks, 
Ming

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-04 Thread Ming Lei

On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> On Tue, 4 Oct 2022 at 05:44, Ming Lei  wrote:
> >
> > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > ublk-qcow2 is available now.
> > >
> > > Cool, thanks for sharing!
> > >
> > > >
> > > > So far it provides basic read/write function, and compression and 
> > > > snapshot
> > > > aren't supported yet. The target/backend implementation is completely
> > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > handler, just like what ublk-loop does.
> > > >
> > > > Follows the main motivations of ublk-qcow2:
> > > >
> > > > - building one complicated target from scratch helps libublksrv 
> > > > APIs/functions
> > > >   become mature/stable more quickly, since qcow2 is complicated and 
> > > > needs more
> > > >   requirement from libublksrv compared with other simple ones(loop, 
> > > > null)
> > > >
> > > > - there are several attempts of implementing qcow2 driver in kernel, 
> > > > such as
> > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so 
> > > > ublk-qcow2
> > > >   might useful be for covering requirement in this field
> > > >
> > > > - performance comparison with qemu-nbd, and it was my 1st thought to 
> > > > evaluate
> > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since 
> > > > ublksrv
> > > >   is started
> > > >
> > > > - help to abstract common building block or design pattern for writing 
> > > > new ublk
> > > >   target/backend
> > > >
> > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > > soft update approach is applied in meta flushing, and meta data
> > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > test, and only cluster leak is reported during this test.
> > > >
> > > > The performance data looks much better compared with qemu-nbd, see
> > > > details in commit log[1], README[5] and STATUS[6]. And the test covers 
> > > > both
> > > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > image(8GB):
> > > >
> > > > - qemu-nbd (make test T=qcow2/002)
> > >
> > > Single queue?
> >
> > Yeah.
> >
> > >
> > > > randwrite(4k): jobs 1, iops 24605
> > > > randread(4k): jobs 1, iops 30938
> > > > randrw(4k): jobs 1, iops read 13981 write 14001
> > > > rw(512k): jobs 1, iops read 724 write 728
> > >
> > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > command-line should be similar to this:
> > >
> > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> >
> > Not found virtio_vdpa module even though I enabled all the following
> > options:
> >
> > --- vDPA drivers
> >  vDPA device simulator core
> >vDPA simulator for networking device
> >vDPA simulator for block device
> >  VDUSE (vDPA Device in Userspace) support
> >  Intel IFC VF vDPA driver
> >  Virtio PCI bridge vDPA driver
> >  vDPA driver for Alibaba ENI
> >
> > BTW, my test environment is VM and the shared data is done in VM too, and
> > can virtio_vdpa be used inside VM?
> 
> I hope Xie Yongji can help explain how to benchmark VDUSE.
> 
> virtio_vdpa is available inside guests too. Please check that
> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> drivers" menu.
> 
> >
> > >   # modprobe vduse
> > >   # qemu-storage-daemon \
> > >   --blockdev 
> > > file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > >   --blockdev qcow2,file=file,node-name=qcow2 \
> > >   --object iothread,id=iothread0 \
> > >   --export 
> > > vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > >   # vdpa dev add name vduse0 mgmtdev vduse
> &g

Re: ublk-qcow2: ublk-qcow2 is available

2022-10-04 Thread Ming Lei

On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > ublk-qcow2 is available now.
> 
> Cool, thanks for sharing!
> 
> > 
> > So far it provides basic read/write function, and compression and snapshot
> > aren't supported yet. The target/backend implementation is completely
> > based on io_uring, and share the same io_uring with ublk IO command
> > handler, just like what ublk-loop does.
> > 
> > Follows the main motivations of ublk-qcow2:
> > 
> > - building one complicated target from scratch helps libublksrv 
> > APIs/functions
> >   become mature/stable more quickly, since qcow2 is complicated and needs 
> > more
> >   requirement from libublksrv compared with other simple ones(loop, null)
> > 
> > - there are several attempts of implementing qcow2 driver in kernel, such as
> >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so 
> > ublk-qcow2
> >   might useful be for covering requirement in this field
> > 
> > - performance comparison with qemu-nbd, and it was my 1st thought to 
> > evaluate
> >   performance of ublk/io_uring backend by writing one ublk-qcow2 since 
> > ublksrv
> >   is started
> > 
> > - help to abstract common building block or design pattern for writing new 
> > ublk
> >   target/backend
> > 
> > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > device as TEST_DEV, and kernel building workload is verified too. Also
> > soft update approach is applied in meta flushing, and meta data
> > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > test, and only cluster leak is reported during this test.
> > 
> > The performance data looks much better compared with qemu-nbd, see
> > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > empty image and pre-allocated image, for example of pre-allocated qcow2
> > image(8GB):
> > 
> > - qemu-nbd (make test T=qcow2/002)
> 
> Single queue?

Yeah.

> 
> > randwrite(4k): jobs 1, iops 24605
> > randread(4k): jobs 1, iops 30938
> > randrw(4k): jobs 1, iops read 13981 write 14001
> > rw(512k): jobs 1, iops read 724 write 728
> 
> Please try qemu-storage-daemon's VDUSE export type as well. The
> command-line should be similar to this:
> 
>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel

Not found virtio_vdpa module even though I enabled all the following
options:

--- vDPA drivers 
 vDPA device simulator core   
   vDPA simulator for networking device   
   vDPA simulator for block device
 VDUSE (vDPA Device in Userspace) support 
 Intel IFC VF vDPA driver 
 Virtio PCI bridge vDPA driver
 vDPA driver for Alibaba ENI

BTW, my test environment is VM and the shared data is done in VM too, and
can virtio_vdpa be used inside VM?

>   # modprobe vduse
>   # qemu-storage-daemon \
>   --blockdev 
> file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
>   --blockdev qcow2,file=file,node-name=qcow2 \
>   --object iothread,id=iothread0 \
>   --export 
> vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
>   # vdpa dev add name vduse0 mgmtdev vduse
> 
> A virtio-blk device should appear and xfstests can be run on it
> (typically /dev/vda unless you already have other virtio-blk devices).
> 
> Afterwards you can destroy the device using:
> 
>   # vdpa dev del vduse0
> 
> > 
> > - ublk-qcow2 (make test T=qcow2/022)
> 
> There are a lot of other factors not directly related to NBD vs ublk. In
> order to get an apples-to-apples comparison with qemu-* a ublk export
> type is needed in qemu-storage-daemon. That way only the difference is
> the ublk interface and the rest of the code path is identical, making it
> possible to compare NBD, VDUSE, ublk, etc more precisely.

Maybe not true.

ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
and so far single io_uring/pthread is for handling all qcow2 IOs and IO
command.


thanks, 
Ming

Re: [RFC v4 5/9] qemu-iotests: test new zone operations.

2022-07-27 Thread Ming Lei

On Wed, Jul 27, 2022 at 10:34:56AM -0400, Stefan Hajnoczi wrote:
> On Mon, 11 Jul 2022 at 22:21, Sam Li  wrote:
> >
> > We have added new block layer APIs of zoned block devices. Test it with:
> > (1) Create a null_blk device, run each zone operation on it and see
> > whether reporting right zone information.
> >
> > Signed-off-by: Sam Li 
> > ---
> >  tests/qemu-iotests/tests/zoned.sh | 69 +++
> >  1 file changed, 69 insertions(+)
> >  create mode 100755 tests/qemu-iotests/tests/zoned.sh
> >
> > diff --git a/tests/qemu-iotests/tests/zoned.sh 
> > b/tests/qemu-iotests/tests/zoned.sh
> > new file mode 100755
> > index 00..e14a3a420e
> > --- /dev/null
> > +++ b/tests/qemu-iotests/tests/zoned.sh
> > @@ -0,0 +1,69 @@
> > +#!/usr/bin/env bash
> > +#
> > +# Test zone management operations.
> > +#
> > +
> > +seq="$(basename $0)"
> > +echo "QA output created by $seq"
> > +status=1 # failure is the default!
> > +
> > +_cleanup()
> > +{
> > +  _cleanup_test_img
> > +  sudo rmmod null_blk
> > +}
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +# get standard environment, filters and checks
> > +. ./common.rc
> > +. ./common.filter
> > +. ./common.qemu
> > +
> > +# This test only runs on Linux hosts with raw image files.
> > +_supported_fmt raw
> > +_supported_proto file
> > +_supported_os Linux
> > +
> > +QEMU_IO="build/qemu-io"
> > +IMG="--image-opts driver=zoned_host_device,filename=/dev/nullb0"
> > +QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
> > +
> > +echo "Testing a null_blk device"
> > +echo "Simple cases: if the operations work"
> > +sudo modprobe null_blk nr_devices=1 zoned=1
> 
> Jens & Ming Lei:
> 
> null_blk is not ideal for automated test cases because it requires
> root and is global. If two tests are run at the same time they will
> interfere with each other. There is no way to have independent
> instances of the null_blk kernel module and the /dev/nullb0 device on
> a single Linux system. I think this test case can be merged for now
> but if there's time I suggest investigating alternatives.
> 
> Maybe the new Linux ublk_drv driver is a better choice if it supports
> unprivileged operation with multiple independent block devices (plus

This can be one big topic, and IMO, the io path needs to be reviewed
wrt. security risk. Also if one block device is stuck, it shouldn't
affect others, so far it can be one problem at least in sync/writeback,
if one device is hang, writeback on all block device may not move on.

> zoned storage!). It would be necessary to write a userspace null block
> device that supports zoned storage if that doesn't exist already. I
> have CCed Ming Lei and Jens Axboe for thoughts.

IMO, it shouldn't be hard to simulate zoned from userspace, the
patches[1] for setting device parameters are just sent out, then zoned
parameter could be sent to driver with the added commands easily.

Even ublk can be used to implement zoned target, such as, ublk is
exposed as one normal disk, but the backing disk could be one real
zoned/zns device.

[1] 
https://lore.kernel.org/linux-block/20220727141628.985429-1-ming@redhat.com/T/#mb5518cf418b63fb6ca649f0df199137e50907a29

thanks,
Ming

Re: [Qemu-devel] Slow boot in QEMU with virtio-scsi disks

2018-08-11 Thread Ming Lei

On Sat, Aug 11, 2018 at 5:47 PM, Oleksandr Natalenko
 wrote:
> Hi.
>
> I'd like to resurrect previous discussion [1] regarding slow kernel boot
> inside QEMU with virtio-scsi disks attached and blk_mq enabled.
>
> Symptom:
>
> [2.830857] ata1: SATA max UDMA/133 abar m4096@0x98002000 port 0x98002100
> irq 36
> [2.834559] ata2: SATA max UDMA/133 abar m4096@0x98002000 port 0x98002180
> irq 36
> [2.837746] ata3: SATA max UDMA/133 abar m4096@0x98002000 port 0x98002200
> irq 36
> [2.841861] ata4: SATA max UDMA/133 abar m4096@0x98002000 port 0x98002280
> irq 36
> [2.847899] ata5: SATA max UDMA/133 abar m4096@0x98002000 port 0x98002300
> irq 36
> [2.853229] ata6: SATA max UDMA/133 abar m4096@0x98002000 port 0x98002380
> irq 36
> [3.172159] ata1: SATA link down (SStatus 0 SControl 300)
> [3.183552] ata5: SATA link down (SStatus 0 SControl 300)
> [3.189925] ata3: SATA link down (SStatus 0 SControl 300)
> [3.196156] ata6: SATA link down (SStatus 0 SControl 300)
> [3.201136] ata2: SATA link down (SStatus 0 SControl 300)
> [3.208559] ata4: SATA link down (SStatus 0 SControl 300)
> [   16.480972] sd 0:0:1:0: Power-on or device reset occurred
> [   16.481591] sd 0:0:0:0: [sda] 16777216 512-byte logical blocks: (8.59
> GB/8.00 GiB)
> [   16.481671] sd 0:0:0:0: [sda] Write Protect is off
> [   16.481815] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled,
> doesn't support DPO or FUA
> [   16.491325]  sda: sda1 sda2
> [   16.517532] sd 0:0:1:0: [sdb] 16777216 512-byte logical blocks: (8.59
> GB/8.00 GiB)
> [   16.525131] sr 0:0:2:0: Power-on or device reset occurred
> [   16.525974] sd 0:0:1:0: [sdb] Write Protect is off
> [   16.530946] sr 0:0:2:0: [sr0] scsi3-mmc drive: 16x/50x cd/rw xa/form2
> cdda tray
> [   16.543592] cdrom: Uniform CD-ROM driver Revision: 3.20
> [   16.549815] sd 0:0:1:0: [sdb] Write cache: disabled, read cache: enabled,
> doesn't support DPO or FUA
> [   16.549833] sd 0:0:0:0: [sda] Attached SCSI disk
> [   16.572055]  sdb: sdb1 sdb2
> [   16.580463] sd 0:0:1:0: [sdb] Attached SCSI disk
>
> (note the hang that lasts for 13 seconds)
>
> The disks are attached to the VM in the following manner:
>
> -device virtio-scsi,id=scsi -device scsi-hd,drive=hd1 -drive
> if=none,media=disk,id=hd1,file=sda.img,format=raw
>
> What I've tested so far:
>
> * 4.14.62  + virtio-scsi +blk_mq == slow boot
> * 4.14.62  + virtio-scsi + no blk_mq == fast boot
> * 4.17.13  + virtio-scsi +blk_mq == slow boot
> * 4.18-rc8 + virtio-scsi +blk_mq == slow boot
>
> QEMU is of v2.12.1, runs with "-machine q35,accel=kvm -cpu host". Also, if
> virtio-scsi disks are replaced with SATA disks, the hang does not occur
> (although, QEMU has other issues with SATA, but that's another story [3]).
>
> Apparently, the commit that was mentioned in [2],
> b5b6e8c8d3b4cbeb447a0f10c7d5de3caa573299, forces blk_mq for virtio_scsi, so
> it cannot be disabled for new kernels.
>
> Any hint on how to avoid this hang while still having virtio-scsi disks and
> blk_mq enabled please?

Please test for-4.19/block:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-4.19/block

This slow boot issue should have been fixed by the following commits:

1311326cf475 blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()
97889f9ac24f blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
5815839b3ca1 blk-mq: introduce new lock for protecting hctx->dispatch_wait
2278d69f030f blk-mq: don't pass **hctx to blk_mq_mark_tag_wait()
8ab6bb9ee8d0 blk-mq: cleanup blk_mq_get_driver_tag()


Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH] block: fix big write

2014-12-10 Thread Ming Lei

On Wed, Dec 10, 2014 at 11:02 PM, Paolo Bonzini  wrote:
>
>
> On 10/12/2014 15:35, Ming Lei wrote:
>>>> It is _not_ never happen at all, and easy to be triggered when using
>>>> mkfs.
>>>
>>> mkfs is not something to optimize for, it's just something that should
>>> work.  (Also, some hardware may time out if you do write same with too
>>> high a block count).
>>
>> I don't think it is related with the hardware time out issue since your
>> patch still splits the block count into 2G - 1, and both are same wrt.
>> block count.
>
> If the guest sends a 1TB WRITE SAME, it's more likely to time out.

Guest implementation should have a top limit for the corresponding
time out, like SD_MAX_WS16_BLOCKS in linux.

>
>>> Both Linux and Windows will always use UNMAP on QEMU, except for the
>>> small time period where Linux used WRITE SAME and this bug was
>>> discovered.  And all versions of Linux that used WRITE SAME honored the
>>> max_ws_blocks field.
>>
>> Not sure how you get the conclusion.
>
> Because the WRITE SAME patch was submitted ~1 month ago.
>
> Windows uses UNMAP because Microsoft says so.
>
>> Secondly SBC-3 draft doesn't describe the priority explicitly among
>> UNMAP, WRITE SAME 10, and WRITE SAME 16, so it is driver's
>> freedom to take anyone in theory.
>
> Sure, but WRITE SAME with UNMAP doesn't make sense if you do not have
> LBPRZ, which QEMU does not set.  In fact the only sensible things to do are:
>
> - use WRITE SAME if LBPRZ
>
> - use UNMAP if !LBPRZ
>
> So any sensible guest will use UNMAP.
>
>> Finally blkdev_issue_zeroout() can send WRITE SAME(10/16) directly
>> and it can be from user space, fs, and block drivers.
>
> That is WRITE SAME without UNMAP, it is not used by mkfs, and Linux has
> always honored max_write_same_blocks for it (defaulting to a 65535 block
> limit for older devices that did not report a limit).

>From QEMU view, blk_aio_write_zeroes() still need to handle
case without UNMAP, and the default 65535 is just linux's current
implementation, and even the recent patch tries to increase
the default setting. Also the default limit might be bigger on other OS.

>
> So what *concrete* case would be fixed by adding extra little-used code
> in QEMU to do the split?
>
> Paolo
>
>> Thanks,
>> Ming Lei
>>
>>

Re: [Qemu-devel] [PATCH] block: fix big write

2014-12-10 Thread Ming Lei

On Wed, Dec 10, 2014 at 8:55 PM, Paolo Bonzini  wrote:
>
>
> On 10/12/2014 13:23, Ming Lei wrote:
>>> >
>>> > Again, we're talking of 2GB and this is something that should never
>>> > happen in practice.
>> Again, the 2GB limit can be avoided if I/O is splitted, isn't it?
>
> Sure.  The guest kernel is doing the split.

QEMU SCSI can do it transparently.

>
>> It is _not_ never happen at all, and easy to be triggered when using
>> mkfs.
>
> mkfs is not something to optimize for, it's just something that should
> work.  (Also, some hardware may time out if you do write same with too
> high a block count).

I don't think it is related with the hardware time out issue since your
patch still splits the block count into 2G - 1, and both are same wrt.
block count.

>
> Both Linux and Windows will always use UNMAP on QEMU, except for the
> small time period where Linux used WRITE SAME and this bug was
> discovered.  And all versions of Linux that used WRITE SAME honored the
> max_ws_blocks field.

Not sure how you get the conclusion.

Firstly some old host scsi drivers can't respect max_ws_blocks
block limits since it is just introduced in last year.

Secondly SBC-3 draft doesn't describe the priority explicitly among
UNMAP, WRITE SAME 10, and WRITE SAME 16, so it is driver's
freedom to take anyone in theory.

Finally blkdev_issue_zeroout() can send WRITE SAME(10/16) directly
and it can be from user space, fs, and block drivers.

Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH] block: fix big write

2014-12-10 Thread Ming Lei

On Wed, Dec 10, 2014 at 5:56 PM, Paolo Bonzini  wrote:
>
>
> On 10/12/2014 02:41, Ming Lei wrote:
>> On Wed, Dec 10, 2014 at 1:45 AM, Paolo Bonzini  wrote:
>>>
>>>
>>> On 08/12/2014 08:19, Ming Lei wrote:
>>>>>>
>>>>>> Alternatively, I'd accept a SCSI patch setting max_ws_blocks and friends
>>>>>> to 2GB - 1 block.
>>>> It should be better to not introduce the limit and split the writes
>>>> into size of 2GB - 1 block since there is only the limit for write zero.
>>>
>>> Why? That's exactly what the max_ws_blocks is for, and there's code in
>>> the guest already to do the split.  We're talking about 2GB, not 1MB.
>>
>> The split in write same does not cover write zero, and that is the problem.
>> Otherwise write same just works fine. That said write same of QEMU SCSI
>> can work well without max write same sectors limit.
>>
>> If we introduce the limit of max write same sectors, this limit will be put
>> on both write zero and write non-zero path.
>
> Yeah, but the 2GB limit happens also for the regular I/O path.  The
> quirk is that it doesn't happen for non-zero WRITE SAME, not the other
> way round.
>
>> Also "MAXIMUM WRITE SAME LENGTH" is just introduced on sbc3r35
>> in Jan, 2013, and some old host drivers may not support it, and not using
>> the limit should have better compatibility.
>
> Again, we're talking of 2GB and this is something that should never
> happen in practice.

Again, the 2GB limit can be avoided if I/O is splitted, isn't it?

It is _not_ never happen at all, and easy to be triggered when using
mkfs.

Thanks,

Re: [Qemu-devel] [PATCH] block: fix big write

2014-12-09 Thread Ming Lei

On Wed, Dec 10, 2014 at 1:45 AM, Paolo Bonzini  wrote:
>
>
> On 08/12/2014 08:19, Ming Lei wrote:
>>> >
>>> > Alternatively, I'd accept a SCSI patch setting max_ws_blocks and friends
>>> > to 2GB - 1 block.
>> It should be better to not introduce the limit and split the writes
>> into size of 2GB - 1 block since there is only the limit for write zero.
>
> Why? That's exactly what the max_ws_blocks is for, and there's code in
> the guest already to do the split.  We're talking about 2GB, not 1MB.

The split in write same does not cover write zero, and that is the problem.
Otherwise write same just works fine. That said write same of QEMU SCSI
can work well without max write same sectors limit.

If we introduce the limit of max write same sectors, this limit will be put
on both write zero and write non-zero path.

Also "MAXIMUM WRITE SAME LENGTH" is just introduced on sbc3r35
in Jan, 2013, and some old host drivers may not support it, and not using
the limit should have better compatibility.

Thanks,
Ming Lei

[Qemu-devel] [PATCH] SCSI: splite write_zero in unit of at most INT_MAX bytes

2014-12-08 Thread Ming Lei

Currently block can not handle big write well when write
size is bigger than INT_MAX, so split the write zero into
smaller size of chunks to meet block's requirement.

This patch fixes one WRITE SAME 16 failure in linux VM side.

Cc: Max Reitz 
Signed-off-by: Ming Lei 
---
 hw/scsi/scsi-disk.c |   67 +++
 1 file changed, 57 insertions(+), 10 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 2f75d7d..a843f9b 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -42,6 +42,9 @@ do { printf("scsi-disk: " fmt , ## __VA_ARGS__); } while (0)
 #include 
 #endif
 
+/* bytes in one single write should be held in one 'int' variable */
+#define SCSI_WRITE_ZERO_MAX (INT_MAX)
+
 #define SCSI_WRITE_SAME_MAX 524288
 #define SCSI_DMA_BUF_SIZE   131072
 #define SCSI_MAX_INQUIRY_LEN256
@@ -1618,10 +1621,53 @@ typedef struct WriteSameCBData {
 SCSIDiskReq *r;
 int64_t sector;
 int nb_sectors;
+int curr_sectors;
+int flags;
 QEMUIOVector qiov;
 struct iovec iov;
 } WriteSameCBData;
 
+static void scsi_write_zero_complete(void *opaque, int ret)
+{
+WriteSameCBData *data = opaque;
+SCSIDiskReq *r = data->r;
+SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
+
+assert(r->req.aiocb != NULL);
+r->req.aiocb = NULL;
+block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
+if (r->req.io_canceled) {
+scsi_req_cancel_complete(&r->req);
+goto done;
+}
+
+if (ret < 0) {
+if (scsi_handle_rw_error(r, -ret)) {
+goto done;
+}
+}
+
+data->nb_sectors -= data->curr_sectors;
+data->sector += data->curr_sectors;
+data->curr_sectors = MIN(data->nb_sectors, SCSI_WRITE_ZERO_MAX / 512);
+if (data->nb_sectors) {
+block_acct_start(blk_get_stats(s->qdev.conf.blk), &r->acct,
+ data->curr_sectors * s->qdev.blocksize,
+ BLOCK_ACCT_WRITE);
+r->req.aiocb = blk_aio_write_zeroes(s->qdev.conf.blk,
+data->sector,
+data->curr_sectors,
+data->flags, scsi_write_zero_complete, data);
+return;
+}
+
+scsi_req_complete(&r->req, GOOD);
+
+done:
+scsi_req_unref(&r->req);
+g_free(data);
+}
+
 static void scsi_write_same_complete(void *opaque, int ret)
 {
 WriteSameCBData *data = opaque;
@@ -1686,25 +1732,26 @@ static void scsi_disk_emulate_write_same(SCSIDiskReq 
*r, uint8_t *inbuf)
 return;
 }
 
+data = g_new0(WriteSameCBData, 1);
+data->r = r;
+data->sector = r->req.cmd.lba * (s->qdev.blocksize / 512);
+data->nb_sectors = nb_sectors * (s->qdev.blocksize / 512);
+data->curr_sectors = MIN(data->nb_sectors, SCSI_WRITE_ZERO_MAX / 512);
 if (buffer_is_zero(inbuf, s->qdev.blocksize)) {
-int flags = (req->cmd.buf[1] & 0x8) ? BDRV_REQ_MAY_UNMAP : 0;
+data->flags = (req->cmd.buf[1] & 0x8) ? BDRV_REQ_MAY_UNMAP : 0;
 
 /* The request is used as the AIO opaque value, so add a ref.  */
 scsi_req_ref(&r->req);
 block_acct_start(blk_get_stats(s->qdev.conf.blk), &r->acct,
- nb_sectors * s->qdev.blocksize,
-BLOCK_ACCT_WRITE);
+ data->curr_sectors * s->qdev.blocksize,
+ BLOCK_ACCT_WRITE);
 r->req.aiocb = blk_aio_write_zeroes(s->qdev.conf.blk,
-r->req.cmd.lba * (s->qdev.blocksize / 512),
-nb_sectors * (s->qdev.blocksize / 512),
-flags, scsi_aio_complete, r);
+data->sector,
+data->curr_sectors,
+data->flags, scsi_write_zero_complete, data);
 return;
 }
 
-data = g_new0(WriteSameCBData, 1);
-data->r = r;
-data->sector = r->req.cmd.lba * (s->qdev.blocksize / 512);
-data->nb_sectors = nb_sectors * (s->qdev.blocksize / 512);
 data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);
 data->iov.iov_base = buf = blk_blockalign(s->qdev.conf.blk,
   data->iov.iov_len);
-- 
1.7.9.5

Re: [Qemu-devel] [PATCH] block: fix big write

2014-12-07 Thread Ming Lei

On Sat, Dec 6, 2014 at 12:33 AM, Paolo Bonzini  wrote:
>
>
> On 05/12/2014 17:15, Ming Lei wrote:
>> From: Ming Lei 
>>
>> QEMU block should have supported to read/write at most
>> 0x7f * 512 bytes, unfortunately INT_MAX is used to check
>> bytes in both bdrv_co_do_writev() and bdrv_check_byte_request(),
>> so cause write failure if nr_sectors is equal or more
>> than 0x40.
>>
>> There are still other INT_MAX usages in block.c, and they might
>> need to change to UINT_MAX too in future, but at least
>> this patch's change can make SCSI WRITE SAME 16 workable.
>>
>> Cc: qemu-sta...@nongnu.org
>> Signed-off-by: Ming Lei 
>
> Alternatively, I'd accept a SCSI patch setting max_ws_blocks and friends
> to 2GB - 1 block.

It should be better to not introduce the limit and split the writes
into size of 2GB - 1 block since there is only the limit for write zero.

Thanks,

[Qemu-devel] [PATCH] block: fix big write

2014-12-05 Thread Ming Lei

From: Ming Lei 

QEMU block should have supported to read/write at most
0x7f * 512 bytes, unfortunately INT_MAX is used to check
bytes in both bdrv_co_do_writev() and bdrv_check_byte_request(),
so cause write failure if nr_sectors is equal or more
than 0x40.

There are still other INT_MAX usages in block.c, and they might
need to change to UINT_MAX too in future, but at least
this patch's change can make SCSI WRITE SAME 16 workable.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Ming Lei 
---
 block.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index a612594..ddc18c2 100644
--- a/block.c
+++ b/block.c
@@ -2607,7 +2607,7 @@ static int bdrv_check_byte_request(BlockDriverState *bs, 
int64_t offset,
 {
 int64_t len;
 
-if (size > INT_MAX) {
+if (size > UINT_MAX) {
 return -EIO;
 }
 
@@ -3420,7 +3420,7 @@ static int coroutine_fn 
bdrv_co_do_writev(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 BdrvRequestFlags flags)
 {
-if (nb_sectors < 0 || nb_sectors > (INT_MAX >> BDRV_SECTOR_BITS)) {
+if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
 return -EINVAL;
 }
 
-- 
1.7.9.5

[Qemu-devel] [PATCH] scsi-disk: provide "max write same length" for block limits VPD page

2014-12-05 Thread Ming Lei

Since QEMU claims to support UNMAP, WRITE SAME and WRITE SAME 16
in the LBP VPD page, it is better to provide "max write same length"
in response for block limits VPD page because:

- T10 SBC-3 doesn't describe priority explicitly when all three
are enabled
- host driver may prefer WRITE/WRITE 16, then try to parse
"max write same length"

Signed-off-by: Ming Lei 
---
 hw/scsi/scsi-disk.c |   14 ++
 1 file changed, 14 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 2f75d7d..b15bf4f 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -674,6 +674,20 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
 outbuf[29] = (unmap_sectors >> 16) & 0xff;
 outbuf[30] = (unmap_sectors >> 8) & 0xff;
 outbuf[31] = unmap_sectors & 0xff;
+
+/*
+ * maximum write same length, just borrow max unmap
+ * count because write same command need to support
+ * unmap
+ */
+outbuf[36] = 0;
+outbuf[37] = 0;
+outbuf[38] = 0;
+outbuf[39] = 0;
+outbuf[40] = (max_unmap_sectors >> 24) & 0xff;
+outbuf[41] = (max_unmap_sectors >> 16) & 0xff;
+outbuf[42] = (max_unmap_sectors >> 8) & 0xff;
+outbuf[43] = max_unmap_sectors & 0xff;
 break;
 }
 case 0xb2: /* thin provisioning */
-- 
1.7.9.5

Re: [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively

2014-12-04 Thread Ming Lei

On Thu, Dec 4, 2014 at 11:39 PM, Kevin Wolf  wrote:
> Am 04.12.2014 um 16:22 hat Ming Lei geschrieben:
>> On Thu, Dec 4, 2014 at 10:37 PM, Kevin Wolf  wrote:
>> > Am 26.11.2014 um 15:46 hat Kevin Wolf geschrieben:
>> >> When getting an error while submitting requests, we must be careful to
>> >> wake up only inactive coroutines. Therefore we must special-case the
>> >> currently active coroutine and communicate an error for that request
>> >> using the ordinary return value of ioq_submit().
>> >>
>> >> Signed-off-by: Kevin Wolf 
>> >> ---
>> >>  block/linux-aio.c | 23 ---
>> >>  1 file changed, 16 insertions(+), 7 deletions(-)
>> >
>> > Ming, did you have a look at this patch specifically? Does it fix the
>> > issue that you tried to address with a much more complex callback-based
>> > patch?
>>
>> I think your patch can't fix my issue.
>>
>> As I explained, we have to handle -EAGAIN and partial submission,
>> which can be triggered quite easily in case of multi-queue, and other
>> case like NVME.
>
> Yes, this is an alternative only to the first patch of your series. If

No, most of my first patch is for handling -EAGAIN and partial
submission, so both my two patches are needed for the issue.

> we go that route, your second patch would still be needed as well, of
> course. It should be relatively obvious how to change it to apply on top
> of this one.
>
> Kevin
>

Re: [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively

2014-12-04 Thread Ming Lei

On Thu, Dec 4, 2014 at 10:37 PM, Kevin Wolf  wrote:
> Am 26.11.2014 um 15:46 hat Kevin Wolf geschrieben:
>> When getting an error while submitting requests, we must be careful to
>> wake up only inactive coroutines. Therefore we must special-case the
>> currently active coroutine and communicate an error for that request
>> using the ordinary return value of ioq_submit().
>>
>> Signed-off-by: Kevin Wolf 
>> ---
>>  block/linux-aio.c | 23 ---
>>  1 file changed, 16 insertions(+), 7 deletions(-)
>
> Ming, did you have a look at this patch specifically? Does it fix the
> issue that you tried to address with a much more complex callback-based
> patch?

I think your patch can't fix my issue.

As I explained, we have to handle -EAGAIN and partial submission,
which can be triggered quite easily in case of multi-queue, and other
case like NVME.

Thanks,

>
> I'd like to go forward with this as both Peter and I have measured
> considerable performance improvements with our optimisation proposals,
> and this series is an important part of it.
>
> Kevin
>
>
>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>> index 99b259d..fd8f0e4 100644
>> --- a/block/linux-aio.c
>> +++ b/block/linux-aio.c
>> @@ -177,12 +177,19 @@ static int ioq_submit(struct qemu_laio_state *s)
>>  container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
>>
>>  laiocb->ret = (ret < 0) ? ret : -EIO;
>> -qemu_laio_process_completion(s, laiocb);
>> +if (laiocb->co != qemu_coroutine_self()) {
>> +qemu_coroutine_enter(laiocb->co, NULL);
>> +} else {
>> +/* The return value is used for the currently active coroutine.
>> + * We're always in ioq_enqueue() here, ioq_submit() never runs 
>> from
>> + * a request's coroutine.*/
>> +ret = laiocb->ret;
>> +}
>>  }
>>  return ret;
>>  }
>>
>> -static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>> +static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>>  {
>>  unsigned int idx = s->io_q.idx;
>>
>> @@ -191,7 +198,9 @@ static void ioq_enqueue(struct qemu_laio_state *s, 
>> struct iocb *iocb)
>>
>>  /* submit immediately if queue is full */
>>  if (idx == s->io_q.size) {
>> -ioq_submit(s);
>> +return ioq_submit(s);
>> +} else {
>> +return 0;
>>  }
>>  }
>>
>> @@ -253,11 +262,11 @@ int laio_submit_co(BlockDriverState *bs, void 
>> *aio_ctx, int fd,
>>
>>  if (!s->io_q.plugged) {
>>  ret = io_submit(s->ctx, 1, &iocbs);
>> -if (ret < 0) {
>> -return ret;
>> -}
>>  } else {
>> -ioq_enqueue(s, iocbs);
>> +ret = ioq_enqueue(s, iocbs);
>> +}
>> +if (ret < 0) {
>> +return ret;
>>  }
>>
>>  qemu_coroutine_yield();
>> --
>> 1.8.3.1
>>
>

Re: [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems

2014-12-01 Thread Ming Lei

On Mon, Dec 1, 2014 at 8:41 PM, Paolo Bonzini  wrote:
>
>
> On 01/12/2014 02:28, Ming Lei wrote:
>>> > -   (unsigned long)(10 * duration) / maxcycles);
>>> > +   (unsigned long)(10.0 * duration / maxcycles));
>> One more single bracket.
>
> I don't understand?

Sorry, it is my fault, :-(

Thanks

[Qemu-devel] [PATCH v7 2/3] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-12-01 Thread Ming Lei

Previously -EAGAIN is simply ignored for !s->io_q.plugged case,
and sometimes it is easy to cause -EIO to VM, such as NVME device.

This patch handles -EAGAIN by io queue for !s->io_q.plugged case,
and it will be retried in following aio completion cb.

Most of times, -EAGAIN only happens if there is pending I/O, but
from linux kernel AIO implementation io_submit() might return it
when kmem_cache_alloc(GFP_KERNEL) returns NULL too. So 'pending'
in 'struct qemu_laio_state' is introduced for tracking active IO,
and -EAGAIN is handled when there is pending I/O.

Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Paolo Bonzini 
Suggested-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   32 
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 53c5616..9403b17 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -56,6 +56,7 @@ typedef struct {
 } LaioQueue;
 
 struct qemu_laio_state {
+unsigned long pending;
 io_context_t ctx;
 EventNotifier e;
 
@@ -98,6 +99,7 @@ static void qemu_laio_process_completion(struct 
qemu_laio_state *s,
 }
 }
 }
+s->pending--;
 laiocb->common.cb(laiocb->common.opaque, ret);
 
 qemu_aio_unref(laiocb);
@@ -179,6 +181,7 @@ static void laio_cancel(BlockAIOCB *blockacb)
 return;
 }
 
+laiocb->ctx->pending--;
 laiocb->common.cb(laiocb->common.opaque, laiocb->ret);
 }
 
@@ -280,8 +283,13 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct 
iocb *iocb)
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
-/* submit immediately if queue depth is above 2/3 */
-if (idx > s->io_q.size * 2 / 3) {
+/*
+ * This is reached in two cases: queue not plugged but io_submit
+ * returned -EAGAIN, or queue plugged.  In the latter case, start
+ * submitting some I/O if the queue is getting too full.  In the
+ * former case, instead, wait until an I/O operation is completed.
+ */
+if (s->io_q.plugged && unlikely(idx > s->io_q.size * 2 / 3)) {
 ioq_submit(s);
 }
 
@@ -346,15 +354,23 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 }
 io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
 
-if (!s->io_q.plugged) {
-if (io_submit(s->ctx, 1, &iocbs) < 0) {
-goto out_free_aiocb;
-}
-} else {
-if (ioq_enqueue(s, iocbs) < 0) {
+/* Switch to queue mode until -EAGAIN is handled */
+if (!s->io_q.plugged && !s->io_q.idx) {
+int ret = io_submit(s->ctx, 1, &iocbs);
+if (ret >= 0) {
+return &laiocb->common;
+} else if (ret != -EAGAIN || (ret == -EAGAIN && !s->pending)) {
 goto out_free_aiocb;
 }
+/*
+ * In case of -EAGAIN, only queue the req if there is pending
+ * I/O and it is resubmitted in completion of pending I/O
+ */
+}
+if (ioq_enqueue(s, iocbs) < 0) {
+goto out_free_aiocb;
 }
+s->pending++;
 return &laiocb->common;
 
 out_free_aiocb:
-- 
1.7.9.5

[Qemu-devel] [PATCH v7 3/3] linux-aio: remove 'node' from 'struct qemu_laiocb'

2014-12-01 Thread Ming Lei

No one uses the 'node' field any more, so remove it
from 'struct qemu_laiocb', and this can save 16byte
for the struct on 64bit arch.

Reviewed-by: Kevin Wolf 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 9403b17..cad3848 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -35,7 +35,6 @@ struct qemu_laiocb {
 size_t nbytes;
 QEMUIOVector *qiov;
 bool is_read;
-QLIST_ENTRY(qemu_laiocb) node;
 };
 
 /*
-- 
1.7.9.5

[Qemu-devel] [PATCH v7 1/3] linux-aio: fix submit aio as a batch

2014-12-01 Thread Ming Lei

In the submit path, we can't complete request directly,
otherwise "Co-routine re-entered recursively" may be caused,
so this patch fixes the issue with below ideas:

- for -EAGAIN or partial submission, retry the submision
in following completion cb which is run in BH context
- for part of submission, update the io queue too
- for case of io queue full, submit queued requests
immediatelly and return failure to caller
- for other failure, abort all queued requests in BH
context, and requests won't be allow to submit until
aborting is handled

Reviewed-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |  116 -
 1 file changed, 97 insertions(+), 19 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index d92513b..53c5616 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -38,11 +38,21 @@ struct qemu_laiocb {
 QLIST_ENTRY(qemu_laiocb) node;
 };
 
+/*
+ * TODO: support to batch I/O from multiple bs in one same
+ * AIO context, one important use case is multi-lun scsi,
+ * so in future the IO queue should be per AIO context.
+ */
 typedef struct {
 struct iocb *iocbs[MAX_QUEUED_IO];
 int plugged;
 unsigned int size;
 unsigned int idx;
+
+/* abort queued requests in BH context */
+QEMUBH *abort_bh;
+bool aborting;
+int abort_ret;
 } LaioQueue;
 
 struct qemu_laio_state {
@@ -59,6 +69,8 @@ struct qemu_laio_state {
 int event_max;
 };
 
+static int ioq_submit(struct qemu_laio_state *s);
+
 static inline ssize_t io_event_ret(struct io_event *ev)
 {
 return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
@@ -135,6 +147,11 @@ static void qemu_laio_completion_bh(void *opaque)
 
 qemu_laio_process_completion(s, laiocb);
 }
+
+/* Handle -EAGAIN or partial submission */
+if (s->io_q.idx) {
+ioq_submit(s);
+}
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
@@ -175,47 +192,100 @@ static void ioq_init(LaioQueue *io_q)
 io_q->size = MAX_QUEUED_IO;
 io_q->idx = 0;
 io_q->plugged = 0;
+io_q->aborting = false;
 }
 
+/* Always return >= 0 and it means how many requests are submitted */
 static int ioq_submit(struct qemu_laio_state *s)
 {
-int ret, i = 0;
+int ret;
 int len = s->io_q.idx;
 
-do {
-ret = io_submit(s->ctx, len, s->io_q.iocbs);
-} while (i++ < 3 && ret == -EAGAIN);
-
-/* empty io queue */
-s->io_q.idx = 0;
+if (!len) {
+return 0;
+}
 
+ret = io_submit(s->ctx, len, s->io_q.iocbs);
 if (ret < 0) {
-i = 0;
-} else {
-i = ret;
+/* retry in following completion cb */
+if (ret == -EAGAIN) {
+return 0;
+}
+
+/*
+ * Abort in BH context for avoiding Co-routine re-entered,
+ * and update io queue at that time
+ */
+s->io_q.aborting = true;
+s->io_q.abort_ret = ret;
+qemu_bh_schedule(s->io_q.abort_bh);
+ret = 0;
 }
 
-for (; i < len; i++) {
-struct qemu_laiocb *laiocb =
-container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
+/*
+ * update io queue, and retry will be started automatically
+ * in following completion cb for the remainder
+ */
+if (ret > 0) {
+if (ret < len) {
+memmove(&s->io_q.iocbs[0], &s->io_q.iocbs[ret],
+(len - ret) * sizeof(struct iocb *));
+}
+s->io_q.idx -= ret;
+}
+
+return ret;
+}
 
-laiocb->ret = (ret < 0) ? ret : -EIO;
+static void ioq_abort_bh(void *opaque)
+{
+struct qemu_laio_state *s = opaque;
+int i;
+
+for (i = 0; i < s->io_q.idx; i++) {
+struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
+  struct qemu_laiocb,
+  iocb);
+laiocb->ret = s->io_q.abort_ret;
 qemu_laio_process_completion(s, laiocb);
 }
-return ret;
+
+s->io_q.idx = 0;
+s->io_q.aborting = false;
 }
 
-static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
+static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
 unsigned int idx = s->io_q.idx;
 
+/* Request can't be allowed to submit until aborting is handled */
+if (unlikely(s->io_q.aborting)) {
+return -EIO;
+}
+
+if (unlikely(idx == s->io_q.size)) {
+ioq_submit(s);
+
+if (unlikely(s->io_q.aborting)) {
+return -EIO;
+}
+idx = s->io_q.idx;
+}
+
+/* It has to return now if queue is still full */
+if (unlikely(idx == s->io_q.size)) {
+return -EAGAIN;
+}
+
 s->io_q.iocbs[idx++] = iocb;

[Qemu-devel] [PATCH v7 0/3] linux-aio: fix batch submission

2014-12-01 Thread Ming Lei

The 1st patch fixes batch submission.

The 2nd one fixes -EAGAIN for non-batch case.

The 3rd one is a cleanup.

This patchset is splitted from previous patchset(dataplane: optimization
and multi virtqueue support), as suggested by Stefan.

V7:
- add protection for aborting in laio_attach_aio_context(), as suggested
by Stefan, 1/3
- patch style, return real aborting failure to caller, as suggested
by Kevin, 1/3
- track pending I/O and only handle -EAGAIN if there is pending I/O,
pointed by Kevin, 2/3 

V6:
- don't pass ioq_submit() return value to ioq_enqueue(), as suggested
by Stefan
- fix one build failure introduced in V5, reported by Stefan

V5:
- in case of submission failure, return -EIO for new coming requests
until aborting is handled
- in patch2, follow Paolo's suggestion about ioq_enqueue() changes

V4:
- abort reuqests in BH to abvoid potential "Co-routine re-entered 
recursively"
- remove 'enqueue' parameter to ioq_submit() to simpify change
- beautify code as suggested by Paolo

V3:
- rebase on QEMU master
V2:
- code style fix and commit log fix as suggested by Benoît Canet
V1:
- rebase on latest QEMU master

 block/linux-aio.c |  139 -
 1 file changed, 116 insertions(+), 23 deletions(-)


Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations

2014-11-30 Thread Ming Lei

On Mon, 01 Dec 2014 08:05:17 +0100
Peter Lieven  wrote:

> On 01.12.2014 06:55, Ming Lei wrote:
> > On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini  wrote:
> >> As discussed in the other thread, this brings speedups from
> >> dropping the coroutine mutex (which serializes multiple iothreads,
> >> too) and using ELF thread-local storage.
> >>
> >> The speedup in perf/cost is about 30% (190->145).  Windows port tested
> >> with tests/test-coroutine.exe under Wine.
> > The data is very nice, and in my laptop, 'perf cost' can be decreased
> > from 244ns to 174ns.
> >
> > BTW, the cost by using coroutine to run function isn't only from these
> > helpers(*_yield, *_enter, *_create, and perf-cost just measures
> > this part of cost), but also some implicit/invisible part. I have some
> > test cases which can show the problem. If someone is interested,
> > I can post them in list.
> 
> Of course, maybe the problem can be solved or impaired.

OK, please try below patch:

From 917d5cc0a273f9825b10abd52152c54e08c81ef8 Mon Sep 17 00:00:00 2001
From: Ming Lei 
Date: Mon, 1 Dec 2014 11:11:23 +0800
Subject: [PATCH] test-coroutine: introduce perf-cost-with-load

The perf/cost test case only covers explicit cost by
using coroutine.

This patch provides a open/close file test case, and
from this case, we can find there is also some implicit
or invisible cost except for the cost measured by /perf/cost.

In my environment, follows the test result after appying this
patch and running perf/cost and perf/cost-with-load:

{*LOG(start):{/perf/cost}:LOG*}
/perf/cost: {*LOG(message):{Run operation 4000 iterations 7.539413
s, 5305K operations/s, 188ns per coroutine}:LOG*}
OK
{*LOG(stop):(0;0;7.539497):LOG*}

{*LOG(start):{/perf/cost-with-load}:LOG*}
/perf/cost-with-load: {*LOG(message):{Run operation 100 iterations
2.648014 s, 377K operations/s, 2648ns per operation without using
coroutine}:LOG*}
{*LOG(message):{Run operation 100 iterations 2.919133 s, 342K
operations/s, 2919ns per operation, 271ns(cost introduced by coroutine)
per operation with using coroutine}:LOG*}
OK
{*LOG(stop):(0;0;5.567333):LOG*}

From above data, we can see 188ns is introduced for running one
coroutine, but in /perf/cost-with-load, the actual cost introduced
is 271ns, and the extra 83ns cost is invisible and implicit.

The similar result can be found in following test case too:
- read from /dev/nullb0 which is opened with O_DIRECT
(it is sort of aio read simulation, need 3.13+ kernel for
/dev/nullbX support by 'modprobe null_blk', this case
can show +150ns extra cost)
- statvfs() syscall, there is ~30ns extra cost for running
one statvfs() with coroutine
---
 tests/test-coroutine.c |   67 
 1 file changed, 67 insertions(+)

diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
index 27d1b6f..7323a91 100644
--- a/tests/test-coroutine.c
+++ b/tests/test-coroutine.c
@@ -311,6 +311,72 @@ static void perf_baseline(void)
 maxcycles, duration);
 }
 
+static void perf_cost_load_worker(void *opaque)
+{
+int fd;
+
+fd = open("/proc/self/exe", O_RDONLY);
+assert(fd >= 0);
+close(fd);
+}
+
+static __attribute__((noinline)) void perf_cost_load_func(void *opaque)
+{
+perf_cost_load_worker(opaque);
+qemu_coroutine_yield();
+}
+
+static double perf_cost_load(unsigned long maxcycles, bool use_co)
+{
+unsigned long i = 0;
+double duration;
+
+g_test_timer_start();
+if (use_co) {
+Coroutine *co;
+while (i++ < maxcycles) {
+co = qemu_coroutine_create(perf_cost_load_func);
+qemu_coroutine_enter(co, &i);
+qemu_coroutine_enter(co, NULL);
+}
+} else {
+while (i++ < maxcycles) {
+perf_cost_load_worker(&i);
+}
+}
+duration = g_test_timer_elapsed();
+
+return duration;
+}
+
+static void perf_cost_with_load(void)
+{
+const unsigned long maxcycles = 100;
+double duration;
+unsigned long ops;
+unsigned long cost_co, cost;
+
+duration = perf_cost_load(maxcycles, false);
+ops = (long)(maxcycles / (duration * 1000));
+cost = (unsigned long)(10.0 * duration / maxcycles);
+g_test_message("Run operation %lu iterations %f s, %luK operations/s, "
+   "%luns per operation without using coroutine",
+   maxcycles,
+   duration, ops,
+   cost);
+
+duration = perf_cost_load(maxcycles, true);
+ops = (long)(maxcycles / (duration * 1000));
+cost_co = (unsigned long)(10.0 * duration / maxcycles);

Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations

2014-11-30 Thread Ming Lei

On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini  wrote:
> As discussed in the other thread, this brings speedups from
> dropping the coroutine mutex (which serializes multiple iothreads,
> too) and using ELF thread-local storage.
>
> The speedup in perf/cost is about 30% (190->145).  Windows port tested
> with tests/test-coroutine.exe under Wine.

The data is very nice, and in my laptop, 'perf cost' can be decreased
from 244ns to 174ns.

BTW, the cost by using coroutine to run function isn't only from these
helpers(*_yield, *_enter, *_create, and perf-cost just measures
this part of cost), but also some implicit/invisible part. I have some
test cases which can show the problem. If someone is interested,
I can post them in list.

Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems

2014-11-30 Thread Ming Lei

On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini  wrote:
> unsigned long is not large enough to represent 10 * duration there.
> Just use floating point.
>
> Signed-off-by: Paolo Bonzini 
> ---
>  tests/test-coroutine.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
> index e22fae1..27d1b6f 100644
> --- a/tests/test-coroutine.c
> +++ b/tests/test-coroutine.c
> @@ -337,7 +337,7 @@ static void perf_cost(void)
> "%luns per coroutine",
> maxcycles,
> duration, ops,
> -   (unsigned long)(10 * duration) / maxcycles);
> +   (unsigned long)(10.0 * duration / maxcycles));

One more single bracket.

thanks,
Ming Lei

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Ming Lei

On 11/28/14, Markus Armbruster  wrote:
> Ming Lei  writes:
>
>> On 11/28/14, Markus Armbruster  wrote:
>>> Ming Lei  writes:
>>>
>>>> Hi Kevin,
>>>>
>>>> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
>>>>> This improves the performance of requests because an ACB doesn't need
>>>>> to
>>>>> be allocated on the heap any more. It also makes the code nicer and
>>>>> smaller.
>>>>
>>>> I am not sure it is good way for linux aio optimization:
>>>>
>>>> - for raw image with some constraint, coroutine can be avoided since
>>>> io_submit() won't sleep most of times
>>>>
>>>> - handling one time coroutine takes much time than handling malloc,
>>>> memset and free on small buffer, following the test data:
>>>>
>>>>  --   241ns per coroutine
>>>
>>> What do you mean by "coroutine" here?  Create + destroy?  Yield?
>>
>> Please see perf_cost() in tests/test-coroutine.c
>
> static __attribute__((noinline)) void perf_cost_func(void *opaque)
> {
> qemu_coroutine_yield();
> }
>
> static void perf_cost(void)
> {
> const unsigned long maxcycles = 4000;
> unsigned long i = 0;
> double duration;
> unsigned long ops;
> Coroutine *co;
>
> g_test_timer_start();
> while (i++ < maxcycles) {
> co = qemu_coroutine_create(perf_cost_func);
> qemu_coroutine_enter(co, &i);
> qemu_coroutine_enter(co, NULL);
> }
> duration = g_test_timer_elapsed();
> ops = (long)(maxcycles / (duration * 1000));
>
> g_test_message("Run operation %lu iterations %f s, %luK
> operations/s, "
>"%luns per coroutine",
>maxcycles,
>duration, ops,
>(unsigned long)(10 * duration) / maxcycles);
> }
>
> This tests create, enter, yield, reenter, terminate, destroy.  The cost
> of create + destroy may well dominate.

Actually there shouldn't have been much cost from create and destroy
attributed to coroutine pool.

>
> If we create and destroy coroutines for each AIO request, we're doing it
> wrong.  I doubt Kevin's doing it *that* wrong ;)
>
> Anyway, let's benchmark the real code instead of putting undue trust in
> tests/test-coroutine.c micro-benchmarks.

I don't think there isn't trust from the micro-benchmark.

That is the direct cost from coroutine, and the cost won't be avoided at all,
not mention cost from switching stack.

If you google some test data posted by me previously, that would show
bypassing coroutine can increase throughput with ~50% for raw image
in case of linux aio, that is the real test case, not micro-benchmark.


Thanks,
Ming Lei

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Ming Lei

On 11/28/14, Markus Armbruster  wrote:
> Ming Lei  writes:
>
>> Hi Kevin,
>>
>> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
>>> This improves the performance of requests because an ACB doesn't need to
>>> be allocated on the heap any more. It also makes the code nicer and
>>> smaller.
>>
>> I am not sure it is good way for linux aio optimization:
>>
>> - for raw image with some constraint, coroutine can be avoided since
>> io_submit() won't sleep most of times
>>
>> - handling one time coroutine takes much time than handling malloc,
>> memset and free on small buffer, following the test data:
>>
>>  --   241ns per coroutine
>
> What do you mean by "coroutine" here?  Create + destroy?  Yield?

Please see perf_cost() in tests/test-coroutine.c

Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH v6 1/3] linux-aio: fix submit aio as a batch

2014-11-27 Thread Ming Lei

On Fri, Nov 28, 2014 at 12:58 AM, Paolo Bonzini  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
>
>
> On 27/11/2014 17:56, Stefan Hajnoczi wrote:
>> I am asking for a one-line if statement to avoid introducing a
>> subtle assumption that would be a pain to debug in the future.
>> It's so easy to add that I'm against merging the patch without this
>> protection.
>
> Yeah, either that or an assertion must be there.

OK.

Thanks,
Ming Lei

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-27 Thread Ming Lei

Hi Kevin,

On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf  wrote:
> This improves the performance of requests because an ACB doesn't need to
> be allocated on the heap any more. It also makes the code nicer and
> smaller.

I am not sure it is good way for linux aio optimization:

- for raw image with some constraint, coroutine can be avoided since
io_submit() won't sleep most of times

- handling one time coroutine takes much time than handling malloc,
memset and free on small buffer, following the test data:

 --   241ns per coroutine
 --   61ns per (malloc, memset, free for 128bytes)

I still think we should figure out a fast path to avoid cocourinte
for linux-aio with raw image, otherwise it can't scale well for high
IOPS device.

Also we can use simple buf pool to avoid the dynamic allocation
easily, can't we?

>
> As a side effect, the codepath taken by aio=threads is changed to use
> paio_submit_co(). This doesn't change the performance at this point.
>
> Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0:
>
>   |  aio=native   | aio=threads
>   | before   | with patch | before   | with patch
> --+--++--+
> run 1 | 29.921s  | 26.932s| 35.286s  | 35.447s
> run 2 | 29.793s  | 26.252s| 35.276s  | 35.111s
> run 3 | 30.186s  | 27.114s| 35.042s  | 34.921s
> run 4 | 30.425s  | 26.600s| 35.169s  | 34.968s
> run 5 | 30.041s  | 26.263s| 35.224s  | 35.000s
>
> TODO: Do some more serious benchmarking in VMs with less variance.
> Results of a quick fio run are vaguely positive.

I will do the test with Paolo's fast path approach under
VM I/O situation.

Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH v6 2/3] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-11-27 Thread Ming Lei

Hi Kevin,

On Wed, Nov 26, 2014 at 7:27 PM, Kevin Wolf  wrote:
> Am 25.11.2014 um 08:23 hat Ming Lei geschrieben:
>> Previously -EAGAIN is simply ignored for !s->io_q.plugged case,
>> and sometimes it is easy to cause -EIO to VM, such as NVME device.
>>
>> This patch handles -EAGAIN by io queue for !s->io_q.plugged case,
>> and it will be retried in following aio completion cb.
>>
>> Reviewed-by: Paolo Bonzini 
>> Suggested-by: Paolo Bonzini 
>> Signed-off-by: Ming Lei 
>> ---
>>  block/linux-aio.c |   24 
>>  1 file changed, 16 insertions(+), 8 deletions(-)
>>
>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>> index 11ac828..ac25722 100644
>> --- a/block/linux-aio.c
>> +++ b/block/linux-aio.c
>> @@ -282,8 +282,13 @@ static int ioq_enqueue(struct qemu_laio_state *s, 
>> struct iocb *iocb)
>>  s->io_q.iocbs[idx++] = iocb;
>>  s->io_q.idx = idx;
>>
>> -/* submit immediately if queue depth is above 2/3 */
>> -if (idx > s->io_q.size * 2 / 3) {
>> +/*
>> + * This is reached in two cases: queue not plugged but io_submit
>> + * returned -EAGAIN, or queue plugged.  In the latter case, start
>> + * submitting some I/O if the queue is getting too full.  In the
>> + * former case, instead, wait until an I/O operation is completed.
>> + */
>
> Are we guaranteed that an I/O operation is in flight when we get
> -EAGAIN? The manpage of io_submit isn't very clear on this,
> "insufficient resources" could be for any reason.
>

That is a good question.

>From fs/aio.c in linux kernel, io_submit_one() returns -EAGAIN when
either there isn't enough requests which are reserved in io_setup(), or
kmem_cache_alloc(GFP_KERNEL) returns NULL.

In the former case, it means I/O operation is in flight.

In the later case, it should be very difficult to trigger since GFP_KERNEL
allocation will wait for memory reclaiming.

So most of times, it is reasonable to resubmit in completion for
-EAGAIN.  When there is no pending I/O, we still can handle
the very unlikely case either by returning failure to caller or
try to submit in one BH. Does it make sense for you?

Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH v6 1/3] linux-aio: fix submit aio as a batch

2014-11-27 Thread Ming Lei

On Wed, Nov 26, 2014 at 7:18 PM, Kevin Wolf  wrote:
> Am 25.11.2014 um 08:23 hat Ming Lei geschrieben:
>> In the submit path, we can't complete request directly,
>> otherwise "Co-routine re-entered recursively" may be caused,
>> so this patch fixes the issue with below ideas:
>>
>>   - for -EAGAIN or partial completion, retry the submision
>>   in following completion cb which is run in BH context
>>   - for part of completion, update the io queue too
>>   - for case of io queue full, submit queued requests
>>   immediatelly and return failure to caller
>>   - for other failure, abort all queued requests in BH
>> context, and requests won't be allow to submit until
>> aborting is handled
>>
>> Reviewed-by: Paolo Bonzini 
>> Signed-off-by: Ming Lei 
>
> This looks like a quite complex fix to this problem, and it introduces
> new error cases, too (while aborting, requests fail now, but they really
> should just be waiting).

"Co-routine re-entered recursively" is only triggered when io_submit()
returns failure(either -EAGAIN or other error) or partial completion, so
this patch actually is for handling failure of io_submit() and making
linux-aio more reliably.

After io queue is introduce, it is a bit easy to trigger the failure, especially
in case of multi-queue, or VM driver sets a longer queue depth. So all
cases(EAGAIN, other failure and partial completion) have to be
considered too.

So it doesn't mean a complex fix, and the actual problem is complex too.
Not mention previously linux aio doesn't handle -EAGAIN.

>
> I'm wondering if this is the time to convert the linux-aio interface to
> coroutines finally. It wouldn't only be a performance optimisation, but
> would potentially also simplify this code.

Even with coroutine, the above io_submit() issues has to be considered
too.

>
>>  block/linux-aio.c |  114 
>> -
>>  1 file changed, 95 insertions(+), 19 deletions(-)
>>
>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>> index d92513b..11ac828 100644
>> --- a/block/linux-aio.c
>> +++ b/block/linux-aio.c
>> @@ -38,11 +38,20 @@ struct qemu_laiocb {
>>  QLIST_ENTRY(qemu_laiocb) node;
>>  };
>>
>> +/*
>> + * TODO: support to batch I/O from multiple bs in one same
>> + * AIO context, one important use case is multi-lun scsi,
>> + * so in future the IO queue should be per AIO context.
>> + */
>>  typedef struct {
>>  struct iocb *iocbs[MAX_QUEUED_IO];
>>  int plugged;
>>  unsigned int size;
>>  unsigned int idx;
>> +
>> +/* abort queued requests in BH context */
>> +QEMUBH *abort_bh;
>> +bool  aborting;
>
> Two spaces.

OK.

>
>>  } LaioQueue;
>>
>>  struct qemu_laio_state {
>> @@ -59,6 +68,8 @@ struct qemu_laio_state {
>>  int event_max;
>>  };
>>
>> +static int ioq_submit(struct qemu_laio_state *s);
>> +
>>  static inline ssize_t io_event_ret(struct io_event *ev)
>>  {
>>  return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
>> @@ -91,6 +102,13 @@ static void qemu_laio_process_completion(struct 
>> qemu_laio_state *s,
>>  qemu_aio_unref(laiocb);
>>  }
>>
>> +static void qemu_laio_start_retry(struct qemu_laio_state *s)
>> +{
>> +if (s->io_q.idx) {
>> +ioq_submit(s);
>> +}
>> +}
>> +
>>  /* The completion BH fetches completed I/O requests and invokes their
>>   * callbacks.
>>   *
>> @@ -135,6 +153,8 @@ static void qemu_laio_completion_bh(void *opaque)
>>
>>  qemu_laio_process_completion(s, laiocb);
>>  }
>> +
>> +qemu_laio_start_retry(s);
>>  }
>
> Why is qemu_laio_start_retry() a separate function? This is the only
> caller.

OK.

>
>>  static void qemu_laio_completion_cb(EventNotifier *e)
>> @@ -175,47 +195,99 @@ static void ioq_init(LaioQueue *io_q)
>>  io_q->size = MAX_QUEUED_IO;
>>  io_q->idx = 0;
>>  io_q->plugged = 0;
>> +io_q->aborting = false;
>>  }
>>
>> +/* Always return >= 0 and it means how many requests are submitted */
>>  static int ioq_submit(struct qemu_laio_state *s)
>>  {
>> -int ret, i = 0;
>> +int ret;
>>  int len = s->io_q.idx;
>>
>> -do {
>> -ret = io_submit(s->ctx, len, s->io_q.iocbs);
>> -} while (i++ < 3 && ret == -EAGAIN);
>> -
>> -/* empty io queue */
>>

Re: [Qemu-devel] [PATCH v6 1/3] linux-aio: fix submit aio as a batch

2014-11-26 Thread Ming Lei

On Wed, Nov 26, 2014 at 12:18 AM, Stefan Hajnoczi  wrote:
>>
>> You mean the abort BH may not have chance to run before its deletion
>> in the detach callback?
>
> Exactly.  Any time you schedule a BH you need to be aware of things that
> may happen before the BH is invoked.
>
>> If so, bdrv_drain_all() from bdrv_set_aio_context() should have
>> handled the pending BH, right?
>
> I'm not sure if it's good to make subtle assumptions like that.  If the
> code changes they will break.

IMO, that should be the purpose of bdrv_drain_all(), at least from
its comment:

 /* ensure there are no in-flight requests */

If it changes in future, the general problem has to be considered.

> Since it is very easy to protect against this case (the code I posted
> before), it seems worthwhile to be on the safe side.

Given there hasn't the potential problem in current tree, could you
agree on merging it first?

BTW, there isn't sort of handling for 'completion_bh' of linux aio too, :-)

Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH v6 1/3] linux-aio: fix submit aio as a batch

2014-11-25 Thread Ming Lei

On Tue, Nov 25, 2014 at 9:08 PM, Stefan Hajnoczi  wrote:
> On Tue, Nov 25, 2014 at 03:23:11PM +0800, Ming Lei wrote:
>> @@ -296,12 +370,14 @@ void laio_detach_aio_context(void *s_, AioContext 
>> *old_context)
>>
>>  aio_set_event_notifier(old_context, &s->e, NULL);
>>  qemu_bh_delete(s->completion_bh);
>> +qemu_bh_delete(s->io_q.abort_bh);
>>  }
>>
>>  void laio_attach_aio_context(void *s_, AioContext *new_context)
>>  {
>>  struct qemu_laio_state *s = s_;
>>
>> +s->io_q.abort_bh = aio_bh_new(new_context, ioq_abort_bh, s);
>>  s->completion_bh = aio_bh_new(new_context, qemu_laio_completion_bh, s);
>>  aio_set_event_notifier(new_context, &s->e, qemu_laio_completion_cb);
>>  }
>
> These functions are incomplete when ->aborting == true.  I can't think

Could you explain in a bit when ->aborting is true during attach callback?

> of a reason why we are guaranteed never to hit that state, and fixing it
> is easy.  Just add the following to the end of
> laio_attach_aio_context():
>
> if (s->aborting) {
> qemu_bh_schedule(s->io_q.abort_bh);
> }

You mean the abort BH may not have chance to run before its deletion
in the detach callback?

If so, bdrv_drain_all() from bdrv_set_aio_context() should have
handled the pending BH, right?


Thanks,
Ming Lei

[Qemu-devel] [PATCH v6 2/3] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-11-24 Thread Ming Lei

Previously -EAGAIN is simply ignored for !s->io_q.plugged case,
and sometimes it is easy to cause -EIO to VM, such as NVME device.

This patch handles -EAGAIN by io queue for !s->io_q.plugged case,
and it will be retried in following aio completion cb.

Reviewed-by: Paolo Bonzini 
Suggested-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 11ac828..ac25722 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -282,8 +282,13 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct 
iocb *iocb)
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
-/* submit immediately if queue depth is above 2/3 */
-if (idx > s->io_q.size * 2 / 3) {
+/*
+ * This is reached in two cases: queue not plugged but io_submit
+ * returned -EAGAIN, or queue plugged.  In the latter case, start
+ * submitting some I/O if the queue is getting too full.  In the
+ * former case, instead, wait until an I/O operation is completed.
+ */
+if (s->io_q.plugged && unlikely(idx > s->io_q.size * 2 / 3)) {
 ioq_submit(s);
 }
 
@@ -348,15 +353,18 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 }
 io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
 
-if (!s->io_q.plugged) {
-if (io_submit(s->ctx, 1, &iocbs) < 0) {
-goto out_free_aiocb;
-}
-} else {
-if (ioq_enqueue(s, iocbs) < 0) {
+/* Switch to queue mode until -EAGAIN is handled */
+if (!s->io_q.plugged && !s->io_q.idx) {
+int ret = io_submit(s->ctx, 1, &iocbs);
+if (ret >= 0) {
+return &laiocb->common;
+} else if (ret != -EAGAIN) {
 goto out_free_aiocb;
 }
 }
+if (ioq_enqueue(s, iocbs) < 0) {
+goto out_free_aiocb;
+}
 return &laiocb->common;
 
 out_free_aiocb:
-- 
1.7.9.5

[Qemu-devel] [PATCH v6 0/3] linux-aio: fix batch submission

2014-11-24 Thread Ming Lei

The 1st patch fixes batch submission.

The 2nd one fixes -EAGAIN for non-batch case.

The 3rd one is a cleanup.

This patchset is splitted from previous patchset(dataplane: optimization
and multi virtqueue support), as suggested by Stefan.

V6:
- don't pass ioq_submit() return value to ioq_enqueue(), as suggested
by Stefan
- fix one build failure introduced in V5, reported by Stefan

V5:
- in case of submission failure, return -EIO for new coming requests
until aborting is handled
- in patch2, follow Paolo's suggestion about ioq_enqueue() changes

V4:
- abort reuqests in BH to abvoid potential "Co-routine re-entered 
recursively"
- remove 'enqueue' parameter to ioq_submit() to simpify change
- beautify code as suggested by Paolo

V3:
- rebase on QEMU master
V2:
- code style fix and commit log fix as suggested by Benoît Canet
V1:
- rebase on latest QEMU master

 block/linux-aio.c |  129 +++--
 1 file changed, 106 insertions(+), 23 deletions(-)

Thanks,
Ming Lei

[Qemu-devel] [PATCH v6 1/3] linux-aio: fix submit aio as a batch

2014-11-24 Thread Ming Lei

In the submit path, we can't complete request directly,
otherwise "Co-routine re-entered recursively" may be caused,
so this patch fixes the issue with below ideas:

- for -EAGAIN or partial completion, retry the submision
in following completion cb which is run in BH context
- for part of completion, update the io queue too
- for case of io queue full, submit queued requests
immediatelly and return failure to caller
- for other failure, abort all queued requests in BH
context, and requests won't be allow to submit until
aborting is handled

Reviewed-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |  114 -
 1 file changed, 95 insertions(+), 19 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index d92513b..11ac828 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -38,11 +38,20 @@ struct qemu_laiocb {
 QLIST_ENTRY(qemu_laiocb) node;
 };
 
+/*
+ * TODO: support to batch I/O from multiple bs in one same
+ * AIO context, one important use case is multi-lun scsi,
+ * so in future the IO queue should be per AIO context.
+ */
 typedef struct {
 struct iocb *iocbs[MAX_QUEUED_IO];
 int plugged;
 unsigned int size;
 unsigned int idx;
+
+/* abort queued requests in BH context */
+QEMUBH *abort_bh;
+bool  aborting;
 } LaioQueue;
 
 struct qemu_laio_state {
@@ -59,6 +68,8 @@ struct qemu_laio_state {
 int event_max;
 };
 
+static int ioq_submit(struct qemu_laio_state *s);
+
 static inline ssize_t io_event_ret(struct io_event *ev)
 {
 return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
@@ -91,6 +102,13 @@ static void qemu_laio_process_completion(struct 
qemu_laio_state *s,
 qemu_aio_unref(laiocb);
 }
 
+static void qemu_laio_start_retry(struct qemu_laio_state *s)
+{
+if (s->io_q.idx) {
+ioq_submit(s);
+}
+}
+
 /* The completion BH fetches completed I/O requests and invokes their
  * callbacks.
  *
@@ -135,6 +153,8 @@ static void qemu_laio_completion_bh(void *opaque)
 
 qemu_laio_process_completion(s, laiocb);
 }
+
+qemu_laio_start_retry(s);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
@@ -175,47 +195,99 @@ static void ioq_init(LaioQueue *io_q)
 io_q->size = MAX_QUEUED_IO;
 io_q->idx = 0;
 io_q->plugged = 0;
+io_q->aborting = false;
 }
 
+/* Always return >= 0 and it means how many requests are submitted */
 static int ioq_submit(struct qemu_laio_state *s)
 {
-int ret, i = 0;
+int ret;
 int len = s->io_q.idx;
 
-do {
-ret = io_submit(s->ctx, len, s->io_q.iocbs);
-} while (i++ < 3 && ret == -EAGAIN);
-
-/* empty io queue */
-s->io_q.idx = 0;
+if (!len) {
+return 0;
+}
 
+ret = io_submit(s->ctx, len, s->io_q.iocbs);
 if (ret < 0) {
-i = 0;
-} else {
-i = ret;
+/* retry in following completion cb */
+if (ret == -EAGAIN) {
+return 0;
+}
+
+/*
+ * Abort in BH context for avoiding Co-routine re-entered,
+ * and update io queue at that time
+ */
+qemu_bh_schedule(s->io_q.abort_bh);
+s->io_q.aborting = true;
+ret = 0;
 }
 
-for (; i < len; i++) {
-struct qemu_laiocb *laiocb =
-container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
+/*
+ * update io queue, and retry will be started automatically
+ * in following completion cb for the remainder
+ */
+if (ret > 0) {
+if (ret < len) {
+memmove(&s->io_q.iocbs[0], &s->io_q.iocbs[ret],
+(len - ret) * sizeof(struct iocb *));
+}
+s->io_q.idx -= ret;
+}
 
-laiocb->ret = (ret < 0) ? ret : -EIO;
+return ret;
+}
+
+static void ioq_abort_bh(void *opaque)
+{
+struct qemu_laio_state *s = opaque;
+int i;
+
+for (i = 0; i < s->io_q.idx; i++) {
+struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
+  struct qemu_laiocb,
+  iocb);
+laiocb->ret = -EIO;
 qemu_laio_process_completion(s, laiocb);
 }
-return ret;
+
+s->io_q.idx = 0;
+s->io_q.aborting = false;
 }
 
-static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
+static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
 unsigned int idx = s->io_q.idx;
 
+/* Request can't be allowed to submit until aborting is handled */
+if (unlikely(s->io_q.aborting)) {
+return -EIO;
+}
+
+if (unlikely(idx == s->io_q.size)) {
+ioq_submit(s);
+
+if (unlikely(s->io_q.aborting)) {
+return -EIO;
+}
+idx = s

[Qemu-devel] [PATCH v6 3/3] linux-aio: remove 'node' from 'struct qemu_laiocb'

2014-11-24 Thread Ming Lei

No one uses the 'node' field any more, so remove it
from 'struct qemu_laiocb', and this can save 16byte
for the struct on 64bit arch.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index ac25722..0f96610 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -35,7 +35,6 @@ struct qemu_laiocb {
 size_t nbytes;
 QEMUIOVector *qiov;
 bool is_read;
-QLIST_ENTRY(qemu_laiocb) node;
 };
 
 /*
-- 
1.7.9.5

Re: [Qemu-devel] [PATCH v5 0/3] linux-aio: fix batch submission

2014-11-24 Thread Ming Lei

On Mon, Nov 24, 2014 at 10:31 PM, Paolo Bonzini  wrote:
>
>
> On 24/11/2014 15:29, Ming Lei wrote:
>> The 1st patch fixes batch submission.
>>
>> The 2nd one fixes -EAGAIN for non-batch case.
>>
>> The 3rd one is a cleanup.
>>
>> This patchset is splitted from previous patchset(dataplane: optimization
>> and multi virtqueue support), as suggested by Stefan.
>>
>> V5:
>> - in case of submission failure, return -EIO for new coming requests
>>   until aborting is handled
>>   - in patch2, follow Paolo's suggestion about ioq_enqueue() changes
>>
>> V4:
>>   - abort reuqests in BH to abvoid potential "Co-routine re-entered 
>> recursively"
>>   - remove 'enqueue' parameter to ioq_submit() to simpify change
>>   - beautify code as suggested by Paolo
>>
>> v3:
>> - rebase on QEMU master
>> v2:
>> - code style fix and commit log fix as suggested by Benoît Canet
>> v1:
>> - rebase on latest QEMU master
>>
>>  block/linux-aio.c |  131 
>> +++--
>>  1 file changed, 107 insertions(+), 24 deletions(-)
>>
>> Thanks,
>> Ming Lei
>>
>>
>
> Reviewed-by: Paolo Bonzini 

Great thanks!

[Qemu-devel] [PATCH v5 3/3] linux-aio: remove 'node' from 'struct qemu_laiocb'

2014-11-24 Thread Ming Lei

No one uses the 'node' field any more, so remove it
from 'struct qemu_laiocb', and this can save 16byte
for the struct on 64bit arch.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 0cb098d..f1b26f5 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -35,7 +35,6 @@ struct qemu_laiocb {
 size_t nbytes;
 QEMUIOVector *qiov;
 bool is_read;
-QLIST_ENTRY(qemu_laiocb) node;
 };
 
 /*
-- 
1.7.9.5

[Qemu-devel] [PATCH v5 2/3] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-11-24 Thread Ming Lei

Previously -EAGAIN is simply ignored for !s->io_q.plugged case,
and sometimes it is easy to cause -EIO to VM, such as NVME device.

This patch handles -EAGAIN by io queue for !s->io_q.plugged case,
and it will be retried in following aio completion cb.

Suggested-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 70312a4..0cb098d 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -282,12 +282,17 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct 
iocb *iocb)
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
-/* submit immediately if queue depth is above 2/3 */
-if (idx > s->io_q.size * 2 / 3) {
-return ioq_submit(s);
+/*
+ * This is reached in two cases: queue not plugged but io_submit
+ * returned -EAGAIN, or queue plugged.  In the latter case, start
+ * submitting some I/O if the queue is getting too full.  In the
+ * former case, instead, wait until an I/O operation is completed.
+ */
+if (!s->io_q.plugged || likely(idx < s->io_q.size * 2 / 3)) {
+return 0;
 }
 
-return 0;
+return ioq_submit(s);
 }
 
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
@@ -348,15 +353,18 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 }
 io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
 
-if (!s->io_q.plugged) {
-if (io_submit(s->ctx, 1, &iocbs) < 0) {
-goto out_free_aiocb;
-}
-} else {
-if (ioq_enqueue(s, iocbs) < 0) {
+/* Switch to queue mode until -EAGAIN is handled */
+if (!s->io_q.plugged && !s->io_q.idx) {
+int ret = io_submit(s->ctx, 1, &iocbs);
+if (ret >= 0) {
+return &laiocb->common;
+} else if (ret != -EAGAIN) {
 goto out_free_aiocb;
 }
 }
+if (ioq_enqueue(s, iocbs) < 0) {
+goto out_free_aiocb;
+}
 return &laiocb->common;
 
 out_free_aiocb:
-- 
1.7.9.5

[Qemu-devel] [PATCH v5 0/3] linux-aio: fix batch submission

2014-11-24 Thread Ming Lei

The 1st patch fixes batch submission.

The 2nd one fixes -EAGAIN for non-batch case.

The 3rd one is a cleanup.

This patchset is splitted from previous patchset(dataplane: optimization
and multi virtqueue support), as suggested by Stefan.

V5:
- in case of submission failure, return -EIO for new coming requests
  until aborting is handled
- in patch2, follow Paolo's suggestion about ioq_enqueue() changes

V4:
- abort reuqests in BH to abvoid potential "Co-routine re-entered 
recursively"
- remove 'enqueue' parameter to ioq_submit() to simpify change
- beautify code as suggested by Paolo

v3:
- rebase on QEMU master
v2:
- code style fix and commit log fix as suggested by Benoît Canet
v1:
- rebase on latest QEMU master

 block/linux-aio.c |  131 +++--
 1 file changed, 107 insertions(+), 24 deletions(-)

Thanks,
Ming Lei

[Qemu-devel] [PATCH v5 1/3] linux-aio: fix submit aio as a batch

2014-11-24 Thread Ming Lei

In the submit path, we can't complete request directly,
otherwise "Co-routine re-entered recursively" may be caused,
so this patch fixes the issue with below ideas:

- for -EAGAIN or partial completion, retry the submision
in following completion cb which is run in BH context
- for part of completion, update the io queue too
- for case of io queue full, submit queued requests
immediatelly and return failure to caller
- for other failure, abort all queued requests in BH
context, and requests won't be allow to submit until
aborting is handled

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |  116 -
 1 file changed, 96 insertions(+), 20 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index d92513b..70312a4 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -38,11 +38,20 @@ struct qemu_laiocb {
 QLIST_ENTRY(qemu_laiocb) node;
 };
 
+/*
+ * TODO: support to batch I/O from multiple bs in one same
+ * AIO context, one important use case is multi-lun scsi,
+ * so in future the IO queue should be per AIO context.
+ */
 typedef struct {
 struct iocb *iocbs[MAX_QUEUED_IO];
 int plugged;
 unsigned int size;
 unsigned int idx;
+
+/* abort queued requests in BH context */
+QEMUBH *abort_bh;
+bool  aborting;
 } LaioQueue;
 
 struct qemu_laio_state {
@@ -59,6 +68,8 @@ struct qemu_laio_state {
 int event_max;
 };
 
+static int ioq_submit(struct qemu_laio_state *s);
+
 static inline ssize_t io_event_ret(struct io_event *ev)
 {
 return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
@@ -91,6 +102,13 @@ static void qemu_laio_process_completion(struct 
qemu_laio_state *s,
 qemu_aio_unref(laiocb);
 }
 
+static void qemu_laio_start_retry(struct qemu_laio_state *s)
+{
+if (s->io_q.idx) {
+ioq_submit(s);
+}
+}
+
 /* The completion BH fetches completed I/O requests and invokes their
  * callbacks.
  *
@@ -135,6 +153,8 @@ static void qemu_laio_completion_bh(void *opaque)
 
 qemu_laio_process_completion(s, laiocb);
 }
+
+qemu_laio_start_retry(s);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
@@ -175,47 +195,99 @@ static void ioq_init(LaioQueue *io_q)
 io_q->size = MAX_QUEUED_IO;
 io_q->idx = 0;
 io_q->plugged = 0;
+io_q->aborting = false;
 }
 
+/* Always return >= 0 and it means how many requests are submitted */
 static int ioq_submit(struct qemu_laio_state *s)
 {
-int ret, i = 0;
+int ret;
 int len = s->io_q.idx;
 
-do {
-ret = io_submit(s->ctx, len, s->io_q.iocbs);
-} while (i++ < 3 && ret == -EAGAIN);
-
-/* empty io queue */
-s->io_q.idx = 0;
+if (!len) {
+return 0;
+}
 
+ret = io_submit(s->ctx, len, s->io_q.iocbs);
 if (ret < 0) {
-i = 0;
-} else {
-i = ret;
+/* retry in following completion cb */
+if (ret == -EAGAIN) {
+return 0;
+}
+
+/*
+ * Abort in BH context for avoiding Co-routine re-entered,
+ * and update io queue at that time
+ */
+qemu_bh_schedule(s->io_q.abort_bh);
+s->io_q.aborting = true;
+ret = 0;
 }
 
-for (; i < len; i++) {
-struct qemu_laiocb *laiocb =
-container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
+/*
+ * update io queue, and retry will be started automatically
+ * in following completion cb for the remainder
+ */
+if (ret > 0) {
+if (ret < len) {
+memmove(&s->io_q.iocbs[0], &s->io_q.iocbs[ret],
+(len - ret) * sizeof(struct iocb *));
+}
+s->io_q.idx -= ret;
+}
 
-laiocb->ret = (ret < 0) ? ret : -EIO;
+return ret;
+}
+
+static void ioq_abort_bh(void *opaque)
+{
+struct qemu_laio_state *s = opaque;
+int i;
+
+for (i = 0; i < s->io_q.idx; i++) {
+struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
+  struct qemu_laiocb,
+  iocb);
+laiocb->ret = -EIO;
 qemu_laio_process_completion(s, laiocb);
 }
-return ret;
+
+s->io_q.idx = 0;
+io_q->aborting = false;
 }
 
-static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
+static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
 unsigned int idx = s->io_q.idx;
 
+/* Request can't be allowed to submit until aborting is handled */
+if (unlikely(s->io_q.aborting)) {
+return -EIO;
+}
+
+if (unlikely(idx == s->io_q.size)) {
+ioq_submit(s);
+
+if (unlikely(s->io_q.aborting)) {
+return -EIO;
+}
+idx = s->io_q.idx;
+}
+
+/* It

Re: [Qemu-devel] [PATCH v4 1/3] linux-aio: fix submit aio as a batch

2014-11-24 Thread Ming Lei

On Mon, Nov 24, 2014 at 7:39 PM, Paolo Bonzini  wrote:
>
>
> On 24/11/2014 12:31, Ming Lei wrote:
>> +static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>>  {
>>  unsigned int idx = s->io_q.idx;
>>
>> +if (unlikely(idx == s->io_q.size)) {
>> +ioq_submit(s);
>> +return -EAGAIN;
>
> Only return -EAGAIN if ioq_submit(s) returns 0?  Otherwise reload idx
> and go on.

Good point.

Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH v4 2/3] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-11-24 Thread Ming Lei

On Mon, Nov 24, 2014 at 7:38 PM, Paolo Bonzini  wrote:
>
>
> On 24/11/2014 12:31, Ming Lei wrote:
>> +/* don't submit until next completion for -EAGAIN of non plug case */
>> +if (unlikely(!s->io_q.plugged)) {
>> +return 0;
>> +}
>> +
>>  /* submit immediately if queue depth is above 2/3 */
>>  if (idx > s->io_q.size * 2 / 3) {
>>  return ioq_submit(s);
>   }
>
>   return 0;
>
> so I fail to see why my proposal:
>
>
> /* This is reached in two cases: queue not plugged but io_submit
>  * returned -EAGAIN, or queue plugged.  In the latter case, start
>  * submitting some I/O if the queue is getting too full.  In the
>  * former case, instead, wait until an I/O operation is completed.
>  */
> if (likely(idx <= s->io_q.size * 2 / 3) || unlikely(!s->io_q.plugged) {
> return 0;
> }
>
> return ioq_submit(s);
>
> was wrong.  Can you explain?

I didn't say your proposal is wrong, and this patch is correct too
without fat comment.

The difference is only that this patch returns immediately in case
of !s->io_q.plugged after putting the req into io queue.

Thanks,

[Qemu-devel] [PATCH v4 3/3] linux-aio: remove 'node' from 'struct qemu_laiocb'

2014-11-24 Thread Ming Lei

No one uses the 'node' field any more, so remove it
from 'struct qemu_laiocb', and this can save 16byte
for the struct on 64bit arch.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 974e4f9..f36e96c 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -35,7 +35,6 @@ struct qemu_laiocb {
 size_t nbytes;
 QEMUIOVector *qiov;
 bool is_read;
-QLIST_ENTRY(qemu_laiocb) node;
 };
 
 /*
-- 
1.7.9.5

[Qemu-devel] [PATCH v4 0/3] linux-aio: fix batch submission

2014-11-24 Thread Ming Lei

The 1st patch fixes batch submission.

The 2nd one fixes -EAGAIN for non-batch case.

The 3rd one is a cleanup.

This patchset is splitted from previous patchset(dataplane: optimization
and multi virtqueue support), as suggested by Stefan.

V4:
- abort reuqests in BH to abvoid potential "Co-routine re-entered 
recursively"
- remove 'enqueue' parameter to ioq_submit() to simpify change
- beautify code as suggested by Paolo

v3:
- rebase on QEMU master
v2:
- code style fix and commit log fix as suggested by Benoît Canet
v1:
- rebase on latest QEMU master

 block/linux-aio.c |  108 +
 1 file changed, 84 insertions(+), 24 deletions(-)

Thanks
Ming Lei

[Qemu-devel] [PATCH v4 2/3] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-11-24 Thread Ming Lei

Previously -EAGAIN is simply ignored for !s->io_q.plugged case,
and sometimes it is easy to cause -EIO to VM, such as NVME device.

This patch handles -EAGAIN by io queue for !s->io_q.plugged case,
and it will be retried in following aio completion cb.

Suggested-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 11fcedb..974e4f9 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -259,6 +259,11 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct 
iocb *iocb)
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
+/* don't submit until next completion for -EAGAIN of non plug case */
+if (unlikely(!s->io_q.plugged)) {
+return 0;
+}
+
 /* submit immediately if queue depth is above 2/3 */
 if (idx > s->io_q.size * 2 / 3) {
 return ioq_submit(s);
@@ -325,15 +330,18 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 }
 io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
 
-if (!s->io_q.plugged) {
-if (io_submit(s->ctx, 1, &iocbs) < 0) {
-goto out_free_aiocb;
-}
-} else {
-if (ioq_enqueue(s, iocbs) < 0) {
+/* Switch to queue mode until -EAGAIN is handled */
+if (!s->io_q.plugged && !s->io_q.idx) {
+int ret = io_submit(s->ctx, 1, &iocbs);
+if (ret >= 0) {
+return &laiocb->common;
+} else if (ret != -EAGAIN) {
 goto out_free_aiocb;
 }
 }
+if (ioq_enqueue(s, iocbs) < 0) {
+goto out_free_aiocb;
+}
 return &laiocb->common;
 
 out_free_aiocb:
-- 
1.7.9.5

[Qemu-devel] [PATCH v4 1/3] linux-aio: fix submit aio as a batch

2014-11-24 Thread Ming Lei

In the submit path, we can't complete request directly,
otherwise "Co-routine re-entered recursively" may be caused,
so this patch fixes the issue with below ideas:

- for -EAGAIN or partial completion, retry the submision
in following completion cb which is run in BH context
- for part of completion, update the io queue too
- for case of io queue full, submit queued requests
immediatelly and return failure to caller
- for other failure, abort all queued requests in BH context

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   93 +
 1 file changed, 73 insertions(+), 20 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index d92513b..11fcedb 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -38,11 +38,19 @@ struct qemu_laiocb {
 QLIST_ENTRY(qemu_laiocb) node;
 };
 
+/*
+ * TODO: support to batch I/O from multiple bs in one same
+ * AIO context, one important use case is multi-lun scsi,
+ * so in future the IO queue should be per AIO context.
+ */
 typedef struct {
 struct iocb *iocbs[MAX_QUEUED_IO];
 int plugged;
 unsigned int size;
 unsigned int idx;
+
+/* abort queued requests in BH context */
+QEMUBH *abort_bh;
 } LaioQueue;
 
 struct qemu_laio_state {
@@ -59,6 +67,8 @@ struct qemu_laio_state {
 int event_max;
 };
 
+static int ioq_submit(struct qemu_laio_state *s);
+
 static inline ssize_t io_event_ret(struct io_event *ev)
 {
 return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
@@ -91,6 +101,13 @@ static void qemu_laio_process_completion(struct 
qemu_laio_state *s,
 qemu_aio_unref(laiocb);
 }
 
+static void qemu_laio_start_retry(struct qemu_laio_state *s)
+{
+if (s->io_q.idx) {
+ioq_submit(s);
+}
+}
+
 /* The completion BH fetches completed I/O requests and invokes their
  * callbacks.
  *
@@ -135,6 +152,8 @@ static void qemu_laio_completion_bh(void *opaque)
 
 qemu_laio_process_completion(s, laiocb);
 }
+
+qemu_laio_start_retry(s);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
@@ -177,45 +196,75 @@ static void ioq_init(LaioQueue *io_q)
 io_q->plugged = 0;
 }
 
+/* always return >= 0 */
 static int ioq_submit(struct qemu_laio_state *s)
 {
-int ret, i = 0;
+int ret;
 int len = s->io_q.idx;
 
-do {
-ret = io_submit(s->ctx, len, s->io_q.iocbs);
-} while (i++ < 3 && ret == -EAGAIN);
-
-/* empty io queue */
-s->io_q.idx = 0;
+if (!len) {
+return 0;
+}
 
+ret = io_submit(s->ctx, len, s->io_q.iocbs);
 if (ret < 0) {
-i = 0;
-} else {
-i = ret;
+/* retry in following completion cb */
+if (ret == -EAGAIN) {
+return 0;
+}
+
+/* abort in BH context for avoiding Co-routine re-entered */
+qemu_bh_schedule(s->io_q.abort_bh);
+ret = len;
 }
 
-for (; i < len; i++) {
-struct qemu_laiocb *laiocb =
-container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
+/*
+ * update io queue, and retry will be started automatically
+ * in following completion cb for the remainder
+ */
+if (ret > 0) {
+if (ret < len) {
+memmove(&s->io_q.iocbs[0], &s->io_q.iocbs[ret],
+(len - ret) * sizeof(struct iocb *));
+}
+s->io_q.idx -= ret;
+}
+
+return ret;
+}
+
+static void ioq_abort_bh(void *opaque)
+{
+struct qemu_laio_state *s = opaque;
+int i;
 
-laiocb->ret = (ret < 0) ? ret : -EIO;
+for (i = 0; i < s->io_q.idx; i++) {
+struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
+  struct qemu_laiocb,
+  iocb);
+laiocb->ret = -EIO;
 qemu_laio_process_completion(s, laiocb);
 }
-return ret;
 }
 
-static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
+static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
 unsigned int idx = s->io_q.idx;
 
+if (unlikely(idx == s->io_q.size)) {
+ioq_submit(s);
+return -EAGAIN;
+}
+
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
-/* submit immediately if queue is full */
-if (idx == s->io_q.size) {
-ioq_submit(s);
+/* submit immediately if queue depth is above 2/3 */
+if (idx > s->io_q.size * 2 / 3) {
+return ioq_submit(s);
 }
+
+return 0;
 }
 
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
@@ -281,7 +330,9 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 goto out_free_aiocb;
 }
 } else {
-ioq_enqueue(s, iocbs);
+if (ioq_enqueue(s, iocbs) < 0) {
+goto out_free

Re: [Qemu-devel] [PATCH v3 2/3] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-11-24 Thread Ming Lei

On Mon, Nov 24, 2014 at 5:52 PM, Paolo Bonzini  wrote:
>
>
> On 18/11/2014 15:06, Paolo Bonzini wrote:
>>> > +/* don't submit until next completion for -EAGAIN of non plug case */
>>> > +if (unlikely(!s->io_q.plugged)) {
>>> > +return 0;
>>> > +}
>>> > +
>> Is this an optimization or a fix for something?
>
> Ah, if !io_q.plugged we must be in the -EAGAIN case, so no need to
> "submit immediately if queue depth is above 2/3".  Can you rewrite the

The above comment is only for io_q.plugged case, and for
!io_q.plugged with -EAGAIN, the requests is submitted in
retry path.

> comment like this:
>
> /* This is reached in two cases: queue not plugged but io_submit
>  * returned -EAGAIN, or queue plugged.  In the latter case, start
>  * submitting some I/O if the queue is getting too full.  In the
>  * former case, instead, wait until an I/O operation is completed.
>  */
> if (likely(idx <= s->io_q.size * 2 / 3) || unlikely(!s->io_q.plugged) {
> return 0;
>     }

Yes, the above change need the corresponding comment, but this
patch needn't since the 1st case won't reach here, :-)

Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH v3 1/3] linux-aio: fix submit aio as a batch

2014-11-24 Thread Ming Lei

On Mon, Nov 24, 2014 at 5:47 PM, Paolo Bonzini  wrote:
>
>
> On 22/11/2014 13:16, Ming Lei wrote:
>> > > +static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>> > >  {
>> > >  unsigned int idx = s->io_q.idx;
>> > >
>> > > +if (unlikely(idx == s->io_q.size)) {
>> > > +return -1;
>> >
>> > return -EAGAIN?
>>
>> It means the io queue is full, so the code has to fail the current
>> request.
>
> Right, but "-1" is "-EPERM" which doesn't make much sense.  I think the
> right thing to do is to return EAGAIN, and let the owner figure it out.
>  For example, a SCSI device might return a "BUSY" status code.

We can do that, but laio_submit() doesn't return the code to callers.


Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH 00/13] linux-aio/virtio-scsi: support AioContext wide IO submission as batch

2014-11-22 Thread Ming Lei

On Tue, Nov 18, 2014 at 9:57 PM, Paolo Bonzini  wrote:
>
>
> On 09/11/2014 08:42, Ming Lei wrote:
>> This patch implements AioContext wide IO submission as batch, and
>> the idea behind is very simple:
>>
>>   - linux native aio(io_submit) supports to enqueue read/write requests
>>   to different files
>>
>>   - in one AioContext, I/O requests from VM can be submitted to different
>>   backend in host, one typical example is multi-lun scsi
>>
>> This patch changes 'struct qemu_laio_state' as per AioContext, and
>> multiple 'bs' can be associted with one single instance of
>> 'struct qemu_laio_state', then AioContext wide IO submission as batch
>> becomes easy to implement.
>>
>> One simple test in my laptop shows ~20% throughput improvement
>> on randread from VM(using AioContext wide IO batch vs. not using io batch)
>> with below config:
>>
>>   -drive 
>> id=drive_scsi1-0-0-0,if=none,format=raw,cache=none,aio=native,file=/dev/nullb2
>>  \
>>   -drive 
>> id=drive_scsi1-0-0-1,if=none,format=raw,cache=none,aio=native,file=/dev/nullb3
>>  \
>>   -device 
>> virtio-scsi-pci,num_queues=4,id=scsi1,addr=07,iothread=iothread0 \
>>   -device 
>> scsi-disk,bus=scsi1.0,channel=0,scsi-id=1,lun=0,drive=drive_scsi1-0-0-0,id=scsi1-0-0-0
>>  \
>>   -device 
>> scsi-disk,bus=scsi1.0,channel=0,scsi-id=1,lun=1,drive=drive_scsi1-0-0-1,id=scsi1-0-0-1
>>  \
>>
>> BTW, maybe more boost can be obtained since ~33K/sec write() system call
>> can be observed when this test case is running, and it might be a recent
>> regression(BH?).
>
> Ming,
>
> these patches are interesting.  I would like to compare them with the
> opposite approach (and, I think, more similar to your old work) where
> the qemu_laio_state API is moved entirely into AioContext, with lazy
> allocation (reference-counted too, probably).

Yes, it can be done in that way, but the feature is linux native aio
specific, so it might not be good to put it into AioContext.

Basically most of the implementation should be same, and the
difference should be where the io queue is put.

>
> Most of the patches would be the same, but you would replace
> aio_attach_aio_bs/aio_detach_aio_bs with something like
> aio_native_get/aio_native_unref.  Ultimately block/{linux,win32}-aio.c
> could be merged into block/aio-{posix,win32}.c, but you do not have to
> do that now.
>
> Could you try that?  This way we can see which API turns out to be nicer.

I can try that, could you share what the APIs you prefer to?

IMO, the APIs can be defined flexiably in this patchset, and only
the AioContext parameter is enough.

Thanks,

Re: [Qemu-devel] [PATCH v3 2/3] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-11-22 Thread Ming Lei

On Tue, Nov 18, 2014 at 10:06 PM, Paolo Bonzini  wrote:
>
>
> On 06/11/2014 16:10, Ming Lei wrote:
>> +/* don't submit until next completion for -EAGAIN of non plug case */
>> +if (unlikely(!s->io_q.plugged)) {
>> +return 0;
>> +}
>> +
>
> Is this an optimization or a fix for something?

It is for avoiding unnecessary submission which will cause
another -EAGAIN.

>
>> +/*
>> + * Switch to queue mode until -EAGAIN is handled, we suppose
>> + * there is always uncompleted I/O, so try to enqueue it first,
>> + * and will be submitted again in following aio completion cb.
>> + */
>> +if (ret == -EAGAIN) {
>> +goto enqueue;
>> +} else if (ret < 0) {
>>  goto out_free_aiocb;
>>  }
>
> Better:
>
>  if (!s->io_q.plugged && !s->io_q.idx) {
> ret = io_submit(s->ctx, 1, &iocbs);
> if (ret >= 0) {
> return &laiocb->common;
> }
> if (ret != -EAGAIN) {
> goto out_free_aiocb;
> }
> }

Right.

Thanks,
Ming Lei

Re: [Qemu-devel] [PATCH v3 1/3] linux-aio: fix submit aio as a batch

2014-11-22 Thread Ming Lei

On Tue, Nov 18, 2014 at 10:18 PM, Paolo Bonzini  wrote:
>
>
>> @@ -137,6 +145,12 @@ static void qemu_laio_completion_bh(void *opaque)
>>  }
>>  }
>>
>> +static void qemu_laio_start_retry(struct qemu_laio_state *s)
>> +{
>> +if (s->io_q.idx)
>> +qemu_bh_schedule(s->io_q.retry);
>> +}
>> +
>>  static void qemu_laio_completion_cb(EventNotifier *e)
>>  {
>>  struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e);
>> @@ -144,6 +158,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
>>  if (event_notifier_test_and_clear(&s->e)) {
>>  qemu_bh_schedule(s->completion_bh);
>>  }
>> +qemu_laio_start_retry(s);
>
> I think you do not even need two bottom halves.  Just call ioq_submit
> from completion_bh instead, after the call to io_getevents.

Yes, that can save one BH, actually the patch was written when
there wasn't completion BH, :-)

>
>>  }
>>
>>  static void laio_cancel(BlockAIOCB *blockacb)
>> @@ -163,6 +178,9 @@ static void laio_cancel(BlockAIOCB *blockacb)
>>  }
>>
>>  laiocb->common.cb(laiocb->common.opaque, laiocb->ret);
>> +
>> +/* check if there are requests in io queue */
>> +qemu_laio_start_retry(laiocb->ctx);
>>  }
>>
>>  static const AIOCBInfo laio_aiocb_info = {
>> @@ -177,45 +195,80 @@ static void ioq_init(LaioQueue *io_q)
>>  io_q->plugged = 0;
>>  }
>>
>> -static int ioq_submit(struct qemu_laio_state *s)
>> +static void abort_queue(struct qemu_laio_state *s)
>> +{
>> +int i;
>> +for (i = 0; i < s->io_q.idx; i++) {
>> +struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
>> +  struct qemu_laiocb,
>> +  iocb);
>> +laiocb->ret = -EIO;
>> +qemu_laio_process_completion(s, laiocb);
>> +}
>> +}
>> +
>> +static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
>>  {
>>  int ret, i = 0;
>>  int len = s->io_q.idx;
>> +int j = 0;
>>
>> -do {
>> -ret = io_submit(s->ctx, len, s->io_q.iocbs);
>> -} while (i++ < 3 && ret == -EAGAIN);
>> +if (!len) {
>> +return 0;
>> +}
>>
>> -/* empty io queue */
>> -s->io_q.idx = 0;
>> +ret = io_submit(s->ctx, len, s->io_q.iocbs);
>> +if (ret == -EAGAIN) { /* retry in following completion cb */
>> +return 0;
>> +} else if (ret < 0) {
>> +if (enqueue) {
>> +return ret;
>> +}
>>
>> -if (ret < 0) {
>> -i = 0;
>> -} else {
>> -i = ret;
>> +/* in non-queue path, all IOs have to be completed */
>> +abort_queue(s);
>> +ret = len;
>> +} else if (ret == 0) {
>> +goto out;
>
> No need for goto; just move the "for" loop inside this conditional.  Or
> better, just use memmove.  That is:
>
> if (ret < 0) {
> if (ret == -EAGAIN) {
> return 0;
> }
> if (enqueue) {
> return ret;
> }
>
> abort_queue(s);
> ret = len;
> }
>
> if (ret > 0) {
> memmove(...)
> s->io_q.idx -= ret;
> }
> return ret;

The above is better.

>> + * update io queue, for partial completion, retry will be
>> + * started automatically in following completion cb.
>> + */
>> +s->io_q.idx -= ret;
>> +
>>  return ret;
>>  }
>>
>> -static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>> +static void ioq_submit_retry(void *opaque)
>> +{
>> +struct qemu_laio_state *s = opaque;
>> +ioq_submit(s, false);
>> +}
>> +
>> +static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>>  {
>>  unsigned int idx = s->io_q.idx;
>>
>> +if (unlikely(idx == s->io_q.size)) {
>> +return -1;
>
> return -EAGAIN?

It means the io queue is full, so the code has to fail the current
request.

Thanks,
Ming Lei

[Qemu-devel] [PATCH] virtio-scsi: dataplane: suppress guest notification

2014-11-11 Thread Ming Lei

This patch uses vring_should_notify() to suppress
guest notification, and looks notification frequency
can be decreased from ~33K/sec to ~2K/sec in my test
environment.

Suggested-by: Stefan Hajnoczi 
Signed-off-by: Ming Lei 
---
 hw/scsi/virtio-scsi-dataplane.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index df17229..be3b042 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -93,9 +93,14 @@ VirtIOSCSIReq *virtio_scsi_pop_req_vring(VirtIOSCSI *s,
 
 void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req)
 {
+VirtIODevice *vdev = VIRTIO_DEVICE(req->vring->parent);
+
 vring_push(&req->vring->vring, &req->elem,
req->qsgl.size + req->resp_iov.size);
-event_notifier_set(&req->vring->guest_notifier);
+
+if (vring_should_notify(vdev, &req->vring->vring)) {
+event_notifier_set(&req->vring->guest_notifier);
+}
 }
 
 static void virtio_scsi_iothread_handle_ctrl(EventNotifier *notifier)
-- 
1.7.9.5

Re: [Qemu-devel] [PATCH v1 2/2] virtio-scsi: dataplane: notify guest as batch

2014-11-11 Thread Ming Lei

On Wed, Nov 12, 2014 at 1:28 AM, Stefan Hajnoczi  wrote:
> On Tue, Nov 11, 2014 at 09:17:10AM +0800, Ming Lei wrote:
>> +static void notify_guest_bh(void *opaque)
>> +{
>> +VirtIOSCSI *s = opaque;
>> +unsigned int qid;
>> +uint64_t pending = s->pending_guest_notify;
>> +
>> +s->pending_guest_notify = 0;
>> +
>> +while ((qid = ffsl(pending))) {
>> +qid--;
>> +event_notifier_set(&s->cmd_vrings[qid]->guest_notifier);
>> +pending &= ~(1 << qid);
>> +}
>> +}
>
> Looks like we're not honoring virtio's usual interrupt mitigation
> mechanism (e.g. EVENT_IDX) for virtio-scsi.  Why is
> vring_should_notify() not used?

Yes, that should be used, but in virtio-blk, BH still can suppress
about ~13K/sec write(), and for virtio-scsi, it should be more since
there may be more queues.

>
>> +
>>  static void virtio_scsi_bad_req(void)
>>  {
>>  error_report("wrong size for virtio-scsi headers");
>> @@ -824,7 +839,12 @@ void virtio_scsi_common_realize(DeviceState *dev, Error 
>> **errp,
>>  }
>>
>>  if (s->conf.iothread) {
>> -virtio_scsi_set_iothread(VIRTIO_SCSI(s), s->conf.iothread);
>> +VirtIOSCSI *vis = VIRTIO_SCSI(s);
>> +
>> +QEMU_BUILD_BUG_ON(VIRTIO_PCI_QUEUE_MAX > 64);
>> +virtio_scsi_set_iothread(vis, s->conf.iothread);
>> +vis->pending_guest_notify = 0;
>> +vis->guest_notify_bh = aio_bh_new(vis->ctx, notify_guest_bh, vis);
>
> Have you checked state transitions like PCI or virtio reset?  They need
> to cancel the BH and clear ->pending_guest_notify.
>
> There is also a transition from dataplane to non-dataplane mode for live
> migration.
>
> Please make sure no interrupts are dropped or stale data is kept across
> these transitions.

Good catch, and I will make sure that in next version, and virtio-blk
dataplane need to consider that too.

Thanks,

Re: [Qemu-devel] [RFC PATCH 0/8] Add Generic PCI host device update

2014-11-10 Thread Ming Lei

On Tue, Nov 11, 2014 at 11:24 AM, Ming Lei  wrote:
> On Mon, Nov 10, 2014 at 6:00 PM, alvise rigo
>  wrote:
>> Hi Claudio,
>>
>> On Fri, Nov 7, 2014 at 4:40 PM, Claudio Fontana
>>  wrote:
>>>
>>> Hi Alvise,
>>>
>>> I now got to test the series for my use case, in particular to enable the
>>> ARM 64bit OSv guest (OSv's devices come from pci + virtio).
>>>
>>> Could you respin the series, possibly including also Rob's patches,
>>> addressing the issues which have been raised before?
>>
>> Yes, I hope to have something for the next week.
>
> I have tested this serials with Rob's two patches, and it works fine
> on ARMv7 VM, but ARMv8 VM can't boot.  'git bisect' told
> me the 1st bad patch is below one:
>
>   generic_pci: generate dt node after devices init

That is because you didn't support 'cortex-a57' and 'host',
so please add that in your next version.

Thanks
Ming Lei

Re: [Qemu-devel] [RFC PATCH 0/8] Add Generic PCI host device update

2014-11-10 Thread Ming Lei

On Mon, Nov 10, 2014 at 6:00 PM, alvise rigo
 wrote:
> Hi Claudio,
>
> On Fri, Nov 7, 2014 at 4:40 PM, Claudio Fontana
>  wrote:
>>
>> Hi Alvise,
>>
>> I now got to test the series for my use case, in particular to enable the
>> ARM 64bit OSv guest (OSv's devices come from pci + virtio).
>>
>> Could you respin the series, possibly including also Rob's patches,
>> addressing the issues which have been raised before?
>
> Yes, I hope to have something for the next week.

I have tested this serials with Rob's two patches, and it works fine
on ARMv7 VM, but ARMv8 VM can't boot.  'git bisect' told
me the 1st bad patch is below one:

  generic_pci: generate dt node after devices init


Thanks,
-- 
Ming Lei

Re: [Qemu-devel] [PATCH v1 2/2] virtio-scsi: dataplane: notify guest as batch

2014-11-10 Thread Ming Lei

On Tue, Nov 11, 2014 at 10:52 AM, Fam Zheng  wrote:
> On Tue, 11/11 10:29, Ming Lei wrote:
>> On Tue, Nov 11, 2014 at 9:52 AM, Fam Zheng  wrote:
>> > On Tue, 11/11 09:17, Ming Lei wrote:
>> >> It isn't necessery to notify guest each time when one request
>> >> is completed, and it should be enough to just notify one time
>> >> for each running of virtio_scsi_iothread_handle_cmd().
>> >>
>> >> This patch supresses about 30K/sec write on eventfd.
>> >>
>> >> Signed-off-by: Ming Lei 
>> >> ---
>> >>  hw/scsi/virtio-scsi-dataplane.c |4 +++-
>> >>  hw/scsi/virtio-scsi.c   |   26 +-
>> >>  include/hw/virtio/virtio-scsi.h |4 
>> >>  3 files changed, 32 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/hw/scsi/virtio-scsi-dataplane.c 
>> >> b/hw/scsi/virtio-scsi-dataplane.c
>> >> index df17229..294515a 100644
>> >> --- a/hw/scsi/virtio-scsi-dataplane.c
>> >> +++ b/hw/scsi/virtio-scsi-dataplane.c
>> >> @@ -62,6 +62,7 @@ static VirtIOSCSIVring 
>> >> *virtio_scsi_vring_init(VirtIOSCSI *s,
>> >>  aio_set_event_notifier(s->ctx, &r->host_notifier, handler);
>> >>
>> >>  r->parent = s;
>> >> +r->qid = n;
>> >>
>> >>  if (!vring_setup(&r->vring, VIRTIO_DEVICE(s), n)) {
>> >>  fprintf(stderr, "virtio-scsi: VRing setup failed\n");
>> >> @@ -95,7 +96,8 @@ void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req)
>> >>  {
>> >>  vring_push(&req->vring->vring, &req->elem,
>> >> req->qsgl.size + req->resp_iov.size);
>> >> -event_notifier_set(&req->vring->guest_notifier);
>> >> +req->dev->pending_guest_notify |= 1 << (req->vring->qid - 2);
>> >> +qemu_bh_schedule(req->dev->guest_notify_bh);
>> >>  }
>> >>
>> >>  static void virtio_scsi_iothread_handle_ctrl(EventNotifier *notifier)
>> >> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
>> >> index 6e34a2c..98411ef 100644
>> >> --- a/hw/scsi/virtio-scsi.c
>> >> +++ b/hw/scsi/virtio-scsi.c
>> >> @@ -86,6 +86,21 @@ static void virtio_scsi_complete_req(VirtIOSCSIReq 
>> >> *req)
>> >>  virtio_scsi_free_req(req);
>> >>  }
>> >>
>> >> +static void notify_guest_bh(void *opaque)
>> >> +{
>> >> +VirtIOSCSI *s = opaque;
>> >> +unsigned int qid;
>> >> +uint64_t pending = s->pending_guest_notify;
>> >> +
>> >> +s->pending_guest_notify = 0;
>> >> +
>> >> +while ((qid = ffsl(pending))) {
>> >> +qid--;
>> >> +event_notifier_set(&s->cmd_vrings[qid]->guest_notifier);
>> >
>> > Don't we need to handle ctrlq and eventq as well?
>>
>> Most of times no frequent activity in ctrl and event vq, so it isn't
>> necessary to make them involved.
>
> I mean virtio_scsi_vring_push_notify is not only used to notify cmd, it should
> also work for ctrl and event, because its caller virtio_scsi_complete_req is
> also called in virtio_scsi_handle_ctrl_req and virtio_scsi_push_event.
>
>>
>> >
>> >> +pending &= ~(1 << qid);
>> >> +}
>> >> +}
>> >> +
>> >>  static void virtio_scsi_bad_req(void)
>> >>  {
>> >>  error_report("wrong size for virtio-scsi headers");
>> >> @@ -824,7 +839,12 @@ void virtio_scsi_common_realize(DeviceState *dev, 
>> >> Error **errp,
>> >>  }
>> >>
>> >>  if (s->conf.iothread) {
>> >> -virtio_scsi_set_iothread(VIRTIO_SCSI(s), s->conf.iothread);
>> >> +VirtIOSCSI *vis = VIRTIO_SCSI(s);
>> >> +
>> >> +QEMU_BUILD_BUG_ON(VIRTIO_PCI_QUEUE_MAX > 64);
>> >> +virtio_scsi_set_iothread(vis, s->conf.iothread);
>> >> +vis->pending_guest_notify = 0;
>> >> +vis->guest_notify_bh = aio_bh_new(vis->ctx, notify_guest_bh, 
>> >> vis);
>> >>  }
>> >>  }
>> >>
>> >> @@ -901,7 +921,11 @@ void virtio_scsi_common_unrealize(DeviceState *dev, 
>> >> Error **errp)
>> >>  {

Re: [Qemu-devel] [PATCH v1 2/2] virtio-scsi: dataplane: notify guest as batch

2014-11-10 Thread Ming Lei

On Tue, Nov 11, 2014 at 9:52 AM, Fam Zheng  wrote:
> On Tue, 11/11 09:17, Ming Lei wrote:
>> It isn't necessery to notify guest each time when one request
>> is completed, and it should be enough to just notify one time
>> for each running of virtio_scsi_iothread_handle_cmd().
>>
>> This patch supresses about 30K/sec write on eventfd.
>>
>> Signed-off-by: Ming Lei 
>> ---
>>  hw/scsi/virtio-scsi-dataplane.c |4 +++-
>>  hw/scsi/virtio-scsi.c   |   26 +-
>>  include/hw/virtio/virtio-scsi.h |4 
>>  3 files changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/scsi/virtio-scsi-dataplane.c 
>> b/hw/scsi/virtio-scsi-dataplane.c
>> index df17229..294515a 100644
>> --- a/hw/scsi/virtio-scsi-dataplane.c
>> +++ b/hw/scsi/virtio-scsi-dataplane.c
>> @@ -62,6 +62,7 @@ static VirtIOSCSIVring *virtio_scsi_vring_init(VirtIOSCSI 
>> *s,
>>  aio_set_event_notifier(s->ctx, &r->host_notifier, handler);
>>
>>  r->parent = s;
>> +r->qid = n;
>>
>>  if (!vring_setup(&r->vring, VIRTIO_DEVICE(s), n)) {
>>  fprintf(stderr, "virtio-scsi: VRing setup failed\n");
>> @@ -95,7 +96,8 @@ void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req)
>>  {
>>  vring_push(&req->vring->vring, &req->elem,
>> req->qsgl.size + req->resp_iov.size);
>> -event_notifier_set(&req->vring->guest_notifier);
>> +req->dev->pending_guest_notify |= 1 << (req->vring->qid - 2);
>> +qemu_bh_schedule(req->dev->guest_notify_bh);
>>  }
>>
>>  static void virtio_scsi_iothread_handle_ctrl(EventNotifier *notifier)
>> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
>> index 6e34a2c..98411ef 100644
>> --- a/hw/scsi/virtio-scsi.c
>> +++ b/hw/scsi/virtio-scsi.c
>> @@ -86,6 +86,21 @@ static void virtio_scsi_complete_req(VirtIOSCSIReq *req)
>>  virtio_scsi_free_req(req);
>>  }
>>
>> +static void notify_guest_bh(void *opaque)
>> +{
>> +VirtIOSCSI *s = opaque;
>> +unsigned int qid;
>> +uint64_t pending = s->pending_guest_notify;
>> +
>> +s->pending_guest_notify = 0;
>> +
>> +while ((qid = ffsl(pending))) {
>> +qid--;
>> +event_notifier_set(&s->cmd_vrings[qid]->guest_notifier);
>
> Don't we need to handle ctrlq and eventq as well?

Most of times no frequent activity in ctrl and event vq, so it isn't
necessary to make them involved.

>
>> +pending &= ~(1 << qid);
>> +}
>> +}
>> +
>>  static void virtio_scsi_bad_req(void)
>>  {
>>  error_report("wrong size for virtio-scsi headers");
>> @@ -824,7 +839,12 @@ void virtio_scsi_common_realize(DeviceState *dev, Error 
>> **errp,
>>  }
>>
>>  if (s->conf.iothread) {
>> -virtio_scsi_set_iothread(VIRTIO_SCSI(s), s->conf.iothread);
>> +VirtIOSCSI *vis = VIRTIO_SCSI(s);
>> +
>> +QEMU_BUILD_BUG_ON(VIRTIO_PCI_QUEUE_MAX > 64);
>> +virtio_scsi_set_iothread(vis, s->conf.iothread);
>> +vis->pending_guest_notify = 0;
>> +vis->guest_notify_bh = aio_bh_new(vis->ctx, notify_guest_bh, vis);
>>  }
>>  }
>>
>> @@ -901,7 +921,11 @@ void virtio_scsi_common_unrealize(DeviceState *dev, 
>> Error **errp)
>>  {
>>  VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>>  VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev);
>> +VirtIOSCSI *s = VIRTIO_SCSI(vs);
>>
>> +if (vs->conf.iothread) {
>> +qemu_bh_delete(s->guest_notify_bh);
>> +}
>>  g_free(vs->cmd_vqs);
>>  virtio_cleanup(vdev);
>>  }
>> diff --git a/include/hw/virtio/virtio-scsi.h 
>> b/include/hw/virtio/virtio-scsi.h
>> index 9e1a49c..5e6c57e 100644
>> --- a/include/hw/virtio/virtio-scsi.h
>> +++ b/include/hw/virtio/virtio-scsi.h
>> @@ -163,6 +163,7 @@ typedef struct {
>>  Vring vring;
>>  EventNotifier host_notifier;
>>  EventNotifier guest_notifier;
>> +uint32_t qid;
>
> Could this "bool notify_pending"? In this case pending_guest_notify in
> VirtIOSCSI is not necessary. I guess looking into no more than 64
> VirtIOSCSIVring elements in guest_notify_bh is not _that_ expensive so it's 
> not
> worth the complexity.

We need to know which queue the pending notifier is sent to in BH
handler, and 64 is the limit for queue number, not element number.

Actually it is a bit expensive, and 30K/sec write syscall is wasted in
the unnecessary & repeated notifying.

And the similar approach is applied in virtio-blk dataplane too, :-)


Thanks,
Ming Lei

[Qemu-devel] [PATCH v1 2/2] virtio-scsi: dataplane: notify guest as batch

2014-11-10 Thread Ming Lei

It isn't necessery to notify guest each time when one request
is completed, and it should be enough to just notify one time
for each running of virtio_scsi_iothread_handle_cmd().

This patch supresses about 30K/sec write on eventfd.

Signed-off-by: Ming Lei 
---
 hw/scsi/virtio-scsi-dataplane.c |4 +++-
 hw/scsi/virtio-scsi.c   |   26 +-
 include/hw/virtio/virtio-scsi.h |4 
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index df17229..294515a 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -62,6 +62,7 @@ static VirtIOSCSIVring *virtio_scsi_vring_init(VirtIOSCSI *s,
 aio_set_event_notifier(s->ctx, &r->host_notifier, handler);
 
 r->parent = s;
+r->qid = n;
 
 if (!vring_setup(&r->vring, VIRTIO_DEVICE(s), n)) {
 fprintf(stderr, "virtio-scsi: VRing setup failed\n");
@@ -95,7 +96,8 @@ void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req)
 {
 vring_push(&req->vring->vring, &req->elem,
req->qsgl.size + req->resp_iov.size);
-event_notifier_set(&req->vring->guest_notifier);
+req->dev->pending_guest_notify |= 1 << (req->vring->qid - 2);
+qemu_bh_schedule(req->dev->guest_notify_bh);
 }
 
 static void virtio_scsi_iothread_handle_ctrl(EventNotifier *notifier)
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 6e34a2c..98411ef 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -86,6 +86,21 @@ static void virtio_scsi_complete_req(VirtIOSCSIReq *req)
 virtio_scsi_free_req(req);
 }
 
+static void notify_guest_bh(void *opaque)
+{
+VirtIOSCSI *s = opaque;
+unsigned int qid;
+uint64_t pending = s->pending_guest_notify;
+
+s->pending_guest_notify = 0;
+
+while ((qid = ffsl(pending))) {
+qid--;
+event_notifier_set(&s->cmd_vrings[qid]->guest_notifier);
+pending &= ~(1 << qid);
+}
+}
+
 static void virtio_scsi_bad_req(void)
 {
 error_report("wrong size for virtio-scsi headers");
@@ -824,7 +839,12 @@ void virtio_scsi_common_realize(DeviceState *dev, Error 
**errp,
 }
 
 if (s->conf.iothread) {
-virtio_scsi_set_iothread(VIRTIO_SCSI(s), s->conf.iothread);
+VirtIOSCSI *vis = VIRTIO_SCSI(s);
+
+QEMU_BUILD_BUG_ON(VIRTIO_PCI_QUEUE_MAX > 64);
+virtio_scsi_set_iothread(vis, s->conf.iothread);
+vis->pending_guest_notify = 0;
+vis->guest_notify_bh = aio_bh_new(vis->ctx, notify_guest_bh, vis);
 }
 }
 
@@ -901,7 +921,11 @@ void virtio_scsi_common_unrealize(DeviceState *dev, Error 
**errp)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev);
+VirtIOSCSI *s = VIRTIO_SCSI(vs);
 
+if (vs->conf.iothread) {
+qemu_bh_delete(s->guest_notify_bh);
+}
 g_free(vs->cmd_vqs);
 virtio_cleanup(vdev);
 }
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 9e1a49c..5e6c57e 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -163,6 +163,7 @@ typedef struct {
 Vring vring;
 EventNotifier host_notifier;
 EventNotifier guest_notifier;
+uint32_t qid;
 } VirtIOSCSIVring;
 
 typedef struct VirtIOSCSICommon {
@@ -198,6 +199,9 @@ typedef struct VirtIOSCSI {
 bool dataplane_fenced;
 Error *blocker;
 Notifier migration_state_notifier;
+
+QEMUBH *guest_notify_bh;
+uint64_t pending_guest_notify;
 } VirtIOSCSI;
 
 typedef struct VirtIOSCSIReq {
-- 
1.7.9.5

[Qemu-devel] [PATCH v1 0/2] virtio-scsi: dataplane: one fix and one optimization

2014-11-10 Thread Ming Lei

Hi,

The 1st patch fixes an allocation problem.

The 2nd one supresses writing eventfd a lot(~30K/sec in my test).

V1:
- use g_new() in 1/2
- avoid to dereference VIRTIO_SCSI() directly 2/2
- add build check on queue depth

Thanks,
Ming Lei

[Qemu-devel] [PATCH v1 1/2] virtio-scsi: dataplane: fix allocation for 'cmd_vrings'

2014-11-10 Thread Ming Lei

The size of each element should be sizeof(VirtIOSCSIVring *).

Signed-off-by: Ming Lei 
---
 hw/scsi/virtio-scsi-dataplane.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 855439e..df17229 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -239,7 +239,7 @@ void virtio_scsi_dataplane_start(VirtIOSCSI *s)
 if (!s->event_vring) {
 goto fail_vrings;
 }
-s->cmd_vrings = g_malloc0(sizeof(VirtIOSCSIVring) * vs->conf.num_queues);
+s->cmd_vrings = g_new(VirtIOSCSIVring *, vs->conf.num_queues);
 for (i = 0; i < vs->conf.num_queues; i++) {
 s->cmd_vrings[i] =
 virtio_scsi_vring_init(s, vs->cmd_vqs[i],
-- 
1.7.9.5

Re: [Qemu-devel] [PATCH 1/2] virtio-scsi-dataplane: fix allocation for 'cmd_vrings'

2014-11-10 Thread Ming Lei

On Mon, Nov 10, 2014 at 5:21 PM, Kevin Wolf  wrote:
> Am 10.11.2014 um 10:14 hat Ming Lei geschrieben:
>> On Mon, Nov 10, 2014 at 4:24 PM, Markus Armbruster  wrote:
>> > Ming Lei  writes:
>> >
>> >> The size of each element should be sizeof(VirtIOSCSIVring *).
>> >>
>> >> Signed-off-by: Ming Lei 
>> >> ---
>> >>  hw/scsi/virtio-scsi-dataplane.c |2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >>
>> >> diff --git a/hw/scsi/virtio-scsi-dataplane.c 
>> >> b/hw/scsi/virtio-scsi-dataplane.c
>> >> index 855439e..8a7cd9f 100644
>> >> --- a/hw/scsi/virtio-scsi-dataplane.c
>> >> +++ b/hw/scsi/virtio-scsi-dataplane.c
>> >> @@ -239,7 +239,7 @@ void virtio_scsi_dataplane_start(VirtIOSCSI *s)
>> >>  if (!s->event_vring) {
>> >>  goto fail_vrings;
>> >>  }
>> >> -s->cmd_vrings = g_malloc0(sizeof(VirtIOSCSIVring) * 
>> >> vs->conf.num_queues);
>> >> +s->cmd_vrings = g_malloc0(sizeof(VirtIOSCSIVring *) * 
>> >> vs->conf.num_queues);
>> >>  for (i = 0; i < vs->conf.num_queues; i++) {
>> >>  s->cmd_vrings[i] =
>> >>  virtio_scsi_vring_init(s, vs->cmd_vqs[i],
>> >
>> > Please use something like
>> >
>> > s->cmd_vrings = g_new0(VirtIOSCSIVring *, vs->conf.num_queues);
>> >
>> > This one crept in since I cleaned up g_malloc() use globally:
>>
>> Your idea is good, but this one is a fix patch, and I
>> think the g_new() conversion should be done in another
>> patch since the two changes are different logically.
>
> It's not really unrelated: g_new() would have caught the incorrect type
> and made it a compiler error. So changing to g_new() in a patch fixing
> such a bug is actually a logical conclusion and makes it more obvious
> that your patch is correct.

Fair enough, will post v1 with g_new() conversion.

Thanks,

Re: [Qemu-devel] [PATCH 1/2] virtio-scsi-dataplane: fix allocation for 'cmd_vrings'

2014-11-10 Thread Ming Lei

On Mon, Nov 10, 2014 at 4:24 PM, Markus Armbruster  wrote:
> Ming Lei  writes:
>
>> The size of each element should be sizeof(VirtIOSCSIVring *).
>>
>> Signed-off-by: Ming Lei 
>> ---
>>  hw/scsi/virtio-scsi-dataplane.c |2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/scsi/virtio-scsi-dataplane.c 
>> b/hw/scsi/virtio-scsi-dataplane.c
>> index 855439e..8a7cd9f 100644
>> --- a/hw/scsi/virtio-scsi-dataplane.c
>> +++ b/hw/scsi/virtio-scsi-dataplane.c
>> @@ -239,7 +239,7 @@ void virtio_scsi_dataplane_start(VirtIOSCSI *s)
>>  if (!s->event_vring) {
>>  goto fail_vrings;
>>  }
>> -s->cmd_vrings = g_malloc0(sizeof(VirtIOSCSIVring) * 
>> vs->conf.num_queues);
>> +s->cmd_vrings = g_malloc0(sizeof(VirtIOSCSIVring *) * 
>> vs->conf.num_queues);
>>  for (i = 0; i < vs->conf.num_queues; i++) {
>>  s->cmd_vrings[i] =
>>  virtio_scsi_vring_init(s, vs->cmd_vqs[i],
>
> Please use something like
>
> s->cmd_vrings = g_new0(VirtIOSCSIVring *, vs->conf.num_queues);
>
> This one crept in since I cleaned up g_malloc() use globally:

Your idea is good, but this one is a fix patch, and I
think the g_new() conversion should be done in another
patch since the two changes are different logically.

Thanks,

[Qemu-devel] [PATCH 2/2] virtio-scsi-dataplane: notify guest as batch

2014-11-09 Thread Ming Lei

It isn't necessery to notify guest each time when one request
is completed, and it should be enough to just notify one time
for each running of virtio_scsi_iothread_handle_cmd().

This patch supresses about 30K/sec write on eventfd.

Signed-off-by: Ming Lei 
---
 hw/scsi/virtio-scsi-dataplane.c |4 +++-
 hw/scsi/virtio-scsi.c   |   22 ++
 include/hw/virtio/virtio-scsi.h |4 
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 8a7cd9f..1fb83de 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -62,6 +62,7 @@ static VirtIOSCSIVring *virtio_scsi_vring_init(VirtIOSCSI *s,
 aio_set_event_notifier(s->ctx, &r->host_notifier, handler);
 
 r->parent = s;
+r->qid = n;
 
 if (!vring_setup(&r->vring, VIRTIO_DEVICE(s), n)) {
 fprintf(stderr, "virtio-scsi: VRing setup failed\n");
@@ -95,7 +96,8 @@ void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req)
 {
 vring_push(&req->vring->vring, &req->elem,
req->qsgl.size + req->resp_iov.size);
-event_notifier_set(&req->vring->guest_notifier);
+req->dev->pending_guest_notify |= 1 << (req->vring->qid - 2);
+qemu_bh_schedule(req->dev->guest_notify_bh);
 }
 
 static void virtio_scsi_iothread_handle_ctrl(EventNotifier *notifier)
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 6e34a2c..1b9c35c 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -86,6 +86,21 @@ static void virtio_scsi_complete_req(VirtIOSCSIReq *req)
 virtio_scsi_free_req(req);
 }
 
+static void notify_guest_bh(void *opaque)
+{
+VirtIOSCSI *s = opaque;
+unsigned int qid;
+uint64_t pending = s->pending_guest_notify;
+
+s->pending_guest_notify = 0;
+
+while ((qid = ffsl(pending))) {
+qid--;
+event_notifier_set(&s->cmd_vrings[qid]->guest_notifier);
+pending &= ~(1 << qid);
+}
+}
+
 static void virtio_scsi_bad_req(void)
 {
 error_report("wrong size for virtio-scsi headers");
@@ -825,6 +840,10 @@ void virtio_scsi_common_realize(DeviceState *dev, Error 
**errp,
 
 if (s->conf.iothread) {
 virtio_scsi_set_iothread(VIRTIO_SCSI(s), s->conf.iothread);
+VIRTIO_SCSI(s)->pending_guest_notify = 0;
+VIRTIO_SCSI(s)->guest_notify_bh = aio_bh_new(VIRTIO_SCSI(s)->ctx,
+ notify_guest_bh,
+ VIRTIO_SCSI(s));
 }
 }
 
@@ -902,6 +921,9 @@ void virtio_scsi_common_unrealize(DeviceState *dev, Error 
**errp)
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev);
 
+if (vs->conf.iothread) {
+qemu_bh_delete(VIRTIO_SCSI(vs)->guest_notify_bh);
+}
 g_free(vs->cmd_vqs);
 virtio_cleanup(vdev);
 }
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 9e1a49c..5e6c57e 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -163,6 +163,7 @@ typedef struct {
 Vring vring;
 EventNotifier host_notifier;
 EventNotifier guest_notifier;
+uint32_t qid;
 } VirtIOSCSIVring;
 
 typedef struct VirtIOSCSICommon {
@@ -198,6 +199,9 @@ typedef struct VirtIOSCSI {
 bool dataplane_fenced;
 Error *blocker;
 Notifier migration_state_notifier;
+
+QEMUBH *guest_notify_bh;
+uint64_t pending_guest_notify;
 } VirtIOSCSI;
 
 typedef struct VirtIOSCSIReq {
-- 
1.7.9.5

[Qemu-devel] [PATCH 1/2] virtio-scsi-dataplane: fix allocation for 'cmd_vrings'

2014-11-09 Thread Ming Lei

The size of each element should be sizeof(VirtIOSCSIVring *).

Signed-off-by: Ming Lei 
---
 hw/scsi/virtio-scsi-dataplane.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 855439e..8a7cd9f 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -239,7 +239,7 @@ void virtio_scsi_dataplane_start(VirtIOSCSI *s)
 if (!s->event_vring) {
 goto fail_vrings;
 }
-s->cmd_vrings = g_malloc0(sizeof(VirtIOSCSIVring) * vs->conf.num_queues);
+s->cmd_vrings = g_malloc0(sizeof(VirtIOSCSIVring *) * vs->conf.num_queues);
 for (i = 0; i < vs->conf.num_queues; i++) {
 s->cmd_vrings[i] =
 virtio_scsi_vring_init(s, vs->cmd_vqs[i],
-- 
1.7.9.5

[Qemu-devel] [PATCH 0/2] virtio-scsi-dataplane: one fix and one optimization

2014-11-09 Thread Ming Lei

The 1st patch fixes an allocation problem.

The 2nd one supresses writing eventfd a lot(~30K/sec in my test).

Thanks,
Ming Lei

[Qemu-devel] [PATCH 13/13] virtio-scsi-dataplane: support AioContext wide IO submission as batch

2014-11-08 Thread Ming Lei

Replace previous usage with AioContext Wide approach.

Signed-off-by: Ming Lei 
---
 hw/scsi/virtio-scsi-dataplane.c |8 
 hw/scsi/virtio-scsi.c   |2 --
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 9651e6f..7fab14d 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -135,8 +135,12 @@ static void virtio_scsi_iothread_handle_cmd(EventNotifier 
*notifier)
 VirtIOSCSI *s = (VirtIOSCSI *)vring->parent;
 VirtIOSCSIReq *req, *next;
 QTAILQ_HEAD(, VirtIOSCSIReq) reqs = QTAILQ_HEAD_INITIALIZER(reqs);
+AioContext *ctx = s->ctx;
+bool plugged;
 
 event_notifier_test_and_clear(notifier);
+
+plugged = bdrv_aio_io_plug(ctx);
 while ((req = virtio_scsi_pop_req_vring(s, vring))) {
 if (virtio_scsi_handle_cmd_req_prepare(s, req)) {
 QTAILQ_INSERT_TAIL(&reqs, req, next);
@@ -146,6 +150,10 @@ static void virtio_scsi_iothread_handle_cmd(EventNotifier 
*notifier)
 QTAILQ_FOREACH_SAFE(req, &reqs, next, next) {
 virtio_scsi_handle_cmd_req_submit(s, req);
 }
+
+if (plugged) {
+bdrv_aio_io_unplug(ctx);
+}
 }
 
 /* assumes s->ctx held */
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index fdcacfd..6e34a2c 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -540,7 +540,6 @@ bool virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, 
VirtIOSCSIReq *req)
 return false;
 }
 scsi_req_ref(req->sreq);
-blk_io_plug(d->conf.blk);
 return true;
 }
 
@@ -550,7 +549,6 @@ void virtio_scsi_handle_cmd_req_submit(VirtIOSCSI *s, 
VirtIOSCSIReq *req)
 if (scsi_req_enqueue(sreq)) {
 scsi_req_continue(sreq);
 }
-blk_io_unplug(sreq->dev->conf.blk);
 scsi_req_unref(sreq);
 }
 
-- 
1.7.9.5

[Qemu-devel] [PATCH 10/13] block/linux-aio.c: prepare for elastical resource's allocation

2014-11-08 Thread Ming Lei

Because we will support AioContext wide IO submission as batch,
so IO resources should be allocated according to how many there
are backends in the same AioContext.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   38 +++---
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index e3e0532..e219b80 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -43,7 +43,7 @@ struct qemu_laiocb {
  * so in future the IO queue should be per AIO context.
  */
 typedef struct {
-struct iocb *iocbs[MAX_QUEUED_IO];
+struct iocb **iocbs;
 int plugged;
 unsigned int size;
 unsigned int idx;
@@ -66,6 +66,7 @@ struct qemu_laio_state {
 LaioQueue *io_q;
 
 /* I/O completion processing */
+int nr_evt;
 QEMUBH *completion_bh;
 struct io_event *events;
 int event_idx;
@@ -73,6 +74,7 @@ struct qemu_laio_state {
 
 /* All BS in the list shared this 'qemu_laio_state' */
 QLIST_HEAD(, LaioTrackedBs) tracked_bs;
+int nr_bs;
 AioContext *aio_context;
 };
 
@@ -131,7 +133,7 @@ static void qemu_laio_completion_bh(void *opaque)
 if (s->event_idx == s->event_max) {
 do {
 struct timespec ts = { 0 };
-s->event_max = io_getevents(s->ctx, MAX_EVENTS, MAX_EVENTS,
+s->event_max = io_getevents(s->ctx, s->nr_evt, s->nr_evt,
 s->events, &ts);
 } while (s->event_max == -EINTR);
 
@@ -201,11 +203,13 @@ static const AIOCBInfo laio_aiocb_info = {
 .cancel_async   = laio_cancel,
 };
 
-static void ioq_init(LaioQueue *io_q)
+static void ioq_init(LaioQueue *io_q, unsigned size)
 {
-io_q->size = MAX_QUEUED_IO;
+io_q->size = size;
 io_q->idx = 0;
 io_q->plugged = 0;
+
+io_q->iocbs = g_malloc(sizeof(*io_q->iocbs) * size);
 }
 
 static void abort_queue(struct qemu_laio_state *s)
@@ -385,16 +389,34 @@ static int laio_alloc_resources(AioContext *ctx,
 struct qemu_laio_state *s)
 {
 LaioQueue *ioq;
+unsigned nr = s->nr_bs ? : 1;
+
+/* return immediately if resources are allocated already */
+if (likely(s->io_q)) {
+return 0;
+}
+
+if (nr > 10) {
+nr = 10;
+}
+
+while (nr > 0) {
+if (io_setup(MAX_EVENTS * nr, &s->ctx) == 0) {
+break;
+}
+nr--;
+}
 
-if (io_setup(MAX_EVENTS, &s->ctx) != 0) {
+if (nr == 0) {
 return -1;
 }
 
+s->nr_evt = nr * MAX_EVENTS;
 ioq = g_malloc0(sizeof(*s->io_q));
-ioq_init(ioq);
+ioq_init(ioq, MAX_QUEUED_IO * nr);
 ioq->retry = aio_bh_new(ctx, ioq_submit_retry, s);
 
-s->events = g_malloc(sizeof(*s->events) * MAX_EVENTS);
+s->events = g_malloc(sizeof(*s->events) * s->nr_evt);
 s->io_q = ioq;
 
 s->completion_bh = aio_bh_new(ctx, qemu_laio_completion_bh, s);
@@ -464,6 +486,7 @@ void laio_detach_aio_context(void *s_, BlockDriverState *bs,
 }
 }
 
+qs->state->nr_bs--;
 if (!aio_detach_aio_bs(old_context, bs)) {
 /* assign new master aio bs for the aio context */
 if (old_context->master_aio_bs == bs) {
@@ -494,6 +517,7 @@ void laio_attach_aio_context(void *s_, BlockDriverState *bs,
 
 tbs->bs = bs;
 QLIST_INSERT_HEAD(&qs->state->tracked_bs, tbs, list);
+qs->state->nr_bs++;
 }
 
 void *laio_init(void)
-- 
1.7.9.5

[Qemu-devel] [PATCH 11/13] block/linux-aio: reallocate I/O resources when aio attached

2014-11-08 Thread Ming Lei

First event notifier and qemu BH is associated with aio
context, so it should be reallocated for making these
handlers running in the current attached aio context.

Secondly when new 'bs' is attached, we need to allocate
more io resources for performance purpose.

This patch only does the reallocation if there aren't
any pending I/O.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   30 ++
 1 file changed, 30 insertions(+)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index e219b80..c5a88e8 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -59,6 +59,7 @@ typedef struct LaioTrackedBs {
 
 /* lifetime: between aio_attach and aio_detach */
 struct qemu_laio_state {
+unsigned long io_pending;
 io_context_t ctx;
 EventNotifier e;
 
@@ -75,6 +76,7 @@ struct qemu_laio_state {
 /* All BS in the list shared this 'qemu_laio_state' */
 QLIST_HEAD(, LaioTrackedBs) tracked_bs;
 int nr_bs;
+bool need_realloc;
 AioContext *aio_context;
 };
 
@@ -82,6 +84,8 @@ typedef struct {
 struct qemu_laio_state *state;
 } QemuLaioState;
 
+static void laio_realloc_resources(struct qemu_laio_state *s);
+
 static inline ssize_t io_event_ret(struct io_event *ev)
 {
 return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
@@ -110,6 +114,8 @@ static void qemu_laio_process_completion(struct 
qemu_laio_state *s,
 }
 }
 laiocb->common.cb(laiocb->common.opaque, ret);
+s->io_pending--;
+laio_realloc_resources(s);
 
 qemu_aio_unref(laiocb);
 }
@@ -193,6 +199,8 @@ static void laio_cancel(BlockAIOCB *blockacb)
 }
 
 laiocb->common.cb(laiocb->common.opaque, laiocb->ret);
+laiocb->aio_state->io_pending--;
+laio_realloc_resources(laiocb->aio_state);
 
 /* check if there are requests in io queue */
 qemu_laio_start_retry(laiocb->aio_state);
@@ -378,6 +386,8 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 goto out_free_aiocb;
 }
 }
+
+s->io_pending++;
 return &laiocb->common;
 
 out_free_aiocb:
@@ -429,6 +439,10 @@ static void laio_free_resources(struct qemu_laio_state *s)
 {
 LaioQueue *ioq = s->io_q;
 
+if (!ioq) {
+return;
+}
+
 aio_set_event_notifier(s->aio_context, &s->e, NULL);
 qemu_bh_delete(s->completion_bh);
 
@@ -441,6 +455,8 @@ static void laio_free_resources(struct qemu_laio_state *s)
 if (io_destroy(s->ctx) != 0) {
 fprintf(stderr, "%s: destroy AIO context %p failed\n",
 __func__, &s->ctx);
+} else {
+s->ctx = 0;
 }
 }
 
@@ -473,6 +489,17 @@ static void laio_state_free(struct qemu_laio_state *s, 
AioContext *context)
 g_free(s);
 }
 
+static void laio_realloc_resources(struct qemu_laio_state *s)
+{
+if (unlikely(s->need_realloc) && unlikely(!s->io_pending) &&
+!s->io_q->plugged) {
+laio_free_resources(s);
+laio_alloc_resources(s->aio_context, s);
+
+s->need_realloc = false;
+}
+}
+
 void laio_detach_aio_context(void *s_, BlockDriverState *bs,
 AioContext *old_context)
 {
@@ -493,6 +520,7 @@ void laio_detach_aio_context(void *s_, BlockDriverState *bs,
 tbs = QLIST_FIRST(&qs->state->tracked_bs);
 old_context->master_aio_bs = tbs->bs;
 }
+qs->state->need_realloc = true;
 return;
 }
 
@@ -513,11 +541,13 @@ void laio_attach_aio_context(void *s_, BlockDriverState 
*bs,
 qs->state->aio_context = new_context;
 } else {
 qs->state = new_context->opaque;
+qs->state->need_realloc = true;
 }
 
 tbs->bs = bs;
 QLIST_INSERT_HEAD(&qs->state->tracked_bs, tbs, list);
 qs->state->nr_bs++;
+qs->state->aio_context = new_context;
 }
 
 void *laio_init(void)
-- 
1.7.9.5

[Qemu-devel] [PATCH 12/13] block: introduce bdrv_aio_io_plug() and its pair

2014-11-08 Thread Ming Lei

These two APIs are introduced for using AioContext wide
IO submission as batch.

Signed-off-by: Ming Lei 
---
 block.c   |   16 
 include/block/block.h |3 +++
 2 files changed, 19 insertions(+)

diff --git a/block.c b/block.c
index dacd881..0200af0 100644
--- a/block.c
+++ b/block.c
@@ -5901,6 +5901,22 @@ void bdrv_io_unplug(BlockDriverState *bs)
 }
 }
 
+bool bdrv_aio_io_plug(AioContext *aio_ctx)
+{
+if (aio_ctx->master_aio_bs) {
+bdrv_io_plug(aio_ctx->master_aio_bs);
+return true;
+}
+return false;
+}
+
+void bdrv_aio_io_unplug(AioContext *aio_ctx)
+{
+if (aio_ctx->master_aio_bs) {
+bdrv_io_unplug(aio_ctx->master_aio_bs);
+}
+}
+
 void bdrv_flush_io_queue(BlockDriverState *bs)
 {
 BlockDriver *drv = bs->drv;
diff --git a/include/block/block.h b/include/block/block.h
index 13e4537..aae6b66 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -541,6 +541,9 @@ void bdrv_io_plug(BlockDriverState *bs);
 void bdrv_io_unplug(BlockDriverState *bs);
 void bdrv_flush_io_queue(BlockDriverState *bs);
 
+bool bdrv_aio_io_plug(AioContext *aio_ctx);
+void bdrv_aio_io_unplug(AioContext *aio_ctx);
+
 BlockAcctStats *bdrv_get_stats(BlockDriverState *bs);
 
 #endif
-- 
1.7.9.5

[Qemu-devel] [PATCH 09/13] block/linux-aio.c: introduce laio_alloc_resource()

2014-11-08 Thread Ming Lei

This patch introduces laio_alloc_resource() for allocating
resources for linux aio, then in the following patchs we
can allocate IO resources just in demand.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   55 ++---
 1 file changed, 35 insertions(+), 20 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 5be8036..e3e0532 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -73,6 +73,7 @@ struct qemu_laio_state {
 
 /* All BS in the list shared this 'qemu_laio_state' */
 QLIST_HEAD(, LaioTrackedBs) tracked_bs;
+AioContext *aio_context;
 };
 
 typedef struct {
@@ -380,20 +381,45 @@ out_free_aiocb:
 return NULL;
 }
 
-static LaioQueue *laio_alloc_ioq(AioContext *ctx, struct qemu_laio_state *s)
+static int laio_alloc_resources(AioContext *ctx,
+struct qemu_laio_state *s)
 {
-LaioQueue *ioq = g_malloc0(sizeof(*ioq));
+LaioQueue *ioq;
 
+if (io_setup(MAX_EVENTS, &s->ctx) != 0) {
+return -1;
+}
+
+ioq = g_malloc0(sizeof(*s->io_q));
 ioq_init(ioq);
 ioq->retry = aio_bh_new(ctx, ioq_submit_retry, s);
-return ioq;
+
+s->events = g_malloc(sizeof(*s->events) * MAX_EVENTS);
+s->io_q = ioq;
+
+s->completion_bh = aio_bh_new(ctx, qemu_laio_completion_bh, s);
+aio_set_event_notifier(ctx, &s->e, qemu_laio_completion_cb);
+
+return 0;
 }
 
-static void laio_free_ioq(struct qemu_laio_state *s, LaioQueue *ioq)
+static void laio_free_resources(struct qemu_laio_state *s)
 {
+LaioQueue *ioq = s->io_q;
+
+aio_set_event_notifier(s->aio_context, &s->e, NULL);
+qemu_bh_delete(s->completion_bh);
+
+g_free(s->events);
+
 qemu_bh_delete(ioq->retry);
 g_free(ioq);
 s->io_q = NULL;
+
+if (io_destroy(s->ctx) != 0) {
+fprintf(stderr, "%s: destroy AIO context %p failed\n",
+__func__, &s->ctx);
+}
 }
 
 static struct qemu_laio_state *laio_state_alloc(AioContext *context)
@@ -405,15 +431,10 @@ static struct qemu_laio_state 
*laio_state_alloc(AioContext *context)
 goto out_free_state;
 }
 
-if (io_setup(MAX_EVENTS, &s->ctx) != 0) {
+if (laio_alloc_resources(context, s) != 0) {
 goto out_close_efd;
 }
 
-s->events = g_malloc(sizeof(*s->events) * MAX_EVENTS);
-s->io_q = laio_alloc_ioq(context, s);
-s->completion_bh = aio_bh_new(context, qemu_laio_completion_bh, s);
-aio_set_event_notifier(context, &s->e, qemu_laio_completion_cb);
-
 return s;
 
 out_close_efd:
@@ -425,17 +446,8 @@ out_free_state:
 
 static void laio_state_free(struct qemu_laio_state *s, AioContext *context)
 {
-aio_set_event_notifier(context, &s->e, NULL);
-qemu_bh_delete(s->completion_bh);
-
-laio_free_ioq(s, s->io_q);
+laio_free_resources(s);
 event_notifier_cleanup(&s->e);
-g_free(s->events);
-
-if (io_destroy(s->ctx) != 0) {
-fprintf(stderr, "%s: destroy AIO context %p failed\n",
-__func__, &s->ctx);
-}
 g_free(s);
 }
 
@@ -473,6 +485,9 @@ void laio_attach_aio_context(void *s_, BlockDriverState *bs,
 
 if (aio_attach_aio_bs(new_context, bs)) {
 new_context->opaque = qs->state = laio_state_alloc(new_context);
+
+/* qemu_laio_state is per AioContext */
+qs->state->aio_context = new_context;
 } else {
 qs->state = new_context->opaque;
 }
-- 
1.7.9.5

[Qemu-devel] [PATCH 08/13] block/linux-aio.c: allocate events dynamically

2014-11-08 Thread Ming Lei

This patch allocates events array of 'struct qemu_laio_state'
dynamically so that in the following patch we can allocate
resource elasticly in case of AioContext wide IO submission
as batch.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index c5c7944..5be8036 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -67,7 +67,7 @@ struct qemu_laio_state {
 
 /* I/O completion processing */
 QEMUBH *completion_bh;
-struct io_event events[MAX_EVENTS];
+struct io_event *events;
 int event_idx;
 int event_max;
 
@@ -409,6 +409,7 @@ static struct qemu_laio_state *laio_state_alloc(AioContext 
*context)
 goto out_close_efd;
 }
 
+s->events = g_malloc(sizeof(*s->events) * MAX_EVENTS);
 s->io_q = laio_alloc_ioq(context, s);
 s->completion_bh = aio_bh_new(context, qemu_laio_completion_bh, s);
 aio_set_event_notifier(context, &s->e, qemu_laio_completion_cb);
@@ -429,6 +430,7 @@ static void laio_state_free(struct qemu_laio_state *s, 
AioContext *context)
 
 laio_free_ioq(s, s->io_q);
 event_notifier_cleanup(&s->e);
+g_free(s->events);
 
 if (io_destroy(s->ctx) != 0) {
 fprintf(stderr, "%s: destroy AIO context %p failed\n",
-- 
1.7.9.5

[Qemu-devel] [PATCH 06/13] AioContext: introduce aio_attach_aio_bs() and its pair

2014-11-08 Thread Ming Lei

This patch introduces aio_attach_aio_bs() and its pair
for supporting IO submission as batch in AioContext wide,
and one typical use case is multi-lun SCSI.

aio_attach_aio_bs() will attach one 'bs' which is capable
of IO submission as batch, then all I/O submission in
this AioContext will borrow the io queue from this 'bs',
so that we can maximum IO submission as batch.

aio_detach_aio_bs() will detach the 'bs' when all 'bs'
in the AioContext is detached.

Signed-off-by: Ming Lei 
---
 async.c |1 +
 include/block/aio.h |   27 +++
 2 files changed, 28 insertions(+)

diff --git a/async.c b/async.c
index 6e1b282..0f4b530 100644
--- a/async.c
+++ b/async.c
@@ -305,6 +305,7 @@ AioContext *aio_context_new(Error **errp)
event_notifier_test_and_clear);
 ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
 ctx->thread_pool = NULL;
+ctx->bs_refcnt = 0;
 qemu_mutex_init(&ctx->bh_lock);
 rfifolock_init(&ctx->lock, aio_rfifolock_cb, ctx);
 timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
diff --git a/include/block/aio.h b/include/block/aio.h
index 6bf0e04..e34a21f 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -90,6 +90,11 @@ struct AioContext {
 
 /* TimerLists for calling timers - one per clock type */
 QEMUTimerListGroup tlg;
+
+/* IO submit as batch in AioContext wide */
+BlockDriverState *master_aio_bs;
+int bs_refcnt;
+void *opaque;
 };
 
 /* Used internally to synchronize aio_poll against qemu_bh_schedule.  */
@@ -325,4 +330,26 @@ static inline void aio_timer_init(AioContext *ctx,
  */
 int64_t aio_compute_timeout(AioContext *ctx);
 
+static inline bool aio_attach_aio_bs(AioContext *ctx,
+ BlockDriverState *bs)
+{
+if (!ctx->bs_refcnt++) {
+ctx->master_aio_bs = bs;
+return true;
+}
+
+return false;
+}
+
+static inline bool aio_detach_aio_bs(AioContext *ctx,
+ BlockDriverState *bs)
+{
+if (!--ctx->bs_refcnt) {
+ctx->master_aio_bs = NULL;
+return true;
+}
+
+return false;
+}
+
 #endif
-- 
1.7.9.5

[Qemu-devel] [PATCH 05/13] block/linux-aio: pass 'BlockDriverState' to laio_attach_aio_context and its pair

2014-11-08 Thread Ming Lei

This patch introduces parameter of 'BlockDriverState' to
laio_attach_aio_context() and its pair, so that it will be
easier to support IO submission as batch in AioContext wide.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |6 --
 block/raw-aio.h   |6 --
 block/raw-posix.c |4 ++--
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 95cd0dc..17de2e3 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -429,7 +429,8 @@ static void laio_state_free(struct qemu_laio_state *s, 
AioContext *context)
 g_free(s);
 }
 
-void laio_detach_aio_context(void *s_, AioContext *old_context)
+void laio_detach_aio_context(void *s_, BlockDriverState *bs,
+AioContext *old_context)
 {
 QemuLaioState *qs = s_;
 
@@ -437,7 +438,8 @@ void laio_detach_aio_context(void *s_, AioContext 
*old_context)
 qs->state = NULL;
 }
 
-void laio_attach_aio_context(void *s_, AioContext *new_context)
+void laio_attach_aio_context(void *s_, BlockDriverState *bs,
+AioContext *new_context)
 {
 QemuLaioState *qs = s_;
 struct qemu_laio_state *s = laio_state_alloc(new_context);
diff --git a/block/raw-aio.h b/block/raw-aio.h
index 80681ce..61a5b96 100644
--- a/block/raw-aio.h
+++ b/block/raw-aio.h
@@ -38,8 +38,10 @@ void laio_cleanup(void *s);
 BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
 int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
 BlockCompletionFunc *cb, void *opaque, int type);
-void laio_detach_aio_context(void *s, AioContext *old_context);
-void laio_attach_aio_context(void *s, AioContext *new_context);
+void laio_detach_aio_context(void *s, BlockDriverState *bs,
+AioContext *old_context);
+void laio_attach_aio_context(void *s, BlockDriverState *bs,
+AioContext *new_context);
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx);
 int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug);
 #endif
diff --git a/block/raw-posix.c b/block/raw-posix.c
index e100ae2..5b7f20b 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -323,7 +323,7 @@ static void raw_detach_aio_context(BlockDriverState *bs)
 BDRVRawState *s = bs->opaque;
 
 if (s->use_aio) {
-laio_detach_aio_context(s->aio_ctx, bdrv_get_aio_context(bs));
+laio_detach_aio_context(s->aio_ctx, bs, bdrv_get_aio_context(bs));
 }
 #endif
 }
@@ -335,7 +335,7 @@ static void raw_attach_aio_context(BlockDriverState *bs,
 BDRVRawState *s = bs->opaque;
 
 if (s->use_aio) {
-laio_attach_aio_context(s->aio_ctx, new_context);
+laio_attach_aio_context(s->aio_ctx, bs, new_context);
 }
 #endif
 }
-- 
1.7.9.5

[Qemu-devel] [PATCH 02/13] block: linux-aio: rename 'ctx' of qemu_laiocb as 'laio_state'

2014-11-08 Thread Ming Lei

So that it can be distinguished from the 'ctx' in qemu_laio_state.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index b9db28e..cf8691e 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -29,7 +29,7 @@
 
 struct qemu_laiocb {
 BlockAIOCB common;
-struct qemu_laio_state *ctx;
+struct qemu_laio_state *aio_state;
 struct iocb iocb;
 ssize_t ret;
 size_t nbytes;
@@ -169,7 +169,7 @@ static void laio_cancel(BlockAIOCB *blockacb)
 if (laiocb->ret != -EINPROGRESS) {
 return;
 }
-ret = io_cancel(laiocb->ctx->ctx, &laiocb->iocb, &event);
+ret = io_cancel(laiocb->aio_state->ctx, &laiocb->iocb, &event);
 laiocb->ret = -ECANCELED;
 if (ret != 0) {
 /* iocb is not cancelled, cb will be called by the event loop later */
@@ -179,7 +179,7 @@ static void laio_cancel(BlockAIOCB *blockacb)
 laiocb->common.cb(laiocb->common.opaque, laiocb->ret);
 
 /* check if there are requests in io queue */
-qemu_laio_start_retry(laiocb->ctx);
+qemu_laio_start_retry(laiocb->aio_state);
 }
 
 static const AIOCBInfo laio_aiocb_info = {
@@ -311,7 +311,7 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 
 laiocb = qemu_aio_get(&laio_aiocb_info, bs, cb, opaque);
 laiocb->nbytes = nb_sectors * 512;
-laiocb->ctx = s;
+laiocb->aio_state = s;
 laiocb->ret = -EINPROGRESS;
 laiocb->is_read = (type == QEMU_AIO_READ);
 laiocb->qiov = qiov;
-- 
1.7.9.5

[Qemu-devel] [PATCH 07/13] block/linux-aio: support IO submission as batch in AioContext wide

2014-11-08 Thread Ming Lei

This patch supports IO submission as batch in AioContext wide
by sharing 'struct qemu_laio_state' instance among all linux-aio
backend in same AioContext.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   36 ++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 17de2e3..c5c7944 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -52,6 +52,11 @@ typedef struct {
 QEMUBH *retry;
 } LaioQueue;
 
+typedef struct LaioTrackedBs {
+BlockDriverState *bs;
+QLIST_ENTRY(LaioTrackedBs) list;
+} LaioTrackedBs;
+
 /* lifetime: between aio_attach and aio_detach */
 struct qemu_laio_state {
 io_context_t ctx;
@@ -65,6 +70,9 @@ struct qemu_laio_state {
 struct io_event events[MAX_EVENTS];
 int event_idx;
 int event_max;
+
+/* All BS in the list shared this 'qemu_laio_state' */
+QLIST_HEAD(, LaioTrackedBs) tracked_bs;
 };
 
 typedef struct {
@@ -433,6 +441,23 @@ void laio_detach_aio_context(void *s_, BlockDriverState 
*bs,
 AioContext *old_context)
 {
 QemuLaioState *qs = s_;
+LaioTrackedBs *tbs, *ntbs;
+
+QLIST_FOREACH_SAFE(tbs, &qs->state->tracked_bs, list, ntbs) {
+if (tbs->bs == bs) {
+QLIST_REMOVE(tbs, list);
+g_free(tbs);
+}
+}
+
+if (!aio_detach_aio_bs(old_context, bs)) {
+/* assign new master aio bs for the aio context */
+if (old_context->master_aio_bs == bs) {
+tbs = QLIST_FIRST(&qs->state->tracked_bs);
+old_context->master_aio_bs = tbs->bs;
+}
+return;
+}
 
 laio_state_free(qs->state, old_context);
 qs->state = NULL;
@@ -442,9 +467,16 @@ void laio_attach_aio_context(void *s_, BlockDriverState 
*bs,
 AioContext *new_context)
 {
 QemuLaioState *qs = s_;
-struct qemu_laio_state *s = laio_state_alloc(new_context);
+LaioTrackedBs *tbs = g_malloc0(sizeof(*tbs));
+
+if (aio_attach_aio_bs(new_context, bs)) {
+new_context->opaque = qs->state = laio_state_alloc(new_context);
+} else {
+qs->state = new_context->opaque;
+}
 
-qs->state = s;
+tbs->bs = bs;
+QLIST_INSERT_HEAD(&qs->state->tracked_bs, tbs, list);
 }
 
 void *laio_init(void)
-- 
1.7.9.5

[Qemu-devel] [PATCH 04/13] block/linux-aio: do more things in laio_state_alloc() and its pair

2014-11-08 Thread Ming Lei

Now lifetime of 'completion_bh', io queue and io context is same,
so move their allocation into laio_state_alloc() and their
releasing into laio_state_free().

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   26 --
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 5fa3c1e..95cd0dc 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -388,7 +388,7 @@ static void laio_free_ioq(struct qemu_laio_state *s, 
LaioQueue *ioq)
 s->io_q = NULL;
 }
 
-static struct qemu_laio_state *laio_state_alloc(void)
+static struct qemu_laio_state *laio_state_alloc(AioContext *context)
 {
 struct qemu_laio_state *s;
 
@@ -401,6 +401,10 @@ static struct qemu_laio_state *laio_state_alloc(void)
 goto out_close_efd;
 }
 
+s->io_q = laio_alloc_ioq(context, s);
+s->completion_bh = aio_bh_new(context, qemu_laio_completion_bh, s);
+aio_set_event_notifier(context, &s->e, qemu_laio_completion_cb);
+
 return s;
 
 out_close_efd:
@@ -410,8 +414,12 @@ out_free_state:
 return NULL;
 }
 
-static void laio_state_free(struct qemu_laio_state *s)
+static void laio_state_free(struct qemu_laio_state *s, AioContext *context)
 {
+aio_set_event_notifier(context, &s->e, NULL);
+qemu_bh_delete(s->completion_bh);
+
+laio_free_ioq(s, s->io_q);
 event_notifier_cleanup(&s->e);
 
 if (io_destroy(s->ctx) != 0) {
@@ -424,25 +432,15 @@ static void laio_state_free(struct qemu_laio_state *s)
 void laio_detach_aio_context(void *s_, AioContext *old_context)
 {
 QemuLaioState *qs = s_;
-struct qemu_laio_state *s = qs->state;
 
-aio_set_event_notifier(old_context, &s->e, NULL);
-qemu_bh_delete(s->completion_bh);
-
-laio_free_ioq(s, s->io_q);
-laio_state_free(s);
+laio_state_free(qs->state, old_context);
 qs->state = NULL;
 }
 
 void laio_attach_aio_context(void *s_, AioContext *new_context)
 {
 QemuLaioState *qs = s_;
-struct qemu_laio_state *s = laio_state_alloc();
-
-s->io_q = laio_alloc_ioq(new_context, s);
-
-s->completion_bh = aio_bh_new(new_context, qemu_laio_completion_bh, s);
-aio_set_event_notifier(new_context, &s->e, qemu_laio_completion_cb);
+struct qemu_laio_state *s = laio_state_alloc(new_context);
 
 qs->state = s;
 }
-- 
1.7.9.5

[Qemu-devel] [PATCH 03/13] block/linux-aio: allocate 'struct qemu_laio_state' dynamically

2014-11-08 Thread Ming Lei

This patch allocates 'struct qemu_laio_state' in aio attach,
and frees it in aio detach, so that in the following patch
we can share one same instance of the structure among multiple
linux-aio backend in same AioContext.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   80 +++--
 1 file changed, 53 insertions(+), 27 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index cf8691e..5fa3c1e 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -52,6 +52,7 @@ typedef struct {
 QEMUBH *retry;
 } LaioQueue;
 
+/* lifetime: between aio_attach and aio_detach */
 struct qemu_laio_state {
 io_context_t ctx;
 EventNotifier e;
@@ -66,6 +67,10 @@ struct qemu_laio_state {
 int event_max;
 };
 
+typedef struct {
+struct qemu_laio_state *state;
+} QemuLaioState;
+
 static inline ssize_t io_event_ret(struct io_event *ev)
 {
 return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
@@ -277,14 +282,16 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct 
iocb *iocb)
 
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
 {
-struct qemu_laio_state *s = aio_ctx;
+QemuLaioState *qs = aio_ctx;
+struct qemu_laio_state *s = qs->state;
 
 s->io_q->plugged++;
 }
 
 int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug)
 {
-struct qemu_laio_state *s = aio_ctx;
+QemuLaioState *qs = aio_ctx;
+struct qemu_laio_state *s = qs->state;
 int ret = 0;
 
 assert(s->io_q->plugged > 0 || !unplug);
@@ -304,7 +311,8 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
 BlockCompletionFunc *cb, void *opaque, int type)
 {
-struct qemu_laio_state *s = aio_ctx;
+QemuLaioState *qs = aio_ctx;
+struct qemu_laio_state *s = qs->state;
 struct qemu_laiocb *laiocb;
 struct iocb *iocbs;
 off_t offset = sector_num * 512;
@@ -380,27 +388,7 @@ static void laio_free_ioq(struct qemu_laio_state *s, 
LaioQueue *ioq)
 s->io_q = NULL;
 }
 
-void laio_detach_aio_context(void *s_, AioContext *old_context)
-{
-struct qemu_laio_state *s = s_;
-
-aio_set_event_notifier(old_context, &s->e, NULL);
-qemu_bh_delete(s->completion_bh);
-
-laio_free_ioq(s, s->io_q);
-}
-
-void laio_attach_aio_context(void *s_, AioContext *new_context)
-{
-struct qemu_laio_state *s = s_;
-
-s->io_q = laio_alloc_ioq(new_context, s);
-
-s->completion_bh = aio_bh_new(new_context, qemu_laio_completion_bh, s);
-aio_set_event_notifier(new_context, &s->e, qemu_laio_completion_cb);
-}
-
-void *laio_init(void)
+static struct qemu_laio_state *laio_state_alloc(void)
 {
 struct qemu_laio_state *s;
 
@@ -422,10 +410,8 @@ out_free_state:
 return NULL;
 }
 
-void laio_cleanup(void *s_)
+static void laio_state_free(struct qemu_laio_state *s)
 {
-struct qemu_laio_state *s = s_;
-
 event_notifier_cleanup(&s->e);
 
 if (io_destroy(s->ctx) != 0) {
@@ -434,3 +420,43 @@ void laio_cleanup(void *s_)
 }
 g_free(s);
 }
+
+void laio_detach_aio_context(void *s_, AioContext *old_context)
+{
+QemuLaioState *qs = s_;
+struct qemu_laio_state *s = qs->state;
+
+aio_set_event_notifier(old_context, &s->e, NULL);
+qemu_bh_delete(s->completion_bh);
+
+laio_free_ioq(s, s->io_q);
+laio_state_free(s);
+qs->state = NULL;
+}
+
+void laio_attach_aio_context(void *s_, AioContext *new_context)
+{
+QemuLaioState *qs = s_;
+struct qemu_laio_state *s = laio_state_alloc();
+
+s->io_q = laio_alloc_ioq(new_context, s);
+
+s->completion_bh = aio_bh_new(new_context, qemu_laio_completion_bh, s);
+aio_set_event_notifier(new_context, &s->e, qemu_laio_completion_cb);
+
+qs->state = s;
+}
+
+void *laio_init(void)
+{
+QemuLaioState *s = g_malloc0(sizeof(*s));
+
+return s;
+}
+
+void laio_cleanup(void *s_)
+{
+QemuLaioState *s = s_;
+
+g_free(s);
+}
-- 
1.7.9.5

[Qemu-devel] [PATCH 01/13] block/linux-aio: allocate io queue dynamically

2014-11-08 Thread Ming Lei

This patch allocates io queue dynamically so that we
can support aio_context wide io queue in the following
patch.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   66 +
 1 file changed, 41 insertions(+), 25 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index b12da25..b9db28e 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -57,7 +57,7 @@ struct qemu_laio_state {
 EventNotifier e;
 
 /* io queue for submit at batch */
-LaioQueue io_q;
+LaioQueue *io_q;
 
 /* I/O completion processing */
 QEMUBH *completion_bh;
@@ -146,8 +146,8 @@ static void qemu_laio_completion_bh(void *opaque)
 
 static void qemu_laio_start_retry(struct qemu_laio_state *s)
 {
-if (s->io_q.idx)
-qemu_bh_schedule(s->io_q.retry);
+if (s->io_q->idx)
+qemu_bh_schedule(s->io_q->retry);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
@@ -197,8 +197,8 @@ static void ioq_init(LaioQueue *io_q)
 static void abort_queue(struct qemu_laio_state *s)
 {
 int i;
-for (i = 0; i < s->io_q.idx; i++) {
-struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
+for (i = 0; i < s->io_q->idx; i++) {
+struct qemu_laiocb *laiocb = container_of(s->io_q->iocbs[i],
   struct qemu_laiocb,
   iocb);
 laiocb->ret = -EIO;
@@ -209,14 +209,14 @@ static void abort_queue(struct qemu_laio_state *s)
 static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
 {
 int ret, i = 0;
-int len = s->io_q.idx;
+int len = s->io_q->idx;
 int j = 0;
 
 if (!len) {
 return 0;
 }
 
-ret = io_submit(s->ctx, len, s->io_q.iocbs);
+ret = io_submit(s->ctx, len, s->io_q->iocbs);
 if (ret == -EAGAIN) { /* retry in following completion cb */
 return 0;
 } else if (ret < 0) {
@@ -232,7 +232,7 @@ static int ioq_submit(struct qemu_laio_state *s, bool 
enqueue)
 }
 
 for (i = ret; i < len; i++) {
-s->io_q.iocbs[j++] = s->io_q.iocbs[i];
+s->io_q->iocbs[j++] = s->io_q->iocbs[i];
 }
 
  out:
@@ -240,7 +240,7 @@ static int ioq_submit(struct qemu_laio_state *s, bool 
enqueue)
  * update io queue, for partial completion, retry will be
  * started automatically in following completion cb.
  */
-s->io_q.idx -= ret;
+s->io_q->idx -= ret;
 
 return ret;
 }
@@ -253,22 +253,22 @@ static void ioq_submit_retry(void *opaque)
 
 static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
-unsigned int idx = s->io_q.idx;
+unsigned int idx = s->io_q->idx;
 
-if (unlikely(idx == s->io_q.size)) {
+if (unlikely(idx == s->io_q->size)) {
 return -1;
 }
 
-s->io_q.iocbs[idx++] = iocb;
-s->io_q.idx = idx;
+s->io_q->iocbs[idx++] = iocb;
+s->io_q->idx = idx;
 
 /* don't submit until next completion for -EAGAIN of non plug case */
-if (unlikely(!s->io_q.plugged)) {
+if (unlikely(!s->io_q->plugged)) {
 return 0;
 }
 
 /* submit immediately if queue depth is above 2/3 */
-if (idx > s->io_q.size * 2 / 3) {
+if (idx > s->io_q->size * 2 / 3) {
 return ioq_submit(s, true);
 }
 
@@ -279,7 +279,7 @@ void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
 {
 struct qemu_laio_state *s = aio_ctx;
 
-s->io_q.plugged++;
+s->io_q->plugged++;
 }
 
 int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug)
@@ -287,13 +287,13 @@ int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, 
bool unplug)
 struct qemu_laio_state *s = aio_ctx;
 int ret = 0;
 
-assert(s->io_q.plugged > 0 || !unplug);
+assert(s->io_q->plugged > 0 || !unplug);
 
-if (unplug && --s->io_q.plugged > 0) {
+if (unplug && --s->io_q->plugged > 0) {
 return 0;
 }
 
-if (s->io_q.idx > 0) {
+if (s->io_q->idx > 0) {
 ret = ioq_submit(s, false);
 }
 
@@ -333,10 +333,10 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 }
 io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
 
-if (!s->io_q.plugged) {
+if (!s->io_q->plugged) {
 int ret;
 
-if (!s->io_q.idx) {
+if (!s->io_q->idx) {
 ret = io_submit(s->ctx, 1, &iocbs);
 } else {
 ret = -EAGAIN;
@@ -364,20 +364,38 @@ out_free_aiocb:
 return NULL;
 }
 
+static LaioQueue *laio_alloc_ioq(AioContext *ctx, struct qemu_laio_state *s)
+{
+LaioQueue *ioq = g_malloc0(sizeof(*ioq));
+
+ioq_init(ioq);
+ioq->retry = aio_bh_new(ctx, ioq_submit_retry, s);
+return ioq;
+}
+
+static void laio_fre

[Qemu-devel] [PATCH 00/13] linux-aio/virtio-scsi: support AioContext wide IO submission as batch

2014-11-08 Thread Ming Lei

This patch implements AioContext wide IO submission as batch, and
the idea behind is very simple:

- linux native aio(io_submit) supports to enqueue read/write requests
to different files

- in one AioContext, I/O requests from VM can be submitted to different
backend in host, one typical example is multi-lun scsi 

This patch changes 'struct qemu_laio_state' as per AioContext, and
multiple 'bs' can be associted with one single instance of
'struct qemu_laio_state', then AioContext wide IO submission as batch
becomes easy to implement.

One simple test in my laptop shows ~20% throughput improvement
on randread from VM(using AioContext wide IO batch vs. not using io batch)
with below config:

-drive 
id=drive_scsi1-0-0-0,if=none,format=raw,cache=none,aio=native,file=/dev/nullb2 \
-drive 
id=drive_scsi1-0-0-1,if=none,format=raw,cache=none,aio=native,file=/dev/nullb3 \
-device 
virtio-scsi-pci,num_queues=4,id=scsi1,addr=07,iothread=iothread0 \
-device 
scsi-disk,bus=scsi1.0,channel=0,scsi-id=1,lun=0,drive=drive_scsi1-0-0-0,id=scsi1-0-0-0
 \
-device 
scsi-disk,bus=scsi1.0,channel=0,scsi-id=1,lun=1,drive=drive_scsi1-0-0-1,id=scsi1-0-0-1
 \

BTW, maybe more boost can be obtained since ~33K/sec write() system call
can be observed when this test case is running, and it might be a recent
regression(BH?).

This patchset can be found on below tree too:

git://kernel.ubuntu.com/ming/qemu.git aio-io-batch.2

and these patches depend on "linux-aio: fix batch submission" patches
in below link:

http://marc.info/?l=qemu-devel&m=141528663106557&w=2

Any comments and suggestions are welcome.

 async.c |1 +
 block.c |   16 +++
 block/linux-aio.c   |  251 ++-
 block/raw-aio.h |6 +-
 block/raw-posix.c   |4 +-
 hw/scsi/virtio-scsi-dataplane.c |8 ++
 hw/scsi/virtio-scsi.c   |2 -
 include/block/aio.h |   27 +
 include/block/block.h   |3 +
 9 files changed, 259 insertions(+), 59 deletions(-)

Thanks,
Ming Lei

[Qemu-devel] [PATCH v3 1/3] linux-aio: fix submit aio as a batch

2014-11-06 Thread Ming Lei

In the enqueue path, we can't complete request, otherwise
"Co-routine re-entered recursively" may be caused, so this
patch fixes the issue with below ideas:

- for -EAGAIN or partial completion, retry the submision by
schedule an BH in following completion cb
- for part of completion, also update the io queue
- for other failure, return the failure if in enqueue path,
otherwise, abort all queued I/O

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |  101 +
 1 file changed, 79 insertions(+), 22 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index d92513b..f66e8ad 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -38,11 +38,19 @@ struct qemu_laiocb {
 QLIST_ENTRY(qemu_laiocb) node;
 };
 
+/*
+ * TODO: support to batch I/O from multiple bs in one same
+ * AIO context, one important use case is multi-lun scsi,
+ * so in future the IO queue should be per AIO context.
+ */
 typedef struct {
 struct iocb *iocbs[MAX_QUEUED_IO];
 int plugged;
 unsigned int size;
 unsigned int idx;
+
+/* handle -EAGAIN and partial completion */
+QEMUBH *retry;
 } LaioQueue;
 
 struct qemu_laio_state {
@@ -137,6 +145,12 @@ static void qemu_laio_completion_bh(void *opaque)
 }
 }
 
+static void qemu_laio_start_retry(struct qemu_laio_state *s)
+{
+if (s->io_q.idx)
+qemu_bh_schedule(s->io_q.retry);
+}
+
 static void qemu_laio_completion_cb(EventNotifier *e)
 {
 struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e);
@@ -144,6 +158,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
 if (event_notifier_test_and_clear(&s->e)) {
 qemu_bh_schedule(s->completion_bh);
 }
+qemu_laio_start_retry(s);
 }
 
 static void laio_cancel(BlockAIOCB *blockacb)
@@ -163,6 +178,9 @@ static void laio_cancel(BlockAIOCB *blockacb)
 }
 
 laiocb->common.cb(laiocb->common.opaque, laiocb->ret);
+
+/* check if there are requests in io queue */
+qemu_laio_start_retry(laiocb->ctx);
 }
 
 static const AIOCBInfo laio_aiocb_info = {
@@ -177,45 +195,80 @@ static void ioq_init(LaioQueue *io_q)
 io_q->plugged = 0;
 }
 
-static int ioq_submit(struct qemu_laio_state *s)
+static void abort_queue(struct qemu_laio_state *s)
+{
+int i;
+for (i = 0; i < s->io_q.idx; i++) {
+struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
+  struct qemu_laiocb,
+  iocb);
+laiocb->ret = -EIO;
+qemu_laio_process_completion(s, laiocb);
+}
+}
+
+static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
 {
 int ret, i = 0;
 int len = s->io_q.idx;
+int j = 0;
 
-do {
-ret = io_submit(s->ctx, len, s->io_q.iocbs);
-} while (i++ < 3 && ret == -EAGAIN);
+if (!len) {
+return 0;
+}
 
-/* empty io queue */
-s->io_q.idx = 0;
+ret = io_submit(s->ctx, len, s->io_q.iocbs);
+if (ret == -EAGAIN) { /* retry in following completion cb */
+return 0;
+} else if (ret < 0) {
+if (enqueue) {
+return ret;
+}
 
-if (ret < 0) {
-i = 0;
-} else {
-i = ret;
+/* in non-queue path, all IOs have to be completed */
+abort_queue(s);
+ret = len;
+} else if (ret == 0) {
+goto out;
 }
 
-for (; i < len; i++) {
-struct qemu_laiocb *laiocb =
-container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
-
-laiocb->ret = (ret < 0) ? ret : -EIO;
-qemu_laio_process_completion(s, laiocb);
+for (i = ret; i < len; i++) {
+s->io_q.iocbs[j++] = s->io_q.iocbs[i];
 }
+
+ out:
+/*
+ * update io queue, for partial completion, retry will be
+ * started automatically in following completion cb.
+ */
+s->io_q.idx -= ret;
+
 return ret;
 }
 
-static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
+static void ioq_submit_retry(void *opaque)
+{
+struct qemu_laio_state *s = opaque;
+ioq_submit(s, false);
+}
+
+static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
 unsigned int idx = s->io_q.idx;
 
+if (unlikely(idx == s->io_q.size)) {
+return -1;
+}
+
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
-/* submit immediately if queue is full */
-if (idx == s->io_q.size) {
-ioq_submit(s);
+/* submit immediately if queue depth is above 2/3 */
+if (idx > s->io_q.size * 2 / 3) {
+return ioq_submit(s, true);
 }
+
+return 0;
 }
 
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
@@ -237,7 +290,7 @@ int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, 
bool unplug)
 }
 
 if (s->io_q.idx &g

[Qemu-devel] [PATCH v3 2/3] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-11-06 Thread Ming Lei

Previously -EAGAIN is simply ignored for !s->io_q.plugged case,
and sometimes it is easy to cause -EIO to VM, such as NVME device.

This patch handles -EAGAIN by io queue for !s->io_q.plugged case,
and it will be retried in following aio completion cb.

Suggested-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index f66e8ad..f5ca41d 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -263,6 +263,11 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct 
iocb *iocb)
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
+/* don't submit until next completion for -EAGAIN of non plug case */
+if (unlikely(!s->io_q.plugged)) {
+return 0;
+}
+
 /* submit immediately if queue depth is above 2/3 */
 if (idx > s->io_q.size * 2 / 3) {
 return ioq_submit(s, true);
@@ -330,10 +335,25 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
 
 if (!s->io_q.plugged) {
-if (io_submit(s->ctx, 1, &iocbs) < 0) {
+int ret;
+
+if (!s->io_q.idx) {
+ret = io_submit(s->ctx, 1, &iocbs);
+} else {
+ret = -EAGAIN;
+}
+/*
+ * Switch to queue mode until -EAGAIN is handled, we suppose
+ * there is always uncompleted I/O, so try to enqueue it first,
+ * and will be submitted again in following aio completion cb.
+ */
+if (ret == -EAGAIN) {
+goto enqueue;
+} else if (ret < 0) {
 goto out_free_aiocb;
 }
 } else {
+ enqueue:
 if (ioq_enqueue(s, iocbs) < 0) {
 goto out_free_aiocb;
 }
-- 
1.7.9.5

[Qemu-devel] [PATCH v3 0/3] linux-aio: fix batch submission

2014-11-06 Thread Ming Lei

The 1st patch fixes batch submission.

The 2nd one fixes -EAGAIN for non-batch case.

The 3rd one is a cleanup.

This patchset is splitted from previous patchset(dataplane: optimization
and multi virtqueue support), as suggested by Stefan.

v3:
- rebase on QEMU master
v2:
- code style fix and commit log fix as suggested by Benoît Canet
v1:
- rebase on latest QEMU master

 block/linux-aio.c |  124 ++---
 1 file changed, 100 insertions(+), 24 deletions(-)

Thanks,

[Qemu-devel] [PATCH v3 3/3] linux-aio: remove 'node' from 'struct qemu_laiocb'

2014-11-06 Thread Ming Lei

No one uses the 'node' field any more, so remove it
from 'struct qemu_laiocb', and this can save 16byte
for the struct on 64bit arch.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index f5ca41d..b12da25 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -35,7 +35,6 @@ struct qemu_laiocb {
 size_t nbytes;
 QEMUIOVector *qiov;
 bool is_read;
-QLIST_ENTRY(qemu_laiocb) node;
 };
 
 /*
-- 
1.7.9.5

Re: [Qemu-devel] [PATCH v2 0/4] linux-aio: fix batch submission

2014-10-15 Thread Ming Lei

Hi Stefan and Guys,

On Thu, Sep 4, 2014 at 6:27 PM, Ming Lei  wrote:
> The 1st patch fixes batch submission.
>
> The 2nd one fixes -EAGAIN for non-batch case.
>
> The 3rd one is a cleanup.
>
> The 4th one increase max event to 256 for supporting the comming
> multi virt-queue.
>
> This patchset is splitted from previous patchset(dataplane: optimization
> and multi virtqueue support), as suggested by Stefan.
>
> These patches have been running well in my box for weeks, and hope
> they can be merged soon, and I have some patches which do depend them.
>
> V2:
> - code style fix and commit log fix as suggested by Benoît Canet
>
> V1:
> - rebase on latest QEMU master
>
>  block/linux-aio.c |  131 
> +
>  1 file changed, 103 insertions(+), 28 deletions(-)

Could you take a look at these patches or the first two/three only?

With this fix merged, some work can be started to extend
submitting I/O as batch to multi-lun SCSI case, and this kind of
optimization can't be done by vhost-scsi at all.

Also multi-queue patches depend on these patches too.

The 1st one has one line conflict with io_cancel() against QEMU
master, and please let me know if you need me to resend or
other comments.

Thanks,
--
Ming Lei

Re: [Qemu-devel] [PATCH v2 0/4] linux-aio: fix batch submission

2014-09-09 Thread Ming Lei

Hi Paolo, Stefan and Kevin,

On Fri, Sep 5, 2014 at 12:27 AM, Ming Lei  wrote:
> The 1st patch fixes batch submission.
>
> The 2nd one fixes -EAGAIN for non-batch case.
>
> The 3rd one is a cleanup.
>
> The 4th one increase max event to 256 for supporting the comming
> multi virt-queue.
>
> This patchset is splitted from previous patchset(dataplane: optimization
> and multi virtqueue support), as suggested by Stefan.
>
> These patches have been running well in my box for weeks, and hope
> they can be merged soon, and I have some patches which do depend them.
>
> V2:
> - code style fix and commit log fix as suggested by Benoît Canet
>
> V1:
> - rebase on latest QEMU master
>

Could you take a look at this patchset?  It has been blocked for weeks.


Thanks,
--
Ming Lei

[Qemu-devel] [PATCH v2 4/4] linux-aio: increase max event to 256

2014-09-04 Thread Ming Lei

This patch increases max event to 256 for the coming
virtio-blk multi virtqueue support.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index f45a142..5d565ad 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -23,7 +23,7 @@
  *  than this we will get EAGAIN from io_submit which is communicated to
  *  the guest as an I/O error.
  */
-#define MAX_EVENTS 128
+#define MAX_EVENTS 256
 
 #define MAX_QUEUED_IO  128
 
-- 
1.7.9.5

[Qemu-devel] [PATCH v2 1/4] linux-aio: fix submit aio as a batch

2014-09-04 Thread Ming Lei

In the enqueue path, we can't complete request, otherwise
"Co-routine re-entered recursively" may be caused, so this
patch fixes the issue with the following ideas:

- for -EAGAIN or partial completion, retry the submission by
scheduling a BH in following completion cb
- for part of completion, also update the io queue
- for other failure, return the failure if in enqueue path,
otherwise, abort all queued I/O

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |  106 -
 1 file changed, 81 insertions(+), 25 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 9aca758..a06576d 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -38,11 +38,19 @@ struct qemu_laiocb {
 QLIST_ENTRY(qemu_laiocb) node;
 };
 
-typedef struct {
+/*
+ * TODO: support to batch I/O from multiple bs in one same
+ * AIO context, one important use case is multi-lun scsi,
+ * so in future the IO queue should be per AIO context.
+ */
+typedef struct LaioQueue {
 struct iocb *iocbs[MAX_QUEUED_IO];
 int plugged;
-unsigned int size;
-unsigned int idx;
+uint32 size;
+uint32 idx;
+
+/* handle -EAGAIN and partial completion */
+QEMUBH *retry;
 } LaioQueue;
 
 struct qemu_laio_state {
@@ -138,6 +146,13 @@ static void qemu_laio_completion_bh(void *opaque)
 }
 }
 
+static void qemu_laio_start_retry(struct qemu_laio_state *s)
+{
+if (s->io_q.idx) {
+qemu_bh_schedule(s->io_q.retry);
+}
+}
+
 static void qemu_laio_completion_cb(EventNotifier *e)
 {
 struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e);
@@ -145,6 +160,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
 if (event_notifier_test_and_clear(&s->e)) {
 qemu_bh_schedule(s->completion_bh);
 }
+qemu_laio_start_retry(s);
 }
 
 static void laio_cancel(BlockDriverAIOCB *blockacb)
@@ -164,6 +180,7 @@ static void laio_cancel(BlockDriverAIOCB *blockacb)
 ret = io_cancel(laiocb->ctx->ctx, &laiocb->iocb, &event);
 if (ret == 0) {
 laiocb->ret = -ECANCELED;
+qemu_laio_start_retry(laiocb->ctx);
 return;
 }
 
@@ -191,45 +208,80 @@ static void ioq_init(LaioQueue *io_q)
 io_q->plugged = 0;
 }
 
-static int ioq_submit(struct qemu_laio_state *s)
+static void abort_queue(struct qemu_laio_state *s)
+{
+int i;
+for (i = 0; i < s->io_q.idx; i++) {
+struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
+  struct qemu_laiocb,
+  iocb);
+laiocb->ret = -EIO;
+qemu_laio_process_completion(s, laiocb);
+}
+}
+
+static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
 {
 int ret, i = 0;
 int len = s->io_q.idx;
+int j = 0;
 
-do {
-ret = io_submit(s->ctx, len, s->io_q.iocbs);
-} while (i++ < 3 && ret == -EAGAIN);
+if (!len) {
+return 0;
+}
 
-/* empty io queue */
-s->io_q.idx = 0;
+ret = io_submit(s->ctx, len, s->io_q.iocbs);
+if (ret == -EAGAIN) { /* retry in following completion cb */
+return 0;
+} else if (ret < 0) {
+if (enqueue) {
+return ret;
+}
 
-if (ret < 0) {
-i = 0;
-} else {
-i = ret;
+/* in non-queue path, all IOs have to be completed */
+abort_queue(s);
+ret = len;
+} else if (ret == 0) {
+goto out;
 }
 
-for (; i < len; i++) {
-struct qemu_laiocb *laiocb =
-container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
-
-laiocb->ret = (ret < 0) ? ret : -EIO;
-qemu_laio_process_completion(s, laiocb);
+for (i = ret; i < len; i++) {
+s->io_q.iocbs[j++] = s->io_q.iocbs[i];
 }
+
+ out:
+/*
+ * update io queue, for partial completion, retry will be
+ * started automatically in following completion cb.
+ */
+s->io_q.idx -= ret;
+
 return ret;
 }
 
-static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
+static void ioq_submit_retry(void *opaque)
+{
+struct qemu_laio_state *s = opaque;
+ioq_submit(s, false);
+}
+
+static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
 unsigned int idx = s->io_q.idx;
 
+if (unlikely(idx == s->io_q.size)) {
+return -1;
+}
+
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
-/* submit immediately if queue is full */
-if (idx == s->io_q.size) {
-ioq_submit(s);
+/* submit immediately if queue depth is above 2/3 */
+if (idx > s->io_q.size * 2 / 3) {
+return ioq_submit(s, true);
 }
+
+return 0;
 }
 
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx)
@@ -251,7 +303,7 @@ int laio_io_unplug(Blo

[Qemu-devel] [PATCH v2 3/4] linux-aio: remove 'node' from 'struct qemu_laiocb'

2014-09-04 Thread Ming Lei

No one uses the 'node' field any more, so remove it
from 'struct qemu_laiocb', and this can save 16byte
for the struct on 64bit arch.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 20a87ec..f45a142 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -35,7 +35,6 @@ struct qemu_laiocb {
 size_t nbytes;
 QEMUIOVector *qiov;
 bool is_read;
-QLIST_ENTRY(qemu_laiocb) node;
 };
 
 /*
-- 
1.7.9.5

[Qemu-devel] [PATCH v2 2/4] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-09-04 Thread Ming Lei

Previously -EAGAIN is simply ignored for !s->io_q.plugged case,
and sometimes it is easy to cause -EIO to VM, such as NVME device.

This patch handles -EAGAIN by io queue for !s->io_q.plugged case,
and it will be retried in following aio completion cb.

Suggested-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index a06576d..20a87ec 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -276,6 +276,11 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct 
iocb *iocb)
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
+/* don't submit until next completion for -EAGAIN of non plug case */
+if (unlikely(!s->io_q.plugged)) {
+return 0;
+}
+
 /* submit immediately if queue depth is above 2/3 */
 if (idx > s->io_q.size * 2 / 3) {
 return ioq_submit(s, true);
@@ -343,10 +348,25 @@ BlockDriverAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
 
 if (!s->io_q.plugged) {
-if (io_submit(s->ctx, 1, &iocbs) < 0) {
+int ret;
+
+if (!s->io_q.idx) {
+ret = io_submit(s->ctx, 1, &iocbs);
+} else {
+ret = -EAGAIN;
+}
+/*
+ * Switch to queue mode until -EAGAIN is handled, we suppose
+ * there is always uncompleted I/O, so try to enqueue it first,
+ * and will be submitted again in following aio completion cb.
+ */
+if (ret == -EAGAIN) {
+goto enqueue;
+} else if (ret < 0) {
 goto out_free_aiocb;
 }
 } else {
+ enqueue:
 if (ioq_enqueue(s, iocbs) < 0) {
 goto out_free_aiocb;
 }
-- 
1.7.9.5

[Qemu-devel] [PATCH v2 0/4] linux-aio: fix batch submission

2014-09-04 Thread Ming Lei

The 1st patch fixes batch submission.

The 2nd one fixes -EAGAIN for non-batch case.

The 3rd one is a cleanup.

The 4th one increase max event to 256 for supporting the comming
multi virt-queue.

This patchset is splitted from previous patchset(dataplane: optimization
and multi virtqueue support), as suggested by Stefan.

These patches have been running well in my box for weeks, and hope
they can be merged soon, and I have some patches which do depend them.

V2:
- code style fix and commit log fix as suggested by Benoît Canet

V1:
- rebase on latest QEMU master

 block/linux-aio.c |  131 +
 1 file changed, 103 insertions(+), 28 deletions(-)


Thanks,
--
Ming Lei

Re: [Qemu-devel] [PATCH 1/4] linux-aio: fix submit aio as a batch

2014-09-04 Thread Ming Lei

On Thu, Sep 4, 2014 at 10:59 PM, Benoît Canet  wrote:
> The Thursday 14 Aug 2014 à 17:41:41 (+0800), Ming Lei wrote :
>> In the enqueue path, we can't complete request, otherwise
>> "Co-routine re-entered recursively" may be caused, so this
>> patch fixes the issue with below ideas:
>
> s/with below ideas/with the following ideas/g
>
>>
>>   - for -EAGAIN or partial completion, retry the submision by
>
> s/submision/submission/
>
>>   schedule an BH in following completion cb
>
> s/schedule an/sheduling a/
>
>>   - for part of completion, also update the io queue
>>   - for other failure, return the failure if in enqueue path,
>>   otherwise, abort all queued I/O
>>
>> Signed-off-by: Ming Lei 
>> ---
>>  block/linux-aio.c |   99 
>> +
>>  1 file changed, 77 insertions(+), 22 deletions(-)
>>
>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>> index 7ac7e8c..4cdf507 100644
>> --- a/block/linux-aio.c
>> +++ b/block/linux-aio.c
>> @@ -38,11 +38,19 @@ struct qemu_laiocb {
>>  QLIST_ENTRY(qemu_laiocb) node;
>>  };
>>
>> +/*
>> + * TODO: support to batch I/O from multiple bs in one same
>> + * AIO context, one important use case is multi-lun scsi,
>> + * so in future the IO queue should be per AIO context.
>> + */
>>  typedef struct {
>
> In QEMU we typically write the name twice in these kind of declarations:
>
> typedef struct LaioQueue {
> ... stuff ...
> } LaioQueue;
>
>>  struct iocb *iocbs[MAX_QUEUED_IO];
>>  int plugged;
>
> Are plugged values either 0 and 1 ?
> If so it should be "bool plugged;"

It is actually a reference counter.

>
>>  unsigned int size;
>>  unsigned int idx;
>
> See:
>
> benoit@Laure:~/code/qemu$ git grep "unsigned int"|wc
>2283   14038  154201
> benoit@Laure:~/code/qemu$ git grep "uint32"|wc
>   12535   63129  810822
>
> Maybe you could use the most popular type.
>
>> +
>> +/* handle -EAGAIN and partial completion */
>> +QEMUBH *retry;
>>  } LaioQueue;
>>
>>  struct qemu_laio_state {
>> @@ -86,6 +94,12 @@ static void qemu_laio_process_completion(struct 
>> qemu_laio_state *s,
>>  qemu_aio_release(laiocb);
>>  }
>>
>> +static void qemu_laio_start_retry(struct qemu_laio_state *s)
>> +{
>
>
>> +if (s->io_q.idx)
>> +qemu_bh_schedule(s->io_q.retry);
>
> In QEMU this test is writen like this:
>
> if (s->io_q.idx) {
> qemu_bh_schedule(s->io_q.retry);
> }
>
> I suggest you ran ./scripts/checkpatch.pl on your series before submitting it.

I don't know why the blow pre-commit hook didn't complain that:

[tom@qemu]$cat .git/hooks/pre-commit
#!/bin/bash
exec git diff --cached | scripts/checkpatch.pl --no-signoff -q -

>
>
>> +}
>> +
>>  static void qemu_laio_completion_cb(EventNotifier *e)
>>  {
>>  struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e);
>> @@ -108,6 +122,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
>>  qemu_laio_process_completion(s, laiocb);
>>  }
>>  }
>> +qemu_laio_start_retry(s);
>>  }
>>
>>  static void laio_cancel(BlockDriverAIOCB *blockacb)
>> @@ -127,6 +142,7 @@ static void laio_cancel(BlockDriverAIOCB *blockacb)
>>  ret = io_cancel(laiocb->ctx->ctx, &laiocb->iocb, &event);
>>  if (ret == 0) {
>>  laiocb->ret = -ECANCELED;
>> +qemu_laio_start_retry(laiocb->ctx);
>>  return;
>>  }
>>
>> @@ -154,45 +170,80 @@ static void ioq_init(LaioQueue *io_q)
>>  io_q->plugged = 0;
>>  }
>>
>> -static int ioq_submit(struct qemu_laio_state *s)
>> +static void abort_queue(struct qemu_laio_state *s)
>> +{
>> +int i;
>> +for (i = 0; i < s->io_q.idx; i++) {
>> +struct qemu_laiocb *laiocb = container_of(s->io_q.iocbs[i],
>> +  struct qemu_laiocb,
>> +  iocb);
>> +laiocb->ret = -EIO;
>> +qemu_laio_process_completion(s, laiocb);
>> +}
>> +}
>> +
>> +static int ioq_submit(struct qemu_laio_state *s, bool enqueue)
>>  {
>>  int ret, i = 0;
>>  int len = s->io_q.idx;
>> +int j = 0;
>>
>> -do {
>>

[Qemu-devel] [PATCH v1 2/4] linux-aio: handling -EAGAIN for !s->io_q.plugged case

2014-09-03 Thread Ming Lei

Previously -EAGAIN is simply ignored for !s->io_q.plugged case,
and sometimes it is easy to cause -EIO to VM, such as NVME device.

This patch handles -EAGAIN by io queue for !s->io_q.plugged case,
and it will be retried in following aio completion cb.

Suggested-by: Paolo Bonzini 
Signed-off-by: Ming Lei 
---
 block/linux-aio.c |   22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index ff66d1e..a979331 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -275,6 +275,11 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct 
iocb *iocb)
 s->io_q.iocbs[idx++] = iocb;
 s->io_q.idx = idx;
 
+/* don't submit until next completion for -EAGAIN of non plug case */
+if (unlikely(!s->io_q.plugged)) {
+return 0;
+}
+
 /* submit immediately if queue depth is above 2/3 */
 if (idx > s->io_q.size * 2 / 3) {
 return ioq_submit(s, true);
@@ -342,10 +347,25 @@ BlockDriverAIOCB *laio_submit(BlockDriverState *bs, void 
*aio_ctx, int fd,
 io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
 
 if (!s->io_q.plugged) {
-if (io_submit(s->ctx, 1, &iocbs) < 0) {
+int ret;
+
+if (!s->io_q.idx) {
+ret = io_submit(s->ctx, 1, &iocbs);
+} else {
+ret = -EAGAIN;
+}
+/*
+ * Switch to queue mode until -EAGAIN is handled, we suppose
+ * there is always uncompleted I/O, so try to enqueue it first,
+ * and will be submitted again in following aio completion cb.
+ */
+if (ret == -EAGAIN) {
+goto enqueue;
+} else if (ret < 0) {
 goto out_free_aiocb;
 }
 } else {
+ enqueue:
 if (ioq_enqueue(s, iocbs) < 0) {
 goto out_free_aiocb;
 }
-- 
1.7.9.5

[Qemu-devel] [PATCH v1 4/4] linux-aio: increase max event to 256

2014-09-03 Thread Ming Lei

This patch increases max event to 256 for the comming
virtio-blk multi virtqueue support.

Signed-off-by: Ming Lei 
---
 block/linux-aio.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index ee125bc..4536106 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -23,7 +23,7 @@
  *  than this we will get EAGAIN from io_submit which is communicated to
  *  the guest as an I/O error.
  */
-#define MAX_EVENTS 128
+#define MAX_EVENTS 256
 
 #define MAX_QUEUED_IO  128
 
-- 
1.7.9.5

1 2 3 4 >

1 - 100 of 339 matches

Mail list logo