date:20160608

[Qemu-block] [PATCH v2] blockdev: clarify error on attempt to open locked tray

2016-06-08 Thread Colin Lord

When opening a device with a locked tray, gives an error explaining the
device tray is locked and that the user should wait and try again. This
is less confusing than the previous error, which simply stated that the
tray was locked.

Signed-off-by: Colin Lord 
---
Reworded commit message to hopefully explain things a little better.
As before this is based off my previously submitted patch v3.
 blockdev.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 7dd14b9..8a045d9 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2544,6 +2544,7 @@ void qmp_blockdev_change_medium(const char *device, const 
char *filename,
 BlockBackend *blk;
 BlockDriverState *medium_bs = NULL;
 int bdrv_flags;
+int rc;
 QDict *options = NULL;
 Error *err = NULL;
 
@@ -2598,11 +2599,13 @@ void qmp_blockdev_change_medium(const char *device, 
const char *filename,
 goto fail;
 }
 
-qmp_blockdev_open_tray(device, false, false, );
-if (err) {
+rc = do_open_tray(device, false, );
+if (rc && rc != -ENOSYS) {
 error_propagate(errp, err);
 goto fail;
 }
+error_free(err);
+err = NULL;
 
 qmp_x_blockdev_remove_medium(device, );
 if (err) {
-- 
2.5.5

Re: [Qemu-block] [Qemu-devel] [PULL 00/31] Block layer patches

2016-06-08 Thread Peter Maydell

On 8 June 2016 at 10:16, Kevin Wolf  wrote:
> The following changes since commit 6ed5546fa7bf12c5b87ef76bafb86e1d77ed6e85:
>
>   Merge remote-tracking branch 
> 'remotes/mjt/tags/pull-trivial-patches-2016-06-07' into staging (2016-06-07 
> 16:34:45 +0100)
>
> are available in the git repository at:
>
>
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to 55d539c8f75abab70301a43d8c94b976f6ddc358:
>
>   qemu-img bench: Add --flush-interval (2016-06-08 10:21:09 +0200)
>
> 
> Block layer patches
>

Applied, thanks.

-- PMM

Re: [Qemu-block] [PATCH v2 2/3] block/mirror: Fix target backing BDS

2016-06-08 Thread Nir Soffer

On Wed, Jun 8, 2016 at 12:32 PM, Kevin Wolf  wrote:
> Am 06.06.2016 um 16:42 hat Max Reitz geschrieben:
>> Currently, we are trying to move the backing BDS from the source to the
>> target in bdrv_replace_in_backing_chain() which is called from
>> mirror_exit(). However, mirror_complete() already tries to open the
>> target's backing chain with a call to bdrv_open_backing_file().
>>
>> First, we should only set the target's backing BDS once. Second, the
>> mirroring block job has a better idea of what to set it to than the
>> generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
>> conditions on when to move the backing BDS from source to target are not
>> really correct).
>>
>> Therefore, remove that code from bdrv_replace_in_backing_chain() and
>> leave it to mirror_complete().
>>
>> However, mirror_complete() in turn pursues a questionable strategy by
>> employing bdrv_open_backing_file(): On the one hand, because this may
>> open the wrong backing file with drive-mirror in "existing" mode, or
>> because it will not override a possibly wrong backing file in the
>> blockdev-mirror case.
>>
>> On the other hand, we want to reuse the existing backing chain of the
>> source instead of opening everything anew, because the latter results in
>> having multiple BDSs for a single physical file and thus potentially
>> concurrent access which we should try to avoid.
>
> Careful, this "wrong" backing file might actually be intended!
>
> Consider a case where you want to move an image with its whole backing
> chain to different storage. In that case, you would copy all of the
> backing files (cp is good enough, they are read-only), create the
> destination image which already points at the copied backing chain, and
> then mirror in "existing" mode.
>
> The intention is obviously that after the job completion the new backing
> chain is used and not the old one.
>
> I know that such cases were discussed when mirroring was introduced, I'm
> not sure whether it's actually used. We need some input there:
>
> Eric, can you tell us whether libvirt makes use of such a setup?
>
> Nir, I'm not sure who is the right person in oVirt these days, but do
> you either know yourself whether oVirt requires this to work, or do you
> know who else would know?

I'm the right person, thanks for keeping me in the loop.

What you describe is how we migrate a disk from one storage to another:

1. Create a vm snapshot
2. Create a volume on the destination storage for the snapshot
3. Start mirroring from the source snapshot to the destination snapshot
using libvirt virDomainBlockCopy:
https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockCopy
4. Copy the reset of the chain from source to destination using qemu-img convert
5. Pivot to the new chain using libvirt virDomainBlockJobAbort
https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockJobAbort
6. Remove the old chain

source and target can be files or block device, and we plan to support also
rbd and gluster volumes as target, maybe also as source.

Nir

>
>> Thus, instead of invoking bdrv_open_backing_file(), just set the correct
>> backing BDS directly via bdrv_set_backing_hd(). Also, do so only when
>> mirror_complete() is certain to succeed.
>>
>> In contrast to what bdrv_replace_in_backing_chain() did so far, we do
>> not need to drop the source's backing file.
>>
>> Signed-off-by: Max Reitz 
>
> Leaving the actual code review for later when we have decided what
> semantics we even want.
>
> Kevin

Re: [Qemu-block] [PATCH 6/6] block: Don't enforce 512 byte minimum alignment

2016-06-08 Thread Eric Blake

On 06/08/2016 08:10 AM, Kevin Wolf wrote:
> If block drivers say that they can do an alignment < 512 bytes, let's
> just suppose they mean it. raw-posix used to be an offender with respect
> to this, but it can actually deal with byte-aligned requests now.
> 
> The default is still 512 bytes for any drivers that only implement
> sector-based interfaces, but it is 1 now for drivers that implement
> .bdrv_co_preadv.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block.c| 2 +-
>  block/io.c | 8 +++-
>  2 files changed, 4 insertions(+), 6 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 5/6] raw-posix: Implement .bdrv_co_preadv/pwritev

2016-06-08 Thread Eric Blake

On 06/08/2016 08:10 AM, Kevin Wolf wrote:
> The raw-posix block driver actually supports byte-aligned requests now
> on non-O_DIRECT images, like it already (and previously incorrectly)
> claimed in bs->request_alignment.
> 
> For some block drivers this means that a RMW cycle can be avoided when
> they write sub-sector metadata e.g. for cluster allocation.

[well, there's still probably a RMW going on, but it's being done by the
kernel, rather than qemu - and choice of caching may let the kernel
optimize things... not worth cluttering the commit message with this,
though]

> 
> Signed-off-by: Kevin Wolf 
> ---

> +++ b/block/linux-aio.c
> @@ -272,14 +272,12 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
> *laiocb, off_t offset,
>  }
>  
>  int laio_submit_co(BlockDriverState *bs, LinuxAioState *s, int fd,
> -int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type)
> +   uint64_t offset, QEMUIOVector *qiov, int type)
>  {
> -off_t offset = sector_num * 512;
>  int ret;
> -
>  struct qemu_laiocb laiocb = {
>  .co = qemu_coroutine_self(),
> -.nbytes = nb_sectors * 512,
> +.nbytes = qiov->size,

So for this interface, we require non-NULL qiov and no duplicated
length; I guess it isn't used for write_zeroes.  We may still want to do
some consistency sweep to decide what level of NULL-ness we want for
representing write_zeroes, rather than ad hoc decisions at each layer of
the call stack, but that's a task for another day.

> @@ -1344,26 +1344,27 @@ static int coroutine_fn raw_co_rw(BlockDriverState 
> *bs, int64_t sector_num,
>  type |= QEMU_AIO_MISALIGNED;
>  #ifdef CONFIG_LINUX_AIO
>  } else if (s->use_aio) {
> -return laio_submit_co(bs, s->aio_ctx, s->fd, sector_num, qiov,
> -   nb_sectors, type);
> +assert(qiov->size == bytes);

Worth hoisting the assertion outside of the #ifdef?...

> +return laio_submit_co(bs, s->aio_ctx, s->fd, offset, qiov, type);
>  #endif
>  }
>  }
>  
> -return paio_submit_co(bs, s->fd, sector_num * BDRV_SECTOR_SIZE, qiov,
> -  nb_sectors * BDRV_SECTOR_SIZE, type);
> +return paio_submit_co(bs, s->fd, offset, qiov, bytes, type);

...then again, paio_submit_co() also does the assert - and this is more
evidence of our inconsistency on whether we duplicate a separate bytes
parameter or reuse qiov->size.

>  
> -static int coroutine_fn raw_co_readv(BlockDriverState *bs, int64_t 
> sector_num,
> - int nb_sectors, QEMUIOVector *qiov)
> +static int coroutine_fn raw_co_preadv(BlockDriverState *bs, uint64_t offset,
> +  uint64_t bytes, QEMUIOVector *qiov,
> +  int flags)
>  {
> -return raw_co_rw(bs, sector_num, nb_sectors, qiov, QEMU_AIO_READ);
> +return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_READ);

We ignore flags, but that's not a change in semantics.  (Maybe someday
we need .supported_read_flags)

>  }
>  
> -static int coroutine_fn raw_co_writev(BlockDriverState *bs, int64_t 
> sector_num,
> -  int nb_sectors, QEMUIOVector *qiov)
> +static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, uint64_t offset,
> +   uint64_t bytes, QEMUIOVector *qiov,
> +   int flags)
>  {
> -return raw_co_rw(bs, sector_num, nb_sectors, qiov, QEMU_AIO_WRITE);
> +return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);

And here, we could assert(!flags) (since we intentionally don't set
.supported_write_flags) - but I won't insist.

None of my comments require a code change, other than a possible added
assertion, so:

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads

2016-06-08 Thread Stefan Hajnoczi

On Tue, Jun 7, 2016 at 1:42 PM, Jason J. Herne
 wrote:
> On 06/06/2016 10:44 PM, Fam Zheng wrote:
>>
>> On Mon, 06/06 14:55, Jason J. Herne wrote:

 I'll see if I can reproduce it here.

 Fam

>>>
>>> Hi Fam,
>>> Have you had any luck reproducing this?
>>
>>
>> No I cannot reproduce so far.
>>
>
> I can hit the problem 100% of the time. Is there any info I can provide to
> help? Either with debugging or reproducing?

I have a related scenario that is reproducible on qemu.git/master:

Launch an x86 guest with virtio-blk-pci,iothread= and perform
drive_mirror to NBD.  Now live migrate (without block migration
because you already mirrored) and the source QEMU will eventually
abort.

The backtrace shows the coroutine is running in the IOThread.  It
should be doing that since virtio-blk.c stops dataplane when live
migration/savevm begins.

What happens is that the mirror block job's coroutine continues to run
in the IOThread but the BlockDriverState's AioContext has switched
back to the main loop.

blk_add_aio_context_notifier() calls are missing for block jobs.  They
need to attach/detach when the AioContext changes.

This needs to be fixed.  I believe Paolo has a patch to continue using
dataplane during savevm but that's a workaround rather than a solution
to this problem.

Stefan

Re: [Qemu-block] [PATCH 4/6] raw-posix: Switch to bdrv_co_* interfaces

2016-06-08 Thread Eric Blake

On 06/08/2016 08:10 AM, Kevin Wolf wrote:
> In order to use the modern byte-based .bdrv_co_preadv/pwritev()
> interface, this patch switches raw-posix to coroutine-based interfaces
> as a first step. In terms of semantics and performance, it doesn't make
> a difference with the existing code whether we go from a coroutine to a
> callback-based interface already in block/io.c or only in linux-aio.c
> 
> As there have been concerns in the past that this change may be a step
> in the wrong direction with respect to a possible AIO fast path, the
> old callback-based interface for linux-aio is left around and can be
> reactivated when a fast path (e.g. directly from virtio-blk dataplane,
> bypassing the whole block layer) is implemented.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/linux-aio.c | 85 
> +--
>  block/raw-aio.h   |  2 ++
>  block/raw-posix.c | 59 ++
>  3 files changed, 92 insertions(+), 54 deletions(-)
> 

> +
> +int laio_submit_co(BlockDriverState *bs, LinuxAioState *s, int fd,
> +int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type)
> +{
> +off_t offset = sector_num * 512;

BDRV_SECTOR_SIZE instead of a magic number?

> +int ret;
> +
> +struct qemu_laiocb laiocb = {
> +.co = qemu_coroutine_self(),
> +.nbytes = nb_sectors * 512,

and again


> +
> +BlockAIOCB *laio_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
> +int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> +BlockCompletionFunc *cb, void *opaque, int type)
> +{
> +struct qemu_laiocb *laiocb;
> +off_t offset = sector_num * 512;
> +int ret;
> +
> +laiocb = qemu_aio_get(_aiocb_info, bs, cb, opaque);
> +laiocb->nbytes = nb_sectors * 512;

and again


> @@ -1345,14 +1344,26 @@ static BlockAIOCB *raw_aio_submit(BlockDriverState 
> *bs,
>  type |= QEMU_AIO_MISALIGNED;
>  #ifdef CONFIG_LINUX_AIO
>  } else if (s->use_aio) {
> -return laio_submit(bs, s->aio_ctx, s->fd, sector_num, qiov,
> -   nb_sectors, cb, opaque, type);
> +return laio_submit_co(bs, s->aio_ctx, s->fd, sector_num, qiov,
> +   nb_sectors, type);

Indentation is now off


> @@ -1957,8 +1952,8 @@ BlockDriver bdrv_file = {
>  .bdrv_co_get_block_status = raw_co_get_block_status,
>  .bdrv_co_pwrite_zeroes = raw_co_pwrite_zeroes,
>  
> -.bdrv_aio_readv = raw_aio_readv,
> -.bdrv_aio_writev = raw_aio_writev,
> +.bdrv_co_readv  = raw_co_readv,
> +.bdrv_co_writev = raw_co_writev,
>  .bdrv_aio_flush = raw_aio_flush,

Why bother with indented alignment of '=' when none of the neighbors do?

Findings are minor enough, so up to you whether to fix them or just mark
this:

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH v5 00/11] Provide a QOM-based authorization API

2016-06-08 Thread Markus Armbruster

"Daniel P. Berrange"  writes:

> Ping, does anyone have any feedback on these first 3 patches
> in particular. If we get agreement on those and get them merged,
> then we can feed the rest of this series to the individual
> subsystem maintainers that are affected, and also unblock
> Max's patch series.

I'll try to get to them this week.

Re: [Qemu-block] [PATCH v2 2/3] block/mirror: Fix target backing BDS

2016-06-08 Thread Max Reitz

On 08.06.2016 13:28, Paolo Bonzini wrote:
> 
> 
> - Original Message -
>> From: "Kevin Wolf" 
>> To: "Max Reitz" 
>> Cc: qemu-block@nongnu.org, qemu-de...@nongnu.org, "Fam Zheng" 
>> , nsof...@redhat.com,
>> ebl...@redhat.com, pbonz...@redhat.com
>> Sent: Wednesday, June 8, 2016 11:32:29 AM
>> Subject: Re: [PATCH v2 2/3] block/mirror: Fix target backing BDS
>>
>> Am 06.06.2016 um 16:42 hat Max Reitz geschrieben:
>>> Currently, we are trying to move the backing BDS from the source to the
>>> target in bdrv_replace_in_backing_chain() which is called from
>>> mirror_exit(). However, mirror_complete() already tries to open the
>>> target's backing chain with a call to bdrv_open_backing_file().
>>>
>>> First, we should only set the target's backing BDS once. Second, the
>>> mirroring block job has a better idea of what to set it to than the
>>> generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
>>> conditions on when to move the backing BDS from source to target are not
>>> really correct).
>>>
>>> Therefore, remove that code from bdrv_replace_in_backing_chain() and
>>> leave it to mirror_complete().
>>>
>>> However, mirror_complete() in turn pursues a questionable strategy by
>>> employing bdrv_open_backing_file(): On the one hand, because this may
>>> open the wrong backing file with drive-mirror in "existing" mode, or
>>> because it will not override a possibly wrong backing file in the
>>> blockdev-mirror case.
>>
>> Careful, this "wrong" backing file might actually be intended!
>>
>> Consider a case where you want to move an image with its whole backing
>> chain to different storage. In that case, you would copy all of the
>> backing files (cp is good enough, they are read-only), create the
>> destination image which already points at the copied backing chain, and
>> then mirror in "existing" mode.
>>
>> The intention is obviously that after the job completion the new backing
>> chain is used and not the old one.
> 
> Yes, this is the intention and it should not be changed.  In addition
> to what Kevin said, you can use drive-mirror to collapse the image to a
> single file; in this case, QEMU should not be using the backing files of
> the source.

That is an issue that we have right now. If you do drive-mirror in
absolute-paths mode with sync=full, the target will have the backing
chain of the source. This is something that this patch fixes.

In fact, I think if you do drive-mirror in existing mode or
blockdev-mirror and the target image does not have a backing file
(whatever sync mode you have used), the same will happen.

Max

> bdrv_open_backing_file() is used because what we want to do is to
> "undo" the BDRV_O_NO_BACKING flag used by qmp_drive_mirror.
> 
> If the contents change under the guest feet, it's the layers above
> QEMU that have screwed up.
> 
> Paolo
> 




signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 3/6] block: Prepare bdrv_aligned_pwritev() for byte-aligned requests

2016-06-08 Thread Eric Blake

On 06/08/2016 08:10 AM, Kevin Wolf wrote:
> Signed-off-by: Kevin Wolf 
> ---
>  block/io.c | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 2fd88cb..4af9c59 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1241,11 +1241,9 @@ static int coroutine_fn 
> bdrv_aligned_pwritev(BlockDriverState *bs,
>  bool waited;
>  int ret;
>  
> -int64_t sector_num = offset >> BDRV_SECTOR_BITS;
> -unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
> +int64_t start_sector = offset >> BDRV_SECTOR_BITS;
> +int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
>  
> -assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> -assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);

I wonder if we should add an 'unsigned int align' parameter to this
function, to mirror bdrv_aligned_preadv() - that way, we can assert that
any divisions we do based on 'align' are indeed aligned. If we do that,
then just as in the previous patch, I think we would want to keep
'assert((offset & (align - 1)) == 0)'.  But at the moment, I don't see
any divisions based on align, so I think you are okay.

>  if (ret >= 0) {
> -bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
> +bs->total_sectors = MAX(bs->total_sectors, end_sector);

Someday, we may want bs->total_bytes instead of total_sectors, but
unrelated to this patch.

Looks clean:
Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH] block: Don't emulate natively supported pwritev flags

2016-06-08 Thread Stefan Hajnoczi

On Tue, Jun 07, 2016 at 03:56:37PM +0200, Kevin Wolf wrote:
> Drivers that implement .bdrv_co_pwritev() get the flags passed as an
> argument to said function, but we also unconditionally emulate the flags
> anyway. We shouldn't do that.
> 
> Fix this by clearing all flags that the driver supports natively after
> it returns from .bdrv_co_pwritev().
> 
> Fixes: 4df863f3 ('block: Make supported_write_flags a per-bds property')
> Signed-off-by: Kevin Wolf 
> ---
>  block/io.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

[Qemu-block] [PATCH 3/6] block: Prepare bdrv_aligned_pwritev() for byte-aligned requests

2016-06-08 Thread Kevin Wolf

Signed-off-by: Kevin Wolf 
---
 block/io.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/block/io.c b/block/io.c
index 2fd88cb..4af9c59 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1241,11 +1241,9 @@ static int coroutine_fn 
bdrv_aligned_pwritev(BlockDriverState *bs,
 bool waited;
 int ret;
 
-int64_t sector_num = offset >> BDRV_SECTOR_BITS;
-unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
+int64_t start_sector = offset >> BDRV_SECTOR_BITS;
+int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
 
-assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
-assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 assert(!qiov || bytes == qiov->size);
 assert((bs->open_flags & BDRV_O_NO_IO) == 0);
 
@@ -1269,22 +1267,21 @@ static int coroutine_fn 
bdrv_aligned_pwritev(BlockDriverState *bs,
 /* Do nothing, write notifier decided to fail this request */
 } else if (flags & BDRV_REQ_ZERO_WRITE) {
 bdrv_debug_event(bs, BLKDBG_PWRITEV_ZERO);
-ret = bdrv_co_do_pwrite_zeroes(bs, sector_num << BDRV_SECTOR_BITS,
-   nb_sectors << BDRV_SECTOR_BITS, flags);
+ret = bdrv_co_do_pwrite_zeroes(bs, offset, bytes, flags);
 } else {
 bdrv_debug_event(bs, BLKDBG_PWRITEV);
 ret = bdrv_driver_pwritev(bs, offset, bytes, qiov, flags);
 }
 bdrv_debug_event(bs, BLKDBG_PWRITEV_DONE);
 
-bdrv_set_dirty(bs, sector_num, nb_sectors);
+bdrv_set_dirty(bs, start_sector, end_sector - start_sector);
 
 if (bs->wr_highest_offset < offset + bytes) {
 bs->wr_highest_offset = offset + bytes;
 }
 
 if (ret >= 0) {
-bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
+bs->total_sectors = MAX(bs->total_sectors, end_sector);
 }
 
 return ret;
-- 
1.8.3.1

[Qemu-block] [PATCH 2/6] block: Prepare bdrv_aligned_preadv() for byte-aligned requests

2016-06-08 Thread Kevin Wolf

This patch makes bdrv_aligned_preadv() ready to accept byte-aligned
requests. Note that this doesn't mean that such requests are actually
made. The caller still ensures that all requests are aligned to at least
512 bytes.

Signed-off-by: Kevin Wolf 
---
 block/io.c | 41 +
 1 file changed, 17 insertions(+), 24 deletions(-)

diff --git a/block/io.c b/block/io.c
index 3b34f20..2fd88cb 100644
--- a/block/io.c
+++ b/block/io.c
@@ -959,11 +959,6 @@ static int coroutine_fn 
bdrv_aligned_preadv(BlockDriverState *bs,
 {
 int ret;
 
-int64_t sector_num = offset >> BDRV_SECTOR_BITS;
-unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
-
-assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
-assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
 assert(!qiov || bytes == qiov->size);
 assert((bs->open_flags & BDRV_O_NO_IO) == 0);
 
@@ -982,9 +977,12 @@ static int coroutine_fn 
bdrv_aligned_preadv(BlockDriverState *bs,
 }
 
 if (flags & BDRV_REQ_COPY_ON_READ) {
+int64_t start_sector = offset >> BDRV_SECTOR_BITS;
+int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
+unsigned int nb_sectors = end_sector - start_sector;
 int pnum;
 
-ret = bdrv_is_allocated(bs, sector_num, nb_sectors, );
+ret = bdrv_is_allocated(bs, start_sector, nb_sectors, );
 if (ret < 0) {
 goto out;
 }
@@ -1000,28 +998,24 @@ static int coroutine_fn 
bdrv_aligned_preadv(BlockDriverState *bs,
 ret = bdrv_driver_preadv(bs, offset, bytes, qiov, 0);
 } else {
 /* Read zeros after EOF */
-int64_t total_sectors, max_nb_sectors;
+int64_t total_bytes, max_bytes;
 
-total_sectors = bdrv_nb_sectors(bs);
-if (total_sectors < 0) {
-ret = total_sectors;
+total_bytes = bdrv_getlength(bs);
+if (total_bytes < 0) {
+ret = total_bytes;
 goto out;
 }
 
-max_nb_sectors = ROUND_UP(MAX(0, total_sectors - sector_num),
-  align >> BDRV_SECTOR_BITS);
-if (nb_sectors < max_nb_sectors) {
+max_bytes = ROUND_UP(MAX(0, total_bytes - offset), align);
+if (bytes < max_bytes) {
 ret = bdrv_driver_preadv(bs, offset, bytes, qiov, 0);
-} else if (max_nb_sectors > 0) {
+} else if (max_bytes > 0) {
 QEMUIOVector local_qiov;
 
 qemu_iovec_init(_qiov, qiov->niov);
-qemu_iovec_concat(_qiov, qiov, 0,
-  max_nb_sectors * BDRV_SECTOR_SIZE);
+qemu_iovec_concat(_qiov, qiov, 0, max_bytes);
 
-ret = bdrv_driver_preadv(bs, offset,
- max_nb_sectors * BDRV_SECTOR_SIZE,
- _qiov, 0);
+ret = bdrv_driver_preadv(bs, offset, max_bytes, _qiov, 0);
 
 qemu_iovec_destroy(_qiov);
 } else {
@@ -1029,11 +1023,10 @@ static int coroutine_fn 
bdrv_aligned_preadv(BlockDriverState *bs,
 }
 
 /* Reading beyond end of file is supposed to produce zeroes */
-if (ret == 0 && total_sectors < sector_num + nb_sectors) {
-uint64_t offset = MAX(0, total_sectors - sector_num);
-uint64_t bytes = (sector_num + nb_sectors - offset) *
-  BDRV_SECTOR_SIZE;
-qemu_iovec_memset(qiov, offset * BDRV_SECTOR_SIZE, 0, bytes);
+if (ret == 0 && total_bytes < offset + bytes) {
+uint64_t zero_offset = MAX(0, total_bytes - offset);
+uint64_t zero_bytes = offset + bytes - zero_offset;
+qemu_iovec_memset(qiov, zero_offset, 0, zero_bytes);
 }
 }
 
-- 
1.8.3.1

Re: [Qemu-block] [PATCH v2 2/3] block/mirror: Fix target backing BDS

2016-06-08 Thread Max Reitz

On 08.06.2016 16:40, Max Reitz wrote:
> On 08.06.2016 13:28, Paolo Bonzini wrote:
>>
>>
>> - Original Message -
>>> From: "Kevin Wolf" 
>>> To: "Max Reitz" 
>>> Cc: qemu-block@nongnu.org, qemu-de...@nongnu.org, "Fam Zheng" 
>>> , nsof...@redhat.com,
>>> ebl...@redhat.com, pbonz...@redhat.com
>>> Sent: Wednesday, June 8, 2016 11:32:29 AM
>>> Subject: Re: [PATCH v2 2/3] block/mirror: Fix target backing BDS
>>>
>>> Am 06.06.2016 um 16:42 hat Max Reitz geschrieben:
 Currently, we are trying to move the backing BDS from the source to the
 target in bdrv_replace_in_backing_chain() which is called from
 mirror_exit(). However, mirror_complete() already tries to open the
 target's backing chain with a call to bdrv_open_backing_file().

 First, we should only set the target's backing BDS once. Second, the
 mirroring block job has a better idea of what to set it to than the
 generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
 conditions on when to move the backing BDS from source to target are not
 really correct).

 Therefore, remove that code from bdrv_replace_in_backing_chain() and
 leave it to mirror_complete().

 However, mirror_complete() in turn pursues a questionable strategy by
 employing bdrv_open_backing_file(): On the one hand, because this may
 open the wrong backing file with drive-mirror in "existing" mode, or
 because it will not override a possibly wrong backing file in the
 blockdev-mirror case.
>>>
>>> Careful, this "wrong" backing file might actually be intended!
>>>
>>> Consider a case where you want to move an image with its whole backing
>>> chain to different storage. In that case, you would copy all of the
>>> backing files (cp is good enough, they are read-only), create the
>>> destination image which already points at the copied backing chain, and
>>> then mirror in "existing" mode.
>>>
>>> The intention is obviously that after the job completion the new backing
>>> chain is used and not the old one.
>>
>> Yes, this is the intention and it should not be changed.  In addition
>> to what Kevin said, you can use drive-mirror to collapse the image to a
>> single file; in this case, QEMU should not be using the backing files of
>> the source.
> 
> That is an issue that we have right now. If you do drive-mirror in
> absolute-paths mode with sync=full, the target will have the backing
> chain of the source. This is something that this patch fixes.

As a clarification: I mean the backing chain inside QEMU (in the BDS
graph), not the on-disk backing chain, i.e. how the physical image files
link to each other.

Max

> In fact, I think if you do drive-mirror in existing mode or
> blockdev-mirror and the target image does not have a backing file
> (whatever sync mode you have used), the same will happen.
> 
> Max
> 
>> bdrv_open_backing_file() is used because what we want to do is to
>> "undo" the BDRV_O_NO_BACKING flag used by qmp_drive_mirror.
>>
>> If the contents change under the guest feet, it's the layers above
>> QEMU that have screwed up.
>>
>> Paolo
>>
> 
> 




signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 2/6] block: Prepare bdrv_aligned_preadv() for byte-aligned requests

2016-06-08 Thread Eric Blake

On 06/08/2016 08:10 AM, Kevin Wolf wrote:
> This patch makes bdrv_aligned_preadv() ready to accept byte-aligned
> requests. Note that this doesn't mean that such requests are actually
> made. The caller still ensures that all requests are aligned to at least
> 512 bytes.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/io.c | 41 +
>  1 file changed, 17 insertions(+), 24 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 3b34f20..2fd88cb 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -959,11 +959,6 @@ static int coroutine_fn 
> bdrv_aligned_preadv(BlockDriverState *bs,
>  {
>  int ret;
>  
> -int64_t sector_num = offset >> BDRV_SECTOR_BITS;
> -unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
> -
> -assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> -assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);

My recent patch proposal conflicts here; I can drop (or rebase) mine,
but I think you want to keep the asserts, after modifying them to:

assert((offset & (align - 1)) == 0);
assert((bytes & (align - 1)) == 0);

and maybe add 'assert(is_power_of_2(align)); for good measure.


>  
> -max_nb_sectors = ROUND_UP(MAX(0, total_sectors - sector_num),
> -  align >> BDRV_SECTOR_BITS);
> -if (nb_sectors < max_nb_sectors) {
> +max_bytes = ROUND_UP(MAX(0, total_bytes - offset), align);
> +if (bytes < max_bytes) {

Another one of my posted patches switches this to <= to avoid a harmless
off-by-one, also something I can rebase.

Whether you add the assertions, or leave it for me to (re-)add them in
my rebase,

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 1/6] block: Byte-based bdrv_co_do_copy_on_readv()

2016-06-08 Thread Eric Blake

On 06/08/2016 08:10 AM, Kevin Wolf wrote:
> In a first step to convert the common I/O path to work on bytes rather
> than sectors, this converts the copy-on-read logic that is used by
> bdrv_aligned_preadv().
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/io.c| 63 
> +++
>  block/mirror.c| 10 
>  include/block/block.h | 10 +---
>  3 files changed, 51 insertions(+), 32 deletions(-)
> 

> @@ -873,21 +893,20 @@ static int coroutine_fn 
> bdrv_co_do_copy_on_readv(BlockDriverState *bs,
>  BlockDriver *drv = bs->drv;
>  struct iovec iov;
>  QEMUIOVector bounce_qiov;
> -int64_t cluster_sector_num;
> -int cluster_nb_sectors;
> +int64_t cluster_offset;
> +unsigned int cluster_bytes;
>  size_t skip_bytes;
>  int ret;
>  
>  /* Cover entire cluster so no additional backing file I/O is required 
> when
>   * allocating cluster in the image file.
>   */
> -bdrv_round_to_clusters(bs, sector_num, nb_sectors,
> -   _sector_num, _nb_sectors);
> +bdrv_round_to_clusters(bs, offset, bytes, _offset, 
> _bytes);
>  
> -trace_bdrv_co_do_copy_on_readv(bs, sector_num, nb_sectors,
> -   cluster_sector_num, cluster_nb_sectors);
> +trace_bdrv_co_do_copy_on_readv(bs, offset, bytes,
> +   cluster_offset, cluster_bytes);

Missing patch to trace-events to advertise new semantics.

With that fixed,
Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH 4/6] raw-posix: Switch to bdrv_co_* interfaces

2016-06-08 Thread Kevin Wolf

In order to use the modern byte-based .bdrv_co_preadv/pwritev()
interface, this patch switches raw-posix to coroutine-based interfaces
as a first step. In terms of semantics and performance, it doesn't make
a difference with the existing code whether we go from a coroutine to a
callback-based interface already in block/io.c or only in linux-aio.c

As there have been concerns in the past that this change may be a step
in the wrong direction with respect to a possible AIO fast path, the
old callback-based interface for linux-aio is left around and can be
reactivated when a fast path (e.g. directly from virtio-blk dataplane,
bypassing the whole block layer) is implemented.

Signed-off-by: Kevin Wolf 
---
 block/linux-aio.c | 85 +--
 block/raw-aio.h   |  2 ++
 block/raw-posix.c | 59 ++
 3 files changed, 92 insertions(+), 54 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 294b2bf..1a56543 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -13,6 +13,7 @@
 #include "qemu/queue.h"
 #include "block/raw-aio.h"
 #include "qemu/event_notifier.h"
+#include "qemu/coroutine.h"
 
 #include 
 
@@ -30,6 +31,7 @@
 
 struct qemu_laiocb {
 BlockAIOCB common;
+Coroutine *co;
 LinuxAioState *ctx;
 struct iocb iocb;
 ssize_t ret;
@@ -88,9 +90,14 @@ static void qemu_laio_process_completion(struct qemu_laiocb 
*laiocb)
 }
 }
 }
-laiocb->common.cb(laiocb->common.opaque, ret);
 
-qemu_aio_unref(laiocb);
+laiocb->ret = ret;
+if (laiocb->co) {
+qemu_coroutine_enter(laiocb->co, NULL);
+} else {
+laiocb->common.cb(laiocb->common.opaque, ret);
+qemu_aio_unref(laiocb);
+}
 }
 
 /* The completion BH fetches completed I/O requests and invokes their
@@ -232,22 +239,12 @@ void laio_io_unplug(BlockDriverState *bs, LinuxAioState 
*s)
 }
 }
 
-BlockAIOCB *laio_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
-int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
-BlockCompletionFunc *cb, void *opaque, int type)
+static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
+  int type)
 {
-struct qemu_laiocb *laiocb;
-struct iocb *iocbs;
-off_t offset = sector_num * 512;
-
-laiocb = qemu_aio_get(_aiocb_info, bs, cb, opaque);
-laiocb->nbytes = nb_sectors * 512;
-laiocb->ctx = s;
-laiocb->ret = -EINPROGRESS;
-laiocb->is_read = (type == QEMU_AIO_READ);
-laiocb->qiov = qiov;
-
-iocbs = >iocb;
+LinuxAioState *s = laiocb->ctx;
+struct iocb *iocbs = >iocb;
+QEMUIOVector *qiov = laiocb->qiov;
 
 switch (type) {
 case QEMU_AIO_WRITE:
@@ -260,7 +257,7 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, LinuxAioState 
*s, int fd,
 default:
 fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
 __func__, type);
-goto out_free_aiocb;
+return -EIO;
 }
 io_set_eventfd(>iocb, event_notifier_get_fd(>e));
 
@@ -270,11 +267,55 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, 
LinuxAioState *s, int fd,
 (!s->io_q.plugged || s->io_q.n >= MAX_QUEUED_IO)) {
 ioq_submit(s);
 }
-return >common;
 
-out_free_aiocb:
-qemu_aio_unref(laiocb);
-return NULL;
+return 0;
+}
+
+int laio_submit_co(BlockDriverState *bs, LinuxAioState *s, int fd,
+int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type)
+{
+off_t offset = sector_num * 512;
+int ret;
+
+struct qemu_laiocb laiocb = {
+.co = qemu_coroutine_self(),
+.nbytes = nb_sectors * 512,
+.ctx= s,
+.is_read= (type == QEMU_AIO_READ),
+.qiov   = qiov,
+};
+
+ret = laio_do_submit(fd, , offset, type);
+if (ret < 0) {
+return ret;
+}
+
+qemu_coroutine_yield();
+return laiocb.ret;
+}
+
+BlockAIOCB *laio_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
+int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
+BlockCompletionFunc *cb, void *opaque, int type)
+{
+struct qemu_laiocb *laiocb;
+off_t offset = sector_num * 512;
+int ret;
+
+laiocb = qemu_aio_get(_aiocb_info, bs, cb, opaque);
+laiocb->nbytes = nb_sectors * 512;
+laiocb->ctx = s;
+laiocb->ret = -EINPROGRESS;
+laiocb->is_read = (type == QEMU_AIO_READ);
+laiocb->qiov = qiov;
+
+ret = laio_do_submit(fd, laiocb, offset, type);
+if (ret < 0) {
+qemu_aio_unref(laiocb);
+return NULL;
+}
+
+return >common;
 }
 
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context)
diff --git a/block/raw-aio.h b/block/raw-aio.h
index 714714e..1037502 100644
--- a/block/raw-aio.h
+++ b/block/raw-aio.h
@@ -38,6 +38,8 @@
 typedef struct LinuxAioState LinuxAioState;
 LinuxAioState *laio_init(void);
 void laio_cleanup(LinuxAioState

[Qemu-block] [PATCH 5/6] raw-posix: Implement .bdrv_co_preadv/pwritev

2016-06-08 Thread Kevin Wolf

The raw-posix block driver actually supports byte-aligned requests now
on non-O_DIRECT images, like it already (and previously incorrectly)
claimed in bs->request_alignment.

For some block drivers this means that a RMW cycle can be avoided when
they write sub-sector metadata e.g. for cluster allocation.

Signed-off-by: Kevin Wolf 
---
 block/linux-aio.c |  6 ++
 block/raw-aio.h   |  2 +-
 block/raw-posix.c | 42 ++
 3 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 1a56543..8dc34db 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -272,14 +272,12 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
*laiocb, off_t offset,
 }
 
 int laio_submit_co(BlockDriverState *bs, LinuxAioState *s, int fd,
-int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type)
+   uint64_t offset, QEMUIOVector *qiov, int type)
 {
-off_t offset = sector_num * 512;
 int ret;
-
 struct qemu_laiocb laiocb = {
 .co = qemu_coroutine_self(),
-.nbytes = nb_sectors * 512,
+.nbytes = qiov->size,
 .ctx= s,
 .is_read= (type == QEMU_AIO_READ),
 .qiov   = qiov,
diff --git a/block/raw-aio.h b/block/raw-aio.h
index 1037502..3f5b8bb 100644
--- a/block/raw-aio.h
+++ b/block/raw-aio.h
@@ -39,7 +39,7 @@ typedef struct LinuxAioState LinuxAioState;
 LinuxAioState *laio_init(void);
 void laio_cleanup(LinuxAioState *s);
 int laio_submit_co(BlockDriverState *bs, LinuxAioState *s, int fd,
-int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type);
+   uint64_t offset, QEMUIOVector *qiov, int type);
 BlockAIOCB *laio_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
 int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
 BlockCompletionFunc *cb, void *opaque, int type);
diff --git a/block/raw-posix.c b/block/raw-posix.c
index af7f69f..0db7876 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1325,8 +1325,8 @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, int 
fd,
 return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
 }
 
-static int coroutine_fn raw_co_rw(BlockDriverState *bs, int64_t sector_num,
-  int nb_sectors, QEMUIOVector *qiov, int type)
+static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
+   uint64_t bytes, QEMUIOVector *qiov, int 
type)
 {
 BDRVRawState *s = bs->opaque;
 
@@ -1344,26 +1344,27 @@ static int coroutine_fn raw_co_rw(BlockDriverState *bs, 
int64_t sector_num,
 type |= QEMU_AIO_MISALIGNED;
 #ifdef CONFIG_LINUX_AIO
 } else if (s->use_aio) {
-return laio_submit_co(bs, s->aio_ctx, s->fd, sector_num, qiov,
-   nb_sectors, type);
+assert(qiov->size == bytes);
+return laio_submit_co(bs, s->aio_ctx, s->fd, offset, qiov, type);
 #endif
 }
 }
 
-return paio_submit_co(bs, s->fd, sector_num * BDRV_SECTOR_SIZE, qiov,
-  nb_sectors * BDRV_SECTOR_SIZE, type);
+return paio_submit_co(bs, s->fd, offset, qiov, bytes, type);
 }
 
-static int coroutine_fn raw_co_readv(BlockDriverState *bs, int64_t sector_num,
- int nb_sectors, QEMUIOVector *qiov)
+static int coroutine_fn raw_co_preadv(BlockDriverState *bs, uint64_t offset,
+  uint64_t bytes, QEMUIOVector *qiov,
+  int flags)
 {
-return raw_co_rw(bs, sector_num, nb_sectors, qiov, QEMU_AIO_READ);
+return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_READ);
 }
 
-static int coroutine_fn raw_co_writev(BlockDriverState *bs, int64_t sector_num,
-  int nb_sectors, QEMUIOVector *qiov)
+static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, uint64_t offset,
+   uint64_t bytes, QEMUIOVector *qiov,
+   int flags)
 {
-return raw_co_rw(bs, sector_num, nb_sectors, qiov, QEMU_AIO_WRITE);
+return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
 static void raw_aio_plug(BlockDriverState *bs)
@@ -1952,8 +1953,8 @@ BlockDriver bdrv_file = {
 .bdrv_co_get_block_status = raw_co_get_block_status,
 .bdrv_co_pwrite_zeroes = raw_co_pwrite_zeroes,
 
-.bdrv_co_readv  = raw_co_readv,
-.bdrv_co_writev = raw_co_writev,
+.bdrv_co_preadv = raw_co_preadv,
+.bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_aio_flush = raw_aio_flush,
 .bdrv_aio_discard = raw_aio_discard,
 .bdrv_refresh_limits = raw_refresh_limits,
@@ -2400,8 +2401,8 @@ static BlockDriver bdrv_host_device = {
 .create_opts = _create_opts,
 .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes,
 
-.bdrv_co_readv

[Qemu-block] [PATCH 0/6] block: Enable byte granularity I/O

2016-06-08 Thread Kevin Wolf

Previous series have already converted some block drivers to byte-based rather
than sector-based interfaces. However, the common I/O path as well as raw-posix
still enforced a minimum alignment of 512 bytes because some sector-based logic
was involved.

This patch series removes these limitations and a sub-sector request actually
ends up as a sub-sector syscall now.

Kevin Wolf (6):
  block: Byte-based bdrv_co_do_copy_on_readv()
  block: Prepare bdrv_aligned_preadv() for byte-aligned requests
  block: Prepare bdrv_aligned_pwritev() for byte-aligned requests
  raw-posix: Switch to bdrv_co_* interfaces
  raw-posix: Implement .bdrv_co_preadv/pwritev
  block: Don't enforce 512 byte minimum alignment

 block.c   |   2 +-
 block/io.c| 125 +-
 block/linux-aio.c |  83 -
 block/mirror.c|  10 ++--
 block/raw-aio.h   |   2 +
 block/raw-posix.c |  61 
 include/block/block.h |  10 ++--
 7 files changed, 169 insertions(+), 124 deletions(-)

-- 
1.8.3.1

Re: [Qemu-block] [PATCH v20 07/10] Introduce new APIs to do replication operation

2016-06-08 Thread Eric Blake

On 06/07/2016 07:11 PM, Changlong Xie wrote:
> Signed-off-by: Wen Congyang 
> Signed-off-by: zhanghailiang 
> Signed-off-by: Gonglei 
> Signed-off-by: Changlong Xie 

No mention of the API names in the commit message?  Grepping 'git log'
is easier if there is something useful to grep for.

> ---
>  Makefile.objs|   1 +
>  qapi/block-core.json |  13 
>  replication.c| 105 ++
>  replication.h| 176 
> +++
>  4 files changed, 295 insertions(+)
>  create mode 100644 replication.c
>  create mode 100644 replication.h
> 

> +++ b/qapi/block-core.json
> @@ -2032,6 +2032,19 @@
>  '*read-pattern': 'QuorumReadPattern' } }
>  
>  ##
> +# @ReplicationMode
> +#
> +# An enumeration of replication modes.
> +#
> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
> +#
> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
> +#
> +# Since: 2.7
> +##
> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
> +

This part is fine, from an interface point of view. However, I have not
closely reviewed the rest of the patch or series.  That said, here's
some quick things that caught my eye.


> +++ b/replication.c
> @@ -0,0 +1,105 @@
> +/*
> + * Replication filter
> + *
> + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2016 Intel Corporation
> + * Copyright (c) 2016 FUJITSU LIMITED
> + *
> + * Author:
> + *   Changlong Xie 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "replication.h"

All new .c files must include "qemu/osdep.h" first.


> +++ b/replication.h
> @@ -0,0 +1,176 @@
> +/*
> + * Replication filter
> + *
> + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2016 Intel Corporation
> + * Copyright (c) 2016 FUJITSU LIMITED
> + *
> + * Author:
> + *   Changlong Xie 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef REPLICATION_H
> +#define REPLICATION_H
> +
> +#include "qemu/osdep.h"

And .h files must NOT include osdep.h.

> +#include "qapi/error.h"

Do you really need the full error.h, or is typedefs.h enough to get the
forward declaration of Error?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v20 00/10] Block replication for continuous checkpoints

2016-06-08 Thread Eric Blake

On 06/07/2016 07:11 PM, Changlong Xie wrote:
> Block replication is a very important feature which is used for
> continuous checkpoints(for example: COLO).
> 

Side note: Including qemu-trivial in CC: on a patch series at v20 feels
wrong.  Obviously it is not trivial to be ten patches with this much churn.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v5 02/11] qapi: allow QmpInputVisitor to auto-cast types

2016-06-08 Thread Paolo Bonzini



On 02/06/2016 18:46, Daniel P. Berrange wrote:
> Currently the QmpInputVisitor assumes that all scalar
> values are directly represented as their final types.
> ie it assumes an 'int' is using QInt, and a 'bool' is
> using QBool.
> 
> This extends it so that QString is optionally permitted
> for any of the non-string scalar types. This behaviour
> is turned on by requesting the 'autocast' flag in the
> constructor.
> 
> This makes it possible to use QmpInputVisitor with a
> QDict produced from QemuOpts, where everything is in
> string format.

Perhaps this should instead be a separate QmpStringInputVisitor visitor
that _only_ accepts strings?  You can reuse most of the QmpInputVisitor
by putting it in the same file, because the struct and list visitors are
compatible.

I wonder if OptsVisitor could also be replaced by qemu_opts_to_qdict, an
optional crumple step and the QmpStringInputVisitor (self-answer:
probably not because of the magic integer range parsing).

Paolo

> Signed-off-by: Daniel P. Berrange 
> ---
>  docs/qapi-code-gen.txt |   2 +-
>  include/qapi/qmp-input-visitor.h   |   6 +-
>  qapi/opts-visitor.c|   1 +
>  qapi/qmp-input-visitor.c   |  89 ++--
>  qmp.c  |   2 +-
>  qom/qom-qobject.c  |   2 +-
>  replay/replay-input.c  |   2 +-
>  scripts/qapi-commands.py   |   2 +-
>  tests/check-qnull.c|   2 +-
>  tests/test-qmp-commands.c  |   2 +-
>  tests/test-qmp-input-strict.c  |   2 +-
>  tests/test-qmp-input-visitor.c | 115 
> -
>  tests/test-visitor-serialization.c |   2 +-
>  util/qemu-sockets.c|   2 +-
>  14 files changed, 201 insertions(+), 30 deletions(-)
> 
> diff --git a/docs/qapi-code-gen.txt b/docs/qapi-code-gen.txt
> index d7d6987..e21773e 100644
> --- a/docs/qapi-code-gen.txt
> +++ b/docs/qapi-code-gen.txt
> @@ -1008,7 +1008,7 @@ Example:
>  {
>  Error *err = NULL;
>  UserDefOne *retval;
> -QmpInputVisitor *qiv = qmp_input_visitor_new(QOBJECT(args), true);
> +QmpInputVisitor *qiv = qmp_input_visitor_new(QOBJECT(args), true, 
> false);
>  QapiDeallocVisitor *qdv;
>  Visitor *v;
>  UserDefOneList *arg1 = NULL;
> diff --git a/include/qapi/qmp-input-visitor.h 
> b/include/qapi/qmp-input-visitor.h
> index b0624d8..d35a053 100644
> --- a/include/qapi/qmp-input-visitor.h
> +++ b/include/qapi/qmp-input-visitor.h
> @@ -24,8 +24,12 @@ typedef struct QmpInputVisitor QmpInputVisitor;
>   *
>   * Set @strict to reject a parse that doesn't consume all keys of a
>   * dictionary; otherwise excess input is ignored.
> + * Set @autocast to automatically convert string values into more
> + * specific types (numbers, bools, etc)
>   */
> -QmpInputVisitor *qmp_input_visitor_new(QObject *obj, bool strict);
> +QmpInputVisitor *qmp_input_visitor_new(QObject *obj,
> +   bool strict,
> +   bool autocast);
>  
>  void qmp_input_visitor_cleanup(QmpInputVisitor *v);
>  
> diff --git a/qapi/opts-visitor.c b/qapi/opts-visitor.c
> index 4cf1cf8..00e4b1a 100644
> --- a/qapi/opts-visitor.c
> +++ b/qapi/opts-visitor.c
> @@ -347,6 +347,7 @@ opts_type_bool(Visitor *v, const char *name, bool *obj, 
> Error **errp)
>  }
>  
>  if (opt->str) {
> +/* Keep these values in sync with same code in qmp-input-visitor.c */
>  if (strcmp(opt->str, "on") == 0 ||
>  strcmp(opt->str, "yes") == 0 ||
>  strcmp(opt->str, "y") == 0) {
> diff --git a/qapi/qmp-input-visitor.c b/qapi/qmp-input-visitor.c
> index aea90a1..92023b1 100644
> --- a/qapi/qmp-input-visitor.c
> +++ b/qapi/qmp-input-visitor.c
> @@ -20,6 +20,7 @@
>  #include "qemu-common.h"
>  #include "qapi/qmp/types.h"
>  #include "qapi/qmp/qerror.h"
> +#include "qemu/cutils.h"
>  
>  #define QIV_STACK_SIZE 1024
>  
> @@ -45,6 +46,7 @@ struct QmpInputVisitor
>  
>  /* True to reject parse in visit_end_struct() if unvisited keys remain. 
> */
>  bool strict;
> +bool autocast;
>  };
>  
>  static QmpInputVisitor *to_qiv(Visitor *v)
> @@ -254,15 +256,25 @@ static void qmp_input_type_int64(Visitor *v, const char 
> *name, int64_t *obj,
>   Error **errp)
>  {
>  QmpInputVisitor *qiv = to_qiv(v);
> -QInt *qint = qobject_to_qint(qmp_input_get_object(qiv, name, true));
> +QObject *qobj = qmp_input_get_object(qiv, name, true);
> +QInt *qint;
> +QString *qstr;
>  
> -if (!qint) {
> -error_setg(errp, QERR_INVALID_PARAMETER_TYPE, name ? name : "null",
> -   "integer");
> +qint = qobject_to_qint(qobj);
> +if (qint) {
> +*obj = qint_get_int(qint);
>  return;
>  }
>  
> -*obj = qint_get_int(qint);
> +qstr = qobject_to_qstring(qobj);
> +if (qstr && qstr->string &&

Re: [Qemu-block] [PATCH v5 00/11] Provide a QOM-based authorization API

2016-06-08 Thread Daniel P. Berrange

On Thu, Jun 02, 2016 at 05:46:16PM +0100, Daniel P. Berrange wrote:
> This is a followup of previously posted work in 2.6 cycle:
> 
>  v1: https://lists.gnu.org/archive/html/qemu-devel/2016-02/msg04618.html
>  v2: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg01454.html
>  v3: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg02498.html
>  v4: https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg01661.html
> 
> Many years ago I was responsible for adding the 'qemu_acl' type
> and associated HMP commands. Looking back at it now, it is quite
> a poor facility with a couple of bad limitations. First, the
> responsibility for creating the ACLs was left with the QEMU network
> service (VNC server was only thing ever doing it). This meant you
> could not share ACLs across multiple services. Second, there was
> no way to populate ACLs on the command line, you had no choice but
> to use the HMP commands. Third, the API was hardcoded around the
> idea of an in-QEMU implementation, leaving no scope for plugging
> in alternative implementations backed by, for example, LDAP or PAM.
> 
> This series introduces a much better authorization API design
> to QEMU that addresses all these problems, and maintains back
> compatibility. It of course is based on the QOM framework, so
> that immediately gives us ability to create objects via the
> CLI, HMP or QMP. There is an abstract base clss "QAuthZ" which
> defines the basic API for QEMU network services to use, and a
> specific implementation "QAuthZ" simple which replicates the
> functionality of 'qemu_acl'. It is thus possible to add other
> impls, without changing any other part of QEMU in the future.
> Finally, the user is responsible for creating the ACL objects,
> so they can have one ACL associated with all their TLS enabled
> network services.
> 
> There was only one small problem with this, specifically the
> -object CLI arg and HMP 'object_add' command had no way to let
> the user specify non-scalar properties for objects. eg if an
> object had a property which is a list of structs, you are out
> of luck if you want to create it without using QMP.
> 
> Thus the first three patches do some work around QAPI / QOM
> to make it possible to specify non-scalar properties with
> the -object CLI arg and HMP 'object_add' command. See the
> respective patches for illustration of the syntax used. Some
> of Max's recent block patches also depend on the qdict_crumple
> method in patch 1.

Ping, does anyone have any feedback on these first 3 patches
in particular. If we get agreement on those and get them merged,
then we can feed the rest of this series to the individual
subsystem maintainers that are affected, and also unblock
Max's patch series.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

Re: [Qemu-block] [PATCH v2 2/3] block/mirror: Fix target backing BDS

2016-06-08 Thread Kevin Wolf

Am 08.06.2016 um 13:28 hat Paolo Bonzini geschrieben:
> 
> 
> - Original Message -
> > From: "Kevin Wolf" 
> > To: "Max Reitz" 
> > Cc: qemu-block@nongnu.org, qemu-de...@nongnu.org, "Fam Zheng" 
> > , nsof...@redhat.com,
> > ebl...@redhat.com, pbonz...@redhat.com
> > Sent: Wednesday, June 8, 2016 11:32:29 AM
> > Subject: Re: [PATCH v2 2/3] block/mirror: Fix target backing BDS
> > 
> > Am 06.06.2016 um 16:42 hat Max Reitz geschrieben:
> > > Currently, we are trying to move the backing BDS from the source to the
> > > target in bdrv_replace_in_backing_chain() which is called from
> > > mirror_exit(). However, mirror_complete() already tries to open the
> > > target's backing chain with a call to bdrv_open_backing_file().
> > > 
> > > First, we should only set the target's backing BDS once. Second, the
> > > mirroring block job has a better idea of what to set it to than the
> > > generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
> > > conditions on when to move the backing BDS from source to target are not
> > > really correct).
> > > 
> > > Therefore, remove that code from bdrv_replace_in_backing_chain() and
> > > leave it to mirror_complete().
> > > 
> > > However, mirror_complete() in turn pursues a questionable strategy by
> > > employing bdrv_open_backing_file(): On the one hand, because this may
> > > open the wrong backing file with drive-mirror in "existing" mode, or
> > > because it will not override a possibly wrong backing file in the
> > > blockdev-mirror case.
> > 
> > Careful, this "wrong" backing file might actually be intended!
> > 
> > Consider a case where you want to move an image with its whole backing
> > chain to different storage. In that case, you would copy all of the
> > backing files (cp is good enough, they are read-only), create the
> > destination image which already points at the copied backing chain, and
> > then mirror in "existing" mode.
> > 
> > The intention is obviously that after the job completion the new backing
> > chain is used and not the old one.
> 
> Yes, this is the intention and it should not be changed.  In addition
> to what Kevin said, you can use drive-mirror to collapse the image to a
> single file; in this case, QEMU should not be using the backing files of
> the source.
> 
> bdrv_open_backing_file() is used because what we want to do is to
> "undo" the BDRV_O_NO_BACKING flag used by qmp_drive_mirror.
> 
> If the contents change under the guest feet, it's the layers above
> QEMU that have screwed up.

We should probably have test cases for both scenarios. They would make
it obvious that changing this behaviour is not okay. Actually, I'm
surprised that our existing cases don't seem to cover this.

Kevin

Re: [Qemu-block] [PATCH v2 2/3] block/mirror: Fix target backing BDS

2016-06-08 Thread Paolo Bonzini



- Original Message -
> From: "Kevin Wolf" 
> To: "Max Reitz" 
> Cc: qemu-block@nongnu.org, qemu-de...@nongnu.org, "Fam Zheng" 
> , nsof...@redhat.com,
> ebl...@redhat.com, pbonz...@redhat.com
> Sent: Wednesday, June 8, 2016 11:32:29 AM
> Subject: Re: [PATCH v2 2/3] block/mirror: Fix target backing BDS
> 
> Am 06.06.2016 um 16:42 hat Max Reitz geschrieben:
> > Currently, we are trying to move the backing BDS from the source to the
> > target in bdrv_replace_in_backing_chain() which is called from
> > mirror_exit(). However, mirror_complete() already tries to open the
> > target's backing chain with a call to bdrv_open_backing_file().
> > 
> > First, we should only set the target's backing BDS once. Second, the
> > mirroring block job has a better idea of what to set it to than the
> > generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
> > conditions on when to move the backing BDS from source to target are not
> > really correct).
> > 
> > Therefore, remove that code from bdrv_replace_in_backing_chain() and
> > leave it to mirror_complete().
> > 
> > However, mirror_complete() in turn pursues a questionable strategy by
> > employing bdrv_open_backing_file(): On the one hand, because this may
> > open the wrong backing file with drive-mirror in "existing" mode, or
> > because it will not override a possibly wrong backing file in the
> > blockdev-mirror case.
> 
> Careful, this "wrong" backing file might actually be intended!
> 
> Consider a case where you want to move an image with its whole backing
> chain to different storage. In that case, you would copy all of the
> backing files (cp is good enough, they are read-only), create the
> destination image which already points at the copied backing chain, and
> then mirror in "existing" mode.
> 
> The intention is obviously that after the job completion the new backing
> chain is used and not the old one.

Yes, this is the intention and it should not be changed.  In addition
to what Kevin said, you can use drive-mirror to collapse the image to a
single file; in this case, QEMU should not be using the backing files of
the source.

bdrv_open_backing_file() is used because what we want to do is to
"undo" the BDRV_O_NO_BACKING flag used by qmp_drive_mirror.

If the contents change under the guest feet, it's the layers above
QEMU that have screwed up.

Paolo

[Qemu-block] [PULL 26/31] block: Don't emulate natively supported pwritev flags

2016-06-08 Thread Kevin Wolf

Drivers that implement .bdrv_co_pwritev() get the flags passed as an
argument to said function, but we also unconditionally emulate the flags
anyway. We shouldn't do that.

Fix this by clearing all flags that the driver supports natively after
it returns from .bdrv_co_pwritev().

Fixes: 4df863f3 ('block: Make supported_write_flags a per-bds property')
Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
---
 block/io.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 9dc265b..fb99a71 100644
--- a/block/io.c
+++ b/block/io.c
@@ -816,7 +816,9 @@ static int coroutine_fn 
bdrv_driver_pwritev(BlockDriverState *bs,
 int ret;
 
 if (drv->bdrv_co_pwritev) {
-ret = drv->bdrv_co_pwritev(bs, offset, bytes, qiov, flags);
+ret = drv->bdrv_co_pwritev(bs, offset, bytes, qiov,
+   flags & bs->supported_write_flags);
+flags &= ~bs->supported_write_flags;
 goto emulate_flags;
 }
 
-- 
1.8.3.1

[Qemu-block] [PULL 30/31] qemu-img bench: Implement -S (step size)

2016-06-08 Thread Kevin Wolf

With this new option, qemu-img bench can be told to advance the current
offset after each request by a different value than the buffer size.
This is useful for controlling the conditions for cluster allocation in
image formats (e.g. qcow2 cluster allocation with COW in front of the
request, or COW areas that aren't overwritten immediately).

Signed-off-by: Kevin Wolf 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Stefan Hajnoczi 
---
 qemu-img-cmds.hx |  4 ++--
 qemu-img.c   | 25 +
 qemu-img.texi|  6 --
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 117d0f9..05a2991 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -10,9 +10,9 @@ STEXI
 ETEXI
 
 DEF("bench", img_bench,
-"bench [-c count] [-d depth] [-f fmt] [-n] [-o offset] [--pattern=pattern] 
[-q] [-s buffer_size] [-t cache] [-w] filename")
+"bench [-c count] [-d depth] [-f fmt] [-n] [-o offset] [--pattern=pattern] 
[-q] [-s buffer_size] [-S step_size] [-t cache] [-w] filename")
 STEXI
-@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] [-o 
@var{offset}] [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-t 
@var{cache}] [-w] @var{filename}
+@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] [-o 
@var{offset}] [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-S 
@var{step_size}] [-t @var{cache}] [-w] @var{filename}
 ETEXI
 
 DEF("check", img_check,
diff --git a/qemu-img.c b/qemu-img.c
index 480ef8d..c5e2638 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3465,6 +3465,7 @@ typedef struct BenchData {
 uint64_t image_size;
 bool write;
 int bufsize;
+int step;
 int nrreq;
 int n;
 uint8_t *buf;
@@ -3501,7 +3502,7 @@ static void bench_cb(void *opaque, int ret)
 exit(EXIT_FAILURE);
 }
 b->in_flight++;
-b->offset += b->bufsize;
+b->offset += b->step;
 b->offset %= b->image_size;
 }
 }
@@ -3518,6 +3519,7 @@ static int img_bench(int argc, char **argv)
 int64_t offset = 0;
 size_t bufsize = 4096;
 int pattern = 0;
+size_t step = 0;
 int64_t image_size;
 BlockBackend *blk = NULL;
 BenchData data = {};
@@ -3533,7 +3535,7 @@ static int img_bench(int argc, char **argv)
 {"pattern", required_argument, 0, OPTION_PATTERN},
 {0, 0, 0, 0}
 };
-c = getopt_long(argc, argv, "hc:d:f:no:qs:t:w", long_options, NULL);
+c = getopt_long(argc, argv, "hc:d:f:no:qs:S:t:w", long_options, NULL);
 if (c == -1) {
 break;
 }
@@ -3601,6 +3603,20 @@ static int img_bench(int argc, char **argv)
 bufsize = sval;
 break;
 }
+case 'S':
+{
+int64_t sval;
+char *end;
+
+sval = qemu_strtosz_suffix(optarg, , QEMU_STRTOSZ_DEFSUFFIX_B);
+if (sval < 0 || sval > INT_MAX || *end) {
+error_report("Invalid step size specified");
+return 1;
+}
+
+step = sval;
+break;
+}
 case 't':
 ret = bdrv_parse_cache_mode(optarg, , );
 if (ret < 0) {
@@ -3651,15 +3667,16 @@ static int img_bench(int argc, char **argv)
 .blk= blk,
 .image_size = image_size,
 .bufsize= bufsize,
+.step   = step ?: bufsize,
 .nrreq  = depth,
 .n  = count,
 .offset = offset,
 .write  = is_write,
 };
 printf("Sending %d %s requests, %d bytes each, %d in parallel "
-   "(starting at offset %" PRId64 ")\n",
+   "(starting at offset %" PRId64 ", step size %d)\n",
data.n, data.write ? "write" : "read", data.bufsize, data.nrreq,
-   data.offset);
+   data.offset, data.step);
 
 data.buf = blk_blockalign(blk, data.nrreq * data.bufsize);
 memset(data.buf, pattern, data.nrreq * data.bufsize);
diff --git a/qemu-img.texi b/qemu-img.texi
index 9bffad2..ccc0b51 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -131,14 +131,16 @@ Skip the creation of the target volume
 Command description:
 
 @table @option
-@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] [-o 
@var{offset}] [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-t 
@var{cache}] [-w] @var{filename}
+@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] [-o 
@var{offset}] [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-S 
@var{step_size}] [-t @var{cache}] [-w] @var{filename}
 
 Run a simple sequential I/O benchmark on the specified image. If @code{-w} is
 specified, a write test is performed, otherwise a read test is performed.
 
 A total number of @var{count} I/O requests is performed, each @var{buffer_size}
 bytes in size, and with @var{depth} requests in parallel. The first request
-starts at the position given

[Qemu-block] [PULL 23/31] qcow2: avoid extra flushes in qcow2

2016-06-08 Thread Kevin Wolf

From: "Denis V. Lunev" 

The problem with excessive flushing was found by a couple of performance
tests:
  - parallel directory tree creation (from 2 processes)
  - 32 cached writes + fsync at the end in a loop

For the first one results improved from 2.6 loops/sec to 3.5 loops/sec.
Each loop creates 10^3 directories with 10 files in each.

For the second one results improved from ~600 fsync/sec to ~1100
fsync/sec. Though, it was run on SSD so it probably won't show such
performance gain on rotational media.

qcow2_cache_flush() calls bdrv_flush() unconditionally after writing
cache entries of a particular cache. This can lead to as many as
2 additional fdatasyncs inside bdrv_flush.

We can simply skip all fdatasync calls inside qcow2_co_flush_to_os
as bdrv_flush for sure will do the job. These flushes are necessary to
keep the right order of writes to the different caches. Though this is
not necessary in the current code base as this ordering is ensured through
the flush in qcow2_cache_flush_dependency().

Signed-off-by: Denis V. Lunev 
CC: Pavel Borzenkov 
CC: Kevin Wolf 
CC: Max Reitz 
Signed-off-by: Kevin Wolf 
---
 block/qcow2-cache.c | 11 +--
 block/qcow2.c   |  4 ++--
 block/qcow2.h   |  1 +
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 0fe8eda..208a060 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -226,7 +226,7 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, 
Qcow2Cache *c, int i)
 return 0;
 }
 
-int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c)
+int qcow2_cache_write(BlockDriverState *bs, Qcow2Cache *c)
 {
 BDRVQcow2State *s = bs->opaque;
 int result = 0;
@@ -242,8 +242,15 @@ int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c)
 }
 }
 
+return result;
+}
+
+int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c)
+{
+int result = qcow2_cache_write(bs, c);
+
 if (result == 0) {
-ret = bdrv_flush(bs->file->bs);
+int ret = bdrv_flush(bs->file->bs);
 if (ret < 0) {
 result = ret;
 }
diff --git a/block/qcow2.c b/block/qcow2.c
index 5e26c3d..6f5fb81 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2813,14 +2813,14 @@ static coroutine_fn int 
qcow2_co_flush_to_os(BlockDriverState *bs)
 int ret;
 
 qemu_co_mutex_lock(>lock);
-ret = qcow2_cache_flush(bs, s->l2_table_cache);
+ret = qcow2_cache_write(bs, s->l2_table_cache);
 if (ret < 0) {
 qemu_co_mutex_unlock(>lock);
 return ret;
 }
 
 if (qcow2_need_accurate_refcounts(s)) {
-ret = qcow2_cache_flush(bs, s->refcount_block_cache);
+ret = qcow2_cache_write(bs, s->refcount_block_cache);
 if (ret < 0) {
 qemu_co_mutex_unlock(>lock);
 return ret;
diff --git a/block/qcow2.h b/block/qcow2.h
index a063a3c..7db9795 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -583,6 +583,7 @@ int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache 
*c);
 void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
  void *table);
 int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c);
+int qcow2_cache_write(BlockDriverState *bs, Qcow2Cache *c);
 int qcow2_cache_set_dependency(BlockDriverState *bs, Qcow2Cache *c,
 Qcow2Cache *dependency);
 void qcow2_cache_depends_on_flush(Qcow2Cache *c);
-- 
1.8.3.1

[Qemu-block] [PULL 29/31] qemu-img bench: Make start offset configurable

2016-06-08 Thread Kevin Wolf

This patch adds an option the specify the offset of the first request
made by qemu-img bench. This allows to benchmark misaligned requests.

Signed-off-by: Kevin Wolf 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Stefan Hajnoczi 
---
 qemu-img-cmds.hx |  4 ++--
 qemu-img.c   | 23 ---
 qemu-img.texi|  5 +++--
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index baca85e..117d0f9 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -10,9 +10,9 @@ STEXI
 ETEXI
 
 DEF("bench", img_bench,
-"bench [-c count] [-d depth] [-f fmt] [-n] [--pattern=pattern] [-q] [-s 
buffer_size] [-t cache] [-w] filename")
+"bench [-c count] [-d depth] [-f fmt] [-n] [-o offset] [--pattern=pattern] 
[-q] [-s buffer_size] [-t cache] [-w] filename")
 STEXI
-@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] 
[--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-t @var{cache}] [-w] 
@var{filename}
+@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] [-o 
@var{offset}] [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-t 
@var{cache}] [-w] @var{filename}
 ETEXI
 
 DEF("check", img_check,
diff --git a/qemu-img.c b/qemu-img.c
index 85d1353..480ef8d 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3515,6 +3515,7 @@ static int img_bench(int argc, char **argv)
 bool is_write = false;
 int count = 75000;
 int depth = 64;
+int64_t offset = 0;
 size_t bufsize = 4096;
 int pattern = 0;
 int64_t image_size;
@@ -3532,7 +3533,7 @@ static int img_bench(int argc, char **argv)
 {"pattern", required_argument, 0, OPTION_PATTERN},
 {0, 0, 0, 0}
 };
-c = getopt_long(argc, argv, "hc:d:f:nqs:t:w", long_options, NULL);
+c = getopt_long(argc, argv, "hc:d:f:no:qs:t:w", long_options, NULL);
 if (c == -1) {
 break;
 }
@@ -3570,6 +3571,19 @@ static int img_bench(int argc, char **argv)
 case 'n':
 flags |= BDRV_O_NATIVE_AIO;
 break;
+case 'o':
+{
+char *end;
+errno = 0;
+offset = qemu_strtosz_suffix(optarg, ,
+ QEMU_STRTOSZ_DEFSUFFIX_B);
+if (offset < 0|| *end) {
+error_report("Invalid offset specified");
+return 1;
+}
+break;
+}
+break;
 case 'q':
 quiet = true;
 break;
@@ -3639,10 +3653,13 @@ static int img_bench(int argc, char **argv)
 .bufsize= bufsize,
 .nrreq  = depth,
 .n  = count,
+.offset = offset,
 .write  = is_write,
 };
-printf("Sending %d %s requests, %d bytes each, %d in parallel\n",
-   data.n, data.write ? "write" : "read", data.bufsize, data.nrreq);
+printf("Sending %d %s requests, %d bytes each, %d in parallel "
+   "(starting at offset %" PRId64 ")\n",
+   data.n, data.write ? "write" : "read", data.bufsize, data.nrreq,
+   data.offset);
 
 data.buf = blk_blockalign(blk, data.nrreq * data.bufsize);
 memset(data.buf, pattern, data.nrreq * data.bufsize);
diff --git a/qemu-img.texi b/qemu-img.texi
index c477fbf..9bffad2 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -131,13 +131,14 @@ Skip the creation of the target volume
 Command description:
 
 @table @option
-@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] 
[--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-t @var{cache}] [-w] 
@var{filename}
+@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] [-o 
@var{offset}] [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-t 
@var{cache}] [-w] @var{filename}
 
 Run a simple sequential I/O benchmark on the specified image. If @code{-w} is
 specified, a write test is performed, otherwise a read test is performed.
 
 A total number of @var{count} I/O requests is performed, each @var{buffer_size}
-bytes in size, and with @var{depth} requests in parallel.
+bytes in size, and with @var{depth} requests in parallel. The first request
+starts at the position given by @var{offset}.
 
 If @code{-n} is specified, the native AIO backend is used if possible. On
 Linux, this option only works if @code{-t none} or @code{-t directsync} is
-- 
1.8.3.1

[Qemu-block] [PULL 14/31] qed: Convert to bdrv_co_pwrite_zeroes()

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Another step on our continuing quest to switch to byte-based
interfaces.

Kill an abuse of the comma operator while at it (fortunately,
the semantics were still right).  Also, the test for requests
not aligned to clusters should be applied always, not just
when a backing file is present.

Signed-off-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/qed.c | 33 +++--
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 113b8e8..1206806 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1418,7 +1418,7 @@ typedef struct {
 bool done;
 } QEDWriteZeroesCB;
 
-static void coroutine_fn qed_co_write_zeroes_cb(void *opaque, int ret)
+static void coroutine_fn qed_co_pwrite_zeroes_cb(void *opaque, int ret)
 {
 QEDWriteZeroesCB *cb = opaque;
 
@@ -1429,10 +1429,10 @@ static void coroutine_fn qed_co_write_zeroes_cb(void 
*opaque, int ret)
 }
 }
 
-static int coroutine_fn bdrv_qed_co_write_zeroes(BlockDriverState *bs,
- int64_t sector_num,
- int nb_sectors,
- BdrvRequestFlags flags)
+static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
+  int64_t offset,
+  int count,
+  BdrvRequestFlags flags)
 {
 BlockAIOCB *blockacb;
 BDRVQEDState *s = bs->opaque;
@@ -1440,25 +1440,22 @@ static int coroutine_fn 
bdrv_qed_co_write_zeroes(BlockDriverState *bs,
 QEMUIOVector qiov;
 struct iovec iov;
 
-/* Refuse if there are untouched backing file sectors */
-if (bs->backing) {
-if (qed_offset_into_cluster(s, sector_num * BDRV_SECTOR_SIZE) != 0) {
-return -ENOTSUP;
-}
-if (qed_offset_into_cluster(s, nb_sectors * BDRV_SECTOR_SIZE) != 0) {
-return -ENOTSUP;
-}
+/* Fall back if the request is not aligned */
+if (qed_offset_into_cluster(s, offset) ||
+qed_offset_into_cluster(s, count)) {
+return -ENOTSUP;
 }
 
 /* Zero writes start without an I/O buffer.  If a buffer becomes necessary
  * then it will be allocated during request processing.
  */
-iov.iov_base = NULL,
-iov.iov_len  = nb_sectors * BDRV_SECTOR_SIZE,
+iov.iov_base = NULL;
+iov.iov_len = count;
 
 qemu_iovec_init_external(, , 1);
-blockacb = qed_aio_setup(bs, sector_num, , nb_sectors,
- qed_co_write_zeroes_cb, ,
+blockacb = qed_aio_setup(bs, offset >> BDRV_SECTOR_BITS, ,
+ count >> BDRV_SECTOR_BITS,
+ qed_co_pwrite_zeroes_cb, ,
  QED_AIOCB_WRITE | QED_AIOCB_ZERO);
 if (!blockacb) {
 return -EIO;
@@ -1663,7 +1660,7 @@ static BlockDriver bdrv_qed = {
 .bdrv_co_get_block_status = bdrv_qed_co_get_block_status,
 .bdrv_aio_readv   = bdrv_qed_aio_readv,
 .bdrv_aio_writev  = bdrv_qed_aio_writev,
-.bdrv_co_write_zeroes = bdrv_qed_co_write_zeroes,
+.bdrv_co_pwrite_zeroes= bdrv_qed_co_pwrite_zeroes,
 .bdrv_truncate= bdrv_qed_truncate,
 .bdrv_getlength   = bdrv_qed_getlength,
 .bdrv_get_info= bdrv_qed_get_info,
-- 
1.8.3.1

[Qemu-block] [PULL 27/31] qemu-img bench

2016-06-08 Thread Kevin Wolf

This adds a qemu-img command that allows doing some simple benchmarks
for the block layer without involving guest devices and a real VM.

For the start, this implements only a test of sequential reads.

Signed-off-by: Kevin Wolf 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Stefan Hajnoczi 
---
 qemu-img-cmds.hx |   6 ++
 qemu-img.c   | 190 +++
 qemu-img.texi|  10 +++
 3 files changed, 206 insertions(+)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index e7cded6..f3bd546 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -9,6 +9,12 @@ STEXI
 @table @option
 ETEXI
 
+DEF("bench", img_bench,
+"bench [-c count] [-d depth] [-f fmt] [-n] [-q] [-s buffer_size] [-t 
cache] filename")
+STEXI
+@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] [-q] [-s 
@var{buffer_size}] [-t @var{cache}] @var{filename}
+ETEXI
+
 DEF("check", img_check,
 "check [-q] [--object objectdef] [--image-opts] [-f fmt] [--output=ofmt] 
[-r [leaks | all]] [-T src_cache] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index 4b56ad3..d471d10 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3459,6 +3459,196 @@ out_no_progress:
 return 0;
 }
 
+typedef struct BenchData {
+BlockBackend *blk;
+uint64_t image_size;
+int bufsize;
+int nrreq;
+int n;
+uint8_t *buf;
+QEMUIOVector *qiov;
+
+int in_flight;
+uint64_t offset;
+} BenchData;
+
+static void bench_cb(void *opaque, int ret)
+{
+BenchData *b = opaque;
+BlockAIOCB *acb;
+
+if (ret < 0) {
+error_report("Failed request: %s\n", strerror(-ret));
+exit(EXIT_FAILURE);
+}
+if (b->in_flight > 0) {
+b->n--;
+b->in_flight--;
+}
+
+while (b->n > b->in_flight && b->in_flight < b->nrreq) {
+acb = blk_aio_preadv(b->blk, b->offset, b->qiov, 0,
+ bench_cb, b);
+if (!acb) {
+error_report("Failed to issue request");
+exit(EXIT_FAILURE);
+}
+b->in_flight++;
+b->offset += b->bufsize;
+b->offset %= b->image_size;
+}
+}
+
+static int img_bench(int argc, char **argv)
+{
+int c, ret = 0;
+const char *fmt = NULL, *filename;
+bool quiet = false;
+bool image_opts = false;
+int count = 75000;
+int depth = 64;
+size_t bufsize = 4096;
+int64_t image_size;
+BlockBackend *blk = NULL;
+BenchData data = {};
+int flags = 0;
+bool writethrough;
+struct timeval t1, t2;
+int i;
+
+for (;;) {
+static const struct option long_options[] = {
+{"help", no_argument, 0, 'h'},
+{"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
+{0, 0, 0, 0}
+};
+c = getopt_long(argc, argv, "hc:d:f:nqs:t:", long_options, NULL);
+if (c == -1) {
+break;
+}
+
+switch (c) {
+case 'h':
+case '?':
+help();
+break;
+case 'c':
+{
+char *end;
+errno = 0;
+count = strtoul(optarg, , 0);
+if (errno || *end || count > INT_MAX) {
+error_report("Invalid request count specified");
+return 1;
+}
+break;
+}
+case 'd':
+{
+char *end;
+errno = 0;
+depth = strtoul(optarg, , 0);
+if (errno || *end || depth > INT_MAX) {
+error_report("Invalid queue depth specified");
+return 1;
+}
+break;
+}
+case 'f':
+fmt = optarg;
+break;
+case 'n':
+flags |= BDRV_O_NATIVE_AIO;
+break;
+case 'q':
+quiet = true;
+break;
+case 's':
+{
+int64_t sval;
+char *end;
+
+sval = qemu_strtosz_suffix(optarg, , QEMU_STRTOSZ_DEFSUFFIX_B);
+if (sval < 0 || sval > INT_MAX || *end) {
+error_report("Invalid buffer size specified");
+return 1;
+}
+
+bufsize = sval;
+break;
+}
+case 't':
+ret = bdrv_parse_cache_mode(optarg, , );
+if (ret < 0) {
+error_report("Invalid cache mode");
+ret = -1;
+goto out;
+}
+break;
+case OPTION_IMAGE_OPTS:
+image_opts = true;
+break;
+}
+}
+
+if (optind != argc - 1) {
+error_exit("Expecting one image file name");
+}
+filename = argv[argc - 1];
+
+blk = img_open(image_opts, filename, fmt, flags, writethrough, quiet);
+if (!blk) {
+ret = -1;
+goto out;
+}
+
+image_size = blk_getlength(blk);
+if (image_size < 0) {
+ret = image_size;
+goto out;
+

[Qemu-block] [PULL 25/31] blockdev: clean up error handling in do_open_tray

2016-06-08 Thread Kevin Wolf

From: Colin Lord 

Returns negative error codes and accompanying error messages in cases where
the device has no tray or the tray is locked and isn't forced open. This
extra information should result in better flexibility in functions that
call do_open_tray.

Suggested by: Markus Armbruster 
Signed-off-by: Colin Lord 
Signed-off-by: Kevin Wolf 
---
 blockdev.c | 46 --
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 6ccb8e1..7fd515a 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -56,6 +56,8 @@
 static QTAILQ_HEAD(, BlockDriverState) monitor_bdrv_states =
 QTAILQ_HEAD_INITIALIZER(monitor_bdrv_states);
 
+static int do_open_tray(const char *device, bool force, Error **errp);
+
 static const char *const if_name[IF_COUNT] = {
 [IF_NONE] = "none",
 [IF_IDE] = "ide",
@@ -2274,8 +2276,6 @@ exit:
 block_job_txn_unref(block_job_txn);
 }
 
-static int do_open_tray(const char *device, bool force, Error **errp);
-
 void qmp_eject(const char *device, bool has_force, bool force, Error **errp)
 {
 Error *local_err = NULL;
@@ -2286,16 +2286,11 @@ void qmp_eject(const char *device, bool has_force, bool 
force, Error **errp)
 }
 
 rc = do_open_tray(device, force, _err);
-if (local_err) {
+if (rc && rc != -ENOSYS) {
 error_propagate(errp, local_err);
 return;
 }
-
-if (rc == EINPROGRESS) {
-error_setg(errp, "Device '%s' is locked and force was not specified, "
-   "wait for tray to open and try again", device);
-return;
-}
+error_free(local_err);
 
 qmp_x_blockdev_remove_medium(device, errp);
 }
@@ -2324,11 +2319,16 @@ void qmp_block_passwd(bool has_device, const char 
*device,
 aio_context_release(aio_context);
 }
 
-/**
- * returns -errno on fatal error, +errno for non-fatal situations.
- * errp will always be set when the return code is negative.
- * May return +ENOSYS if the device has no tray,
- * or +EINPROGRESS if the tray is locked and the guest has been notified.
+/*
+ * Attempt to open the tray of @device.
+ * If @force, ignore its tray lock.
+ * Else, if the tray is locked, don't open it, but ask the guest to open it.
+ * On error, store an error through @errp and return -errno.
+ * If @device does not exist, return -ENODEV.
+ * If it has no removable media, return -ENOTSUP.
+ * If it has no tray, return -ENOSYS.
+ * If the guest was asked to open the tray, return -EINPROGRESS.
+ * Else, return 0.
  */
 static int do_open_tray(const char *device, bool force, Error **errp)
 {
@@ -2348,8 +2348,8 @@ static int do_open_tray(const char *device, bool force, 
Error **errp)
 }
 
 if (!blk_dev_has_tray(blk)) {
-/* Ignore this command on tray-less devices */
-return ENOSYS;
+error_setg(errp, "Device '%s' does not have a tray", device);
+return -ENOSYS;
 }
 
 if (blk_dev_is_tray_open(blk)) {
@@ -2366,7 +2366,9 @@ static int do_open_tray(const char *device, bool force, 
Error **errp)
 }
 
 if (locked && !force) {
-return EINPROGRESS;
+error_setg(errp, "Device '%s' is locked and force was not specified, "
+   "wait for tray to open and try again", device);
+return -EINPROGRESS;
 }
 
 return 0;
@@ -2375,10 +2377,18 @@ static int do_open_tray(const char *device, bool force, 
Error **errp)
 void qmp_blockdev_open_tray(const char *device, bool has_force, bool force,
 Error **errp)
 {
+Error *local_err = NULL;
+int rc;
+
 if (!has_force) {
 force = false;
 }
-do_open_tray(device, force, errp);
+rc = do_open_tray(device, force, _err);
+if (rc && rc != -ENOSYS && rc != -EINPROGRESS) {
+error_propagate(errp, local_err);
+return;
+}
+error_free(local_err);
 }
 
 void qmp_blockdev_close_tray(const char *device, Error **errp)
-- 
1.8.3.1

[Qemu-block] [PULL 12/31] blkreplay: Convert to bdrv_co_pwrite_zeroes()

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Another step on our continuing quest to switch to byte-based
interfaces.

Signed-off-by: Eric Blake 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/blkreplay.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/blkreplay.c b/block/blkreplay.c
index 1a721ad..525c2d5 100755
--- a/block/blkreplay.c
+++ b/block/blkreplay.c
@@ -103,13 +103,11 @@ static int coroutine_fn 
blkreplay_co_writev(BlockDriverState *bs,
 return ret;
 }
 
-static int coroutine_fn blkreplay_co_write_zeroes(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
+static int coroutine_fn blkreplay_co_pwrite_zeroes(BlockDriverState *bs,
+int64_t offset, int count, BdrvRequestFlags flags)
 {
 uint64_t reqid = request_id++;
-int ret = bdrv_co_pwrite_zeroes(bs->file->bs,
-sector_num << BDRV_SECTOR_BITS,
-nb_sectors << BDRV_SECTOR_BITS, flags);
+int ret = bdrv_co_pwrite_zeroes(bs->file->bs, offset, count, flags);
 block_request_create(reqid, bs, qemu_coroutine_self());
 qemu_coroutine_yield();
 
@@ -149,7 +147,7 @@ static BlockDriver bdrv_blkreplay = {
 .bdrv_co_readv  = blkreplay_co_readv,
 .bdrv_co_writev = blkreplay_co_writev,
 
-.bdrv_co_write_zeroes   = blkreplay_co_write_zeroes,
+.bdrv_co_pwrite_zeroes  = blkreplay_co_pwrite_zeroes,
 .bdrv_co_discard= blkreplay_co_discard,
 .bdrv_co_flush  = blkreplay_co_flush,
 };
-- 
1.8.3.1

[Qemu-block] [PULL 18/31] block: Kill bdrv_co_write_zeroes()

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Now that all drivers have been converted to a byte interface,
we no longer need a sector interface.

Signed-off-by: Eric Blake 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/io.c| 15 ++-
 include/block/block_int.h |  2 --
 2 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/block/io.c b/block/io.c
index a426094..9dc265b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -901,7 +901,7 @@ static int coroutine_fn 
bdrv_co_do_copy_on_readv(BlockDriverState *bs,
 goto err;
 }
 
-if ((drv->bdrv_co_write_zeroes || drv->bdrv_co_pwrite_zeroes) &&
+if (drv->bdrv_co_pwrite_zeroes &&
 buffer_is_zero(bounce_buffer, iov.iov_len)) {
 ret = bdrv_co_do_pwrite_zeroes(bs,
cluster_sector_num * BDRV_SECTOR_SIZE,
@@ -1170,16 +1170,6 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 !(bs->supported_zero_flags & BDRV_REQ_FUA)) {
 need_flush = true;
 }
-} else if (drv->bdrv_co_write_zeroes) {
-assert(offset % BDRV_SECTOR_SIZE == 0);
-assert(count % BDRV_SECTOR_SIZE == 0);
-ret = drv->bdrv_co_write_zeroes(bs, offset >> BDRV_SECTOR_BITS,
-num >> BDRV_SECTOR_BITS,
-flags & bs->supported_zero_flags);
-if (ret != -ENOTSUP && (flags & BDRV_REQ_FUA) &&
-!(bs->supported_zero_flags & BDRV_REQ_FUA)) {
-need_flush = true;
-}
 } else {
 assert(!bs->supported_zero_flags);
 }
@@ -1259,8 +1249,7 @@ static int coroutine_fn 
bdrv_aligned_pwritev(BlockDriverState *bs,
 ret = notifier_with_return_list_notify(>before_write_notifiers, req);
 
 if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
-!(flags & BDRV_REQ_ZERO_WRITE) &&
-(drv->bdrv_co_pwrite_zeroes || drv->bdrv_co_write_zeroes) &&
+!(flags & BDRV_REQ_ZERO_WRITE) && drv->bdrv_co_pwrite_zeroes &&
 qemu_iovec_is_zero(qiov)) {
 flags |= BDRV_REQ_ZERO_WRITE;
 if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 1dfdf92..8a4963c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -163,8 +163,6 @@ struct BlockDriver {
  * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
  * will be called instead.
  */
-int coroutine_fn (*bdrv_co_write_zeroes)(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, BdrvRequestFlags flags);
 int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
 int64_t offset, int count, BdrvRequestFlags flags);
 int coroutine_fn (*bdrv_co_discard)(BlockDriverState *bs,
-- 
1.8.3.1

[Qemu-block] [PULL 13/31] gluster: Convert to bdrv_co_pwrite_zeroes()

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Another step on our continuing quest to switch to byte-based
interfaces.

Signed-off-by: Eric Blake 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/gluster.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index a8aaacf..d361d8e 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -454,14 +454,12 @@ static void qemu_gluster_reopen_abort(BDRVReopenState 
*state)
 }
 
 #ifdef CONFIG_GLUSTERFS_ZEROFILL
-static coroutine_fn int qemu_gluster_co_write_zeroes(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
+static coroutine_fn int qemu_gluster_co_pwrite_zeroes(BlockDriverState *bs,
+int64_t offset, int size, BdrvRequestFlags flags)
 {
 int ret;
 GlusterAIOCB acb;
 BDRVGlusterState *s = bs->opaque;
-off_t size = nb_sectors * BDRV_SECTOR_SIZE;
-off_t offset = sector_num * BDRV_SECTOR_SIZE;
 
 acb.size = size;
 acb.ret = 0;
@@ -769,7 +767,7 @@ static BlockDriver bdrv_gluster = {
 .bdrv_co_discard  = qemu_gluster_co_discard,
 #endif
 #ifdef CONFIG_GLUSTERFS_ZEROFILL
-.bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes,
+.bdrv_co_pwrite_zeroes= qemu_gluster_co_pwrite_zeroes,
 #endif
 .create_opts  = _gluster_create_opts,
 };
@@ -796,7 +794,7 @@ static BlockDriver bdrv_gluster_tcp = {
 .bdrv_co_discard  = qemu_gluster_co_discard,
 #endif
 #ifdef CONFIG_GLUSTERFS_ZEROFILL
-.bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes,
+.bdrv_co_pwrite_zeroes= qemu_gluster_co_pwrite_zeroes,
 #endif
 .create_opts  = _gluster_create_opts,
 };
@@ -823,7 +821,7 @@ static BlockDriver bdrv_gluster_unix = {
 .bdrv_co_discard  = qemu_gluster_co_discard,
 #endif
 #ifdef CONFIG_GLUSTERFS_ZEROFILL
-.bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes,
+.bdrv_co_pwrite_zeroes= qemu_gluster_co_pwrite_zeroes,
 #endif
 .create_opts  = _gluster_create_opts,
 };
@@ -850,7 +848,7 @@ static BlockDriver bdrv_gluster_rdma = {
 .bdrv_co_discard  = qemu_gluster_co_discard,
 #endif
 #ifdef CONFIG_GLUSTERFS_ZEROFILL
-.bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes,
+.bdrv_co_pwrite_zeroes= qemu_gluster_co_pwrite_zeroes,
 #endif
 .create_opts  = _gluster_create_opts,
 };
-- 
1.8.3.1

[Qemu-block] [PULL 11/31] qcow2: Convert to bdrv_co_pwrite_zeroes()

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Another step on our continuing quest to switch to byte-based
interfaces.

Signed-off-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/qcow2.c | 37 +++--
 trace-events  |  4 ++--
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index cc59efc..5e26c3d 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2421,38 +2421,39 @@ static bool is_zero_sectors(BlockDriverState *bs, 
int64_t start,
 return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
 }
 
-static coroutine_fn int qcow2_co_write_zeroes(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
+static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
+int64_t offset, int count, BdrvRequestFlags flags)
 {
 int ret;
 BDRVQcow2State *s = bs->opaque;
 
-uint32_t head = sector_num % s->cluster_sectors;
-uint32_t tail = (sector_num + nb_sectors) % s->cluster_sectors;
+uint32_t head = offset % s->cluster_size;
+uint32_t tail = (offset + count) % s->cluster_size;
 
-trace_qcow2_write_zeroes_start_req(qemu_coroutine_self(), sector_num,
-   nb_sectors);
+trace_qcow2_pwrite_zeroes_start_req(qemu_coroutine_self(), offset, count);
 
 if (head || tail) {
-int64_t cl_start = sector_num - head;
+int64_t cl_start = (offset - head) >> BDRV_SECTOR_BITS;
 uint64_t off;
 int nr;
 
-assert(cl_start + s->cluster_sectors >= sector_num + nb_sectors);
+assert(head + count <= s->cluster_size);
 
 /* check whether remainder of cluster already reads as zero */
-if (!(is_zero_sectors(bs, cl_start, head) &&
-  is_zero_sectors(bs, sector_num + nb_sectors,
-  -tail & (s->cluster_sectors - 1 {
+if (!(is_zero_sectors(bs, cl_start,
+  DIV_ROUND_UP(head, BDRV_SECTOR_SIZE)) &&
+  is_zero_sectors(bs, (offset + count) >> BDRV_SECTOR_BITS,
+  DIV_ROUND_UP(-tail & (s->cluster_size - 1),
+   BDRV_SECTOR_SIZE {
 return -ENOTSUP;
 }
 
 qemu_co_mutex_lock(>lock);
 /* We can have new write after previous check */
-sector_num = cl_start;
-nb_sectors = nr = s->cluster_sectors;
-ret = qcow2_get_cluster_offset(bs, cl_start << BDRV_SECTOR_BITS,
-   , );
+offset = cl_start << BDRV_SECTOR_BITS;
+count = s->cluster_size;
+nr = s->cluster_sectors;
+ret = qcow2_get_cluster_offset(bs, offset, , );
 if (ret != QCOW2_CLUSTER_UNALLOCATED && ret != QCOW2_CLUSTER_ZERO) {
 qemu_co_mutex_unlock(>lock);
 return -ENOTSUP;
@@ -2461,10 +2462,10 @@ static coroutine_fn int 
qcow2_co_write_zeroes(BlockDriverState *bs,
 qemu_co_mutex_lock(>lock);
 }
 
-trace_qcow2_write_zeroes(qemu_coroutine_self(), sector_num, nb_sectors);
+trace_qcow2_pwrite_zeroes(qemu_coroutine_self(), offset, count);
 
 /* Whatever is left can use real zero clusters */
-ret = qcow2_zero_clusters(bs, sector_num << BDRV_SECTOR_BITS, nb_sectors);
+ret = qcow2_zero_clusters(bs, offset, count >> BDRV_SECTOR_BITS);
 qemu_co_mutex_unlock(>lock);
 
 return ret;
@@ -3371,7 +3372,7 @@ BlockDriver bdrv_qcow2 = {
 .bdrv_co_writev = qcow2_co_writev,
 .bdrv_co_flush_to_os= qcow2_co_flush_to_os,
 
-.bdrv_co_write_zeroes   = qcow2_co_write_zeroes,
+.bdrv_co_pwrite_zeroes  = qcow2_co_pwrite_zeroes,
 .bdrv_co_discard= qcow2_co_discard,
 .bdrv_truncate  = qcow2_truncate,
 .bdrv_write_compressed  = qcow2_write_compressed,
diff --git a/trace-events b/trace-events
index cfb1886..b1ca05b 100644
--- a/trace-events
+++ b/trace-events
@@ -611,8 +611,8 @@ qcow2_writev_done_req(void *co, int ret) "co %p ret %d"
 qcow2_writev_start_part(void *co) "co %p"
 qcow2_writev_done_part(void *co, int cur_nr_sectors) "co %p cur_nr_sectors %d"
 qcow2_writev_data(void *co, uint64_t offset) "co %p offset %" PRIx64
-qcow2_write_zeroes_start_req(void *co, int64_t sector, int nb_sectors) "co %p 
sector %" PRIx64 " nb_sectors %d"
-qcow2_write_zeroes(void *co, int64_t sector, int nb_sectors) "co %p sector %" 
PRIx64 " nb_sectors %d"
+qcow2_pwrite_zeroes_start_req(void *co, int64_t offset, int count) "co %p 
offset %" PRIx64 " count %d"
+qcow2_pwrite_zeroes(void *co, int64_t offset, int count) "co %p offset %" 
PRIx64 " count %d"
 
 # block/qcow2-cluster.c
 qcow2_alloc_clusters_offset(void *co, uint64_t offset, int num) "co %p offset 
%" PRIx64 " num %d"
-- 
1.8.3.1

[Qemu-block] [PULL 05/31] qcow2: Catch more unaligned write_zero into zero cluster

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

is_zero_cluster() and is_zero_cluster_top_locked() are used only
by qcow2_co_write_zeroes().  The former is too broad (we don't
care if the sectors we are about to overwrite are non-zero, only
that all other sectors in the cluster are zero), so it needs to
be called up to twice but with smaller limits - rename it along
with adding the neeeded parameter.  The latter can be inlined for
more compact code.

The testsuite change shows that we now have a sparser top file
when an unaligned write_zeroes overwrites the only portion of
the backing file with data.

Based on a patch proposal by Denis V. Lunev.

CC: Denis V. Lunev 
Signed-off-by: Eric Blake 
Reviewed-by: Denis V. Lunev 
Signed-off-by: Kevin Wolf 
---
 block/qcow2.c  | 47 +++---
 tests/qemu-iotests/154.out |  8 
 2 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 105fd5e..ecac399 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2406,26 +2406,19 @@ finish:
 }
 
 
-static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
+static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
+uint32_t count)
 {
-BDRVQcow2State *s = bs->opaque;
 int nr;
 BlockDriverState *file;
-int64_t res = bdrv_get_block_status_above(bs, NULL, start,
-  s->cluster_sectors, , );
-return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == s->cluster_sectors;
-}
-
-static bool is_zero_cluster_top_locked(BlockDriverState *bs, int64_t start)
-{
-BDRVQcow2State *s = bs->opaque;
-int nr = s->cluster_sectors;
-uint64_t off;
-int ret;
+int64_t res;
 
-ret = qcow2_get_cluster_offset(bs, start << BDRV_SECTOR_BITS, , );
-assert(nr == s->cluster_sectors);
-return ret == QCOW2_CLUSTER_UNALLOCATED || ret == QCOW2_CLUSTER_ZERO;
+if (!count) {
+return true;
+}
+res = bdrv_get_block_status_above(bs, NULL, start, count,
+  , );
+return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
 }
 
 static coroutine_fn int qcow2_co_write_zeroes(BlockDriverState *bs,
@@ -2434,27 +2427,33 @@ static coroutine_fn int 
qcow2_co_write_zeroes(BlockDriverState *bs,
 int ret;
 BDRVQcow2State *s = bs->opaque;
 
-int head = sector_num % s->cluster_sectors;
-int tail = (sector_num + nb_sectors) % s->cluster_sectors;
+uint32_t head = sector_num % s->cluster_sectors;
+uint32_t tail = (sector_num + nb_sectors) % s->cluster_sectors;
 
 trace_qcow2_write_zeroes_start_req(qemu_coroutine_self(), sector_num,
nb_sectors);
 
-if (head != 0 || tail != 0) {
+if (head || tail) {
 int64_t cl_start = sector_num - head;
+uint64_t off;
+int nr;
 
 assert(cl_start + s->cluster_sectors >= sector_num + nb_sectors);
 
-sector_num = cl_start;
-nb_sectors = s->cluster_sectors;
-
-if (!is_zero_cluster(bs, sector_num)) {
+/* check whether remainder of cluster already reads as zero */
+if (!(is_zero_sectors(bs, cl_start, head) &&
+  is_zero_sectors(bs, sector_num + nb_sectors,
+  -tail & (s->cluster_sectors - 1 {
 return -ENOTSUP;
 }
 
 qemu_co_mutex_lock(>lock);
 /* We can have new write after previous check */
-if (!is_zero_cluster_top_locked(bs, sector_num)) {
+sector_num = cl_start;
+nb_sectors = nr = s->cluster_sectors;
+ret = qcow2_get_cluster_offset(bs, cl_start << BDRV_SECTOR_BITS,
+   , );
+if (ret != QCOW2_CLUSTER_UNALLOCATED && ret != QCOW2_CLUSTER_ZERO) {
 qemu_co_mutex_unlock(>lock);
 return -ENOTSUP;
 }
diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
index 531fd8c..da9eabd 100644
--- a/tests/qemu-iotests/154.out
+++ b/tests/qemu-iotests/154.out
@@ -102,13 +102,13 @@ wrote 2048/2048 bytes at offset 29696
 read 4096/4096 bytes at offset 28672
 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 [{ "start": 0, "length": 4096, "depth": 1, "zero": true, "data": false},
-{ "start": 4096, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 20480},
+{ "start": 4096, "length": 4096, "depth": 0, "zero": true, "data": false},
 { "start": 8192, "length": 4096, "depth": 1, "zero": true, "data": false},
-{ "start": 12288, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 24576},
+{ "start": 12288, "length": 4096, "depth": 0, "zero": true, "data": false},
 { "start": 16384, "length": 4096, "depth": 1, "zero": true, "data": false},
-{ "start": 20480, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 28672},
+{ "start": 20480, "length":

[Qemu-block] [PULL 03/31] qcow2: add tracepoints for qcow2_co_write_zeroes

2016-06-08 Thread Kevin Wolf

From: "Denis V. Lunev" 

This patch follows guidelines of all other tracepoints in qcow2, like ones
in qcow2_co_writev. I think that they should dump values in the same
quantities or be changed all together.

Signed-off-by: Denis V. Lunev 
CC: Eric Blake 
CC: Kevin Wolf 
Message-Id: <1463476543-3087-4-git-send-email-...@openvz.org>
[eblake: typo fix in commit message]
Signed-off-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/qcow2.c | 5 +
 trace-events  | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2f73201..105fd5e 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2437,6 +2437,9 @@ static coroutine_fn int 
qcow2_co_write_zeroes(BlockDriverState *bs,
 int head = sector_num % s->cluster_sectors;
 int tail = (sector_num + nb_sectors) % s->cluster_sectors;
 
+trace_qcow2_write_zeroes_start_req(qemu_coroutine_self(), sector_num,
+   nb_sectors);
+
 if (head != 0 || tail != 0) {
 int64_t cl_start = sector_num - head;
 
@@ -2459,6 +2462,8 @@ static coroutine_fn int 
qcow2_co_write_zeroes(BlockDriverState *bs,
 qemu_co_mutex_lock(>lock);
 }
 
+trace_qcow2_write_zeroes(qemu_coroutine_self(), sector_num, nb_sectors);
+
 /* Whatever is left can use real zero clusters */
 ret = qcow2_zero_clusters(bs, sector_num << BDRV_SECTOR_BITS, nb_sectors);
 qemu_co_mutex_unlock(>lock);
diff --git a/trace-events b/trace-events
index c55d708..5f720d0 100644
--- a/trace-events
+++ b/trace-events
@@ -611,6 +611,8 @@ qcow2_writev_done_req(void *co, int ret) "co %p ret %d"
 qcow2_writev_start_part(void *co) "co %p"
 qcow2_writev_done_part(void *co, int cur_nr_sectors) "co %p cur_nr_sectors %d"
 qcow2_writev_data(void *co, uint64_t offset) "co %p offset %" PRIx64
+qcow2_write_zeroes_start_req(void *co, int64_t sector, int nb_sectors) "co %p 
sector %" PRIx64 " nb_sectors %d"
+qcow2_write_zeroes(void *co, int64_t sector, int nb_sectors) "co %p sector %" 
PRIx64 " nb_sectors %d"
 
 # block/qcow2-cluster.c
 qcow2_alloc_clusters_offset(void *co, uint64_t offset, int num) "co %p offset 
%" PRIx64 " num %d"
-- 
1.8.3.1

[Qemu-block] [PULL 04/31] qemu-iotests: Test one more spot for optimizing write_zeroes

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Add another test to 154, showing that we currently allocate a
data cluster in the top layer if any sector of the backing file
was allocated.  The next patch will optimize this case.

Signed-off-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/154 | 40 
 tests/qemu-iotests/154.out | 37 +
 2 files changed, 77 insertions(+)

diff --git a/tests/qemu-iotests/154 b/tests/qemu-iotests/154
index 23f1b3a..5905c55 100755
--- a/tests/qemu-iotests/154
+++ b/tests/qemu-iotests/154
@@ -115,6 +115,46 @@ $QEMU_IO -c "read -P 0 40k 3k" "$TEST_IMG" | 
_filter_qemu_io
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
 
 echo
+echo == write_zeroes covers non-zero data ==
+
+CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $size
+_make_test_img -b "$TEST_IMG.base"
+
+# non-zero data at front of request
+# Backing file: -- XX -- --
+# Active layer: -- 00 00 --
+
+$QEMU_IO -c "write -P 0x11 5k 1k" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -z 5k 2k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 4k 4k" "$TEST_IMG" | _filter_qemu_io
+
+# non-zero data at end of request
+# Backing file: -- -- XX --
+# Active layer: -- 00 00 --
+
+$QEMU_IO -c "write -P 0x11 14k 1k" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -z 13k 2k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 12k 4k" "$TEST_IMG" | _filter_qemu_io
+
+# non-zero data matches size of request
+# Backing file: -- XX XX --
+# Active layer: -- 00 00 --
+
+$QEMU_IO -c "write -P 0x11 21k 2k" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -z 21k 2k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 20k 4k" "$TEST_IMG" | _filter_qemu_io
+
+# non-zero data smaller than request
+# Backing file: -- -X X- --
+# Active layer: -- 00 00 --
+
+$QEMU_IO -c "write -P 0x11 30208 1k" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -z 29k 2k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 28k 4k" "$TEST_IMG" | _filter_qemu_io
+
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+
+echo
 echo == spanning two clusters, non-zero before request ==
 
 CLUSTER_SIZE=512 TEST_IMG="$TEST_IMG.base" _make_test_img $size
diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
index b9d27c5..531fd8c 100644
--- a/tests/qemu-iotests/154.out
+++ b/tests/qemu-iotests/154.out
@@ -74,6 +74,43 @@ read 3072/3072 bytes at offset 40960
 { "start": 40960, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 24576},
 { "start": 45056, "length": 134172672, "depth": 1, "zero": true, "data": 
false}]
 
+== write_zeroes covers non-zero data ==
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134217728
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
backing_file=TEST_DIR/t.IMGFMT.base
+wrote 1024/1024 bytes at offset 5120
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 5120
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 4096
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1024/1024 bytes at offset 14336
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 13312
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 12288
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 21504
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 21504
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 20480
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1024/1024 bytes at offset 30208
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 29696
+2 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 28672
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[{ "start": 0, "length": 4096, "depth": 1, "zero": true, "data": false},
+{ "start": 4096, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 20480},
+{ "start": 8192, "length": 4096, "depth": 1, "zero": true, "data": false},
+{ "start": 12288, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 24576},
+{ "start": 16384, "length": 4096, "depth": 1, "zero": true, "data": false},
+{ "start": 20480, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 28672},
+{ "start": 24576, "length": 4096, "depth": 1, "zero": true, "data": false},
+{ "start": 28672, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 32768},
+{ "start": 32768, "length": 134184960, "depth": 1, "zero": true, "data": 
false}]
+
 == spanning two clusters, non-zero before request ==
 Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134217728
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT

[Qemu-block] [PULL 16/31] raw_bsd: Convert to bdrv_co_pwrite_zeroes()

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Another step on our continuing quest to switch to byte-based
interfaces.

Signed-off-by: Eric Blake 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/raw_bsd.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/raw_bsd.c b/block/raw_bsd.c
index d9adf90..b1d5237 100644
--- a/block/raw_bsd.c
+++ b/block/raw_bsd.c
@@ -127,12 +127,11 @@ static int64_t coroutine_fn 
raw_co_get_block_status(BlockDriverState *bs,
(sector_num << BDRV_SECTOR_BITS);
 }
 
-static int coroutine_fn raw_co_write_zeroes(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors,
-BdrvRequestFlags flags)
+static int coroutine_fn raw_co_pwrite_zeroes(BlockDriverState *bs,
+ int64_t offset, int count,
+ BdrvRequestFlags flags)
 {
-return bdrv_co_pwrite_zeroes(bs->file->bs, sector_num << BDRV_SECTOR_BITS,
- nb_sectors << BDRV_SECTOR_BITS, flags);
+return bdrv_co_pwrite_zeroes(bs->file->bs, offset, count, flags);
 }
 
 static int coroutine_fn raw_co_discard(BlockDriverState *bs,
@@ -253,7 +252,7 @@ BlockDriver bdrv_raw = {
 .bdrv_create  = _create,
 .bdrv_co_readv= _co_readv,
 .bdrv_co_writev_flags = _co_writev_flags,
-.bdrv_co_write_zeroes = _co_write_zeroes,
+.bdrv_co_pwrite_zeroes = _co_pwrite_zeroes,
 .bdrv_co_discard  = _co_discard,
 .bdrv_co_get_block_status = _co_get_block_status,
 .bdrv_truncate= _truncate,
-- 
1.8.3.1

[Qemu-block] [PULL 08/31] block: Add .bdrv_co_pwrite_zeroes()

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Update bdrv_co_do_write_zeroes() to be byte-based, and select
between the new byte-based bdrv_co_pwrite_zeroes() or the old
bdrv_co_write_zeroes().  The next patches will convert drivers,
then remove the old interface.

Signed-off-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/io.c| 78 ++-
 include/block/block_int.h |  4 ++-
 2 files changed, 46 insertions(+), 36 deletions(-)

diff --git a/block/io.c b/block/io.c
index 1eda1b2..91a2e23 100644
--- a/block/io.c
+++ b/block/io.c
@@ -42,8 +42,8 @@ static BlockAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
  void *opaque,
  bool is_write);
 static void coroutine_fn bdrv_co_do_rw(void *opaque);
-static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, BdrvRequestFlags flags);
+static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
+int64_t offset, int count, BdrvRequestFlags flags);
 
 static void bdrv_parent_drained_begin(BlockDriverState *bs)
 {
@@ -893,10 +893,12 @@ static int coroutine_fn 
bdrv_co_do_copy_on_readv(BlockDriverState *bs,
 goto err;
 }
 
-if (drv->bdrv_co_write_zeroes &&
+if ((drv->bdrv_co_write_zeroes || drv->bdrv_co_pwrite_zeroes) &&
 buffer_is_zero(bounce_buffer, iov.iov_len)) {
-ret = bdrv_co_do_write_zeroes(bs, cluster_sector_num,
-  cluster_nb_sectors, 0);
+ret = bdrv_co_do_pwrite_zeroes(bs,
+   cluster_sector_num * BDRV_SECTOR_SIZE,
+   cluster_nb_sectors * BDRV_SECTOR_SIZE,
+   0);
 } else {
 /* This does not change the data on the disk, it is not necessary
  * to flush even in cache=writethrough mode.
@@ -1110,8 +1112,8 @@ int coroutine_fn bdrv_co_readv(BlockDriverState *bs, 
int64_t sector_num,
 
 #define MAX_WRITE_ZEROES_BOUNCE_BUFFER 32768
 
-static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
+static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
+int64_t offset, int count, BdrvRequestFlags flags)
 {
 BlockDriver *drv = bs->drv;
 QEMUIOVector qiov;
@@ -1122,20 +1124,16 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 int tail = 0;
 
 int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
-int write_zeroes_sector_align =
-bs->bl.pwrite_zeroes_alignment >> BDRV_SECTOR_BITS;
+int alignment = MAX(bs->bl.pwrite_zeroes_alignment ?: 1,
+bs->request_alignment);
 
-max_write_zeroes >>= BDRV_SECTOR_BITS;
-if (write_zeroes_sector_align) {
-assert(is_power_of_2(bs->bl.pwrite_zeroes_alignment));
-head = sector_num & (write_zeroes_sector_align - 1);
-tail = (sector_num + nb_sectors) & (write_zeroes_sector_align - 1);
-max_write_zeroes &= ~(write_zeroes_sector_align - 1);
-}
+assert(is_power_of_2(alignment));
+head = offset & (alignment - 1);
+tail = (offset + count) & (alignment - 1);
+max_write_zeroes &= ~(alignment - 1);
 
-assert(nb_sectors <= BDRV_REQUEST_MAX_SECTORS);
-while (nb_sectors > 0 && !ret) {
-int num = nb_sectors;
+while (count > 0 && !ret) {
+int num = count;
 
 /* Align request.  Block drivers can expect the "bulk" of the request
  * to be aligned, and that unaligned requests do not cross cluster
@@ -1143,9 +1141,9 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
  */
 if (head) {
 /* Make a small request up to the first aligned sector.  */
-num = MIN(nb_sectors, write_zeroes_sector_align - head);
+num = MIN(count, alignment - head);
 head = 0;
-} else if (tail && num > write_zeroes_sector_align) {
+} else if (tail && num > alignment) {
 /* Shorten the request to the last aligned sector.  */
 num -= tail;
 }
@@ -1157,8 +1155,18 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 
 ret = -ENOTSUP;
 /* First try the efficient write zeroes operation */
-if (drv->bdrv_co_write_zeroes) {
-ret = drv->bdrv_co_write_zeroes(bs, sector_num, num,
+if (drv->bdrv_co_pwrite_zeroes) {
+ret = drv->bdrv_co_pwrite_zeroes(bs, offset, num,
+ flags & bs->supported_zero_flags);
+if (ret != -ENOTSUP && (flags & BDRV_REQ_FUA) &&
+!(bs->supported_zero_flags & BDRV_REQ_FUA)) {
+need_flush = true;
+}
+} else if

[Qemu-block] [PULL 17/31] vmdk: Convert to bdrv_co_pwrite_zeroes()

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Another step on our continuing quest to switch to byte-based
interfaces.

Signed-off-by: Eric Blake 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/vmdk.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index 5a01e16..ee09423 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -1703,15 +1703,13 @@ static int vmdk_write_compressed(BlockDriverState *bs,
 }
 }
 
-static int coroutine_fn vmdk_co_write_zeroes(BlockDriverState *bs,
- int64_t sector_num,
- int nb_sectors,
- BdrvRequestFlags flags)
+static int coroutine_fn vmdk_co_pwrite_zeroes(BlockDriverState *bs,
+  int64_t offset,
+  int bytes,
+  BdrvRequestFlags flags)
 {
 int ret;
 BDRVVmdkState *s = bs->opaque;
-uint64_t offset = sector_num * BDRV_SECTOR_SIZE;
-uint64_t bytes = nb_sectors * BDRV_SECTOR_SIZE;
 
 qemu_co_mutex_lock(>lock);
 /* write zeroes could fail if sectors not aligned to cluster, test it with
@@ -2402,7 +2400,7 @@ static BlockDriver bdrv_vmdk = {
 .bdrv_co_preadv   = vmdk_co_preadv,
 .bdrv_co_pwritev  = vmdk_co_pwritev,
 .bdrv_write_compressed= vmdk_write_compressed,
-.bdrv_co_write_zeroes = vmdk_co_write_zeroes,
+.bdrv_co_pwrite_zeroes= vmdk_co_pwrite_zeroes,
 .bdrv_close   = vmdk_close,
 .bdrv_create  = vmdk_create,
 .bdrv_co_flush_to_disk= vmdk_co_flush,
-- 
1.8.3.1

[Qemu-block] [PULL 09/31] block: Switch bdrv_write_zeroes() to byte interface

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Rename to bdrv_pwrite_zeroes() to let the compiler ensure we
cater to the updated semantics.  Do the same for bdrv_co_write_zeroes().

Signed-off-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/blkreplay.c  |  4 +++-
 block/io.c | 34 +-
 block/parallels.c  |  4 +++-
 block/qcow2-cluster.c  |  3 +--
 block/qcow2.c  |  9 -
 block/raw_bsd.c|  3 ++-
 include/block/block.h  | 10 +-
 migration/block.c  |  5 +++--
 tests/qemu-iotests/034 |  2 +-
 tests/qemu-iotests/154 |  2 +-
 trace-events   |  2 +-
 11 files changed, 45 insertions(+), 33 deletions(-)

diff --git a/block/blkreplay.c b/block/blkreplay.c
index 42f1813..1a721ad 100755
--- a/block/blkreplay.c
+++ b/block/blkreplay.c
@@ -107,7 +107,9 @@ static int coroutine_fn 
blkreplay_co_write_zeroes(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
 {
 uint64_t reqid = request_id++;
-int ret = bdrv_co_write_zeroes(bs->file->bs, sector_num, nb_sectors, 
flags);
+int ret = bdrv_co_pwrite_zeroes(bs->file->bs,
+sector_num << BDRV_SECTOR_BITS,
+nb_sectors << BDRV_SECTOR_BITS, flags);
 block_request_create(reqid, bs, qemu_coroutine_self());
 qemu_coroutine_yield();
 
diff --git a/block/io.c b/block/io.c
index 91a2e23..a426094 100644
--- a/block/io.c
+++ b/block/io.c
@@ -620,18 +620,25 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
 return bdrv_rw_co(bs, sector_num, (uint8_t *)buf, nb_sectors, true, 0);
 }
 
-int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
-  int nb_sectors, BdrvRequestFlags flags)
+int bdrv_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+   int count, BdrvRequestFlags flags)
 {
-return bdrv_rw_co(bs, sector_num, NULL, nb_sectors, true,
-  BDRV_REQ_ZERO_WRITE | flags);
+QEMUIOVector qiov;
+struct iovec iov = {
+.iov_base = NULL,
+.iov_len = count,
+};
+
+qemu_iovec_init_external(, , 1);
+return bdrv_prwv_co(bs, offset, , true,
+BDRV_REQ_ZERO_WRITE | flags);
 }
 
 /*
- * Completely zero out a block device with the help of bdrv_write_zeroes.
+ * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
  * The operation is sped up by checking the block status and only writing
  * zeroes to the device if they currently do not return zeroes. Optional
- * flags are passed through to bdrv_write_zeroes (e.g. BDRV_REQ_MAY_UNMAP,
+ * flags are passed through to bdrv_pwrite_zeroes (e.g. BDRV_REQ_MAY_UNMAP,
  * BDRV_REQ_FUA).
  *
  * Returns < 0 on error, 0 on success. For error codes see bdrv_write().
@@ -662,7 +669,8 @@ int bdrv_make_zero(BlockDriverState *bs, BdrvRequestFlags 
flags)
 sector_num += n;
 continue;
 }
-ret = bdrv_write_zeroes(bs, sector_num, n, flags);
+ret = bdrv_pwrite_zeroes(bs, sector_num << BDRV_SECTOR_BITS,
+ n << BDRV_SECTOR_BITS, flags);
 if (ret < 0) {
 error_report("error writing zeroes at sector %" PRId64 ": %s",
  sector_num, strerror(-ret));
@@ -1526,18 +1534,18 @@ int coroutine_fn bdrv_co_writev(BlockDriverState *bs, 
int64_t sector_num,
 return bdrv_co_do_writev(bs, sector_num, nb_sectors, qiov, 0);
 }
 
-int coroutine_fn bdrv_co_write_zeroes(BlockDriverState *bs,
-  int64_t sector_num, int nb_sectors,
-  BdrvRequestFlags flags)
+int coroutine_fn bdrv_co_pwrite_zeroes(BlockDriverState *bs,
+   int64_t offset, int count,
+   BdrvRequestFlags flags)
 {
-trace_bdrv_co_write_zeroes(bs, sector_num, nb_sectors, flags);
+trace_bdrv_co_pwrite_zeroes(bs, offset, count, flags);
 
 if (!(bs->open_flags & BDRV_O_UNMAP)) {
 flags &= ~BDRV_REQ_MAY_UNMAP;
 }
 
-return bdrv_co_do_writev(bs, sector_num, nb_sectors, NULL,
- BDRV_REQ_ZERO_WRITE | flags);
+return bdrv_co_pwritev(bs, offset, count, NULL,
+   BDRV_REQ_ZERO_WRITE | flags);
 }
 
 typedef struct BdrvCoGetBlockStatusData {
diff --git a/block/parallels.c b/block/parallels.c
index b9b5c6d..d6a1a61 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -210,7 +210,9 @@ static int64_t allocate_clusters(BlockDriverState *bs, 
int64_t sector_num,
 int ret;
 space += s->prealloc_size;
 if (s->prealloc_mode == PRL_PREALLOC_MODE_FALLOCATE) {
-ret = bdrv_write_zeroes(bs->file->bs, s->data_end, space, 0);
+ret = bdrv_pwrite_zeroes(bs->file->bs,
+ s->data_end << BDRV_SECTOR_BITS,
+

[Qemu-block] [PULL 10/31] iscsi: Convert to bdrv_co_pwrite_zeroes()

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Another step on our continuing quest to switch to byte-based
interfaces.

As this is the first byte-based iscsi interface, convert
is_request_lun_aligned() into two versions, one for sectors
and one for bytes.  Also, change from outright -EINVAL failure
on an unaligned request, to instead failing with -ENOTSUP to
trigger a read-modify-write fallback, particularly since the
block layer should be honoring bs->request_alignment to avoid
-EINVAL on read/write requests.

Signed-off-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/iscsi.c | 58 ++
 1 file changed, 34 insertions(+), 24 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 52ea9d7..7e78ade 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -401,18 +401,26 @@ static int64_t sector_qemu2lun(int64_t sector, IscsiLun 
*iscsilun)
 return sector * BDRV_SECTOR_SIZE / iscsilun->block_size;
 }
 
-static bool is_request_lun_aligned(int64_t sector_num, int nb_sectors,
-  IscsiLun *iscsilun)
+static bool is_byte_request_lun_aligned(int64_t offset, int count,
+IscsiLun *iscsilun)
 {
-if ((sector_num * BDRV_SECTOR_SIZE) % iscsilun->block_size ||
-(nb_sectors * BDRV_SECTOR_SIZE) % iscsilun->block_size) {
-error_report("iSCSI misaligned request: "
- "iscsilun->block_size %u, sector_num %" PRIi64
- ", nb_sectors %d",
- iscsilun->block_size, sector_num, nb_sectors);
-return 0;
-}
-return 1;
+if (offset % iscsilun->block_size || count % iscsilun->block_size) {
+error_report("iSCSI misaligned request: "
+ "iscsilun->block_size %u, offset %" PRIi64
+ ", count %d",
+ iscsilun->block_size, offset, count);
+return false;
+}
+return true;
+}
+
+static bool is_sector_request_lun_aligned(int64_t sector_num, int nb_sectors,
+  IscsiLun *iscsilun)
+{
+assert(nb_sectors < BDRV_REQUEST_MAX_SECTORS);
+return is_byte_request_lun_aligned(sector_num << BDRV_SECTOR_BITS,
+   nb_sectors << BDRV_SECTOR_BITS,
+   iscsilun);
 }
 
 static unsigned long *iscsi_allocationmap_init(IscsiLun *iscsilun)
@@ -461,7 +469,7 @@ iscsi_co_writev_flags(BlockDriverState *bs, int64_t 
sector_num, int nb_sectors,
 if (fua) {
 assert(iscsilun->dpofua);
 }
-if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
+if (!is_sector_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
 return -EINVAL;
 }
 
@@ -541,7 +549,7 @@ static int64_t coroutine_fn 
iscsi_co_get_block_status(BlockDriverState *bs,
 
 iscsi_co_init_iscsitask(iscsilun, );
 
-if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
+if (!is_sector_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
 ret = -EINVAL;
 goto out;
 }
@@ -638,7 +646,7 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState *bs,
 uint64_t lba;
 uint32_t num_sectors;
 
-if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
+if (!is_sector_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
 return -EINVAL;
 }
 
@@ -926,7 +934,7 @@ coroutine_fn iscsi_co_discard(BlockDriverState *bs, int64_t 
sector_num,
 struct IscsiTask iTask;
 struct unmap_list list;
 
-if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
+if (!is_sector_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
 return -EINVAL;
 }
 
@@ -977,8 +985,8 @@ retry:
 }
 
 static int
-coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
-   int nb_sectors, BdrvRequestFlags flags)
+coroutine_fn iscsi_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+int count, BdrvRequestFlags flags)
 {
 IscsiLun *iscsilun = bs->opaque;
 struct IscsiTask iTask;
@@ -986,8 +994,8 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, 
int64_t sector_num,
 uint32_t nb_blocks;
 bool use_16_for_ws = iscsilun->use_16_for_rw;
 
-if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
-return -EINVAL;
+if (!is_byte_request_lun_aligned(offset, count, iscsilun)) {
+return -ENOTSUP;
 }
 
 if (flags & BDRV_REQ_MAY_UNMAP) {
@@ -1008,8 +1016,8 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, 
int64_t sector_num,
 return -ENOTSUP;
 }
 
-lba = sector_qemu2lun(sector_num, iscsilun);
-nb_blocks = sector_qemu2lun(nb_sectors, iscsilun);
+lba = offset / iscsilun->block_size;
+nb_blocks = count / iscsilun->block_size;
 
 if

[Qemu-block] [PULL 01/31] block: split write_zeroes always

2016-06-08 Thread Kevin Wolf

From: "Denis V. Lunev" 

We should split requests even if they are less than write_zeroes_alignment.
For example we can have the following request:
  offset 62k
  size   4k
  write_zeroes_alignment 64k
The original code sent 1 request covering 2 qcow2 clusters, and resulted
in both clusters being allocated. But by splitting the request, we can
cater to the case where one of the two clusters can be zeroed as a
whole, for only 1 cluster allocated after the operation.

Signed-off-by: Denis V. Lunev 
CC: Eric Blake 
CC: Kevin Wolf 
Message-Id: <1463476543-3087-2-git-send-email-...@openvz.org>

[eblake: Avoid exceeding nb_sectors, hoist alignment checks out of
loop, and update testsuite to show that patch works]

Signed-off-by: Eric Blake 
Reviewed-by: Fam Zheng 
Signed-off-by: Kevin Wolf 
---
 block/io.c | 30 +-
 tests/qemu-iotests/154.out | 18 --
 2 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/block/io.c b/block/io.c
index 6070e77..c73be55 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1118,28 +1118,32 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 struct iovec iov = {0};
 int ret = 0;
 bool need_flush = false;
+int head = 0;
+int tail = 0;
 
 int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_write_zeroes,
 BDRV_REQUEST_MAX_SECTORS);
+if (bs->bl.write_zeroes_alignment) {
+assert(is_power_of_2(bs->bl.write_zeroes_alignment));
+head = sector_num & (bs->bl.write_zeroes_alignment - 1);
+tail = (sector_num + nb_sectors) & (bs->bl.write_zeroes_alignment - 1);
+max_write_zeroes &= ~(bs->bl.write_zeroes_alignment - 1);
+}
 
 while (nb_sectors > 0 && !ret) {
 int num = nb_sectors;
 
 /* Align request.  Block drivers can expect the "bulk" of the request
- * to be aligned.
+ * to be aligned, and that unaligned requests do not cross cluster
+ * boundaries.
  */
-if (bs->bl.write_zeroes_alignment
-&& num > bs->bl.write_zeroes_alignment) {
-if (sector_num % bs->bl.write_zeroes_alignment != 0) {
-/* Make a small request up to the first aligned sector.  */
-num = bs->bl.write_zeroes_alignment;
-num -= sector_num % bs->bl.write_zeroes_alignment;
-} else if ((sector_num + num) % bs->bl.write_zeroes_alignment != 
0) {
-/* Shorten the request to the last aligned sector.  num cannot
- * underflow because num > bs->bl.write_zeroes_alignment.
- */
-num -= (sector_num + num) % bs->bl.write_zeroes_alignment;
-}
+if (head) {
+/* Make a small request up to the first aligned sector.  */
+num = MIN(nb_sectors, bs->bl.write_zeroes_alignment - head);
+head = 0;
+} else if (tail && num > bs->bl.write_zeroes_alignment) {
+/* Shorten the request to the last aligned sector.  */
+num -= tail;
 }
 
 /* limit request size */
diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
index 8946b73..b9d27c5 100644
--- a/tests/qemu-iotests/154.out
+++ b/tests/qemu-iotests/154.out
@@ -106,11 +106,14 @@ read 1024/1024 bytes at offset 67584
 read 5120/5120 bytes at offset 68608
 5 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 [{ "start": 0, "length": 32768, "depth": 1, "zero": true, "data": false},
-{ "start": 32768, "length": 8192, "depth": 0, "zero": false, "data": true, 
"offset": 20480},
+{ "start": 32768, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 20480},
+{ "start": 36864, "length": 4096, "depth": 0, "zero": true, "data": false},
 { "start": 40960, "length": 8192, "depth": 1, "zero": true, "data": false},
-{ "start": 49152, "length": 8192, "depth": 0, "zero": false, "data": true, 
"offset": 28672},
+{ "start": 49152, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 24576},
+{ "start": 53248, "length": 4096, "depth": 0, "zero": true, "data": false},
 { "start": 57344, "length": 8192, "depth": 1, "zero": true, "data": false},
-{ "start": 65536, "length": 8192, "depth": 0, "zero": false, "data": true, 
"offset": 36864},
+{ "start": 65536, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 28672},
+{ "start": 69632, "length": 4096, "depth": 0, "zero": true, "data": false},
 { "start": 73728, "length": 134144000, "depth": 1, "zero": true, "data": 
false}]
 
 == spanning two clusters, non-zero after request ==
@@ -145,11 +148,14 @@ read 7168/7168 bytes at offset 65536
 read 1024/1024 bytes at offset 72704
 1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 [{ "start": 0, "length": 32768, "depth": 1, "zero": true, "data": false},
-{

[Qemu-block] [PULL 07/31] block: Track write zero limits in bytes

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

Another step towards removing sector-based interfaces: convert
the maximum write and minimum alignment values from sectors to
bytes.  Rename the variables to let the compiler check that all
users are converted to the new semantics.

The maximum remains an int as long as BDRV_REQUEST_MAX_SECTORS
is constrained by INT_MAX (this means that we can't even
support a 2G write_zeroes, but just under it) - changing
operation lengths to unsigned or to 64-bits is a much bigger
audit, and debatable if we even want to do it (since at the
core, a 32-bit platform will still have ssize_t as its
underlying limit on write()).

Meanwhile, alignment is changed to 'uint32_t', since it makes no
sense to have an alignment larger than the maximum write, and
less painful to use an unsigned type with well-defined behavior
in bit operations than to have to worry about what happens if
a driver mistakenly supplies a negative alignment.

Add an assert that no one was trying to use sectors to get a
write zeroes larger than 2G, and therefore that a later conversion
to bytes won't be impacted by keeping the limit at 32 bits.

Signed-off-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/io.c| 22 +-
 block/iscsi.c | 13 ++---
 block/qcow2.c |  2 +-
 block/qed.c   |  2 +-
 block/vmdk.c  |  6 +++---
 include/block/block_int.h | 10 ++
 6 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/block/io.c b/block/io.c
index c73be55..1eda1b2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1121,15 +1121,19 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 int head = 0;
 int tail = 0;
 
-int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_write_zeroes,
-BDRV_REQUEST_MAX_SECTORS);
-if (bs->bl.write_zeroes_alignment) {
-assert(is_power_of_2(bs->bl.write_zeroes_alignment));
-head = sector_num & (bs->bl.write_zeroes_alignment - 1);
-tail = (sector_num + nb_sectors) & (bs->bl.write_zeroes_alignment - 1);
-max_write_zeroes &= ~(bs->bl.write_zeroes_alignment - 1);
+int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
+int write_zeroes_sector_align =
+bs->bl.pwrite_zeroes_alignment >> BDRV_SECTOR_BITS;
+
+max_write_zeroes >>= BDRV_SECTOR_BITS;
+if (write_zeroes_sector_align) {
+assert(is_power_of_2(bs->bl.pwrite_zeroes_alignment));
+head = sector_num & (write_zeroes_sector_align - 1);
+tail = (sector_num + nb_sectors) & (write_zeroes_sector_align - 1);
+max_write_zeroes &= ~(write_zeroes_sector_align - 1);
 }
 
+assert(nb_sectors <= BDRV_REQUEST_MAX_SECTORS);
 while (nb_sectors > 0 && !ret) {
 int num = nb_sectors;
 
@@ -1139,9 +1143,9 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
  */
 if (head) {
 /* Make a small request up to the first aligned sector.  */
-num = MIN(nb_sectors, bs->bl.write_zeroes_alignment - head);
+num = MIN(nb_sectors, write_zeroes_sector_align - head);
 head = 0;
-} else if (tail && num > bs->bl.write_zeroes_alignment) {
+} else if (tail && num > write_zeroes_sector_align) {
 /* Shorten the request to the last aligned sector.  */
 num -= tail;
 }
diff --git a/block/iscsi.c b/block/iscsi.c
index 94f9974..52ea9d7 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1715,16 +1715,15 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.discard_alignment = iscsilun->block_size >> BDRV_SECTOR_BITS;
 }
 
-if (iscsilun->bl.max_ws_len < 0x) {
-bs->bl.max_write_zeroes =
-sector_limits_lun2qemu(iscsilun->bl.max_ws_len, iscsilun);
+if (iscsilun->bl.max_ws_len < 0x / iscsilun->block_size) {
+bs->bl.max_pwrite_zeroes =
+iscsilun->bl.max_ws_len * iscsilun->block_size;
 }
 if (iscsilun->lbp.lbpws) {
-bs->bl.write_zeroes_alignment =
-sector_limits_lun2qemu(iscsilun->bl.opt_unmap_gran, iscsilun);
+bs->bl.pwrite_zeroes_alignment =
+iscsilun->bl.opt_unmap_gran * iscsilun->block_size;
 } else {
-bs->bl.write_zeroes_alignment =
-iscsilun->block_size >> BDRV_SECTOR_BITS;
+bs->bl.pwrite_zeroes_alignment = iscsilun->block_size;
 }
 bs->bl.opt_transfer_length =
 sector_limits_lun2qemu(iscsilun->bl.opt_xfer_len, iscsilun);
diff --git a/block/qcow2.c b/block/qcow2.c
index ecac399..a6ea6cb 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1193,7 +1193,7 @@ static void qcow2_refresh_limits(BlockDriverState *bs, 
Error **errp)
 {
 BDRVQcow2State *s = bs->opaque;
 
-bs->bl.write_zeroes_alignment = s->cluster_sectors;
+

[Qemu-block] [PULL 06/31] iscsi: Use block size as minimum zero/discard alignment

2016-06-08 Thread Kevin Wolf

From: Eric Blake 

If hardware does not advertise a minimum zero/discard
alignment, we still want to guarantee that the block layer
will align requests to our blocks, rather than the arbitrary
512-byte BDRV sector size.

Signed-off-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/iscsi.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/iscsi.c b/block/iscsi.c
index e7d5f7b..94f9974 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1711,6 +1711,8 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
Error **errp)
 }
 bs->bl.discard_alignment =
 sector_limits_lun2qemu(iscsilun->bl.opt_unmap_gran, iscsilun);
+} else {
+bs->bl.discard_alignment = iscsilun->block_size >> BDRV_SECTOR_BITS;
 }
 
 if (iscsilun->bl.max_ws_len < 0x) {
@@ -1720,6 +1722,9 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
Error **errp)
 if (iscsilun->lbp.lbpws) {
 bs->bl.write_zeroes_alignment =
 sector_limits_lun2qemu(iscsilun->bl.opt_unmap_gran, iscsilun);
+} else {
+bs->bl.write_zeroes_alignment =
+iscsilun->block_size >> BDRV_SECTOR_BITS;
 }
 bs->bl.opt_transfer_length =
 sector_limits_lun2qemu(iscsilun->bl.opt_xfer_len, iscsilun);
-- 
1.8.3.1

[Qemu-block] [PULL 02/31] qcow2: simplify logic in qcow2_co_write_zeroes

2016-06-08 Thread Kevin Wolf

From: "Denis V. Lunev" 

Unaligned requests will occupy only one cluster. This is true since the
previous commit. Simplify the code taking this consideration into
account.

In other words, the caller is now buggy if it ever passes us an unaligned
request that crosses cluster boundaries (the only requests that can cross
boundaries will be aligned).

There are no other changes so far.

Signed-off-by: Denis V. Lunev 
Reviewed-by: Eric Blake 
CC: Eric Blake 
CC: Kevin Wolf 
Message-Id: <1463476543-3087-3-git-send-email-...@openvz.org>
Signed-off-by: Kevin Wolf 
---
 block/qcow2.c | 23 +--
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index c9306a7..2f73201 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2438,33 +2438,20 @@ static coroutine_fn int 
qcow2_co_write_zeroes(BlockDriverState *bs,
 int tail = (sector_num + nb_sectors) % s->cluster_sectors;
 
 if (head != 0 || tail != 0) {
-int64_t cl_end = -1;
+int64_t cl_start = sector_num - head;
 
-sector_num -= head;
-nb_sectors += head;
+assert(cl_start + s->cluster_sectors >= sector_num + nb_sectors);
 
-if (tail != 0) {
-nb_sectors += s->cluster_sectors - tail;
-}
+sector_num = cl_start;
+nb_sectors = s->cluster_sectors;
 
 if (!is_zero_cluster(bs, sector_num)) {
 return -ENOTSUP;
 }
 
-if (nb_sectors > s->cluster_sectors) {
-/* Technically the request can cover 2 clusters, f.e. 4k write
-   at s->cluster_sectors - 2k offset. One of these cluster can
-   be zeroed, one unallocated */
-cl_end = sector_num + nb_sectors - s->cluster_sectors;
-if (!is_zero_cluster(bs, cl_end)) {
-return -ENOTSUP;
-}
-}
-
 qemu_co_mutex_lock(>lock);
 /* We can have new write after previous check */
-if (!is_zero_cluster_top_locked(bs, sector_num) ||
-(cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
+if (!is_zero_cluster_top_locked(bs, sector_num)) {
 qemu_co_mutex_unlock(>lock);
 return -ENOTSUP;
 }
-- 
1.8.3.1

Re: [Qemu-block] [PATCH] Report error when opening device with locked tray

2016-06-08 Thread Kevin Wolf

Am 07.06.2016 um 22:00 hat John Snow geschrieben:
> 
> 
> On 06/07/2016 06:28 AM, Kevin Wolf wrote:
> > Am 06.06.2016 um 21:40 hat Colin Lord geschrieben:
> >> This commit causes qmp_blockdev_change_medium to report an error if an
> >> attempt is made to open a device with a locked tray.
> > 
> > The old behaviour is that the command seemingly succeeds, but the medium
> > isn't actually changed. Correct?
> > 
> 
> Close. Old "change" command also fails, but with a confusing error.
> 
> > Should this be mentioned in the commit message? You just describe what
> > you change, but not why.
> > 
> 
> Old behavior:
> 
> - Change uses qmp_blockdev_open_tray, which "succeeds."
> - Change then tries to use qmp_x_blockdev_remove_medium, but receives
> potentially confusing error "Tray is locked."
> - Moments later, the tray is likely now open.

Ah, yes, makes sense. This and that we want to have a less confusing
error message is the part that is missing in the commit message.

Kevin

> New behavior:
> 
> - Change uses do_open_tray, which returns -EINPROGRESS.
> - Change can propagate this error upwards without attempting to remove
> the medium.
> - User gets "Device  is locked and force was not specified, wait
> for tray to open and try again" error.
> 
> Why: "The new error tries to inform the user that there is an action
> pending and that the command, if run again, may succeed."
> 
> >> Signed-off-by: Colin Lord 
> >> This is based off my previous patch regarding the do_open_tray function
> >> (currently at v3). Probably should have been submitted as a patch set
> >> but I wasn't thinking that far ahead when I submitted the first patch.
> >> ---
> >>  blockdev.c | 7 +--
> >>  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > Yes, would probably have made sense as a series, but as long as it's
> > only two patches, it's not really a problem.
> > 
> > Please make sure to put such comments below the "---" line, though, i.e.
> > comments that make sense for the review, but not as part of the commit
> > log. Then git-am automatically removes that part from the commit message
> > while applying the patch. I did it manually for this one now.
> > 
> > Kevin
> > 
> 
> -- 
> —js

49 matches

Mail list logo