[Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer

2016-11-08 Thread Eric Blake
Commit 443668ca rewrote the write_zeroes logic to guarantee that
an unaligned request never crosses a cluster boundary.  But
in the rewrite, the new code assumed that at most one iteration
would be needed to get to an alignment boundary.

However, it is easy to trigger an assertion failure: the Linux
kernel limits loopback devices to advertise a max_transfer of
only 64k.  Any operation that requires falling back to writes
rather than more efficient zeroing must obey max_transfer during
that fallback, which means an unaligned head may require multiple
iterations of the write fallbacks before reaching the aligned
boundaries, when layering a format with clusters larger than 64k
atop the protocol of file access to a loopback device.

Test case:

$ qemu-img create -f qcow2 -o cluster_size=1M file 10M
$ losetup /dev/loop2 /path/to/file
$ qemu-io -f qcow2 /dev/loop2
qemu-io> w 7m 1k
qemu-io> w -z 8003584 2093056

In fairness to Denis (as the original listed author of the culprit
commit), the faulty logic for at most one iteration is probably all
my fault in reworking his idea.  But the solution is to restore what
was in place prior to that commit: when dealing with an unaligned
head or tail, iterate as many times as necessary while fragmenting
the operation at max_transfer boundaries.

CC: qemu-sta...@nongnu.org
CC: Ed Swierk 
CC: Denis V. Lunev 
Signed-off-by: Eric Blake 
---
 block/io.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index aa532a5..085ac34 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1214,6 +1214,8 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
 int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
 bs->bl.request_alignment);
+int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+MAX_WRITE_ZEROES_BOUNCE_BUFFER);

 assert(alignment % bs->bl.request_alignment == 0);
 head = offset % alignment;
@@ -1229,9 +1231,12 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
  * boundaries.
  */
 if (head) {
-/* Make a small request up to the first aligned sector.  */
-num = MIN(count, alignment - head);
-head = 0;
+/* Make a small request up to the first aligned sector. For
+ * convenience, limit this request to max_transfer even if
+ * we don't need to fall back to writes.  */
+num = MIN(MIN(count, max_transfer), alignment - head);
+head = (head + num) % alignment;
+assert(num < max_write_zeroes);
 } else if (tail && num > alignment) {
 /* Shorten the request to the last aligned sector.  */
 num -= tail;
@@ -1257,8 +1262,6 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,

 if (ret == -ENOTSUP) {
 /* Fall back to bounce buffer if write zeroes is unsupported */
-int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
-MAX_WRITE_ZEROES_BOUNCE_BUFFER);
 BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;

 if ((flags & BDRV_REQ_FUA) &&
-- 
2.7.4




Re: [Qemu-block] [Qemu-devel] [PATCH v3 5/6] blockjob: refactor backup_start as backup_job_create

2016-11-08 Thread Jeff Cody
On Tue, Nov 08, 2016 at 10:24:50AM -0500, John Snow wrote:
> 
> 
> On 11/08/2016 04:11 AM, Kevin Wolf wrote:
> >Am 08.11.2016 um 06:41 hat John Snow geschrieben:
> >>On 11/03/2016 09:17 AM, Kevin Wolf wrote:
> >>>Am 02.11.2016 um 18:50 hat John Snow geschrieben:
> Refactor backup_start as backup_job_create, which only creates the job,
> but does not automatically start it. The old interface, 'backup_start',
> is not kept in favor of limiting the number of nearly-identical interfaces
> that would have to be edited to keep up with QAPI changes in the future.
> 
> Callers that wish to synchronously start the backup_block_job can
> instead just call block_job_start immediately after calling
> backup_job_create.
> 
> Transactions are updated to use the new interface, calling block_job_start
> only during the .commit phase, which helps prevent race conditions where
> jobs may finish before we even finish building the transaction. This may
> happen, for instance, during empty block backup jobs.
> 
> Reported-by: Vladimir Sementsov-Ogievskiy 
> Signed-off-by: John Snow 
> >>>
> +static void drive_backup_commit(BlkActionState *common)
> +{
> +DriveBackupState *state = DO_UPCAST(DriveBackupState, common, 
> common);
> +if (state->job) {
> +block_job_start(state->job);
> +}
> }
> >>>
> >>>How could state->job ever be NULL?
> >>>
> >>
> >>Mechanical thinking. It can't. (I definitely didn't copy paste from
> >>the .abort routines. Definitely.)
> >>
> >>>Same question for abort, and for blockdev_backup_commit/abort.
> >>>
> >>
> >>Abort ... we may not have created the job successfully. Abort gets
> >>called whether or not we made it to or through the matching
> >>.prepare.
> >
> >Ah, yes, I always forget about this. It's so counterintuitive (and
> >bdrv_reopen() actually works differently, it only aborts entries that
> >have successfully been prepared).
> >
> >Is there a good reason why qmp_transaction() works this way, especially
> >since we have a separate .clean function?
> >
> >Kevin
> >
> 
> We just don't track which actions have succeeded or not, so we loop through
> all actions on each phase regardless.
> 
> I could add a little state enumeration (or boolean) to each action and I
> could adjust abort to only run on actions that either completed or failed,
> but in this case I think it still wouldn't change the text for .abort,
> because an action may fail before it got to creating the job, for instance.
> 

As far as this part goes, couldn't we just do it without any flags, by not
inserting the state into the snap_bdrv_states list unless it was successful
(assuming _prepare cleans up itself on failure)?  E.g.:
 
-QSIMPLEQ_INSERT_TAIL(_bdrv_states, state, entry);
 
 state->ops->prepare(state, _err);
 if (local_err) {
 error_propagate(errp, local_err);
+g_free(state);
 goto delete_and_fail;
 }
+QSIMPLEQ_INSERT_TAIL(_bdrv_states, state, entry);
 }

> Unless you'd propose undoing .prepare IN .prepare in failure cases, but why
> write abort code twice? I don't mind it living in .abort, personally.
>

Doing it the above way would indeed require prepare functions to clean up
after themselves on failure.

The bdrv_reopen() model does it this way, and I think it makes sense.  With
most APIs, on failure you wouldn't have a way of knowing what has or has not
been done, so it leaves everything in a clean state.  I think this is a good
model to follow.

It is also what most QEMU block interfaces currently do, iirc (.bdrv_open,
etc.) - if it fails, it is assumed that it frees all resources it allocated.

I guess it doesn't have to be done this way, and the complexity can just be
pushed into the _abort() function.  After all, with these transactional
models, there exists an abort function, which differentiates it from most
other APIs.  But the downfall is that we have different ways of handling
essentially the same sort of transactional model in the block layer (between
bdrv_reopen and qmp_transaction), and it trips up reviewers / authors.  

(I don't think changing how qmp_transaction handles this is something that
needs to be handled in this series - but it would be nice in the future
sometime).

Jeff



Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto

2016-11-08 Thread Vladimir Sementsov-Ogievskiy

08.11.2016 15:18, Kevin Wolf wrote:

Am 08.11.2016 um 12:08 hat Vladimir Sementsov-Ogievskiy geschrieben:

08.11.2016 14:05, Kevin Wolf wrote:

Am 07.11.2016 um 17:10 hat Max Reitz geschrieben:

On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not
handled.. Is it ok? Should not they be filled with ones or something
like this?

Filling them with ones makes sense to me. I guess nobody noticed because
nobody was crazy enough to use block jobs alongside loadvm...

What's the use case in which ones make sense?

It rather seems to me that an active dirty bitmap and snapshot switching
should exclude each other because the bitmap becomes meaningless by the
switch. And chances are that after switching a snapshot you don't want
to "incrementally" backup everything, but that you should access a
different backup.

In other words, dirty bitmaps should be deleted on snapshot switch?
All? Or only named?

As Max said, we should probably integrate bitmaps with snapshots. After
reloading the old state, the bitmap becomes valid again, so throwing it
away in the active state seems only right if we included it in the
snapshot and can bring it back.


If we choose this way, it should be firstly done for BdrvDirtyBitmap's 
without any persistance. And it is not as simple as just drop dirty 
bitmaps or fill them with ones. Current behavior is definitely wrong: if 
user create incremental backup after snapshot switch this incremental 
backup will be incorrect. I think it should be fixed now simpler way 
(actually this fix means "for now incremental backup is incompatible 
with snapshot switch"), and in future, if we really need this, make them 
work together.


Also, I think that filling with ones is safer and more native. It really 
describes, what happens (with some overhead of dirty bits). Simple 
improvement: instead of filling with ones, new_dirty_bitmap_state  = 
old_dirty_bitmap_state | old_allocated_mask | new_allocated_mask, where 
allocated mask is bitmap with same granularity, showing which ranges are 
allocated in the image.




Kevin



--
Best regards,
Vladimir




Re: [Qemu-block] [PATCH] Added iopmem device emulation

2016-11-08 Thread Logan Gunthorpe
Hey,

On 08/11/16 08:58 AM, Stefan Hajnoczi wrote:
> My concern with the current implementation is that a PCI MMIO access
> invokes a synchronous blk_*() call.  That can pause vcpu execution while
> I/O is happening and therefore leads to unresponsive guests.  QEMU's
> monitor interface is also blocked during blk_*() making it impossible to
> troubleshoot QEMU if it gets stuck due to a slow/hung I/O operation.
> 
> Device models need to use blk_aio_*() so that control is returned while
> I/O is running.  There are a few legacy devices left that use
> synchronous I/O but new devices should not use this approach.

That's fair. I wasn't aware of this and I must have copied a legacy
device. We can certainly make the change in our patch.

> Regarding the hardware design, I think the PCI BAR approach to nvdimm is
> inefficient for virtualization because each memory load/store requires a
> guest<->host transition (vmexit + vmenter).  A DMA approach (i.e.
> message passing or descriptor rings) is more efficient because it
> requires fewer vmexits.
> 
> On real hardware the performance characteristics are different so it
> depends what your target market is.

The performance of the virtual device is completely unimportant. This
isn't something I'd expect anyone to use except to test drivers. On real
hardware, with real applications, DMA would almost certainly be used --
but it would be the DMA engine in another device. eg. an IB NIC would
DMA from the PCI BAR of the iopmem device. This completely bypasses the
CPU so there would be no load/stores to be concerned about.

Thanks,

Logan



Re: [Qemu-block] [PATCH 4/4] block: Cater to iscsi with non-power-of-2 discard

2016-11-08 Thread Eric Blake
On 11/08/2016 05:03 AM, Peter Lieven wrote:
> Am 25.10.2016 um 18:12 schrieb Eric Blake:
>> On 10/25/2016 09:36 AM, Paolo Bonzini wrote:
>>>
>>> On 25/10/2016 16:35, Eric Blake wrote:
 So your argument is that we should always pass down every unaligned
 less-than-optimum discard request all the way to the hardware, rather
 than dropping it higher in the stack, even though discard requests are
 already advisory, in order to leave the hardware as the ultimate
 decision on whether to ignore the unaligned request?
>>> Yes, I agree with Peter as to this.
>> Okay, I'll work on patches. I think it counts as bug fix, so appropriate
>> even if I miss soft freeze (I'd still like to get NBD write zero support
>> into 2.8, since it already missed 2.7, but that one is still awaiting
>> review with not much time left).
>>
> 
> Hi Eric,
> 
> have you had time to look at this?
> If you need help, let me know.

Still on my list. I'm not forgetting it, and it does count as a bug fix
so it is safe for inclusion, although I'm trying to get it in before
this week is out.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH for-2.8] hbitmap: Fix the serialization granularity's type

2016-11-08 Thread Stefan Hajnoczi
On Mon, Nov 07, 2016 at 05:39:21PM +0100, Max Reitz wrote:
> This function returns a uint64_t, so it should not truncate its result
> by performing a plain int calculation.
> 
> Signed-off-by: Max Reitz 
> ---
>  util/hbitmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/util/hbitmap.c b/util/hbitmap.c
> index 5d1a21c..c57be76 100644
> --- a/util/hbitmap.c
> +++ b/util/hbitmap.c
> @@ -401,7 +401,7 @@ uint64_t hbitmap_serialization_granularity(const HBitmap 
> *hb)
>  {
>  /* Require at least 64 bit granularity to be safe on both 64 bit and 32 
> bit
>   * hosts. */
> -return 64 << hb->granularity;
> +return UINT64_C(64) << hb->granularity;
>  }

Another instance that should be fixed:

  uint64_t start = QEMU_ALIGN_UP(num_elements, 1 << hb->granularity);


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH] Added iopmem device emulation

2016-11-08 Thread Stefan Hajnoczi
On Mon, Nov 07, 2016 at 10:09:29AM -0700, Logan Gunthorpe wrote:
> On 07/11/16 03:28 AM, Stefan Hajnoczi wrote:
> > It may be too early to merge this code into qemu.git if there is no
> > hardware spec and this is a prototype device that is subject to change.
> 
> Fair enough, though the interface is so simple I don't know what could
> possibly change.

My concern with the current implementation is that a PCI MMIO access
invokes a synchronous blk_*() call.  That can pause vcpu execution while
I/O is happening and therefore leads to unresponsive guests.  QEMU's
monitor interface is also blocked during blk_*() making it impossible to
troubleshoot QEMU if it gets stuck due to a slow/hung I/O operation.

Device models need to use blk_aio_*() so that control is returned while
I/O is running.  There are a few legacy devices left that use
synchronous I/O but new devices should not use this approach.

Regarding the hardware design, I think the PCI BAR approach to nvdimm is
inefficient for virtualization because each memory load/store requires a
guest<->host transition (vmexit + vmenter).  A DMA approach (i.e.
message passing or descriptor rings) is more efficient because it
requires fewer vmexits.

On real hardware the performance characteristics are different so it
depends what your target market is.

> > I'm wondering if there is a way to test or use this device if you are
> > not releasing specs and code that drives the device.
> > 
> > Have you submitted patches to enable this device in Linux, DPDK, or any
> > other project?
> 
> Yes, you can find patches to the Linux Kernel that were submitted to a
> couple mailing lists at the same time as the QEMU patch:
> 
> http://www.mail-archive.com/linux-nvdimm@lists.01.org/msg01426.html
> 
> There's been a discussion as to how best to expose these devices to user
> space and we may take a different approach in v2. But there has been no
> indication that the PCI interface would need to change at all.

Thanks, I'll check out the discussion!

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v3 5/6] blockjob: refactor backup_start as backup_job_create

2016-11-08 Thread John Snow



On 11/08/2016 04:11 AM, Kevin Wolf wrote:

Am 08.11.2016 um 06:41 hat John Snow geschrieben:

On 11/03/2016 09:17 AM, Kevin Wolf wrote:

Am 02.11.2016 um 18:50 hat John Snow geschrieben:

Refactor backup_start as backup_job_create, which only creates the job,
but does not automatically start it. The old interface, 'backup_start',
is not kept in favor of limiting the number of nearly-identical interfaces
that would have to be edited to keep up with QAPI changes in the future.

Callers that wish to synchronously start the backup_block_job can
instead just call block_job_start immediately after calling
backup_job_create.

Transactions are updated to use the new interface, calling block_job_start
only during the .commit phase, which helps prevent race conditions where
jobs may finish before we even finish building the transaction. This may
happen, for instance, during empty block backup jobs.

Reported-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: John Snow 



+static void drive_backup_commit(BlkActionState *common)
+{
+DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
+if (state->job) {
+block_job_start(state->job);
+}
}


How could state->job ever be NULL?



Mechanical thinking. It can't. (I definitely didn't copy paste from
the .abort routines. Definitely.)


Same question for abort, and for blockdev_backup_commit/abort.



Abort ... we may not have created the job successfully. Abort gets
called whether or not we made it to or through the matching
.prepare.


Ah, yes, I always forget about this. It's so counterintuitive (and
bdrv_reopen() actually works differently, it only aborts entries that
have successfully been prepared).

Is there a good reason why qmp_transaction() works this way, especially
since we have a separate .clean function?

Kevin



We just don't track which actions have succeeded or not, so we loop 
through all actions on each phase regardless.


I could add a little state enumeration (or boolean) to each action and I 
could adjust abort to only run on actions that either completed or 
failed, but in this case I think it still wouldn't change the text for 
.abort, because an action may fail before it got to creating the job, 
for instance.


Unless you'd propose undoing .prepare IN .prepare in failure cases, but 
why write abort code twice? I don't mind it living in .abort, personally.


--js



Re: [Qemu-block] [PATCH 1/2] aio-posix: avoid NULL pointer dereference in aio_epoll_update

2016-11-08 Thread Fam Zheng
On Tue, 11/08 14:55, Paolo Bonzini wrote:
> aio_epoll_update dereferences parameter "node", but it could have been NULL
> if deleting an fd handler that was not registered in the first place.
> 
> Signed-off-by: Paolo Bonzini 
> ---
> Remove unnecessary assignment to node->pfd.revents.
> 
>  aio-posix.c | 32 +---
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/aio-posix.c b/aio-posix.c
> index 4ef34dd..ec908f7 100644
> --- a/aio-posix.c
> +++ b/aio-posix.c
> @@ -217,21 +217,23 @@ void aio_set_fd_handler(AioContext *ctx,
>  
>  /* Are we deleting the fd handler? */
>  if (!io_read && !io_write) {
> -if (node) {
> -g_source_remove_poll(>source, >pfd);
> -
> -/* If the lock is held, just mark the node as deleted */
> -if (ctx->walking_handlers) {
> -node->deleted = 1;
> -node->pfd.revents = 0;
> -} else {
> -/* Otherwise, delete it for real.  We can't just mark it as
> - * deleted because deleted nodes are only cleaned up after
> - * releasing the walking_handlers lock.
> - */
> -QLIST_REMOVE(node, node);
> -deleted = true;
> -}
> +if (node == NULL) {
> +return;
> +}
> +
> +g_source_remove_poll(>source, >pfd);
> +
> +/* If the lock is held, just mark the node as deleted */
> +if (ctx->walking_handlers) {
> +node->deleted = 1;
> +node->pfd.revents = 0;
> +} else {
> +/* Otherwise, delete it for real.  We can't just mark it as
> + * deleted because deleted nodes are only cleaned up after
> + * releasing the walking_handlers lock.
> + */
> +QLIST_REMOVE(node, node);
> +deleted = true;
>  }
>  } else {
>  if (node == NULL) {
> -- 
> 2.7.4
> 
> 

Reviewed-by: Fam Zheng 



Re: [Qemu-block] [Qemu-devel] [PATCH] MAINTAINERS: Add Fam and Jsnow for Bitmap support

2016-11-08 Thread Fam Zheng
On Tue, 11/08 12:57, Thomas Huth wrote:
> On 07.11.2016 17:40, Max Reitz wrote:
> > On 04.08.2016 20:18, John Snow wrote:
> >> These files are currently unmaintained.
> >>
> >> I'm proposing that Fam and I co-maintain them; under the model that
> >> whomever between us isn't authoring a given series will be responsible
> >> for reviewing it.
> >>
> >> Signed-off-by: John Snow 
> >> ---
> >>  MAINTAINERS | 14 ++
> >>  1 file changed, 14 insertions(+)
> > 
> > Ping, anyone?
> 
> I'm currently gathering a set of my patches that updates the MAINTAINERS
> file - and Paolo asked me to send a PULL request for that one, so if you
> like, I can also include this patch there.

Please do. Thanks!

Fam



Re: [Qemu-block] [Qemu-devel] qemu-img create doesn't always replace the existing file

2016-11-08 Thread Richard W.M. Jones
On Tue, Nov 08, 2016 at 03:05:24PM +0100, Kevin Wolf wrote:
> [ Cc: qemu-block ]
> 
> Am 08.11.2016 um 11:58 hat Richard W.M. Jones geschrieben:
> > When using 'qemu-img create', if the file being created already
> > exists, then qemu-img tries to read it first.  This has some
> > unexpected effects:
> > 
> > 
> > $ rm test.qcow2 
> > $ qemu-img create -f qcow2 -o compat=1.1,preallocation=off test.qcow2 1G
> > Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 
> > encryption=off cluster_size=65536 preallocation=off lazy_refcounts=off 
> > refcount_bits=16
> > $ du -sh test.qcow2 
> > 196K test.qcow2
> > 
> > 
> > $ rm test.qcow2 
> > $ qemu-img create -f qcow2 -o compat=1.1,preallocation=falloc test.qcow2 1G
> > Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 
> > encryption=off cluster_size=65536 preallocation=falloc lazy_refcounts=off 
> > refcount_bits=16
> > $ qemu-img create -f qcow2 -o compat=1.1,preallocation=off test.qcow2 1G
> > Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 
> > encryption=off cluster_size=65536 preallocation=off lazy_refcounts=off 
> > refcount_bits=16
> > $ du -sh test.qcow2 
> > 256K test.qcow2# would expect this to be the same as above
> 
> For me it's actually even more:
> 
> $ du -h /tmp/test.qcow2 
> 448K/tmp/test.qcow2
> 
> However...
> 
> $ ls -lh /tmp/test.qcow2 
> -rw-r--r--. 1 kwolf kwolf 193K  8. Nov 15:00 /tmp/test.qcow2
> 
> So qemu-img can't be at fault, the file has the same size as always.
> 
> Are you using XFS? In my case I would have guessed that it's probably
> some preallocation thing that XFS does internally. We've seen this
> before that 'du' shows (sometimes by far) larger values than the file
> size on XFS. That space is reclaimed later, though.

Yes I am, and indeed this looks like a filesystem artifact and not
a problem with qemu-img.

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org



Re: [Qemu-block] [Qemu-devel] qemu-img create doesn't always replace the existing file

2016-11-08 Thread Kevin Wolf
[ Cc: qemu-block ]

Am 08.11.2016 um 11:58 hat Richard W.M. Jones geschrieben:
> When using 'qemu-img create', if the file being created already
> exists, then qemu-img tries to read it first.  This has some
> unexpected effects:
> 
> 
> $ rm test.qcow2 
> $ qemu-img create -f qcow2 -o compat=1.1,preallocation=off test.qcow2 1G
> Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 encryption=off 
> cluster_size=65536 preallocation=off lazy_refcounts=off refcount_bits=16
> $ du -sh test.qcow2 
> 196K test.qcow2
> 
> 
> $ rm test.qcow2 
> $ qemu-img create -f qcow2 -o compat=1.1,preallocation=falloc test.qcow2 1G
> Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 encryption=off 
> cluster_size=65536 preallocation=falloc lazy_refcounts=off refcount_bits=16
> $ qemu-img create -f qcow2 -o compat=1.1,preallocation=off test.qcow2 1G
> Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 encryption=off 
> cluster_size=65536 preallocation=off lazy_refcounts=off refcount_bits=16
> $ du -sh test.qcow2 
> 256K test.qcow2# would expect this to be the same as above

For me it's actually even more:

$ du -h /tmp/test.qcow2 
448K/tmp/test.qcow2

However...

$ ls -lh /tmp/test.qcow2 
-rw-r--r--. 1 kwolf kwolf 193K  8. Nov 15:00 /tmp/test.qcow2

So qemu-img can't be at fault, the file has the same size as always.

Are you using XFS? In my case I would have guessed that it's probably
some preallocation thing that XFS does internally. We've seen this
before that 'du' shows (sometimes by far) larger values than the file
size on XFS. That space is reclaimed later, though.

Kevin



[Qemu-block] [PATCH for-2.8 v2 0/2] aio-posix: epoll cleanups

2016-11-08 Thread Paolo Bonzini
The first fixes a NULL-pointer dereference that was reported by
Coverity (so definitely for 2.8).  The second is a small simplification.

Paolo Bonzini (2):
  aio-posix: avoid NULL pointer dereference in aio_epoll_update
  aio-posix: simplify aio_epoll_update

 aio-posix.c | 55 +--
 1 file changed, 25 insertions(+), 30 deletions(-)

-- 
2.7.4




[Qemu-block] [PATCH 1/2] aio-posix: avoid NULL pointer dereference in aio_epoll_update

2016-11-08 Thread Paolo Bonzini
aio_epoll_update dereferences parameter "node", but it could have been NULL
if deleting an fd handler that was not registered in the first place.

Signed-off-by: Paolo Bonzini 
---
Remove unnecessary assignment to node->pfd.revents.

 aio-posix.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index 4ef34dd..ec908f7 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -217,21 +217,23 @@ void aio_set_fd_handler(AioContext *ctx,
 
 /* Are we deleting the fd handler? */
 if (!io_read && !io_write) {
-if (node) {
-g_source_remove_poll(>source, >pfd);
-
-/* If the lock is held, just mark the node as deleted */
-if (ctx->walking_handlers) {
-node->deleted = 1;
-node->pfd.revents = 0;
-} else {
-/* Otherwise, delete it for real.  We can't just mark it as
- * deleted because deleted nodes are only cleaned up after
- * releasing the walking_handlers lock.
- */
-QLIST_REMOVE(node, node);
-deleted = true;
-}
+if (node == NULL) {
+return;
+}
+
+g_source_remove_poll(>source, >pfd);
+
+/* If the lock is held, just mark the node as deleted */
+if (ctx->walking_handlers) {
+node->deleted = 1;
+node->pfd.revents = 0;
+} else {
+/* Otherwise, delete it for real.  We can't just mark it as
+ * deleted because deleted nodes are only cleaned up after
+ * releasing the walking_handlers lock.
+ */
+QLIST_REMOVE(node, node);
+deleted = true;
 }
 } else {
 if (node == NULL) {
-- 
2.7.4





[Qemu-block] [PATCH 2/2] aio-posix: simplify aio_epoll_update

2016-11-08 Thread Paolo Bonzini
Extract common code out of the "if".

Reviewed-by: Fam Zheng 
Signed-off-by: Paolo Bonzini 
---
 aio-posix.c | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index ec908f7..d54553d 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -81,29 +81,22 @@ static void aio_epoll_update(AioContext *ctx, AioHandler 
*node, bool is_new)
 {
 struct epoll_event event;
 int r;
+int ctl;
 
 if (!ctx->epoll_enabled) {
 return;
 }
 if (!node->pfd.events) {
-r = epoll_ctl(ctx->epollfd, EPOLL_CTL_DEL, node->pfd.fd, );
-if (r) {
-aio_epoll_disable(ctx);
-}
+ctl = EPOLL_CTL_DEL;
 } else {
 event.data.ptr = node;
 event.events = epoll_events_from_pfd(node->pfd.events);
-if (is_new) {
-r = epoll_ctl(ctx->epollfd, EPOLL_CTL_ADD, node->pfd.fd, );
-if (r) {
-aio_epoll_disable(ctx);
-}
-} else {
-r = epoll_ctl(ctx->epollfd, EPOLL_CTL_MOD, node->pfd.fd, );
-if (r) {
-aio_epoll_disable(ctx);
-}
-}
+ctl = is_new ? EPOLL_CTL_ADD : EPOLL_CTL_MOD;
+}
+
+r = epoll_ctl(ctx->epollfd, ctl, node->pfd.fd, );
+if (r) {
+aio_epoll_disable(ctx);
 }
 }
 
-- 
2.7.4




Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto

2016-11-08 Thread Kevin Wolf
Am 08.11.2016 um 12:08 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 08.11.2016 14:05, Kevin Wolf wrote:
> >Am 07.11.2016 um 17:10 hat Max Reitz geschrieben:
> >>On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote:
> >>>Hi all!
> >>>
> >>>As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not
> >>>handled.. Is it ok? Should not they be filled with ones or something
> >>>like this?
> >>Filling them with ones makes sense to me. I guess nobody noticed because
> >>nobody was crazy enough to use block jobs alongside loadvm...
> >What's the use case in which ones make sense?
> >
> >It rather seems to me that an active dirty bitmap and snapshot switching
> >should exclude each other because the bitmap becomes meaningless by the
> >switch. And chances are that after switching a snapshot you don't want
> >to "incrementally" backup everything, but that you should access a
> >different backup.
> 
> In other words, dirty bitmaps should be deleted on snapshot switch?
> All? Or only named?

As Max said, we should probably integrate bitmaps with snapshots. After
reloading the old state, the bitmap becomes valid again, so throwing it
away in the active state seems only right if we included it in the
snapshot and can bring it back.

Kevin



Re: [Qemu-block] [Qemu-devel] [PATCH] MAINTAINERS: Add Fam and Jsnow for Bitmap support

2016-11-08 Thread Thomas Huth
On 07.11.2016 17:40, Max Reitz wrote:
> On 04.08.2016 20:18, John Snow wrote:
>> These files are currently unmaintained.
>>
>> I'm proposing that Fam and I co-maintain them; under the model that
>> whomever between us isn't authoring a given series will be responsible
>> for reviewing it.
>>
>> Signed-off-by: John Snow 
>> ---
>>  MAINTAINERS | 14 ++
>>  1 file changed, 14 insertions(+)
> 
> Ping, anyone?

I'm currently gathering a set of my patches that updates the MAINTAINERS
file - and Paolo asked me to send a PULL request for that one, so if you
like, I can also include this patch there.

 Thomas




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto

2016-11-08 Thread Vladimir Sementsov-Ogievskiy

08.11.2016 14:05, Kevin Wolf wrote:

Am 07.11.2016 um 17:10 hat Max Reitz geschrieben:

On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not
handled.. Is it ok? Should not they be filled with ones or something
like this?

Filling them with ones makes sense to me. I guess nobody noticed because
nobody was crazy enough to use block jobs alongside loadvm...

What's the use case in which ones make sense?

It rather seems to me that an active dirty bitmap and snapshot switching
should exclude each other because the bitmap becomes meaningless by the
switch. And chances are that after switching a snapshot you don't want
to "incrementally" backup everything, but that you should access a
different backup.

Kevin


In other words, dirty bitmaps should be deleted on snapshot switch? All? 
Or only named?



--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto

2016-11-08 Thread Kevin Wolf
Am 07.11.2016 um 17:10 hat Max Reitz geschrieben:
> On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote:
> > Hi all!
> > 
> > As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not
> > handled.. Is it ok? Should not they be filled with ones or something
> > like this?
> 
> Filling them with ones makes sense to me. I guess nobody noticed because
> nobody was crazy enough to use block jobs alongside loadvm...

What's the use case in which ones make sense?

It rather seems to me that an active dirty bitmap and snapshot switching
should exclude each other because the bitmap becomes meaningless by the
switch. And chances are that after switching a snapshot you don't want
to "incrementally" backup everything, but that you should access a
different backup.

Kevin


pgpHEq5UTe7mS.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH 4/4] block: Cater to iscsi with non-power-of-2 discard

2016-11-08 Thread Peter Lieven

Am 25.10.2016 um 18:12 schrieb Eric Blake:

On 10/25/2016 09:36 AM, Paolo Bonzini wrote:


On 25/10/2016 16:35, Eric Blake wrote:

So your argument is that we should always pass down every unaligned
less-than-optimum discard request all the way to the hardware, rather
than dropping it higher in the stack, even though discard requests are
already advisory, in order to leave the hardware as the ultimate
decision on whether to ignore the unaligned request?

Yes, I agree with Peter as to this.

Okay, I'll work on patches. I think it counts as bug fix, so appropriate
even if I miss soft freeze (I'd still like to get NBD write zero support
into 2.8, since it already missed 2.7, but that one is still awaiting
review with not much time left).



Hi Eric,

have you had time to look at this?
If you need help, let me know.

Peter



Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto

2016-11-08 Thread Vladimir Sementsov-Ogievskiy

07.11.2016 19:10, Max Reitz wrote:

On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not
handled.. Is it ok? Should not they be filled with ones or something
like this?

Filling them with ones makes sense to me. I guess nobody noticed because
nobody was crazy enough to use block jobs alongside loadvm...


Using block jobs is not necessary - we just have to maintain our dirty 
bitmap while qemu works, regardless of block jobs.





Also, when we will have persistent bitmaps in qcow2, haw they should be
handled on snapshot switching?

Good question. Since persistent bitmaps are not bound to snapshots, I'd
fill them with ones for now, too.

It would probably make sense to bind bitmaps to snapshots, though. This
could be achieved by adding a bitmap directory pointer to each snapshot
table entry. When switching snapshots, software (i.e. qemu) could then
either:

(1) Fill the bitmaps with ones, thus treating them as "global" bitmaps.

(2) Save the current bitmap directory in the old snapshot and put the
one from the snapshot that is being switched to into the image header,
thus treating them as bound to the snapshot.

Of course, this could be a bitmap-specific property.

Max




--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] [PATCH v3 4/6] blockjob: add block_job_start

2016-11-08 Thread Kevin Wolf
Am 08.11.2016 um 03:05 hat Jeff Cody geschrieben:
> On Mon, Nov 07, 2016 at 09:02:14PM -0500, John Snow wrote:
> > On 11/03/2016 08:17 AM, Kevin Wolf wrote:
> > >Am 02.11.2016 um 18:50 hat John Snow geschrieben:
> > >>+void block_job_start(BlockJob *job)
> > >>+{
> > >>+assert(job && !block_job_started(job) && job->paused &&
> > >>+   !job->busy && job->driver->start);
> > >>+job->paused = false;
> > >>+job->busy = true;
> > >>+job->co = qemu_coroutine_create(job->driver->start, job);
> > >>+qemu_coroutine_enter(job->co);
> > >>+}
> > >
> > >We allow the user to pause a job while it's not started yet. You
> > >classified this as "harmless". But if we accept this, can we really
> > >unconditionally enter the coroutine even if the job has been paused?
> > >Can't a user expect that a job remains in paused state when they
> > >explicitly requested a pause and the job was already internally paused,
> > >like in this case by block_job_create()?
> > >
> > 
> > What will end up happening is that we'll enter the job, and then it'll pause
> > immediately upon entrance. Is that a problem?
> > 
> > If the jobs themselves are not checking their pause state fastidiously, it
> > could be (but block/backup does -- after it creates a write notifier.)
> > 
> > Do we want a stronger guarantee here?
> > 
> > Naively I think it's OK as-is, but I could add a stronger boolean in that
> > lets us know if it's okay to start or not, and we could delay the actual
> > creation and start until the 'resume' comes in if you'd like.
> > 
> > I'd like to avoid the complexity if we can help it, but perhaps I'm not
> > thinking carefully enough about the existing edge cases.
> > 
> 
> Is there any reason we can't just use job->pause_count here?  When the job
> is created, set job->paused = true, and job->pause_count = 1.  In the
> block_job_start(), check the pause_count prior to qemu_coroutine_enter():
> 
> void block_job_start(BlockJob *job)
> {
> assert(job && !block_job_started(job) && job->paused &&
>   !job->busy && job->driver->start);
> job->co = qemu_coroutine_create(job->driver->start, job);
> job->paused = --job->pause_count > 0;
> if (!job->paused) {
> job->busy = true;
> qemu_coroutine_enter(job->co);
> }
> }

Yes, something like this is what I had in mind.

> > >The same probably also applies to the internal job pausing during
> > >bdrv_drain_all_begin/end, though as you know there is a larger problem
> > >with starting jobs under drain_all anyway. For now, we just need to keep
> > >in mind that we can neither create nor start a job in such sections.
> > >
> > 
> > Yeah, there are deeper problems there. As long as the existing critical
> > sections don't allow us to create jobs (started or not) I think we're
> > probably already OK.

My point here was that we would like the get rid of that restriction
eventually, and if we add more and more things that depend on the
restriction, getting rid of it will only become harder.

But with the above code, I think this specific problem is solved.

Kevin



Re: [Qemu-block] [Qemu-devel] [PATCH v3 5/6] blockjob: refactor backup_start as backup_job_create

2016-11-08 Thread Kevin Wolf
Am 08.11.2016 um 06:41 hat John Snow geschrieben:
> On 11/03/2016 09:17 AM, Kevin Wolf wrote:
> >Am 02.11.2016 um 18:50 hat John Snow geschrieben:
> >>Refactor backup_start as backup_job_create, which only creates the job,
> >>but does not automatically start it. The old interface, 'backup_start',
> >>is not kept in favor of limiting the number of nearly-identical interfaces
> >>that would have to be edited to keep up with QAPI changes in the future.
> >>
> >>Callers that wish to synchronously start the backup_block_job can
> >>instead just call block_job_start immediately after calling
> >>backup_job_create.
> >>
> >>Transactions are updated to use the new interface, calling block_job_start
> >>only during the .commit phase, which helps prevent race conditions where
> >>jobs may finish before we even finish building the transaction. This may
> >>happen, for instance, during empty block backup jobs.
> >>
> >>Reported-by: Vladimir Sementsov-Ogievskiy 
> >>Signed-off-by: John Snow 
> >
> >>+static void drive_backup_commit(BlkActionState *common)
> >>+{
> >>+DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
> >>+if (state->job) {
> >>+block_job_start(state->job);
> >>+}
> >> }
> >
> >How could state->job ever be NULL?
> >
> 
> Mechanical thinking. It can't. (I definitely didn't copy paste from
> the .abort routines. Definitely.)
> 
> >Same question for abort, and for blockdev_backup_commit/abort.
> >
> 
> Abort ... we may not have created the job successfully. Abort gets
> called whether or not we made it to or through the matching
> .prepare.

Ah, yes, I always forget about this. It's so counterintuitive (and
bdrv_reopen() actually works differently, it only aborts entries that
have successfully been prepared).

Is there a good reason why qmp_transaction() works this way, especially
since we have a separate .clean function?

Kevin