Re: [Qemu-block] [Qemu-devel] Meeting notes on -blockdev, dynamic backend reconfiguration

2016-12-06 Thread Fam Zheng
On Mon, 12/05 13:03, Markus Armbruster wrote:
> == Basic dynamic reconfiguration operation ==
> 
> The basic operation is "replace child".
> 
> Beware of race conditions.  Consider:
> 
>   BB
>   |
> mirror-filter
>   |
>  BDS
> 
> Add a throttle filter under BB while the mirror job is running.  First
> step, create the filter:
> 
>   BBthrottle-filter
>   | /
> mirror-filter
>   |
>  BDS
> 
> Second step, replace child of BB by the new filter:
> 
>   BB
>   |
>throttle-filter
>   |
> mirror-filter
>   |
>  BDS
> 
> But: if mirror-filter goes away between the two steps, the replace
> brings it right back!
> 
> To guard against such races, we need to specify both ends of the edge
> being replaced, i.e. parent, child name, actual child.  Then the replace
> step fails if the mirror-filter has gone away.  We can either fail the
> whole operation, or start over.
> 
> Alternatively, transactions, but that feels much more complex.
> 

Isn't it easy to make creating throttle-filter and replacing child happen in the
same critical section of BQL, without any coroutine yield?  If so I think there
is no race to worry about, mirror-filter should go away only after a QMP 
command.

Fam



Re: [Qemu-block] [PATCH v3 2/5] blkdebug: Add pass-through write_zero and discard support

2016-12-06 Thread Eric Blake
On 12/06/2016 04:23 PM, Max Reitz wrote:
> On 06.12.2016 23:18, Eric Blake wrote:
>> On 12/06/2016 04:14 PM, Max Reitz wrote:
>>> On 06.12.2016 23:12, Eric Blake wrote:
 On 12/06/2016 04:00 PM, Max Reitz wrote:

>> Tested by setting up an NBD server with export 'foo', then invoking:
>> $ ./qemu-io
>> qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo

 By the way, I'd LOVE to know if there is a way to write a qemu-io
 command line that would do this connection automatically (so that I can
 batch commands up front and benefit from the shell's history) rather
 than having to issue an 'open' after the fact.  I tried various
 incantations with --object and --image-opts, but got stumped.
>>>
>>> Can't you just do qemu-io -c 'open'?
>>
>> I suppose that would get command-line history.  But I still want
>> interactive mode.  The moment you use -c, ALL commands get run
>> back-to-back without stopping, so I'd have to add additional -c
>> 'read'/'write' commands up front. I like interactive mode (open
>> pre-connected, now let me explore the image at will).
> 
> Well, the usual --image-opts version would be:
> 
> ./qemu-io --image-opts driver=blkdebug,image.driver=nbd,\
> image.host=localhost,image.export=foo

Thanks, that appears to do the trick! I think I was getting confused by
trying 'file.driver' instead of 'image.driver', or maybe it was because
I was trying 'image.align' to set the blkdebug alignment where just
plain 'align' works once you are in --image-opts mode, or some such
problem on my end.  [Maybe I shouldn't be testing patches late at night,
either...]

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 5/5] tests: Add coverage for recent block geometry fixes

2016-12-06 Thread Max Reitz
On 02.12.2016 20:22, Eric Blake wrote:
> Use blkdebug's new geometry constraints to emulate setups that
> have caused recent regression fixes: write zeroes asserting
> when running through a loopback block device with max-transfer
> smaller than cluster size, and discard rounding away portions
> of requests not aligned to preferred boundaries.  Also, add
> coverage that the block layer is honoring max transfer limits.
> 
> For now, a single iotest performs all actions, with the idea
> that we can add future blkdebug constraint test cases in the
> same file; but it can be split into multiple iotests if we find
> reason to run one portion of the test in more setups than what
> are possible in the other.
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v3: make comments tied more to test at hand, rather than the
> particular hardware that led to the earlier patches being tested
> v2: new patch
> ---
>  tests/qemu-iotests/173 | 115 
> +
>  tests/qemu-iotests/173.out |  49 +++
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 165 insertions(+)
>  create mode 100755 tests/qemu-iotests/173
>  create mode 100644 tests/qemu-iotests/173.out

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 4/5] blkdebug: Add ability to override unmap geometries

2016-12-06 Thread Max Reitz
On 02.12.2016 20:22, Eric Blake wrote:
> Make it easier to simulate various unusual hardware setups (for
> example, recent commits 3482b9b and b8d0a98 affect the Dell
> Equallogic iSCSI with its 15M preferred and maximum unmap and
> write zero sizing, or b2f95fe deals with the Linux loopback
> block device having a max_transfer of 64k), by allowing blkdebug
> to wrap any other device with further restrictions on various
> alignments.
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v3: improve legibility of bounds checking, improve docs
> v2: new patch
> ---
>  qapi/block-core.json | 27 ++-
>  block/blkdebug.c | 96 
> ++--
>  2 files changed, 120 insertions(+), 3 deletions(-)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 2/5] blkdebug: Add pass-through write_zero and discard support

2016-12-06 Thread Max Reitz
On 06.12.2016 23:18, Eric Blake wrote:
> On 12/06/2016 04:14 PM, Max Reitz wrote:
>> On 06.12.2016 23:12, Eric Blake wrote:
>>> On 12/06/2016 04:00 PM, Max Reitz wrote:
>>>
> Tested by setting up an NBD server with export 'foo', then invoking:
> $ ./qemu-io
> qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
>>>
>>> By the way, I'd LOVE to know if there is a way to write a qemu-io
>>> command line that would do this connection automatically (so that I can
>>> batch commands up front and benefit from the shell's history) rather
>>> than having to issue an 'open' after the fact.  I tried various
>>> incantations with --object and --image-opts, but got stumped.
>>
>> Can't you just do qemu-io -c 'open'?
> 
> I suppose that would get command-line history.  But I still want
> interactive mode.  The moment you use -c, ALL commands get run
> back-to-back without stopping, so I'd have to add additional -c
> 'read'/'write' commands up front. I like interactive mode (open
> pre-connected, now let me explore the image at will).

Well, the usual --image-opts version would be:

./qemu-io --image-opts driver=blkdebug,image.driver=nbd,\
image.host=localhost,image.export=foo

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 2/5] blkdebug: Add pass-through write_zero and discard support

2016-12-06 Thread Eric Blake
On 12/06/2016 04:14 PM, Max Reitz wrote:
> On 06.12.2016 23:12, Eric Blake wrote:
>> On 12/06/2016 04:00 PM, Max Reitz wrote:
>>
 Tested by setting up an NBD server with export 'foo', then invoking:
 $ ./qemu-io
 qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
>>
>> By the way, I'd LOVE to know if there is a way to write a qemu-io
>> command line that would do this connection automatically (so that I can
>> batch commands up front and benefit from the shell's history) rather
>> than having to issue an 'open' after the fact.  I tried various
>> incantations with --object and --image-opts, but got stumped.
> 
> Can't you just do qemu-io -c 'open'?

I suppose that would get command-line history.  But I still want
interactive mode.  The moment you use -c, ALL commands get run
back-to-back without stopping, so I'd have to add additional -c
'read'/'write' commands up front. I like interactive mode (open
pre-connected, now let me explore the image at will).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 3/5] blkdebug: Simplify override logic

2016-12-06 Thread Eric Blake
On 12/06/2016 04:10 PM, Max Reitz wrote:
> On 02.12.2016 20:22, Eric Blake wrote:
>> Rather than store into a local variable, the copy to the struct
>> if the value is valid, then reporting errors otherwise,
> 
> It is rather difficult to part this sentence starting from "the copy".

That's because I meant to type "then copy".  (What a difference an 'n'
makes - I hope it does't imply that the batteries i my wireless keyboard
are o the declie agai, or a sig of ay other more serious problem)

> 
>> it is
>> simpler to just store into the struct and report errors if the
>> value is invalid.  This however requires that the struct store
>> a 64-bit number, rather than a narrower type.
>>
>> Signed-off-by: Eric Blake 
>>
>> ---
>> v3: new patch
>> ---
>>  block/blkdebug.c | 11 ---
>>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> Anyway, thanks! If you can explain to me how to parse the commit message
> or make it easier to read:
> 
> Reviewed-by: Max Reitz 
> 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 2/5] blkdebug: Add pass-through write_zero and discard support

2016-12-06 Thread Max Reitz
On 06.12.2016 23:12, Eric Blake wrote:
> On 12/06/2016 04:00 PM, Max Reitz wrote:
> 
>>> Tested by setting up an NBD server with export 'foo', then invoking:
>>> $ ./qemu-io
>>> qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
> 
> By the way, I'd LOVE to know if there is a way to write a qemu-io
> command line that would do this connection automatically (so that I can
> batch commands up front and benefit from the shell's history) rather
> than having to issue an 'open' after the fact.  I tried various
> incantations with --object and --image-opts, but got stumped.

Can't you just do qemu-io -c 'open'?

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 2/5] blkdebug: Add pass-through write_zero and discard support

2016-12-06 Thread Eric Blake
On 12/06/2016 04:00 PM, Max Reitz wrote:

>> Tested by setting up an NBD server with export 'foo', then invoking:
>> $ ./qemu-io
>> qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo

By the way, I'd LOVE to know if there is a way to write a qemu-io
command line that would do this connection automatically (so that I can
batch commands up front and benefit from the shell's history) rather
than having to issue an 'open' after the fact.  I tried various
incantations with --object and --image-opts, but got stumped.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 3/5] blkdebug: Simplify override logic

2016-12-06 Thread Max Reitz
On 02.12.2016 20:22, Eric Blake wrote:
> Rather than store into a local variable, the copy to the struct
> if the value is valid, then reporting errors otherwise,

It is rather difficult to part this sentence starting from "the copy".

> it is
> simpler to just store into the struct and report errors if the
> value is invalid.  This however requires that the struct store
> a 64-bit number, rather than a narrower type.
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v3: new patch
> ---
>  block/blkdebug.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)

Anyway, thanks! If you can explain to me how to parse the commit message
or make it easier to read:

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 2/5] blkdebug: Add pass-through write_zero and discard support

2016-12-06 Thread Max Reitz
On 02.12.2016 20:22, Eric Blake wrote:
> In order to test the effects of artificial geometry constraints
> on operations like write zero or discard, we first need blkdebug
> to manage these actions.  It also allows us to inject errors on
> those operations, just like we can for read/write/flush.
> 
> We can also test the contract promised by the block layer; namely,
> if a device has specified limits on alignment or maximum size,
> then those limits must be obeyed (for now, the blkdebug driver
> merely inherits limits from whatever it is wrapping, but the next
> patch will further enhance it to allow specific limit overrides).
> 
> This patch intentionally refuses to service requests smaller than
> the requested alignments; this is because an upcoming patch adds
> a qemu-iotest to prove that the block layer is correctly handling
> fragmentation, but the test only works if there is a way to tell
> the difference at artificial alignment boundaries when blkdebug is
> using a larger-than-default alignment.  If we let the blkdebug
> layer always defer to the underlying layer, which potentially has
> a smaller granularity, the iotest will be thwarted.
> 
> Tested by setting up an NBD server with export 'foo', then invoking:
> $ ./qemu-io
> qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
> qemu-io> d 0 15M
> qemu-io> w -z 0 15M
> 
> Pre-patch, the server never sees the discard (it was silently
> eaten by the block layer); post-patch it is passed across the
> wire.  Likewise, pre-patch the write is always passed with
> NBD_WRITE (with 15M of zeroes on the wire), while post-patch
> it can utilize NBD_WRITE_ZEROES (for less traffic).
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v3: rebase to byte-based read/write, improve docs on why no
> partial write zero passthrough
> v2: new patch
> ---
>  block/blkdebug.c | 81 
> 
>  1 file changed, 81 insertions(+)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH] qcow2: Discard/zero clusters by byte count

2016-12-06 Thread Eric Blake
On 12/06/2016 03:31 PM, Max Reitz wrote:

>>>
>>> The only alternative I can come up with would be "qcow2_write_zeroes";
>>> that at least solves the first issue I have with this, but not the
>>> second one...
>>
>> Maybe qcow2_cluster_zeroize() and qcow2_cluster_discard()?
> 
> I think qcow2_discard() is fine (it works with any alignment (even
> though it will only discard whole clusters)

Except that I think with my recent fix to qcow2_make_empty(), we can now
assert that all callers actually pass cluster-aligned values.  Hence my
other comment that maybe we want to (as a prereq patch) actually assert
that callers are aligned, at which point keeping cluster in the name is
once again worthwhile.

> and "discard" is mainly used
> as a verb, so at least I'm not confused there). I like
> qcow2_cluster_zeroize() if nothing else then for the "zeroize" alone. :-)

Okay, then I have a good idea of what name to use.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH] qcow2: Discard/zero clusters by byte count

2016-12-06 Thread Max Reitz
On 06.12.2016 22:26, Eric Blake wrote:
> On 12/06/2016 03:01 PM, Max Reitz wrote:
>> On 03.12.2016 19:19, Eric Blake wrote:
>>> Passing a byte offset, but sector count, when we ultimately
>>> want to operate on cluster granularity, is madness.  Clean up
>>> the interfaces to take byte offset and count.  Rename
>>> qcow2_discard_clusters() and qcow2_zero_clusters() to the
>>> shorter qcow2_discard() and qcow2_zero() to make sure backports
>>> don't get confused by changed units.
>>>
>>> Signed-off-by: Eric Blake 
>>> ---
>>>
>>> 2.9 material, depends on 'Don't strand clusters near 2G intervals
>>> during commit'
>>>
> 
>>> @@ -1595,20 +1593,24 @@ static int zero_single_l2(BlockDriverState *bs, 
>>> uint64_t offset,
>>>  return nb_clusters;
>>>  }
>>>
>>> -int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int 
>>> nb_sectors,
>>> -int flags)
>>> +int qcow2_zero(BlockDriverState *bs, uint64_t offset, uint64_t count,
>>> +   int flags)
>>
>> Hmm, I personally liked qcow2_zero_clusters() better. qcow2_zero()
>> doesn't really express that it means the verb "to zero".
>>
>> Also, while you are making a good point why the function should be
>> renamed, qcow2_zero_clusters() was much more accurate because offset and
>> count are expected to be cluster-aligned.
>>
>> The only alternative I can come up with would be "qcow2_write_zeroes";
>> that at least solves the first issue I have with this, but not the
>> second one...
> 
> Maybe qcow2_cluster_zeroize() and qcow2_cluster_discard()?

I think qcow2_discard() is fine (it works with any alignment (even
though it will only discard whole clusters) and "discard" is mainly used
as a verb, so at least I'm not confused there). I like
qcow2_cluster_zeroize() if nothing else then for the "zeroize" alone. :-)

Max

>That gets the
> benefit of the rename (to force all callers to use the right semantics),
> while still being legible as an object-verb naming: the action
> ('discard' or 'zeroize') is performed on 'qcow2 cluster' objects.



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH] qcow2: Discard/zero clusters by byte count

2016-12-06 Thread Eric Blake
On 12/06/2016 03:01 PM, Max Reitz wrote:
> On 03.12.2016 19:19, Eric Blake wrote:
>> Passing a byte offset, but sector count, when we ultimately
>> want to operate on cluster granularity, is madness.  Clean up
>> the interfaces to take byte offset and count.  Rename
>> qcow2_discard_clusters() and qcow2_zero_clusters() to the
>> shorter qcow2_discard() and qcow2_zero() to make sure backports
>> don't get confused by changed units.
>>
>> Signed-off-by: Eric Blake 
>> ---
>>
>> 2.9 material, depends on 'Don't strand clusters near 2G intervals
>> during commit'
>>

>> @@ -1595,20 +1593,24 @@ static int zero_single_l2(BlockDriverState *bs, 
>> uint64_t offset,
>>  return nb_clusters;
>>  }
>>
>> -int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int 
>> nb_sectors,
>> -int flags)
>> +int qcow2_zero(BlockDriverState *bs, uint64_t offset, uint64_t count,
>> +   int flags)
> 
> Hmm, I personally liked qcow2_zero_clusters() better. qcow2_zero()
> doesn't really express that it means the verb "to zero".
> 
> Also, while you are making a good point why the function should be
> renamed, qcow2_zero_clusters() was much more accurate because offset and
> count are expected to be cluster-aligned.
> 
> The only alternative I can come up with would be "qcow2_write_zeroes";
> that at least solves the first issue I have with this, but not the
> second one...

Maybe qcow2_cluster_zeroize() and qcow2_cluster_discard()? That gets the
benefit of the rename (to force all callers to use the right semantics),
while still being legible as an object-verb naming: the action
('discard' or 'zeroize') is performed on 'qcow2 cluster' objects.

> 
>>  {
>>  BDRVQcow2State *s = bs->opaque;
>>  uint64_t nb_clusters;
>>  int ret;
>>
>> +/* Block layer guarantees cluster alignment */
> 
> Hm, it's rather qcow2_co_pwrite_zeroes(), isn't it? The block layer will
> split unaligned requests into head, body and tail and it will still
> submit head and tail (though separately).

Hmm, good point.  I'll have to come up with some way to reword that.

> 
> As far as I can see, it's qcow2_co_pwrite_zeroes() which then guarantees
> that only the aligned part gets through to qcow2_zero().
> 
> 
> The patch looks good apart from these nit picks, though.
> 
> Max
> 
>> +assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
>> +assert(QEMU_IS_ALIGNED(count, s->cluster_size));

And since I'm adding assertions that the zero operation is never
attempted on unaligned parts, maybe I should also add asserts that
discards are never unaligned, perhaps as a prereq patch.

I'll wait a bit and see if anyone else has better naming ideas for the
functions, before I try to send a v2.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v3 1/5] blkdebug: Sanity check block layer guarantees

2016-12-06 Thread Max Reitz
On 02.12.2016 20:22, Eric Blake wrote:
> Commits 04ed95f4 and 1a62d0ac updated the block layer to auto-fragment
> any I/O to fit within device boundaries. Additionally, when using a
> minimum alignment of 4k, we want to ensure the block layer does proper
> read-modify-write rather than requesting I/O on a slice of a sector.
> Let's enforce that the contract is obeyed when using blkdebug.  For
> now, blkdebug only allows alignment overrides, and just inherits other
> limits from whatever device it is wrapping, but a future patch will
> further enhance things.
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v3: rebase to byte-based interfaces
> v2: new patch
> ---
>  block/blkdebug.c | 14 ++
>  1 file changed, 14 insertions(+)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH] qcow2: Discard/zero clusters by byte count

2016-12-06 Thread Max Reitz
On 03.12.2016 19:19, Eric Blake wrote:
> Passing a byte offset, but sector count, when we ultimately
> want to operate on cluster granularity, is madness.  Clean up
> the interfaces to take byte offset and count.  Rename
> qcow2_discard_clusters() and qcow2_zero_clusters() to the
> shorter qcow2_discard() and qcow2_zero() to make sure backports
> don't get confused by changed units.
> 
> Signed-off-by: Eric Blake 
> ---
> 
> 2.9 material, depends on 'Don't strand clusters near 2G intervals
> during commit'
> 
>  block/qcow2.h  |  8 
>  block/qcow2-cluster.c  | 20 +++-
>  block/qcow2-snapshot.c |  7 +++
>  block/qcow2.c  | 22 ++
>  4 files changed, 28 insertions(+), 29 deletions(-)
> 
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 1823414..a0d169b 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -543,10 +543,10 @@ uint64_t 
> qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
>   int compressed_size);
> 
>  int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m);
> -int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
> -int nb_sectors, enum qcow2_discard_type type, bool full_discard);
> -int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int 
> nb_sectors,
> -int flags);
> +int qcow2_discard(BlockDriverState *bs, uint64_t offset, uint64_t count,
> +  enum qcow2_discard_type type, bool full_discard);
> +int qcow2_zero(BlockDriverState *bs, uint64_t offset, uint64_t count,
> +   int flags);
> 
>  int qcow2_expand_zero_clusters(BlockDriverState *bs,
> BlockDriverAmendStatusCB *status_cb,
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 928c1e2..3ee0815 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1511,19 +1511,17 @@ static int discard_single_l2(BlockDriverState *bs, 
> uint64_t offset,
>  return nb_clusters;
>  }
> 
> -int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
> -int nb_sectors, enum qcow2_discard_type type, bool full_discard)
> +int qcow2_discard(BlockDriverState *bs, uint64_t offset, uint64_t count,
> +  enum qcow2_discard_type type, bool full_discard)
>  {
>  BDRVQcow2State *s = bs->opaque;
>  uint64_t end_offset;
>  uint64_t nb_clusters;
>  int ret;
> 
> -end_offset = offset + (nb_sectors << BDRV_SECTOR_BITS);
> -
> -/* Round start up and end down */
> +/* Round start up and end down to cluster boundary */
> +end_offset = start_of_cluster(s, offset + count);
>  offset = align_offset(offset, s->cluster_size);
> -end_offset = start_of_cluster(s, end_offset);
> 
>  if (offset > end_offset) {
>  return 0;
> @@ -1595,20 +1593,24 @@ static int zero_single_l2(BlockDriverState *bs, 
> uint64_t offset,
>  return nb_clusters;
>  }
> 
> -int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int 
> nb_sectors,
> -int flags)
> +int qcow2_zero(BlockDriverState *bs, uint64_t offset, uint64_t count,
> +   int flags)

Hmm, I personally liked qcow2_zero_clusters() better. qcow2_zero()
doesn't really express that it means the verb "to zero".

Also, while you are making a good point why the function should be
renamed, qcow2_zero_clusters() was much more accurate because offset and
count are expected to be cluster-aligned.

The only alternative I can come up with would be "qcow2_write_zeroes";
that at least solves the first issue I have with this, but not the
second one...

>  {
>  BDRVQcow2State *s = bs->opaque;
>  uint64_t nb_clusters;
>  int ret;
> 
> +/* Block layer guarantees cluster alignment */

Hm, it's rather qcow2_co_pwrite_zeroes(), isn't it? The block layer will
split unaligned requests into head, body and tail and it will still
submit head and tail (though separately).

As far as I can see, it's qcow2_co_pwrite_zeroes() which then guarantees
that only the aligned part gets through to qcow2_zero().


The patch looks good apart from these nit picks, though.

Max

> +assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
> +assert(QEMU_IS_ALIGNED(count, s->cluster_size));
> +
>  /* The zero flag is only supported by version 3 and newer */
>  if (s->qcow_version < 3) {
>  return -ENOTSUP;
>  }
> 
>  /* Each L2 table is handled by its own loop iteration */
> -nb_clusters = size_to_clusters(s, nb_sectors << BDRV_SECTOR_BITS);
> +nb_clusters = size_to_clusters(s, count);
> 
>  s->cache_discards = true;




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH for-2.8] qapi: Document introduction of gluster's 'debug' option

2016-12-06 Thread Stefan Hajnoczi
On Tue, Dec 06, 2016 at 12:20:20PM -0600, Eric Blake wrote:
> We intentionally renamed 'debug-level' to 'debug' in the QMP
> schema for 'blockdev-add' related to gluster, in order to
> match the command line (commit 1a417e46).  However, since
> 'debug-level' was visible in 2.7, that means that we should
> document that 'debug' was not available until 2.8.
> 
> The change was intentional because 'blockdev-add' itself
> underwent incompatible changes (such as commit 0153d2f) for
> the same release; our intent is that after 2.8, these
> interfaces will now be stable.  [In hindsight, we should have
> used the name x-blockdev-add when we first introduced it]
> 
> Signed-off-by: Eric Blake 
> ---
>  qapi/block-core.json | 1 +
>  1 file changed, 1 insertion(+)

Thanks, applied to my staging tree:
https://github.com/stefanha/qemu/commits/staging

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH for-2.8] qapi: Document DEVICE_TRAY_MOVED addition

2016-12-06 Thread Stefan Hajnoczi
On Tue, Dec 06, 2016 at 10:03:45AM -0600, Eric Blake wrote:
> Commit 2d76e72 failed to add a versioning tag to 'id'.
> 
> I audited all qapi*.json files from v2.7.0 to the current
> state of the tree, and didn't find any other additions where
> we failed to use a version tag.
> 
> Signed-off-by: Eric Blake 
> ---
> 
> Doc-only, so safe for 2.8. But we're also close enough to -rc3
> that I'm fine if it has to go in later via -stable.
> 
>  qapi/block.json | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks, applied to my staging tree:
https://github.com/stefanha/qemu/commits/staging

Stefan


signature.asc
Description: PGP signature


[Qemu-block] [PATCH for-2.8] qapi: Document introduction of gluster's 'debug' option

2016-12-06 Thread Eric Blake
We intentionally renamed 'debug-level' to 'debug' in the QMP
schema for 'blockdev-add' related to gluster, in order to
match the command line (commit 1a417e46).  However, since
'debug-level' was visible in 2.7, that means that we should
document that 'debug' was not available until 2.8.

The change was intentional because 'blockdev-add' itself
underwent incompatible changes (such as commit 0153d2f) for
the same release; our intent is that after 2.8, these
interfaces will now be stable.  [In hindsight, we should have
used the name x-blockdev-add when we first introduced it]

Signed-off-by: Eric Blake 
---
 qapi/block-core.json | 1 +
 1 file changed, 1 insertion(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index a298e76..6b42216 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2196,6 +2196,7 @@
 # @server:  gluster servers description
 #
 # @debug:   #optional libgfapi log level (default '4' which is Error)
+#   (Since 2.8)
 #
 # @logfile: #optional libgfapi log file (default /dev/stderr) (Since 2.8)
 #
-- 
2.9.3




Re: [Qemu-block] [Qemu-devel] [PULL] Block layer patches for 2.8.0-rc3

2016-12-06 Thread Stefan Hajnoczi
On Tue, Dec 06, 2016 at 04:16:57PM +0100, Kevin Wolf wrote:
> The following changes since commit bc66cedb4141fb7588f2462c74310d8fb5dd4cf1:
> 
>   Merge remote-tracking branch 'yongbok/tags/mips-20161204' into staging 
> (2016-12-05 10:56:45 +)
> 
> are available in the git repository at:
> 
> 
>   git://repo.or.cz/qemu/kevin.git tags/for-upstream
> 
> for you to fetch changes up to a3e1505daec31ef56f0489f8c8fff1b8e4ca92bd:
> 
>   qcow2: Don't strand clusters near 2G intervals during commit (2016-12-06 
> 15:37:02 +0100)
> 
> 
> Block layer patches for 2.8.0-rc3
> 
> 
> Eric Blake (1):
>   qcow2: Don't strand clusters near 2G intervals during commit
> 
>  block/qcow2.c  |   3 +-
>  tests/qemu-iotests/097 |  41 +---
>  tests/qemu-iotests/097.out | 249 
> +
>  3 files changed, 210 insertions(+), 83 deletions(-)
> 

Thanks, applied to my staging tree:
https://github.com/stefanha/qemu/commits/staging

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH for-2.8] qapi: Document DEVICE_TRAY_MOVED addition

2016-12-06 Thread Marc-André Lureau
On Tue, Dec 6, 2016 at 7:04 PM Eric Blake  wrote:

> Commit 2d76e72 failed to add a versioning tag to 'id'.
>
> I audited all qapi*.json files from v2.7.0 to the current
> state of the tree, and didn't find any other additions where
> we failed to use a version tag.
>
> Signed-off-by: Eric Blake 
>


Reviewed-by: Marc-André Lureau 

---
>
> Doc-only, so safe for 2.8. But we're also close enough to -rc3
> that I'm fine if it has to go in later via -stable.
>
>  qapi/block.json | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/qapi/block.json b/qapi/block.json
> index 937df05..8e9f590 100644
> --- a/qapi/block.json
> +++ b/qapi/block.json
> @@ -199,7 +199,7 @@
>  #  reasons, but it can be empty ("") if the image does not
>  #  have a device name associated.
>  #
> -# @id: The name or QOM path of the guest device
> +# @id: The name or QOM path of the guest device (since 2.8)
>  #
>  # @tray-open: true if the tray has been opened or false if it has been
> closed
>  #
> --
> 2.9.3
>
>
> --
Marc-André Lureau


[Qemu-block] [PATCH for-2.8] qapi: Document DEVICE_TRAY_MOVED addition

2016-12-06 Thread Eric Blake
Commit 2d76e72 failed to add a versioning tag to 'id'.

I audited all qapi*.json files from v2.7.0 to the current
state of the tree, and didn't find any other additions where
we failed to use a version tag.

Signed-off-by: Eric Blake 
---

Doc-only, so safe for 2.8. But we're also close enough to -rc3
that I'm fine if it has to go in later via -stable.

 qapi/block.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qapi/block.json b/qapi/block.json
index 937df05..8e9f590 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -199,7 +199,7 @@
 #  reasons, but it can be empty ("") if the image does not
 #  have a device name associated.
 #
-# @id: The name or QOM path of the guest device
+# @id: The name or QOM path of the guest device (since 2.8)
 #
 # @tray-open: true if the tray has been opened or false if it has been closed
 #
-- 
2.9.3




Re: [Qemu-block] [Qemu-devel] [PULL for-2.8 0/3] Block patches for -rc3

2016-12-06 Thread Eric Blake
On 12/06/2016 09:11 AM, Eric Blake wrote:

>> BlockdevOptionsGluster.debug(-level) does not have "Added in 2.8" so I
>> had to dig through git-blame(1) to verify that it was indeed added in
>> the current release cycle.
> 
> Then that implies we should add yet one more patch that adds the
> appropriate versioning information to all the gluster fields added for
> 2.8.  My reviewed-by was given on the assumption that debug was in 2.7
> and that this was a break from 2.7 behavior, but that we already KNOW
> we're breaking blockdev-add between 2.7 and 2.8; while your argument is
> that there is no backwards incompatibility because it was not in 2.7 to
> begin with.  I think both reasons are indeed acceptable, but it also
> means that my reason was flawed because of the incomplete documentation.

I checked again.  'git diff v2.7.0..origin qapi/*.json' shows:

 ##
-# @BlockdevOptionsGluster
+# @BlockdevOptionsGluster:
 #
 # Driver specific block device options for Gluster
 #
@@ -2140,7 +2195,9 @@
 #
 # @server:  gluster servers description
 #
-# @debug-level: #optional libgfapi log level (default '4' which is Error)
+# @debug:   #optional libgfapi log level (default '4' which is Error)
+#
+# @logfile: #optional libgfapi log file (default /dev/stderr)
(Since 2.8)
 #
 # Since: 2.7
 ##
@@ -2148,25 +2205,163 @@
   'data': { 'volume': 'str',
 'path': 'str',
 'server': ['GlusterServer'],
-'*debug-level': 'int' } }
+'*debug': 'int',
+'*logfile': 'str' } }

So I was right after all - this IS a backwards-incompatible change (and
NOT something that was introduced only in 2.8) - but I still argue that
the change is appropriate NOW (but not later) because blockdev-add is
the only client of this type in QMP, and that command changed
backwards-incompatibly at the same time.

> 
>>
>> In the future please make sure all QAPI changes are marked by version.
> 
> Indeed, and I try to flag it in my reviews as often as I notice it.
> 
>> If there tricky changes you can include a statement showing you are
>> aware of QAPI backwards compatibility ("These new options were added in
>> the 2.8 release cycle and can therefore still be changed without
>> breaking backward compatibility").  This will make me confident that
>> you've checked the QAPI changes.
>>
>> Thanks, applied to my staging tree:
>> https://github.com/stefanha/qemu/commits/staging

In my audit of all differences between 2.7 and current state of qapi
.json files, I only found one addition that was lacking versioning
information:

event DEVICE_TRAY_MOVED gained 'id' parameter

I'll submit a patch for that shortly.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PULL] Block layer patches for 2.8.0-rc3

2016-12-06 Thread Kevin Wolf
The following changes since commit bc66cedb4141fb7588f2462c74310d8fb5dd4cf1:

  Merge remote-tracking branch 'yongbok/tags/mips-20161204' into staging 
(2016-12-05 10:56:45 +)

are available in the git repository at:


  git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to a3e1505daec31ef56f0489f8c8fff1b8e4ca92bd:

  qcow2: Don't strand clusters near 2G intervals during commit (2016-12-06 
15:37:02 +0100)


Block layer patches for 2.8.0-rc3


Eric Blake (1):
  qcow2: Don't strand clusters near 2G intervals during commit

 block/qcow2.c  |   3 +-
 tests/qemu-iotests/097 |  41 +---
 tests/qemu-iotests/097.out | 249 +
 3 files changed, 210 insertions(+), 83 deletions(-)



[Qemu-block] [PULL] qcow2: Don't strand clusters near 2G intervals during commit

2016-12-06 Thread Kevin Wolf
From: Eric Blake 

The qcow2_make_empty() function is reached during 'qemu-img commit',
in order to clear out ALL clusters of an image.  However, if the
image cannot use the fast code path (true if the image is format
0.10, or if the image contains a snapshot), the cluster size is
larger than 512, and the image is larger than 2G in size, then our
choice of sector_step causes problems.  Since it is not cluster
aligned, but qcow2_discard_clusters() silently ignores an unaligned
head or tail, we are leaving clusters allocated.

Enhance the testsuite to expose the flaw, and patch the problem by
ensuring our step size is aligned.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/qcow2.c  |   3 +-
 tests/qemu-iotests/097 |  41 +---
 tests/qemu-iotests/097.out | 249 +
 3 files changed, 210 insertions(+), 83 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ed9e0f3..96fb8a8 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2808,7 +2808,8 @@ static int qcow2_make_empty(BlockDriverState *bs)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t start_sector;
-int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
+int sector_step = (QEMU_ALIGN_DOWN(INT_MAX, s->cluster_size) /
+   BDRV_SECTOR_SIZE);
 int l1_clusters, ret = 0;
 
 l1_clusters = DIV_ROUND_UP(s->l1_size, s->cluster_size / sizeof(uint64_t));
diff --git a/tests/qemu-iotests/097 b/tests/qemu-iotests/097
index 01d8dd0..4c33e80 100755
--- a/tests/qemu-iotests/097
+++ b/tests/qemu-iotests/097
@@ -46,7 +46,7 @@ _supported_proto file
 _supported_os Linux
 
 
-# Four passes:
+# Four main passes:
 #  0: Two-layer backing chain, commit to upper backing file (implicitly)
 # (in this case, the top image will be emptied)
 #  1: Two-layer backing chain, commit to upper backing file (explicitly)
@@ -56,22 +56,30 @@ _supported_os Linux
 #  3: Two-layer backing chain, commit to lower backing file
 # (in this case, the top image will implicitly stay unchanged)
 #
+# Each pass is run twice, since qcow2 has different code paths for cleaning
+# an image depending on whether it has a snapshot.
+#
 # 020 already tests committing, so this only tests whether image chains are
 # working properly and that all images above the base are emptied; therefore,
-# no complicated patterns are necessary
+# no complicated patterns are necessary.  Check near the 2G mark, as qcow2
+# has been buggy at that boundary in the past.
 for i in 0 1 2 3; do
+for j in 0 1; do
 
 echo
-echo "=== Test pass $i ==="
+echo "=== Test pass $i.$j ==="
 echo
 
-TEST_IMG="$TEST_IMG.base" _make_test_img 64M
-TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 64M
-_make_test_img -b "$TEST_IMG.itmd" 64M
+TEST_IMG="$TEST_IMG.base" _make_test_img 2100M
+TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 2100M
+_make_test_img -b "$TEST_IMG.itmd" 2100M
+if [ $j -eq 0 ]; then
+$QEMU_IMG snapshot -c snap "$TEST_IMG"
+fi
 
-$QEMU_IO -c 'write -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
-$QEMU_IO -c 'write -P 2 64k 128k' "$TEST_IMG.itmd" | _filter_qemu_io
-$QEMU_IO -c 'write -P 3 128k 64k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c 'write -P 1 0x7ffd 192k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'write -P 2 0x7ffe 128k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'write -P 3 0x7fff 64k' "$TEST_IMG" | _filter_qemu_io
 
 if [ $i -lt 3 ]; then
 if [ $i == 0 ]; then
@@ -88,12 +96,12 @@ if [ $i -lt 3 ]; then
 fi
 
 # Bottom should be unchanged
-$QEMU_IO -c 'read -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'read -P 1 0x7ffd 192k' "$TEST_IMG.base" | _filter_qemu_io
 
 # Intermediate should contain changes from top
-$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.itmd" | _filter_qemu_io
-$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
-$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'read -P 1 0x7ffd 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'read -P 2 0x7ffe 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'read -P 3 0x7fff 64k' "$TEST_IMG.itmd" | _filter_qemu_io
 
 # And in pass 0, the top image should be empty, whereas in both other 
passes
 # it should be unchanged (which is both checked by qemu-img map)
@@ -101,9 +109,9 @@ else
 $QEMU_IMG commit -b "$TEST_IMG.base" "$TEST_IMG"
 
 # Bottom should contain all changes
-$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.base" | _filter_qemu_io
-$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.base" | _filter_qemu_io
-$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'read -P 1 0x7ffd 64k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'read -P 2 0x7ffe 64k' 

Re: [Qemu-block] [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-06 Thread Dr. David Alan Gilbert
* Kevin Wolf (kw...@redhat.com) wrote:
> Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:
> > commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
> > which migration aborted QEMU because it didn't regain the control
> > of images while some errors happened.
> > 
> > Actually, we have another case in that error path to abort QEMU
> > because of the same reason:
> > migration_thread()
> > migration_completion()
> >bdrv_inactivate_all() > inactivate images
> >qemu_savevm_state_complete_precopy()
> >socket_writev_buffer() > error because destination 
> > fails
> >  qemu_fflush() ---> set error on migration 
> > stream
> >qemu_mutex_unlock_iothread() --> unlock
> > qmp_migrate_cancel() -> user cancelled migration
> > migrate_set_state() --> set migrate CANCELLING
> 
> Important to note here: qmp_migrate_cancel() is executed by a concurrent
> thread, it doesn't depend on any code paths in migration_completion().
> 
> > migration_completion() -> go on to fail_invalidate
> > if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
> > migration_thread() ---> break migration loop
> >   vm_start() -> restart guest with inactive
> > images
> > We failed to regain the control of images because we only regain it
> > while the migration state is "active", but here users cancelled the 
> > migration
> > when they found some errors happened (for example, libvirtd daemon is 
> > shutdown
> > in destination unexpectedly).
> > 
> > Signed-off-by: zhanghailiang 
> > ---
> >  migration/migration.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index f498ab8..0c1ee6d 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1752,7 +1752,8 @@ fail_invalidate:
> >  /* If not doing postcopy, vm_start() will be called: let's regain
> >   * control on images.
> >   */
> > -if (s->state == MIGRATION_STATUS_ACTIVE) {
> 
> This if condition tries to check whether we ran the code path that
> called bdrv_inactivate_all(), so that we only try to reactivate images
> it if we really inactivated them first.
> 
> The problem with it is that it ignores a possible concurrent
> modification of s->state.
> 
> > +if (s->state == MIGRATION_STATUS_ACTIVE ||
> > +s->state == MIGRATION_STATUS_CANCELLING) {
> 
> This adds another state that we could end up with with a concurrent
> modification, so that even in this case we undo the inactivation.
> 
> However, it is no longer limited to the cases where we inactivated the
> image. It also applies to other code paths (like the postcopy one) where
> we didn't inactivate images.
> 
> What saves the patch is that bdrv_invalidate_cache() is a no-op for
> block devices that aren't inactivated, so calling it more often than
> necessary is okay.
> 
> But then, if we're going to rely on this, it would be much better to
> just remove the if altogether. I can't say whether there are any other
> possible values of s->state that we should consider, and by removing the
> if we would be guaranteed to catch all of them.
> 
> If we don't want to rely on it, just keep a local bool that remembers
> whether we inactivated images and check that here.
> 
> >  Error *local_err = NULL;
> >  
> >  bdrv_invalidate_cache_all(_err);
> 
> So in summary, this is a horrible patch because it checks the wrong
> thing, and for I can't really say if it covers everything it needs to
> cover, but arguably it happens to correctly fix the outcome of a
> previously failing case.
> 
> Normally I would reject such a patch and require a clean solution, but
> then we're on the day of -rc3, so if you can't send v2 right away, we
> might not have the time for it.
> 
> Tough call...

Hmm, this case is messy; I created this function having split it out
of the main loop a couple of years back but it did get more messy
with more s->state checks; as far as I can tell it's always
done the transition to COMPLETED at the end well after the locked
section, so there's always been that chance that cancellation sneaks
in just before or just after the locked section.

Some of the bad cases that can happen:
   a) A cancel sneaks in after the ACTIVE check but before or after
 the locked section;  should we reactivate the disks? Well that
 depends on whether the destination actually got the full migration
 stream - we don't know!
If the destination actually starts running we must not reactivate
the disk on the source even if the CPU is stopped.

   b) If the bdrv_inactive_all fails for one device, but the others 
  are fine, we go down the 

Re: [Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-06 Thread Eric Blake
On 12/06/2016 03:25 AM, Kevin Wolf wrote:
> Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
>> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
>> team discovered that it is useful if a server can advertise
>> whether an export is in a known-all-zeroes state at the time
>> the client connects.
> 
> Does a server usually have the information to set this flag, other than
> querying the block status of all blocks at startup? If so, the client
> could just query this by itself.

Well, only if the client can query information at all (we don't have the
documentation finished for extent queries, let alone a reference
implementation).

> 
> The patch that was originally sent to qemu-devel just forwarded qemu's
> .bdrv_has_zero_init() call to the server. However, what this function
> returns is not a known-all-zeroes state on open, but just a
> known-all-zeroes state immediately after bdrv_create(), i.e. creating a
> new image. Then it becomes information that is easy to get and doesn't
> involve querying all blocks (e.g. true for COW image formats, true for
> raw on regular files, false for raw on block devices).

Just because the NBD spec describes the bit does NOT require that
servers HAVE to set the bit on all images that are all zeroes.  It is
perfectly compliant if the server never advertises the bit.  That said,
I think there are cases where qemu can easily advertise the bit.

I _do_ agree that it is NOT as trivial as the qemu server just
forwarding the value of .bdrv_has_zero_init() - the server HAS to prove
that no data has been written to the image.  But for a qcow2 image just
created with qemu-img, it is a fairly easy proof: If the L1 table has
all-zero entries, then the image has not been written to yet.  Reading
the L1 table for all-zeroes is only a single cluster read, which is MUCH
faster than crawling the entire image for extent status.  And for
regular files, a single lseek(SEEK_DATA) is sufficient to see if the
entire image is currently sparse.

Note that I only proposed the NBD implementation - it still remains to
be coded into the qemu code for the client to make use of the bit
(fairly easy: if the bit is set, the client can make its own
.bdrv_has_zero_init() return true), as well as for the server to set the
bit (harder: the server has to check .bdrv_has_zero_init() of the
wrapped image, but also has to prove the image is still unwritten).
Maybe this means that qemu's block layer wants to add a new
.bdrv_has_been_written() [or whatever name] to better abstract the proof
across drivers.  But those patches would be qemu 2.9 material, and do
not need to further cc the NBD list.

> 
> This is useful for 'qemu-img convert', which creates an image and then
> writes the whole contents, but I'm not sure if this property is
> applicable for NBD, which I think doesn't even have a create operation.

Another option on the NBD server side is to create a server option -
when firing up a server to serve a particular file as an export, the
user can explicitly tell the server to advertise the bit because the
user has side knowledge that the file was just created (and then the
burden of misbehavior is on the user if they mistakenly request the
advertisement when it is not true).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH] glusterfs: allow partial reads

2016-12-06 Thread Kevin Wolf
Am 06.12.2016 um 16:02 hat Eric Blake geschrieben:
> On 12/06/2016 03:59 AM, Kevin Wolf wrote:
> 
>  I'm not sure what the original rationale was to treat both partial
>  reads as well as well as writes as I/O error. (Seems to have happened
>  from original glusterfs v1 to v2 series with a note but no reasoning
>  for the read side as far as I could see.)
>  The general direction lately seems to be to move away from sector
>  based block APIs. Also eg. the NFS code allows partial reads. (It
>  does, however, have an old patch (c2eb918e3) dedicated to aligning
>  sizes to 512 byte boundaries for file creation for compatibility to
>  other parts of qemu like qcow2. This already happens in glusterfs,
>  though, but if you move a file from a different storage over to
>  glusterfs you may end up with a qcow2 file with eg. the L1 table in
>  the last 80 bytes of the file aligned to _begin_ at a 512 boundary,
>  but not _end_ at one.)
> > 
> > Hm, does this really happen? I always thought that the file size of
> > qcow2 images is aligned to the cluster size. If it isn't, maybe we
> > should fix that.
> 
> Yes, it absolutely happens all the time!  In fact, it is often the case
> that our images are not even sector-aligned, let alone cluster-aligned:
> 
> $ qemu-img create -f qcow2 file 1M
> Formatting 'file', fmt=qcow2 size=1048576 encryption=off
> cluster_size=65536 lazy_refcounts=off refcount_bits=16
> $ ls -l file
> -rw-r--r--. 1 eblake eblake 196616 Dec  6 08:58 file
> $ echo $((384*512))
> 196608
> $ echo $((385*512))
> 197120
> 
> 196616 is a fraction more than 384 sectors.
> 
> This is because the qcow2 format explicitly requires that if the L1 or
> L2 table is at the end of the file (which is what happens by default in
> qemu-img create), any entries not physically present in the table
> (because the file ends early) are read as 0.
> 
> >>> Would it be better to switch to byte-based interfaces rather than
> >>> continue to force gluster interaction in 512-byte sector chunks,
> >>> since gluster can obviously store files that are not 512-aligned?
> > 
> > The gluster I/O functions are byte-based anyway, and the driver already
> > implements .bdrv_co_readv, so going to .bdrv_co_preadv should be
> > trivial. Probably the best solution here indeed.
> > 
> 
> Are we still hoping to fix this bug (the gluster behavior on unaligned
> files, not the larger [non-?]bug of qemu-img create giving unaligned
> images in the first place) for 2.8, or are we pushing our luck here,
> where the correct patch should be 2.9 material and cc qemu-stable?

I think we're too late for 2.8.

Kevin


pgpUuN4Y8ownI.pgp
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PULL for-2.8 0/3] Block patches for -rc3

2016-12-06 Thread Eric Blake
On 12/06/2016 04:04 AM, Stefan Hajnoczi wrote:

>>
>> Prasanna Kumar Kalever (3):
>>   block/gluster: fix QMP to match debug option
>>   block/nfs: fix QMP to match debug option
>>   qemu-doc: update gluster protocol usage guide
>>
>>  block/gluster.c  | 38 -
>>  block/nfs.c  |  4 ++--
>>  qapi/block-core.json |  8 +++
>>  qemu-doc.texi| 59 
>> +++-
>>  qemu-options.hx  | 25 --
>>  5 files changed, 93 insertions(+), 41 deletions(-)
> 
> BlockdevOptionsGluster.debug(-level) does not have "Added in 2.8" so I
> had to dig through git-blame(1) to verify that it was indeed added in
> the current release cycle.

Then that implies we should add yet one more patch that adds the
appropriate versioning information to all the gluster fields added for
2.8.  My reviewed-by was given on the assumption that debug was in 2.7
and that this was a break from 2.7 behavior, but that we already KNOW
we're breaking blockdev-add between 2.7 and 2.8; while your argument is
that there is no backwards incompatibility because it was not in 2.7 to
begin with.  I think both reasons are indeed acceptable, but it also
means that my reason was flawed because of the incomplete documentation.

> 
> In the future please make sure all QAPI changes are marked by version.

Indeed, and I try to flag it in my reviews as often as I notice it.

> If there tricky changes you can include a statement showing you are
> aware of QAPI backwards compatibility ("These new options were added in
> the 2.8 release cycle and can therefore still be changed without
> breaking backward compatibility").  This will make me confident that
> you've checked the QAPI changes.
> 
> Thanks, applied to my staging tree:
> https://github.com/stefanha/qemu/commits/staging
> 
> Stefan
> 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH] glusterfs: allow partial reads

2016-12-06 Thread Eric Blake
On 12/06/2016 03:59 AM, Kevin Wolf wrote:

 I'm not sure what the original rationale was to treat both partial
 reads as well as well as writes as I/O error. (Seems to have happened
 from original glusterfs v1 to v2 series with a note but no reasoning
 for the read side as far as I could see.)
 The general direction lately seems to be to move away from sector
 based block APIs. Also eg. the NFS code allows partial reads. (It
 does, however, have an old patch (c2eb918e3) dedicated to aligning
 sizes to 512 byte boundaries for file creation for compatibility to
 other parts of qemu like qcow2. This already happens in glusterfs,
 though, but if you move a file from a different storage over to
 glusterfs you may end up with a qcow2 file with eg. the L1 table in
 the last 80 bytes of the file aligned to _begin_ at a 512 boundary,
 but not _end_ at one.)
> 
> Hm, does this really happen? I always thought that the file size of
> qcow2 images is aligned to the cluster size. If it isn't, maybe we
> should fix that.

Yes, it absolutely happens all the time!  In fact, it is often the case
that our images are not even sector-aligned, let alone cluster-aligned:

$ qemu-img create -f qcow2 file 1M
Formatting 'file', fmt=qcow2 size=1048576 encryption=off
cluster_size=65536 lazy_refcounts=off refcount_bits=16
$ ls -l file
-rw-r--r--. 1 eblake eblake 196616 Dec  6 08:58 file
$ echo $((384*512))
196608
$ echo $((385*512))
197120

196616 is a fraction more than 384 sectors.

This is because the qcow2 format explicitly requires that if the L1 or
L2 table is at the end of the file (which is what happens by default in
qemu-img create), any entries not physically present in the table
(because the file ends early) are read as 0.

>>> Would it be better to switch to byte-based interfaces rather than
>>> continue to force gluster interaction in 512-byte sector chunks,
>>> since gluster can obviously store files that are not 512-aligned?
> 
> The gluster I/O functions are byte-based anyway, and the driver already
> implements .bdrv_co_readv, so going to .bdrv_co_preadv should be
> trivial. Probably the best solution here indeed.
> 

Are we still hoping to fix this bug (the gluster behavior on unaligned
files, not the larger [non-?]bug of qemu-img create giving unaligned
images in the first place) for 2.8, or are we pushing our luck here,
where the correct patch should be 2.9 material and cc qemu-stable?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH for-2.8 v2] qcow2: Don't strand clusters near 2G intervals during commit

2016-12-06 Thread Eric Blake
On 12/05/2016 01:52 PM, John Snow wrote:
> 
> 
> On 12/05/2016 10:49 AM, Eric Blake wrote:
>> The qcow2_make_empty() function is reached during 'qemu-img commit',
>> in order to clear out ALL clusters of an image.  However, if the
>> image cannot use the fast code path (true if the image is format
>> 0.10, or if the image contains a snapshot), the cluster size is
>> larger than 512, and the image is larger than 2G in size, then our
>> choice of sector_step causes problems.  Since it is not cluster
>> aligned, but qcow2_discard_clusters() silently ignores an unaligned
>> head or tail, we are leaving clusters allocated.
>>
>> Enhance the testsuite to expose the flaw, and patch the problem by
>> ensuring our step size is aligned.

>> +++ b/tests/qemu-iotests/097
>> @@ -46,7 +46,7 @@ _supported_proto file
>>  _supported_os Linux
>>
>>
>> -# Four passes:
>> +# Four main passes:
>>  #  0: Two-layer backing chain, commit to upper backing file (implicitly)
>>  # (in this case, the top image will be emptied)
>>  #  1: Two-layer backing chain, commit to upper backing file (explicitly)
>> @@ -56,22 +56,30 @@ _supported_os Linux
>>  #  3: Two-layer backing chain, commit to lower backing file
>>  # (in this case, the top image will implicitly stay unchanged)
>>  #
>> +# Each pass is run twice, since qcow2 has different code paths for cleaning
>> +# an image depending on whether it has a snapshot.
>> +#
>>  # 020 already tests committing, so this only tests whether image chains are
>>  # working properly and that all images above the base are emptied; 
>> therefore,
>> -# no complicated patterns are necessary
>> +# no complicated patterns are necessary.  Check near the 2G mark, as qcow2
>> +# has been buggy at that boundary in the past.
>>  for i in 0 1 2 3; do
>> +for j in 0 1; do
>>
>>  echo
>> -echo "=== Test pass $i ==="
>> +echo "=== Test pass $i.$j ==="

> 
> I skimmed the test changes, but it looks good.

By the way, it was test 0.0 that fails without the change to qcow2.c
(the changes added the x.0 passes; the pre-existing x.1 passes all
succeed because they use the fast path, and passes 1-3.y succeed because
they aren't clearing all clusters).

Maybe I could have mentioned that in the commit message, and/or split
this into two patches to make the test changes have an easier diff (one
to shift the portion of the test image being written from 0-0x3 out
to 0x7ffd-0x8000, the other to double the number of passes). But
at this point, it's probably not worth a respin if we're going to get it
in 2.8; but I'm okay if the maintainer wants to add the above paragraph
into the commit message.

> 
> Reviewed-by: John Snow 
> 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH for-2.8 v2] qcow2: Don't strand clusters near 2G intervals during commit

2016-12-06 Thread Kevin Wolf
Am 05.12.2016 um 16:49 hat Eric Blake geschrieben:
> The qcow2_make_empty() function is reached during 'qemu-img commit',
> in order to clear out ALL clusters of an image.  However, if the
> image cannot use the fast code path (true if the image is format
> 0.10, or if the image contains a snapshot), the cluster size is
> larger than 512, and the image is larger than 2G in size, then our
> choice of sector_step causes problems.  Since it is not cluster
> aligned, but qcow2_discard_clusters() silently ignores an unaligned
> head or tail, we are leaving clusters allocated.
> 
> Enhance the testsuite to expose the flaw, and patch the problem by
> ensuring our step size is aligned.
> 
> Signed-off-by: Eric Blake 

Thanks, applied to the block branch.

Kevin



Re: [Qemu-block] [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-06 Thread Stefan Hajnoczi
On Tue, Dec 6, 2016 at 1:42 PM, Kevin Wolf  wrote:
> Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:
>>  Error *local_err = NULL;
>>
>>  bdrv_invalidate_cache_all(_err);
>
> So in summary, this is a horrible patch because it checks the wrong
> thing, and for I can't really say if it covers everything it needs to
> cover, but arguably it happens to correctly fix the outcome of a
> previously failing case.
>
> Normally I would reject such a patch and require a clean solution, but
> then we're on the day of -rc3, so if you can't send v2 right away, we
> might not have the time for it.
>
> Tough call...

This bug is not a release blocker because it was already in QEMU 2.7
and not a regression.

Let's take time to understand this fully.  There could be related bugs
that can be fixed if this area of QEMU is scrutinized.  We can merge
fixes in -stable when consensus is reached.

Stefan



Re: [Qemu-block] [Qemu-devel] [PATCH] migration: re-active images when migration fails to complete

2016-12-06 Thread Kevin Wolf
Am 19.11.2016 um 12:43 hat zhanghailiang geschrieben:
> commit fe904ea8242cbae2d7e69c052c754b8f5f1ba1d6 fixed a case
> which migration aborted QEMU because it didn't regain the control
> of images while some errors happened.
> 
> Actually, we have another case in that error path to abort QEMU
> because of the same reason:
> migration_thread()
> migration_completion()
>bdrv_inactivate_all() > inactivate images
>qemu_savevm_state_complete_precopy()
>socket_writev_buffer() > error because destination 
> fails
>  qemu_fflush() ---> set error on migration stream
>qemu_mutex_unlock_iothread() --> unlock
> qmp_migrate_cancel() -> user cancelled migration
> migrate_set_state() --> set migrate CANCELLING

Important to note here: qmp_migrate_cancel() is executed by a concurrent
thread, it doesn't depend on any code paths in migration_completion().

> migration_completion() -> go on to fail_invalidate
> if (s->state == MIGRATION_STATUS_ACTIVE) -> Jump this branch
> migration_thread() ---> break migration loop
>   vm_start() -> restart guest with inactive
> images
> We failed to regain the control of images because we only regain it
> while the migration state is "active", but here users cancelled the migration
> when they found some errors happened (for example, libvirtd daemon is shutdown
> in destination unexpectedly).
> 
> Signed-off-by: zhanghailiang 
> ---
>  migration/migration.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index f498ab8..0c1ee6d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1752,7 +1752,8 @@ fail_invalidate:
>  /* If not doing postcopy, vm_start() will be called: let's regain
>   * control on images.
>   */
> -if (s->state == MIGRATION_STATUS_ACTIVE) {

This if condition tries to check whether we ran the code path that
called bdrv_inactivate_all(), so that we only try to reactivate images
it if we really inactivated them first.

The problem with it is that it ignores a possible concurrent
modification of s->state.

> +if (s->state == MIGRATION_STATUS_ACTIVE ||
> +s->state == MIGRATION_STATUS_CANCELLING) {

This adds another state that we could end up with with a concurrent
modification, so that even in this case we undo the inactivation.

However, it is no longer limited to the cases where we inactivated the
image. It also applies to other code paths (like the postcopy one) where
we didn't inactivate images.

What saves the patch is that bdrv_invalidate_cache() is a no-op for
block devices that aren't inactivated, so calling it more often than
necessary is okay.

But then, if we're going to rely on this, it would be much better to
just remove the if altogether. I can't say whether there are any other
possible values of s->state that we should consider, and by removing the
if we would be guaranteed to catch all of them.

If we don't want to rely on it, just keep a local bool that remembers
whether we inactivated images and check that here.

>  Error *local_err = NULL;
>  
>  bdrv_invalidate_cache_all(_err);

So in summary, this is a horrible patch because it checks the wrong
thing, and for I can't really say if it covers everything it needs to
cover, but arguably it happens to correctly fix the outcome of a
previously failing case.

Normally I would reject such a patch and require a clean solution, but
then we're on the day of -rc3, so if you can't send v2 right away, we
might not have the time for it.

Tough call...

Kevin



[Qemu-block] [PATCH] qemu-img: fix in-flight count for qemu-img bench

2016-12-06 Thread Paolo Bonzini
With aio=native (qemu-img bench -n) one or more requests can be completed
when a new request is submitted.  This in turn can cause bench_cb to
recurse before b->in_flight is updated.  The blk_aio_pwritev coroutines
are never freed, and qemu-img aborts.

Signed-off-by: Paolo Bonzini 
---
 qemu-img.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 6949b73..607dbe5 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3559,6 +3559,9 @@ static void bench_cb(void *opaque, int ret)
 }
 
 while (b->n > b->in_flight && b->in_flight < b->nrreq) {
+b->in_flight++;
+b->offset += b->step;
+b->offset %= b->image_size;
 if (b->write) {
 acb = blk_aio_pwritev(b->blk, b->offset, b->qiov, 0,
   bench_cb, b);
@@ -3570,9 +3573,6 @@ static void bench_cb(void *opaque, int ret)
 error_report("Failed to issue request");
 exit(EXIT_FAILURE);
 }
-b->in_flight++;
-b->offset += b->step;
-b->offset %= b->image_size;
 }
 }
 
-- 
2.9.3




Re: [Qemu-block] Meeting notes on -blockdev, dynamic backend reconfiguration

2016-12-06 Thread Stefan Hajnoczi
On Mon, Dec 05, 2016 at 01:03:50PM +0100, Markus Armbruster wrote:
> I recently met Kevin, and we discussed two block layer topics in some
> depth.

Thanks for sharing this.  Both topics were a good read and the direction
you are heading in looks good.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Nbd] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-06 Thread Alex Bligh

> On 6 Dec 2016, at 08:46, Alex Bligh  wrote:
> 
> I would support this.
> 
> In fact the patch is sufficiently simple I think I'd merge this
> into extension-write-zeroes then merge that into master.

Hence:

Reviewed-By: Alex Bligh 

-- 
Alex Bligh







Re: [Qemu-block] [Nbd] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-06 Thread Alex Bligh

> On 6 Dec 2016, at 09:25, Kevin Wolf  wrote:
> 
> Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
>> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
>> team discovered that it is useful if a server can advertise
>> whether an export is in a known-all-zeroes state at the time
>> the client connects.
> 
> Does a server usually have the information to set this flag, other than
> querying the block status of all blocks at startup? 

The server may have other ways of knowing this, for instance
that it has just created the file (*), or that it stat'd the file
before opening it (not unlikely) and noticed it had 0 allocated
size. The latter I suspect would be trivial to implement in nbd-server

(*) = e.g. I had one application where nbd use the export path
to signify it wanted to open a temporary file, the path consisting
of a UUID and an encoded length. If the file was not present already
it created it with ftruncate(). That could trivially have used this.

> If so, the client could just query this by itself.

Well there's no currently mainlined extension to do that, but yes
it could. On the other hand I see no issue passing complete
zero status back to the client if it's so obvious from a stat().

-- 
Alex Bligh







Re: [Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-06 Thread Stefan Hajnoczi
On Mon, Dec 05, 2016 at 05:42:35PM -0600, Eric Blake wrote:
> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
> team discovered that it is useful if a server can advertise
> whether an export is in a known-all-zeroes state at the time
> the client connects.
> 
> Signed-off-by: Eric Blake 
> ---
>  doc/proto.md | 5 +
>  1 file changed, 5 insertions(+)
> 
> This replaces the following qemu patch attempt:
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html
> which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The
> semantics in this proposal should be much better.
> 
> Patch is to the merge of the master branch and the
> extension-write-zeroes branch.  By the way, qemu 2.8 is due
> to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES,
> so maybe it is time to consider promoting the extension-write-zeroes
> branch into master.

Useful, thanks!

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [Qemu-block] [PULL for-2.8 0/3] Block patches for -rc3

2016-12-06 Thread Stefan Hajnoczi
On Mon, Dec 05, 2016 at 04:31:58PM -0500, Jeff Cody wrote:
> The following changes since commit bc66cedb4141fb7588f2462c74310d8fb5dd4cf1:
> 
>   Merge remote-tracking branch 'yongbok/tags/mips-20161204' into staging 
> (2016-12-05 10:56:45 +)
> 
> are available in the git repository at:
> 
>   https://github.com/codyprime/qemu-kvm-jtc.git tags/block-pull-request
> 
> for you to fetch changes up to 76b5550f709b975a7b04fb4c887f300b7bb731c2:
> 
>   qemu-doc: update gluster protocol usage guide (2016-12-05 16:30:29 -0500)
> 
> 
> Gluster block patches for -rc3
> 
> 
> Prasanna Kumar Kalever (3):
>   block/gluster: fix QMP to match debug option
>   block/nfs: fix QMP to match debug option
>   qemu-doc: update gluster protocol usage guide
> 
>  block/gluster.c  | 38 -
>  block/nfs.c  |  4 ++--
>  qapi/block-core.json |  8 +++
>  qemu-doc.texi| 59 
> +++-
>  qemu-options.hx  | 25 --
>  5 files changed, 93 insertions(+), 41 deletions(-)

BlockdevOptionsGluster.debug(-level) does not have "Added in 2.8" so I
had to dig through git-blame(1) to verify that it was indeed added in
the current release cycle.

In the future please make sure all QAPI changes are marked by version.
If there tricky changes you can include a statement showing you are
aware of QAPI backwards compatibility ("These new options were added in
the 2.8 release cycle and can therefore still be changed without
breaking backward compatibility").  This will make me confident that
you've checked the QAPI changes.

Thanks, applied to my staging tree:
https://github.com/stefanha/qemu/commits/staging

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [RFC PATCH] glusterfs: allow partial reads

2016-12-06 Thread Kevin Wolf
Am 05.12.2016 um 09:26 hat Wolfgang Bumiller geschrieben:
> On Fri, Dec 02, 2016 at 01:13:28PM -0600, Eric Blake wrote:
> > On 12/01/2016 04:59 AM, Wolfgang Bumiller wrote:
> > > Fixes #1644754.
> > > 
> > > Signed-off-by: Wolfgang Bumiller 
> > > ---
> > > I'm not sure what the original rationale was to treat both partial
> > > reads as well as well as writes as I/O error. (Seems to have happened
> > > from original glusterfs v1 to v2 series with a note but no reasoning
> > > for the read side as far as I could see.)
> > > The general direction lately seems to be to move away from sector
> > > based block APIs. Also eg. the NFS code allows partial reads. (It
> > > does, however, have an old patch (c2eb918e3) dedicated to aligning
> > > sizes to 512 byte boundaries for file creation for compatibility to
> > > other parts of qemu like qcow2. This already happens in glusterfs,
> > > though, but if you move a file from a different storage over to
> > > glusterfs you may end up with a qcow2 file with eg. the L1 table in
> > > the last 80 bytes of the file aligned to _begin_ at a 512 boundary,
> > > but not _end_ at one.)

Hm, does this really happen? I always thought that the file size of
qcow2 images is aligned to the cluster size. If it isn't, maybe we
should fix that.

> > >  block/gluster.c | 10 +-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/gluster.c b/block/gluster.c
> > > index 891c13b..3db0bf8 100644
> > > --- a/block/gluster.c
> > > +++ b/block/gluster.c
> > > @@ -41,6 +41,7 @@ typedef struct GlusterAIOCB {
> > >  int ret;
> > >  Coroutine *coroutine;
> > >  AioContext *aio_context;
> > > +bool is_write;
> > >  } GlusterAIOCB;
> > >  
> > >  typedef struct BDRVGlusterState {
> > > @@ -716,8 +717,10 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, 
> > > ssize_t ret, void *arg)
> > >  acb->ret = 0; /* Success */
> > >  } else if (ret < 0) {
> > >  acb->ret = -errno; /* Read/Write failed */
> > > +} else if (acb->is_write) {
> > > +acb->ret = -EIO; /* Partial write - fail it */
> > >  } else {
> > > -acb->ret = -EIO; /* Partial read/write - fail it */
> > > +acb->ret = 0; /* Success */
> > 
> > Does this properly guarantee that the portion beyond EOF reads as zero?
> 
> I'd argue this wasn't necessarily the case before either, considering
> the first check starts with `!ret`:
> 
> if (!ret || ret == acb->size) {
> acb->ret = 0; /* Success */
> 
> A read right at EOF would return 0 and be treated as success there, no?

Yes, this is a bug.

I guess this was the lazy way that "usually" works both for
reads/writes, which return a positive number of bytes, and for things
like flush which return 0 on success. But the callback really needs to
distinguish these cases and apply different checks.

> Iow. it wouldn't zero out the destination buffer as far as I can see.
> Come to think of it, I'm not too fond of this part of the check for the
> write case either.

raw-posix treats short reads as success, too, but it zeroes out the
missing part. Note that it also loops after a short read and only if it
reads 0 bytes then, it returns success. If an error is returned after
the short read, the whole function returns an error. Is this necessary
for gluster, too?

> > Would it be better to switch to byte-based interfaces rather than
> > continue to force gluster interaction in 512-byte sector chunks,
> > since gluster can obviously store files that are not 512-aligned?

The gluster I/O functions are byte-based anyway, and the driver already
implements .bdrv_co_readv, so going to .bdrv_co_preadv should be
trivial. Probably the best solution here indeed.

Kevin



Re: [Qemu-block] [PATCH for-2.8 v2] qcow2: Don't strand clusters near 2G intervals during commit

2016-12-06 Thread Stefan Hajnoczi
On Mon, Dec 05, 2016 at 09:49:34AM -0600, Eric Blake wrote:
> The qcow2_make_empty() function is reached during 'qemu-img commit',
> in order to clear out ALL clusters of an image.  However, if the
> image cannot use the fast code path (true if the image is format
> 0.10, or if the image contains a snapshot), the cluster size is
> larger than 512, and the image is larger than 2G in size, then our
> choice of sector_step causes problems.  Since it is not cluster
> aligned, but qcow2_discard_clusters() silently ignores an unaligned
> head or tail, we are leaving clusters allocated.
> 
> Enhance the testsuite to expose the flaw, and patch the problem by
> ensuring our step size is aligned.
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v2: perform rounding correctly
> ---
>  block/qcow2.c  |   3 +-
>  tests/qemu-iotests/097 |  41 +---
>  tests/qemu-iotests/097.out | 249 
> +
>  3 files changed, 210 insertions(+), 83 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH v5 7/9] block: don't make snapshots for filters

2016-12-06 Thread Pavel Dovgalyuk
> From: Kevin Wolf [mailto:kw...@redhat.com]
> Am 05.12.2016 um 12:49 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kw...@redhat.com]
> > > Am 05.12.2016 um 08:43 hat Pavel Dovgalyuk geschrieben:
> 
> > Record/replay without this option uses '-snapshot' to preserve
> > the state of the disk images.
> >
> > > Anyway, it seems that doing things manually is the safe way as long as
> > > we don't know the final solution, so I think I agree.
> > >
> > > For a slightly more convenient way, one of the problems to solve seems
> > > to be that snapshot=on always affects the top level node and you can't
> > > create a temporary snapshot in the middle of the chain. Perhaps we
> > > should introduce a 'temporary-overlay' driver or something like that, so
> > > that you could specify things like this:
> > >
> > > -drive if=none,driver=file,filename=test.img,id=orig
> > > -drive if=none,driver=temporary-overlay,file=orig,id=snap
> > > -drive if=none,driver=blkreplay,image=snap
> >
> > This seems reasonable for manual way.
> 
> Maybe another, easier to implement option could be something like this:
> 
> -drive 
> if=none,driver=file,filename=test.img,snapshot=on,overlay.node-name=snap
> -drive if=none,driver=blkreplay,image=snap
> 
> It would require that we implement support for overlay.* options like we
> already support backing.* options. Allowing to specify options for the
> overlay node is probably nice to have anyway.
> 
> However, there could be a few tricky parts there. For example, what
> happens if someone uses overlay.backing=something-else? Perhaps
> completely disallowing backing and backing.* for overlays would already
> solve this.
> 
> > > Which makes me wonder... Is blkreplay usable without the temporary
> > > snapshot or is this pretty much a requirement?
> >
> > It's not a requirement. But to make replay deterministic we have to
> > start with the same image every time. As I know, this may be achieved by:
> > 1. Restoring original disk image manually
> > 2. Using vm snapshot to start execution from
> > 3. Using -snapshot option
> > 4. Not using disks at all
> >
> > > Because if it has to be
> > > there, the next step could be that blkreplay creates temporary-overlay
> > > internally in its .bdrv_open().
> >
> > Here is your answer about such an approach :)
> > https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg04687.html
> 
> Right, and unfortunately these are still good points.
> 
> Especially the part where you allowed to give the overlay filename
> really needs to work the way it does now with the 'image' option. We
> might not need to be that strict with temporary overlays, restricting to
> qcow2 with default options could be acceptable there - but whatever I
> think of to support both cases results in something that isn't really
> easier than the manual way that we figured out above.

Can we stop on the following?
1. Don't create any overlays automatically when user wants to save/restore VM 
state
2. In the opposite case create snapshots, but do not use -snapshot option.
   Snapshots will be created by the blkreplay as in the link specified.

Pavel Dovgalyuk




Re: [Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-06 Thread Kevin Wolf
Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
> team discovered that it is useful if a server can advertise
> whether an export is in a known-all-zeroes state at the time
> the client connects.

Does a server usually have the information to set this flag, other than
querying the block status of all blocks at startup? If so, the client
could just query this by itself.

The patch that was originally sent to qemu-devel just forwarded qemu's
.bdrv_has_zero_init() call to the server. However, what this function
returns is not a known-all-zeroes state on open, but just a
known-all-zeroes state immediately after bdrv_create(), i.e. creating a
new image. Then it becomes information that is easy to get and doesn't
involve querying all blocks (e.g. true for COW image formats, true for
raw on regular files, false for raw on block devices).

This is useful for 'qemu-img convert', which creates an image and then
writes the whole contents, but I'm not sure if this property is
applicable for NBD, which I think doesn't even have a create operation.

Kevin



Re: [Qemu-block] [Nbd] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-06 Thread Alex Bligh

> On 5 Dec 2016, at 23:42, Eric Blake  wrote:
> 
> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
> team discovered that it is useful if a server can advertise
> whether an export is in a known-all-zeroes state at the time
> the client connects.

I think this is good to go, and ...

> Signed-off-by: Eric Blake 
> ---
> doc/proto.md | 5 +
> 1 file changed, 5 insertions(+)
> 
> This replaces the following qemu patch attempt:
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html
> which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The
> semantics in this proposal should be much better.
> 
> Patch is to the merge of the master branch and the
> extension-write-zeroes branch.  By the way, qemu 2.8 is due
> to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES,
> so maybe it is time to consider promoting the extension-write-zeroes
> branch into master.

I would support this.

In fact the patch is sufficiently simple I think I'd merge this
into extension-write-zeroes then merge that into master.

Wouter?

Alex

> diff --git a/doc/proto.md b/doc/proto.md
> index afe71fc..7e4ec7f 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -697,6 +697,11 @@ The field has the following format:
>   the export.
> - bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental
>   `BLOCK_STATUS` 
> [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md).
> +- bit 10, `NBD_FLAG_INIT_ZEROES`: Indicates that the server guarantees
> +  that at the time transmission phase begins, all offsets within the
> +  export read as zero bytes.  Clients MAY use this information to
> +  avoid writing to sections of the export that should still read as
> +  zero after the client is done writing.
> 
> Clients SHOULD ignore unknown flags.
> 
> -- 
> 2.9.3
> 
> 
> --
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today.http://sdm.link/xeonphi
> ___
> Nbd-general mailing list
> nbd-gene...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
> 

-- 
Alex Bligh