Re: [Qemu-block] [PATCH 2/2] nbd/server: add assert to nbd_negotiate_handle_info

2017-11-02 Thread Vladimir Sementsov-Ogievskiy

02.11.2017 21:06, Eric Blake wrote:

On 11/01/2017 10:42 AM, Vladimir Sementsov-Ogievskiy wrote:

Add an assert here to make last length assignment meaningful and
following return without tail dropping obvious.

Not quite sure I followed your explanation, which means it's difficult
for me to propose an alternative wording.  Maybe:

Add an assert here to make it obvious that the prior loop consumed the
rest of the input, and that all further code in the function is focused
on output.

On the other hand, if you are okay with it, I wouldn't mind squashing
the first and second patches into one (as the first patch is then easier
to read when it is obvious that we used the wrong length variable).  But
until I get your feedback (on either squashing the two patches or
tweaking the wording), I'm just placing your patches as-is on my NBD
queue for inclusion prior to rc0.



in commit message I mean two things:
1. assignment to length in the previous loop is useless without an assert
2. firstly I thought that the following return statement is a bug as it 
doesn't
drop payload tail. And only then I noticed previous "if (requests != 
length / sizeof(request))".

With an assert the correctness of the code is more evident.

However you may reword or squash it as you want, all is good.




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  nbd/server.c | 1 +
  1 file changed, 1 insertion(+)

Reviewed-by: Eric Blake 




--
Best regards,
Vladimir




Re: [Qemu-block] ping Re: [PATCH v2 0/5] backup improvements part 1

2017-11-02 Thread Vladimir Sementsov-Ogievskiy

02.11.2017 23:52, John Snow wrote:


On 10/31/2017 06:26 AM, Vladimir Sementsov-Ogievskiy wrote:

Hey, what about this?


Sorry Vladimir, unexpected leave of absence from work. It's at the top
of my pile.

I invite others to come take a look, too.


No problems, thank you.

--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] [PATCH v6 00/24] make bdrv_get_block_status byte-based

2017-11-02 Thread Eric Blake
On 11/02/2017 04:01 PM, John Snow wrote:

>>> The overall conversion currently looks like:
>>> part 1: bdrv_is_allocated (merged, commit 51b0a488)
>>> part 2: dirty-bitmap (merged, commit ca759622)
>>> part 3: bdrv_get_block_status (this series, v5 at [1])
>>> part 4: .bdrv_co_block_status (v3 is posted [2], mostly reviewed, but needs 
>>> rebase)
>>
>> Thanks, applied to the block branch.
>>
>> Kevin
>>
> 
> Thanks;
> 
> Eric: does this just leave part four?

Yes; the latest version of part four is v4:
https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg02940.html

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v6 16/25] block: Add 'base-directory' BDS option

2017-11-02 Thread Eric Blake
On 11/02/2017 11:07 AM, Max Reitz wrote:
> On 2017-11-02 15:40, Alberto Garcia wrote:
>> On Fri 29 Sep 2017 06:53:38 PM CEST, Max Reitz wrote:
>>> Using this option, one can directly override what bdrv_dirname() will
>>> return. This is useful if one uses e.g. qcow2 on top of quorum (with
>>> only protocol BDSs under the quorum BDS) and wants to be able to use
>>> relative backing filenames.
>>>
>>> Signed-off-by: Max Reitz 
>>
>> Who would be using this option in practice? I understand that management
>> tools should be using always absolute filenames, and this API doesn't
>> seem so convenient for humans.
> 
> Hmmm.  I seem to recall Eric liked it a couple of years ago when he was
> still more on the libvirt side of things. :-)

Ideally, libvirt should be using node names rather than filenames
(whether absolute or relative); but I still think it could be something
that turns out to be useful.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

2017-11-02 Thread Max Reitz
On 2017-11-02 13:02, Daniel P. Berrange wrote:
[...]
> One alternative approach to doing this would be to suggest that we should
> instead just spawn qemu-system-x86_64 with '--machine none' and use that
> as a replacement for qemu-nbd, since it already has a built-in NBD server
> which can do many exports at once and arbitrary block jobs.

As far as I know, we had wanted to add QMP support to qemu-nbd maybe one
or two years ago, but nobody ever did it.

I've had some discussions about this with Markus and Kevin at KVM Forum.
 They appeared to strongly prefer this approach.  I agree with them that
design-wise, a qemu with no machine at all (and not even -M none) and
early QMP is the way we want to go anyway, and then this would be the
correct tool to use.

> I'm concerned that this could end up being a be a game of whack-a-mole
> though, constantly trying to cut out/down all the bits of system emulation
> in the machine emulators to get its resource overhead to match the low
> overhead of standalone qemu-nbd.

However, I personally share your concern.  Especially, I think that
getting to a point where we can have no machine at all and early QMP
will take much longer than just adding QMP to qemu-nbd -- or adding a
qmp command to qemu-img (because you can add NBD exports through QMP, so
qemu-nbd's hardcoded focus on NBD exports seems kind of weird then)[1].

I'm very much torn here.  There are two approaches: Stripping fat qemu
down, or fattening lean qemu-img (?) up.  The latter is very simple.
The former is what we want anyway.

Markus says it's not too hard to strip down qemu.  If that is true,
there is no point in fattening qemu-img now.  I personally am not
convinced at all, but he knows the state of that project much better
than me, so I cannot reasonably disagree.

So my mail is more of a CC to Markus and Kevin -- but I think both are
on PTO right now.

I guess the main question is: If someone were to introduce a qemu-img
QMP subcommand -- would it be accepted? :-)

Max


[1] Also, adding QMP should trivially add block jobs and multiple
exports to whatever tool we are talking about (in fact, qemu-img already
does perform the mirror block job for committing).



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH v9 0/8] Optimize VMDK I/O by allocating multiple clusters

2017-11-02 Thread John Snow


On 10/20/2017 02:53 AM, Ashijeet Acharya wrote:
> 
> On Fri, Oct 20, 2017 at 11:58 Fam Zheng  > wrote:
> 
> On Mon, 10/09 22:12, Fam Zheng wrote:
> > On Mon, 10/09 18:29, Ashijeet Acharya wrote:
> > > Optimization test results:
> > >
> > > This patch series improves 128 KB sequential write performance to an
> > > empty VMDK file by 54%
> > >
> > > Benchmark command: ./qemu-img bench -w -c 1024 -s 128K -d 1 -t
> none -f
> > > vmdk test.vmdk
> > >
> > > Changes in v9:
> > > - rebase the series
> >
> > Thanks, looks good to me, applied:
> >
> > https://github.com/famz/qemu/tree/staging
> 
> Ashijeet: I've been testing my branch and it seems installing
> Fedora/CentOS to a
> VMDK image is broken with your patches applied. Both guest and QEMU are
> responsive, but the installing of packages stops to make any
> progress at some
> point:
> 
> Installing rootfiles.noarch (317/318)
> Installing langpacks-en.noarch (318/318)
> Performing post-installation setup tasks
> Configuring fedora-release.noarch
> Configuring filesystem.x86_64
> Configuring GeoIP-GeoLite-data.noarch
> Configuring python3.x86_64
> Configuring fedora-logos.x86_64
> Configuring kernel-core.x86_64
> 
> # hang here
> 
> Can you reproduce this on your machine?
> 
> My command line is something like this:
> 
> qemu-system-x86_64 -enable-kvm -cpu host -m 1G -qmp
> unix:/home/fam/.q/qemu-8DOC9EF4/qmp,server,nowait -name 8DOC9EF4
> -netdev user,id=vnet,hostfwd=:0.0.0.0 :10022-:22
> -device virtio-net-pci,netdev=vnet -drive
> file=/var/tmp/test2.vmdk,if=none,id=drive-1,cache=none,aio=native
> -device virtio-blk-pci,drive=drive-1 -cdrom
> /stor/iso/CentOS-6.9-x86_64-minimal.iso -pidfile
> /home/fam/.q/qemu-8DOC9EF4/pid
> 
> qemu.git master doesn't have this problem. So I'll drop this series
> from the
> pull request until it is resolved.
> 
> 
> Fam: Alright, I will look into this but I cannot give you a deadline
> unfortunately. I will try my best to resolve this as soon as though.
> 
> Ashijeet
> 
> 
> 
> Fam
> 

Do we need to temporarily roll this back for the 2.11 release if it
cannot be addressed in time?



Re: [Qemu-block] [Qemu-devel] [PATCH v6 00/24] make bdrv_get_block_status byte-based

2017-11-02 Thread John Snow


On 10/20/2017 12:45 PM, Kevin Wolf wrote:
> Am 12.10.2017 um 05:46 hat Eric Blake geschrieben:
>> There are patches floating around to add NBD_CMD_BLOCK_STATUS,
>> but NBD wants to report status on byte granularity (even if the
>> reporting will probably be naturally aligned to sectors or even
>> much higher levels).  I've therefore started the task of
>> converting our block status code to report at a byte granularity
>> rather than sectors.
>>
>> Now that 2.11 is open, I'm rebasing/reposting the remaining patches.
>>
>> The overall conversion currently looks like:
>> part 1: bdrv_is_allocated (merged, commit 51b0a488)
>> part 2: dirty-bitmap (merged, commit ca759622)
>> part 3: bdrv_get_block_status (this series, v5 at [1])
>> part 4: .bdrv_co_block_status (v3 is posted [2], mostly reviewed, but needs 
>> rebase)
> 
> Thanks, applied to the block branch.
> 
> Kevin
> 

Thanks;

Eric: does this just leave part four?



Re: [Qemu-block] [Qemu-devel] [PATCH] hw/ide/ahci: Move allwinner code into a separate file

2017-11-02 Thread John Snow


On 10/23/2017 02:48 PM, Thomas Huth wrote:
> The allwinner code is only needed for the allwinner board (for which
> we also have a separate CONFIG_ALLWINNER_A10 config switch), so it
> does not make sense that we compile this for all the other boards
> that need AHCI, too. Let's move it to a separate file that is only
> compiled when CONFIG_ALLWINNER_A10 is set.
> 
> Signed-off-by: Thomas Huth 

Belated, as it's already merged to master, but:

Thanks, applied to my IDE tree:

https://github.com/jnsnow/qemu/commits/ide
https://github.com/jnsnow/qemu.git

--js



Re: [Qemu-block] [Qemu-devel] [PATCH] block/snapshot: dirty all dirty bitmaps on snapshot-switch

2017-11-02 Thread John Snow


On 10/23/2017 05:29 AM, Vladimir Sementsov-Ogievskiy wrote:
> Snapshot-switch actually changes active state of disk so it should
> reflect on dirty bitmaps. Otherwise next incremental backup using
> these bitmaps will be invalid.
> 

Good call. I knew that snapshots weren't compatible, but this makes it
more obvious and less error-prone to an end user.

> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Reviewed-by: John Snow 

> ---
>  block/snapshot.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/block/snapshot.c b/block/snapshot.c
> index a46564e7b7..1d5ab5f90f 100644
> --- a/block/snapshot.c
> +++ b/block/snapshot.c
> @@ -181,10 +181,24 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
>  {
>  BlockDriver *drv = bs->drv;
>  int ret, open_ret;
> +int64_t len;
>  
>  if (!drv) {
>  return -ENOMEDIUM;
>  }
> +
> +len = bdrv_getlength(bs);
> +if (len < 0) {
> +return len;
> +}
> +/* We should set all bits in all enabled dirty bitmaps, because dirty
> + * bitmaps reflect active state of disk and snapshot switch operation
> + * actually dirties active state.
> + * TODO: It may make sense not to set all bits but analyze block status 
> of
> + * current state and destination snapshot and do not set bits 
> corresponding
> + * to both-zero or both-unallocated areas. */
> +bdrv_set_dirty(bs, 0, len);
> +
>  if (drv->bdrv_snapshot_goto) {
>  return drv->bdrv_snapshot_goto(bs, snapshot_id);
>  }
> 




Re: [Qemu-block] ping Re: [PATCH v2 0/5] backup improvements part 1

2017-11-02 Thread John Snow


On 10/31/2017 06:26 AM, Vladimir Sementsov-Ogievskiy wrote:
> Hey, what about this?
> 

Sorry Vladimir, unexpected leave of absence from work. It's at the top
of my pile.

I invite others to come take a look, too.



Re: [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

2017-11-02 Thread Paolo Bonzini
On 02/11/2017 13:02, Daniel P. Berrange wrote:
> 
> After all that long background explanation, what I'm wondering is whether
> there is any interest / desire to extend qemu-nbd to have more advanced
> featureset than simply exporting a single disk image which must be listed
> at startup time.
> 
>  - Ability to start qemu-nbd up with no initial disk image connected
>  - Option to have a QMP interface to control qemu-nbd
>  - Commands to add / remove individual disk image exports
>  - Commands for doing the drive-mirror / backing file pivot
> 
> It feels like this wouldn't require significant new functionality in either
> QMP or block layer. It ought to be mostly a cache of taking existing QMP
> code and wiring it up in qemu-nbd, and only exposing a whitelisted subset
> of existing QMP commands related to block backends. 

I think adding a QMP interface is a good idea; if you're using Unix
sockets I don't see much benefit in using multiple disk image exports
from a single qemu-nbd instance, but maybe you aren't?

At this point it does indeed feel a lot like --machine none.  Perhaps we
should just have a new binary /usr/bin/qemu-noguest instead of
augmenting qemu-nbd.

Would you also add rerror/werror support to qemu-nbd at the same time?

Paolo



Re: [Qemu-block] [PATCH 2/2] nbd/server: add assert to nbd_negotiate_handle_info

2017-11-02 Thread Eric Blake
On 11/01/2017 10:42 AM, Vladimir Sementsov-Ogievskiy wrote:
> Add an assert here to make last length assignment meaningful and
> following return without tail dropping obvious.

Not quite sure I followed your explanation, which means it's difficult
for me to propose an alternative wording.  Maybe:

Add an assert here to make it obvious that the prior loop consumed the
rest of the input, and that all further code in the function is focused
on output.

On the other hand, if you are okay with it, I wouldn't mind squashing
the first and second patches into one (as the first patch is then easier
to read when it is obvious that we used the wrong length variable).  But
until I get your feedback (on either squashing the two patches or
tweaking the wording), I'm just placing your patches as-is on my NBD
queue for inclusion prior to rc0.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  nbd/server.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 1/2] nbd/server: fix nbd_negotiate_handle_info

2017-11-02 Thread Eric Blake
On 11/01/2017 10:42 AM, Vladimir Sementsov-Ogievskiy wrote:

It's best to send a 0/2 cover letter for a series, even when both
patches are small, as that helps automation tools.

> namelen should be here, lenght is unrelated.

s/lenght/length/

Broken in introduction in commit f37708f6; hence adding qemu-stable in cc.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  nbd/server.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Eric Blake 

> 
> diff --git a/nbd/server.c b/nbd/server.c
> index 70b40ed27e..7fcec0af7e 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -433,7 +433,7 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
> uint32_t length,
>  
>  /* Don't bother sending NBD_INFO_NAME unless client requested it */
>  if (sendname) {
> -rc = nbd_negotiate_send_info(client, opt, NBD_INFO_NAME, length, 
> name,
> +rc = nbd_negotiate_send_info(client, opt, NBD_INFO_NAME, namelen, 
> name,

Interestingly enough, length == 0 at this point, so we would always
report that the export name is '' (aka the default export), without
actually being a protocol violation.  Doesn't hurt qemu as a client,
since we don't ask for NBD_INFO_NAME, but may break other NBD client
implementations, if they then use NBD_OPT_GO on the '' name expecting it
to resolve to the same non-empty name they just queried on NBD_OPT_INFO.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

2017-11-02 Thread Eric Blake
On 11/02/2017 12:04 PM, Daniel P. Berrange wrote:

> vm-a-disk1.qcow2 open - its just a regular backing file setup.
> 
>>
>>> |  (format=qcow2, proto=file)
>>>   |
>>>   +-  vm-a-disk1.qcow2  (qemu-system-XXX)
>>>
>>> The problem is that many VMs are wanting to use cache-disk1.qcow2 as
>>> their disk's backing file, and only one process is permitted to be
>>> writing to disk backing file at any time.
>>
>> Can you explain a bit more about how many VMs are trying to write to
>> write to the same backing file 'cache-disk1.qcow2'?  I'd assume it's
>> just the "immutable" local backing store (once the previous 'mirror' job
>> is completed), based on which Nova creates a qcow2 overlay for each
>> instance it boots.
> 
> An arbitrary number of  vm-*-disk1.qcow2 files could exist all using
> the same cache-disk1.qcow2 image. Its only limited by how many VMs
> you can fit on the host. By definition you can only ever have a single
> process writing to a qcow2 file though, otherwise corruption will quickly
> follow.

So if I'm following, your argument is that the local qemu-nbd process is
the only one writing to the file, while all other overlays are backed by
the NBD process; and then as any one of the VMs reads, the qemu-nbd
process pulls those sectors into the local storage as a result.

> 
>> When I pointed this e-mail of yours to Matt Booth on Freenode Nova IRC
>> channel, he said the intermediate image (cache-disk1.qcow2) is a COR
>> Copy-On-Read).  I realize what COR is -- everytime you read a cluster
>> from the backing file, you write that locally, to avoid reading it
>> again.
> 
> qcow2 doesn't give you COR, only COW. So every read request would have a miss
> in cache-disk1.qcow2 and thus have to be fetched from master-disk1.qcow2. The
> use of drive-mirror to pull master-disk1.qcow2 contents into cache-disk1.qcow
> makes up for the lack of COR by populating cache-disk1.qcow2 in the 
> background.

Ah, but qcow2 (or more precisely, any protocol qemu BDS) DOES have
copy-on-read, built in to the block layer.  See qemu-iotest 197 for an
example of it in use.  If we use COR correctly, then every initial read
request will miss in the cache, but the COR will populate the cache
without having to have a background drive-mirror.  A background
drive-mirror may still be useful to populate the cache faster, but COR
populates the parts you want now regardless of how fast the background
task is running.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 4/4] qcow2: Add iotest for an empty refcount table

2017-11-02 Thread Max Reitz
On 2017-11-01 16:42, Alberto Garcia wrote:
> This patch adds a simple iotests in which we try to write to an image
> with an empty refcount table (i.e. with all entries set to 0).
> 
> This scenario was already handled by the existing consistency checks,
> but we add an explicit test case for completeness.
> 
> Signed-off-by: Alberto Garcia 
> ---
>  tests/qemu-iotests/060 | 7 +++
>  tests/qemu-iotests/060.out | 6 ++
>  2 files changed, 13 insertions(+)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 3/4] qcow2: Don't open images with header.refcount_table_clusters == 0

2017-11-02 Thread Max Reitz
On 2017-11-01 16:42, Alberto Garcia wrote:
> qcow2_do_open() is checking that header.refcount_table_clusters is not
> too large, but it doesn't check that it's greater than zero. Apart
> from the fact that an image like that is obviously corrupted, trying
> to use it crashes QEMU since we end up with a null s->refcount_table
> after qcow2_refcount_init().
> 
> These images can however be repaired, so allow opening them if the
> BDRV_O_CHECK flag is set.
> 
> Signed-off-by: Alberto Garcia 
> ---
>  block/qcow2.c  | 6 ++
>  tests/qemu-iotests/060 | 7 +++
>  tests/qemu-iotests/060.out | 5 +
>  3 files changed, 18 insertions(+)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 2/4] qcow2: Prevent allocating L2 tables at offset 0

2017-11-02 Thread Max Reitz
On 2017-11-01 16:42, Alberto Garcia wrote:
> If the refcount data is corrupted then we can end up trying to
> allocate a new L2 table at offset 0 in the image, triggering an
> assertion in the qcow2 cache that would crash QEMU:
> 
>   qcow2_cache_entry_mark_dirty: Assertion `c->entries[i].offset != 0' failed
> 
> This patch adds an explicit check for this scenario and a new test
> case.
> 
> Signed-off-by: Alberto Garcia 
> ---
>  block/qcow2-cluster.c  | 7 +++
>  tests/qemu-iotests/060 | 7 +++
>  tests/qemu-iotests/060.out | 6 ++
>  3 files changed, 20 insertions(+)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index fb10e26068..540af4d19d 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -278,6 +278,13 @@ static int l2_allocate(BlockDriverState *bs, int 
> l1_index, uint64_t **table)
>  goto fail;
>  }
>  
> +/* If we're allocating the table at offset 0 then something is wrong */
> +if (l2_offset == 0) {
> +qcow2_signal_corruption(bs, true, -1, -1, "Preventing invalid "
> +"allocation of L2 table at offset 0");
> +return -EIO;

The other error paths in this function use a goto fail.  It doesn't make
a functional difference, but not doing it here looks a bit weird.

(Also, there's same thing about offset=0 and size=s->cluster_size as in
patch 1.)

Functionally correct, though, so:

Reviewed-by: Max Reitz 

> +}
> +
>  ret = qcow2_cache_flush(bs, s->refcount_block_cache);
>  if (ret < 0) {
>  goto fail;
> diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
> index dead26aeaf..40f85cc216 100755
> --- a/tests/qemu-iotests/060
> +++ b/tests/qemu-iotests/060
> @@ -253,6 +253,13 @@ poke_file "$TEST_IMG" "$rt_offset"
> "\x00\x00\x00\x00\x00\x00\x00\x00"
>  # causing a refcount block to be allocated at offset 0
>  $QEMU_IO -c "write 0 128k" "$TEST_IMG" | _filter_qemu_io
>  
> +echo
> +echo "=== Testing empty refcount block ==="
> +echo
> +_make_test_img 64M
> +poke_file "$TEST_IMG" "$rb_offset""\x00\x00\x00\x00\x00\x00\x00\x00"
> +$QEMU_IO -c "write 0 64k" "$TEST_IMG" | _filter_qemu_io
> +
>  # success, all done
>  echo "*** done"
>  rm -f $seq.full
> diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
> index 872719009c..5b8b518486 100644
> --- a/tests/qemu-iotests/060.out
> +++ b/tests/qemu-iotests/060.out
> @@ -189,4 +189,10 @@ wrote 65536/65536 bytes at offset 0
>  64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>  qcow2: Marking image as corrupt: Preventing invalid allocation of refcount 
> block at offset 0; further corruption events will be suppressed
>  write failed: Input/output error
> +
> +=== Testing empty refcount block ===
> +
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
> +qcow2: Marking image as corrupt: Preventing invalid allocation of L2 table 
> at offset 0; further corruption events will be suppressed
> +write failed: Input/output error
>  *** done
> 




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 1/4] qcow2: Prevent allocating refcount blocks at offset 0

2017-11-02 Thread Max Reitz
On 2017-11-01 16:42, Alberto Garcia wrote:
> Each entry in the qcow2 cache contains an offset field indicating the
> location of the data in the qcow2 image. If the offset is 0 then it
> means that the entry contains no data and is available to be used when
> needed.
> 
> Because of that it is not possible to store in the cache the first
> cluster of the qcow2 image (offset = 0). This is not a problem because
> that cluster always contains the qcow2 header and we're not using this
> cache for that.
> 
> However, if the qcow2 image is corrupted it can happen that we try to
> allocate a new refcount block at offset 0, triggering this assertion
> and crashing QEMU:
> 
>   qcow2_cache_entry_mark_dirty: Assertion `c->entries[i].offset != 0' failed
> 
> This patch adds an explicit check for this scenario and a new test
> case.
> 
> This problem was originally reported here:
> 
>https://bugs.launchpad.net/qemu/+bug/1728615
> 
> Reported-by: R.Nageswara Sastry 
> Signed-off-by: Alberto Garcia 
> ---
>  block/qcow2-refcount.c |  7 +++
>  tests/qemu-iotests/060 | 11 +++
>  tests/qemu-iotests/060.out |  8 
>  3 files changed, 26 insertions(+)
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index aa3fd6cf17..9059996c4b 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -367,6 +367,13 @@ static int alloc_refcount_block(BlockDriverState *bs,
>  return new_block;
>  }
>  
> +/* If we're allocating the block at offset 0 then something is wrong */
> +if (new_block == 0) {
> +qcow2_signal_corruption(bs, true, -1, -1, "Preventing invalid "

Technically, the offset is 0 and the size is s->cluster_size, but I
won't insist.

Reviewed-by: Max Reitz 

> +"allocation of refcount block at offset 0");
> +return -EIO;
> +}
> +
>  #ifdef DEBUG_ALLOC2
>  fprintf(stderr, "qcow2: Allocate refcount block %d for %" PRIx64
>  " at %" PRIx64 "\n",
> diff --git a/tests/qemu-iotests/060 b/tests/qemu-iotests/060
> index 8e95c450eb..dead26aeaf 100755
> --- a/tests/qemu-iotests/060
> +++ b/tests/qemu-iotests/060
> @@ -242,6 +242,17 @@ poke_file "$TEST_IMG" "$(($l2_offset+8))" 
> "\x80\x00\x00\x00\x00\x06\x2a\x00"
>  # Should emit two error messages
>  $QEMU_IO -c "discard 0 64k" -c "read 64k 64k" "$TEST_IMG" | _filter_qemu_io
>  
> +echo
> +echo "=== Testing empty refcount table with valid L1 and L2 tables ==="
> +echo
> +_make_test_img 64M
> +$QEMU_IO -c "write 0 64k" "$TEST_IMG" | _filter_qemu_io
> +poke_file "$TEST_IMG" "$rt_offset""\x00\x00\x00\x00\x00\x00\x00\x00"
> +# Since the first data cluster is already allocated this triggers an
> +# allocation with an explicit offset (using qcow2_alloc_clusters_at())
> +# causing a refcount block to be allocated at offset 0
> +$QEMU_IO -c "write 0 128k" "$TEST_IMG" | _filter_qemu_io
> +
>  # success, all done
>  echo "*** done"
>  rm -f $seq.full
> diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
> index 5ca3af491f..872719009c 100644
> --- a/tests/qemu-iotests/060.out
> +++ b/tests/qemu-iotests/060.out
> @@ -181,4 +181,12 @@ qcow2: Marking image as corrupt: Cluster allocation 
> offset 0x62a00 unaligned (L2
>  discard 65536/65536 bytes at offset 0
>  64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>  read failed: Input/output error
> +
> +=== Testing empty refcount table with valid L1 and L2 tables ===
> +
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
> +wrote 65536/65536 bytes at offset 0
> +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +qcow2: Marking image as corrupt: Preventing invalid allocation of refcount 
> block at offset 0; further corruption events will be suppressed
> +write failed: Input/output error
>  *** done
> 




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH 0/4] Misc qcow2 corruption checks

2017-11-02 Thread Max Reitz
On 2017-11-01 16:42, Alberto Garcia wrote:
> Misc qcow2 corruption checks
> 
> This series contains a few checks that prevent QEMU from crashing
> under some scenarios with corrupted qcow2 images.
> 
> The first patch solves the crash reported here:
> 
>   https://bugs.launchpad.net/qemu/+bug/1728615
> 
> And the others solve similar crashes that I detected in the process of
> fixing this one.
> 
> Regards,
> 
> Berto

There are two more cases which might need a check that the return value
of an allocation function isn't 0:

The first is qcow2_alloc_bytes() which has an assert(offset) after
potentially setting offset = new_cluster (with new_cluster being the
return value of alloc_clusters_noref()).

The second is qcow2_crypto_hdr_init_func() which is simply missing a
pre-write overlap check.

The rest (besides L2 table and refblock allocation) should be guarded by
the pre-write overlap check.

Do you want to fix these or do we need another volunteer? :-)

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

2017-11-02 Thread Daniel P. Berrange
On Thu, Nov 02, 2017 at 05:40:28PM +0100, Kashyap Chamarthy wrote:
> [Cc: Matt Booth from Nova upstream; so not snipping the email to retain
> context for Matt.]
> 
> On Thu, Nov 02, 2017 at 12:02:23PM +, Daniel P. Berrange wrote:
> > I've been thinking about a potential design/impl improvement for the way
> > that OpenStack Nova handles disk images when booting virtual machines, and
> > thinking if some enhancements to qemu-nbd could be beneficial...
> 
> Just read-through, very intereesting idea.  A couple of things inline.
> 
> > At a high level, OpenStack has a repository of disk images (Glance), and
> > when we go to boot a VM, Nova copies the disk image out of the repository
> > onto the local host's image cache. We doing this, Nova may also enlarge
> > disk image (eg if the original image has 10GB size, it may do a qemu-img
> > resize to 40GB). Nova then creates a qcow2 overlay with backing file
> > pointing to its local cache. Multiple VMs can be booted in parallel each
> > with their own overlay pointing to the same backing file
> > 
> > The problem with this approach is that VM startup is delayed while we copy
> > the disk image from the glance repository to the local cache, and again
> > while we do the image resize (though the latter is pretty quick really
> > since its just changing metadata in the image and/or host filesystem)
> > 
> > One might suggest that we avoid the local disk copy and just point the
> > VM directly at an NBD server running in the remote image repository, but
> > this introduces a centralized point of failure. With the local disk copy
> > VMs can safely continue running even if the image repository dies. Running
> > from the local image cache can offer better performance too, particularly
> > if having SSD storage. 
> > 
> > Conceptually what I want to start with is a 3 layer chain
> > 
> >master-disk1.qcow2  (qemu-nbd)
> >   |
> >   |  (format=raw, proto=nbd)
> >   |
> >cache-disk1.qcow2   (qemu-system-XXX)
> >   |
> >   |  (format=qcow2, proto=file)
> >   |
> >   +-  vm-a-disk1.qcow2   (qemu-system-XXX)
> > 
> > NB vm-?-disk.qcow2 sizes may different than the backing file.
> > Sometimes OS disk images are built with a fairly small root filesystem
> > size, and the guest OS will grow its root FS to fill the actual disk
> > size allowed to the specific VM instance.
> > 
> > The cache-disk1.qcow2 is on each local virt host that needs disk1, and
> > created when the first VM is launched. Further launched VMs can all use
> > this same cached disk.  Now the cache-disk1.qcow2 is not useful as is,
> > because it has no allocated clusters, so after its created we need to
> > be able to stream content into it from master-disk1.qcow2, in parallel
> > with the VM A booting off vm-a-disk1.qcow2
> > 
> > If there was only a single VM, this would be easy enough, because we
> > can use drive-mirror monitor command to pull master-disk1.qcow2 data
> > into cache-disk1.qcow2 and then remove the backing chain leaving just
> > 
> >cache-disk1.qcow2   (qemu-system-XXX)
> >   |
> 
> Just for my own understanding: in this hypothetical single VM diagram,
> you denote a QEMU binary ("qemu-system-XXX") for 'cache-disk1.qcow2'
> because it will be issuing 'drive-mirror' / 'blockdev-mirror' to the
> 'qemu-nbd' that exported 'master-disk1.qcow2', and "un-chain"
> post completion of 'mirror' job.  Yes?

In this diagram the same QEMU process has both cache-disk1.qcow2 and
vm-a-disk1.qcow2 open - its just a regular backing file setup.

> 
> > |  (format=qcow2, proto=file)
> >   |
> >   +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> > 
> > The problem is that many VMs are wanting to use cache-disk1.qcow2 as
> > their disk's backing file, and only one process is permitted to be
> > writing to disk backing file at any time.
> 
> Can you explain a bit more about how many VMs are trying to write to
> write to the same backing file 'cache-disk1.qcow2'?  I'd assume it's
> just the "immutable" local backing store (once the previous 'mirror' job
> is completed), based on which Nova creates a qcow2 overlay for each
> instance it boots.

An arbitrary number of  vm-*-disk1.qcow2 files could exist all using
the same cache-disk1.qcow2 image. Its only limited by how many VMs
you can fit on the host. By definition you can only ever have a single
process writing to a qcow2 file though, otherwise corruption will quickly
follow.

> When I pointed this e-mail of yours to Matt Booth on Freenode Nova IRC
> channel, he said the intermediate image (cache-disk1.qcow2) is a COR
> Copy-On-Read).  I realize what COR is -- everytime you read a cluster
> from the backing file, you write that locally, to avoid reading it
> again.

qcow2 doesn't give you COR, only COW. So every read request would have a miss
in cache-disk1.qcow2 and thus have to be fetched from master-disk1.qcow2. The
use of drive-mirror to pull master-disk1.qcow2 contents 

Re: [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

2017-11-02 Thread Kashyap Chamarthy
[Cc: Matt Booth from Nova upstream; so not snipping the email to retain
context for Matt.]

On Thu, Nov 02, 2017 at 12:02:23PM +, Daniel P. Berrange wrote:
> I've been thinking about a potential design/impl improvement for the way
> that OpenStack Nova handles disk images when booting virtual machines, and
> thinking if some enhancements to qemu-nbd could be beneficial...

Just read-through, very intereesting idea.  A couple of things inline.

> At a high level, OpenStack has a repository of disk images (Glance), and
> when we go to boot a VM, Nova copies the disk image out of the repository
> onto the local host's image cache. We doing this, Nova may also enlarge
> disk image (eg if the original image has 10GB size, it may do a qemu-img
> resize to 40GB). Nova then creates a qcow2 overlay with backing file
> pointing to its local cache. Multiple VMs can be booted in parallel each
> with their own overlay pointing to the same backing file
> 
> The problem with this approach is that VM startup is delayed while we copy
> the disk image from the glance repository to the local cache, and again
> while we do the image resize (though the latter is pretty quick really
> since its just changing metadata in the image and/or host filesystem)
> 
> One might suggest that we avoid the local disk copy and just point the
> VM directly at an NBD server running in the remote image repository, but
> this introduces a centralized point of failure. With the local disk copy
> VMs can safely continue running even if the image repository dies. Running
> from the local image cache can offer better performance too, particularly
> if having SSD storage. 
> 
> Conceptually what I want to start with is a 3 layer chain
> 
>master-disk1.qcow2  (qemu-nbd)
>   |
>   |  (format=raw, proto=nbd)
> |
>cache-disk1.qcow2   (qemu-system-XXX)
>   |
>   |  (format=qcow2, proto=file)
> |
>   +-  vm-a-disk1.qcow2   (qemu-system-XXX)
> 
> NB vm-?-disk.qcow2 sizes may different than the backing file.
> Sometimes OS disk images are built with a fairly small root filesystem
> size, and the guest OS will grow its root FS to fill the actual disk
> size allowed to the specific VM instance.
> 
> The cache-disk1.qcow2 is on each local virt host that needs disk1, and
> created when the first VM is launched. Further launched VMs can all use
> this same cached disk.  Now the cache-disk1.qcow2 is not useful as is,
> because it has no allocated clusters, so after its created we need to
> be able to stream content into it from master-disk1.qcow2, in parallel
> with the VM A booting off vm-a-disk1.qcow2
> 
> If there was only a single VM, this would be easy enough, because we
> can use drive-mirror monitor command to pull master-disk1.qcow2 data
> into cache-disk1.qcow2 and then remove the backing chain leaving just
> 
>cache-disk1.qcow2   (qemu-system-XXX)
>   |

Just for my own understanding: in this hypothetical single VM diagram,
you denote a QEMU binary ("qemu-system-XXX") for 'cache-disk1.qcow2'
because it will be issuing 'drive-mirror' / 'blockdev-mirror' to the
'qemu-nbd' that exported 'master-disk1.qcow2', and "un-chain"
post completion of 'mirror' job.  Yes?

>   |  (format=qcow2, proto=file)
>   |
>   +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> 
> The problem is that many VMs are wanting to use cache-disk1.qcow2 as
> their disk's backing file, and only one process is permitted to be
> writing to disk backing file at any time.

Can you explain a bit more about how many VMs are trying to write to
write to the same backing file 'cache-disk1.qcow2'?  I'd assume it's
just the "immutable" local backing store (once the previous 'mirror' job
is completed), based on which Nova creates a qcow2 overlay for each
instance it boots.

When I pointed this e-mail of yours to Matt Booth on Freenode Nova IRC
channel, he said the intermediate image (cache-disk1.qcow2) is a COR
Copy-On-Read).  I realize what COR is -- everytime you read a cluster
from the backing file, you write that locally, to avoid reading it
again.

> So I can't use the drive-mirror
> in the QEMU processes to deal with this; all QEMU's must see their
> backing file in a consistent read-only state
> 
> I've been wondering if it is possible to add an extra layer of NBD to
> deal with this scenario. i.e. start off with:
> 
>master-disk1.qcow2  (qemu-nbd)
>   |
>   |  (format=raw, proto=nbd)
> |
>cache-disk1.qcow2  (qemu-nbd)
>   |
>   |  (format=raw, proto=nbd)
> |
>   +-  vm-a-disk1.qcow2  (qemu-system-XXX)
>   +-  vm-b-disk1.qcow2  (qemu-system-XXX)
>   +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> 
> 
> In this model 'cache-disk1.qcow2' would be opened read-write by a
> qemu-nbd server process, but exported read-only to QEMU. qemu-nbd
> would then do a drive mirror to stream the contents of
> master-disk1.qcow2 into its 

Re: [Qemu-block] [PATCH v6 18/25] block: Add sgfnt_runtime_opts to BlockDriver

2017-11-02 Thread Max Reitz
On 2017-11-02 17:11, Alberto Garcia wrote:
> On Fri 29 Sep 2017 06:53:40 PM CEST, Max Reitz wrote:
>>  QLIST_ENTRY(BlockDriver) list;
>> +
>> +/* Pointer to a NULL-terminated array of names of significant options 
>> that
>> + * can be specified for bdrv_open(). A significant option is one that
>> + * changes the data of a BDS.
>> + * If this pointer is NULL, the array is considered empty.
>> + * "filename" and "driver" are always considered significant. */
>> +const char *const *sgfnt_runtime_opts;
>>  };
> 
> Is sgfnt a common acronym? I actually had to look it up...

I'm open for other suggestions to shorten "significant". :-)

(Actually, I can't remember where I got it from.  I think I just made it
up, but then I don't know why I didn't use sgfcnt -- maybe because cnt
looks either like count (or like something else, but that's a different
matter)...?)

Max

>> +static const char *const iscsi_sgfnt_runtime_opts[] = {
>> +"transport",
>> +"portal",
>> +"target",
>> +"user",
>> +"password",
>> +"password-secret",
>> +"lun",
>> +"initiator-name",
>> +"header-digest",
>> +
>> +NULL
>> +};
> 
> Ideally it would be nice to have constants for all these options, but
> this is a pre-existing problem.
> 
> Reviewed-by: Alberto Garcia 
> 
> Berto
> 




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v6 18/25] block: Add sgfnt_runtime_opts to BlockDriver

2017-11-02 Thread Alberto Garcia
On Fri 29 Sep 2017 06:53:40 PM CEST, Max Reitz wrote:
>  QLIST_ENTRY(BlockDriver) list;
> +
> +/* Pointer to a NULL-terminated array of names of significant options 
> that
> + * can be specified for bdrv_open(). A significant option is one that
> + * changes the data of a BDS.
> + * If this pointer is NULL, the array is considered empty.
> + * "filename" and "driver" are always considered significant. */
> +const char *const *sgfnt_runtime_opts;
>  };

Is sgfnt a common acronym? I actually had to look it up...

> +static const char *const iscsi_sgfnt_runtime_opts[] = {
> +"transport",
> +"portal",
> +"target",
> +"user",
> +"password",
> +"password-secret",
> +"lun",
> +"initiator-name",
> +"header-digest",
> +
> +NULL
> +};

Ideally it would be nice to have constants for all these options, but
this is a pre-existing problem.

Reviewed-by: Alberto Garcia 

Berto



Re: [Qemu-block] [PATCH v6 16/25] block: Add 'base-directory' BDS option

2017-11-02 Thread Max Reitz
On 2017-11-02 15:40, Alberto Garcia wrote:
> On Fri 29 Sep 2017 06:53:38 PM CEST, Max Reitz wrote:
>> Using this option, one can directly override what bdrv_dirname() will
>> return. This is useful if one uses e.g. qcow2 on top of quorum (with
>> only protocol BDSs under the quorum BDS) and wants to be able to use
>> relative backing filenames.
>>
>> Signed-off-by: Max Reitz 
> 
> Who would be using this option in practice? I understand that management
> tools should be using always absolute filenames, and this API doesn't
> seem so convenient for humans.

Hmmm.  I seem to recall Eric liked it a couple of years ago when he was
still more on the libvirt side of things. :-)

Max



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v6 09/25] block: Fix bdrv_find_backing_image()

2017-11-02 Thread Max Reitz
On 2017-10-30 15:47, Alberto Garcia wrote:
> On Fri 29 Sep 2017 06:53:31 PM CEST, Max Reitz wrote:
>> @@ -4096,22 +4086,31 @@ BlockDriverState 
>> *bdrv_find_backing_image(BlockDriverState *bs,
>>  } else {
>>  /* If not an absolute filename path, make it relative to the 
>> current
>>   * image's filename path */
>> -path_combine_deprecated(filename_tmp, PATH_MAX, 
>> curr_bs->filename,
>> -backing_file);
>> +filename_tmp = bdrv_make_absolute_filename(curr_bs, 
>> backing_file,
>> +   NULL);
>> +if (!filename_tmp) {
>> +continue;
>> +}
>>  
>>  /* We are going to compare absolute pathnames */
>>  if (!realpath(filename_tmp, filename_full)) {
>> +g_free(filename_tmp);
>>  continue;
>>  }
>> +g_free(filename_tmp);
> 
> This feels a bit too verbose, doesn't it? (especially because you're
> doing the same thing twice, see below). It could be made a bit shorter,
> something like:
> 
> bool have_filename = filename_tmp && realpath(filename_tmp, 
> filename_full);
> g_free(filename_tmp);
> if (!have_filename) {
>  continue;
> }

Well, but then again

if (filename_tmp && realpath(filename_tmp, filename_full)) {
g_free(filename_tmp);
continue;
}
g_free(filename_tmp);

has the same length. :-)

(Actually, overall it'd be one line shorter because I'd have to use an
own line for the "bool have_filename" declaration (to put it at the
start of the block).)

So I'll do that, yes.

Thanks for reviewing (and this suggestion)!

Max

> 
>> -path_combine_deprecated(filename_tmp, PATH_MAX, 
>> curr_bs->filename,
>> -curr_bs->backing_file);
>> +filename_tmp = bdrv_get_full_backing_filename(curr_bs, NULL);
>> +if (!filename_tmp) {
>> +continue;
>> +}
>>  
>>  if (!realpath(filename_tmp, backing_file_full)) {
>> +g_free(filename_tmp);
>>  continue;
>>  }
>> +g_free(filename_tmp);
> 
> Berto
> 




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v6 16/25] block: Add 'base-directory' BDS option

2017-11-02 Thread Alberto Garcia
On Fri 29 Sep 2017 06:53:38 PM CEST, Max Reitz wrote:
> Using this option, one can directly override what bdrv_dirname() will
> return. This is useful if one uses e.g. qcow2 on top of quorum (with
> only protocol BDSs under the quorum BDS) and wants to be able to use
> relative backing filenames.
>
> Signed-off-by: Max Reitz 

Who would be using this option in practice? I understand that management
tools should be using always absolute filenames, and this API doesn't
seem so convenient for humans.

Berto



[Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

2017-11-02 Thread Daniel P. Berrange
I've been thinking about a potential design/impl improvement for the way
that OpenStack Nova handles disk images when booting virtual machines, and
thinking if some enhancements to qemu-nbd could be beneficial...

At a high level, OpenStack has a repository of disk images (Glance), and
when we go to boot a VM, Nova copies the disk image out of the repository
onto the local host's image cache. We doing this, Nova may also enlarge
disk image (eg if the original image has 10GB size, it may do a qemu-img
resize to 40GB). Nova then creates a qcow2 overlay with backing file
pointing to its local cache. Multiple VMs can be booted in parallel each
with their own overlay pointing to the same backing file

The problem with this approach is that VM startup is delayed while we copy
the disk image from the glance repository to the local cache, and again
while we do the image resize (though the latter is pretty quick really
since its just changing metadata in the image and/or host filesystem)

One might suggest that we avoid the local disk copy and just point the
VM directly at an NBD server running in the remote image repository, but
this introduces a centralized point of failure. With the local disk copy
VMs can safely continue running even if the image repository dies. Running
from the local image cache can offer better performance too, particularly
if having SSD storage. 

Conceptually what I want to start with is a 3 layer chain

   master-disk1.qcow2  (qemu-nbd)
  |
  |  (format=raw, proto=nbd)
  |
   cache-disk1.qcow2   (qemu-system-XXX)
  |
  |  (format=qcow2, proto=file)
  |
  +-  vm-a-disk1.qcow2   (qemu-system-XXX)

NB vm-?-disk.qcow2 sizes may different than the backing file.
Sometimes OS disk images are built with a fairly small root filesystem
size, and the guest OS will grow its root FS to fill the actual disk
size allowed to the specific VM instance.

The cache-disk1.qcow2 is on each local virt host that needs disk1, and
created when the first VM is launched. Further launched VMs can all use
this same cached disk.  Now the cache-disk1.qcow2 is not useful as is,
because it has no allocated clusters, so after its created we need to
be able to stream content into it from master-disk1.qcow2, in parallel
with the VM A booting off vm-a-disk1.qcow2

If there was only a single VM, this would be easy enough, because we
can use drive-mirror monitor command to pull master-disk1.qcow2 data
into cache-disk1.qcow2 and then remove the backing chain leaving just

   cache-disk1.qcow2   (qemu-system-XXX)
  |
  |  (format=qcow2, proto=file)
  |
  +-  vm-a-disk1.qcow2  (qemu-system-XXX)

The problem is that many VMs are wanting to use cache-disk1.qcow2 as
their disk's backing file, and only one process is permitted to be
writing to disk backing file at any time. So I can't use the drive-mirror
in the QEMU processes to deal with this; all QEMU's must see their
backing file in a consistent read-only state

I've been wondering if it is possible to add an extra layer of NBD to
deal with this scenario. i.e. start off with:

   master-disk1.qcow2  (qemu-nbd)
  |
  |  (format=raw, proto=nbd)
  |
   cache-disk1.qcow2  (qemu-nbd)
  |
  |  (format=raw, proto=nbd)
  |
  +-  vm-a-disk1.qcow2  (qemu-system-XXX)
  +-  vm-b-disk1.qcow2  (qemu-system-XXX)
  +-  vm-c-disk1.qcow2  (qemu-system-XXX)


In this model 'cache-disk1.qcow2' would be opened read-write by a
qemu-nbd server process, but exported read-only to QEMU. qemu-nbd
would then do a drive mirror to stream the contents of
master-disk1.qcow2 into its cache-disk1.qcow2, concurrently with
servicing read requests from many QEMU's vm-*-disk1.qcow2 files
over NBD. When the drive mirror is complete we would again cut
the backing file to give

   cache-disk1.qcow2  (qemu-nbd)
  |
  |  (format=raw, proto=nbd)
  |
  +-  vm-a-disk1.qcow2  (qemu-system-XXX)
  +-  vm-b-disk1.qcow2  (qemu-system-XXX)
  +-  vm-c-disk1.qcow2  (qemu-system-XXX)

Since qemu-nbd no longer needs write to cache-disk1.qcow2 at this point,
we can further pivot all the QEMU servers to make vm-*-disk1.qcow2 use
format=qcow2,proto=file, allowing the local qemu-nbd to close the disk
image, and potentially exit (assuming it doesn't have other disks to
service). This would leave

   cache-disk1.qcow2  (qemu-system-XXX)
  |
  |  (format=qcow2, proto=file)
  |
  +-  vm-a-disk1.qcow2  (qemu-system-XXX)
  +-  vm-b-disk1.qcow2  (qemu-system-XXX)
  +-  vm-c-disk1.qcow2  (qemu-system-XXX)

Conceptually QEMU has all the pieces neccessary to support this kind of
approach to disk images, but they're not exposed by qemu-nbd as it has
no QMP interface of its own.

Another more minor issue is that the disk image repository may have
1000's of images in it, and I don't want to be 

Re: [Qemu-block] [PATCH v13 2/6] qmp: Use ThrottleLimits structure

2017-11-02 Thread Pradeep Jagadeesh

On 10/13/2017 4:26 PM, Eric Blake wrote:

[adding Markus, and block list]

On 10/13/2017 09:16 AM, Alberto Garcia wrote:

On Mon 02 Oct 2017 04:33:28 PM CEST, Pradeep Jagadeesh wrote:

This patch factors out code to use the ThrottleLimits
structure.



 { 'struct': 'BlockIOThrottle',
-  'data': { '*device': 'str', '*id': 'str', 'bps': 'int', 'bps_rd': 'int',
-'bps_wr': 'int', 'iops': 'int', 'iops_rd': 'int', 'iops_wr': 'int',
-'*bps_max': 'int', '*bps_rd_max': 'int',
-'*bps_wr_max': 'int', '*iops_max': 'int',
-'*iops_rd_max': 'int', '*iops_wr_max': 'int',
-'*bps_max_length': 'int', '*bps_rd_max_length': 'int',
-'*bps_wr_max_length': 'int', '*iops_max_length': 'int',
-'*iops_rd_max_length': 'int', '*iops_wr_max_length': 'int',
-'*iops_size': 'int', '*group': 'str' } }
+  'base': 'ThrottleLimits',
+  'data': { '*device': 'str', '*id': 'str', '*group': 'str' } }


So BlockIOThrottle used to have parameters named bps_rd and iops_wr, and
after this patch they become bps-read and iops-write. This breaks the
API completely, as you can see if you run e.g. iotest 129:

AssertionError: failed path traversal for "return" in "{u'error': {u'class': u'GenericError', 
u'desc': u"Parameter 'iops_rd' is unexpected"}}"

I just checked previous versions of the series and I see that Manos
already warned you of this in v11:

   https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg04698.html


On the bright side, ThrottleLimits is marked 'since 2.11' (added in
commit 432d889e), meaning it has not yet been released, so we CAN fix
the naming in ThrottleLimits to be compatible with BlockIOThrottle if we
want to share the type, as long as we get it done before the 2.11
release.  It does mean tweaking Manos' code to use compatible names
everywhere, but that may be a wise course of action (we tend to favor
'-' in new API names unless there is a strong reason to use '_'; but
sharing code for maximum back-compat would be a reason to use '_').


Sorry for the late reply. I was out of office.
I am ready to change them to "_" instead of "-".
Need to ask Manos.

@Manos, what do you say about the above comment. That makes sense.
It will help to reuse lot of code. Shall we rename the parameters with
"_" instead of "-"?

Regards,
Pradeep