date:20200205

Re: [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:47, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:51:42AM +0100, Klaus Jensen wrote:
> > Hi,
> > 
> > 
> > Changes since v4
> >  - Changed vendor and device id to use a Red Hat allocated one. For
> >backwards compatibility add the 'x-use-intel-id' nvme device
> >parameter. This is off by default but is added as a machine compat
> >property to be true for machine types <= 4.2.
> > 
> >  - SGL mapping code has been refactored.
> 
> Looking pretty good to me. For the series beyond the individually
> reviewed patches:
> 
> Acked-by: Keith Busch 
> 
> If you need to send a v5, you may add my tag to the patches that are not
> substaintially modified if you like.

I'll send a v6 with the changes to "nvme: make lba data size
configurable". It won't be substantially changed, I will just only
accept 9 and 12 as valid values for lbads.

Thanks for the Ack's and Reviews Keith!


Klaus

Re: [PATCH v5 24/26] nvme: change controller pci id

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:35, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:06AM +0100, Klaus Jensen wrote:
> > There are two reasons for changing this:
> > 
> >   1. The nvme device currently uses an internal Intel device id.
> > 
> >   2. Since commits "nvme: fix write zeroes offset and count" and "nvme:
> >  support multiple namespaces" the controller device no longer has
> >  the quirks that the Linux kernel think it has.
> > 
> >  As the quirks are applied based on pci vendor and device id, change
> >  them to get rid of the quirks.
> > 
> > To keep backward compatibility, add a new 'x-use-intel-id' parameter to
> > the nvme device to force use of the Intel vendor and device id. This is
> > off by default but add a compat property to set this for machines 4.2
> > and older.
> > 
> > Signed-off-by: Klaus Jensen 
> 
> Yay, thank you for following through on getting this identifier assigned.
> 
> Reviewed-by: Keith Busch 

This is technically not "officially" sanctioned yet, but I got an
indication from Gerd that we are good to proceed with this.

Re: [PATCH v5 22/26] nvme: support multiple namespaces

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:31, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:04AM +0100, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> 
> I like this feature a lot, thanks for doing it.
> 
> Reviewed-by: Keith Busch 
> 
> > @@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, 
> > NvmeCmd *cmd, uint8_t rae,
> >  uint64_t units_read = 0, units_written = 0, read_commands = 0,
> >  write_commands = 0;
> >  NvmeSmartLog smart;
> > -BlockAcctStats *s;
> >  
> >  if (nsid && nsid != 0x) {
> >  return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> 
> This is totally optional, but worth mentioning: this patch makes it
> possible to remove this check and allow per-namespace smart logs. The
> ID_CTRL.LPA would need to updated to reflect that if you wanted to
> go that route.

Yeah, I thought about that, but with NVMe v1.4 support arriving in a
later series, there are no longer any namespace specific stuff in the
log page anyway.

The spec isn't really clear on what the preferred behavior for a 1.4
compliant device is. Either

  1. LBA bit 0 set and just return the same page for each namespace or,
  2. LBA bit 0 unset and fail when NSID is set

Re: [PATCH v5 26/26] nvme: make lba data size configurable

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:43, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:08AM +0100, Klaus Jensen wrote:
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme-ns.c | 2 +-
> >  hw/block/nvme-ns.h | 4 +++-
> >  hw/block/nvme.c| 1 +
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 0e5be44486f4..981d7101b8f2 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
> >  {
> >  NvmeIdNs *id_ns = &ns->id_ns;
> >  
> > -id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->lbaf[0].ds = ns->params.lbads;
> >  id_ns->nuse = id_ns->ncap = id_ns->nsze =
> >  cpu_to_le64(nvme_ns_nlbas(ns));
> >  
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index b564bac25f6d..f1fe4db78b41 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -7,10 +7,12 @@
> >  
> >  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> >  DEFINE_PROP_DRIVE("drive", _state, blk), \
> > -DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > +DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> > +DEFINE_PROP_UINT8("lbads", _state, _props.lbads, BDRV_SECTOR_BITS)
> 
> I think we need to validate the parameter is between 9 and 12 before
> trusting it can be used safely.
> 
> Alternatively, add supported formats to the lbaf array and let the host
> decide on a live system with the 'format' command.

The device does not yet support Format NVM, but we have a patch ready
for that to be submitted with a new series when this is merged.

For now, while it does not support Format, I will change this patch such
that it defaults to 9 (BRDV_SECTOR_BITS) and only accept 12 as an
alternative (while always keeping the number of formats available to 1).

Re: [PATCH] qemu-img: Place the '-i aio' option in alphabetical order

2020-02-05 Thread Stefan Hajnoczi

On Wed, Feb 05, 2020 at 05:30:08PM +0100, Julia Suvorova wrote:
> The '-i AIO' option was accidentally placed after '-n' and '-t'. Move it
> after '--flush-interval'.
> 
> Signed-off-by: Julia Suvorova 
> ---
>  docs/interop/qemu-img.rst | 8 
>  qemu-img-cmds.hx  | 4 ++--
>  2 files changed, 6 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v3 1/1] qemu-img: Add --target-is-zero to convert

2020-02-05 Thread Eric Blake


On 2/4/20 8:21 AM, David Edmondson wrote:

On Tuesday, 2020-02-04 at 07:31:52 -06, Eric Blake wrote:


On 2/4/20 3:52 AM, David Edmondson wrote:

In many cases the target of a convert operation is a newly provisioned
target that the user knows is blank (filled with zeroes). In this


'filled with zeroes' is accurate for a preallocated image, but reads
awkwardly for a sparse image; it might be better to state 'reads as zero'.





+++ b/docs/interop/qemu-img.rst
@@ -214,6 +214,12 @@ Parameters to convert subcommand:
 will still be printed.  Areas that cannot be read from the source will be
 treated as containing only zeroes.
   
+.. option:: --target-is-zero

+
+  Assume that the destination is filled with zeros. This parameter is


Spelled 'zeroes' just three lines earlier.


My understanding is that "zeros" is the correct plural of "zero" (and
that "zeroes" relates to the use of "zero" as a verb), but perhaps
that's another British English thing?

I don't care enough to fight about it.



https://english.stackexchange.com/questions/3824/what-is-the-plural-form-of-zero 
says both zeros and zeroes are fine for the noun (with UK leaning more 
to zeros), but only zeroes for the verb; but also concedes that zeroes 
is gaining in popularity over time.  I likewise won't bother tweaking 
your patch (I see in v4 that you left it unchanged).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v4 1/1] qemu-img: Add --target-is-zero to convert

2020-02-05 Thread Eric Blake


On 2/5/20 5:02 AM, David Edmondson wrote:

In many cases the target of a convert operation is a newly provisioned
target that the user knows is blank (reads as zero). In this situation
there is no requirement for qemu-img to wastefully zero out the entire
device.

Add a new option, --target-is-zero, allowing the user to indicate that
an existing target device will return zeros for all reads.

Signed-off-by: David Edmondson 
---
  docs/interop/qemu-img.rst |  9 -
  qemu-img-cmds.hx  |  4 ++--
  qemu-img.c| 26 +++---
  3 files changed, 33 insertions(+), 6 deletions(-)


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 26/33] block: Use child_of_bds in remaining places

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Replace child_file by child_of_bds in all remaining places (excluding
tests).

Signed-off-by: Max Reitz 
---



+++ b/block/raw-format.c
@@ -444,6 +444,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, 
int flags,
  BDRVRawState *s = bs->opaque;
  bool has_size;
  uint64_t offset, size;
+BdrvChildRole file_role;
  int ret;
  
  ret = raw_read_options(options, &offset, &has_size, &size, errp);

@@ -451,8 +452,18 @@ static int raw_open(BlockDriverState *bs, QDict *options, 
int flags,
  return ret;
  }
  
-bs->file = bdrv_open_child(NULL, options, "file", bs, &child_file, 0,

-   false, errp);
+/*
+ * Without offset and a size limit, this driver behaves very much
+ * like a filter.  With any such limit, it does not.
+ */
+if (offset || has_size) {
+file_role = BDRV_CHILD_DATA | BDRV_CHILD_PRIMARY;
+} else {
+file_role = BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY;
+}


Cool - the new roles gives us the flexibility to be more dynamic!

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v3 00/13] RFC: [for 5.0]: HMP monitor handlers cleanups

2020-02-05 Thread John Snow




On 1/28/20 11:47 AM, Dr. David Alan Gilbert wrote:
> * John Snow (js...@redhat.com) wrote:
>>
>>
>> On 1/27/20 3:43 PM, Peter Krempa wrote:
>>> On Mon, Jan 27, 2020 at 14:39:02 -0500, John Snow wrote:


 On 1/27/20 5:36 AM, Maxim Levitsky wrote:
> This patch series is bunch of cleanups
> to the hmp monitor code.
>
> This series only touched blockdev related hmp handlers.
>
> No functional changes expected other that
> light error message changes by the last patch.
>
> This was inspired by this bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=1719169
>
> Basically some users still parse hmp error messages,
> and they would like to have them prefixed with 'Error:'
>

 HMP isn't meant to be parsed. It's explicitly *not* API or ABI. I do
 like consistency in my UIs (it's useful for human eyes, too), but I'd
 like to know more about the request.
>>>
>>> That's true as long as there's an stable replacement ... see below.
>>>
>>
>> Thanks for the context!
>>

 Is this request coming from libvirt? Can we wean them off of this
 interface? What do they need as a replacement?
>>>
>>> There are 5 commands that libvirt still has HMP interfaces for:
>>>
>>> drive_add
>>> drive_del
>>>
>>> savevm
>>> loadvm
>>> delvm
>>>
>>> From upstream point of view there's no value in adding the 'error'
>>> prefix to drive_add/drive_del as libvirt now uses blockdev-add/del QMP
>>> command instead which have implicit error propagation.
>>>
>>
>> As thought.
>>
>>> There are no replacements for the internal snapshot commands, but they
>>> reported the 'error' prefix for some time even before this series.
>>>
>>> Said that, please don't break savevm/loadvm/delvm until a QMP
>>> replacement is added.
>>>
>>
>> Yes, noted. I wonder where userfaultfd write support is these days...
> 
> How would that help you there?
> 

Left at the traffic lights, but there was a thought that we'd be able to
get transactionable save-memory support in QMP if we could use
userfaultfd to do just-in-time copies of memory as needed, until the job
is complete.

This way we could support it properly in QMP and we'd have a replacement
for the HMP version which -- from memory -- is not appropriate for the
QMP channel.

Maybe I imagined this restriction.

--js

Re: [PATCH v2 25/33] block: Make filter drivers use child_of_bds

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Note that some filters have secondary children, namely blkverify (the
image to be verified) and blklogwrites (the log).  This patch does not
touch those children.


I would have guessed blkdebug; but I guess that's not quite a filter for 
other reasons, so it gets covered in a later patch.




Note that for blkverify, the filtered child should not be format-probed.
While there is nothing enforcing this here, in practice, it will not be:
blkverify implements .bdrv_file_open.  The block layer ensures (and in
fact, asserts) that BDRV_O_PROTOCOL is set for every BDS whose driver
implements .bdrv_file_open.  This flag will then be bequeathed to
blkverify's children, and they will thus (by default) not be probed
either.

("By default" refers to the fact that blkverify's other child (the
non-filtered one) will have BDRV_O_PROTOCOL force-unset, because that is
what happens for all non-filtered children of non-format drivers.)

Signed-off-by: Max Reitz 
---
  block/blkdebug.c| 4 +++-
  block/blklogwrites.c| 3 ++-
  block/blkverify.c   | 4 +++-
  block/copy-on-read.c| 5 +++--
  block/filter-compress.c | 5 +++--
  block/replication.c | 3 ++-
  block/throttle.c| 5 +++--
  7 files changed, 19 insertions(+), 10 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 24/33] block: Make format drivers use child_of_bds

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Commonly, they need to pass the BDRV_CHILD_IMAGE set as the
BdrvChildRole; but there are exceptions for drivers with external data
files (qcow2 and vmdk).

Signed-off-by: Max Reitz 
---


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 23/33] block: Drop child_backing

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 
---
  block.c   | 62 +++
  include/block/block_int.h |  1 -
  2 files changed, 4 insertions(+), 59 deletions(-)


More code deletion.



diff --git a/block.c b/block.c
index 6b705ee23a..e245b5d8d9 100644
--- a/block.c
+++ b/block.c
@@ -1169,15 +1169,6 @@ static void bdrv_backing_attach(BdrvChild *c)
  parent->backing_blocker);
  }
  
-/* XXX: Will be removed along with child_backing */

-static void bdrv_child_cb_attach_backing(BdrvChild *c)


In fact, more promised code deletion :)

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 22/33] block: Make backing files child_of_bds children

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 


Another sparse commit message (a recurring theme of this series). The 
subject line says 'what', and the patch appears to be faithful to that, 
but if a future bisection lands here, even a one-sentence 'why' would be 
handy; maybe:


This is part of a larger series of unifying block device relationships 
via child_of_bds.


to at least hint that searching nearby commits gives a better why.


---
  block.c | 26 --
  block/backup-top.c  |  2 +-
  block/vvfat.c   |  3 ++-
  tests/test-bdrv-drain.c | 13 +++--
  4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/block.c b/block.c
index 77755f0c6c..6b705ee23a 100644
--- a/block.c
+++ b/block.c
@@ -2748,6 +2748,20 @@ static bool 
bdrv_inherits_from_recursive(BlockDriverState *child,
  return child != NULL;
  }
  
+/*

+ * Return the BdrvChildRole for @bs's backing child.  bs->backing is
+ * mostly used for COW backing children (role = COW), but also for
+ * filtered children (role = FILTERED | PRIMARY).
+ */
+static BdrvChildRole bdrv_backing_role(BlockDriverState *bs)
+{
+if (bs->drv && bs->drv->is_filter) {
+return BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY;


And here's the first point (that I've spotted at least) in this series 
where you are definitely returning a non-BdrvChildRole through a return 
type of BdrvChildRole, rather than me just guessing you might (the 
integer formed by bitwise-or of two enum values is not itself an enum 
value).  Repeating what I said earlier, the C language is loose enough 
to allow your usage, and your usage is somewhat better self-documenting 
than using an unsigned int; but it would not fly in other languages.


So I won't insist you change it, but at least think about it. (And the 
latter has already happened if you read my paragraph - so can we call 
that enough thought on the matter? ;)



+} else {
+return BDRV_CHILD_COW;
+}
+}
+

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 2/2] qemu-nbd: Removed deprecated --partition option

2020-02-05 Thread Ján Tomko


On Thu, Jan 23, 2020 at 10:46:50AM -0600, Eric Blake wrote:

The option was deprecated in 4.0.0 (commit 0ae2d546); it's now been
long enough with no complaints to follow through with that process.

Signed-off-by: Eric Blake 
---
docs/interop/qemu-nbd.rst |  15 ++---
qemu-deprecated.texi  |  49 ++
qemu-nbd.c| 133 +-
3 files changed, 24 insertions(+), 173 deletions(-)



Reviewed-by: Ján Tomko 

Jano


signature.asc
Description: PGP signature

Re: [PATCH v2 21/33] block: Drop child_format

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 
---


Sparse message, but the diffstat makes it obvious that it is ripping out 
now-unused code, so there's not much more to say.



  block.c   | 29 -
  include/block/block_int.h |  1 -
  2 files changed, 30 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 20/33] block: Switch child_format users to child_of_bds

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Both users (quorum and blkverify) use child_format for
not-really-filtered children, so the appropriate BdrvChildRole in both
cases is DATA.  (Note that this will cause bdrv_inherited_options() to
force-allow format probing.)

Signed-off-by: Max Reitz 
---
  block/blkverify.c | 4 ++--
  block/quorum.c| 7 ---
  2 files changed, 6 insertions(+), 5 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 19/33] raw-format: Split raw_read_options()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Split raw_read_options() into one function that actually just reads the
options, and another that applies them.  This will allow us to detect
whether the user has specified any options before attaching the file
child (so we can decide on its role based on the options).

Signed-off-by: Max Reitz 
---
  block/raw-format.c | 110 ++---
  1 file changed, 65 insertions(+), 45 deletions(-)


Looks like a useful split.

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 18/33] block: Add bdrv_default_perms()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

This callback can be used by BDSs that use child_of_bds with the
appropriate BdrvChildRole for their children.

Also, make bdrv_format_default_perms() use it for child_of_bds children
(just a temporary solution until we can drop bdrv_format_default_perms()
altogether).

Signed-off-by: Max Reitz 
---
  block.c   | 46 ---
  include/block/block_int.h | 11 ++
  2 files changed, 49 insertions(+), 8 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 17/33] block: Split bdrv_default_perms_for_storage()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

We can be less restrictive about pure data children than those with
metadata on them.

For bdrv_format_default_perms(), we keep the safe option of
bdrv_default_perms_for_metadata() (until we drop
bdrv_format_default_perms() altogether).

That means that bdrv_default_perms_for_data() is unused at this point.
We will use it for all children that have the DATA role, but not the
METADATA role.  So far, no child has any role, so we do not use it, but
that will change.

Signed-off-by: Max Reitz 
---
  block.c | 53 +++--
  1 file changed, 43 insertions(+), 10 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 16/33] block: Pull out bdrv_default_perms_for_storage()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 
---


Commit message is short.


  block.c | 71 +
  1 file changed, 46 insertions(+), 25 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 15/33] block: Pull out bdrv_default_perms_for_backing()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 


Rather light on the commit message. But looks like straightforward 
refactoring (with the previous patch making it easier to follow).



---
  block.c | 62 +
  1 file changed, 40 insertions(+), 22 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 14/33] block: Distinguish paths in *_format_default_perms

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

bdrv_format_default_perms() has one code path for backing files, and one
for storage files.  We want to pull them out into own functions, so
make sure they are completely distinct before so the next patches will
be a bit cleaner.

Signed-off-by: Max Reitz 
---
  block.c | 19 +--
  1 file changed, 13 insertions(+), 6 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 13/33] block: Add child_of_bds

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Any current user of child_file, child_format, and child_backing can and
should use this generic BdrvChildClass instead, as it can handle all of
these cases.  However, to be able to do so, the users must pass the
appropriate BdrvChildRole when the child is created/attached.  (The
following commits will take care of that.)

Signed-off-by: Max Reitz 
---
  block.c   | 27 +++
  include/block/block_int.h |  1 +
  2 files changed, 28 insertions(+)


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Eric Blake


On 2/5/20 11:04 AM, Max Reitz wrote:

OK, I expected users to come in a separate patch.


I can refactor that better in v2.




That's the use case: when copying into a destination file, it's useful
to know if the destination already reads as all zeroes, before
attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls
to block status checking for holes.


But that was my point on IRC.  Is it really more useful if
bdrv_make_zero() is just as quick?  (And the fact that NBD doesn’t have
an implementation looks more like a problem with NBD to me.)


That is indeed a thought - why should qemu-img even TRY to call 
bdrv_has_init_zero; it should instead call bdrv_make_zero(), which 
should be as fast as possible (see my latest reply on 9/17 exploring 
that idea more).  Under the hood, we can then make bdrv_make_zero() use 
whatever tricks it needs, whether keeping the driver's 
.bdrv_has_zero_init/_truncate callbacks, adding another callback, making 
write_zeroes faster, or whatever, but instead of making qemu-img sort 
through multiple ideas, the burden would now be hidden in the block layer.




(Considering that at least the code we discussed on IRC didn’t work for
preallocated images, which was the one point where we actually have a
problem in practice.)


And this series DOES improve the case for preallocated qcow2 images, by 
virtue of a new qcow2 autoclear bit.  But again, that may be something 
we want to hide in the driver callback interfaces, while qemu-img just 
blindly calls bdrv_make_zero() and gets success (the image now reads as 
zeroes, either because it was already that way or we did something 
quick) or failure (it is a waste of time to prezero, whether by 
write_zeroes or by trim or by truncate, so just manually write zeroes as 
part of your image copying).





(We have a use case with convert -n to freshly created image files, but
my position on this on IRC was that we want the --target-is-zero flag
for that anyway: Auto-detection may always break, our preferred default
behavior may always change, so if you want convert -n not to touch the
target image except to write non-zero data from the source, we need a
--target-is-zero flag and users need to use it.  Well, management
layers, because I don’t think users would use convert -n anyway.

And with --target-is-zero and users effectively having to use it, I
don’t think that’s a good example of a use case.)


Yes, there will still be cases where you have to use --target-is-zero
because the image itself couldn't report that it already reads as
zeroes, but there are also enough cases where the destination is already
known to read zeroes and it's a shame to tell the user that 'you have to
add --target-is-zero to get faster copying even though we could have
inferred it on your behalf'.


How is it a shame?  I think only management tools would use convert -n.
  Management tools want reliable behavior.  If you want reliable
behavior, you have to use --target-is-zero anyway.  So I don’t see the
actual benefit for qemu-img convert.


qemu-img convert to an NBD destination cannot create the destination, so 
it ALWAYS has to use -n.  I don't know how often users are likely to 
wire up a command line for qemu-img convert with NBD as the destination, 
or whether you are correct that it will be a management app able to 
supply -n (and thus able to supply --target-is-zero).  But at the same 
time, can a management app learn whether it is safe to supply 
--target-is-zero?  With my upcoming NBD patches, 'qemu-img --list' will 
show whether the NBD target is known to start life all zero, and a 
management app could use mechanism to decide when to pass 
--target-is-zero (whether a management app would actually fork qemu-img 
--list, or just be an NBD client itself to do the same thing qemu-img 
would do, is beside the point).


Similarly, this series includes enhancements to 'qemu-img info' on qcow2 
images known to currently read as zero; again, that sort of information 
is necessary somewhere in the chain, whether it be because qemu-img 
consumes the information itself, or because the management app consumes 
the information in order to pass the --target-is-zero option to 
qemu-img, either way, the information needs to be available for consumption.





I suppose there is the point of blockdev-create + blockdev-mirror: This
has exactly the same problem as convert -n.  But again, if you really
want blockdev-mirror not just to force-zero the image, you probably need
to tell it so explicitly (i.e., with a --target-is-zero flag for
blockdev-mirror).

(Well, I suppose we could save us a target-is-zero for mirror if we took
this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
But, well, please no.)

But maybe I’m just an idiot and there is no reason not to take this
series and make blockdev-create + blockdev-mirror do the sensible thing
by default in most cases. *shrug*


My argument for taking the series _is_ that the common case can b

Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}

2020-02-05 Thread Eric Blake


On 2/5/20 11:22 AM, Max Reitz wrote:




And thus callers which just want the trivially obtainable
BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
even though they don’t care about that flag.


True, but only to a minor extent; and the documentation mentions that
the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
block_status loop.


So it must be less expensive than an arbitrarily complex loop.  I think
a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?


If I recall, the tmpfs bug was that it was O(n) where n was based on the 
initial offset and the number of extents prior to that offset.  The 
probe at offset 0 is O(1) (because there are no prior extents), whether 
it reaches the end of the file (the entire image is a hole) or hits data 
beforehand.  It is only probes at later offsets where the speed penalty 
sets in, and where an O(n) loop over all extents turned into an O(n^2) 
traversal time due to the O(n) nature of each later lookup.




What I’m trying to say is that this is not a good limit and can mean
anything.

I do think this limit definition makes sense for callers that want to
know about ZERO_OPEN.  But I don’t know why we would have to let other
callers wait, too.


Keeping separate functions may still be the right approach for v2, 
although I'd still like to name things better ('has_zero_init' vs. 
'has_zero_init_truncate' did not work well for me).  And if I'm renaming 
things, then I'm touching just as much code whether I rename and keep 
separate functions or rename and consolidate into one.





Meanwhile, callers tend to only care about
bdrv_known_zeroes() right after opening an image or right before
resizing (not repeatedly during runtime);


Hm, yes.  I was thinking of parallels, but that only checks once in
parallels_open(), so it’s OK.


and you also argued elsewhere
in this thread that it may be worth having the block layer cache
BDRV_ZERO_OPEN and update the cache on any write,


I didn’t say the block layer, but it if makes sense.


at which point, the
expense in the driver callback really is a one-time call during
bdrv_co_open().


It definitely doesn’t make sense to me to do that call unconditionally
in bdrv_co_open().


Okay, you have a point there - while 'qemu-img convert' cares, not all 
clients of bdrv_co_open() are worried about whether the existing 
contents are zero; so unconditionally priming a cache during 
bdrv_co_open is not as wise as doing things when it will actually be 
useful information.  On the other hand, if it is something that clients 
only use when first opening an image, caching data doesn't make much 
sense either.


So, we know that bdrv_has_zero_init() is only viable on a just-created 
image, bdrv_has_zero_init_truncate() is only viable if you are about to 
resize an image using bdrv_co_truncate(PREALLOC_MODE_OFF).


Hmm - thinking aloud: our ultimate goal is that we want to make it 
easier for algorithms that can be sped up IF the image is currently 
known to be all zero.  Maybe what this means is that we really want to 
be tweaking bdrv_make_zeroes() to do all the work, something along the 
lines of:
- if the image is known to already be all zeroes using an O(1) probe 
(this includes if the image was freshly created and creation sees all 
zeroes, or if a block_status at offset 0 shows a hole for the entire 
image, or if an NBD extension advertises all zero at connection 
time...), return success
- if the image has a FAST truncate, and resizing reads zeroes, we can 
truncate to size 0 and back to the desired size, then return success; 
determining if truncate is fast should be similar to how 
BDRV_REQ_NO_FALLBACK determines whether write zeroes is fast
- if the image supports BDRV_REQ_NO_FALLBACK with write zeroes, we can 
request a write zeroes over the whole image, which will either succeed 
(the image is now quickly zero) or fail (writing zeroes as we go is the 
best we can do)
- if the image could report that it is all zeroes, but only at the cost 
of O(n) work such as a loop over block_status (or even O(n^2) with the 
tmpfs lseek bug), it's easier to report failure than to worry about 
making the image read all zeroes


qemu-img would then only ever need to consult --target-is-zero and 
bdrv_make_zero(), and not worry about any other function calls; while 
the block layer would take care of coordinating whatever other call 
sequences make the most sense in reporting success or failure in getting 
the image into an all-zero state if it was not already there.






And in that case, whether the one-time expense is done
via a single function call or via three driver callbacks, the amount of
work is the same; but the driver callback interface is easier if there
is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
though BlockDriverInfo tracks much more than that boolean).

In fact, it may be worth consolidating

Re: [PATCH 1/3] m25p80: Convert to support tracing

2020-02-05 Thread Guenter Roeck

On Wed, Feb 05, 2020 at 06:10:44PM +0100, Cédric Le Goater wrote:
> On 2/5/20 5:35 PM, Guenter Roeck wrote:
> > On Wed, Feb 05, 2020 at 11:05:04AM +0100, Cédric Le Goater wrote:
> >> On 2/3/20 7:09 PM, Guenter Roeck wrote:
> >>> While at it, add some trace messages to help debug problems
> >>> seen when running the latest Linux kernel.
> >>
> >> In case you resend, It would be nice to printout a flash id in the tracing
> >> else I have a patch for it on top of yours. It helped me track a squashfs
> >> corruption on the Aspeed witherspoon-bmc machine which we were after since
> >> 2017 or so. It seems to be a kernel bug.
> >>
> > 
> > I'll send a new version to split patch 2. Not sure I understand what you 
> > mean
> > with the above. If you send me your patch I'll be happy to merge it into 
> > mine,
> > otherwise we can just keep it as follow-ip patch.
> 
> Here is the idea : 
> 
>   
> https://github.com/legoater/qemu/commit/a07727e9cfc8447ea18249ff68a561f7e8883584
> 

Ah, that. I had thought about doing that, but I found displaying the pointer
a bit clumsy. It looks like there isn't really anything else available that
would provide a flash index, and I agree that it would be useful, so I'll
add it for all traces.

Thanks,
Guenter

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Max Reitz

On 05.02.20 16:14, Vladimir Sementsov-Ogievskiy wrote:
> 05.02.2020 17:47, Vladimir Sementsov-Ogievskiy wrote:
>> 05.02.2020 17:26, Eric Blake wrote:
>>> On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>
> 3. For qcow2
> Hmm. Here, as I understand, than main case is freshly created qcow2,
> which is fully-unallocated. To understand that it is empty, we
> need only to check all L1 entries. And for empty L1 table it is fast.
> So we don't need any qcow2 format improvement to check it.
>

 Ah yes, I forget about preallocated case. Hmm. For preallocated
 clusters,
 we have zero bits in L2 entries. And with them, we even don't need
 preallocated to be filled by zeros, as we never read them (but just
 return
 zeros on read)..
>>>
>>> Scanning all L2 entries is O(n), while an autoclear bit properly
>>> maintained is O(1).
>>>

 Then, may be we want similar flag for L1 entry (this will enable large
 fast write-zero). And may be we want flag which marks the whole image
 as read-zero (it's your flag). So, now I think, my previous idea
 of "all allocated is zero" is worse. As for fully-preallocated images
 we are sure that all clusters are allocated, and it is more native to
 have flags similar to ZERO bit in L2 entry.
>>>
>>> Right now, we don't have any L1 entry flags.  Adding one would
>>> require adding an incompatible feature flag (if older qemu would
>>> choke to see unexpected flags in an L1 entry), or at best an
>>> autoclear feature flag (if the autoclear bit gets cleared because an
>>> older qemu opened the image and couldn't maintain L1 entry flags
>>> correctly, then newer qemu knows it cannot trust those L1 entry
>>> flags).  But as soon as you are talking about adding a feature bit,
>>> then why add one that still requires O(n) traversal to check (true,
>>> the 'n' in an O(n) traversal of L1 tables is much smaller than the
>>> 'n' in an O(n) traversal of L2 tables), when you can instead just add
>>> an O(1) autoclear bit that maintains all_zero status for the image as
>>> a whole?
>>>
>>
>> My suggestion about L1 entry flag is side thing, I understand
>> difference between O(n) and O(1) :) Still additional L1 entry will
>> help to make efficient large block-status and write-zero requests.
>>
>> And I agree that we need top level flag.. I just try to say, that it
>> seems good to make it similar with existing L2 flag. But yes, it would
>> be incomaptible change, as it marks all clusters as ZERO, and older
>> Qemu can't understand it and may treat all clusters as unallocated.
>>
> 
> Still, how long is this O(n) ? We load the whole L1 into memory anyway.
> For example, 16Tb disk with 64K granularity, we'll have 32768 L1
> entries. Will we get sensible performance benefit with an extension? I
> doubt in it now. And anyway, if we have an extension, we should fallback
> to this O(n) if we don't have the flag set.

(Sorry, it’s late and I haven’t followed this particular conversation
too closely, but:)

Keep in mind that the default metadata overlap protection mode causes
all L1 entries to be scanned on each I/O write.  So it can’t be that bad.

Max



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}

2020-02-05 Thread Max Reitz

On 05.02.20 15:07, Eric Blake wrote:
> On 2/5/20 1:51 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
> +typedef enum {
> +    /*
> + * bdrv_known_zeroes() should include this bit if the contents of
> + * a freshly-created image with no backing file reads as all
> + * zeroes without any additional effort.  If .bdrv_co_truncate is
> + * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.

 I understand that this is preexisting logic, but could I ask: why?
 What's wrong
 if driver can guarantee that created file is all-zero, but is not sure
 about
 file resizing? I agree that it's normal for these flags to have the
 same
 value,
 but what is the reason for this restriction?..
>>>
>>> If areas added by truncation (or growth, rather) are always zero, then
>>> the file can always be created with size 0 and grown from there.  Thus,
>>> images where truncation adds zeroed areas will generally always be zero
>>> after creation.
>>
>> This means, that if truncation bit is set, than create bit should be
>> set.. But
>> here we say that if truncation is clear, than create bit must be clear.
> 
> Max, did we get the logic backwards?

Or maybe my explanation was just wrong.

Because nobody actually forces a driver to use truncate to ensure that
an newly created file will be 0.  Hm.  And more importantly, you can’t
use truncate with PREALLOC_MODE_OFF when you want to create an image
with preallocation.

Let’s see.  The offending commit message says:

> No .bdrv_has_zero_init() implementation returns 1 if growing the file
> would add non-zero areas (at least with PREALLOC_MODE_OFF), so using it
> in lieu of this new function was always safe.
> 
> But on the other hand, it is possible that growing an image that is not
> zero-initialized would still add a zero-initialized area, like when
> using nonpreallocating truncation on a preallocated image.  For callers
> that care only about truncation, not about creation with potential
> preallocation, this new function is useful.

So I suppose the explanation is just the preallocation mode alone;
has_zero_init() is for the image’s actual preallocation mode, whereas
has_zero_init_truncate() is forced to PREALLOC_MODE_OFF.  As such, the
latter is less strict than the former.  So the former cannot be true
when the latter is false.

 So, the only possible combination of flags, when they differs, is
 create=0 and
 truncate=1.. How is it possible?
>>>
>>> For preallocated qcow2 images, it depends on the storage whether they
>>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>>> to bdrv_has_zero_init() of s->data_file->bs.
>>>
>>> But when you truncate them (with PREALLOC_MODE_OFF, as
>>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>>> area is always going to be 0, regardless of initial preallocation.
>>
>> ah yes, due to qcow2 zero clusters.
> 
> Hmm. Do we actually set the zero flag on unallocated clusters when
> resizing a qcow2 image?

No.  They are just unallocated, i.e. zero.  (Nodes with backing files
never return true for bdrv_has_zero_init_truncate anyway).

> That would be an O(n) operation (we have to
> visit the L2 entry for each added cluster, even if only to set the zero
> cluster bit).  Or do we instead just rely on the fact that qcow2 is
> inherently sparse, and that when you resize the guest-visible size
> without writing any new clusters, then it is only subsequent guest
> access to those addresses that finally allocate clusters, making resize
> O(1) (update the qcow2 metadata cluster, but not any L2 tables) while
> still reading 0 from the new data.  To some extent, that's what the
> allocation mode is supposed to control.
> 
> What about with external data images, where a resize in guest-visible
> length requires a resize of the underlying data image?  There, we DO
> have to worry about whether the data image resizes with zeroes (as in
> the filesystem) or with random data (as in a block device).

Well, partially: Namely, only with data_file_raw.  Because otherwise the
clusters are still unallocated and thus read as zero.  So yes, then we
do have to worry about that.

With data_file_raw, we have an obligation to make the data file return
the same data as the qcow2 file, so, um.  I wonder whether we actually
take any care of this yet.  If you have some external data file without
zero_init(_truncate), do get zeroes when reading from the qcow2 node,
but non-zeroes when reading from the raw data file?  That would be OK
without data_file_raw, but not with it.  I suppose I’ll have to test it.

Max



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/3] m25p80: Improve command handling for Jedec and unsupported commands

2020-02-05 Thread Guenter Roeck

On Wed, Feb 05, 2020 at 11:08:07AM +0100, Cédric Le Goater wrote:
> On 2/4/20 3:28 PM, Guenter Roeck wrote:
> > On 2/4/20 12:53 AM, Cédric Le Goater wrote:
> >> On 2/3/20 7:09 PM, Guenter Roeck wrote:
> >>> Always report 6 bytes of JEDEC data. Fill remaining data with 0.
> >>>
> >>> For unsupported commands, keep sending a value of 0 until the chip
> >>> is deselected.
> >>>
> >>> Both changes avoid attempts to decode random commands. Up to now this
> >>> happened if the reported Jedec data was shorter than 6 bytes but the
> >>> host read 6 bytes, and with all unsupported commands.
> >>
> >> Do you have a concrete example for that ? machine and flash model.
> >>
> > 
> > I noticed it while tracking down the bug fixed in patch 3 of the series,
> > ie with AST2500 evb using w25q256 flash, but it happens with all machines
> > using SPI NOR flash (ie all aspeed bmc machines) when running the Linux
> > kernel. Most of the time it doesn't cause harm, unless the host sends
> > an "address" as part of an unsupported command which happens to include
> > a valid command code.
> 
> ok. we will need to model SFDP one day.
> 
> Are you using the OpenBMC images or do you have your own ? 
> 

I am running images built from upstream/stable kernel branches.

Guenter

> > 
> >>> Signed-off-by: Guenter Roeck 
> >>> ---
> >>>   hw/block/m25p80.c | 10 +-
> >>>   1 file changed, 9 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
> >>> index 63e050d7d3..aca75edcc1 100644
> >>> --- a/hw/block/m25p80.c
> >>> +++ b/hw/block/m25p80.c
> >>> @@ -1040,8 +1040,11 @@ static void decode_new_cmd(Flash *s, uint32_t 
> >>> value)
> >>>   for (i = 0; i < s->pi->id_len; i++) {
> >>>   s->data[i] = s->pi->id[i];
> >>>   }
> >>> +    for (; i < SPI_NOR_MAX_ID_LEN; i++) {
> >>> +    s->data[i] = 0;
> >>> +    }
> >>
> >> It seems that data should be reseted in m25p80_cs() also.
> >>
> > Are you sure ?
> > 
> > The current implementation sets s->data[] as needed when command decode
> > is complete. That seems less costly to me.
> 
> ok.
> Reviewed-by: Cédric Le Goater 
> 
> 
> Thanks,
> 
> C.
>  
> > Thanks,
> > Guenter
> > 
> >>>   -    s->len = s->pi->id_len;
> >>> +    s->len = SPI_NOR_MAX_ID_LEN;
> >>>   s->pos = 0;
> >>>   s->state = STATE_READING_DATA;
> >>>   break;
> >>> @@ -1158,6 +1161,11 @@ static void decode_new_cmd(Flash *s, uint32_t 
> >>> value)
> >>>   s->quad_enable = false;
> >>>   break;
> >>>   default:
> >>> +    s->pos = 0;
> >>> +    s->len = 1;
> >>> +    s->state = STATE_READING_DATA;
> >>> +    s->data_read_loop = true;
> >>> +    s->data[0] = 0;
> >>>   qemu_log_mask(LOG_GUEST_ERROR, "M25P80: Unknown cmd %x\n", 
> >>> value);
> >>>   break;
> >>>   }
> >>>
> >>
> > 
>

Re: [PATCH v2 11/33] block: Unify bdrv_child_cb_attach()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Make bdrv_child_cb_attach() call bdrv_backing_attach() for children with
a COW role (and drop the reverse call from bdrv_backing_attach()), so it
can be used for any child (with a proper role set).

Because so far no child has a proper role set, we need a temporary new
callback for child_backing.attach that ensures bdrv_backing_attach() is
called for all COW children that do not have their role set yet.

Signed-off-by: Max Reitz 
---
  block.c | 16 +++-
  1 file changed, 15 insertions(+), 1 deletion(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 12/33] block: Unify bdrv_child_cb_detach()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Make bdrv_child_cb_detach() call bdrv_backing_detach() for children with
a COW role (and drop the reverse call from bdrv_backing_detach()), so it
can be used for any child (with a proper role set).

Because so far no child has a proper role set, we need a temporary new
callback for child_backing.detach that ensures bdrv_backing_detach() is
called for all COW children that do not have their role set yet.

Signed-off-by: Max Reitz 
---
  block.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)


Looks oddly familiar after 11/33 :)
Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}

2020-02-05 Thread Max Reitz

On 04.02.20 20:03, Eric Blake wrote:
> On 2/4/20 11:53 AM, Max Reitz wrote:
>> On 31.01.20 18:44, Eric Blake wrote:
>>> Having two slightly-different function names for related purposes is
>>> unwieldy, especially since I envision adding yet another notion of
>>> zero support in an upcoming patch.  It doesn't help that
>>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>>> driver could only return 1 when opening an already-existing image
>>> known to be all zeroes; but in reality many drivers always return 1
>>> because it only applies to a just-created image).
>>
>> I don’t find it misleading, I just find it meaningless, which then makes
>> it open to interpretation (or maybe rather s/interpretation/wishful
>> thinking/).
>>
>>> Refactor all uses
>>> to instead have a single function that returns multiple bits of
>>> information, with better naming and documentation.
>>
>> It doesn’t make sense to me.  How exactly is it unwieldy?  In the sense
>> that we have to deal with multiple rather small implementation functions
>> rather than a big one per driver?  Actually, multiple small functions
>> sounds better to me – unless the three implementations share common code.
> 
> Common code for dealing with encryption, backing files, and so on.  It
> felt like I had a lot of code repetition when keeping functions separate.

Well, I suppose “dealing with” means “if (encrypted || has_backing)”, so
duplicating that doesn’t seem too bad.

>> As for the callers, they only want a single flag out of the three, don’t
>> they?  If so, it doesn’t really matter for them.
> 
> The qemu-img.c caller in patch 10 checks ZERO_CREATE | ZERO_OPEN, so we
> DO have situations of checking more than one bit, vs. needing two
> function calls.

Hm, OK.  Not sure if that place would look worse with two function
calls, but, well.

>> In fact, I can imagine that drivers can trivially return
>> BDRV_ZERO_TRUNCATE information (because the preallocation mode is
>> fixed), whereas BDRV_ZERO_CREATE can be a bit more involved, and
>> BDRV_ZERO_OPEN could take even more time because some (constant-time)
>> inquiries have to be done.
> 
> In looking at the rest of the series, drivers were either completely
> trivial (in which case, declaring:
> 
> .bdrv_has_zero_init = bdrv_has_zero_init_1,
> .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
> 
> was a lot wordier than the new:
> 
> .bdrv_known_zeroes = bdrv_known_zeroes_truncate,

Not sure if that’s bad, though.

> ), or completely spelled out but where both creation and truncation were
> determined in the same amount of effort.

Well, usually, the effort is minimal, but OK.

>> And thus callers which just want the trivially obtainable
>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>> even though they don’t care about that flag.
> 
> True, but only to a minor extent; and the documentation mentions that
> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
> block_status loop.

So it must be less expensive than an arbitrarily complex loop.  I think
a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?

What I’m trying to say is that this is not a good limit and can mean
anything.

I do think this limit definition makes sense for callers that want to
know about ZERO_OPEN.  But I don’t know why we would have to let other
callers wait, too.

> Meanwhile, callers tend to only care about
> bdrv_known_zeroes() right after opening an image or right before
> resizing (not repeatedly during runtime);

Hm, yes.  I was thinking of parallels, but that only checks once in
parallels_open(), so it’s OK.

> and you also argued elsewhere
> in this thread that it may be worth having the block layer cache
> BDRV_ZERO_OPEN and update the cache on any write,

I didn’t say the block layer, but it if makes sense.

> at which point, the
> expense in the driver callback really is a one-time call during
> bdrv_co_open().

It definitely doesn’t make sense to me to do that call unconditionally
in bdrv_co_open().

> And in that case, whether the one-time expense is done
> via a single function call or via three driver callbacks, the amount of
> work is the same; but the driver callback interface is easier if there
> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
> though BlockDriverInfo tracks much more than that boolean).
> 
> In fact, it may be worth consolidating known zeroes support into
> BlockDriverInfo.

I’m very skeptical of that.  BDI already has the problem that it doesn’t
know which of the information the caller actually wants and that it is
sometimes used in a quasi-hot path.

Maybe that means it is indeed time to incorporate it into BDI, but the
caller should have a way of specifying what parts of BDI it actually
needs and then drivers can skip anything that isn’t trivially obtainable
that the caller doesn’t need.

>> So I’d leave it as separate

Re: [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag

2020-02-05 Thread Max Reitz

On 04.02.20 18:50, Eric Blake wrote:
> On 2/4/20 11:34 AM, Max Reitz wrote:
> 
>>> +++ b/include/block/block.h
>>> @@ -105,6 +105,16 @@ typedef enum {
>>>    * for drivers that set .bdrv_co_truncate.
>>>    */
>>>   BDRV_ZERO_TRUNCATE  = 0x2,
>>> +
>>> +    /*
>>> + * bdrv_known_zeroes() should include this bit if an image is
>>> + * known to read as all zeroes when first opened; this bit should
>>> + * not be relied on after any writes to the image.
>>
>> Is there a good reason for this?  Because to me this screams like we are
>> going to check this flag without ensuring that the image has actually
>> not been written to yet.  So if it’s generally easy for drivers to stop
>> reporting this flag after a write, then maybe we should do so.
> 
> In patch 15 (implementing things in qcow2), I actually wrote the driver
> to return live results, rather than just open-time results, in part
> because writing the bit to persistent storage in qcow2 means that the
> bit must be accurate, without relying on the block layer's help.
> 
> But my pending NBD patch (not posted yet, but will be soon), the
> proposal I'm making for the NBD protocol itself is just open-time, not
> live, and so it would be more work than necessary to make the NBD driver
> report live results.
> 
> But it seems like it should be easy enough to also patch the block layer
> itself to guarantee that callers of bdrv_known_zeroes() cannot see this
> bit set if the block layer has been used in any non-zero transaction, by
> repeating the same logic as used in qcow2 to kill the bit (any
> write/write_compressed/bdrv_copy clear the bit, any trim clears the bit
> if the driver does not guarantee trim reads as zero, any truncate clears
> the bit if the driver does not guarantee truncate reads as zero, etc).
> Basically, the block layer would cache the results of .bdrv_known_zeroes
> during .bdrv_co_open, bdrv_co_pwrite() and friends would update that
> cache, and and bdrv_known_zeroes() would report the cached value rather
> than a fresh call to .bdrv_known_zeroes.

Sounds reasonable to me in generaly, but I’d prefer it to be fetched
on-demand rather than unconditionally in bdrv_open().

(I realize that this means a tri-state of “known false”, “known true”,
and “not yet queried”.)

> Are we worried enough about clients of this interface to make the block
> layer more robust?  (From the maintenance standpoint, the more the block
> layer guarantees, the easier it is to write code that uses the block
> layer; but there is the counter-argument that making the block layer
> track whether an image has been modified means a [slight] penalty to
> every write request to update the boolean.)

Just like Vladimir, I’m worried about repeating the same mistakes we
have before: That is, most places that called bdrv_has_zero_init() just
did so out of wishful thinking, hoping that it would do what they need
it to.  It didn’t.

Max



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/3] m25p80: Convert to support tracing

2020-02-05 Thread Cédric Le Goater

On 2/5/20 5:35 PM, Guenter Roeck wrote:
> On Wed, Feb 05, 2020 at 11:05:04AM +0100, Cédric Le Goater wrote:
>> On 2/3/20 7:09 PM, Guenter Roeck wrote:
>>> While at it, add some trace messages to help debug problems
>>> seen when running the latest Linux kernel.
>>
>> In case you resend, It would be nice to printout a flash id in the tracing
>> else I have a patch for it on top of yours. It helped me track a squashfs
>> corruption on the Aspeed witherspoon-bmc machine which we were after since
>> 2017 or so. It seems to be a kernel bug.
>>
> 
> I'll send a new version to split patch 2. Not sure I understand what you mean
> with the above. If you send me your patch I'll be happy to merge it into mine,
> otherwise we can just keep it as follow-ip patch.

Here is the idea : 

  
https://github.com/legoater/qemu/commit/a07727e9cfc8447ea18249ff68a561f7e8883584

You can merge and maybe extend to all traces.


In the issue we had, two CS could be selected at the same time 
and the SPI transfers were getting mixed. Printing out which
CS is doing what is interesting for debug. 

Thanks,
C.

Re: [PATCH v2 10/33] block: Use bdrv_inherited_options()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Let child_file's, child_format's, and child_backing's .inherit_options()
implementations fall back to bdrv_inherited_options() to show that it
would really work for all of these cases, if only the parents passed the
appropriate BdrvChildRole and parent_is_format values.

(Also, make bdrv_open_inherit(), the only place to explicitly call
bdrv_backing_options(), call bdrv_inherited_options() instead.)

This patch should incur only two visible changes, both for child_format
children, both of which are effectively bug fixes:

First, they no longer have discard=unmap set by default.  This reason it
was set is because bdrv_inherited_fmt_options() fell through to
bdrv_protocol_options(), and that set it because "format drivers take
care to send flushes and respect unmap policy".  None of the drivers
that use child_format for their children (quorum and blkverify) are
format drivers, though, so this reasoning does not apply here.

Second, they no longer have BDRV_O_NO_IO force-cleared.  child_format
was used solely for children that do not store any metadata and as such
will not be accessed by their parents as long as those parents do not
receive I/O themselves.  Thus, such children should inherit
BDRV_O_NO_IO.

Signed-off-by: Max Reitz 
---
  block.c | 66 -
  1 file changed, 14 insertions(+), 52 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Max Reitz

On 04.02.20 19:53, Eric Blake wrote:
> On 2/4/20 11:32 AM, Max Reitz wrote:
>> On 31.01.20 18:44, Eric Blake wrote:
>>> Based-on: <20200124103458.1525982-2-david.edmond...@oracle.com>
>>> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
>>>
>>> I'm working on adding an NBD extension that reports whether an image
>>> is already all zero when the client first connects.  I initially
>>> thought I could write the NBD code to just call bdrv_has_zero_init(),
>>> but that turned out to be a bad assumption that instead resulted in
>>> this patch series.  The NBD patch will come later (and cross-posted to
>>> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
>>> repositories).
>>
>> We had a discussion about this on IRC, and as far as I remember I wasn’t
>> quite sold on the “why”.  So, again, I wonder why this is needed.
>>
>> I mean, it does make intuitive sense to want to know whether an image is
>> fully zero, but if I continue thinking about it I don’t know any case
>> where we would need to figure it out and where we could accept “We don’t
>> know” as an answer.  So I’m looking for use cases, but this cover letter
>> doesn’t mention any.  (And from a quick glance I don’t see this series
>> using the flag, actually.)
> 
> Patch 10/17 has:
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index e60217e6c382..c8519a74f738 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1985,10 +1985,12 @@ static int convert_do_copy(ImgConvertState *s)
>  int64_t sector_num = 0;
> 
>  /* Check whether we have zero initialisation or can get it
> efficiently */
> -    if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
> -    !s->target_has_backing) {
> -    s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
> -  BDRV_ZERO_CREATE);
> +    if (!s->has_zero_init && s->min_sparse && !s->target_has_backing) {
> +    ret = bdrv_known_zeroes(blk_bs(s->target));
> +    if (ret & BDRV_ZERO_OPEN ||
> +    (s->target_is_new && ret & BDRV_ZERO_CREATE)) {
> +    s->has_zero_init = true;
> +    }
>  }

OK, I expected users to come in a separate patch.

> That's the use case: when copying into a destination file, it's useful
> to know if the destination already reads as all zeroes, before
> attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls
> to block status checking for holes.

But that was my point on IRC.  Is it really more useful if
bdrv_make_zero() is just as quick?  (And the fact that NBD doesn’t have
an implementation looks more like a problem with NBD to me.)

(Considering that at least the code we discussed on IRC didn’t work for
preallocated images, which was the one point where we actually have a
problem in practice.)

>> (We have a use case with convert -n to freshly created image files, but
>> my position on this on IRC was that we want the --target-is-zero flag
>> for that anyway: Auto-detection may always break, our preferred default
>> behavior may always change, so if you want convert -n not to touch the
>> target image except to write non-zero data from the source, we need a
>> --target-is-zero flag and users need to use it.  Well, management
>> layers, because I don’t think users would use convert -n anyway.
>>
>> And with --target-is-zero and users effectively having to use it, I
>> don’t think that’s a good example of a use case.)
> 
> Yes, there will still be cases where you have to use --target-is-zero
> because the image itself couldn't report that it already reads as
> zeroes, but there are also enough cases where the destination is already
> known to read zeroes and it's a shame to tell the user that 'you have to
> add --target-is-zero to get faster copying even though we could have
> inferred it on your behalf'.

How is it a shame?  I think only management tools would use convert -n.
 Management tools want reliable behavior.  If you want reliable
behavior, you have to use --target-is-zero anyway.  So I don’t see the
actual benefit for qemu-img convert.

>> I suppose there is the point of blockdev-create + blockdev-mirror: This
>> has exactly the same problem as convert -n.  But again, if you really
>> want blockdev-mirror not just to force-zero the image, you probably need
>> to tell it so explicitly (i.e., with a --target-is-zero flag for
>> blockdev-mirror).
>>
>> (Well, I suppose we could save us a target-is-zero for mirror if we took
>> this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
>> But, well, please no.)
>>
>> But maybe I’m just an idiot and there is no reason not to take this
>> series and make blockdev-create + blockdev-mirror do the sensible thing
>> by default in most cases. *shrug*
> 
> My argument for taking the series _is_ that the common case can be made
> more efficient without user effort.

The thing is, I don’t see the user effort.  I don’t think users use
convert -n or backup manually.  And for management tools, it isn’t
really effort

Re: [PATCH v2 09/33] block: Add generic bdrv_inherited_options()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

After the series this patch belongs to, we want to have a common
BdrvChildClass that encompasses all of child_file, child_format, and
child_backing.  Such a single class needs a single .inherit_options()
implementation, and this patch introduces it.

The next patch will show how the existing implementations can fall back
to it just by passing appropriate BdrvChildRole and parent_is_format
values.

Signed-off-by: Max Reitz 
---
  block.c | 84 +
  1 file changed, 84 insertions(+)



No impact until the next patch, but the division of patches was wise.



+/*
+ * First, decide whether to set, clear, or leave BDRV_O_PROTOCOL.
+ * Generally, the question to answer is: Should this child be
+ * format-probed by default?
+ */
+
+/*
+ * Pure and non-filtered data children of non-format nodes should
+ * be probed by default (even when the node itself has BDRV_O_PROTOCOL
+ * set).  This only affects a very limited set of drivers (namely
+ * quorum and blkverify when this comment was written).
+ * Force-clear BDRV_O_PROTOCOL then.
+ */
+if (!parent_is_format &&
+(role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
+ BDRV_CHILD_FILTERED)) ==
+BDRV_CHILD_DATA)
+{
+flags &= ~BDRV_O_PROTOCOL;
+}
+
+/*
+ * All children of format nodes (except for COW children) and all
+ * metadata children in general should never be format-probed.
+ * Force-set BDRV_O_PROTOCOL then.
+ */
+if ((parent_is_format && !(role & BDRV_CHILD_COW)) ||
+(role & BDRV_CHILD_METADATA))


Should this use 'else if', to make it obvious that we never have a path 
that both force-clears and force-sets BDRV_O_PROTOCOL?  But a careful 
reading shows that the two 'if' are mutually exclusive, even without the 
second using 'else if'.



+{
+flags |= BDRV_O_PROTOCOL;
+}
+


Looks good!  Lots of decision trees, but also lots of good comments 
backing up that complexity.

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

[PATCH 1/2] nbd/proto: introduce structured request

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

Idea: reuse structured replies as requests too. For this:

Rename structured reply "structure" to structured message. And,
correspondingly structured reply chunk to structured message chunk.

Keep name "structured reply" for structured messages sent by server,
and name "structured request" a structured message sent by client.

Share almost all semantics around structured messages for client and
server except for chunk type (as reply types and request types are
orthogonal things). Still, share none-type chunk for both server and
client.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 doc/proto.md | 104 +++
 1 file changed, 72 insertions(+), 32 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index 4b067f5..cb0ac56 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -279,12 +279,13 @@ a soft disconnect.
 
 ### Transmission
 
-There are three message types in the transmission phase: the request,
-the simple reply, and the structured reply chunk.  The
+There are three message types in the transmission phase: the simple request,
+the simple reply, and the structured message chunk.  The
 transmission phase consists of a series of transactions, where the
-client submits requests and the server sends corresponding replies
-with either a single simple reply or a series of one or more
-structured reply chunks per request.  The phase continues until
+client submits simple requests or structured requests, and the server sends 
corresponding replies
+with either a single simple reply or a structured reply.  Both structured
+request and structured reply are represented by structured message,
+which in turn is a series of one or more structured message chunks.  The phase 
continues until
 either side terminates transmission; this can be performed cleanly
 only by the client.
 
@@ -295,6 +296,9 @@ reply is also problematic for error handling of the 
`NBD_CMD_READ`
 request.  Therefore, structured replies can be used to create a
 a context-free server stream; see below.
 
+Similarly, without server negotiation, the client MUST use only simple
+requests.
+
 Replies need not be sent in the same order as requests (i.e., requests
 may be handled by the server asynchronously), and structured reply
 chunks from one request may be interleaved with reply messages from
@@ -369,7 +373,7 @@ S: 64 bits, handle
 S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
 *error* is zero)  
 
- Structured reply chunk message
+ Structured message chunk
 
 Some of the major downsides of the default simple reply to
 `NBD_CMD_READ` are as follows.  First, it is not possible to support
@@ -382,9 +386,11 @@ possible to reliably decode the server traffic without 
also having
 context of what pending read requests were sent by the client.
 Therefore structured replies are also permitted if negotiated.
 
-A structured reply in the transmission phase consists of one or
-more structured reply chunk messages.  The server MUST NOT send
-this reply type unless the client has successfully negotiated
+A structured message in the transmission phase consists of one or
+more structured message chunks.
+
+The server MUST NOT send structured reply
+type unless the client has successfully negotiated
 structured replies via `NBD_OPT_STRUCTURED_REPLY`.  Conversely, if
 structured replies are negotiated, the server MUST use a
 structured reply for any response with a payload, and MUST NOT use
@@ -394,12 +400,15 @@ structured reply to all other requests.  The server 
SHOULD prefer
 sending errors via a structured reply, as the error can then be
 accompanied by a string payload to present to a human user.
 
-A structured reply MAY occupy multiple structured chunk messages
+The client MUST NOT sent structured messages, unless
+NBD_FLAG_STRUCTURED_REQUEST is negotiated by the server.
+
+A structured message MAY occupy multiple structured message chunks
 (all with the same value for "handle"), and the
-`NBD_REPLY_FLAG_DONE` reply flag is used to identify the final
+`NBD_STRUCTURED_FLAG_DONE` reply flag is used to identify the final
 chunk.  Unless further documented by individual requests below,
 the chunks MAY be sent in any order, except that the chunk with
-the flag `NBD_REPLY_FLAG_DONE` MUST be sent last.  Even when a
+the flag `NBD_STRUCTURED_FLAG_DONE` MUST be sent last.  Even when a
 command documents further constraints between chunks of one reply,
 it is always safe to interleave chunks of that reply with messages
 related to other requests.  A server SHOULD try to minimize the
@@ -412,7 +421,7 @@ on the chunks received.
 
 A structured reply chunk message looks as follows:
 
-S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
+S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_MESSAGE_MAGIC`)
 S: 16 bits, flags  
 S: 16 bits, type  
 S: 64 bits, handle  
@@ -1085,6 +1094,8 @@ The field has the following format:
   will be faster than a regular write). Clients MUST NOT set

[PATCH 2/2] nbd/proto: add NBD_CMD_WRITE_ZEROES64

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

Add new structured request type to represent 64bit version of
NBD_CMD_WRITE_ZEROES.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 doc/proto.md | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/doc/proto.md b/doc/proto.md
index cb0ac56..378a800 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -1096,6 +1096,8 @@ The field has the following format:
   is set.
 - bit 12, `NBD_FLAG_STRUCTURED_REQUEST`; allow clients to use
   structured requests.
+- bit 13, `NBD_FLAG_SEND_WRITE_ZEROES64`: documents that the server
+  understands `NBD_CMD_WRITE_ZEROES64` structured request chunk type.
 
 Clients SHOULD ignore unknown flags.
 
@@ -1866,6 +1868,25 @@ NBD_STRUCTURED_FLAG_DONE (NBD_NONE_CHUNK may be used for 
this).
   represents a no-op command, which SHOULD be replied with no errors and
   may be used like ping, to check server availability.
 
+* `NBD_CMD_WRITE_ZEROES64` (6)
+
+  NBD_CMD_WRITE_ZEROES representation with 64bit length.
+
+  The payload is structured as:
+
+  16 bits: command flags
+  64 bits: offset (unsigned)
+  64 bits: length (unsigned)
+
+  The fields has exactly same meaning as corresponding fields for
+  NBD_CMD_WRITE_ZEROES request.
+
+  This request chunk type MUST be the only one chunk of the structured message
+  and therefore MUST be accompanied by NBD_STRUCTURED_FLAG_DONE.
+
+  This request chunk type is negotiated by server flag
+  NBD_FLAG_SEND_WRITE_ZEROES64, and MUST not be used otherwise.
+
  Request types
 
 The following request types exist:
@@ -2138,6 +2159,11 @@ The following request types exist:
 including one or more sectors beyond the size of the device. It SHOULD
 return `NBD_EPERM` if it receives a write zeroes request on a read-only 
export.
 
+If structured requests are negotieated by server, client MAY use
+64bit variant of the command, which has exactly same behavior and
+the only differency is 64bit length field. See NBD_CMD_WRITE_ZEROES64
+above.
+
 * `NBD_CMD_BLOCK_STATUS` (7)
 
 A block status query request. Length and offset define the range
-- 
2.21.0

Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}

2020-02-05 Thread Max Reitz

On 04.02.20 18:51, Eric Blake wrote:
> On 2/4/20 11:42 AM, Max Reitz wrote:
> 
>>>
>>> I understand that this is preexisting logic, but could I ask: why?
>>> What's wrong
>>> if driver can guarantee that created file is all-zero, but is not sure
>>> about
>>> file resizing? I agree that it's normal for these flags to have the same
>>> value,
>>> but what is the reason for this restriction?..
>>
>> If areas added by truncation (or growth, rather) are always zero, then
>> the file can always be created with size 0 and grown from there.  Thus,
>> images where truncation adds zeroed areas will generally always be zero
>> after creation.
>>
>>> So, the only possible combination of flags, when they differs, is
>>> create=0 and
>>> truncate=1.. How is it possible?
>>
>> For preallocated qcow2 images, it depends on the storage whether they
>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>> to bdrv_has_zero_init() of s->data_file->bs.
>>
>> But when you truncate them (with PREALLOC_MODE_OFF, as
>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>> area is always going to be 0, regardless of initial preallocation.
>>
>>
>> I just noticed a bug there, though: Encrypted qcow2 images will not see
>> areas added through growth as 0.  Hence, qcow2’s
>> bdrv_has_zero_init_truncate() implementation should not return true
>> unconditionally, but only for unencrypted images.
> 
> Hence patch 5 earlier in the series :)

Ah, good. :-)

Max



signature.asc
Description: OpenPGP digital signature

[PATCH 0/2] Structured requests and 64bit commands

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

Hi all. This is a continuation of thread
"NBD extended (or structured?) request"

Here is a very drafty draft of the feature, to discuss the general
design.

Vladimir Sementsov-Ogievskiy (2):
  nbd/proto: introduce structured request
  nbd/proto: add NBD_CMD_WRITE_ZEROES64

 doc/proto.md | 130 ++-
 1 file changed, 98 insertions(+), 32 deletions(-)

-- 
2.21.0

Re: [PATCH v2 08/33] block: Rename bdrv_inherited_options()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

The other two .inherit_options implementations specify exactly for what
case they are used in their name, so do it for this one as well.

(The actual intention behind this patch is to follow it up with a
generic bdrv_inherited_options() that works for all three cases.)

Signed-off-by: Max Reitz 
---
  block.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 1/3] m25p80: Convert to support tracing

2020-02-05 Thread Guenter Roeck

On Wed, Feb 05, 2020 at 11:05:04AM +0100, Cédric Le Goater wrote:
> On 2/3/20 7:09 PM, Guenter Roeck wrote:
> > While at it, add some trace messages to help debug problems
> > seen when running the latest Linux kernel.
> 
> In case you resend, It would be nice to printout a flash id in the tracing
> else I have a patch for it on top of yours. It helped me track a squashfs
> corruption on the Aspeed witherspoon-bmc machine which we were after since
> 2017 or so. It seems to be a kernel bug.
> 

I'll send a new version to split patch 2. Not sure I understand what you mean
with the above. If you send me your patch I'll be happy to merge it into mine,
otherwise we can just keep it as follow-ip patch.

Thanks,
Guenter

Re: [PATCH v2 07/33] block: Pass parent_is_format to .inherit_options()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

We plan to unify the generic .inherit_options() functions.  The
resulting common function will need to decide whether to force-enable
format probing, force-disable it, or leave it as-is.  To make this
decision, it will need to know whether the parent node is a format node
or not (because we never want format probing if the parent is a format
node already (except for the backing chain)).

Signed-off-by: Max Reitz 
---
  block.c   | 37 +++--
  block/block-backend.c |  2 +-
  block/vvfat.c |  2 +-
  include/block/block_int.h |  2 +-
  4 files changed, 30 insertions(+), 13 deletions(-)




@@ -2992,8 +2994,22 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
  bs->explicit_options = qdict_clone_shallow(options);
  
  if (child_class) {

+bool parent_is_format;
+
+if (parent->drv) {
+parent_is_format = parent->drv->is_format;
+} else {
+/*
+ * parent->drv is not set yet because this node is opened for
+ * (potential) format probing.  That means that @parent is going
+ * to be a format node.
+ */
+parent_is_format = true;
+}


Nice comment.

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 06/33] block: Pass BdrvChildRole to .inherit_options()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 


Another sparse commit message.


---
  block.c   | 40 +++
  block/block-backend.c |  3 ++-
  block/vvfat.c |  3 ++-
  include/block/block_int.h |  3 ++-
  4 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/block.c b/block.c
index 7fa7830428..612e86fab9 100644
--- a/block.c
+++ b/block.c
@@ -77,6 +77,7 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 QDict *options, int flags,
 BlockDriverState *parent,
 const BdrvChildClass *child_class,
+   BdrvChildRole child_role,


And another question about proper typing.

And another:
Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 05/33] block: Pass BdrvChildRole to bdrv_child_perm()

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 


Sparse commit message - is the intent that this is no semantic change 
and just adding a parameter?



---
  block.c | 22 --
  block/backup-top.c  |  3 ++-
  block/blkdebug.c|  5 +++--
  block/blklogwrites.c|  9 +
  block/commit.c  |  1 +
  block/copy-on-read.c|  1 +
  block/mirror.c  |  1 +
  block/quorum.c  |  1 +
  block/replication.c |  1 +
  block/vvfat.c   |  1 +
  include/block/block_int.h   |  5 -
  tests/test-bdrv-drain.c |  5 +++--
  tests/test-bdrv-graph-mod.c |  1 +
  13 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/block.c b/block.c
index c576377650..7fa7830428 100644
--- a/block.c
+++ b/block.c
@@ -1764,12 +1764,12 @@ bool bdrv_is_writable(BlockDriverState *bs)
  
  static void bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs,

  BdrvChild *c, const BdrvChildClass *child_class,
-BlockReopenQueue *reopen_queue,
+BdrvChildRole role, BlockReopenQueue *reopen_queue,


Again, using an enum name where a bitmask of non-enum values is 
self-documenting, but somewhat abusive of C's loose type system.  Is an 
unsigned int any better?


Looks mechanical enough. I'd like a better commit message, but:
Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH for-5.0 v2 10/23] quorum: Implement .bdrv_recurse_can_replace()

2020-02-05 Thread Kevin Wolf

Am 05.02.2020 um 16:55 hat Kevin Wolf geschrieben:
> Am 11.11.2019 um 17:02 hat Max Reitz geschrieben:
> > Signed-off-by: Max Reitz 
> > ---
> >  block/quorum.c | 62 ++
> >  1 file changed, 62 insertions(+)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index 3a824e77e3..8ee03e9baf 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -825,6 +825,67 @@ static bool 
> > quorum_recurse_is_first_non_filter(BlockDriverState *bs,
> >  return false;
> >  }
> >  
> > +static bool quorum_recurse_can_replace(BlockDriverState *bs,
> > +   BlockDriverState *to_replace)
> > +{
> > +BDRVQuorumState *s = bs->opaque;
> > +int i;
> > +
> > +for (i = 0; i < s->num_children; i++) {
> > +/*
> > + * We have no idea whether our children show the same data as
> > + * this node (@bs).  It is actually highly likely that
> > + * @to_replace does not, because replacing a broken child is
> > + * one of the main use cases here.
> > + *
> > + * We do know that the new BDS will match @bs, so replacing
> > + * any of our children by it will be safe.  It cannot change
> > + * the data this quorum node presents to its parents.
> > + *
> > + * However, replacing @to_replace by @bs in any of our
> > + * children's chains may change visible data somewhere in
> > + * there.  We therefore cannot recurse down those chains with
> > + * bdrv_recurse_can_replace().
> > + * (More formally, bdrv_recurse_can_replace() requires that
> > + * @to_replace will be replaced by something matching the @bs
> > + * passed to it.  We cannot guarantee that.)
> > + *
> > + * Thus, we can only check whether any of our immediate
> > + * children matches @to_replace.
> > + *
> > + * (In the future, we might add a function to recurse down a
> > + * chain that checks that nothing there cares about a change
> > + * in data from the respective child in question.  For
> > + * example, most filters do not care when their child's data
> > + * suddenly changes, as long as their parents do not care.)
> > + */
> > +if (s->children[i].child->bs == to_replace) {
> > +Error *local_err = NULL;
> > +
> > +/*
> > + * We now have to ensure that there is no other parent
> > + * that cares about replacing this child by a node with
> > + * potentially different data.
> > + */
> > +s->children[i].to_be_replaced = true;
> > +bdrv_child_refresh_perms(bs, s->children[i].child, &local_err);
> > +
> > +/* Revert permissions */
> > +s->children[i].to_be_replaced = false;
> > +bdrv_child_refresh_perms(bs, s->children[i].child, 
> > &error_abort);
> 
> Quite a hack. The two obvious problems are:
> 
> 1. We can't guarantee that we can actually revert the permissions. I
>think we ignore failure to loosen permissions meanwhile so that at
>least the &error_abort doesn't trigger, but bs could still be in the
>wrong state afterwards.
> 
>It would be cleaner to use check+abort instead of actually setting
>the new permission.
> 
> 2. As aborting the permission change makes more obvious, we're checking
>something that might not be true any more when we actually make the
>change.
> 
> Pragmatically, a hack might be good enough here, but it should be
> documented as such (with a short explanation of its shortcomings) at
> least.

Oops, meant to send this as a comment for v3 (which I did apply locally
for review).

Kevin

Re: [PATCH for-5.0 v2 11/23] block: Use bdrv_recurse_can_replace()

2020-02-05 Thread Kevin Wolf

Am 11.11.2019 um 17:02 hat Max Reitz geschrieben:
> Let check_to_replace_node() use the more specialized
> bdrv_recurse_can_replace() instead of
> bdrv_recurse_is_first_non_filter(), which is too restrictive.
> 
> Signed-off-by: Max Reitz 
> ---
>  block.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/block.c b/block.c
> index de53addeb0..7608f21570 100644
> --- a/block.c
> +++ b/block.c
> @@ -6243,6 +6243,17 @@ bool bdrv_recurse_can_replace(BlockDriverState *bs,
>  return false;
>  }
>  
> +/*
> + * Check whether the given @node_name can be replaced by a node that
> + * has the same data as @parent_bs.  If so, return @node_name's BDS;
> + * NULL otherwise.
> + *
> + * @node_name must be a (recursive) *child of @parent_bs (or this
> + * function will return NULL).
> + *
> + * The result (whether the node can be replaced or not) is only valid
> + * for as long as no graph changes occur.
> + */
>  BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
>  const char *node_name, Error **errp)
>  {
> @@ -6267,8 +6278,11 @@ BlockDriverState 
> *check_to_replace_node(BlockDriverState *parent_bs,
>   * Another benefit is that this tests exclude backing files which are
>   * blocked by the backing blockers.
>   */
> -if (!bdrv_recurse_is_first_non_filter(parent_bs, to_replace_bs)) {
> -error_setg(errp, "Only top most non filter can be replaced");
> +if (!bdrv_recurse_can_replace(parent_bs, to_replace_bs)) {
> +error_setg(errp, "Cannot replace '%s' by a node mirrored from '%s', "
> +   "because it cannot be guaranteed that doing so would not "
> +   "lead to an abrupt change of visible data",
> +   node_name, parent_bs->node_name);

If this function is only supposed to be used in the context of the
mirror job, moving it into block/mirror.c could be considered as a
cleanup on top.

Kevin

Re: [PULL 15/18] qemu-img: adds option to use aio engine for benchmarking

2020-02-05 Thread Julia Suvorova

On Mon, Feb 3, 2020 at 11:56 AM Peter Maydell  wrote:
>
> On Thu, 30 Jan 2020 at 21:32, Stefan Hajnoczi  wrote:
> >
> > From: Aarushi Mehta 
> >
> > Signed-off-by: Aarushi Mehta 
> > Acked-by: Stefano Garzarella 
> > Signed-off-by: Stefan Hajnoczi 
> > Message-id: 20200120141858.587874-13-stefa...@redhat.com
> > Message-Id: <20200120141858.587874-13-stefa...@redhat.com>
> > Signed-off-by: Stefan Hajnoczi 
>
> > --- a/qemu-img-cmds.hx
> > +++ b/qemu-img-cmds.hx
> > @@ -20,9 +20,9 @@ STEXI
> >  ETEXI
> >
> >  DEF("bench", img_bench,
> > -"bench [-c count] [-d depth] [-f fmt] 
> > [--flush-interval=flush_interval] [-n] [--no-drain] [-o offset] 
> > [--pattern=pattern] [-q] [-s buffer_size] [-S step_size] [-t cache] [-w] 
> > [-U] filename")
> > +"bench [-c count] [-d depth] [-f fmt] 
> > [--flush-interval=flush_interval] [-n] [--no-drain] [-o offset] 
> > [--pattern=pattern] [-q] [-s buffer_size] [-S step_size] [-t cache] [-i 
> > aio] [-w] [-U] filename")
> >  STEXI
> > -@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] 
> > [--flush-interval=@var{flush_interval}] [-n] [--no-drain] [-o @var{offset}] 
> > [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-S @var{step_size}] 
> > [-t @var{cache}] [-w] [-U] @var{filename}
> > +@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] 
> > [--flush-interval=@var{flush_interval}] [-n] [--no-drain] [-o @var{offset}] 
> > [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-S @var{step_size}] 
> > [-t @var{cache}] [-i @var{aio}] [-w] [-U] @var{filename}
> >  ETEXI
>
> > diff --git a/qemu-img.texi b/qemu-img.texi
> > index b5156d6316..20136fcb94 100644
> > --- a/qemu-img.texi
> > +++ b/qemu-img.texi
> > @@ -206,7 +206,7 @@ Command description:
> >  Amends the image format specific @var{options} for the image file
> >  @var{filename}. Not all file formats support this operation.
> >
> > -@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] 
> > [--flush-interval=@var{flush_interval}] [-n] [--no-drain] [-o @var{offset}] 
> > [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-S @var{step_size}] 
> > [-t @var{cache}] [-w] [-U] @var{filename}
> > +@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] 
> > [--flush-interval=@var{flush_interval}] [-n] [-i @var{aio}] [--no-drain] 
> > [-o @var{offset}] [--pattern=@var{pattern}] [-q] [-s @var{buffer_size}] [-S 
> > @var{step_size}] [-t @var{cache}] [-w] [-U] @var{filename}
>
> Is there a reason why the new '-i aio' option is added to the synopsis
> line after '-t cache' in the DEF() line and the STEXI/ETEXI fragment,
> but after '-n' in the line in the qemu-img.texi file ?

No reason, just an accident.

> All the other options here are in alphabetical order, so logically
> '-i aio' should go in neither of those two places but after
> '--flush-interval'...
>
> (This change is a conflict with the in-flight qemu-img conversion
> to rST; to fix that up I'm going to just change the rST conversion
> to exactly follow the texi here; we can fix the option ordering
> as a followup patch.)


Ok. I'll send a follow-up patch.

Best regards, Julia Suvorova.

Re: [PATCH for-5.0 v2 10/23] quorum: Implement .bdrv_recurse_can_replace()

2020-02-05 Thread Kevin Wolf

Am 11.11.2019 um 17:02 hat Max Reitz geschrieben:
> Signed-off-by: Max Reitz 
> ---
>  block/quorum.c | 62 ++
>  1 file changed, 62 insertions(+)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index 3a824e77e3..8ee03e9baf 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -825,6 +825,67 @@ static bool 
> quorum_recurse_is_first_non_filter(BlockDriverState *bs,
>  return false;
>  }
>  
> +static bool quorum_recurse_can_replace(BlockDriverState *bs,
> +   BlockDriverState *to_replace)
> +{
> +BDRVQuorumState *s = bs->opaque;
> +int i;
> +
> +for (i = 0; i < s->num_children; i++) {
> +/*
> + * We have no idea whether our children show the same data as
> + * this node (@bs).  It is actually highly likely that
> + * @to_replace does not, because replacing a broken child is
> + * one of the main use cases here.
> + *
> + * We do know that the new BDS will match @bs, so replacing
> + * any of our children by it will be safe.  It cannot change
> + * the data this quorum node presents to its parents.
> + *
> + * However, replacing @to_replace by @bs in any of our
> + * children's chains may change visible data somewhere in
> + * there.  We therefore cannot recurse down those chains with
> + * bdrv_recurse_can_replace().
> + * (More formally, bdrv_recurse_can_replace() requires that
> + * @to_replace will be replaced by something matching the @bs
> + * passed to it.  We cannot guarantee that.)
> + *
> + * Thus, we can only check whether any of our immediate
> + * children matches @to_replace.
> + *
> + * (In the future, we might add a function to recurse down a
> + * chain that checks that nothing there cares about a change
> + * in data from the respective child in question.  For
> + * example, most filters do not care when their child's data
> + * suddenly changes, as long as their parents do not care.)
> + */
> +if (s->children[i].child->bs == to_replace) {
> +Error *local_err = NULL;
> +
> +/*
> + * We now have to ensure that there is no other parent
> + * that cares about replacing this child by a node with
> + * potentially different data.
> + */
> +s->children[i].to_be_replaced = true;
> +bdrv_child_refresh_perms(bs, s->children[i].child, &local_err);
> +
> +/* Revert permissions */
> +s->children[i].to_be_replaced = false;
> +bdrv_child_refresh_perms(bs, s->children[i].child, &error_abort);

Quite a hack. The two obvious problems are:

1. We can't guarantee that we can actually revert the permissions. I
   think we ignore failure to loosen permissions meanwhile so that at
   least the &error_abort doesn't trigger, but bs could still be in the
   wrong state afterwards.

   It would be cleaner to use check+abort instead of actually setting
   the new permission.

2. As aborting the permission change makes more obvious, we're checking
   something that might not be true any more when we actually make the
   change.

Pragmatically, a hack might be good enough here, but it should be
documented as such (with a short explanation of its shortcomings) at
least.

Kevin

Re: [PATCH v3 09/21] quorum: Add QuorumChild.to_be_replaced

2020-02-05 Thread Kevin Wolf

Am 30.01.2020 um 22:44 hat Max Reitz geschrieben:
> We will need this to verify that Quorum can let one of its children be
> replaced without breaking anything else.
> 
> Signed-off-by: Max Reitz 
> ---
>  block/quorum.c | 25 +
>  1 file changed, 25 insertions(+)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index 59cd524502..6a7224c9e4 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -67,6 +67,13 @@ typedef struct QuorumVotes {
>  
>  typedef struct QuorumChild {
>  BdrvChild *child;
> +
> +/*
> + * If set, check whether this node can be replaced without any
> + * other parent noticing: Unshare CONSISTENT_READ, and take the
> + * WRITE permission.
> + */
> +bool to_be_replaced;

I don't understand these permission changes. How does (preparing for)
detaching a node from quorum make its content invalid? And why do we
suddenly need WRITE permissions even if the quorum node is only used
read-only?

The comment is a bit unclear, too. "check whether" implies that both
outcomes could be true, but it doesn't say what happens in either case.
Is this really "make sure that"?

Kevin

Re: [PATCH v2 04/33] block: Add BdrvChildRole to BdrvChild

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

For now, it is always set to 0.  Later patches in this series will
ensure that all callers pass an appropriate combination of flags.


Sneaky - this re-adds the field you dropped as part of a rename in 2/33. 
 Anyone doing backport had better be aware that they would need this 
whole series, rather than cherry-picking later patches without the 
earlier ones.  But that observation does not affect the patch validity.




Signed-off-by: Max Reitz 
---



+++ b/block.c
@@ -2381,6 +2381,7 @@ static void bdrv_replace_child(BdrvChild *child, 
BlockDriverState *new_bs)
  BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
const char *child_name,
const BdrvChildClass *child_class,
+  BdrvChildRole child_role,


Hmm - C is loose enough to allow the declaration of a parameter as an 
enum type even when its intended usage is to receive a bitwise-or 
derived from that enum but not declared in the enum.  For example, if I 
understand intent correctly, a caller might pass in 0x3 
(BDRV_CHILD_DATA|BDRV_CHILD_METADATA) which does NOT appear in 
BdrvChildRole.  It feels like it might be cleaner to document the 
interface as taking an unsigned int (although then you've lost the 
documentation that the int is derived from BdrvChildRole), than to abuse 
the typesystem to pass values that are not BdrvChildRole through the 
parameter.


Otherwise, this patch is a mechanical addition.
Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 03/33] block: Add BdrvChildRole

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

This enum will supplement BdrvChildClass when it comes to what role (or
combination of roles) a child takes for its parent.

Because empty enums are not allowed, let us just start with it filled.

Signed-off-by: Max Reitz 
---
  include/block/block.h | 27 +++
  1 file changed, 27 insertions(+)

diff --git a/include/block/block.h b/include/block/block.h
index 38963ef203..0f7e8caa5b 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -279,6 +279,33 @@ enum {
  DEFAULT_PERM_UNCHANGED  = BLK_PERM_ALL & ~DEFAULT_PERM_PASSTHROUGH,
  };
  
+typedef enum BdrvChildRole {

+/* Child stores data */
+BDRV_CHILD_DATA = (1 << 0),
+
+/* Child stores metadata */
+BDRV_CHILD_METADATA = (1 << 1),
+
+/* Filtered child */
+BDRV_CHILD_FILTERED = (1 << 2),


I'm not sure this comment does justice for what the flag represents, but 
am not sure of what longer comment to put in its place.



+
+/* Child to COW from (backing child) */
+BDRV_CHILD_COW  = (1 << 3),
+
+/*
+ * The primary child.  For most drivers, this is the child whose
+ * filename applies best to the parent node.
+ * Each parent must give this flag to no more than one child at a
+ * time.
+ */
+BDRV_CHILD_PRIMARY  = (1 << 4),
+
+/* Useful combination of flags */
+BDRV_CHILD_IMAGE= BDRV_CHILD_DATA
+  | BDRV_CHILD_METADATA
+  | BDRV_CHILD_PRIMARY,
+} BdrvChildRole;
+
  char *bdrv_perm_names(uint64_t perm);
  uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission qapi_perm);
  



Whether or not you can improve the comment, the enum makes sense.

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 02/33] block: Rename BdrvChildRole to BdrvChildClass

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

This structure nearly only contains parent callbacks for child state
changes.  It cannot really reflect a child's role, because different
roles may overlap (as we will see when real roles are introduced), and
because parents can have custom callbacks even when the child fulfills a
standard role.

Signed-off-by: Max Reitz 
---


Looks mechanical, and the new name makes sense for the upcoming patches.

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v4 01/10] hbitmap: assert that we don't create bitmap larger than INT64_MAX

2020-02-05 Thread Eric Blake


On 2/5/20 5:20 AM, Vladimir Sementsov-Ogievskiy wrote:

We have APIs which returns signed int64_t, to be able to return error.


s/returns/return/


Therefore we can't handle bitmaps with absolute size larger than
(INT64_MAX+1). Still, keep maximum to be INT64_MAX which is a bit
safer.

Note, that bitmaps are used to represent disk images, which can't


s/Note,/Note/


exceed INT64_MAX anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
  util/hbitmap.c | 2 ++
  1 file changed, 2 insertions(+)


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


05.02.2020 17:47, Vladimir Sementsov-Ogievskiy wrote:

05.02.2020 17:26, Eric Blake wrote:

On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:


3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.



Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
we have zero bits in L2 entries. And with them, we even don't need
preallocated to be filled by zeros, as we never read them (but just return
zeros on read)..


Scanning all L2 entries is O(n), while an autoclear bit properly maintained is 
O(1).



Then, may be we want similar flag for L1 entry (this will enable large
fast write-zero). And may be we want flag which marks the whole image
as read-zero (it's your flag). So, now I think, my previous idea
of "all allocated is zero" is worse. As for fully-preallocated images
we are sure that all clusters are allocated, and it is more native to
have flags similar to ZERO bit in L2 entry.


Right now, we don't have any L1 entry flags.  Adding one would require adding 
an incompatible feature flag (if older qemu would choke to see unexpected flags 
in an L1 entry), or at best an autoclear feature flag (if the autoclear bit 
gets cleared because an older qemu opened the image and couldn't maintain L1 
entry flags correctly, then newer qemu knows it cannot trust those L1 entry 
flags).  But as soon as you are talking about adding a feature bit, then why 
add one that still requires O(n) traversal to check (true, the 'n' in an O(n) 
traversal of L1 tables is much smaller than the 'n' in an O(n) traversal of L2 
tables), when you can instead just add an O(1) autoclear bit that maintains 
all_zero status for the image as a whole?



My suggestion about L1 entry flag is side thing, I understand difference 
between O(n) and O(1) :) Still additional L1 entry will help to make efficient 
large block-status and write-zero requests.

And I agree that we need top level flag.. I just try to say, that it seems good 
to make it similar with existing L2 flag. But yes, it would be incomaptible 
change, as it marks all clusters as ZERO, and older Qemu can't understand it 
and may treat all clusters as unallocated.



Still, how long is this O(n) ? We load the whole L1 into memory anyway. For 
example, 16Tb disk with 64K granularity, we'll have 32768 L1 entries. Will we 
get sensible performance benefit with an extension? I doubt in it now. And 
anyway, if we have an extension, we should fallback to this O(n) if we don't 
have the flag set.

So, I think the flag is beneficial only for preallocated images.

Hmm. and for such images, if we want, we can define this flag as 'all clusters 
are allocated zeroes', if we want. Which will prove that image reads as zero 
independently of any backing relations.


--
Best regards,
Vladimir

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


05.02.2020 17:43, Vladimir Sementsov-Ogievskiy wrote:

05.02.2020 17:22, Eric Blake wrote:

On 2/5/20 3:04 AM, Vladimir Sementsov-Ogievskiy wrote:


[repo.or.cz appears to be down as I type this; I'll post a link to a
repository later when it comes back up]


Now up
https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qcow2-all-zero-v1





I have several ideas around it.

1. For generic block layer.
Did you consider as alternative to BDRV_ZEO_OPEN, to export the
information through normal block_status? So, if we have the
information, that disk is all-zero, we can always add _ZERO
flag to block-status result.


Makes sense.


And in generic bdrv_is_all_zeroes(),
we can just call block_status(0, disk_size), which will return
ZERO and n=disk_size if driver supports all-zero feature and is
all-zero now.


Less obvious.  block_status is not required to visit the entire disk, even if 
the entire disk is all zero.  For example, qcow2 visits at most one L2 page in 
a call (if the request spans L1 entries, it will be truncated at the boundary, 
even if the region before and after the boundary have the same status).  I'm 
also worried if we still have 32-bit limitations in block_status (ideally, 
we've fixed things to support 64-bit status where possible, but I'm not 
counting on it).


Not required, but why not doing it? If we have information that all disk is of 
the same ZERO status, no reasons to not reply on block_status(0, disk_size) 
with smaller n.




I think block-status is a native way for such information, and I
think that we anyway want to come to support of 64bit block-status
for qcow2 and nbd.


Block status requires an O(n) loop over the disk, where n is the number of 
distinct extents possible.  If you get lucky, and block_status(0,size) returns 
a single extent, then yes that can feed the 'is_zeroes' request.  Similarly, a 
single return of non-zero data can instantly tell you that 'is_zeroes' is 
false.  But given that drivers may break up their response on convenient 
boundaries, such as qcow2 on L1 entry granularity, you cannot blindly assume 
that a return of zero data for smaller than the requested size implies non-zero 
data, only that there is insufficient information to tell if the disk is 
all_zeroes without querying further block_status calls, and that's where you 
lose out on the speed compared to just being told up-front from an 'is_zero' 
call.


Yes. But how is it worse than BDRV_ZERO_OPEN? With one block_status call we 
have the same information. If on block_status(0, disk_size) driver replies with 
ZERO but smaller than disk_size, it means that either disk is not all-zero, or 
driver doesn't support 'fast whole-disk zero check' feature, which is equal to 
not supporting BDRV_ZERO_OPEN.





2. For NBD
Again, possible alternative is BLOCK_STATUS, but we need 64bit
commands for it. I plan to send a proposal anyway. Still, nothing
bad in two possible path of receiving all-zero information.
And even with your NBD extension, we can export this information
through block-status [1.]


Yes, having 64-bit BLOCK_STATUS in NBD is orthogonal to this, but both ideas 
are independently useful, and as the level of difficulty in implementing things 
may vary, it is conceivable to have both a server that provides 'is_zero' but 
not BLOCK_STATUS, and a server that provides 64-bit BLOCK_STATUS but not 
'is_zero'.



3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.


The benefit of this patch series is that it detects preallocated qcow2 images 
as all_zero.  What's more, scanning all L1 entries is O(n), but detecting an 
autoclear all_zero bit is O(1).  Your proposed L1 scan is accurate for fewer 
cases, and costs more time.


Ah yes, somehow I thought that L1 is not allocated for fresh image..

Hmm, than possibly we need two new top-level flags: "all-zero" and 
"all-unallocated"..



It make sense only with incompatible semantics. With autoclean semantics it's 
better to have one 'all-allocated-are-zero' and don't care.


--
Best regards,
Vladimir

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


05.02.2020 17:26, Eric Blake wrote:

On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:


3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.



Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
we have zero bits in L2 entries. And with them, we even don't need
preallocated to be filled by zeros, as we never read them (but just return
zeros on read)..


Scanning all L2 entries is O(n), while an autoclear bit properly maintained is 
O(1).



Then, may be we want similar flag for L1 entry (this will enable large
fast write-zero). And may be we want flag which marks the whole image
as read-zero (it's your flag). So, now I think, my previous idea
of "all allocated is zero" is worse. As for fully-preallocated images
we are sure that all clusters are allocated, and it is more native to
have flags similar to ZERO bit in L2 entry.


Right now, we don't have any L1 entry flags.  Adding one would require adding 
an incompatible feature flag (if older qemu would choke to see unexpected flags 
in an L1 entry), or at best an autoclear feature flag (if the autoclear bit 
gets cleared because an older qemu opened the image and couldn't maintain L1 
entry flags correctly, then newer qemu knows it cannot trust those L1 entry 
flags).  But as soon as you are talking about adding a feature bit, then why 
add one that still requires O(n) traversal to check (true, the 'n' in an O(n) 
traversal of L1 tables is much smaller than the 'n' in an O(n) traversal of L2 
tables), when you can instead just add an O(1) autoclear bit that maintains 
all_zero status for the image as a whole?



My suggestion about L1 entry flag is side thing, I understand difference 
between O(n) and O(1) :) Still additional L1 entry will help to make efficient 
large block-status and write-zero requests.

And I agree that we need top level flag.. I just try to say, that it seems good 
to make it similar with existing L2 flag. But yes, it would be incomaptible 
change, as it marks all clusters as ZERO, and older Qemu can't understand it 
and may treat all clusters as unallocated.


--
Best regards,
Vladimir

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


05.02.2020 17:22, Eric Blake wrote:

On 2/5/20 3:04 AM, Vladimir Sementsov-Ogievskiy wrote:


[repo.or.cz appears to be down as I type this; I'll post a link to a
repository later when it comes back up]


Now up
https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qcow2-all-zero-v1





I have several ideas around it.

1. For generic block layer.
Did you consider as alternative to BDRV_ZEO_OPEN, to export the
information through normal block_status? So, if we have the
information, that disk is all-zero, we can always add _ZERO
flag to block-status result.


Makes sense.


And in generic bdrv_is_all_zeroes(),
we can just call block_status(0, disk_size), which will return
ZERO and n=disk_size if driver supports all-zero feature and is
all-zero now.


Less obvious.  block_status is not required to visit the entire disk, even if 
the entire disk is all zero.  For example, qcow2 visits at most one L2 page in 
a call (if the request spans L1 entries, it will be truncated at the boundary, 
even if the region before and after the boundary have the same status).  I'm 
also worried if we still have 32-bit limitations in block_status (ideally, 
we've fixed things to support 64-bit status where possible, but I'm not 
counting on it).


Not required, but why not doing it? If we have information that all disk is of 
the same ZERO status, no reasons to not reply on block_status(0, disk_size) 
with smaller n.




I think block-status is a native way for such information, and I
think that we anyway want to come to support of 64bit block-status
for qcow2 and nbd.


Block status requires an O(n) loop over the disk, where n is the number of 
distinct extents possible.  If you get lucky, and block_status(0,size) returns 
a single extent, then yes that can feed the 'is_zeroes' request.  Similarly, a 
single return of non-zero data can instantly tell you that 'is_zeroes' is 
false.  But given that drivers may break up their response on convenient 
boundaries, such as qcow2 on L1 entry granularity, you cannot blindly assume 
that a return of zero data for smaller than the requested size implies non-zero 
data, only that there is insufficient information to tell if the disk is 
all_zeroes without querying further block_status calls, and that's where you 
lose out on the speed compared to just being told up-front from an 'is_zero' 
call.


Yes. But how is it worse than BDRV_ZERO_OPEN? With one block_status call we 
have the same information. If on block_status(0, disk_size) driver replies with 
ZERO but smaller than disk_size, it means that either disk is not all-zero, or 
driver doesn't support 'fast whole-disk zero check' feature, which is equal to 
not supporting BDRV_ZERO_OPEN.





2. For NBD
Again, possible alternative is BLOCK_STATUS, but we need 64bit
commands for it. I plan to send a proposal anyway. Still, nothing
bad in two possible path of receiving all-zero information.
And even with your NBD extension, we can export this information
through block-status [1.]


Yes, having 64-bit BLOCK_STATUS in NBD is orthogonal to this, but both ideas 
are independently useful, and as the level of difficulty in implementing things 
may vary, it is conceivable to have both a server that provides 'is_zero' but 
not BLOCK_STATUS, and a server that provides 64-bit BLOCK_STATUS but not 
'is_zero'.



3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.


The benefit of this patch series is that it detects preallocated qcow2 images 
as all_zero.  What's more, scanning all L1 entries is O(n), but detecting an 
autoclear all_zero bit is O(1).  Your proposed L1 scan is accurate for fewer 
cases, and costs more time.


Ah yes, somehow I thought that L1 is not allocated for fresh image..

Hmm, than possibly we need two new top-level flags: "all-zero" and 
"all-unallocated"..


--
Best regards,
Vladimir

Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}

2020-02-05 Thread Eric Blake


On 2/5/20 8:25 AM, Vladimir Sementsov-Ogievskiy wrote:


But when you truncate them (with PREALLOC_MODE_OFF, as
BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
area is always going to be 0, regardless of initial preallocation.


ah yes, due to qcow2 zero clusters.


Hmm. Do we actually set the zero flag on unallocated clusters when 
resizing a qcow2 image?  That would be an O(n) operation (we have to 
visit the L2 entry for each added cluster, even if only to set the 
zero cluster bit).  Or do we instead just rely on the fact that qcow2 
is inherently sparse, and that when you resize the guest-visible size 
without writing any new clusters, then it is only subsequent guest 
access to those addresses that finally allocate clusters, making 
resize O(1) (update the qcow2 metadata cluster, but not any L2 tables) 
while still reading 0 from the new data.  To some extent, that's what 
the allocation mode is supposed to control.


We must mark as ZERO new cluster at least if there is a _larger_ backing 
file, to prevent data from backing file become available for the guest. 
But we don't do it. It's a bug and there is fixing series from Kevin, 
I've just pinged it:

"[PATCH for-4.2? v3 0/8] block: Fix resize (extending) of short overlays"


There's a difference for a backing file larger than the qcow2 file, and 
the protocol layer larger than the qcow2 file.  Visually, with the 
following four nodes:


f1 [qcow2 format] <- f2 [qcow2 format]
  vv
p1 [file protocol]p2 [file protocol]

If f1 is larger than f2, then resizing f2 without writing zero clusters 
leaks the data from f1 into f2.  The block layer knows this: prior to 
this series, bdrv_has_zero_init_truncate() returns 0 if bs->backing is 
present; and even in this series, see patch 6/17 which continues to 
force a 0 return rather than calling into the driver if the sizes are 
suspect.  But that is an uncommon corner case; in short, the qcow2 
callback .bdrv_has_zero_init_truncate is NOT reachable in that scenario, 
whether before or after this series.


On the other hand, if p2 is larger than f2, resizing f2 reads zeroes. 
That's because qcow2 HAS to add L2 mappings into p2 before data from p2 
can leak, but .bdrv_co_truncate(PREALLOC_MODE_OFF) does not add any L2 
mappings.  Thus, qcow2 blindly returning 1 for 
.bdrv_has_zero_init_truncate was correct (other than the anomaly of 
bs->encrypted, also fixed earlier in this series).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 02/13] qcrypto-luks: implement encryption key management

2020-02-05 Thread Markus Armbruster

Daniel P. Berrangé  writes:

> On Wed, Feb 05, 2020 at 10:30:11AM +0100, Kevin Wolf wrote:
>> Am 05.02.2020 um 09:24 hat Markus Armbruster geschrieben:
>> > Daniel, Kevin, any comments or objections to the QAPI schema design
>> > sketch developed below?
>> > 
>> > For your convenience, here's the result again:
>> > 
>> > { 'enum': 'LUKSKeyslotState',
>> >   'data': [ 'active', 'inactive' ] }
>> > { 'struct': 'LUKSKeyslotActive',
>> >   'data': { 'secret': 'str',
>> > '*iter-time': 'int } }
>> > { 'union': 'LUKSKeyslotAmend',
>> >   'base': { '*keyslot': 'int',
>> > 'state': 'LUKSKeyslotState' }
>> >   'discriminator': 'state',
>> >   'data': { 'active': 'LUKSKeyslotActive' } }
>
> We need 'secret' in the 'inactive' case too

Yes, my mistake.

>> I think one of the requirements was that you can specify the keyslot not
>> only by using its number, but also by specifying the old secret. Trivial
>> extension, you just get another optional field that can be specified
>> instead of 'keyslot'.
>> 
>> Resulting commands:
>> 
>> Adding a key:
>> qemu-img amend -o encrypt.keys.0.state=active,encrypt.keys.0.secret=sec0 
>> test.qcow2
>> 
>> Deleting a key:
>> qemu-img amend -o encrypt.keys.0.state=inactive,encrypt.keys.0.keyslot=2 
>> test.qcow2
>
> I think this is good as a design.
>
> Expanding the examples to cover all scenarios we've discussed
>
>
>   - Activating a new keyslot, auto-picking slot
>
>  qemu-img amend -o encrypt.keys.0.state=active,\
>encrypt.keys.0.secret=sec0 \
>   test.qcow2
>
> Must raise an error if no free slots
>
>
>   - Activating a new keyslot, picking a specific slot
>
>  qemu-img amend -o encrypt.keys.0.state=active,\
>encrypt.keys.0.secret=sec0 \
>  encrypt.keys.0.keyslot=3 \
>   test.qcow2
>
> Must raise an error if slot is already active

>From the "describe desired state" point of view:

* Always suceeds when slot is inactive

* No-op when active and its secret is already the desired secret

* Must raise "in place update refused" error otherwise

>   - Deactivating a old keyslot, auto-picking slot(s) from existing password
>
>  qemu-img amend -o encrypt.keys.0.state=inactive,\
>encrypt.keys.0.secret=sec0 \
>   test.qcow2
>
> Must raise an error if this would leave zero keyslots
> after processing.
>
>
>   - Deactivating a old keyslot, picking a specific slot
>
>  qemu-img amend -o encrypt.keys.0.state=inactive,\
>encrypt.keys.0.keyslot=2 \
>   test.qcow2
>
> Always succeeds even if zero keyslots left.

This one's dangerous.

Here's a variation: permit operations that may or will lose data only
with 'force': true.

When @keyslot is absent, using force has no effect.

When @keyslot is present, using force permits update in place and
deactivating the last slot.

Re: [PATCH 02/13] qcrypto-luks: implement encryption key management

2020-02-05 Thread Markus Armbruster

Kevin Wolf  writes:

> Am 05.02.2020 um 11:03 hat Markus Armbruster geschrieben:
>> Kevin Wolf  writes:
>> 
>> > Am 05.02.2020 um 09:24 hat Markus Armbruster geschrieben:
>> >> Daniel, Kevin, any comments or objections to the QAPI schema design
>> >> sketch developed below?
>> >> 
>> >> For your convenience, here's the result again:
>> >> 
>> >> { 'enum': 'LUKSKeyslotState',
>> >>   'data': [ 'active', 'inactive' ] }
>> >> { 'struct': 'LUKSKeyslotActive',
>> >>   'data': { 'secret': 'str',
>> >> '*iter-time': 'int } }
>> >> { 'union': 'LUKSKeyslotAmend',
>> >>   'base': { '*keyslot': 'int',
>> >> 'state': 'LUKSKeyslotState' }
>> >>   'discriminator': 'state',
>> >>   'data': { 'active': 'LUKSKeyslotActive' } }
>> >
>> > I think one of the requirements was that you can specify the keyslot not
>> > only by using its number, but also by specifying the old secret.
>> 
>> Quoting myself:
>> 
>>   When we don't specify the slot#, then "new state active" selects an
>>   inactive slot (chosen by the system, and "new state inactive selects
>>   slots by secret (commonly just one slot).
>> 
>> This takes care of selecting (active) slots by old secret with "new
>> state inactive".
>
> "new secret inactive" can't select a slot by secret because 'secret'
> doesn't even exist for inactive.

My mistake.  My text leading up to my schema has it, but the schema
itself doesn't.  Obvious fix:

As struct:

{ 'struct': 'LUKSKeyslotUpdate',
  'data': { 'active': 'bool',   # could do enum instead
'*keyslot': 'int',
'*secret': 'str',   # present if @active is true
#   or @keyslot is absent
'*iter-time': 'int' } } # absent if @active is false

As union:

{ 'enum': 'LUKSKeyslotState',
  'data': [ 'active', 'inactive' ] }
{ 'struct': 'LUKSKeyslotActive',
  'data': { 'secret': 'str',
'*iter-time': 'int } }
{ 'struct': 'LUKSKeyslotInactive',
  'data': { '*secret': 'str' } }# either @secret or @keyslot present
# might want to name this @old-secret
{ 'union': 'LUKSKeyslotAmend',
  'base': { '*keyslot': 'int',
'state': 'LUKSKeyslotState' }
  'discriminator': 'state',
  'data': { 'active': 'LUKSKeyslotActive',
'inactive': 'LUKSKeyslotInactive' } }

The "deactivate secret" operation needs a bit of force to fit into the
amend interface's "describe desired state" mold: the desired state is
(state=inactive, secret=S).  In other words, the inactive slot keeps its
secret, you just can't use it for anything.

Sadly, even with a union, we now have optional members that aren't
really optional: "either @secret or @keyslot present".  To avoid that,
we'd have to come up with sane semantics for "neither" and "both".  Let
me try.

The basic idea is to have @keyslot and @secret each select a set of
slots, and take the intersection.

If @keyslot is present: { @keyslot }
   absent: all slots
If @secret is present: the set of slots holding @secret
  absent: all slots

Neither present: select all slots.
Both present: slot @keyslot if it holds @secret, else no slots

The ability to specify @keyslot and @secret might actually be
appreciated by some users.  Belt *and* suspenders.

Selecting no slots could be a no-op or an error.  As a user, I don't
care as long as I can tell what the command actually changed.

Selecting all slots is an error because deactivating the last slot is.
No different from selecting all slots with a particular secret when no
active slots with different secrets exist.

I'm not sure this is much of an improvement.

>> I intentionally did not provide for selecting (active) slots by old
>> secret with "new state active", because that's unsafe update in place.
>> 
>> We want to update secrets, of course.  But the safe way to do that is to
>> put the new secret into a free slot, and if that succeeds, deactivate
>> the old secret.  If deactivation fails, you're left with both old and
>> new secret, which beats being left with no secret when update in place
>> fails.
>
> Right. I wonder if qemu-img wants support for that specifically
> (possibly with allowing to enter the key interactively) rather than
> requiring the user to call qemu-img amend twice.

Human users may well appreciate such a "replace secret" operation.  As
so often with high-level operations, the difficulty is its failure
modes:

* Activation fails: no change (old secret still active)

* Deactivate fails: both secrets are active

Humans should be able to deal with both failure modes, provided the
error reporting is sane.

If I'd have to program a machine, however, I'd rather use the primitive
operations, because each either succeeds completely or fails completely,
which means I don't have to figure out *how* something failed.

Note that such a high-level "repl

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Eric Blake


On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:


3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.



Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
we have zero bits in L2 entries. And with them, we even don't need
preallocated to be filled by zeros, as we never read them (but just return
zeros on read)..


Scanning all L2 entries is O(n), while an autoclear bit properly 
maintained is O(1).




Then, may be we want similar flag for L1 entry (this will enable large
fast write-zero). And may be we want flag which marks the whole image
as read-zero (it's your flag). So, now I think, my previous idea
of "all allocated is zero" is worse. As for fully-preallocated images
we are sure that all clusters are allocated, and it is more native to
have flags similar to ZERO bit in L2 entry.


Right now, we don't have any L1 entry flags.  Adding one would require 
adding an incompatible feature flag (if older qemu would choke to see 
unexpected flags in an L1 entry), or at best an autoclear feature flag 
(if the autoclear bit gets cleared because an older qemu opened the 
image and couldn't maintain L1 entry flags correctly, then newer qemu 
knows it cannot trust those L1 entry flags).  But as soon as you are 
talking about adding a feature bit, then why add one that still requires 
O(n) traversal to check (true, the 'n' in an O(n) traversal of L1 tables 
is much smaller than the 'n' in an O(n) traversal of L2 tables), when 
you can instead just add an O(1) autoclear bit that maintains all_zero 
status for the image as a whole?


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


05.02.2020 17:07, Eric Blake wrote:

On 2/5/20 1:51 AM, Vladimir Sementsov-Ogievskiy wrote:


+typedef enum {
+    /*
+ * bdrv_known_zeroes() should include this bit if the contents of
+ * a freshly-created image with no backing file reads as all
+ * zeroes without any additional effort.  If .bdrv_co_truncate is
+ * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.


I understand that this is preexisting logic, but could I ask: why?
What's wrong
if driver can guarantee that created file is all-zero, but is not sure
about
file resizing? I agree that it's normal for these flags to have the same
value,
but what is the reason for this restriction?..


If areas added by truncation (or growth, rather) are always zero, then
the file can always be created with size 0 and grown from there.  Thus,
images where truncation adds zeroed areas will generally always be zero
after creation.


This means, that if truncation bit is set, than create bit should be set.. But
here we say that if truncation is clear, than create bit must be clear.


Max, did we get the logic backwards?






So, the only possible combination of flags, when they differs, is
create=0 and
truncate=1.. How is it possible?


For preallocated qcow2 images, it depends on the storage whether they
are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
to bdrv_has_zero_init() of s->data_file->bs.

But when you truncate them (with PREALLOC_MODE_OFF, as
BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
area is always going to be 0, regardless of initial preallocation.


ah yes, due to qcow2 zero clusters.


Hmm. Do we actually set the zero flag on unallocated clusters when resizing a 
qcow2 image?  That would be an O(n) operation (we have to visit the L2 entry 
for each added cluster, even if only to set the zero cluster bit).  Or do we 
instead just rely on the fact that qcow2 is inherently sparse, and that when 
you resize the guest-visible size without writing any new clusters, then it is 
only subsequent guest access to those addresses that finally allocate clusters, 
making resize O(1) (update the qcow2 metadata cluster, but not any L2 tables) 
while still reading 0 from the new data.  To some extent, that's what the 
allocation mode is supposed to control.


We must mark as ZERO new cluster at least if there is a _larger_ backing file, 
to prevent data from backing file become available for the guest. But we don't 
do it. It's a bug and there is fixing series from Kevin, I've just pinged it:
"[PATCH for-4.2? v3 0/8] block: Fix resize (extending) of short overlays"



What about with external data images, where a resize in guest-visible length 
requires a resize of the underlying data image?  There, we DO have to worry 
about whether the data image resizes with zeroes (as in the filesystem) or with 
random data (as in a block device).




--
Best regards,
Vladimir

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Eric Blake


On 2/5/20 3:04 AM, Vladimir Sementsov-Ogievskiy wrote:


[repo.or.cz appears to be down as I type this; I'll post a link to a
repository later when it comes back up]


Now up
https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qcow2-all-zero-v1





I have several ideas around it.

1. For generic block layer.
Did you consider as alternative to BDRV_ZEO_OPEN, to export the
information through normal block_status? So, if we have the
information, that disk is all-zero, we can always add _ZERO
flag to block-status result.


Makes sense.


And in generic bdrv_is_all_zeroes(),
we can just call block_status(0, disk_size), which will return
ZERO and n=disk_size if driver supports all-zero feature and is
all-zero now.


Less obvious.  block_status is not required to visit the entire disk, 
even if the entire disk is all zero.  For example, qcow2 visits at most 
one L2 page in a call (if the request spans L1 entries, it will be 
truncated at the boundary, even if the region before and after the 
boundary have the same status).  I'm also worried if we still have 
32-bit limitations in block_status (ideally, we've fixed things to 
support 64-bit status where possible, but I'm not counting on it).



I think block-status is a native way for such information, and I
think that we anyway want to come to support of 64bit block-status
for qcow2 and nbd.


Block status requires an O(n) loop over the disk, where n is the number 
of distinct extents possible.  If you get lucky, and 
block_status(0,size) returns a single extent, then yes that can feed the 
'is_zeroes' request.  Similarly, a single return of non-zero data can 
instantly tell you that 'is_zeroes' is false.  But given that drivers 
may break up their response on convenient boundaries, such as qcow2 on 
L1 entry granularity, you cannot blindly assume that a return of zero 
data for smaller than the requested size implies non-zero data, only 
that there is insufficient information to tell if the disk is all_zeroes 
without querying further block_status calls, and that's where you lose 
out on the speed compared to just being told up-front from an 'is_zero' 
call.




2. For NBD
Again, possible alternative is BLOCK_STATUS, but we need 64bit
commands for it. I plan to send a proposal anyway. Still, nothing
bad in two possible path of receiving all-zero information.
And even with your NBD extension, we can export this information
through block-status [1.]


Yes, having 64-bit BLOCK_STATUS in NBD is orthogonal to this, but both 
ideas are independently useful, and as the level of difficulty in 
implementing things may vary, it is conceivable to have both a server 
that provides 'is_zero' but not BLOCK_STATUS, and a server that provides 
64-bit BLOCK_STATUS but not 'is_zero'.




3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.


The benefit of this patch series is that it detects preallocated qcow2 
images as all_zero.  What's more, scanning all L1 entries is O(n), but 
detecting an autoclear all_zero bit is O(1).  Your proposed L1 scan is 
accurate for fewer cases, and costs more time.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH] iscsi: Cap block count from GET LBA STATUS (CVE-2020-1711)

2020-02-05 Thread Felipe Franciosi

Hi and sorry for the delay, this got lost in my inbox.

> On Jan 28, 2020, at 12:42 PM, Kevin Wolf  wrote:
> 
> Am 28.01.2020 um 13:30 hat Philippe Mathieu-Daudé geschrieben:
>> Hi guys,
>> 
>> (Cc'ing Jon)
>> 
>> On 1/23/20 5:59 PM, Kevin Wolf wrote:
>>> Am 23.01.2020 um 13:44 hat Felipe Franciosi geschrieben:
 When querying an iSCSI server for the provisioning status of blocks (via
 GET LBA STATUS), Qemu only validates that the response descriptor zero's
 LBA matches the one requested. Given the SCSI spec allows servers to
 respond with the status of blocks beyond the end of the LUN, Qemu may
 have its heap corrupted by clearing/setting too many bits at the end of
 its allocmap for the LUN.

 A malicious guest in control of the iSCSI server could carefully program
 Qemu's heap (by selectively setting the bitmap) and then smash it.

 This limits the number of bits that iscsi_co_block_status() will try to
 update in the allocmap so it can't overflow the bitmap.

 Signed-off-by: Felipe Franciosi 
 Signed-off-by: Peter Turschmid 
 Signed-off-by: Raphael Norwitz 
>>> 
>>> Thanks, applied to the block branch.
>> 
>> We are trying to reproduce this, do you already have some code that
>> triggered this issue?
> 
> I don't, maybe Felipe has a reproducer that would crash QEMU.

It's not hard.

1) Attach an iSCSI LUN to Qemu. Do not read data from it (so Qemu
   won't populate the bitmap).
2) Issue a read larger than 64 blocks towards the end of the LUN.
   Qemu will attempt a GET LBA STATUS to find out if the provisioning
   status of the blocks.
3) Get your iSCSI server to respond with more blocks than is available
   on the LUN for that LBA.

We did that with an iSCSI server written exclusively for such tests.
All the responses can be controlled or fuzzed. But it should be easy
to modify any existing server (eg. iet).

F.

> 
>> I am new to the block API, I noticed the block/blkdebug.c file with
>> 'blkdebug' option, is it helpful to reproduce this issue via HMP?
>> 
>> Any suggestion what would be the easier/quicker way to test this?
> 
> On the QEMU side, you just need to connect to an iscsi backend. The
> malicious response must come from the server, which is not part of QEMU.
> So no, blkdebug won't help you.
> 
>> Looking for iotests examples I see tests/qemu-iotests/147 providing a
>> BuiltinNBD class. Is it the recommended way to go, to mock a iSCSI server?
> 
> That BuiltinNBD class doesn't implement an NBD server, but it just
> starts the built-in NBD server in QEMU and runs some tests against it.
> QEMU doesn't have a built-in iscsi server.
> 
> Kevin

Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}

2020-02-05 Thread Eric Blake


On 2/5/20 1:51 AM, Vladimir Sementsov-Ogievskiy wrote:


+typedef enum {
+    /*
+ * bdrv_known_zeroes() should include this bit if the contents of
+ * a freshly-created image with no backing file reads as all
+ * zeroes without any additional effort.  If .bdrv_co_truncate is
+ * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.


I understand that this is preexisting logic, but could I ask: why?
What's wrong
if driver can guarantee that created file is all-zero, but is not sure
about
file resizing? I agree that it's normal for these flags to have the same
value,
but what is the reason for this restriction?..


If areas added by truncation (or growth, rather) are always zero, then
the file can always be created with size 0 and grown from there.  Thus,
images where truncation adds zeroed areas will generally always be zero
after creation.


This means, that if truncation bit is set, than create bit should be 
set.. But

here we say that if truncation is clear, than create bit must be clear.


Max, did we get the logic backwards?






So, the only possible combination of flags, when they differs, is
create=0 and
truncate=1.. How is it possible?


For preallocated qcow2 images, it depends on the storage whether they
are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
to bdrv_has_zero_init() of s->data_file->bs.

But when you truncate them (with PREALLOC_MODE_OFF, as
BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
area is always going to be 0, regardless of initial preallocation.


ah yes, due to qcow2 zero clusters.


Hmm. Do we actually set the zero flag on unallocated clusters when 
resizing a qcow2 image?  That would be an O(n) operation (we have to 
visit the L2 entry for each added cluster, even if only to set the zero 
cluster bit).  Or do we instead just rely on the fact that qcow2 is 
inherently sparse, and that when you resize the guest-visible size 
without writing any new clusters, then it is only subsequent guest 
access to those addresses that finally allocate clusters, making resize 
O(1) (update the qcow2 metadata cluster, but not any L2 tables) while 
still reading 0 from the new data.  To some extent, that's what the 
allocation mode is supposed to control.


What about with external data images, where a resize in guest-visible 
length requires a resize of the underlying data image?  There, we DO 
have to worry about whether the data image resizes with zeroes (as in 
the filesystem) or with random data (as in a block device).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v4 0/4] qmp: Optionally run handlers in coroutines

2020-02-05 Thread Kevin Wolf

Am 21.01.2020 um 19:11 hat Kevin Wolf geschrieben:
> Some QMP command handlers can block the main loop for a relatively long
> time, for example because they perform some I/O. This is quite nasty.
> Allowing such handlers to run in a coroutine where they can yield (and
> therefore release the BQL) while waiting for an event such as I/O
> completion solves the problem.
> 
> This series adds the infrastructure to allow this and switches
> block_resize to run in a coroutine as a first example.
> 
> This is an alternative solution to Marc-André's "monitor: add
> asynchronous command type" series.

Ping?

Kevin

Re: [PATCH v2 01/33] block: Add BlockDriver.is_format

2020-02-05 Thread Eric Blake


On 2/4/20 11:08 AM, Max Reitz wrote:

We want to unify child_format and child_file at some point.  One of the
important things that set format drivers apart from other drivers is
that they do not expect other format nodes under them (except in the
backing chain).  That means we need something on which to distinguish
format drivers from others, and hence this flag.


It _is_ possible to set up a format node on top of another; in fact, our 
testsuite does that in at least iotest 072.  I agree that setups like 
'qcow2 - qcow2 - file' are uncommon, but the setup 'qcow2 - raw - file' 
may be useful for extracting a partition of a raw disk when it is known 
that the partition in that disk itself contains qcow2 data.




Signed-off-by: Max Reitz 
---



+++ b/include/block/block_int.h
@@ -94,6 +94,13 @@ struct BlockDriver {
   * must implement them and return -ENOTSUP.
   */
  bool is_filter;
+/*
+ * Set to true if the BlockDriver is a format driver.  Format nodes
+ * generally do not expect their children to be other format nodes
+ * (except for backing files), and so format probing is disabled
+ * on those children.


Aha - nested formats ARE still allowed when you explicitly request it 
(which is what iotest 72 does) - what you are stating here is that 
implicit probing of is forbidden for a parent declared as a format 
driver.  That makes more sense.


I'm not sure if the commit message needs a tweak, but the patch itself 
is sane as-is.


Reviewed-by: Eric Blake 


+ */
+bool is_format;
  /*
   * Return true if @to_replace can be replaced by a BDS with the
   * same data as @bs without it affecting @bs's behavior (that is,



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH for-4.2? v3 0/8] block: Fix resize (extending) of short overlays

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


19.12.2019 13:20, Vladimir Sementsov-Ogievskiy wrote:

19.12.2019 13:13, Kevin Wolf wrote:

Am 19.12.2019 um 10:24 hat Vladimir Sementsov-Ogievskiy geschrieben:

10.12.2019 20:46, Kevin Wolf wrote:

Am 22.11.2019 um 17:05 hat Kevin Wolf geschrieben:

See patch 4 for the description of the bug fixed.


I'm applying patches 3 and 5-7 to the block branch because they make
sense on their own.

The real fix will need another approach because the error handling is
broken in this one: If zeroing out fails (either because of NO_FALLBACK
or because of some other I/O error), bdrv_co_truncate() will return
failure, but the image size has already been increased, with potentially
incorrect data in the new area.

To fix this, we need to make sure that zeros will be read before we
commit the new image size to the image file (e.g. qcow2 header) and to
bs->total_sectors. In other words, it must become the responsibility of
the block driver.

To this effect, I'm planning to introduce a PREALLOC_MODE_ZERO_INIT flag
that can be or'ed to the preallocation mode. This will fail by default
because it looks like just another unimplemented preallocation mode to
block drivers. It will be requested explicitly by commit jobs and
automatically added by bdrv_co_truncate() if the backing file would
become visible (like in this series, but now for all preallocation
modes). I'm planning to implement it for qcow2 and file-posix for now,
which should cover most interesting cases.

Does this make sense to you?


This should work. Do you still have this plan in a timeline?


Still planning to do this, but tomorrow is my last working day for this
year. So I guess I'll get to it sometime in January.



Good. Have a nice holiday!




Hi, didn't you forget? I just going to ping (or resend) my related
"[PATCH 0/4] fix & merge block_status_above and is_allocated_above", so,
pinging these patches too...

--
Best regards,
Vladimir

Re: [PATCH v2 4/7] block/block-copy: refactor interfaces to use bytes instead of end

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


29.01.2020 20:12, Andrey Shinkevich wrote:



On 27/11/2019 21:08, Vladimir Sementsov-Ogievskiy wrote:

We have a lot of "chunk_end - start" invocations, let's switch to
bytes/cur_bytes scheme instead.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
   include/block/block-copy.h |  4 +--
   block/block-copy.c | 68 --
   2 files changed, 37 insertions(+), 35 deletions(-)

diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index 0a161724d7..7321b3d305 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -19,8 +19,8 @@
   #include "qemu/co-shared-resource.h"
   
   typedef struct BlockCopyInFlightReq {

-int64_t start_byte;
-int64_t end_byte;
+int64_t start;
+int64_t bytes;
   QLIST_ENTRY(BlockCopyInFlightReq) list;
   CoQueue wait_queue; /* coroutines blocked on this request */
   } BlockCopyInFlightReq;
diff --git a/block/block-copy.c b/block/block-copy.c
index 94e7e855ef..cc273b6cb8 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -26,12 +26,12 @@
   
   static BlockCopyInFlightReq *block_copy_find_inflight_req(BlockCopyState *s,

 int64_t start,
-  int64_t end)
+  int64_t bytes)
   {
   BlockCopyInFlightReq *req;
   
   QLIST_FOREACH(req, &s->inflight_reqs, list) {

-if (end > req->start_byte && start < req->end_byte) {
+if (start + bytes > req->start && start < req->start + req->bytes) {
   return req;
   }
   }
@@ -41,21 +41,21 @@ static BlockCopyInFlightReq 
*block_copy_find_inflight_req(BlockCopyState *s,
   
   static void coroutine_fn block_copy_wait_inflight_reqs(BlockCopyState *s,

  int64_t start,
-   int64_t end)
+   int64_t bytes)
   {
   BlockCopyInFlightReq *req;
   
-while ((req = block_copy_find_inflight_req(s, start, end))) {

+while ((req = block_copy_find_inflight_req(s, start, bytes))) {
   qemu_co_queue_wait(&req->wait_queue, NULL);
   }
   }
   
   static void block_copy_inflight_req_begin(BlockCopyState *s,

 BlockCopyInFlightReq *req,
-  int64_t start, int64_t end)
+  int64_t start, int64_t bytes)
   {
-req->start_byte = start;
-req->end_byte = end;
+req->start = start;
+req->bytes = bytes;
   qemu_co_queue_init(&req->wait_queue);
   QLIST_INSERT_HEAD(&s->inflight_reqs, req, list);
   }
@@ -150,24 +150,26 @@ void block_copy_set_callbacks(
   /*
* block_copy_do_copy
*
- * Do copy of cluser-aligned chunk. @end is allowed to exceed s->len only to
- * cover last cluster when s->len is not aligned to clusters.
+ * Do copy of cluser-aligned chunk. Requested region is allowed to exceed 
s->len


cluster-...


+ * only to cover last cluster when s->len is not aligned to clusters.
*
* No sync here: nor bitmap neighter intersecting requests handling, only 
copy.
*
* Returns 0 on success.
*/
   static int coroutine_fn block_copy_do_copy(BlockCopyState *s,
-   int64_t start, int64_t end,
+   int64_t start, int64_t bytes,
  bool zeroes, bool *error_is_read)
   {
   int ret;
-int nbytes = MIN(end, s->len) - start;
+int nbytes = MIN(start + bytes, s->len) - start;
   void *bounce_buffer = NULL;
   
+assert(start >= 0 && bytes > 0 && INT64_MAX - start >= bytes);

   assert(QEMU_IS_ALIGNED(start, s->cluster_size));
-assert(QEMU_IS_ALIGNED(end, s->cluster_size));
-assert(end < s->len || end == QEMU_ALIGN_UP(s->len, s->cluster_size));
+assert(QEMU_IS_ALIGNED(bytes, s->cluster_size));
+assert(start + bytes <= s->len ||


The '<=' looks correct but I wonder how only '<' worked without
assertion failure before.


No difference, if end == s->len, and end is aligned, then len is aligned too and "end == 
QEMU_ALIGN_UP(s->len, s->cluster_size)" is true. But "<=" looks more native for end and 
file len.


Was the s->len never aligned to the cluster size?


+   start + bytes == QEMU_ALIGN_UP(s->len, s->cluster_size));
   
   if (zeroes) {

   ret = bdrv_co_pwrite_zeroes(s->target, start, nbytes, s->write_flags 
&
@@ -347,7 +349,6 @@ int coroutine_fn block_copy(BlockCopyState *s,
   bool *error_is_read)
   {
   int ret = 0;
-int64_t end = bytes + start; /* bytes */
   BlockCopyInFlightReq req;
   
   /*

@@ -358,58 +359,59 @@ int coroutine_fn block_copy(BlockCopyState *s,
  bdrv_get_aio

Re: [PATCH v4 1/1] qemu-img: Add --target-is-zero to convert

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


05.02.2020 14:02, David Edmondson wrote:

In many cases the target of a convert operation is a newly provisioned
target that the user knows is blank (reads as zero). In this situation
there is no requirement for qemu-img to wastefully zero out the entire
device.

Add a new option, --target-is-zero, allowing the user to indicate that
an existing target device will return zeros for all reads.

Signed-off-by: David Edmondson


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

[PATCH v4 05/10] block/dirty-bitmap: switch _next_dirty_area and _next_zero to int64_t

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

We are going to introduce bdrv_dirty_bitmap_next_dirty so that same
variable may be used to store its return value and to be its parameter,
so it would int64_t.

Similarly, we are going to refactor hbitmap_next_dirty_area to use
hbitmap_next_dirty together with hbitmap_next_zero, therefore we want
hbitmap_next_zero parameter type to be int64_t too.

So, for convenience update all parameters of *_next_zero and
*_next_dirty_area to be int64_t.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/dirty-bitmap.h |  6 +++---
 include/qemu/hbitmap.h   |  7 +++
 block/dirty-bitmap.c |  6 +++---
 nbd/server.c |  2 +-
 tests/test-hbitmap.c | 36 ++--
 util/hbitmap.c   | 13 -
 6 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index e2b20ecab9..27c72cc56a 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -105,10 +105,10 @@ for (bitmap = bdrv_dirty_bitmap_first(bs); bitmap; \
  bitmap = bdrv_dirty_bitmap_next(bitmap))
 
 char *bdrv_dirty_bitmap_sha256(const BdrvDirtyBitmap *bitmap, Error **errp);
-int64_t bdrv_dirty_bitmap_next_zero(BdrvDirtyBitmap *bitmap, uint64_t offset,
-uint64_t bytes);
+int64_t bdrv_dirty_bitmap_next_zero(BdrvDirtyBitmap *bitmap, int64_t offset,
+int64_t bytes);
 bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap *bitmap,
-   uint64_t *offset, uint64_t *bytes);
+   int64_t *offset, int64_t *bytes);
 BdrvDirtyBitmap *bdrv_reclaim_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
   Error **errp);
 
diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index df922d8517..b6e85f3d5d 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -304,10 +304,10 @@ void hbitmap_iter_init(HBitmapIter *hbi, const HBitmap 
*hb, uint64_t first);
  * @hb: The HBitmap to operate on
  * @start: The bit to start from.
  * @count: Number of bits to proceed. If @start+@count > bitmap size, the whole
- * bitmap is looked through. You can use UINT64_MAX as @count to search up to
+ * bitmap is looked through. You can use INT64_MAX as @count to search up to
  * the bitmap end.
  */
-int64_t hbitmap_next_zero(const HBitmap *hb, uint64_t start, uint64_t count);
+int64_t hbitmap_next_zero(const HBitmap *hb, int64_t start, int64_t count);
 
 /* hbitmap_next_dirty_area:
  * @hb: The HBitmap to operate on
@@ -322,8 +322,7 @@ int64_t hbitmap_next_zero(const HBitmap *hb, uint64_t 
start, uint64_t count);
  * @offset and @bytes appropriately. Otherwise returns false and leaves @offset
  * and @bytes unchanged.
  */
-bool hbitmap_next_dirty_area(const HBitmap *hb, uint64_t *start,
- uint64_t *count);
+bool hbitmap_next_dirty_area(const HBitmap *hb, int64_t *start, int64_t 
*count);
 
 /**
  * hbitmap_iter_next:
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 7039e82520..af9f5411a6 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -860,14 +860,14 @@ char *bdrv_dirty_bitmap_sha256(const BdrvDirtyBitmap 
*bitmap, Error **errp)
 return hbitmap_sha256(bitmap->bitmap, errp);
 }
 
-int64_t bdrv_dirty_bitmap_next_zero(BdrvDirtyBitmap *bitmap, uint64_t offset,
-uint64_t bytes)
+int64_t bdrv_dirty_bitmap_next_zero(BdrvDirtyBitmap *bitmap, int64_t offset,
+int64_t bytes)
 {
 return hbitmap_next_zero(bitmap->bitmap, offset, bytes);
 }
 
 bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap *bitmap,
-   uint64_t *offset, uint64_t *bytes)
+   int64_t *offset, int64_t *bytes)
 {
 return hbitmap_next_dirty_area(bitmap->bitmap, offset, bytes);
 }
diff --git a/nbd/server.c b/nbd/server.c
index 87fcd2e7bf..c422265041 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2055,7 +2055,7 @@ static unsigned int bitmap_to_extents(BdrvDirtyBitmap 
*bitmap, uint64_t offset,
 bool next_dirty = !dirty;
 
 if (dirty) {
-end = bdrv_dirty_bitmap_next_zero(bitmap, begin, UINT64_MAX);
+end = bdrv_dirty_bitmap_next_zero(bitmap, begin, INT64_MAX);
 } else {
 bdrv_set_dirty_iter(it, begin);
 end = bdrv_dirty_iter_next(it);
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index aeaa0b3f22..9d210dc18c 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -817,8 +817,8 @@ static void test_hbitmap_iter_and_reset(TestHBitmapData 
*data,
 }
 
 static void test_hbitmap_next_zero_check_range(TestHBitmapData *data,
-   uint64_t start,
-   uint64_t count)
+

[PATCH v4 09/10] nbd/server: use bdrv_dirty_bitmap_next_dirty_area

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

Use bdrv_dirty_bitmap_next_dirty_area for bitmap_to_extents. Since
bdrv_dirty_bitmap_next_dirty_area is very accurate in its interface,
we'll never exceed requested region with last chunk. So, we don't need
dont_fragment, and bitmap_to_extents() interface becomes clean enough
to not require any comment.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 nbd/server.c | 59 +---
 1 file changed, 19 insertions(+), 40 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 21e665e9e0..0ca9b8b6b8 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2068,57 +2068,36 @@ static int nbd_co_send_block_status(NBDClient *client, 
uint64_t handle,
 return nbd_co_send_extents(client, handle, ea, last, context_id, errp);
 }
 
-/*
- * Populate @ea from a dirty bitmap. Unless @dont_fragment, the
- * final extent may exceed the original @length.
- */
+/* Populate @ea from a dirty bitmap. */
 static void bitmap_to_extents(BdrvDirtyBitmap *bitmap,
   uint64_t offset, uint64_t length,
-  NBDExtentArray *ea, bool dont_fragment)
+  NBDExtentArray *es)
 {
-uint64_t begin = offset, end = offset;
-uint64_t overall_end = offset + length;
-BdrvDirtyBitmapIter *it;
-bool dirty;
+int64_t start, dirty_start, dirty_count;
+int64_t end = offset + length;
+bool full = false;
 
 bdrv_dirty_bitmap_lock(bitmap);
 
-it = bdrv_dirty_iter_new(bitmap);
-dirty = bdrv_dirty_bitmap_get_locked(bitmap, offset);
-
-while (begin < overall_end) {
-bool next_dirty = !dirty;
-
-if (dirty) {
-end = bdrv_dirty_bitmap_next_zero(bitmap, begin, INT64_MAX);
-} else {
-bdrv_set_dirty_iter(it, begin);
-end = bdrv_dirty_iter_next(it);
-}
-if (end == -1 || end - begin > UINT32_MAX) {
-/* Cap to an aligned value < 4G beyond begin. */
-end = MIN(bdrv_dirty_bitmap_size(bitmap),
-  begin + UINT32_MAX + 1 -
-  bdrv_dirty_bitmap_granularity(bitmap));
-next_dirty = dirty;
-}
-if (dont_fragment && end > overall_end) {
-end = overall_end;
-}
-
-if (nbd_extent_array_add(ea, end - begin,
- dirty ? NBD_STATE_DIRTY : 0) < 0) {
+for (start = offset;
+ bdrv_dirty_bitmap_next_dirty_area(bitmap, start, end, INT32_MAX,
+   &dirty_start, &dirty_count);
+ start = dirty_start + dirty_count)
+{
+if ((nbd_extent_array_add(es, dirty_start - start, 0) < 0) ||
+(nbd_extent_array_add(es, dirty_count, NBD_STATE_DIRTY) < 0))
+{
+full = true;
 break;
 }
-begin = end;
-dirty = next_dirty;
 }
 
-bdrv_dirty_iter_free(it);
+if (!full) {
+/* last non dirty extent */
+nbd_extent_array_add(es, end - start, 0);
+}
 
 bdrv_dirty_bitmap_unlock(bitmap);
-
-assert(offset < end);
 }
 
 static int nbd_co_send_bitmap(NBDClient *client, uint64_t handle,
@@ -2129,7 +2108,7 @@ static int nbd_co_send_bitmap(NBDClient *client, uint64_t 
handle,
 unsigned int nb_extents = dont_fragment ? 1 : NBD_MAX_BLOCK_STATUS_EXTENTS;
 g_autoptr(NBDExtentArray) ea = nbd_extent_array_new(nb_extents);
 
-bitmap_to_extents(bitmap, offset, length, ea, dont_fragment);
+bitmap_to_extents(bitmap, offset, length, ea);
 
 return nbd_co_send_extents(client, handle, ea, last, context_id, errp);
 }
-- 
2.21.0

[PATCH v4 04/10] hbitmap: drop meta bitmaps as they are unused

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 include/qemu/hbitmap.h |  21 
 tests/test-hbitmap.c   | 115 -
 util/hbitmap.c |  16 --
 3 files changed, 152 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 15837a0e2d..df922d8517 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -325,27 +325,6 @@ int64_t hbitmap_next_zero(const HBitmap *hb, uint64_t 
start, uint64_t count);
 bool hbitmap_next_dirty_area(const HBitmap *hb, uint64_t *start,
  uint64_t *count);
 
-/* hbitmap_create_meta:
- * Create a "meta" hbitmap to track dirtiness of the bits in this HBitmap.
- * The caller owns the created bitmap and must call hbitmap_free_meta(hb) to
- * free it.
- *
- * Currently, we only guarantee that if a bit in the hbitmap is changed it
- * will be reflected in the meta bitmap, but we do not yet guarantee the
- * opposite.
- *
- * @hb: The HBitmap to operate on.
- * @chunk_size: How many bits in @hb does one bit in the meta track.
- */
-HBitmap *hbitmap_create_meta(HBitmap *hb, int chunk_size);
-
-/* hbitmap_free_meta:
- * Free the meta bitmap of @hb.
- *
- * @hb: The HBitmap whose meta bitmap should be freed.
- */
-void hbitmap_free_meta(HBitmap *hb);
-
 /**
  * hbitmap_iter_next:
  * @hbi: HBitmapIter to operate on.
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index e1f867085f..aeaa0b3f22 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -22,7 +22,6 @@
 
 typedef struct TestHBitmapData {
 HBitmap   *hb;
-HBitmap   *meta;
 unsigned long *bits;
 size_t size;
 size_t old_size;
@@ -94,14 +93,6 @@ static void hbitmap_test_init(TestHBitmapData *data,
 }
 }
 
-static void hbitmap_test_init_meta(TestHBitmapData *data,
-   uint64_t size, int granularity,
-   int meta_chunk)
-{
-hbitmap_test_init(data, size, granularity);
-data->meta = hbitmap_create_meta(data->hb, meta_chunk);
-}
-
 static inline size_t hbitmap_test_array_size(size_t bits)
 {
 size_t n = DIV_ROUND_UP(bits, BITS_PER_LONG);
@@ -144,9 +135,6 @@ static void hbitmap_test_teardown(TestHBitmapData *data,
   const void *unused)
 {
 if (data->hb) {
-if (data->meta) {
-hbitmap_free_meta(data->hb);
-}
 hbitmap_free(data->hb);
 data->hb = NULL;
 }
@@ -648,96 +636,6 @@ static void 
test_hbitmap_truncate_shrink_large(TestHBitmapData *data,
 hbitmap_test_truncate(data, size, -diff, 0);
 }
 
-static void hbitmap_check_meta(TestHBitmapData *data,
-   int64_t start, int count)
-{
-int64_t i;
-
-for (i = 0; i < data->size; i++) {
-if (i >= start && i < start + count) {
-g_assert(hbitmap_get(data->meta, i));
-} else {
-g_assert(!hbitmap_get(data->meta, i));
-}
-}
-}
-
-static void hbitmap_test_meta(TestHBitmapData *data,
-  int64_t start, int count,
-  int64_t check_start, int check_count)
-{
-hbitmap_reset_all(data->hb);
-hbitmap_reset_all(data->meta);
-
-/* Test "unset" -> "unset" will not update meta. */
-hbitmap_reset(data->hb, start, count);
-hbitmap_check_meta(data, 0, 0);
-
-/* Test "unset" -> "set" will update meta */
-hbitmap_set(data->hb, start, count);
-hbitmap_check_meta(data, check_start, check_count);
-
-/* Test "set" -> "set" will not update meta */
-hbitmap_reset_all(data->meta);
-hbitmap_set(data->hb, start, count);
-hbitmap_check_meta(data, 0, 0);
-
-/* Test "set" -> "unset" will update meta */
-hbitmap_reset_all(data->meta);
-hbitmap_reset(data->hb, start, count);
-hbitmap_check_meta(data, check_start, check_count);
-}
-
-static void hbitmap_test_meta_do(TestHBitmapData *data, int chunk_size)
-{
-uint64_t size = chunk_size * 100;
-hbitmap_test_init_meta(data, size, 0, chunk_size);
-
-hbitmap_test_meta(data, 0, 1, 0, chunk_size);
-hbitmap_test_meta(data, 0, chunk_size, 0, chunk_size);
-hbitmap_test_meta(data, chunk_size - 1, 1, 0, chunk_size);
-hbitmap_test_meta(data, chunk_size - 1, 2, 0, chunk_size * 2);
-hbitmap_test_meta(data, chunk_size - 1, chunk_size + 1, 0, chunk_size * 2);
-hbitmap_test_meta(data, chunk_size - 1, chunk_size + 2, 0, chunk_size * 3);
-hbitmap_test_meta(data, 7 * chunk_size - 1, chunk_size + 2,
-  6 * chunk_size, chunk_size * 3);
-hbitmap_test_meta(data, size - 1, 1, size - chunk_size, chunk_size);
-hbitmap_test_meta(data, 0, size, 0, size);
-}
-
-static void test_hbitmap_meta_byte(TestHBitmapData *data, const void *unused)
-{
-hbitmap_test_meta_do(data, BITS_PER_BYTE);
-}
-
-static void test_hbitmap_meta_word(TestHBitmapData *data, const void *unused)
-{
-hbitmap_tes

[PATCH v4 10/10] block/qcow2-bitmap: use bdrv_dirty_bitmap_next_dirty

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

store_bitmap_data() loop does bdrv_set_dirty_iter() on each iteration,
which means that we actually don't need iterator itself and we can use
simpler bitmap API.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 block/qcow2-bitmap.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index d41f5d049b..82646c53bd 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1289,7 +1289,6 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
 const char *bm_name = bdrv_dirty_bitmap_name(bitmap);
 uint8_t *buf = NULL;
-BdrvDirtyBitmapIter *dbi;
 uint64_t *tb;
 uint64_t tb_size =
 size_to_clusters(s,
@@ -1308,12 +1307,14 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 return NULL;
 }
 
-dbi = bdrv_dirty_iter_new(bitmap);
 buf = g_malloc(s->cluster_size);
 limit = bytes_covered_by_bitmap_cluster(s, bitmap);
 assert(DIV_ROUND_UP(bm_size, limit) == tb_size);
 
-while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
+offset = 0;
+while ((offset = bdrv_dirty_bitmap_next_dirty(bitmap, offset, INT64_MAX))
+   >= 0)
+{
 uint64_t cluster = offset / limit;
 uint64_t end, write_size;
 int64_t off;
@@ -1356,23 +1357,17 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 goto fail;
 }
 
-if (end >= bm_size) {
-break;
-}
-
-bdrv_set_dirty_iter(dbi, end);
+offset = end;
 }
 
 *bitmap_table_size = tb_size;
 g_free(buf);
-bdrv_dirty_iter_free(dbi);
 
 return tb;
 
 fail:
 clear_bitmap_table(bs, tb, tb_size);
 g_free(buf);
-bdrv_dirty_iter_free(dbi);
 g_free(tb);
 
 return NULL;
-- 
2.21.0

[PATCH v4 08/10] nbd/server: introduce NBDExtentArray

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

Introduce NBDExtentArray class, to handle extents list creation in more
controlled way and with fewer OUT parameters in functions.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 nbd/server.c | 210 +--
 1 file changed, 118 insertions(+), 92 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index c422265041..21e665e9e0 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1909,27 +1909,98 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 return ret;
 }
 
+typedef struct NBDExtentArray {
+NBDExtent *extents;
+unsigned int nb_alloc;
+unsigned int count;
+uint64_t total_length;
+bool can_add;
+bool converted_to_be;
+} NBDExtentArray;
+
+static NBDExtentArray *nbd_extent_array_new(unsigned int nb_alloc)
+{
+NBDExtentArray *ea = g_new0(NBDExtentArray, 1);
+
+ea->nb_alloc = nb_alloc;
+ea->extents = g_new(NBDExtent, nb_alloc);
+ea->can_add = true;
+
+return ea;
+}
+
+static void nbd_extent_array_free(NBDExtentArray *ea)
+{
+g_free(ea->extents);
+g_free(ea);
+}
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(NBDExtentArray, nbd_extent_array_free);
+
+/* Further modifications of the array after conversion are abandoned */
+static void nbd_extent_array_convert_to_be(NBDExtentArray *ea)
+{
+int i;
+
+assert(!ea->converted_to_be);
+ea->can_add = false;
+ea->converted_to_be = true;
+
+for (i = 0; i < ea->count; i++) {
+ea->extents[i].flags = cpu_to_be32(ea->extents[i].flags);
+ea->extents[i].length = cpu_to_be32(ea->extents[i].length);
+}
+}
+
 /*
- * Populate @extents from block status. Update @bytes to be the actual
- * length encoded (which may be smaller than the original), and update
- * @nb_extents to the number of extents used.
- *
- * Returns zero on success and -errno on bdrv_block_status_above failure.
+ * Add extent to NBDExtentArray. If extent can't be added (no available space),
+ * return -1.
+ * For safety, when returning -1 for the first time, .can_add is set to false,
+ * further call to nbd_extent_array_add() will crash.
+ * (to avoid the situation, when after failing to add an extent (returned -1),
+ * user miss this failure and add another extent, which is successfully added
+ * (array is full, but new extent may be squashed into the last one), then we
+ * have invalid array with skipped extent)
  */
-static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
-  uint64_t *bytes, NBDExtent *extents,
-  unsigned int *nb_extents)
+static int nbd_extent_array_add(NBDExtentArray *ea,
+uint32_t length, uint32_t flags)
 {
-uint64_t remaining_bytes = *bytes;
-NBDExtent *extent = extents, *extents_end = extents + *nb_extents;
-bool first_extent = true;
+assert(ea->can_add);
+
+if (!length) {
+return 0;
+}
+
+/* Extend previous extent if flags are the same */
+if (ea->count > 0 && flags == ea->extents[ea->count - 1].flags) {
+uint64_t sum = (uint64_t)length + ea->extents[ea->count - 1].length;
+
+if (sum <= UINT32_MAX) {
+ea->extents[ea->count - 1].length = sum;
+ea->total_length += length;
+return 0;
+}
+}
+
+if (ea->count >= ea->nb_alloc) {
+ea->can_add = false;
+return -1;
+}
+
+ea->total_length += length;
+ea->extents[ea->count] = (NBDExtent) {.length = length, .flags = flags};
+ea->count++;
 
-assert(*nb_extents);
-while (remaining_bytes) {
+return 0;
+}
+
+static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,
+  uint64_t bytes, NBDExtentArray *ea)
+{
+while (bytes) {
 uint32_t flags;
 int64_t num;
-int ret = bdrv_block_status_above(bs, NULL, offset, remaining_bytes,
-  &num, NULL, NULL);
+int ret = bdrv_block_status_above(bs, NULL, offset, bytes, &num,
+  NULL, NULL);
 
 if (ret < 0) {
 return ret;
@@ -1938,60 +2009,37 @@ static int blockstatus_to_extents(BlockDriverState *bs, 
uint64_t offset,
 flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) |
 (ret & BDRV_BLOCK_ZERO  ? NBD_STATE_ZERO : 0);
 
-if (first_extent) {
-extent->flags = flags;
-extent->length = num;
-first_extent = false;
-} else if (flags == extent->flags) {
-/* extend current extent */
-extent->length += num;
-} else {
-if (extent + 1 == extents_end) {
-break;
-}
-
-/* start new extent */
-extent++;
-extent->flags = flags;
-extent->length = num;
+if (nbd_extent_array_add(ea, num, flags) < 0) {
+return 0;
 }
-of

[PATCH v4 06/10] block/dirty-bitmap: add _next_dirty API

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

We have bdrv_dirty_bitmap_next_zero, let's add corresponding
bdrv_dirty_bitmap_next_dirty, which is more comfortable to use than
bitmap iterators in some cases.

For test modify test_hbitmap_next_zero_check_range to check both
next_zero and next_dirty and add some new checks.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 include/block/dirty-bitmap.h |   2 +
 include/qemu/hbitmap.h   |  13 
 block/dirty-bitmap.c |   6 ++
 tests/test-hbitmap.c | 130 ---
 util/hbitmap.c   |  60 
 5 files changed, 126 insertions(+), 85 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 27c72cc56a..b1f0de12db 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -105,6 +105,8 @@ for (bitmap = bdrv_dirty_bitmap_first(bs); bitmap; \
  bitmap = bdrv_dirty_bitmap_next(bitmap))
 
 char *bdrv_dirty_bitmap_sha256(const BdrvDirtyBitmap *bitmap, Error **errp);
+int64_t bdrv_dirty_bitmap_next_dirty(BdrvDirtyBitmap *bitmap, int64_t offset,
+ int64_t bytes);
 int64_t bdrv_dirty_bitmap_next_zero(BdrvDirtyBitmap *bitmap, int64_t offset,
 int64_t bytes);
 bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap *bitmap,
diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index b6e85f3d5d..6e9ae51ed3 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -297,6 +297,19 @@ void hbitmap_free(HBitmap *hb);
  */
 void hbitmap_iter_init(HBitmapIter *hbi, const HBitmap *hb, uint64_t first);
 
+/*
+ * hbitmap_next_dirty:
+ *
+ * Find next dirty bit within selected range. If not found, return -1.
+ *
+ * @hb: The HBitmap to operate on
+ * @start: The bit to start from.
+ * @count: Number of bits to proceed. If @start+@count > bitmap size, the whole
+ * bitmap is looked through. You can use INT64_MAX as @count to search up to
+ * the bitmap end.
+ */
+int64_t hbitmap_next_dirty(const HBitmap *hb, int64_t start, int64_t count);
+
 /* hbitmap_next_zero:
  *
  * Find next not dirty bit within selected range. If not found, return -1.
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index af9f5411a6..1b14c8eb26 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -860,6 +860,12 @@ char *bdrv_dirty_bitmap_sha256(const BdrvDirtyBitmap 
*bitmap, Error **errp)
 return hbitmap_sha256(bitmap->bitmap, errp);
 }
 
+int64_t bdrv_dirty_bitmap_next_dirty(BdrvDirtyBitmap *bitmap, int64_t offset,
+ int64_t bytes)
+{
+return hbitmap_next_dirty(bitmap->bitmap, offset, bytes);
+}
+
 int64_t bdrv_dirty_bitmap_next_zero(BdrvDirtyBitmap *bitmap, int64_t offset,
 int64_t bytes)
 {
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index 9d210dc18c..8905b8a351 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -816,92 +816,108 @@ static void test_hbitmap_iter_and_reset(TestHBitmapData 
*data,
 hbitmap_iter_next(&hbi);
 }
 
-static void test_hbitmap_next_zero_check_range(TestHBitmapData *data,
-   int64_t start,
-   int64_t count)
+static void test_hbitmap_next_x_check_range(TestHBitmapData *data,
+int64_t start,
+int64_t count)
 {
-int64_t ret1 = hbitmap_next_zero(data->hb, start, count);
-int64_t ret2 = start;
+int64_t next_zero = hbitmap_next_zero(data->hb, start, count);
+int64_t next_dirty = hbitmap_next_dirty(data->hb, start, count);
+int64_t next;
 int64_t end = start >= data->size || data->size - start < count ?
 data->size : start + count;
+bool first_bit = hbitmap_get(data->hb, start);
 
-for ( ; ret2 < end && hbitmap_get(data->hb, ret2); ret2++) {
+for (next = start;
+ next < end && hbitmap_get(data->hb, next) == first_bit;
+ next++)
+{
 ;
 }
-if (ret2 == end) {
-ret2 = -1;
+
+if (next == end) {
+next = -1;
 }
 
-g_assert_cmpint(ret1, ==, ret2);
+g_assert_cmpint(next_dirty, ==, first_bit ? start : next);
+g_assert_cmpint(next_zero, ==, first_bit ? next : start);
 }
 
-static void test_hbitmap_next_zero_check(TestHBitmapData *data, int64_t start)
+static void test_hbitmap_next_x_check(TestHBitmapData *data, int64_t start)
 {
-test_hbitmap_next_zero_check_range(data, start, INT64_MAX);
+test_hbitmap_next_x_check_range(data, start, INT64_MAX);
 }
 
-static void test_hbitmap_next_zero_do(TestHBitmapData *data, int granularity)
+static void test_hbitmap_next_x_do(TestHBitmapData *data, int granularity)
 {
 hbitmap_test_init(data, L3, granularity);
-test_hbitmap_next_zero_check(data, 0);
-test_hbitmap_next_zero_check(data, L3 - 1);
-test_hbitmap_next_zero_check_ran

[PATCH v4 07/10] block/dirty-bitmap: improve _next_dirty_area API

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

Firstly, _next_dirty_area is for scenarios when we may contiguously
search for next dirty area inside some limited region, so it is more
comfortable to specify "end" which should not be recalculated on each
iteration.

Secondly, let's add a possibility to limit resulting area size, not
limiting searching area. This will be used in NBD code in further
commit. (Note that now bdrv_dirty_bitmap_next_dirty_area is unused)

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 include/block/dirty-bitmap.h |  3 ++-
 include/qemu/hbitmap.h   | 25 +++-
 block/dirty-bitmap.c |  6 +++--
 tests/test-hbitmap.c | 43 +++
 util/hbitmap.c   | 44 ++--
 5 files changed, 75 insertions(+), 46 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index b1f0de12db..8a10029418 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -110,7 +110,8 @@ int64_t bdrv_dirty_bitmap_next_dirty(BdrvDirtyBitmap 
*bitmap, int64_t offset,
 int64_t bdrv_dirty_bitmap_next_zero(BdrvDirtyBitmap *bitmap, int64_t offset,
 int64_t bytes);
 bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap *bitmap,
-   int64_t *offset, int64_t *bytes);
+int64_t start, int64_t end, int64_t max_dirty_count,
+int64_t *dirty_start, int64_t *dirty_count);
 BdrvDirtyBitmap *bdrv_reclaim_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
   Error **errp);
 
diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 6e9ae51ed3..5e71b6d6f7 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -324,18 +324,21 @@ int64_t hbitmap_next_zero(const HBitmap *hb, int64_t 
start, int64_t count);
 
 /* hbitmap_next_dirty_area:
  * @hb: The HBitmap to operate on
- * @start: in-out parameter.
- * in: the offset to start from
- * out: (if area found) start of found area
- * @count: in-out parameter.
- * in: length of requested region
- * out: length of found area
- *
- * If dirty area found within [@start, @start + @count), returns true and sets
- * @offset and @bytes appropriately. Otherwise returns false and leaves @offset
- * and @bytes unchanged.
+ * @start: the offset to start from
+ * @end: end of requested area
+ * @max_dirty_count: limit for out parameter dirty_count
+ * @dirty_start: on success: start of found area
+ * @dirty_count: on success: length of found area
+ *
+ * If dirty area found within [@start, @end), returns true and sets
+ * @dirty_start and @dirty_count appropriately. @dirty_count will not exceed
+ * @max_dirty_count.
+ * If dirty area was not found, returns false and leaves @dirty_start and
+ * @dirty_count unchanged.
  */
-bool hbitmap_next_dirty_area(const HBitmap *hb, int64_t *start, int64_t 
*count);
+bool hbitmap_next_dirty_area(const HBitmap *hb, int64_t start, int64_t end,
+ int64_t max_dirty_count,
+ int64_t *dirty_start, int64_t *dirty_count);
 
 /**
  * hbitmap_iter_next:
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 1b14c8eb26..063793e316 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -873,9 +873,11 @@ int64_t bdrv_dirty_bitmap_next_zero(BdrvDirtyBitmap 
*bitmap, int64_t offset,
 }
 
 bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap *bitmap,
-   int64_t *offset, int64_t *bytes)
+int64_t start, int64_t end, int64_t max_dirty_count,
+int64_t *dirty_start, int64_t *dirty_count)
 {
-return hbitmap_next_dirty_area(bitmap->bitmap, offset, bytes);
+return hbitmap_next_dirty_area(bitmap->bitmap, start, end, max_dirty_count,
+   dirty_start, dirty_count);
 }
 
 /**
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index 8905b8a351..b6726cf76b 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -920,18 +920,19 @@ static void 
test_hbitmap_next_x_after_truncate(TestHBitmapData *data,
 test_hbitmap_next_x_check(data, 0);
 }
 
-static void test_hbitmap_next_dirty_area_check(TestHBitmapData *data,
-   int64_t offset,
-   int64_t count)
+static void test_hbitmap_next_dirty_area_check_limited(TestHBitmapData *data,
+   int64_t offset,
+   int64_t count,
+   int64_t max_dirty)
 {
 int64_t off1, off2;
 int64_t len1 = 0, len2;
 bool ret1, ret2;
 int64_t end;
 
-off1 = offset;
-len1 = count;
-ret1 = hbitmap_next_dirty_area(data->hb, &off1, &len1);
+ret1 = hbitmap_next_dirty_area(data->hb,
+offset, count == INT6

[PATCH v4 00/10] Further bitmaps improvements

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

Hi!

The main feature here is improvement of _next_dirty_area API, which I'm
going to use then for backup / block-copy.

Somehow, I thought that it was merged, but seems I even forgot to send
v4.

v4:
01-04: add Max's r-b
05: switch test_hbitmap_next_zero_check_range args to int64_t too
06: fix s/UINT64_MAX/INT64_MAX/ in comment to hbitmap_next_dirty
s/firt_dirty_off/first_dirty_off/
Context changed due to 05 change, but I keep Max's r-b
07: simplify parameter check in hbitmap_next_dirty_area
drop initialization in hbitmap_sparse_merge
add Max's r-b
08: commit message tweak
refactor converted flag to separated converted_to_be and can_add
do not convert to be automatically in nbd_extent_array_add
check uint32 overflow in nbd_extent_array_add
10: drop extra check from store_bitmap_data, add Max's r-b

Vladimir Sementsov-Ogievskiy (10):
  hbitmap: assert that we don't create bitmap larger than INT64_MAX
  hbitmap: move hbitmap_iter_next_word to hbitmap.c
  hbitmap: unpublish hbitmap_iter_skip_words
  hbitmap: drop meta bitmaps as they are unused
  block/dirty-bitmap: switch _next_dirty_area and _next_zero to int64_t
  block/dirty-bitmap: add _next_dirty API
  block/dirty-bitmap: improve _next_dirty_area API
  nbd/server: introduce NBDExtentArray
  nbd/server: use bdrv_dirty_bitmap_next_dirty_area
  block/qcow2-bitmap: use bdrv_dirty_bitmap_next_dirty

 include/block/dirty-bitmap.h |   9 +-
 include/qemu/hbitmap.h   |  97 +++
 block/dirty-bitmap.c |  16 +-
 block/qcow2-bitmap.c |  15 +-
 nbd/server.c | 251 ++--
 tests/test-hbitmap.c | 314 +--
 util/hbitmap.c   | 134 +--
 7 files changed, 375 insertions(+), 461 deletions(-)

-- 
2.21.0

[PATCH v4 03/10] hbitmap: unpublish hbitmap_iter_skip_words

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

Function is internal and even commented as internal. Drop its
definition from .h file.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 include/qemu/hbitmap.h | 7 ---
 util/hbitmap.c | 2 +-
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index ab227b117f..15837a0e2d 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -297,13 +297,6 @@ void hbitmap_free(HBitmap *hb);
  */
 void hbitmap_iter_init(HBitmapIter *hbi, const HBitmap *hb, uint64_t first);
 
-/* hbitmap_iter_skip_words:
- * @hbi: HBitmapIter to operate on.
- *
- * Internal function used by hbitmap_iter_next and hbitmap_iter_next_word.
- */
-unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi);
-
 /* hbitmap_next_zero:
  *
  * Find next not dirty bit within selected range. If not found, return -1.
diff --git a/util/hbitmap.c b/util/hbitmap.c
index a368dc5ef7..26145d4b9e 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -104,7 +104,7 @@ struct HBitmap {
 /* Advance hbi to the next nonzero word and return it.  hbi->pos
  * is updated.  Returns zero if we reach the end of the bitmap.
  */
-unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi)
+static unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi)
 {
 size_t pos = hbi->pos;
 const HBitmap *hb = hbi->hb;
-- 
2.21.0

[PATCH v4 02/10] hbitmap: move hbitmap_iter_next_word to hbitmap.c

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

The function is definitely internal (it's not used by third party and
it has complicated interface). Move it to .c file.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 include/qemu/hbitmap.h | 30 --
 util/hbitmap.c | 29 +
 2 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 1bf944ca3d..ab227b117f 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -362,34 +362,4 @@ void hbitmap_free_meta(HBitmap *hb);
  */
 int64_t hbitmap_iter_next(HBitmapIter *hbi);
 
-/**
- * hbitmap_iter_next_word:
- * @hbi: HBitmapIter to operate on.
- * @p_cur: Location where to store the next non-zero word.
- *
- * Return the index of the next nonzero word that is set in @hbi's
- * associated HBitmap, and set *p_cur to the content of that word
- * (bits before the index that was passed to hbitmap_iter_init are
- * trimmed on the first call).  Return -1, and set *p_cur to zero,
- * if all remaining words are zero.
- */
-static inline size_t hbitmap_iter_next_word(HBitmapIter *hbi, unsigned long 
*p_cur)
-{
-unsigned long cur = hbi->cur[HBITMAP_LEVELS - 1];
-
-if (cur == 0) {
-cur = hbitmap_iter_skip_words(hbi);
-if (cur == 0) {
-*p_cur = 0;
-return -1;
-}
-}
-
-/* The next call will resume work from the next word.  */
-hbi->cur[HBITMAP_LEVELS - 1] = 0;
-*p_cur = cur;
-return hbi->pos;
-}
-
-
 #endif
diff --git a/util/hbitmap.c b/util/hbitmap.c
index 7f9b3e0cd7..a368dc5ef7 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -298,6 +298,35 @@ uint64_t hbitmap_count(const HBitmap *hb)
 return hb->count << hb->granularity;
 }
 
+/**
+ * hbitmap_iter_next_word:
+ * @hbi: HBitmapIter to operate on.
+ * @p_cur: Location where to store the next non-zero word.
+ *
+ * Return the index of the next nonzero word that is set in @hbi's
+ * associated HBitmap, and set *p_cur to the content of that word
+ * (bits before the index that was passed to hbitmap_iter_init are
+ * trimmed on the first call).  Return -1, and set *p_cur to zero,
+ * if all remaining words are zero.
+ */
+static size_t hbitmap_iter_next_word(HBitmapIter *hbi, unsigned long *p_cur)
+{
+unsigned long cur = hbi->cur[HBITMAP_LEVELS - 1];
+
+if (cur == 0) {
+cur = hbitmap_iter_skip_words(hbi);
+if (cur == 0) {
+*p_cur = 0;
+return -1;
+}
+}
+
+/* The next call will resume work from the next word.  */
+hbi->cur[HBITMAP_LEVELS - 1] = 0;
+*p_cur = cur;
+return hbi->pos;
+}
+
 /* Count the number of set bits between start and end, not accounting for
  * the granularity.  Also an example of how to use hbitmap_iter_next_word.
  */
-- 
2.21.0

[PATCH v4 01/10] hbitmap: assert that we don't create bitmap larger than INT64_MAX

2020-02-05 Thread Vladimir Sementsov-Ogievskiy

We have APIs which returns signed int64_t, to be able to return error.
Therefore we can't handle bitmaps with absolute size larger than
(INT64_MAX+1). Still, keep maximum to be INT64_MAX which is a bit
safer.

Note, that bitmaps are used to represent disk images, which can't
exceed INT64_MAX anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Max Reitz 
---
 util/hbitmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/util/hbitmap.c b/util/hbitmap.c
index 242c6e519c..7f9b3e0cd7 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -716,6 +716,7 @@ HBitmap *hbitmap_alloc(uint64_t size, int granularity)
 HBitmap *hb = g_new0(struct HBitmap, 1);
 unsigned i;
 
+assert(size <= INT64_MAX);
 hb->orig_size = size;
 
 assert(granularity >= 0 && granularity < 64);
@@ -746,6 +747,7 @@ void hbitmap_truncate(HBitmap *hb, uint64_t size)
 uint64_t num_elements = size;
 uint64_t old;
 
+assert(size <= INT64_MAX);
 hb->orig_size = size;
 
 /* Size comes in as logical elements, adjust for granularity. */
-- 
2.21.0

Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk

2020-02-05 Thread Stefan Hajnoczi

On Tue, Feb 04, 2020 at 12:59:04PM +0300, Denis Plotnikov wrote:
> 
> 
> On 30.01.2020 17:58, Stefan Hajnoczi wrote:
> > On Wed, Jan 29, 2020 at 05:07:00PM +0300, Denis Plotnikov wrote:
> > > The goal is to reduce the amount of requests issued by a guest on
> > > 1M reads/writes. This rises the performance up to 4% on that kind of
> > > disk access pattern.
> > > 
> > > The maximum chunk size to be used for the guest disk accessing is
> > > limited with seg_max parameter, which represents the max amount of
> > > pices in the scatter-geather list in one guest disk request.
> > > 
> > > Since seg_max is virqueue_size dependent, increasing the virtqueue
> > > size increases seg_max, which, in turn, increases the maximum size
> > > of data to be read/write from guest disk.
> > > 
> > > More details in the original problem statment:
> > > https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg03721.html
> > > 
> > > Suggested-by: Denis V. Lunev 
> > > Signed-off-by: Denis Plotnikov 
> > > ---
> > >   hw/core/machine.c  | 3 +++
> > >   include/hw/virtio/virtio.h | 2 +-
> > >   2 files changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > > index 3e288bfceb..8bc401d8b7 100644
> > > --- a/hw/core/machine.c
> > > +++ b/hw/core/machine.c
> > > @@ -28,6 +28,9 @@
> > >   #include "hw/mem/nvdimm.h"
> > >   GlobalProperty hw_compat_4_2[] = {
> > > +{ "virtio-blk-device", "queue-size", "128"},
> > > +{ "virtio-scsi-device", "virtqueue_size", "128"},
> > > +{ "vhost-blk-device", "virtqueue_size", "128"},
> > vhost-blk-device?!  Who has this?  It's not in qemu.git so please omit
> > this line. ;-)
> So in this case the line:
> 
> { "vhost-blk-device", "seg_max_adjust", "off"},
> 
> introduced by my patch:
> 
> commit 1bf8a989a566b2ba41c197004ec2a02562a766a4
> Author: Denis Plotnikov 
> Date:   Fri Dec 20 17:09:04 2019 +0300
> 
>     virtio: make seg_max virtqueue size dependent
> 
> is also wrong. It should be:
> 
> { "vhost-scsi-device", "seg_max_adjust", "off"},
> 
> Am I right?

It's just called "vhost-scsi":

include/hw/virtio/vhost-scsi.h:#define TYPE_VHOST_SCSI "vhost-scsi"

> > 
> > On the other hand, do you want to do this for the vhost-user-blk,
> > vhost-user-scsi, and vhost-scsi devices that exist in qemu.git?  Those
> > devices would benefit from better performance too.
> It seems to be so. We also have the test checking those settings:
> tests/acceptance/virtio_seg_max_adjust.py
> For now it checks virtio-scsi-pci and virtio-blk-pci.
> I'm going to extend it for the virtqueue size checking.
> If I change vhost-user-blk, vhost-user-scsi and vhost-scsi it's worth
> to check those devices too. But I don't know how to form a command line
> for that 3 devices since they should involve some third party components as
> backends (kernel modules, DPDK, etc.) and they seems to be not available in
> the
> qemu git.
> Is there any way to do it with some qit.qemu available stubs or something
> else?
> If so, could you please point out the proper way to do it?

qemu.git has contrib/vhost-user-blk/ and contrib/vhost-user-scsi/ if
you need to test those vhost-user devices without external dependencies.

Stefan


signature.asc
Description: PGP signature

[PATCH v4 1/1] qemu-img: Add --target-is-zero to convert

2020-02-05 Thread David Edmondson

In many cases the target of a convert operation is a newly provisioned
target that the user knows is blank (reads as zero). In this situation
there is no requirement for qemu-img to wastefully zero out the entire
device.

Add a new option, --target-is-zero, allowing the user to indicate that
an existing target device will return zeros for all reads.

Signed-off-by: David Edmondson 
---
 docs/interop/qemu-img.rst |  9 -
 qemu-img-cmds.hx  |  4 ++--
 qemu-img.c| 26 +++---
 3 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/docs/interop/qemu-img.rst b/docs/interop/qemu-img.rst
index fa27e5c7b453..763036857451 100644
--- a/docs/interop/qemu-img.rst
+++ b/docs/interop/qemu-img.rst
@@ -214,6 +214,13 @@ Parameters to convert subcommand:
   will still be printed.  Areas that cannot be read from the source will be
   treated as containing only zeroes.
 
+.. option:: --target-is-zero
+
+  Assume that reading the destination image will always return
+  zeros. This parameter is mutually exclusive with a destination image
+  that has a backing file. It is required to also use the ``-n``
+  parameter to skip image creation.
+
 Parameters to dd subcommand:
 
 .. program:: qemu-img-dd
@@ -366,7 +373,7 @@ Command description:
   4
 Error on reading data
 
-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O 
OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] 
[-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T 
SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] 
[-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] 
OUTPUT_FILENAME
 
   Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
   to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 3fd836ca9090..e6f98b75473f 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -39,9 +39,9 @@ SRST
 ERST
 
 DEF("convert", img_convert,
-"convert [--object objectdef] [--image-opts] [--target-image-opts] [-U] 
[-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T src_cache] [-O output_fmt] [-B 
backing_file] [-o options] [-l snapshot_param] [-S sparse_size] [-m 
num_coroutines] [-W] [--salvage] filename [filename2 [...]] output_filename")
+"convert [--object objectdef] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-T 
src_cache] [-O output_fmt] [-B backing_file] [-o options] [-l snapshot_param] 
[-S sparse_size] [-m num_coroutines] [-W] [--salvage] filename [filename2 
[...]] output_filename")
 SRST
-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O 
OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] 
[-m NUM_COROUTINES] [-W] [--salvage] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T 
SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] 
[-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] [--salvage] FILENAME [FILENAME2 
[...]] OUTPUT_FILENAME
 ERST
 
 DEF("create", img_create,
diff --git a/qemu-img.c b/qemu-img.c
index 2b4562b9d9f2..0faf2cd2f530 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -70,6 +70,7 @@ enum {
 OPTION_PREALLOCATION = 265,
 OPTION_SHRINK = 266,
 OPTION_SALVAGE = 267,
+OPTION_TARGET_IS_ZERO = 268,
 };
 
 typedef enum OutputFormat {
@@ -1984,10 +1985,9 @@ static int convert_do_copy(ImgConvertState *s)
 int64_t sector_num = 0;
 
 /* Check whether we have zero initialisation or can get it efficiently */
-if (s->target_is_new && s->min_sparse && !s->target_has_backing) {
+if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
+!s->target_has_backing) {
 s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
-} else {
-s->has_zero_init = false;
 }
 
 if (!s->has_zero_init && !s->target_has_backing &&
@@ -2086,6 +2086,7 @@ static int img_convert(int argc, char **argv)
 {"force-share", no_argument, 0, 'U'},
 {"target-image-opts", no_argument, 0, OPTION_TARGET_IMAGE_OPTS},
 {"salvage", no_argument, 0, OPTION_SALVAGE},
+{"target-is-zero", no_argument, 0, OPTION_TARGET_IS_ZERO},
 {0, 0, 0, 0}
 };
 c = getopt_long(argc, argv, ":hf:O:B:Cco:l:S:pt:T:qnm:WU",
@@ -2209,6 +2210,14 @@ static int img_convert(int argc, char **argv)
 case OPTION_TARGET_IMAGE_OPTS:
 tgt_image_opts = true;

[PATCH v4 0/1] qemu-img: Add --target-is-zero to indicate that a target is blank

2020-02-05 Thread David Edmondson



qemu-img: Add --target-is-zero to indicate that a target is blank

v4:
- Wording in the doc and error message.

v3:
- Merge with the rST docs.
- No more need to fix @var -> @code.

v2:
- Remove target_is_zero, preferring to set has_zero_init
  directly (Mark Kanda).
- Disallow --target-is-zero in the presence of a backing file (Max
  Reitz).
- Add relevant documentation (Max Reitz).
- @var -> @code for options in qemu-img.texi.


David Edmondson (1):
  qemu-img: Add --target-is-zero to convert

 docs/interop/qemu-img.rst |  9 -
 qemu-img-cmds.hx  |  4 ++--
 qemu-img.c| 26 +++---
 3 files changed, 33 insertions(+), 6 deletions(-)

-- 
2.24.1

Re: [PATCH 02/13] qcrypto-luks: implement encryption key management

2020-02-05 Thread Kevin Wolf

Am 05.02.2020 um 11:03 hat Markus Armbruster geschrieben:
> Kevin Wolf  writes:
> 
> > Am 05.02.2020 um 09:24 hat Markus Armbruster geschrieben:
> >> Daniel, Kevin, any comments or objections to the QAPI schema design
> >> sketch developed below?
> >> 
> >> For your convenience, here's the result again:
> >> 
> >> { 'enum': 'LUKSKeyslotState',
> >>   'data': [ 'active', 'inactive' ] }
> >> { 'struct': 'LUKSKeyslotActive',
> >>   'data': { 'secret': 'str',
> >> '*iter-time': 'int } }
> >> { 'union': 'LUKSKeyslotAmend',
> >>   'base': { '*keyslot': 'int',
> >> 'state': 'LUKSKeyslotState' }
> >>   'discriminator': 'state',
> >>   'data': { 'active': 'LUKSKeyslotActive' } }
> >
> > I think one of the requirements was that you can specify the keyslot not
> > only by using its number, but also by specifying the old secret.
> 
> Quoting myself:
> 
>   When we don't specify the slot#, then "new state active" selects an
>   inactive slot (chosen by the system, and "new state inactive selects
>   slots by secret (commonly just one slot).
> 
> This takes care of selecting (active) slots by old secret with "new
> state inactive".

"new secret inactive" can't select a slot by secret because 'secret'
doesn't even exist for inactive.

> I intentionally did not provide for selecting (active) slots by old
> secret with "new state active", because that's unsafe update in place.
> 
> We want to update secrets, of course.  But the safe way to do that is to
> put the new secret into a free slot, and if that succeeds, deactivate
> the old secret.  If deactivation fails, you're left with both old and
> new secret, which beats being left with no secret when update in place
> fails.

Right. I wonder if qemu-img wants support for that specifically
(possibly with allowing to enter the key interactively) rather than
requiring the user to call qemu-img amend twice.

> >  Trivial
> > extension, you just get another optional field that can be specified
> > instead of 'keyslot'.
> >
> > Resulting commands:
> >
> > Adding a key:
> > qemu-img amend -o 
> > encrypt.keys.0.state=active,encrypt.keys.0.secret=sec0 test.qcow2
> 
> This activates an inactive slot chosen by the sysem.
> 
> You can activate a specific keyslot N by throwing in
> encrypt.keys.0.keyslot=N.

Yes. The usual case is that you just want to add a new key somwhere.

> > Deleting a key:
> > qemu-img amend -o 
> > encrypt.keys.0.state=inactive,encrypt.keys.0.keyslot=2 test.qcow2
> 
> This deactivates keyslot#2.
> 
> You can deactivate slots holding a specific secret S by replacing
> encrypt.keys.0.keyslot=2 by encrypt.keys.0.secret=S.

Not with your definition above, but with the appropriate changes, this
makes sense.

> > Previous version (if this series is applied unchanged):
> >
> > Adding a key:
> > qemu-img amend -o encrypt.keys.0.new-secret=sec0 test.qcow2
> >
> > Deleting a key:
> > qemu-img amend -o encrypt.keys.0.new-secret=,encrypt.keys.0.keyslot=2 
> > test.qcow2
> >
> > Adding a key gets more complicated with your proposed interface because
> > state must be set explicitly now whereas before it was derived
> > automatically from the fact that if you give a key, only active makes
> > sense.
> 
> The explicitness could be viewed as an improvement :)

Not really. I mean, I really know to appreciate the advantages of
-blockdev where needed, but usually I don't want to type all that stuff
for the most common tasks. qemu-img amend is similar.

For deleting, I might actually agree that explicitness is an
improvement, but for creating it's just unnecessary verbosity.

> If you'd prefer implicit here: Max has patches for making union tags
> optional with a default.  They'd let you default active to true.

I guess this would improve the usability in this case.

Kevin

Re: [PATCH 02/13] qcrypto-luks: implement encryption key management

2020-02-05 Thread Daniel P . Berrangé

On Wed, Feb 05, 2020 at 10:30:11AM +0100, Kevin Wolf wrote:
> Am 05.02.2020 um 09:24 hat Markus Armbruster geschrieben:
> > Daniel, Kevin, any comments or objections to the QAPI schema design
> > sketch developed below?
> > 
> > For your convenience, here's the result again:
> > 
> > { 'enum': 'LUKSKeyslotState',
> >   'data': [ 'active', 'inactive' ] }
> > { 'struct': 'LUKSKeyslotActive',
> >   'data': { 'secret': 'str',
> > '*iter-time': 'int } }
> > { 'union': 'LUKSKeyslotAmend',
> >   'base': { '*keyslot': 'int',
> > 'state': 'LUKSKeyslotState' }
> >   'discriminator': 'state',
> >   'data': { 'active': 'LUKSKeyslotActive' } }

We need 'secret' in the 'inactive' case too

> 
> I think one of the requirements was that you can specify the keyslot not
> only by using its number, but also by specifying the old secret. Trivial
> extension, you just get another optional field that can be specified
> instead of 'keyslot'.
> 
> Resulting commands:
> 
> Adding a key:
> qemu-img amend -o encrypt.keys.0.state=active,encrypt.keys.0.secret=sec0 
> test.qcow2
> 
> Deleting a key:
> qemu-img amend -o encrypt.keys.0.state=inactive,encrypt.keys.0.keyslot=2 
> test.qcow2

I think this is good as a design.

Expanding the examples to cover all scenarios we've discussed


  - Activating a new keyslot, auto-picking slot

 qemu-img amend -o encrypt.keys.0.state=active,\
   encrypt.keys.0.secret=sec0 \
test.qcow2

Must raise an error if no free slots


  - Activating a new keyslot, picking a specific slot

 qemu-img amend -o encrypt.keys.0.state=active,\
   encrypt.keys.0.secret=sec0 \
   encrypt.keys.0.keyslot=3 \
test.qcow2

Must raise an error if slot is already active


  - Deactivating a old keyslot, auto-picking slot(s) from existing password

 qemu-img amend -o encrypt.keys.0.state=inactive,\
   encrypt.keys.0.secret=sec0 \
test.qcow2

Must raise an error if this would leave zero keyslots
after processing.


  - Deactivating a old keyslot, picking a specific slot

 qemu-img amend -o encrypt.keys.0.state=inactive,\
   encrypt.keys.0.keyslot=2 \
test.qcow2

Always succeeds even if zero keyslots left.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH 2/3] m25p80: Improve command handling for Jedec and unsupported commands

2020-02-05 Thread Cédric Le Goater

On 2/4/20 3:28 PM, Guenter Roeck wrote:
> On 2/4/20 12:53 AM, Cédric Le Goater wrote:
>> On 2/3/20 7:09 PM, Guenter Roeck wrote:
>>> Always report 6 bytes of JEDEC data. Fill remaining data with 0.
>>>
>>> For unsupported commands, keep sending a value of 0 until the chip
>>> is deselected.
>>>
>>> Both changes avoid attempts to decode random commands. Up to now this
>>> happened if the reported Jedec data was shorter than 6 bytes but the
>>> host read 6 bytes, and with all unsupported commands.
>>
>> Do you have a concrete example for that ? machine and flash model.
>>
> 
> I noticed it while tracking down the bug fixed in patch 3 of the series,
> ie with AST2500 evb using w25q256 flash, but it happens with all machines
> using SPI NOR flash (ie all aspeed bmc machines) when running the Linux
> kernel. Most of the time it doesn't cause harm, unless the host sends
> an "address" as part of an unsupported command which happens to include
> a valid command code.

ok. we will need to model SFDP one day.

Are you using the OpenBMC images or do you have your own ? 

> 
>>> Signed-off-by: Guenter Roeck 
>>> ---
>>>   hw/block/m25p80.c | 10 +-
>>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
>>> index 63e050d7d3..aca75edcc1 100644
>>> --- a/hw/block/m25p80.c
>>> +++ b/hw/block/m25p80.c
>>> @@ -1040,8 +1040,11 @@ static void decode_new_cmd(Flash *s, uint32_t value)
>>>   for (i = 0; i < s->pi->id_len; i++) {
>>>   s->data[i] = s->pi->id[i];
>>>   }
>>> +    for (; i < SPI_NOR_MAX_ID_LEN; i++) {
>>> +    s->data[i] = 0;
>>> +    }
>>
>> It seems that data should be reseted in m25p80_cs() also.
>>
> Are you sure ?
> 
> The current implementation sets s->data[] as needed when command decode
> is complete. That seems less costly to me.

ok.
Reviewed-by: Cédric Le Goater 


Thanks,

C.
 
> Thanks,
> Guenter
> 
>>>   -    s->len = s->pi->id_len;
>>> +    s->len = SPI_NOR_MAX_ID_LEN;
>>>   s->pos = 0;
>>>   s->state = STATE_READING_DATA;
>>>   break;
>>> @@ -1158,6 +1161,11 @@ static void decode_new_cmd(Flash *s, uint32_t value)
>>>   s->quad_enable = false;
>>>   break;
>>>   default:
>>> +    s->pos = 0;
>>> +    s->len = 1;
>>> +    s->state = STATE_READING_DATA;
>>> +    s->data_read_loop = true;
>>> +    s->data[0] = 0;
>>>   qemu_log_mask(LOG_GUEST_ERROR, "M25P80: Unknown cmd %x\n", value);
>>>   break;
>>>   }
>>>
>>
>

Re: [PATCH 1/3] m25p80: Convert to support tracing

2020-02-05 Thread Cédric Le Goater

On 2/3/20 7:09 PM, Guenter Roeck wrote:
> While at it, add some trace messages to help debug problems
> seen when running the latest Linux kernel.

In case you resend, It would be nice to printout a flash id in the tracing
else I have a patch for it on top of yours. It helped me track a squashfs
corruption on the Aspeed witherspoon-bmc machine which we were after since
2017 or so. It seems to be a kernel bug.

Thanks,

C. 

> Signed-off-by: Guenter Roeck 
> ---
>  hw/block/m25p80.c | 48 ---
>  hw/block/trace-events | 16 +++
>  2 files changed, 38 insertions(+), 26 deletions(-)
> 
> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
> index 11ff5b9ad7..63e050d7d3 100644
> --- a/hw/block/m25p80.c
> +++ b/hw/block/m25p80.c
> @@ -32,17 +32,7 @@
>  #include "qemu/module.h"
>  #include "qemu/error-report.h"
>  #include "qapi/error.h"
> -
> -#ifndef M25P80_ERR_DEBUG
> -#define M25P80_ERR_DEBUG 0
> -#endif
> -
> -#define DB_PRINT_L(level, ...) do { \
> -if (M25P80_ERR_DEBUG > (level)) { \
> -fprintf(stderr,  ": %s: ", __func__); \
> -fprintf(stderr, ## __VA_ARGS__); \
> -} \
> -} while (0)
> +#include "trace.h"
>  
>  /* Fields for FlashPartInfo->flags */
>  
> @@ -574,7 +564,8 @@ static void flash_erase(Flash *s, int offset, FlashCMD 
> cmd)
>  abort();
>  }
>  
> -DB_PRINT_L(0, "offset = %#x, len = %d\n", offset, len);
> +trace_m25p80_flash_erase(offset, len);
> +
>  if ((s->pi->flags & capa_to_assert) != capa_to_assert) {
>  qemu_log_mask(LOG_GUEST_ERROR, "M25P80: %d erase size not supported 
> by"
>" device\n", len);
> @@ -607,8 +598,7 @@ void flash_write8(Flash *s, uint32_t addr, uint8_t data)
>  }
>  
>  if ((prev ^ data) & data) {
> -DB_PRINT_L(1, "programming zero to one! addr=%" PRIx32 "  %" PRIx8
> -   " -> %" PRIx8 "\n", addr, prev, data);
> +trace_m25p80_programming_zero_to_one(addr, prev, data);
>  }
>  
>  if (s->pi->flags & EEPROM) {
> @@ -662,6 +652,9 @@ static void complete_collecting_data(Flash *s)
>  
>  s->state = STATE_IDLE;
>  
> +trace_m25p80_complete_collecting(s->cmd_in_progress, n, s->ear,
> + s->cur_addr);
> +
>  switch (s->cmd_in_progress) {
>  case DPP:
>  case QPP:
> @@ -825,7 +818,7 @@ static void reset_memory(Flash *s)
>  break;
>  }
>  
> -DB_PRINT_L(0, "Reset done.\n");
> +trace_m25p80_reset_done();
>  }
>  
>  static void decode_fast_read_cmd(Flash *s)
> @@ -941,9 +934,10 @@ static void decode_qio_read_cmd(Flash *s)
>  
>  static void decode_new_cmd(Flash *s, uint32_t value)
>  {
> -s->cmd_in_progress = value;
>  int i;
> -DB_PRINT_L(0, "decoded new command:%x\n", value);
> +
> +s->cmd_in_progress = value;
> +trace_m25p80_command_decoded(value);
>  
>  if (value != RESET_MEMORY) {
>  s->reset_enable = false;
> @@ -1042,7 +1036,7 @@ static void decode_new_cmd(Flash *s, uint32_t value)
>  break;
>  
>  case JEDEC_READ:
> -DB_PRINT_L(0, "populated jedec code\n");
> +trace_m25p80_populated_jedec();
>  for (i = 0; i < s->pi->id_len; i++) {
>  s->data[i] = s->pi->id[i];
>  }
> @@ -1063,7 +1057,7 @@ static void decode_new_cmd(Flash *s, uint32_t value)
>  case BULK_ERASE_60:
>  case BULK_ERASE:
>  if (s->write_enable) {
> -DB_PRINT_L(0, "chip erase\n");
> +trace_m25p80_chip_erase();
>  flash_erase(s, 0, BULK_ERASE);
>  } else {
>  qemu_log_mask(LOG_GUEST_ERROR, "M25P80: chip erase with write "
> @@ -1184,7 +1178,7 @@ static int m25p80_cs(SSISlave *ss, bool select)
>  s->data_read_loop = false;
>  }
>  
> -DB_PRINT_L(0, "%sselect\n", select ? "de" : "");
> +trace_m25p80_select(select ? "de" : "");
>  
>  return 0;
>  }
> @@ -1194,19 +1188,20 @@ static uint32_t m25p80_transfer8(SSISlave *ss, 
> uint32_t tx)
>  Flash *s = M25P80(ss);
>  uint32_t r = 0;
>  
> +trace_m25p80_transfer(s->state, s->len, s->needed_bytes, s->pos,
> +  s->cur_addr, (uint8_t)tx);
> +
>  switch (s->state) {
>  
>  case STATE_PAGE_PROGRAM:
> -DB_PRINT_L(1, "page program cur_addr=%#" PRIx32 " data=%" PRIx8 "\n",
> -   s->cur_addr, (uint8_t)tx);
> +trace_m25p80_page_program(s->cur_addr, (uint8_t)tx);
>  flash_write8(s, s->cur_addr, (uint8_t)tx);
>  s->cur_addr = (s->cur_addr + 1) & (s->size - 1);
>  break;
>  
>  case STATE_READ:
>  r = s->storage[s->cur_addr];
> -DB_PRINT_L(1, "READ 0x%" PRIx32 "=%" PRIx8 "\n", s->cur_addr,
> -   (uint8_t)r);
> +trace_m25p80_read_byte(s->cur_addr, (uint8_t)r);
>  s->cur_addr = (s->cur_addr + 1) & (s->size - 1);
>  break;
>  
> @@ -1244,6 +1239,7 @@ static uint32_t m25p80_transfer8(SSISlave *ss

Re: [PATCH 02/13] qcrypto-luks: implement encryption key management

2020-02-05 Thread Markus Armbruster

Kevin Wolf  writes:

> Am 05.02.2020 um 09:24 hat Markus Armbruster geschrieben:
>> Daniel, Kevin, any comments or objections to the QAPI schema design
>> sketch developed below?
>> 
>> For your convenience, here's the result again:
>> 
>> { 'enum': 'LUKSKeyslotState',
>>   'data': [ 'active', 'inactive' ] }
>> { 'struct': 'LUKSKeyslotActive',
>>   'data': { 'secret': 'str',
>> '*iter-time': 'int } }
>> { 'union': 'LUKSKeyslotAmend',
>>   'base': { '*keyslot': 'int',
>> 'state': 'LUKSKeyslotState' }
>>   'discriminator': 'state',
>>   'data': { 'active': 'LUKSKeyslotActive' } }
>
> I think one of the requirements was that you can specify the keyslot not
> only by using its number, but also by specifying the old secret.

Quoting myself:

  When we don't specify the slot#, then "new state active" selects an
  inactive slot (chosen by the system, and "new state inactive selects
  slots by secret (commonly just one slot).

This takes care of selecting (active) slots by old secret with "new
state inactive".

I intentionally did not provide for selecting (active) slots by old
secret with "new state active", because that's unsafe update in place.

We want to update secrets, of course.  But the safe way to do that is to
put the new secret into a free slot, and if that succeeds, deactivate
the old secret.  If deactivation fails, you're left with both old and
new secret, which beats being left with no secret when update in place
fails.

>  Trivial
> extension, you just get another optional field that can be specified
> instead of 'keyslot'.
>
> Resulting commands:
>
> Adding a key:
> qemu-img amend -o encrypt.keys.0.state=active,encrypt.keys.0.secret=sec0 
> test.qcow2

This activates an inactive slot chosen by the sysem.

You can activate a specific keyslot N by throwing in
encrypt.keys.0.keyslot=N.

> Deleting a key:
> qemu-img amend -o encrypt.keys.0.state=inactive,encrypt.keys.0.keyslot=2 
> test.qcow2

This deactivates keyslot#2.

You can deactivate slots holding a specific secret S by replacing
encrypt.keys.0.keyslot=2 by encrypt.keys.0.secret=S.

> Previous version (if this series is applied unchanged):
>
> Adding a key:
> qemu-img amend -o encrypt.keys.0.new-secret=sec0 test.qcow2
>
> Deleting a key:
> qemu-img amend -o encrypt.keys.0.new-secret=,encrypt.keys.0.keyslot=2 
> test.qcow2
>
> Adding a key gets more complicated with your proposed interface because
> state must be set explicitly now whereas before it was derived
> automatically from the fact that if you give a key, only active makes
> sense.

The explicitness could be viewed as an improvement :)

If you'd prefer implicit here: Max has patches for making union tags
optional with a default.  They'd let you default active to true.

> Deleting stays more or less the same, you just change the state instead
> of clearing the secret.

Thanks for your input!

Re: [PATCH 02/13] qcrypto-luks: implement encryption key management

2020-02-05 Thread Kevin Wolf

Am 05.02.2020 um 09:24 hat Markus Armbruster geschrieben:
> Daniel, Kevin, any comments or objections to the QAPI schema design
> sketch developed below?
> 
> For your convenience, here's the result again:
> 
> { 'enum': 'LUKSKeyslotState',
>   'data': [ 'active', 'inactive' ] }
> { 'struct': 'LUKSKeyslotActive',
>   'data': { 'secret': 'str',
> '*iter-time': 'int } }
> { 'union': 'LUKSKeyslotAmend',
>   'base': { '*keyslot': 'int',
> 'state': 'LUKSKeyslotState' }
>   'discriminator': 'state',
>   'data': { 'active': 'LUKSKeyslotActive' } }

I think one of the requirements was that you can specify the keyslot not
only by using its number, but also by specifying the old secret. Trivial
extension, you just get another optional field that can be specified
instead of 'keyslot'.

Resulting commands:

Adding a key:
qemu-img amend -o encrypt.keys.0.state=active,encrypt.keys.0.secret=sec0 
test.qcow2

Deleting a key:
qemu-img amend -o encrypt.keys.0.state=inactive,encrypt.keys.0.keyslot=2 
test.qcow2

Previous version (if this series is applied unchanged):

Adding a key:
qemu-img amend -o encrypt.keys.0.new-secret=sec0 test.qcow2

Deleting a key:
qemu-img amend -o encrypt.keys.0.new-secret=,encrypt.keys.0.keyslot=2 
test.qcow2

Adding a key gets more complicated with your proposed interface because
state must be set explicitly now whereas before it was derived
automatically from the fact that if you give a key, only active makes
sense.

Deleting stays more or less the same, you just change the state instead
of clearing the secret.

Kevin

> Markus Armbruster  writes:
> 
> [...]
> > A keyslot can be either inactive or active.
> >
> > Let's start low-level, i.e. we specify the slot by slot#:
> >
> > state   new state   action
> > inactiveinactivenop
> > inactiveactive  put secret, iter-time, mark active
> > active  inactivemark inactive (effectively deletes secret)
> > active  active  in general, error (unsafe update in place)
> > we can make it a nop when secret, iter-time
> > remain unchanged
> > we can allow unsafe update with force: true
> >
> > As struct:
> >
> > { 'struct': 'LUKSKeyslotUpdate',
> >   'data': { 'active': 'bool',   # could do enum instead
> > 'keyslot': 'int',
> > '*secret': 'str',   # present if @active is true
> > '*iter-time': 'int' } } # absent if @active is false
> >
> > As union:
> >
> > { 'enum': 'LUKSKeyslotState',
> >   'data': [ 'active', 'inactive' ] }
> > { 'struct': 'LUKSKeyslotActive',
> >   'data': { 'secret': 'str',
> > '*iter-time': 'int } }
> > { 'union': 'LUKSKeyslotAmend',
> >   'base': { 'state': 'LUKSKeyslotState' }   # must do enum
> >   'discriminator': 'state',
> >   'data': { 'active': 'LUKSKeyslotActive' } }
> >
> > When we don't specify the slot#, then "new state active" selects an
> > inactive slot (chosen by the system, and "new state inactive selects
> > slots by secret (commonly just one slot).
> >
> > New state active:
> >
> > state   new state   action
> > inactiveactive  put secret, iter-time, mark active
> > active  active  N/A (system choses inactive slot)
> >
> > New state inactive, for each slot holding the specified secret:
> >
> > state   new state   action
> > inactiveinactiveN/A (inactive slot holds no secret)
> > active  inactivemark inactive (effectively deletes secret)
> >
> > As struct:
> >
> > { 'struct': 'LUKSKeyslotUpdate',
> >   'data': { 'active': 'bool',   # could do enum instead
> > '*keyslot': 'int',
> > '*secret': 'str',   # present if @active is true
> > '*iter-time': 'int' } } # absent if @active is false
> >
> > As union:
> >
> > { 'enum': 'LUKSKeyslotState',
> >   'data': [ 'active', 'inactive' ] }
> > { 'struct': 'LUKSKeyslotActive',
> >   'data': { 'secret': 'str',
> > '*iter-time': 'int } }
> > { 'union': 'LUKSKeyslotAmend',
> >   'base': { '*keyslot': 'int',
> > 'state': 'LUKSKeyslotState' }
> >   'discriminator': 'state',
> >   'data': { 'active': 'LUKSKeyslotActive' } }
> >
> > Union looks more complicated because our union notation sucks[*].  I
> > like it anyway, because you don't have to explain when which optional
> > members aren't actually optional.
> >
> > Regardless of struct vs. union, this supports an active -> active
> > transition only with an explicit keyslot.  Feels fine to me.  If we want
> > to support it without keyslot as well, we need a way to specify both old
> > and new secret.  Do we?
> >
> >
> > [*] I hope to fix that one day.  It's not even hard.

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


05.02.2020 12:04, Vladimir Sementsov-Ogievskiy wrote:

31.01.2020 20:44, Eric Blake wrote:

Based-on: <20200124103458.1525982-2-david.edmond...@oracle.com>
([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)

I'm working on adding an NBD extension that reports whether an image
is already all zero when the client first connects.  I initially
thought I could write the NBD code to just call bdrv_has_zero_init(),
but that turned out to be a bad assumption that instead resulted in
this patch series.  The NBD patch will come later (and cross-posted to
the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
repositories).

I do have an RFC question on patch 13 - as implemented here, I set a
qcow2 bit if the image has all clusters known zero and no backing
image.  But it may be more useful to instead report whether all
clusters _allocated in this layer_ are zero, at which point the
overall image is all-zero only if the backing file also has that
property (or even make it two bits).  The tweaks to subsequent patches
based on what we think makes the most useful semantics shouldn't be
hard.

[repo.or.cz appears to be down as I type this; I'll post a link to a
repository later when it comes back up]



I have several ideas around it.

1. For generic block layer.
Did you consider as alternative to BDRV_ZEO_OPEN, to export the
information through normal block_status? So, if we have the
information, that disk is all-zero, we can always add _ZERO
flag to block-status result. And in generic bdrv_is_all_zeroes(),
we can just call block_status(0, disk_size), which will return
ZERO and n=disk_size if driver supports all-zero feature and is
all-zero now.
I think block-status is a native way for such information, and I
think that we anyway want to come to support of 64bit block-status
for qcow2 and nbd.

2. For NBD
Again, possible alternative is BLOCK_STATUS, but we need 64bit
commands for it. I plan to send a proposal anyway. Still, nothing
bad in two possible path of receiving all-zero information.
And even with your NBD extension, we can export this information
through block-status [1.]

3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.



Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
we have zero bits in L2 entries. And with them, we even don't need
preallocated to be filled by zeros, as we never read them (but just return
zeros on read)..

Then, may be we want similar flag for L1 entry (this will enable large
fast write-zero). And may be we want flag which marks the whole image
as read-zero (it's your flag). So, now I think, my previous idea
of "all allocated is zero" is worse. As for fully-preallocated images
we are sure that all clusters are allocated, and it is more native to
have flags similar to ZERO bit in L2 entry.


--
Best regards,
Vladimir

Re: [PATCH 00/17] Improve qcow2 all-zero detection

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


31.01.2020 20:44, Eric Blake wrote:

Based-on: <20200124103458.1525982-2-david.edmond...@oracle.com>
([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)

I'm working on adding an NBD extension that reports whether an image
is already all zero when the client first connects.  I initially
thought I could write the NBD code to just call bdrv_has_zero_init(),
but that turned out to be a bad assumption that instead resulted in
this patch series.  The NBD patch will come later (and cross-posted to
the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
repositories).

I do have an RFC question on patch 13 - as implemented here, I set a
qcow2 bit if the image has all clusters known zero and no backing
image.  But it may be more useful to instead report whether all
clusters _allocated in this layer_ are zero, at which point the
overall image is all-zero only if the backing file also has that
property (or even make it two bits).  The tweaks to subsequent patches
based on what we think makes the most useful semantics shouldn't be
hard.

[repo.or.cz appears to be down as I type this; I'll post a link to a
repository later when it comes back up]



I have several ideas around it.

1. For generic block layer.
Did you consider as alternative to BDRV_ZEO_OPEN, to export the
information through normal block_status? So, if we have the
information, that disk is all-zero, we can always add _ZERO
flag to block-status result. And in generic bdrv_is_all_zeroes(),
we can just call block_status(0, disk_size), which will return
ZERO and n=disk_size if driver supports all-zero feature and is
all-zero now.
I think block-status is a native way for such information, and I
think that we anyway want to come to support of 64bit block-status
for qcow2 and nbd.

2. For NBD
Again, possible alternative is BLOCK_STATUS, but we need 64bit
commands for it. I plan to send a proposal anyway. Still, nothing
bad in two possible path of receiving all-zero information.
And even with your NBD extension, we can export this information
through block-status [1.]

3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.




--
Best regards,
Vladimir

Re: [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag

2020-02-05 Thread Vladimir Sementsov-Ogievskiy


04.02.2020 20:50, Eric Blake wrote:

On 2/4/20 11:34 AM, Max Reitz wrote:


+++ b/include/block/block.h
@@ -105,6 +105,16 @@ typedef enum {
   * for drivers that set .bdrv_co_truncate.
   */
  BDRV_ZERO_TRUNCATE  = 0x2,
+
+    /*
+ * bdrv_known_zeroes() should include this bit if an image is
+ * known to read as all zeroes when first opened; this bit should
+ * not be relied on after any writes to the image.


Is there a good reason for this?  Because to me this screams like we are
going to check this flag without ensuring that the image has actually
not been written to yet.  So if it’s generally easy for drivers to stop
reporting this flag after a write, then maybe we should do so.


In patch 15 (implementing things in qcow2), I actually wrote the driver to 
return live results, rather than just open-time results, in part because 
writing the bit to persistent storage in qcow2 means that the bit must be 
accurate, without relying on the block layer's help.

But my pending NBD patch (not posted yet, but will be soon), the proposal I'm 
making for the NBD protocol itself is just open-time, not live, and so it would 
be more work than necessary to make the NBD driver report live results.

But it seems like it should be easy enough to also patch the block layer itself 
to guarantee that callers of bdrv_known_zeroes() cannot see this bit set if the 
block layer has been used in any non-zero transaction, by repeating the same 
logic as used in qcow2 to kill the bit (any write/write_compressed/bdrv_copy 
clear the bit, any trim clears the bit if the driver does not guarantee trim 
reads as zero, any truncate clears the bit if the driver does not guarantee 
truncate reads as zero, etc). Basically, the block layer would cache the 
results of .bdrv_known_zeroes during .bdrv_co_open, bdrv_co_pwrite() and 
friends would update that cache, and and bdrv_known_zeroes() would report the 
cached value rather than a fresh call to .bdrv_known_zeroes.

Are we worried enough about clients of this interface to make the block layer 
more robust?  (From the maintenance standpoint, the more the block layer 
guarantees, the easier it is to write code that uses the block layer; but there 
is the counter-argument that making the block layer track whether an image has 
been modified means a [slight] penalty to every write request to update the 
boolean.)



I'm for functions is_all_zero(), vs is_it_was_all_zeros_when_opened(). I never 
liked places in code where is_zero_init() used like is_disk_zero(), without any 
checks, that the drive was not modified, or even created by use.

--
Best regards,
Vladimir

Re: [PATCH 02/13] qcrypto-luks: implement encryption key management

2020-02-05 Thread Markus Armbruster

Daniel, Kevin, any comments or objections to the QAPI schema design
sketch developed below?

For your convenience, here's the result again:

{ 'enum': 'LUKSKeyslotState',
  'data': [ 'active', 'inactive' ] }
{ 'struct': 'LUKSKeyslotActive',
  'data': { 'secret': 'str',
'*iter-time': 'int } }
{ 'union': 'LUKSKeyslotAmend',
  'base': { '*keyslot': 'int',
'state': 'LUKSKeyslotState' }
  'discriminator': 'state',
  'data': { 'active': 'LUKSKeyslotActive' } }

Markus Armbruster  writes:

[...]
> A keyslot can be either inactive or active.
>
> Let's start low-level, i.e. we specify the slot by slot#:
>
> state   new state   action
> inactiveinactivenop
> inactiveactive  put secret, iter-time, mark active
> active  inactivemark inactive (effectively deletes secret)
> active  active  in general, error (unsafe update in place)
> we can make it a nop when secret, iter-time
> remain unchanged
> we can allow unsafe update with force: true
>
> As struct:
>
> { 'struct': 'LUKSKeyslotUpdate',
>   'data': { 'active': 'bool',   # could do enum instead
> 'keyslot': 'int',
> '*secret': 'str',   # present if @active is true
> '*iter-time': 'int' } } # absent if @active is false
>
> As union:
>
> { 'enum': 'LUKSKeyslotState',
>   'data': [ 'active', 'inactive' ] }
> { 'struct': 'LUKSKeyslotActive',
>   'data': { 'secret': 'str',
> '*iter-time': 'int } }
> { 'union': 'LUKSKeyslotAmend',
>   'base': { 'state': 'LUKSKeyslotState' }   # must do enum
>   'discriminator': 'state',
>   'data': { 'active': 'LUKSKeyslotActive' } }
>
> When we don't specify the slot#, then "new state active" selects an
> inactive slot (chosen by the system, and "new state inactive selects
> slots by secret (commonly just one slot).
>
> New state active:
>
> state   new state   action
> inactiveactive  put secret, iter-time, mark active
> active  active  N/A (system choses inactive slot)
>
> New state inactive, for each slot holding the specified secret:
>
> state   new state   action
> inactiveinactiveN/A (inactive slot holds no secret)
> active  inactivemark inactive (effectively deletes secret)
>
> As struct:
>
> { 'struct': 'LUKSKeyslotUpdate',
>   'data': { 'active': 'bool',   # could do enum instead
> '*keyslot': 'int',
> '*secret': 'str',   # present if @active is true
> '*iter-time': 'int' } } # absent if @active is false
>
> As union:
>
> { 'enum': 'LUKSKeyslotState',
>   'data': [ 'active', 'inactive' ] }
> { 'struct': 'LUKSKeyslotActive',
>   'data': { 'secret': 'str',
> '*iter-time': 'int } }
> { 'union': 'LUKSKeyslotAmend',
>   'base': { '*keyslot': 'int',
> 'state': 'LUKSKeyslotState' }
>   'discriminator': 'state',
>   'data': { 'active': 'LUKSKeyslotActive' } }
>
> Union looks more complicated because our union notation sucks[*].  I
> like it anyway, because you don't have to explain when which optional
> members aren't actually optional.
>
> Regardless of struct vs. union, this supports an active -> active
> transition only with an explicit keyslot.  Feels fine to me.  If we want
> to support it without keyslot as well, we need a way to specify both old
> and new secret.  Do we?
>
>
> [*] I hope to fix that one day.  It's not even hard.

96 matches

Mail list logo