[PATCH for-5.1] nbd: Fix large trim/zero requests

2020-07-22 Thread Eric Blake
Although qemu as NBD client limits requests to <2G, the NBD protocol
allows clients to send requests almost all the way up to 4G.  But
because our block layer is not yet 64-bit clean, we accidentally wrap
such requests into a negative size, and fail with EIO instead of
performing the intended operation.

The bug is visible in modern systems with something as simple as:

$ qemu-img create -f qcow2 /tmp/image.img 5G
$ sudo qemu-nbd --connect=/dev/nbd0 /tmp/image.img
$ sudo blkdiscard /dev/nbd0

or with user-space only:

$ truncate --size=3G file
$ qemu-nbd -f raw file
$ nbdsh -u nbd://localhost:10809 -c 'h.trim(3*1024*1024*1024,0)'

Alas, our iotests do not currently make it easy to add external
dependencies on blkdiscard or nbdsh, so we have to rely on manual
testing for now.

This patch can be reverted when we later improve the overall block
layer to be 64-bit clean, but for now, a minimal fix was deemed less
risky prior to release.

CC: qemu-sta...@nongnu.org
Fixes: 1f4d6d18ed
Fixes: 1c6c4bb7f0
Fixes: https://github.com/systemd/systemd/issues/16242
Signed-off-by: Eric Blake 
---
 nbd/server.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 4752a6c8bc07..029618017c90 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2378,8 +2378,17 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
 if (request->flags & NBD_CMD_FLAG_FAST_ZERO) {
 flags |= BDRV_REQ_NO_FALLBACK;
 }
-ret = blk_pwrite_zeroes(exp->blk, request->from + exp->dev_offset,
-request->len, flags);
+ret = 0;
+/* FIXME simplify this when blk_pwrite_zeroes switches to 64-bit */
+while (ret == 0 && request->len) {
+int align = client->check_align ?: 1;
+int len = MIN(request->len, QEMU_ALIGN_DOWN(BDRV_REQUEST_MAX_BYTES,
+align));
+ret = blk_pwrite_zeroes(exp->blk, request->from + exp->dev_offset,
+len, flags);
+request->len -= len;
+request->from += len;
+}
 return nbd_send_generic_reply(client, request->handle, ret,
   "writing to file failed", errp);

@@ -2393,8 +2402,17 @@ static coroutine_fn int nbd_handle_request(NBDClient 
*client,
   "flush failed", errp);

 case NBD_CMD_TRIM:
-ret = blk_co_pdiscard(exp->blk, request->from + exp->dev_offset,
-  request->len);
+ret = 0;
+/* FIXME simplify this when blk_co_pdiscard switches to 64-bit */
+while (ret == 0 && request->len) {
+int align = client->check_align ?: 1;
+int len = MIN(request->len, QEMU_ALIGN_DOWN(BDRV_REQUEST_MAX_BYTES,
+align));
+ret = blk_co_pdiscard(exp->blk, request->from + exp->dev_offset,
+  len);
+request->len -= len;
+request->from += len;
+}
 if (ret == 0 && request->flags & NBD_CMD_FLAG_FUA) {
 ret = blk_co_flush(exp->blk);
 }
-- 
2.27.0




Re: [PATCH v7 33/47] mirror: Deal with filters

2020-07-22 Thread Andrey Shinkevich

On 25.06.2020 18:22, Max Reitz wrote:

This includes some permission limiting (for example, we only need to
take the RESIZE permission for active commits where the base is smaller
than the top).

Use this opportunity to rename qmp_drive_mirror()'s "source" BDS to
"target_backing_bs", because that is what it really refers to.

Signed-off-by: Max Reitz 
---
  qapi/block-core.json |   6 ++-
  block/mirror.c   | 118 +--
  blockdev.c   |  36 +
  3 files changed, 121 insertions(+), 39 deletions(-)


...

diff --git a/block/mirror.c b/block/mirror.c
index 469acf4600..770de3b34e 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -42,6 +42,7 @@ typedef struct MirrorBlockJob {
  BlockBackend *target;
  BlockDriverState *mirror_top_bs;
  BlockDriverState *base;
+BlockDriverState *base_overlay;
  
  /* The name of the graph node to replace */

  char *replaces;
@@ -677,8 +678,10 @@ static int mirror_exit_common(Job *job)
   _abort);
  if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
  BlockDriverState *backing = s->is_none_mode ? src : s->base;
-if (backing_bs(target_bs) != backing) {
-bdrv_set_backing_hd(target_bs, backing, _err);
+BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
+
+if (bdrv_cow_bs(unfiltered_target) != backing) {



I just worry about a filter node of the concurrent job right below the 
unfiltered_target. The filter has unfiltered_target in its parent list. 
Will that filter node be replaced correctly then?



Andrey

...


+/*
+ * The topmost node with
+ * bdrv_skip_filters(filtered_target) == bdrv_skip_filters(target)
+ */
+filtered_target = bdrv_cow_bs(bdrv_find_overlay(bs, target));
+
+assert(bdrv_skip_filters(filtered_target) ==
+   bdrv_skip_filters(target));
+
+/*
+ * XXX BLK_PERM_WRITE needs to be allowed so we don't block
+ * ourselves at s->base (if writes are blocked for a node, they are
+ * also blocked for its backing file). The other options would be a
+ * second filter driver above s->base (== target).
+ */
+iter_shared_perms = BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE;
+
+for (iter = bdrv_filter_or_cow_bs(bs); iter != target;
+ iter = bdrv_filter_or_cow_bs(iter))
+{
+if (iter == filtered_target) {



For one filter node only?



+/*
+ * From here on, all nodes are filters on the base.
+ * This allows us to share BLK_PERM_CONSISTENT_READ.
+ */
+iter_shared_perms |= BLK_PERM_CONSISTENT_READ;
+}
+
  ret = block_job_add_bdrv(>common, "intermediate node", iter, 0,
- BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE,
- errp);
+ iter_shared_perms, errp);
  if (ret < 0) {
  goto fail;
  }

...

@@ -3042,6 +3053,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
   " named node of the graph");
  goto out;
  }
+replaces_node_name = arg->replaces;



What is the idea behind the variables substitution?

Probably, the patch might be split out.

Andrey





Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-22 Thread Andrzej Jakowski
On 7/22/20 10:21 AM, Klaus Jensen wrote:
> On Jul 22 10:00, Andrzej Jakowski wrote:
>> On 7/22/20 12:43 AM, Klaus Jensen wrote:
>>> @keith, please see below - can you comment on the Linux kernel 2 MB
>>> boundary requirement for the CMB? Or should we hail Stephen (or Logan
>>> maybe) since this seems to be related to p2pdma?
>>>
>>> On Jul 21 14:54, Andrzej Jakowski wrote:
 On 7/15/20 1:06 AM, Klaus Jensen wrote:
> Hi Andrzej,
>
> I've not been ignoring this, but sorry for not following up earlier.
>
> I'm hesitent to merge anything that very obviously breaks an OS that we
> know is used a lot to this using this device. Also because the issue has
> not been analyzed well enough to actually know if this is a QEMU or
> kernel issue.

 Hi Klaus,

 Thx for your response! I understand your hesitance on merging stuff that
 obviously breaks guest OS. 

>
> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> (irregardless of IOMMU on/off).
>
> Later, when the issue is better understood, we can add options to set
> offsets, BIRs etc.
>
> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> git://git.infradead.org/qemu-nvme.git nvme-next branch.
>
> Can you reproduce the issues with that patch? I can't on a stock Arch
> Linux 5.7.5-arch1-1 kernel.

 While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
 feel that investigation part why it works while mine doesn't is
 missing. It looks to me that both patches are basically following same 
 approach: create memory subregion and overlay on top of other memory
 region. Why one works and the other doesn't then?

 Having in mind that, I have recently focused on understanding problem.
 I observed that when guest assigns address to BAR4, addr field in
 nvme-bar4 memory region gets populated, but it doesn't get automatically
 populated in ctrl_mem (cmb) memory subregion, so later when 
 nvme_addr_is_cmb() 
 is called address check works incorrectly and as a consequence vmm does 
 dma 
 read instead of memcpy.
 I created a patch that sets correct address on ctrl_mem subregion and 
 guest 
 OS boots up correctly.

 When I looked into pci and memory region code I noticed that indeed address
 is only assigned to top level memory region but not to contained 
 subregions.
 I think that because in your approach cmb grabs whole bar exclusively it 
 works
 fine.

 Here is my question (perhaps pci folks can help answer :)): if we consider 
 memory region overlapping for pci devices as valid use case should pci 
 code on configuration cycles walk through all contained subregion and
 update addr field accordingly?

 Thx!

>>>
>>> Hi Andrzej,
>>>
>>> Thanks for looking into this. I think your analysis helped me nail this.
>>> The problem is that we added the use of a subregion and have some
>>> assumptions that no longer hold.
>>>
>>> nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
>>> But when the memory region is a subregion, addr holds an offset into the
>>> parent container instead. Thus, changing all occurances of
>>> n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
>>> (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
>>> in your original patch[1]. The reason my version worked is because there
>>> was no subregion involved for the CMB, so the existing address
>>> validation calculations were still correct.
>>
>> I'm a little bit concerned with this approach:
>> (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
>> describe my understanding of the problem.
> 
> Oh. In the context of your patch I meant bar4 of course, but anyway.
> 
>> It looks to me that addr field sometimes contains *absolute* address (when 
>> no 
>> hierarchy is used) and other times it contains *relative* address (when
>> hierarchy is created). From my perspective use of this field is inconsistent
>> and thus error-prone.  
>> Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
>> solve root problem and is still prone to the same problem if in the future
>> we potentially build even more complicated hierarchy.
>> I think that we could solve it by introducing helper function like
>>
>> hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 
>>
>> to retrieve absolute address and in the documentation indicate that addr 
>> field
>> can be relative or absolute and it is recommended to use above function to 
>> retrieve absolute address.
>> What do you think?
>>
> 
> I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I
> did just to convince myself 

Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-22 Thread Klaus Jensen
On Jul 22 10:00, Andrzej Jakowski wrote:
> On 7/22/20 12:43 AM, Klaus Jensen wrote:
> > @keith, please see below - can you comment on the Linux kernel 2 MB
> > boundary requirement for the CMB? Or should we hail Stephen (or Logan
> > maybe) since this seems to be related to p2pdma?
> > 
> > On Jul 21 14:54, Andrzej Jakowski wrote:
> >> On 7/15/20 1:06 AM, Klaus Jensen wrote:
> >>> Hi Andrzej,
> >>>
> >>> I've not been ignoring this, but sorry for not following up earlier.
> >>>
> >>> I'm hesitent to merge anything that very obviously breaks an OS that we
> >>> know is used a lot to this using this device. Also because the issue has
> >>> not been analyzed well enough to actually know if this is a QEMU or
> >>> kernel issue.
> >>
> >> Hi Klaus,
> >>
> >> Thx for your response! I understand your hesitance on merging stuff that
> >> obviously breaks guest OS. 
> >>
> >>>
> >>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> >>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> >>> (irregardless of IOMMU on/off).
> >>>
> >>> Later, when the issue is better understood, we can add options to set
> >>> offsets, BIRs etc.
> >>>
> >>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> >>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> >>> git://git.infradead.org/qemu-nvme.git nvme-next branch.
> >>>
> >>> Can you reproduce the issues with that patch? I can't on a stock Arch
> >>> Linux 5.7.5-arch1-1 kernel.
> >>
> >> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
> >> feel that investigation part why it works while mine doesn't is
> >> missing. It looks to me that both patches are basically following same 
> >> approach: create memory subregion and overlay on top of other memory
> >> region. Why one works and the other doesn't then?
> >>
> >> Having in mind that, I have recently focused on understanding problem.
> >> I observed that when guest assigns address to BAR4, addr field in
> >> nvme-bar4 memory region gets populated, but it doesn't get automatically
> >> populated in ctrl_mem (cmb) memory subregion, so later when 
> >> nvme_addr_is_cmb() 
> >> is called address check works incorrectly and as a consequence vmm does 
> >> dma 
> >> read instead of memcpy.
> >> I created a patch that sets correct address on ctrl_mem subregion and 
> >> guest 
> >> OS boots up correctly.
> >>
> >> When I looked into pci and memory region code I noticed that indeed address
> >> is only assigned to top level memory region but not to contained 
> >> subregions.
> >> I think that because in your approach cmb grabs whole bar exclusively it 
> >> works
> >> fine.
> >>
> >> Here is my question (perhaps pci folks can help answer :)): if we consider 
> >> memory region overlapping for pci devices as valid use case should pci 
> >> code on configuration cycles walk through all contained subregion and
> >> update addr field accordingly?
> >>
> >> Thx!
> >>
> > 
> > Hi Andrzej,
> > 
> > Thanks for looking into this. I think your analysis helped me nail this.
> > The problem is that we added the use of a subregion and have some
> > assumptions that no longer hold.
> > 
> > nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
> > But when the memory region is a subregion, addr holds an offset into the
> > parent container instead. Thus, changing all occurances of
> > n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
> > (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
> > in your original patch[1]. The reason my version worked is because there
> > was no subregion involved for the CMB, so the existing address
> > validation calculations were still correct.
> 
> I'm a little bit concerned with this approach:
> (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
> describe my understanding of the problem.

Oh. In the context of your patch I meant bar4 of course, but anyway.

> It looks to me that addr field sometimes contains *absolute* address (when no 
> hierarchy is used) and other times it contains *relative* address (when
> hierarchy is created). From my perspective use of this field is inconsistent
> and thus error-prone.  
> Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
> solve root problem and is still prone to the same problem if in the future
> we potentially build even more complicated hierarchy.
> I think that we could solve it by introducing helper function like
> 
> hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 
> 
> to retrieve absolute address and in the documentation indicate that addr field
> can be relative or absolute and it is recommended to use above function to 
> retrieve absolute address.
> What do you think?
> 

I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I
did just to convince myself that this was the issue ;)

I think the helper might already be there in 

Re: [PATCH for-5.1 1/2] qcow2: Implement v2 zero writes with discard if possible

2020-07-22 Thread Maxim Levitsky
On Wed, 2020-07-22 at 19:14 +0200, Kevin Wolf wrote:
> Am 22.07.2020 um 19:01 hat Maxim Levitsky geschrieben:
> > On Mon, 2020-07-20 at 15:18 +0200, Kevin Wolf wrote:
> > > qcow2 version 2 images don't support the zero flag for clusters, so for
> > > write_zeroes requests, we return -ENOTSUP and get explicit zero buffer
> > > writes. If the image doesn't have a backing file, we can do better: Just
> > > discard the respective clusters.
> > > 
> > > This is relevant for 'qemu-img convert -O qcow2 -n', where qemu-img has
> > > to assume that the existing target image may contain any data, so it has
> > > to write zeroes. Without this patch, this results in a fully allocated
> > > target image, even if the source image was empty.
> > > 
> > > Reported-by: Nir Soffer 
> > > Signed-off-by: Kevin Wolf 
> > > ---
> > >  block/qcow2-cluster.c | 9 -
> > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> > > index 4b5fc8c4a7..a677ba9f5c 100644
> > > --- a/block/qcow2-cluster.c
> > > +++ b/block/qcow2-cluster.c
> > > @@ -1797,8 +1797,15 @@ int qcow2_cluster_zeroize(BlockDriverState *bs, 
> > > uint64_t offset,
> > >  assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
> > > end_offset >= bs->total_sectors << BDRV_SECTOR_BITS);
> > >  
> > > -/* The zero flag is only supported by version 3 and newer */
> > > +/*
> > > + * The zero flag is only supported by version 3 and newer. However, 
> > > if we
> > > + * have no backing file, we can resort to discard in version 2.
> > > + */
> > >  if (s->qcow_version < 3) {
> > > +if (!bs->backing) {
> > > +return qcow2_cluster_discard(bs, offset, bytes,
> > > + QCOW2_DISCARD_REQUEST, false);
> > > +}
> > >  return -ENOTSUP;
> > >  }
> > >  
> > 
> > From my knowelege of nvme, I remember that discard doesn't have to zero the 
> > blocks.
> > There is special namespace capability the indicates the contents of the 
> > discarded block.
> > (Deallocate Logical Block Features)
> > 
> > If and only if the discard behavier flag indicates that discarded areas are 
> > zero,
> > then the write-zero command can have special 'deallocate' flag that hints 
> > the controller
> > to discard the sectors.
> > 
> > So woudn't discarding the clusters have theoretical risk of introducing 
> > garbage there?
> 
> No, qcow2_cluster_discard() has a defined behaviour. For v2 images, it
> unallocates the cluster in the L2 table (this is only safe without a
> backing file), for v3 images it converts them to zero clusters.

All right then!

Best regards,
Maxim Levitsky
> 
> Kevin





Re: [PATCH for-5.1 1/2] qcow2: Implement v2 zero writes with discard if possible

2020-07-22 Thread Kevin Wolf
Am 22.07.2020 um 19:01 hat Maxim Levitsky geschrieben:
> On Mon, 2020-07-20 at 15:18 +0200, Kevin Wolf wrote:
> > qcow2 version 2 images don't support the zero flag for clusters, so for
> > write_zeroes requests, we return -ENOTSUP and get explicit zero buffer
> > writes. If the image doesn't have a backing file, we can do better: Just
> > discard the respective clusters.
> > 
> > This is relevant for 'qemu-img convert -O qcow2 -n', where qemu-img has
> > to assume that the existing target image may contain any data, so it has
> > to write zeroes. Without this patch, this results in a fully allocated
> > target image, even if the source image was empty.
> > 
> > Reported-by: Nir Soffer 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  block/qcow2-cluster.c | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> > index 4b5fc8c4a7..a677ba9f5c 100644
> > --- a/block/qcow2-cluster.c
> > +++ b/block/qcow2-cluster.c
> > @@ -1797,8 +1797,15 @@ int qcow2_cluster_zeroize(BlockDriverState *bs, 
> > uint64_t offset,
> >  assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
> > end_offset >= bs->total_sectors << BDRV_SECTOR_BITS);
> >  
> > -/* The zero flag is only supported by version 3 and newer */
> > +/*
> > + * The zero flag is only supported by version 3 and newer. However, if 
> > we
> > + * have no backing file, we can resort to discard in version 2.
> > + */
> >  if (s->qcow_version < 3) {
> > +if (!bs->backing) {
> > +return qcow2_cluster_discard(bs, offset, bytes,
> > + QCOW2_DISCARD_REQUEST, false);
> > +}
> >  return -ENOTSUP;
> >  }
> >  
> 
> From my knowelege of nvme, I remember that discard doesn't have to zero the 
> blocks.
> There is special namespace capability the indicates the contents of the 
> discarded block.
> (Deallocate Logical Block Features)
> 
> If and only if the discard behavier flag indicates that discarded areas are 
> zero,
> then the write-zero command can have special 'deallocate' flag that hints the 
> controller
> to discard the sectors.
> 
> So woudn't discarding the clusters have theoretical risk of introducing 
> garbage there?

No, qcow2_cluster_discard() has a defined behaviour. For v2 images, it
unallocates the cluster in the L2 table (this is only safe without a
backing file), for v3 images it converts them to zero clusters.

Kevin




Re: [PATCH for-5.1 1/2] qcow2: Implement v2 zero writes with discard if possible

2020-07-22 Thread Maxim Levitsky
On Mon, 2020-07-20 at 15:18 +0200, Kevin Wolf wrote:
> qcow2 version 2 images don't support the zero flag for clusters, so for
> write_zeroes requests, we return -ENOTSUP and get explicit zero buffer
> writes. If the image doesn't have a backing file, we can do better: Just
> discard the respective clusters.
> 
> This is relevant for 'qemu-img convert -O qcow2 -n', where qemu-img has
> to assume that the existing target image may contain any data, so it has
> to write zeroes. Without this patch, this results in a fully allocated
> target image, even if the source image was empty.
> 
> Reported-by: Nir Soffer 
> Signed-off-by: Kevin Wolf 
> ---
>  block/qcow2-cluster.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 4b5fc8c4a7..a677ba9f5c 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1797,8 +1797,15 @@ int qcow2_cluster_zeroize(BlockDriverState *bs, 
> uint64_t offset,
>  assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
> end_offset >= bs->total_sectors << BDRV_SECTOR_BITS);
>  
> -/* The zero flag is only supported by version 3 and newer */
> +/*
> + * The zero flag is only supported by version 3 and newer. However, if we
> + * have no backing file, we can resort to discard in version 2.
> + */
>  if (s->qcow_version < 3) {
> +if (!bs->backing) {
> +return qcow2_cluster_discard(bs, offset, bytes,
> + QCOW2_DISCARD_REQUEST, false);
> +}
>  return -ENOTSUP;
>  }
>  

>From my knowelege of nvme, I remember that discard doesn't have to zero the 
>blocks.
There is special namespace capability the indicates the contents of the 
discarded block.
(Deallocate Logical Block Features)

If and only if the discard behavier flag indicates that discarded areas are 
zero,
then the write-zero command can have special 'deallocate' flag that hints the 
controller
to discard the sectors.

So woudn't discarding the clusters have theoretical risk of introducing garbage 
there?

Best regards,
Maxim Levitsky




Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-22 Thread Andrzej Jakowski
On 7/22/20 12:43 AM, Klaus Jensen wrote:
> @keith, please see below - can you comment on the Linux kernel 2 MB
> boundary requirement for the CMB? Or should we hail Stephen (or Logan
> maybe) since this seems to be related to p2pdma?
> 
> On Jul 21 14:54, Andrzej Jakowski wrote:
>> On 7/15/20 1:06 AM, Klaus Jensen wrote:
>>> Hi Andrzej,
>>>
>>> I've not been ignoring this, but sorry for not following up earlier.
>>>
>>> I'm hesitent to merge anything that very obviously breaks an OS that we
>>> know is used a lot to this using this device. Also because the issue has
>>> not been analyzed well enough to actually know if this is a QEMU or
>>> kernel issue.
>>
>> Hi Klaus,
>>
>> Thx for your response! I understand your hesitance on merging stuff that
>> obviously breaks guest OS. 
>>
>>>
>>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
>>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
>>> (irregardless of IOMMU on/off).
>>>
>>> Later, when the issue is better understood, we can add options to set
>>> offsets, BIRs etc.
>>>
>>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
>>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
>>> git://git.infradead.org/qemu-nvme.git nvme-next branch.
>>>
>>> Can you reproduce the issues with that patch? I can't on a stock Arch
>>> Linux 5.7.5-arch1-1 kernel.
>>
>> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
>> feel that investigation part why it works while mine doesn't is
>> missing. It looks to me that both patches are basically following same 
>> approach: create memory subregion and overlay on top of other memory
>> region. Why one works and the other doesn't then?
>>
>> Having in mind that, I have recently focused on understanding problem.
>> I observed that when guest assigns address to BAR4, addr field in
>> nvme-bar4 memory region gets populated, but it doesn't get automatically
>> populated in ctrl_mem (cmb) memory subregion, so later when 
>> nvme_addr_is_cmb() 
>> is called address check works incorrectly and as a consequence vmm does dma 
>> read instead of memcpy.
>> I created a patch that sets correct address on ctrl_mem subregion and guest 
>> OS boots up correctly.
>>
>> When I looked into pci and memory region code I noticed that indeed address
>> is only assigned to top level memory region but not to contained subregions.
>> I think that because in your approach cmb grabs whole bar exclusively it 
>> works
>> fine.
>>
>> Here is my question (perhaps pci folks can help answer :)): if we consider 
>> memory region overlapping for pci devices as valid use case should pci 
>> code on configuration cycles walk through all contained subregion and
>> update addr field accordingly?
>>
>> Thx!
>>
> 
> Hi Andrzej,
> 
> Thanks for looking into this. I think your analysis helped me nail this.
> The problem is that we added the use of a subregion and have some
> assumptions that no longer hold.
> 
> nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
> But when the memory region is a subregion, addr holds an offset into the
> parent container instead. Thus, changing all occurances of
> n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
> (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
> in your original patch[1]. The reason my version worked is because there
> was no subregion involved for the CMB, so the existing address
> validation calculations were still correct.

I'm a little bit concerned with this approach:
(n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
describe my understanding of the problem.
It looks to me that addr field sometimes contains *absolute* address (when no 
hierarchy is used) and other times it contains *relative* address (when
hierarchy is created). From my perspective use of this field is inconsistent
and thus error-prone.  
Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
solve root problem and is still prone to the same problem if in the future
we potentially build even more complicated hierarchy.
I think that we could solve it by introducing helper function like

hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 

to retrieve absolute address and in the documentation indicate that addr field
can be relative or absolute and it is recommended to use above function to 
retrieve absolute address.
What do you think?

> 
> This leaves us with the Linux kernel complaining about not being able to
> register the CMB if it is not on a 2MB boundary - this is probably just
> a constraint in the kernel that we can't do anything about (but I'm no
> kernel hacker...), which can be fixed by either being "nice" towards the
> Linux kernel by forcing a 2 MB alignment in the device or exposing the
> SZU field such that the user can choose 16MiB size units (or higher)
> instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment 

[PATCH for-5.1] iotests: Select a default machine for the rx and avr targets

2020-07-22 Thread Thomas Huth
If you are building only with either the new rx-softmmu or avr-softmmu
target, "make check-block" fails a couple of tests since there is no
default machine defined in these new targets. We have to select a machine
in the "check" script for these, just like we already do for the arm- and
tricore-softmmu targets.

Signed-off-by: Thomas Huth 
---
 tests/qemu-iotests/check | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
index e0d8049012..0657f7286c 100755
--- a/tests/qemu-iotests/check
+++ b/tests/qemu-iotests/check
@@ -595,15 +595,19 @@ then
 fi
 export QEMU_PROG="$(type -p "$QEMU_PROG")"
 
+export QEMU_OPTIONS="-nodefaults -display none -accel qtest"
 case "$QEMU_PROG" in
 *qemu-system-arm|*qemu-system-aarch64)
-export QEMU_OPTIONS="-nodefaults -display none -machine virt -accel 
qtest"
+export QEMU_OPTIONS="$QEMU_OPTIONS -machine virt"
 ;;
-*qemu-system-tricore)
-export QEMU_OPTIONS="-nodefaults -display none -machine 
tricore_testboard -accel qtest"
+*qemu-system-avr)
+export QEMU_OPTIONS="$QEMU_OPTIONS -machine mega2560"
+;;
+*qemu-system-rx)
+export QEMU_OPTIONS="$QEMU_OPTIONS -machine gdbsim-r5f562n8"
 ;;
-*)
-export QEMU_OPTIONS="-nodefaults -display none -accel qtest"
+*qemu-system-tricore)
+export QEMU_OPTIONS="-$QEMU_OPTIONS -machine tricore_testboard"
 ;;
 esac
 
-- 
2.18.1




Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Vladimir Sementsov-Ogievskiy

22.07.2020 18:43, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 06:40:10PM +0300, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 18:21, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 06:04:53PM +0300, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 16:47, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 15:53, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 03:43:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 14:21, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 02:00:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:

20.07.2020 21:29, Daniel P. Berrangé wrote:

On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Utilize new socket API to make a non-blocking connect for inet sockets.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
      include/io/channel-socket.h | 14 +++
      io/channel-socket.c | 74 +
      2 files changed, 88 insertions(+)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index 777ff5954e..82e868bc02 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -94,6 +94,20 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
      SocketAddress *addr,
      Error **errp);
+/**
+ * qio_channel_socket_connect_non_blocking_sync:
+ * @ioc: the socket channel object
+ * @addr: the address to connect to
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to connect to the address @addr using non-blocking mode of
+ * the socket. Function is synchronous, but being called from
+ * coroutine context will yield during connect operation.
+ */
+int qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket *ioc,
+ SocketAddress *addr,
+ Error **errp);
+
      /**
   * qio_channel_socket_connect_async:
   * @ioc: the socket channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e1b4667087..076de7578a 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -22,6 +22,7 @@
      #include "qapi/error.h"
      #include "qapi/qapi-visit-sockets.h"
      #include "qemu/module.h"
+#include "qemu/sockets.h"
      #include "io/channel-socket.h"
      #include "io/channel-watch.h"
      #include "trace.h"
@@ -29,6 +30,8 @@
      #define SOCKET_MAX_FDS 16
+static int qio_channel_socket_close(QIOChannel *ioc, Error **errp);
+
      SocketAddress *
      qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
   Error **errp)
@@ -157,6 +160,77 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
      return 0;
      }
+static int qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket *ioc,
+    InetSocketAddress *addr, Error **errp)
+{
+    Error *local_err = NULL;
+    struct addrinfo *infos, *info;
+    int sock = -1;
+
+    infos = inet_parse_connect_saddr(addr, errp);
+    if (!infos) {
+    return -1;
+    }


This call is blocking since it calls getaddrinfo whose design
offers no ability todo non-blocking DNS lookups. Given this
call, ...


Oh, that's bad, thanks for taking a look on that early stage!




+
+    for (info = infos; info != NULL; info = info->ai_next) {
+    bool in_progress;
+
+    error_free(local_err);
+    local_err = NULL;
+
+    sock = inet_connect_addr(addr, info, false, _progress, _err);
+    if (sock < 0) {
+    continue;
+    }
+
+    if (qio_channel_socket_set_fd(ioc, sock, _err) < 0) {
+    close(sock);
+    continue;
+    }
+
+    if (in_progress) {
+    if (qemu_in_coroutine()) {
+    qio_channel_yield(QIO_CHANNEL(ioc), G_IO_OUT);
+    } else {
+    qio_channel_wait(QIO_CHANNEL(ioc), G_IO_OUT);
+    }


...this is offering false assurances of being non-blocking.

If we don't want the current thread to be blocked then we
need to be using the existing qio_channel_socket_connect_async
method or similar. It uses a throw away background thread to
run the connection attempt, and then reports completion back
later, thus avoiding the getaddrinfo design flaw for the callers.

I explicitly didn't want to add an method like the impl in this
patch, because getaddrinfo dooms it and we already had bugs in
the pre-QIOChannel code where QEMU thought it was non-blocking
but wasn't due to getaddrinfo lookups.


IIUC, the main appeal of this method is that the non-blocking
nature is hidden from the caller who can continue to treat it
as a synchronous call and have the coroutine magic happen in
behind the scenes.

IOW, What's needed is a simple way to run the operation in a
thread, and sleep for completion while having the coroutine
yield.

I think this could likely be achieved with QIOTask with an
alternate impl of the qio_task_wait_thread() method 

Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Daniel P . Berrangé
On Wed, Jul 22, 2020 at 06:40:10PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 22.07.2020 18:21, Daniel P. Berrangé wrote:
> > On Wed, Jul 22, 2020 at 06:04:53PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > 22.07.2020 16:47, Vladimir Sementsov-Ogievskiy wrote:
> > > > 22.07.2020 15:53, Daniel P. Berrangé wrote:
> > > > > On Wed, Jul 22, 2020 at 03:43:54PM +0300, Vladimir 
> > > > > Sementsov-Ogievskiy wrote:
> > > > > > 22.07.2020 14:21, Daniel P. Berrangé wrote:
> > > > > > > On Wed, Jul 22, 2020 at 02:00:25PM +0300, Vladimir 
> > > > > > > Sementsov-Ogievskiy wrote:
> > > > > > > > 20.07.2020 21:29, Daniel P. Berrangé wrote:
> > > > > > > > > On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir 
> > > > > > > > > Sementsov-Ogievskiy wrote:
> > > > > > > > > > Utilize new socket API to make a non-blocking connect for 
> > > > > > > > > > inet sockets.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > > > > > > > > 
> > > > > > > > > > ---
> > > > > > > > > >      include/io/channel-socket.h | 14 +++
> > > > > > > > > >      io/channel-socket.c | 74 
> > > > > > > > > > +
> > > > > > > > > >      2 files changed, 88 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/io/channel-socket.h 
> > > > > > > > > > b/include/io/channel-socket.h
> > > > > > > > > > index 777ff5954e..82e868bc02 100644
> > > > > > > > > > --- a/include/io/channel-socket.h
> > > > > > > > > > +++ b/include/io/channel-socket.h
> > > > > > > > > > @@ -94,6 +94,20 @@ int 
> > > > > > > > > > qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
> > > > > > > > > >      SocketAddress 
> > > > > > > > > > *addr,
> > > > > > > > > >      Error **errp);
> > > > > > > > > > +/**
> > > > > > > > > > + * qio_channel_socket_connect_non_blocking_sync:
> > > > > > > > > > + * @ioc: the socket channel object
> > > > > > > > > > + * @addr: the address to connect to
> > > > > > > > > > + * @errp: pointer to a NULL-initialized error object
> > > > > > > > > > + *
> > > > > > > > > > + * Attempt to connect to the address @addr using 
> > > > > > > > > > non-blocking mode of
> > > > > > > > > > + * the socket. Function is synchronous, but being called 
> > > > > > > > > > from
> > > > > > > > > > + * coroutine context will yield during connect operation.
> > > > > > > > > > + */
> > > > > > > > > > +int 
> > > > > > > > > > qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket
> > > > > > > > > >  *ioc,
> > > > > > > > > > + 
> > > > > > > > > > SocketAddress *addr,
> > > > > > > > > > + Error 
> > > > > > > > > > **errp);
> > > > > > > > > > +
> > > > > > > > > >      /**
> > > > > > > > > >   * qio_channel_socket_connect_async:
> > > > > > > > > >   * @ioc: the socket channel object
> > > > > > > > > > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > > > > > > > > > index e1b4667087..076de7578a 100644
> > > > > > > > > > --- a/io/channel-socket.c
> > > > > > > > > > +++ b/io/channel-socket.c
> > > > > > > > > > @@ -22,6 +22,7 @@
> > > > > > > > > >      #include "qapi/error.h"
> > > > > > > > > >      #include "qapi/qapi-visit-sockets.h"
> > > > > > > > > >      #include "qemu/module.h"
> > > > > > > > > > +#include "qemu/sockets.h"
> > > > > > > > > >      #include "io/channel-socket.h"
> > > > > > > > > >      #include "io/channel-watch.h"
> > > > > > > > > >      #include "trace.h"
> > > > > > > > > > @@ -29,6 +30,8 @@
> > > > > > > > > >      #define SOCKET_MAX_FDS 16
> > > > > > > > > > +static int qio_channel_socket_close(QIOChannel *ioc, Error 
> > > > > > > > > > **errp);
> > > > > > > > > > +
> > > > > > > > > >      SocketAddress *
> > > > > > > > > >      qio_channel_socket_get_local_address(QIOChannelSocket 
> > > > > > > > > > *ioc,
> > > > > > > > > >   Error **errp)
> > > > > > > > > > @@ -157,6 +160,77 @@ int 
> > > > > > > > > > qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
> > > > > > > > > >      return 0;
> > > > > > > > > >      }
> > > > > > > > > > +static int 
> > > > > > > > > > qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket 
> > > > > > > > > > *ioc,
> > > > > > > > > > +    InetSocketAddress *addr, Error **errp)
> > > > > > > > > > +{
> > > > > > > > > > +    Error *local_err = NULL;
> > > > > > > > > > +    struct addrinfo *infos, *info;
> > > > > > > > > > +    int sock = -1;
> > > > > > > > > > +
> > > > > > > > > > +    infos = inet_parse_connect_saddr(addr, errp);
> > > > > > > > > > +    if (!infos) {
> > > > > > > > > > +    return -1;
> > > > > > > > > > +    }
> > > > > > > > > 
> > > > > > > > > This call is blocking since it calls getaddrinfo whose design
> > > > > > > > > offers no ability todo non-blocking DNS lookups. Given 

Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Vladimir Sementsov-Ogievskiy

22.07.2020 18:21, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 06:04:53PM +0300, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 16:47, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 15:53, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 03:43:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 14:21, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 02:00:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:

20.07.2020 21:29, Daniel P. Berrangé wrote:

On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Utilize new socket API to make a non-blocking connect for inet sockets.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
     include/io/channel-socket.h | 14 +++
     io/channel-socket.c | 74 +
     2 files changed, 88 insertions(+)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index 777ff5954e..82e868bc02 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -94,6 +94,20 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
     SocketAddress *addr,
     Error **errp);
+/**
+ * qio_channel_socket_connect_non_blocking_sync:
+ * @ioc: the socket channel object
+ * @addr: the address to connect to
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to connect to the address @addr using non-blocking mode of
+ * the socket. Function is synchronous, but being called from
+ * coroutine context will yield during connect operation.
+ */
+int qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket *ioc,
+ SocketAddress *addr,
+ Error **errp);
+
     /**
  * qio_channel_socket_connect_async:
  * @ioc: the socket channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e1b4667087..076de7578a 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -22,6 +22,7 @@
     #include "qapi/error.h"
     #include "qapi/qapi-visit-sockets.h"
     #include "qemu/module.h"
+#include "qemu/sockets.h"
     #include "io/channel-socket.h"
     #include "io/channel-watch.h"
     #include "trace.h"
@@ -29,6 +30,8 @@
     #define SOCKET_MAX_FDS 16
+static int qio_channel_socket_close(QIOChannel *ioc, Error **errp);
+
     SocketAddress *
     qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
  Error **errp)
@@ -157,6 +160,77 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
     return 0;
     }
+static int qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket *ioc,
+    InetSocketAddress *addr, Error **errp)
+{
+    Error *local_err = NULL;
+    struct addrinfo *infos, *info;
+    int sock = -1;
+
+    infos = inet_parse_connect_saddr(addr, errp);
+    if (!infos) {
+    return -1;
+    }


This call is blocking since it calls getaddrinfo whose design
offers no ability todo non-blocking DNS lookups. Given this
call, ...


Oh, that's bad, thanks for taking a look on that early stage!




+
+    for (info = infos; info != NULL; info = info->ai_next) {
+    bool in_progress;
+
+    error_free(local_err);
+    local_err = NULL;
+
+    sock = inet_connect_addr(addr, info, false, _progress, _err);
+    if (sock < 0) {
+    continue;
+    }
+
+    if (qio_channel_socket_set_fd(ioc, sock, _err) < 0) {
+    close(sock);
+    continue;
+    }
+
+    if (in_progress) {
+    if (qemu_in_coroutine()) {
+    qio_channel_yield(QIO_CHANNEL(ioc), G_IO_OUT);
+    } else {
+    qio_channel_wait(QIO_CHANNEL(ioc), G_IO_OUT);
+    }


...this is offering false assurances of being non-blocking.

If we don't want the current thread to be blocked then we
need to be using the existing qio_channel_socket_connect_async
method or similar. It uses a throw away background thread to
run the connection attempt, and then reports completion back
later, thus avoiding the getaddrinfo design flaw for the callers.

I explicitly didn't want to add an method like the impl in this
patch, because getaddrinfo dooms it and we already had bugs in
the pre-QIOChannel code where QEMU thought it was non-blocking
but wasn't due to getaddrinfo lookups.


IIUC, the main appeal of this method is that the non-blocking
nature is hidden from the caller who can continue to treat it
as a synchronous call and have the coroutine magic happen in
behind the scenes.

IOW, What's needed is a simple way to run the operation in a
thread, and sleep for completion while having the coroutine
yield.

I think this could likely be achieved with QIOTask with an
alternate impl of the qio_task_wait_thread() method that is
friendly to coroutines instead of being based on pthread
condition variable waits.


The most simple thing is just run 

Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Daniel P . Berrangé
On Wed, Jul 22, 2020 at 06:04:53PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 22.07.2020 16:47, Vladimir Sementsov-Ogievskiy wrote:
> > 22.07.2020 15:53, Daniel P. Berrangé wrote:
> > > On Wed, Jul 22, 2020 at 03:43:54PM +0300, Vladimir Sementsov-Ogievskiy 
> > > wrote:
> > > > 22.07.2020 14:21, Daniel P. Berrangé wrote:
> > > > > On Wed, Jul 22, 2020 at 02:00:25PM +0300, Vladimir 
> > > > > Sementsov-Ogievskiy wrote:
> > > > > > 20.07.2020 21:29, Daniel P. Berrangé wrote:
> > > > > > > On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir 
> > > > > > > Sementsov-Ogievskiy wrote:
> > > > > > > > Utilize new socket API to make a non-blocking connect for inet 
> > > > > > > > sockets.
> > > > > > > > 
> > > > > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > > > > > > 
> > > > > > > > ---
> > > > > > > >     include/io/channel-socket.h | 14 +++
> > > > > > > >     io/channel-socket.c | 74 
> > > > > > > > +
> > > > > > > >     2 files changed, 88 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/include/io/channel-socket.h 
> > > > > > > > b/include/io/channel-socket.h
> > > > > > > > index 777ff5954e..82e868bc02 100644
> > > > > > > > --- a/include/io/channel-socket.h
> > > > > > > > +++ b/include/io/channel-socket.h
> > > > > > > > @@ -94,6 +94,20 @@ int 
> > > > > > > > qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
> > > > > > > >     SocketAddress *addr,
> > > > > > > >     Error **errp);
> > > > > > > > +/**
> > > > > > > > + * qio_channel_socket_connect_non_blocking_sync:
> > > > > > > > + * @ioc: the socket channel object
> > > > > > > > + * @addr: the address to connect to
> > > > > > > > + * @errp: pointer to a NULL-initialized error object
> > > > > > > > + *
> > > > > > > > + * Attempt to connect to the address @addr using non-blocking 
> > > > > > > > mode of
> > > > > > > > + * the socket. Function is synchronous, but being called from
> > > > > > > > + * coroutine context will yield during connect operation.
> > > > > > > > + */
> > > > > > > > +int 
> > > > > > > > qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket 
> > > > > > > > *ioc,
> > > > > > > > + SocketAddress 
> > > > > > > > *addr,
> > > > > > > > + Error **errp);
> > > > > > > > +
> > > > > > > >     /**
> > > > > > > >  * qio_channel_socket_connect_async:
> > > > > > > >  * @ioc: the socket channel object
> > > > > > > > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > > > > > > > index e1b4667087..076de7578a 100644
> > > > > > > > --- a/io/channel-socket.c
> > > > > > > > +++ b/io/channel-socket.c
> > > > > > > > @@ -22,6 +22,7 @@
> > > > > > > >     #include "qapi/error.h"
> > > > > > > >     #include "qapi/qapi-visit-sockets.h"
> > > > > > > >     #include "qemu/module.h"
> > > > > > > > +#include "qemu/sockets.h"
> > > > > > > >     #include "io/channel-socket.h"
> > > > > > > >     #include "io/channel-watch.h"
> > > > > > > >     #include "trace.h"
> > > > > > > > @@ -29,6 +30,8 @@
> > > > > > > >     #define SOCKET_MAX_FDS 16
> > > > > > > > +static int qio_channel_socket_close(QIOChannel *ioc, Error 
> > > > > > > > **errp);
> > > > > > > > +
> > > > > > > >     SocketAddress *
> > > > > > > >     qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
> > > > > > > >  Error **errp)
> > > > > > > > @@ -157,6 +160,77 @@ int 
> > > > > > > > qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
> > > > > > > >     return 0;
> > > > > > > >     }
> > > > > > > > +static int 
> > > > > > > > qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket 
> > > > > > > > *ioc,
> > > > > > > > +    InetSocketAddress *addr, Error **errp)
> > > > > > > > +{
> > > > > > > > +    Error *local_err = NULL;
> > > > > > > > +    struct addrinfo *infos, *info;
> > > > > > > > +    int sock = -1;
> > > > > > > > +
> > > > > > > > +    infos = inet_parse_connect_saddr(addr, errp);
> > > > > > > > +    if (!infos) {
> > > > > > > > +    return -1;
> > > > > > > > +    }
> > > > > > > 
> > > > > > > This call is blocking since it calls getaddrinfo whose design
> > > > > > > offers no ability todo non-blocking DNS lookups. Given this
> > > > > > > call, ...
> > > > > > 
> > > > > > Oh, that's bad, thanks for taking a look on that early stage!
> > > > > > 
> > > > > > > 
> > > > > > > > +
> > > > > > > > +    for (info = infos; info != NULL; info = info->ai_next) {
> > > > > > > > +    bool in_progress;
> > > > > > > > +
> > > > > > > > +    error_free(local_err);
> > > > > > > > +    local_err = NULL;
> > > > > > > > +
> > > > > > > > +    sock = inet_connect_addr(addr, info, false, 
> > > > > > > > _progress, _err);
> > > > > > > > +    if (sock < 0) {
> > > > > > > > +    continue;
> > > > > 

Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Vladimir Sementsov-Ogievskiy

22.07.2020 16:47, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 15:53, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 03:43:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 14:21, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 02:00:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:

20.07.2020 21:29, Daniel P. Berrangé wrote:

On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Utilize new socket API to make a non-blocking connect for inet sockets.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
    include/io/channel-socket.h | 14 +++
    io/channel-socket.c | 74 +
    2 files changed, 88 insertions(+)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index 777ff5954e..82e868bc02 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -94,6 +94,20 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
    SocketAddress *addr,
    Error **errp);
+/**
+ * qio_channel_socket_connect_non_blocking_sync:
+ * @ioc: the socket channel object
+ * @addr: the address to connect to
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to connect to the address @addr using non-blocking mode of
+ * the socket. Function is synchronous, but being called from
+ * coroutine context will yield during connect operation.
+ */
+int qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket *ioc,
+ SocketAddress *addr,
+ Error **errp);
+
    /**
 * qio_channel_socket_connect_async:
 * @ioc: the socket channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e1b4667087..076de7578a 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -22,6 +22,7 @@
    #include "qapi/error.h"
    #include "qapi/qapi-visit-sockets.h"
    #include "qemu/module.h"
+#include "qemu/sockets.h"
    #include "io/channel-socket.h"
    #include "io/channel-watch.h"
    #include "trace.h"
@@ -29,6 +30,8 @@
    #define SOCKET_MAX_FDS 16
+static int qio_channel_socket_close(QIOChannel *ioc, Error **errp);
+
    SocketAddress *
    qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
 Error **errp)
@@ -157,6 +160,77 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
    return 0;
    }
+static int qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket *ioc,
+    InetSocketAddress *addr, Error **errp)
+{
+    Error *local_err = NULL;
+    struct addrinfo *infos, *info;
+    int sock = -1;
+
+    infos = inet_parse_connect_saddr(addr, errp);
+    if (!infos) {
+    return -1;
+    }


This call is blocking since it calls getaddrinfo whose design
offers no ability todo non-blocking DNS lookups. Given this
call, ...


Oh, that's bad, thanks for taking a look on that early stage!




+
+    for (info = infos; info != NULL; info = info->ai_next) {
+    bool in_progress;
+
+    error_free(local_err);
+    local_err = NULL;
+
+    sock = inet_connect_addr(addr, info, false, _progress, _err);
+    if (sock < 0) {
+    continue;
+    }
+
+    if (qio_channel_socket_set_fd(ioc, sock, _err) < 0) {
+    close(sock);
+    continue;
+    }
+
+    if (in_progress) {
+    if (qemu_in_coroutine()) {
+    qio_channel_yield(QIO_CHANNEL(ioc), G_IO_OUT);
+    } else {
+    qio_channel_wait(QIO_CHANNEL(ioc), G_IO_OUT);
+    }


...this is offering false assurances of being non-blocking.

If we don't want the current thread to be blocked then we
need to be using the existing qio_channel_socket_connect_async
method or similar. It uses a throw away background thread to
run the connection attempt, and then reports completion back
later, thus avoiding the getaddrinfo design flaw for the callers.

I explicitly didn't want to add an method like the impl in this
patch, because getaddrinfo dooms it and we already had bugs in
the pre-QIOChannel code where QEMU thought it was non-blocking
but wasn't due to getaddrinfo lookups.


IIUC, the main appeal of this method is that the non-blocking
nature is hidden from the caller who can continue to treat it
as a synchronous call and have the coroutine magic happen in
behind the scenes.

IOW, What's needed is a simple way to run the operation in a
thread, and sleep for completion while having the coroutine
yield.

I think this could likely be achieved with QIOTask with an
alternate impl of the qio_task_wait_thread() method that is
friendly to coroutines instead of being based on pthread
condition variable waits.


The most simple thing is just run qio_channel_socket_connect_sync in
a thread with help of thread_pool_submit_co() which is coroutine-friendly.
And this don't need any changes in io/channel.


Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Vladimir Sementsov-Ogievskiy

22.07.2020 15:53, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 03:43:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:

22.07.2020 14:21, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 02:00:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:

20.07.2020 21:29, Daniel P. Berrangé wrote:

On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Utilize new socket API to make a non-blocking connect for inet sockets.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
include/io/channel-socket.h | 14 +++
io/channel-socket.c | 74 +
2 files changed, 88 insertions(+)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index 777ff5954e..82e868bc02 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -94,6 +94,20 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
SocketAddress *addr,
Error **errp);
+/**
+ * qio_channel_socket_connect_non_blocking_sync:
+ * @ioc: the socket channel object
+ * @addr: the address to connect to
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to connect to the address @addr using non-blocking mode of
+ * the socket. Function is synchronous, but being called from
+ * coroutine context will yield during connect operation.
+ */
+int qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket *ioc,
+ SocketAddress *addr,
+ Error **errp);
+
/**
 * qio_channel_socket_connect_async:
 * @ioc: the socket channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e1b4667087..076de7578a 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -22,6 +22,7 @@
#include "qapi/error.h"
#include "qapi/qapi-visit-sockets.h"
#include "qemu/module.h"
+#include "qemu/sockets.h"
#include "io/channel-socket.h"
#include "io/channel-watch.h"
#include "trace.h"
@@ -29,6 +30,8 @@
#define SOCKET_MAX_FDS 16
+static int qio_channel_socket_close(QIOChannel *ioc, Error **errp);
+
SocketAddress *
qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
 Error **errp)
@@ -157,6 +160,77 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
return 0;
}
+static int qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket *ioc,
+InetSocketAddress *addr, Error **errp)
+{
+Error *local_err = NULL;
+struct addrinfo *infos, *info;
+int sock = -1;
+
+infos = inet_parse_connect_saddr(addr, errp);
+if (!infos) {
+return -1;
+}


This call is blocking since it calls getaddrinfo whose design
offers no ability todo non-blocking DNS lookups. Given this
call, ...


Oh, that's bad, thanks for taking a look on that early stage!




+
+for (info = infos; info != NULL; info = info->ai_next) {
+bool in_progress;
+
+error_free(local_err);
+local_err = NULL;
+
+sock = inet_connect_addr(addr, info, false, _progress, _err);
+if (sock < 0) {
+continue;
+}
+
+if (qio_channel_socket_set_fd(ioc, sock, _err) < 0) {
+close(sock);
+continue;
+}
+
+if (in_progress) {
+if (qemu_in_coroutine()) {
+qio_channel_yield(QIO_CHANNEL(ioc), G_IO_OUT);
+} else {
+qio_channel_wait(QIO_CHANNEL(ioc), G_IO_OUT);
+}


...this is offering false assurances of being non-blocking.

If we don't want the current thread to be blocked then we
need to be using the existing qio_channel_socket_connect_async
method or similar. It uses a throw away background thread to
run the connection attempt, and then reports completion back
later, thus avoiding the getaddrinfo design flaw for the callers.

I explicitly didn't want to add an method like the impl in this
patch, because getaddrinfo dooms it and we already had bugs in
the pre-QIOChannel code where QEMU thought it was non-blocking
but wasn't due to getaddrinfo lookups.


IIUC, the main appeal of this method is that the non-blocking
nature is hidden from the caller who can continue to treat it
as a synchronous call and have the coroutine magic happen in
behind the scenes.

IOW, What's needed is a simple way to run the operation in a
thread, and sleep for completion while having the coroutine
yield.

I think this could likely be achieved with QIOTask with an
alternate impl of the qio_task_wait_thread() method that is
friendly to coroutines instead of being based on pthread
condition variable waits.


The most simple thing is just run qio_channel_socket_connect_sync in
a thread with help of thread_pool_submit_co() which is coroutine-friendly.
And this don't need any changes in io/channel.

Actually, I've started with such design, but decided that 

Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Vladimir Sementsov-Ogievskiy

22.07.2020 14:21, Daniel P. Berrangé wrote:

On Wed, Jul 22, 2020 at 02:00:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:

20.07.2020 21:29, Daniel P. Berrangé wrote:

On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Utilize new socket API to make a non-blocking connect for inet sockets.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
   include/io/channel-socket.h | 14 +++
   io/channel-socket.c | 74 +
   2 files changed, 88 insertions(+)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index 777ff5954e..82e868bc02 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -94,6 +94,20 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
   SocketAddress *addr,
   Error **errp);
+/**
+ * qio_channel_socket_connect_non_blocking_sync:
+ * @ioc: the socket channel object
+ * @addr: the address to connect to
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to connect to the address @addr using non-blocking mode of
+ * the socket. Function is synchronous, but being called from
+ * coroutine context will yield during connect operation.
+ */
+int qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket *ioc,
+ SocketAddress *addr,
+ Error **errp);
+
   /**
* qio_channel_socket_connect_async:
* @ioc: the socket channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e1b4667087..076de7578a 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -22,6 +22,7 @@
   #include "qapi/error.h"
   #include "qapi/qapi-visit-sockets.h"
   #include "qemu/module.h"
+#include "qemu/sockets.h"
   #include "io/channel-socket.h"
   #include "io/channel-watch.h"
   #include "trace.h"
@@ -29,6 +30,8 @@
   #define SOCKET_MAX_FDS 16
+static int qio_channel_socket_close(QIOChannel *ioc, Error **errp);
+
   SocketAddress *
   qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
Error **errp)
@@ -157,6 +160,77 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
   return 0;
   }
+static int qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket *ioc,
+InetSocketAddress *addr, Error **errp)
+{
+Error *local_err = NULL;
+struct addrinfo *infos, *info;
+int sock = -1;
+
+infos = inet_parse_connect_saddr(addr, errp);
+if (!infos) {
+return -1;
+}


This call is blocking since it calls getaddrinfo whose design
offers no ability todo non-blocking DNS lookups. Given this
call, ...


Oh, that's bad, thanks for taking a look on that early stage!




+
+for (info = infos; info != NULL; info = info->ai_next) {
+bool in_progress;
+
+error_free(local_err);
+local_err = NULL;
+
+sock = inet_connect_addr(addr, info, false, _progress, _err);
+if (sock < 0) {
+continue;
+}
+
+if (qio_channel_socket_set_fd(ioc, sock, _err) < 0) {
+close(sock);
+continue;
+}
+
+if (in_progress) {
+if (qemu_in_coroutine()) {
+qio_channel_yield(QIO_CHANNEL(ioc), G_IO_OUT);
+} else {
+qio_channel_wait(QIO_CHANNEL(ioc), G_IO_OUT);
+}


...this is offering false assurances of being non-blocking.

If we don't want the current thread to be blocked then we
need to be using the existing qio_channel_socket_connect_async
method or similar. It uses a throw away background thread to
run the connection attempt, and then reports completion back
later, thus avoiding the getaddrinfo design flaw for the callers.

I explicitly didn't want to add an method like the impl in this
patch, because getaddrinfo dooms it and we already had bugs in
the pre-QIOChannel code where QEMU thought it was non-blocking
but wasn't due to getaddrinfo lookups.


IIUC, the main appeal of this method is that the non-blocking
nature is hidden from the caller who can continue to treat it
as a synchronous call and have the coroutine magic happen in
behind the scenes.

IOW, What's needed is a simple way to run the operation in a
thread, and sleep for completion while having the coroutine
yield.

I think this could likely be achieved with QIOTask with an
alternate impl of the qio_task_wait_thread() method that is
friendly to coroutines instead of being based on pthread
condition variable waits.


The most simple thing is just run qio_channel_socket_connect_sync in
a thread with help of thread_pool_submit_co() which is coroutine-friendly.
And this don't need any changes in io/channel.

Actually, I've started with such design, but decided that better use
non-blocking connect to not deal with cancelling the connecting thread
on shutdown.

I think, I'll resend based on 

Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Daniel P . Berrangé
On Wed, Jul 22, 2020 at 03:43:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 22.07.2020 14:21, Daniel P. Berrangé wrote:
> > On Wed, Jul 22, 2020 at 02:00:25PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > 20.07.2020 21:29, Daniel P. Berrangé wrote:
> > > > On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir Sementsov-Ogievskiy 
> > > > wrote:
> > > > > Utilize new socket API to make a non-blocking connect for inet 
> > > > > sockets.
> > > > > 
> > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > > > ---
> > > > >include/io/channel-socket.h | 14 +++
> > > > >io/channel-socket.c | 74 
> > > > > +
> > > > >2 files changed, 88 insertions(+)
> > > > > 
> > > > > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > > > > index 777ff5954e..82e868bc02 100644
> > > > > --- a/include/io/channel-socket.h
> > > > > +++ b/include/io/channel-socket.h
> > > > > @@ -94,6 +94,20 @@ int 
> > > > > qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
> > > > >SocketAddress *addr,
> > > > >Error **errp);
> > > > > +/**
> > > > > + * qio_channel_socket_connect_non_blocking_sync:
> > > > > + * @ioc: the socket channel object
> > > > > + * @addr: the address to connect to
> > > > > + * @errp: pointer to a NULL-initialized error object
> > > > > + *
> > > > > + * Attempt to connect to the address @addr using non-blocking mode of
> > > > > + * the socket. Function is synchronous, but being called from
> > > > > + * coroutine context will yield during connect operation.
> > > > > + */
> > > > > +int qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket 
> > > > > *ioc,
> > > > > + SocketAddress *addr,
> > > > > + Error **errp);
> > > > > +
> > > > >/**
> > > > > * qio_channel_socket_connect_async:
> > > > > * @ioc: the socket channel object
> > > > > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > > > > index e1b4667087..076de7578a 100644
> > > > > --- a/io/channel-socket.c
> > > > > +++ b/io/channel-socket.c
> > > > > @@ -22,6 +22,7 @@
> > > > >#include "qapi/error.h"
> > > > >#include "qapi/qapi-visit-sockets.h"
> > > > >#include "qemu/module.h"
> > > > > +#include "qemu/sockets.h"
> > > > >#include "io/channel-socket.h"
> > > > >#include "io/channel-watch.h"
> > > > >#include "trace.h"
> > > > > @@ -29,6 +30,8 @@
> > > > >#define SOCKET_MAX_FDS 16
> > > > > +static int qio_channel_socket_close(QIOChannel *ioc, Error **errp);
> > > > > +
> > > > >SocketAddress *
> > > > >qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
> > > > > Error **errp)
> > > > > @@ -157,6 +160,77 @@ int 
> > > > > qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
> > > > >return 0;
> > > > >}
> > > > > +static int 
> > > > > qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket *ioc,
> > > > > +InetSocketAddress *addr, Error **errp)
> > > > > +{
> > > > > +Error *local_err = NULL;
> > > > > +struct addrinfo *infos, *info;
> > > > > +int sock = -1;
> > > > > +
> > > > > +infos = inet_parse_connect_saddr(addr, errp);
> > > > > +if (!infos) {
> > > > > +return -1;
> > > > > +}
> > > > 
> > > > This call is blocking since it calls getaddrinfo whose design
> > > > offers no ability todo non-blocking DNS lookups. Given this
> > > > call, ...
> > > 
> > > Oh, that's bad, thanks for taking a look on that early stage!
> > > 
> > > > 
> > > > > +
> > > > > +for (info = infos; info != NULL; info = info->ai_next) {
> > > > > +bool in_progress;
> > > > > +
> > > > > +error_free(local_err);
> > > > > +local_err = NULL;
> > > > > +
> > > > > +sock = inet_connect_addr(addr, info, false, _progress, 
> > > > > _err);
> > > > > +if (sock < 0) {
> > > > > +continue;
> > > > > +}
> > > > > +
> > > > > +if (qio_channel_socket_set_fd(ioc, sock, _err) < 0) {
> > > > > +close(sock);
> > > > > +continue;
> > > > > +}
> > > > > +
> > > > > +if (in_progress) {
> > > > > +if (qemu_in_coroutine()) {
> > > > > +qio_channel_yield(QIO_CHANNEL(ioc), G_IO_OUT);
> > > > > +} else {
> > > > > +qio_channel_wait(QIO_CHANNEL(ioc), G_IO_OUT);
> > > > > +}
> > > > 
> > > > ...this is offering false assurances of being non-blocking.
> > > > 
> > > > If we don't want the current thread to be blocked then we
> > > > need to be using the existing qio_channel_socket_connect_async
> > > > method or similar. It uses a throw away background thread to
> > > > run the connection attempt, and then reports completion back
> > > > later, thus avoiding the getaddrinfo design flaw 

Re: [PATCH v2 11/20] qapi: backup: add x-max-chunk and x-max-workers parameters

2020-07-22 Thread Max Reitz
On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:
> Add new parameters to configure future backup features. The patch
> doesn't introduce aio backup requests (so we actually have only one
> worker) neither requests larger than one cluster. Still, formally we
> satisfy these maximums anyway, so add the parameters now, to facilitate
> further patch which will really change backup job behavior.
> 
> Options are added with x- prefixes, as the only use for them are some
> very conservative iotests which will be updated soon.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  qapi/block-core.json  |  9 -
>  include/block/block_int.h |  7 +++
>  block/backup.c| 21 +
>  block/replication.c   |  2 +-
>  blockdev.c|  5 +
>  5 files changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 0c7600e4ec..f4abcde34e 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1407,6 +1407,12 @@
>  #
>  # @x-use-copy-range: use copy offloading if possible. Default true. (Since 
> 5.1)
>  #
> +# @x-max-workers: maximum of parallel requests for static data backup. This
> +# doesn't influence copy-before-write operations. (Since: 
> 5.1)

I think I’d prefer something with “background” rather than or in
addition to “static”, like “maximum number of parallel requests for the
sustained background backup operation”.

> +#
> +# @x-max-chunk: maximum chunk length for static data backup. This doesn't
> +#   influence copy-before-write operations. (Since: 5.1)
> +#
>  # Note: @on-source-error and @on-target-error only affect background
>  #   I/O.  If an error occurs during a guest write request, the device's
>  #   rerror/werror actions will be used.
> @@ -1421,7 +1427,8 @@
>  '*on-source-error': 'BlockdevOnError',
>  '*on-target-error': 'BlockdevOnError',
>  '*auto-finalize': 'bool', '*auto-dismiss': 'bool',
> -'*filter-node-name': 'str', '*x-use-copy-range': 'bool'  } }
> +'*filter-node-name': 'str', '*x-use-copy-range': 'bool',
> +'*x-max-workers': 'int', '*x-max-chunk': 'int64' } }
>  
>  ##
>  # @DriveBackup:
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 93b9b3bdc0..d93a170d37 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -1207,6 +1207,11 @@ void mirror_start(const char *job_id, BlockDriverState 
> *bs,
>   * @sync_mode: What parts of the disk image should be copied to the 
> destination.
>   * @sync_bitmap: The dirty bitmap if sync_mode is 'bitmap' or 'incremental'
>   * @bitmap_mode: The bitmap synchronization policy to use.
> + * @max_workers: The limit for parallel requests for main backup loop.
> + *   Must be >= 1.
> + * @max_chunk: The limit for one IO operation length in main backup loop.
> + * Must be not less than job cluster size or zero. Zero means no
> + * specific limit.
>   * @on_source_error: The action to take upon error reading from the source.
>   * @on_target_error: The action to take upon error writing to the target.
>   * @creation_flags: Flags that control the behavior of the Job lifetime.
> @@ -1226,6 +1231,8 @@ BlockJob *backup_job_create(const char *job_id, 
> BlockDriverState *bs,
>  bool compress,
>  const char *filter_node_name,
>  bool use_copy_range,
> +int max_workers,
> +int64_t max_chunk,
>  BlockdevOnError on_source_error,
>  BlockdevOnError on_target_error,
>  int creation_flags,
> diff --git a/block/backup.c b/block/backup.c
> index 76847b4daf..ec2676abc2 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -46,6 +46,8 @@ typedef struct BackupBlockJob {
>  uint64_t len;
>  uint64_t bytes_read;
>  int64_t cluster_size;
> +int max_workers;
> +int64_t max_chunk;
>  
>  BlockCopyState *bcs;
>  } BackupBlockJob;
> @@ -335,6 +337,8 @@ BlockJob *backup_job_create(const char *job_id, 
> BlockDriverState *bs,
>bool compress,
>const char *filter_node_name,
>bool use_copy_range,
> +  int max_workers,
> +  int64_t max_chunk,
>BlockdevOnError on_source_error,
>BlockdevOnError on_target_error,
>int creation_flags,
> @@ -355,6 +359,16 @@ BlockJob *backup_job_create(const char *job_id, 
> BlockDriverState *bs,
>  assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
>  assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP);
>  
> +if (max_workers < 1) {
> +error_setg(errp, "At least one worker needed");
> +return NULL;
> +}

Re: [PATCH v2 10/20] job: call job_enter from job_user_pause

2020-07-22 Thread Max Reitz
On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:
> If main job coroutine called job_yield (while some background process
> is in progress), we should give it a chance to call job_pause_point().
> It will be used in backup, when moved on async block-copy.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  job.c | 1 +
>  1 file changed, 1 insertion(+)

Sounds reasonable to me, although I’d prefer an opinion from John.

So, a middle-weak:

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] pci: pass along the return value of dma_memory_rw

2020-07-22 Thread Michael S. Tsirkin
On Mon, Jun 29, 2020 at 10:20:52PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Some devices might want to know the return value of dma_memory_rw, so
> pass it along instead of ignoring it.
> 
> There are no existing users of the return value, so this patch should be
> safe.
> 
> Signed-off-by: Klaus Jensen 
> Reviewed-by: Philippe Mathieu-Daudé 
> Reviewed-by: Michael S. Tsirkin 
> Acked-by: Keith Busch 


Please feel free to merge this with the patch that uses the
return value.

> ---
>  include/hw/pci/pci.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index a4e9c3341615..2347dc36bfb5 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -786,8 +786,7 @@ static inline AddressSpace 
> *pci_get_address_space(PCIDevice *dev)
>  static inline int pci_dma_rw(PCIDevice *dev, dma_addr_t addr,
>   void *buf, dma_addr_t len, DMADirection dir)
>  {
> -dma_memory_rw(pci_get_address_space(dev), addr, buf, len, dir);
> -return 0;
> +return dma_memory_rw(pci_get_address_space(dev), addr, buf, len, dir);
>  }
>  
>  static inline int pci_dma_read(PCIDevice *dev, dma_addr_t addr,
> -- 
> 2.27.0
> 
> 




Re: [PATCH v2 09/20] blockjob: add set_speed to BlockJobDriver

2020-07-22 Thread Max Reitz
On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:
> We are going to use async block-copy call in backup, so we'll need to
> passthrough setting backup speed to block-copy call.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/blockjob_int.h | 2 ++
>  blockjob.c   | 6 ++
>  2 files changed, 8 insertions(+)

Reviewed-by: Max Reitz 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 08/20] block/block-copy: add block_copy_cancel

2020-07-22 Thread Max Reitz
On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:
> Add function to cancel running async block-copy call. It will be used
> in backup.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/block-copy.h |  7 +++
>  block/block-copy.c | 22 +++---
>  2 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/include/block/block-copy.h b/include/block/block-copy.h
> index d40e691123..370a194d3c 100644
> --- a/include/block/block-copy.h
> +++ b/include/block/block-copy.h
> @@ -67,6 +67,13 @@ BlockCopyCallState *block_copy_async(BlockCopyState *s,
>  void block_copy_set_speed(BlockCopyState *s, BlockCopyCallState *call_state,
>uint64_t speed);
>  
> +/*
> + * Cancel running block-copy call.
> + * Cancel leaves block-copy state valid: dirty bits are correct and you may 
> use
> + * cancel +  to emulate pause/resume.
> + */
> +void block_copy_cancel(BlockCopyCallState *call_state);
> +
>  BdrvDirtyBitmap *block_copy_dirty_bitmap(BlockCopyState *s);
>  void block_copy_set_skip_unallocated(BlockCopyState *s, bool skip);
>  
> diff --git a/block/block-copy.c b/block/block-copy.c
> index 851d9c8aaf..b551feb6c2 100644
> --- a/block/block-copy.c
> +++ b/block/block-copy.c
> @@ -44,6 +44,8 @@ typedef struct BlockCopyCallState {
>  bool failed;
>  bool finished;
>  QemuCoSleepState *sleep_state;
> +bool cancelled;
> +Coroutine *canceller;
>  
>  /* OUT parameters */
>  bool error_is_read;
> @@ -582,7 +584,7 @@ block_copy_dirty_clusters(BlockCopyCallState *call_state)
>  assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
>  assert(QEMU_IS_ALIGNED(bytes, s->cluster_size));
>  
> -while (bytes && aio_task_pool_status(aio) == 0) {
> +while (bytes && aio_task_pool_status(aio) == 0 && 
> !call_state->cancelled) {
>  BlockCopyTask *task;
>  int64_t status_bytes;
>  
> @@ -693,7 +695,7 @@ static int coroutine_fn 
> block_copy_common(BlockCopyCallState *call_state)
>  do {
>  ret = block_copy_dirty_clusters(call_state);
>  
> -if (ret == 0) {
> +if (ret == 0 && !call_state->cancelled) {
>  ret = block_copy_wait_one(call_state->s, call_state->offset,
>call_state->bytes);
>  }
> @@ -707,13 +709,18 @@ static int coroutine_fn 
> block_copy_common(BlockCopyCallState *call_state)
>   * 2. We have waited for some intersecting block-copy request
>   *It may have failed and produced new dirty bits.
>   */
> -} while (ret > 0);
> +} while (ret > 0 && !call_state->cancelled);

Would it be cleaner if block_copy_dirty_cluster() just returned
-ECANCELED?  Or would that pose a problem for its callers or the async
callback?

>  if (call_state->cb) {
>  call_state->cb(ret, call_state->error_is_read,
> call_state->s->progress_opaque);
>  }
>  
> +if (call_state->canceller) {
> +aio_co_wake(call_state->canceller);
> +call_state->canceller = NULL;
> +}
> +
>  call_state->finished = true;
>  
>  return ret;



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Daniel P . Berrangé
On Wed, Jul 22, 2020 at 02:00:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 20.07.2020 21:29, Daniel P. Berrangé wrote:
> > On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > Utilize new socket API to make a non-blocking connect for inet sockets.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > ---
> > >   include/io/channel-socket.h | 14 +++
> > >   io/channel-socket.c | 74 +
> > >   2 files changed, 88 insertions(+)
> > > 
> > > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > > index 777ff5954e..82e868bc02 100644
> > > --- a/include/io/channel-socket.h
> > > +++ b/include/io/channel-socket.h
> > > @@ -94,6 +94,20 @@ int qio_channel_socket_connect_sync(QIOChannelSocket 
> > > *ioc,
> > >   SocketAddress *addr,
> > >   Error **errp);
> > > +/**
> > > + * qio_channel_socket_connect_non_blocking_sync:
> > > + * @ioc: the socket channel object
> > > + * @addr: the address to connect to
> > > + * @errp: pointer to a NULL-initialized error object
> > > + *
> > > + * Attempt to connect to the address @addr using non-blocking mode of
> > > + * the socket. Function is synchronous, but being called from
> > > + * coroutine context will yield during connect operation.
> > > + */
> > > +int qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket *ioc,
> > > + SocketAddress *addr,
> > > + Error **errp);
> > > +
> > >   /**
> > >* qio_channel_socket_connect_async:
> > >* @ioc: the socket channel object
> > > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > > index e1b4667087..076de7578a 100644
> > > --- a/io/channel-socket.c
> > > +++ b/io/channel-socket.c
> > > @@ -22,6 +22,7 @@
> > >   #include "qapi/error.h"
> > >   #include "qapi/qapi-visit-sockets.h"
> > >   #include "qemu/module.h"
> > > +#include "qemu/sockets.h"
> > >   #include "io/channel-socket.h"
> > >   #include "io/channel-watch.h"
> > >   #include "trace.h"
> > > @@ -29,6 +30,8 @@
> > >   #define SOCKET_MAX_FDS 16
> > > +static int qio_channel_socket_close(QIOChannel *ioc, Error **errp);
> > > +
> > >   SocketAddress *
> > >   qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
> > >Error **errp)
> > > @@ -157,6 +160,77 @@ int qio_channel_socket_connect_sync(QIOChannelSocket 
> > > *ioc,
> > >   return 0;
> > >   }
> > > +static int qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket 
> > > *ioc,
> > > +InetSocketAddress *addr, Error **errp)
> > > +{
> > > +Error *local_err = NULL;
> > > +struct addrinfo *infos, *info;
> > > +int sock = -1;
> > > +
> > > +infos = inet_parse_connect_saddr(addr, errp);
> > > +if (!infos) {
> > > +return -1;
> > > +}
> > 
> > This call is blocking since it calls getaddrinfo whose design
> > offers no ability todo non-blocking DNS lookups. Given this
> > call, ...
> 
> Oh, that's bad, thanks for taking a look on that early stage!
> 
> > 
> > > +
> > > +for (info = infos; info != NULL; info = info->ai_next) {
> > > +bool in_progress;
> > > +
> > > +error_free(local_err);
> > > +local_err = NULL;
> > > +
> > > +sock = inet_connect_addr(addr, info, false, _progress, 
> > > _err);
> > > +if (sock < 0) {
> > > +continue;
> > > +}
> > > +
> > > +if (qio_channel_socket_set_fd(ioc, sock, _err) < 0) {
> > > +close(sock);
> > > +continue;
> > > +}
> > > +
> > > +if (in_progress) {
> > > +if (qemu_in_coroutine()) {
> > > +qio_channel_yield(QIO_CHANNEL(ioc), G_IO_OUT);
> > > +} else {
> > > +qio_channel_wait(QIO_CHANNEL(ioc), G_IO_OUT);
> > > +}
> > 
> > ...this is offering false assurances of being non-blocking.
> > 
> > If we don't want the current thread to be blocked then we
> > need to be using the existing qio_channel_socket_connect_async
> > method or similar. It uses a throw away background thread to
> > run the connection attempt, and then reports completion back
> > later, thus avoiding the getaddrinfo design flaw for the callers.
> > 
> > I explicitly didn't want to add an method like the impl in this
> > patch, because getaddrinfo dooms it and we already had bugs in
> > the pre-QIOChannel code where QEMU thought it was non-blocking
> > but wasn't due to getaddrinfo lookups.
> > 
> > 
> > IIUC, the main appeal of this method is that the non-blocking
> > nature is hidden from the caller who can continue to treat it
> > as a synchronous call and have the coroutine magic happen in
> > behind the scenes.
> > 
> > IOW, What's needed is a simple way to run the operation in a
> > thread, and sleep for completion while having the coroutine

Re: [PATCH v2 1/3] migration: Add block-bitmap-mapping parameter

2020-07-22 Thread Vladimir Sementsov-Ogievskiy

21.07.2020 11:02, Max Reitz wrote:

On 20.07.20 18:31, Vladimir Sementsov-Ogievskiy wrote:

16.07.2020 16:53, Max Reitz wrote:

This migration parameter allows mapping block node names and bitmap
names to aliases for the purpose of block dirty bitmap migration.

This way, management tools can use different node and bitmap names on
the source and destination and pass the mapping of how bitmaps are to be
transferred to qemu (on the source, the destination, or even both with
arbitrary aliases in the migration stream).

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
Vladimir noted in v1 that it would be better to ignore bitmaps whose
names aren't mapped, or that are on nodes whose names aren't mapped.
One of the reasons he gave was that bitmap migration is inherently a
form of postcopy migration, and we should not break the target when it
is already running because of a bitmap issue.

So in this version, I changed the behavior to just ignore bitmaps
without a mapping on the source side.  On the destination, however, I
kept it an error when an incoming bitmap or node alias is unknown.

This is in violation of Vladimir's (rightful) reasoning that such
errors should never break the already running target, but I decided to
do so anyway for a couple of reasons:

- Ignoring unmapped bitmaps on the source is trivial: We just have to
    not put them into the migration stream.
    On the destination, it isn't so easy: We (I think) would have to
    modify the code to actively skip the bitmap in the stream.
    (On the other hand, erroring out is always easy.)


Agree. Actually, my series "[PATCH v2 00/22] Fix error handling during
bitmap postcopy"
do something like this. But no sense in mixing such logic into your
series :)



- Incoming bitmaps with unknown names are already an error before this
    series.  So this isn't introducing new breakage.

- I think it makes sense to drop all bitmaps you don't want to migrate
    (or implicitly drop them by just not specifying them if you don't care
    about them) on the source.  I can't imagine a scenario where it would
    be really useful if the destination could silently drop unknown
    bitmaps.  Unknown bitmaps should just be dropped on the source.

- Choosing to make it an error now doesn't prevent us from relaxing that
    restriction in the future.


Agree, that's all OK. Still we can at least ignore, if we don't get some
bitmap on destination, for which mapping is set (I think you do exactly
this, will see below).



---
   qapi/migration.json    |  92 +++-
   migration/migration.h  |   3 +
   migration/block-dirty-bitmap.c | 373 -
   migration/migration.c  |  30 +++
   monitor/hmp-cmds.c |  30 +++
   5 files changed, 473 insertions(+), 55 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index d5000558c6..1b0fbcef96 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -507,6 +507,44 @@


[..]


   #
+# @block-bitmap-mapping: Maps block nodes and bitmaps on them to
+#  aliases for the purpose of dirty bitmap migration.  Such
+#  aliases may for example be the corresponding names on the
+#  opposite site.
+#  The mapping must be one-to-one, and on the destination also
+#  complete: On the source, bitmaps on nodes where either the
+#  bitmap or the node is not mapped will be ignored.  In
+#  contrast, on the destination, receiving a bitmap (by alias)
+#  from a node (by alias) when either alias is not mapped will
+#  result in an error.


Still, not receiving some bitmap is not an error. It's good. I think,
should
be mentioned here too (does it violate "compelete" term?).


Hm.  Well, from the implementation’s perspective, it obviously isn’t an
error, because there’s no list of bitmaps that’s transferred outside of
the bitmaps themselves; so if some bitmap isn’t transferred, the
destination of course never knows about it.  (And “complete” just means
that all received bitmaps must be mapped.)

So I never thought about mentioning that detail here.

How would we even go about documenting that?  “Note that the destination
does not know about bitmaps it does not receive, so there is no
limitation or requirement about the number of bitmaps received, or how
they are named, or on which nodes they are placed.”


Destination "knows" about bitmaps to receive, if block-bitmap-mapping set
on destination.. Hm, but yes, mapping is not a list of bitmaps to receive,
it's just mapping. May be, "Note that it's not an error if source qemu doesn't 
find
all bitmaps specified in mapping or destination doesn't receive all such
bitmaps"? But I'm OK without any additional note as well.




+#  By default (when this parameter has never been set), bitmap
+#  names are mapped to themselves.  Nodes are mapped to their
+#  block device name if there is one, and to their node 

Re: [PATCH v2 07/20] block/block-copy: add ratelimit to block-copy

2020-07-22 Thread Max Reitz
On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:
> We are going to directly use one async block-copy operation for backup
> job, so we need rate limitator.

%s/limitator/limiter/g, I think.

> We want to maintain current backup behavior: only background copying is
> limited and copy-before-write operations only participate in limit
> calculation. Therefore we need one rate limitator for block-copy state
> and boolean flag for block-copy call state for actual limitation.
> 
> Note, that we can't just calculate each chunk in limitator after
> successful copying: it will not save us from starting a lot of async
> sub-requests which will exceed limit too much. Instead let's use the
> following scheme on sub-request creation:
> 1. If at the moment limit is not exceeded, create the request and
> account it immediately.
> 2. If at the moment limit is already exceeded, drop create sub-request
> and handle limit instead (by sleep).
> With this approach we'll never exceed the limit more than by one
> sub-request (which pretty much matches current backup behavior).

Sounds reasonable.

> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/block-copy.h |  8 +++
>  block/block-copy.c | 44 ++
>  2 files changed, 52 insertions(+)
> 
> diff --git a/include/block/block-copy.h b/include/block/block-copy.h
> index 600984c733..d40e691123 100644
> --- a/include/block/block-copy.h
> +++ b/include/block/block-copy.h
> @@ -59,6 +59,14 @@ BlockCopyCallState *block_copy_async(BlockCopyState *s,
>   int64_t max_chunk,
>   BlockCopyAsyncCallbackFunc cb);
>  
> +/*
> + * Set speed limit for block-copy instance. All block-copy operations 
> related to
> + * this BlockCopyState will participate in speed calculation, but only
> + * block_copy_async calls with @ratelimit=true will be actually limited.
> + */
> +void block_copy_set_speed(BlockCopyState *s, BlockCopyCallState *call_state,
> +  uint64_t speed);
> +
>  BdrvDirtyBitmap *block_copy_dirty_bitmap(BlockCopyState *s);
>  void block_copy_set_skip_unallocated(BlockCopyState *s, bool skip);
>  
> diff --git a/block/block-copy.c b/block/block-copy.c
> index 4114d1fd25..851d9c8aaf 100644
> --- a/block/block-copy.c
> +++ b/block/block-copy.c
> @@ -26,6 +26,7 @@
>  #define BLOCK_COPY_MAX_BUFFER (1 * MiB)
>  #define BLOCK_COPY_MAX_MEM (128 * MiB)
>  #define BLOCK_COPY_MAX_WORKERS 64
> +#define BLOCK_COPY_SLICE_TIME 1ULL /* ns */
>  
>  static coroutine_fn int block_copy_task_entry(AioTask *task);
>  
> @@ -36,11 +37,13 @@ typedef struct BlockCopyCallState {
>  int64_t bytes;
>  int max_workers;
>  int64_t max_chunk;
> +bool ratelimit;
>  BlockCopyAsyncCallbackFunc cb;
>  
>  /* State */
>  bool failed;
>  bool finished;
> +QemuCoSleepState *sleep_state;
>  
>  /* OUT parameters */
>  bool error_is_read;
> @@ -103,6 +106,9 @@ typedef struct BlockCopyState {
>  void *progress_opaque;
>  
>  SharedResource *mem;
> +
> +uint64_t speed;
> +RateLimit rate_limit;
>  } BlockCopyState;
>  
>  static BlockCopyTask *find_conflicting_task(BlockCopyState *s,
> @@ -611,6 +617,21 @@ block_copy_dirty_clusters(BlockCopyCallState *call_state)
>  }
>  task->zeroes = ret & BDRV_BLOCK_ZERO;
>  
> +if (s->speed) {
> +if (call_state->ratelimit) {
> +uint64_t ns = ratelimit_calculate_delay(>rate_limit, 0);
> +if (ns > 0) {
> +block_copy_task_end(task, -EAGAIN);
> +g_free(task);
> +qemu_co_sleep_ns_wakeable(QEMU_CLOCK_REALTIME, ns,
> +  _state->sleep_state);
> +continue;
> +}
> +}
> +
> +ratelimit_calculate_delay(>rate_limit, task->bytes);
> +}
> +

Looks good.

>  trace_block_copy_process(s, task->offset);
>  
>  co_get_from_shres(s->mem, task->bytes);
> @@ -649,6 +670,13 @@ out:
>  return ret < 0 ? ret : found_dirty;
>  }
>  
> +static void block_copy_kick(BlockCopyCallState *call_state)
> +{
> +if (call_state->sleep_state) {
> +qemu_co_sleep_wake(call_state->sleep_state);
> +}
> +}
> +
>  /*
>   * block_copy_common
>   *
> @@ -729,6 +757,7 @@ BlockCopyCallState *block_copy_async(BlockCopyState *s,
>  .s = s,
>  .offset = offset,
>  .bytes = bytes,
> +.ratelimit = ratelimit,

Hm, same problem/question as in patch 6: Should the @ratelimit parameter
really be added in patch 5 if it’s used only now?

>  .cb = cb,
>  .max_workers = max_workers ?: BLOCK_COPY_MAX_WORKERS,
>  .max_chunk = max_chunk,
> @@ -752,3 +781,18 @@ void block_copy_set_skip_unallocated(BlockCopyState *s, 
> bool skip)
>  {
>  s->skip_unallocated = skip;
>  }
> +
> +void 

Re: [PATCH 3/4] io/channel-socket: implement non-blocking connect

2020-07-22 Thread Vladimir Sementsov-Ogievskiy

20.07.2020 21:29, Daniel P. Berrangé wrote:

On Mon, Jul 20, 2020 at 09:07:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Utilize new socket API to make a non-blocking connect for inet sockets.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/io/channel-socket.h | 14 +++
  io/channel-socket.c | 74 +
  2 files changed, 88 insertions(+)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index 777ff5954e..82e868bc02 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -94,6 +94,20 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
  SocketAddress *addr,
  Error **errp);
  
+/**

+ * qio_channel_socket_connect_non_blocking_sync:
+ * @ioc: the socket channel object
+ * @addr: the address to connect to
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to connect to the address @addr using non-blocking mode of
+ * the socket. Function is synchronous, but being called from
+ * coroutine context will yield during connect operation.
+ */
+int qio_channel_socket_connect_non_blocking_sync(QIOChannelSocket *ioc,
+ SocketAddress *addr,
+ Error **errp);
+
  /**
   * qio_channel_socket_connect_async:
   * @ioc: the socket channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e1b4667087..076de7578a 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -22,6 +22,7 @@
  #include "qapi/error.h"
  #include "qapi/qapi-visit-sockets.h"
  #include "qemu/module.h"
+#include "qemu/sockets.h"
  #include "io/channel-socket.h"
  #include "io/channel-watch.h"
  #include "trace.h"
@@ -29,6 +30,8 @@
  
  #define SOCKET_MAX_FDS 16
  
+static int qio_channel_socket_close(QIOChannel *ioc, Error **errp);

+
  SocketAddress *
  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
   Error **errp)
@@ -157,6 +160,77 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
  return 0;
  }
  
+static int qio_channel_inet_connect_non_blocking_sync(QIOChannelSocket *ioc,

+InetSocketAddress *addr, Error **errp)
+{
+Error *local_err = NULL;
+struct addrinfo *infos, *info;
+int sock = -1;
+
+infos = inet_parse_connect_saddr(addr, errp);
+if (!infos) {
+return -1;
+}


This call is blocking since it calls getaddrinfo whose design
offers no ability todo non-blocking DNS lookups. Given this
call, ...


Oh, that's bad, thanks for taking a look on that early stage!




+
+for (info = infos; info != NULL; info = info->ai_next) {
+bool in_progress;
+
+error_free(local_err);
+local_err = NULL;
+
+sock = inet_connect_addr(addr, info, false, _progress, _err);
+if (sock < 0) {
+continue;
+}
+
+if (qio_channel_socket_set_fd(ioc, sock, _err) < 0) {
+close(sock);
+continue;
+}
+
+if (in_progress) {
+if (qemu_in_coroutine()) {
+qio_channel_yield(QIO_CHANNEL(ioc), G_IO_OUT);
+} else {
+qio_channel_wait(QIO_CHANNEL(ioc), G_IO_OUT);
+}


...this is offering false assurances of being non-blocking.

If we don't want the current thread to be blocked then we
need to be using the existing qio_channel_socket_connect_async
method or similar. It uses a throw away background thread to
run the connection attempt, and then reports completion back
later, thus avoiding the getaddrinfo design flaw for the callers.

I explicitly didn't want to add an method like the impl in this
patch, because getaddrinfo dooms it and we already had bugs in
the pre-QIOChannel code where QEMU thought it was non-blocking
but wasn't due to getaddrinfo lookups.


IIUC, the main appeal of this method is that the non-blocking
nature is hidden from the caller who can continue to treat it
as a synchronous call and have the coroutine magic happen in
behind the scenes.

IOW, What's needed is a simple way to run the operation in a
thread, and sleep for completion while having the coroutine
yield.

I think this could likely be achieved with QIOTask with an
alternate impl of the qio_task_wait_thread() method that is
friendly to coroutines instead of being based on pthread
condition variable waits.


The most simple thing is just run qio_channel_socket_connect_sync in
a thread with help of thread_pool_submit_co() which is coroutine-friendly.
And this don't need any changes in io/channel.

Actually, I've started with such design, but decided that better use
non-blocking connect to not deal with cancelling the connecting thread
on shutdown.

I think, I'll resend based on thread_pool_submit_co().

===

Hmm, there is async getaddrinfo_a function.. What do you think of it?
But seems simpler to use a thread than move to 

Re: [PATCH v2 2/4] m25p80: Improve command handling for Jedec commands

2020-07-22 Thread Philippe Mathieu-Daudé
On 7/22/20 10:02 AM, Cédric Le Goater wrote:
> On 7/21/20 9:57 PM, Guenter Roeck wrote:
>> On 7/21/20 10:36 AM, Cédric Le Goater wrote:
>>> Hello,
>>>
>>> On 2/6/20 7:32 PM, Guenter Roeck wrote:
 When requesting JEDEC data using the JEDEC_READ command, the Linux kernel
 always requests 6 bytes. The current implementation only returns three
 bytes, and interprets the remaining three bytes as new commands.
 While this does not matter most of the time, it is at the very least
 confusing. To avoid the problem, always report up to 6 bytes of JEDEC
 data. Fill remaining data with 0.

 Signed-off-by: Guenter Roeck 
 ---
 v2: Split patch into two parts; improved decription

  hw/block/m25p80.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

 diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
 index 5ff8d270c4..53bf63856f 100644
 --- a/hw/block/m25p80.c
 +++ b/hw/block/m25p80.c
 @@ -1040,8 +1040,11 @@ static void decode_new_cmd(Flash *s, uint32_t value)
  for (i = 0; i < s->pi->id_len; i++) {
  s->data[i] = s->pi->id[i];
  }
 +for (; i < SPI_NOR_MAX_ID_LEN; i++) {
 +s->data[i] = 0;
 +}
>>>
>>> This is breaking an old firmware (Linux version 2.6.28.9) for a SuperMicro
>>> board : 
>>>
>>> https://www.supermicro.com/en/products/motherboard/X11SSL-F 
>>>
>>> which expects the flash ID to repeat. Erik solved the problem with the 
>>> patch 
>>> below and I think it makes sense to wrap around. Anyone knows what should 
>>> be 
>>> the expected behavior ? 
>>>
>>
>> No idea what the expected behavior is, but I am fine with the code below
>> if it works.
> 
> I checked on a few real systems and indeed the mx25l25635e behaves
> differently.
> 
> AST2400
> 
> [5.594176] aspeed-smc 1e62.spi: reading JEDEC ID 20:BA:19:10:00:00
> [5.602226] aspeed-smc 1e62.spi: n25q256a (32768 Kbytes)
> ...
> [6.174052] aspeed-smc 1e63.spi: reading JEDEC ID C2:20:19:C2:20:19
> [6.181682] aspeed-smc 1e63.spi: mx25l25635e (32768 Kbytes)
> 
> AST2500
> 
> [1.641317] aspeed-smc 1e62.spi: reading JEDEC ID EF:40:19:00:00:00
> [1.648174] aspeed-smc 1e62.spi: w25q256 (32768 Kbytes)
> ...
> [1.179214] aspeed-smc 1e63.spi: reading JEDEC ID EF:40:19:00:00:00
> [1.186737] aspeed-smc 1e63.spi: w25q256 (32768 Kbytes)
> 
> AST2600
> 
> [1.020255] aspeed-smc 1e62.spi: reading JEDEC ID EF:40:20:00:00:00
> [1.027830] aspeed-smc 1e62.spi: w25q512jv (65536 Kbytes)
> ...
> [1.884171] aspeed-smc 1e63.spi: reading JEDEC ID EF:40:19:00:00:00
> [1.890937] aspeed-smc 1e63.spi: w25q256 (32768 Kbytes)

FWIW QEMU models this one as 64KiB.

> 
> 
> I think we need a special case for it.
> 
> C. 
> 




Re: [PATCH v2 06/20] block/block-copy: add max_chunk and max_workers parameters

2020-07-22 Thread Max Reitz
On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:
> They will be used for backup.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/block-copy.h |  5 +
>  block/block-copy.c | 10 --
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/include/block/block-copy.h b/include/block/block-copy.h
> index ada0d99566..600984c733 100644
> --- a/include/block/block-copy.h
> +++ b/include/block/block-copy.h
> @@ -47,6 +47,11 @@ int coroutine_fn block_copy(BlockCopyState *s, int64_t 
> offset, int64_t bytes,
>  /*
>   * Run block-copy in a coroutine, return state pointer. If finished early
>   * returns NULL (@cb is called anyway).
> + *
> + * @max_workers means maximum of parallel coroutines to execute sub-requests,
> + * must be > 0.
> + *
> + * @max_chunk means maximum length for one IO operation. Zero means 
> unlimited.
>   */
>  BlockCopyCallState *block_copy_async(BlockCopyState *s,
>   int64_t offset, int64_t bytes,

I only just now notice that @max_workers and @max_chunk were already
added in the previous patch, even though they aren’t used there.  Should
we defer adding them until this patch?

> diff --git a/block/block-copy.c b/block/block-copy.c
> index a0477d90f3..4114d1fd25 100644
> --- a/block/block-copy.c
> +++ b/block/block-copy.c
> @@ -34,6 +34,8 @@ typedef struct BlockCopyCallState {
>  BlockCopyState *s;
>  int64_t offset;
>  int64_t bytes;
> +int max_workers;
> +int64_t max_chunk;
>  BlockCopyAsyncCallbackFunc cb;
>  
>  /* State */
> @@ -144,10 +146,11 @@ static BlockCopyTask 
> *block_copy_task_create(BlockCopyState *s,
>   int64_t offset, int64_t bytes)
>  {
>  BlockCopyTask *task;
> +int64_t max_chunk = MIN_NON_ZERO(s->copy_size, call_state->max_chunk);
>  
>  if (!bdrv_dirty_bitmap_next_dirty_area(s->copy_bitmap,
> offset, offset + bytes,
> -   s->copy_size, , ))
> +   max_chunk, , ))
>  {
>  return NULL;
>  }
> @@ -616,7 +619,7 @@ block_copy_dirty_clusters(BlockCopyCallState *call_state)
>  bytes = end - offset;
>  
>  if (!aio && bytes) {
> -aio = aio_task_pool_new(BLOCK_COPY_MAX_WORKERS);
> +aio = aio_task_pool_new(call_state->max_workers);
>  }
>  
>  ret = block_copy_task_run(aio, task);
> @@ -695,6 +698,7 @@ int coroutine_fn block_copy(BlockCopyState *s, int64_t 
> start, int64_t bytes,
>  .s = s,
>  .offset = start,
>  .bytes = bytes,
> +.max_workers = BLOCK_COPY_MAX_WORKERS,
>  };
>  
>  int ret = block_copy_common(_state);
> @@ -726,6 +730,8 @@ BlockCopyCallState *block_copy_async(BlockCopyState *s,
>  .offset = offset,
>  .bytes = bytes,
>  .cb = cb,
> +.max_workers = max_workers ?: BLOCK_COPY_MAX_WORKERS,

I thought this must be > 0?

> +.max_chunk = max_chunk,
>  };
>  
>  qemu_coroutine_enter(co);
> 

And I now notice that there’s no newline after block_copy_async().
(Doesn’t concern this patch, of course, but the previous one.)

Max



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 0/3] migration: Add block-bitmap-mapping parameter

2020-07-22 Thread Dr. David Alan Gilbert
* Max Reitz (mre...@redhat.com) wrote:
> RFC v1: https://lists.nongnu.org/archive/html/qemu-block/2020-05/msg00912.html
> RFC v2: https://lists.nongnu.org/archive/html/qemu-block/2020-05/msg00915.html
> v1: https://lists.nongnu.org/archive/html/qemu-devel/2020-06/msg09792.html
> 
> Branch: https://github.com/XanClic/qemu.git migration-bitmap-mapping-v2
> Branch: https://git.xanclic.moe/XanClic/qemu.git migration-bitmap-mapping-v2
> 
> Based-on: <20200626130658.76498-1-vsement...@virtuozzo.com>
>   (“migration/block-dirty-bitmap: fix add_bitmaps_to_list”)
> 
> 
> Hi,
> 
> This new migration parameter allows mapping block node names and bitmap
> names to aliases for the purpose of block dirty bitmap migration.

One random thought is that you might find these block name aliases turn
out to be useful in other places, and an alias list may well turn out to
be generically useful.

Dave

> This way, management tools can use different node names on the source
> and destination and pass the mapping of how bitmaps are to be
> transferred to qemu (on the source, the destination, or even both with
> arbitrary aliases in the migration stream).
> 
> 
> v2:
> - Dropped what used to be patch 1 (the memleak fix), I see the exact
>   same fix has been sent concurrently and has been merged as
>   9728ebfb77f0159f4
> 
> - Patch 1:
>   - Changed documentation to clarify the default behavior
>   - s/5.1/5.2/
>   - Dropped dead assignment
>   - Fixed bitmaps_map memleak
>   - Assert that the bs_name given to add_bitmaps_to_list() must be the
> BDS’s node name if an alias_map is given
>   - On the source side, unmapped bitmaps are simply dropped without
> error
>   - On the destination side, unmapped aliases still result in errors
> (see patch 1 for a short explanation on my reasoning)
>   - Fixed a bug in qmp_query_migrate_parameters(): We have to clone
> s->parameters.block_bitmap_mapping, not
> params->block_bitmap_mapping, or the latter will just stay NULL (and
> so qmp_query_migrate_parameters() won’t return a result for the
> block-bitmap-mapping)
>   - Emit the mapping through HMP’s “info migrate_parameters”
>   - Return an error when trying to set the mapping through HMP (instead
> of just crashing because of an “assert(0)” signalling an unhandled
> migration parameter)
> 
> - Patch 3:
>   - Type alias for BlockBitmapMapping
>   - Check the result of “info migrate_parameters” whenever setting the
> block-bitmap-mapping (just to test the new formatting code)
>   - Catch the qemu.machine.AbnormalShutdown exception on the destination
> VM whenever the migration is expected to fail
> (necessary since commit ef5d474472426eda6abf81)
>   - Cases where we don’t set up a mapping for some bitmap on the source
> are now expected to succeed (without the bitmap being migrated)
> 
> 
> git-backport-diff against v1:
> 
> Key:
> [] : patches are identical
> [] : number of functional differences between upstream/downstream patch
> [down] : patch is downstream-only
> The flags [FC] indicate (F)unctional and (C)ontextual differences, 
> respectively
> 
> 001/3:[0117] [FC] 'migration: Add block-bitmap-mapping parameter'
> 002/3:[] [--] 'iotests.py: Add wait_for_runstate()'
> 003/3:[0202] [FC] 'iotests: Test node/bitmap aliases during migration'
> 
> 
> Max Reitz (3):
>   migration: Add block-bitmap-mapping parameter
>   iotests.py: Add wait_for_runstate()
>   iotests: Test node/bitmap aliases during migration
> 
>  qapi/migration.json|  92 +-
>  migration/migration.h  |   3 +
>  migration/block-dirty-bitmap.c | 373 
>  migration/migration.c  |  30 ++
>  monitor/hmp-cmds.c |  30 ++
>  tests/qemu-iotests/300 | 511 +
>  tests/qemu-iotests/300.out |   5 +
>  tests/qemu-iotests/group   |   1 +
>  tests/qemu-iotests/iotests.py  |   4 +
>  9 files changed, 994 insertions(+), 55 deletions(-)
>  create mode 100755 tests/qemu-iotests/300
>  create mode 100644 tests/qemu-iotests/300.out
> 
> -- 
> 2.26.2
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH v2 1/3] migration: Add block-bitmap-mapping parameter

2020-07-22 Thread Dr. David Alan Gilbert
* Max Reitz (mre...@redhat.com) wrote:
> This migration parameter allows mapping block node names and bitmap
> names to aliases for the purpose of block dirty bitmap migration.
> 
> This way, management tools can use different node and bitmap names on
> the source and destination and pass the mapping of how bitmaps are to be
> transferred to qemu (on the source, the destination, or even both with
> arbitrary aliases in the migration stream).
> 
> Suggested-by: Vladimir Sementsov-Ogievskiy 
> Signed-off-by: Max Reitz 

For both HMP and migration stuff:
Acked-by: Dr. David Alan Gilbert 

> ---
> Vladimir noted in v1 that it would be better to ignore bitmaps whose
> names aren't mapped, or that are on nodes whose names aren't mapped.
> One of the reasons he gave was that bitmap migration is inherently a
> form of postcopy migration, and we should not break the target when it
> is already running because of a bitmap issue.
> 
> So in this version, I changed the behavior to just ignore bitmaps
> without a mapping on the source side.  On the destination, however, I
> kept it an error when an incoming bitmap or node alias is unknown.
> 
> This is in violation of Vladimir's (rightful) reasoning that such
> errors should never break the already running target, but I decided to
> do so anyway for a couple of reasons:
> 
> - Ignoring unmapped bitmaps on the source is trivial: We just have to
>   not put them into the migration stream.
>   On the destination, it isn't so easy: We (I think) would have to
>   modify the code to actively skip the bitmap in the stream.
>   (On the other hand, erroring out is always easy.)
> 
> - Incoming bitmaps with unknown names are already an error before this
>   series.  So this isn't introducing new breakage.
> 
> - I think it makes sense to drop all bitmaps you don't want to migrate
>   (or implicitly drop them by just not specifying them if you don't care
>   about them) on the source.  I can't imagine a scenario where it would
>   be really useful if the destination could silently drop unknown
>   bitmaps.  Unknown bitmaps should just be dropped on the source.
> 
> - Choosing to make it an error now doesn't prevent us from relaxing that
>   restriction in the future.
> ---
>  qapi/migration.json|  92 +++-
>  migration/migration.h  |   3 +
>  migration/block-dirty-bitmap.c | 373 -
>  migration/migration.c  |  30 +++
>  monitor/hmp-cmds.c |  30 +++
>  5 files changed, 473 insertions(+), 55 deletions(-)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index d5000558c6..1b0fbcef96 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -507,6 +507,44 @@
>'data': [ 'none', 'zlib',
>  { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
>  
> +##
> +# @BitmapMigrationBitmapAlias:
> +#
> +# @name: The name of the bitmap.
> +#
> +# @alias: An alias name for migration (for example the bitmap name on
> +# the opposite site).
> +#
> +# Since: 5.1
> +##
> +{ 'struct': 'BitmapMigrationBitmapAlias',
> +  'data': {
> +  'name': 'str',
> +  'alias': 'str'
> +  } }
> +
> +##
> +# @BitmapMigrationNodeAlias:
> +#
> +# Maps a block node name and the bitmaps it has to aliases for dirty
> +# bitmap migration.
> +#
> +# @node-name: A block node name.
> +#
> +# @alias: An alias block node name for migration (for example the
> +# node name on the opposite site).
> +#
> +# @bitmaps: Mappings for the bitmaps on this node.
> +#
> +# Since: 5.1
> +##
> +{ 'struct': 'BitmapMigrationNodeAlias',
> +  'data': {
> +  'node-name': 'str',
> +  'alias': 'str',
> +  'bitmaps': [ 'BitmapMigrationBitmapAlias' ]
> +  } }
> +
>  ##
>  # @MigrationParameter:
>  #
> @@ -641,6 +679,21 @@
>  #  will consume more CPU.
>  #  Defaults to 1. (Since 5.0)
>  #
> +# @block-bitmap-mapping: Maps block nodes and bitmaps on them to
> +#  aliases for the purpose of dirty bitmap migration.  Such
> +#  aliases may for example be the corresponding names on the
> +#  opposite site.
> +#  The mapping must be one-to-one, and on the destination also
> +#  complete: On the source, bitmaps on nodes where either the
> +#  bitmap or the node is not mapped will be ignored.  In
> +#  contrast, on the destination, receiving a bitmap (by alias)
> +#  from a node (by alias) when either alias is not mapped will
> +#  result in an error.
> +#  By default (when this parameter has never been set), bitmap
> +#  names are mapped to themselves.  Nodes are mapped to their
> +#  block device name if there is one, and to their node name
> +#  otherwise. (Since 5.2)
> +#
>  # Since: 2.4
>  ##
>  { 'enum': 'MigrationParameter',
> @@ -655,7 +708,8 @@
> 'multifd-channels',
> 'xbzrle-cache-size', 'max-postcopy-bandwidth',
> 'max-cpu-throttle', 

[PATCH for-5.2 v3 2/3] iotests.py: Add wait_for_runstate()

2020-07-22 Thread Max Reitz
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/iotests.py | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 3590ed78a0..20645a6e7d 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -28,6 +28,7 @@ import signal
 import struct
 import subprocess
 import sys
+import time
 from typing import (Any, Callable, Dict, Iterable,
 List, Optional, Sequence, Tuple, TypeVar)
 import unittest
@@ -803,6 +804,10 @@ class VM(qtest.QEMUQtestMachine):
'Found node %s under %s (but expected %s)' % \
(node['name'], path, expected_node)
 
+def wait_for_runstate(self, runstate: str) -> None:
+while self.qmp('query-status')['return']['status'] != runstate:
+time.sleep(0.2)
+
 index_re = re.compile(r'([^\[]+)\[([^\]]+)\]')
 
 class QMPTestCase(unittest.TestCase):
-- 
2.26.2




[PATCH for-5.2 v3 3/3] iotests: Test node/bitmap aliases during migration

2020-07-22 Thread Max Reitz
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/300 | 515 +
 tests/qemu-iotests/300.out |   5 +
 tests/qemu-iotests/group   |   1 +
 3 files changed, 521 insertions(+)
 create mode 100755 tests/qemu-iotests/300
 create mode 100644 tests/qemu-iotests/300.out

diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
new file mode 100755
index 00..6d684e882a
--- /dev/null
+++ b/tests/qemu-iotests/300
@@ -0,0 +1,515 @@
+#!/usr/bin/env python3
+#
+# Copyright (C) 2020 Red Hat, Inc.
+#
+# Tests for dirty bitmaps migration with node aliases
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import random
+import re
+import time
+from typing import Dict, List, Optional, Union
+import iotests
+import qemu
+
+BlockBitmapMapping = List[Dict[str, Union[str, List[Dict[str, str]
+
+assert iotests.sock_dir is not None
+mig_sock = os.path.join(iotests.sock_dir, 'mig_sock')
+
+class TestDirtyBitmapMigration(iotests.QMPTestCase):
+src_node_name: str = ''
+dst_node_name: str = ''
+src_bmap_name: str = ''
+dst_bmap_name: str = ''
+
+def setUp(self) -> None:
+self.vm_a = iotests.VM(path_suffix='-a')
+self.vm_a.add_blockdev(f'node-name={self.src_node_name},' \
+   'driver=null-co')
+self.vm_a.launch()
+
+self.vm_b = iotests.VM(path_suffix='-b')
+self.vm_b.add_blockdev(f'node-name={self.dst_node_name},' \
+   'driver=null-co')
+self.vm_b.add_incoming(f'unix:{mig_sock}')
+self.vm_b.launch()
+
+result = self.vm_a.qmp('block-dirty-bitmap-add',
+   node=self.src_node_name,
+   name=self.src_bmap_name)
+self.assert_qmp(result, 'return', {})
+
+# Dirty some random megabytes
+for _ in range(9):
+mb_ofs = random.randrange(1024)
+self.vm_a.hmp_qemu_io(self.src_node_name, f'write {mb_ofs}M 1M')
+
+result = self.vm_a.qmp('x-debug-block-dirty-bitmap-sha256',
+   node=self.src_node_name,
+   name=self.src_bmap_name)
+self.bitmap_hash_reference = result['return']['sha256']
+
+caps = [{'capability': name, 'state': True} \
+for name in ('dirty-bitmaps', 'events')]
+
+for vm in (self.vm_a, self.vm_b):
+result = vm.qmp('migrate-set-capabilities', capabilities=caps)
+self.assert_qmp(result, 'return', {})
+
+def tearDown(self) -> None:
+self.vm_a.shutdown()
+self.vm_b.shutdown()
+try:
+os.remove(mig_sock)
+except OSError:
+pass
+
+def check_bitmap(self, bitmap_name_valid: bool) -> None:
+result = self.vm_b.qmp('x-debug-block-dirty-bitmap-sha256',
+   node=self.dst_node_name,
+   name=self.dst_bmap_name)
+if bitmap_name_valid:
+self.assert_qmp(result, 'return/sha256',
+self.bitmap_hash_reference)
+else:
+self.assert_qmp(result, 'error/desc',
+f"Dirty bitmap '{self.dst_bmap_name}' not found")
+
+def migrate(self, migration_success: bool = True,
+bitmap_name_valid: bool = True) -> None:
+result = self.vm_a.qmp('migrate', uri=f'unix:{mig_sock}')
+self.assert_qmp(result, 'return', {})
+
+status = None
+while status not in ('completed', 'failed'):
+status = self.vm_a.event_wait('MIGRATION')['data']['status']
+
+self.assertEqual(status == 'completed', migration_success)
+if status == 'failed':
+# Wait until the migration has been cleaned up
+# (Otherwise, bdrv_close_all() will abort because the
+# dirty bitmap migration code still holds a reference to
+# the BDS)
+# (Unfortunately, there does not appear to be a nicer way
+# of waiting until a failed migration has been cleaned up)
+timeout_msg = 'Timeout waiting for migration to be cleaned up'
+with iotests.Timeout(30, timeout_msg):
+while os.path.exists(mig_sock):
+time.sleep(0.1)
+
+# Dropping src_node_name will only work once the
+

[PATCH for-5.2 v3 1/3] migration: Add block-bitmap-mapping parameter

2020-07-22 Thread Max Reitz
This migration parameter allows mapping block node names and bitmap
names to aliases for the purpose of block dirty bitmap migration.

This way, management tools can use different node and bitmap names on
the source and destination and pass the mapping of how bitmaps are to be
transferred to qemu (on the source, the destination, or even both with
arbitrary aliases in the migration stream).

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
 qapi/migration.json| 104 -
 migration/migration.h  |   3 +
 migration/block-dirty-bitmap.c | 373 -
 migration/migration.c  |  30 +++
 monitor/hmp-cmds.c |  30 +++
 5 files changed, 485 insertions(+), 55 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index d5000558c6..d7e953cd73 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -507,6 +507,44 @@
   'data': [ 'none', 'zlib',
 { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
 
+##
+# @BitmapMigrationBitmapAlias:
+#
+# @name: The name of the bitmap.
+#
+# @alias: An alias name for migration (for example the bitmap name on
+# the opposite site).
+#
+# Since: 5.1
+##
+{ 'struct': 'BitmapMigrationBitmapAlias',
+  'data': {
+  'name': 'str',
+  'alias': 'str'
+  } }
+
+##
+# @BitmapMigrationNodeAlias:
+#
+# Maps a block node name and the bitmaps it has to aliases for dirty
+# bitmap migration.
+#
+# @node-name: A block node name.
+#
+# @alias: An alias block node name for migration (for example the
+# node name on the opposite site).
+#
+# @bitmaps: Mappings for the bitmaps on this node.
+#
+# Since: 5.1
+##
+{ 'struct': 'BitmapMigrationNodeAlias',
+  'data': {
+  'node-name': 'str',
+  'alias': 'str',
+  'bitmaps': [ 'BitmapMigrationBitmapAlias' ]
+  } }
+
 ##
 # @MigrationParameter:
 #
@@ -641,6 +679,25 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @block-bitmap-mapping: Maps block nodes and bitmaps on them to
+#  aliases for the purpose of dirty bitmap migration.  Such
+#  aliases may for example be the corresponding names on the
+#  opposite site.
+#  The mapping must be one-to-one, and on the destination also
+#  complete: On the source, bitmaps on nodes where either the
+#  bitmap or the node is not mapped will be ignored.  In
+#  contrast, on the destination, receiving a bitmap (by alias)
+#  from a node (by alias) when either alias is not mapped will
+#  result in an error.
+#  Note that the destination does not know about bitmaps it
+#  does not receive, so there is no limitation or requirement
+#  regarding the number of bitmaps received, or how they are
+#  named, or on which nodes they are placed.
+#  By default (when this parameter has never been set), bitmap
+#  names are mapped to themselves.  Nodes are mapped to their
+#  block device name if there is one, and to their node name
+#  otherwise. (Since 5.2)
+#
 # Since: 2.4
 ##
 { 'enum': 'MigrationParameter',
@@ -655,7 +712,8 @@
'multifd-channels',
'xbzrle-cache-size', 'max-postcopy-bandwidth',
'max-cpu-throttle', 'multifd-compression',
-   'multifd-zlib-level' ,'multifd-zstd-level' ] }
+   'multifd-zlib-level' ,'multifd-zstd-level',
+   'block-bitmap-mapping' ] }
 
 ##
 # @MigrateSetParameters:
@@ -781,6 +839,25 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @block-bitmap-mapping: Maps block nodes and bitmaps on them to
+#  aliases for the purpose of dirty bitmap migration.  Such
+#  aliases may for example be the corresponding names on the
+#  opposite site.
+#  The mapping must be one-to-one, and on the destination also
+#  complete: On the source, bitmaps on nodes where either the
+#  bitmap or the node is not mapped will be ignored.  In
+#  contrast, on the destination, receiving a bitmap (by alias)
+#  from a node (by alias) when either alias is not mapped will
+#  result in an error.
+#  Note that the destination does not know about bitmaps it
+#  does not receive, so there is no limitation or requirement
+#  regarding the number of bitmaps received, or how they are
+#  named, or on which nodes they are placed.
+#  By default (when this parameter has never been set), bitmap
+#  names are mapped to themselves.  Nodes are mapped to their
+#  block device name if there is one, and to their node name
+#  otherwise. (Since 5.2)
+#
 # Since: 2.4
 ##
 # TODO either fuse back into MigrationParameters, or make
@@ -811,7 +888,8 @@
 '*max-cpu-throttle': 'int',
 '*multifd-compression': 'MultiFDCompression',
 '*multifd-zlib-level': 'int',
-  

[PATCH for-5.2 v3 0/3] migration: Add block-bitmap-mapping parameter

2020-07-22 Thread Max Reitz
RFC v1: https://lists.nongnu.org/archive/html/qemu-block/2020-05/msg00912.html
RFC v2: https://lists.nongnu.org/archive/html/qemu-block/2020-05/msg00915.html
v1: https://lists.nongnu.org/archive/html/qemu-devel/2020-06/msg09792.html
v2: https://lists.nongnu.org/archive/html/qemu-block/2020-07/msg01179.html

Branch: https://github.com/XanClic/qemu.git migration-bitmap-mapping-v3
Branch: https://git.xanclic.moe/XanClic/qemu.git migration-bitmap-mapping-v3

Hi,

This new migration parameter allows mapping block node names and bitmap
names to aliases for the purpose of block dirty bitmap migration.

This way, management tools can use different node names on the source
and destination and pass the mapping of how bitmaps are to be
transferred to qemu (on the source, the destination, or even both with
arbitrary aliases in the migration stream).


v3:
- Patch 1:
  - Add notes on the fact that the destination won’t ever know about any
bitmaps that aren’t sent, so you can freely drop what you want, and
you’re completely free in renaming bitmaps and putting them on
“other” nodes (whatever “other” node means in the context of
migration, because that’s kind of one of the problems this series is
trying to solve: The fact that you can’t trivially match nodes
between source and destination)
  - Fix an assertion

- Patch 2: s/pass/time.sleep(0.2)/

- Patch 3:
  - Add copyright line
  - Use format string instead of %
  - s/pass/time.sleep(0.1)/
  - s/wait_for_runstate/wait_migration/ on the destination to wait for
the migration to actually complete
  - Replace the “info migrate_parameters” parsing code by a multiline
regex
  - Test what happens when the destination has a mapping that isn’t used
because there are not bitmaps to be transferred (which breaks the
assertion in patch 1 as it was in v2)
  - Let verify_dest_has_all_bitmaps() actually verify the bitmaps on the
destination instead of the source


git-backport-diff against v2:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/3:[0014] [FC] 'migration: Add block-bitmap-mapping parameter'
002/3:[0003] [FC] 'iotests.py: Add wait_for_runstate()'
003/3:[0046] [FC] 'iotests: Test node/bitmap aliases during migration'


Max Reitz (3):
  migration: Add block-bitmap-mapping parameter
  iotests.py: Add wait_for_runstate()
  iotests: Test node/bitmap aliases during migration

 qapi/migration.json| 104 ++-
 migration/migration.h  |   3 +
 migration/block-dirty-bitmap.c | 373 
 migration/migration.c  |  30 ++
 monitor/hmp-cmds.c |  30 ++
 tests/qemu-iotests/300 | 515 +
 tests/qemu-iotests/300.out |   5 +
 tests/qemu-iotests/group   |   1 +
 tests/qemu-iotests/iotests.py  |   5 +
 9 files changed, 1011 insertions(+), 55 deletions(-)
 create mode 100755 tests/qemu-iotests/300
 create mode 100644 tests/qemu-iotests/300.out

-- 
2.26.2




Re: [PATCH v2 2/4] m25p80: Improve command handling for Jedec commands

2020-07-22 Thread Cédric Le Goater
On 7/21/20 9:57 PM, Guenter Roeck wrote:
> On 7/21/20 10:36 AM, Cédric Le Goater wrote:
>> Hello,
>>
>> On 2/6/20 7:32 PM, Guenter Roeck wrote:
>>> When requesting JEDEC data using the JEDEC_READ command, the Linux kernel
>>> always requests 6 bytes. The current implementation only returns three
>>> bytes, and interprets the remaining three bytes as new commands.
>>> While this does not matter most of the time, it is at the very least
>>> confusing. To avoid the problem, always report up to 6 bytes of JEDEC
>>> data. Fill remaining data with 0.
>>>
>>> Signed-off-by: Guenter Roeck 
>>> ---
>>> v2: Split patch into two parts; improved decription
>>>
>>>  hw/block/m25p80.c | 5 -
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
>>> index 5ff8d270c4..53bf63856f 100644
>>> --- a/hw/block/m25p80.c
>>> +++ b/hw/block/m25p80.c
>>> @@ -1040,8 +1040,11 @@ static void decode_new_cmd(Flash *s, uint32_t value)
>>>  for (i = 0; i < s->pi->id_len; i++) {
>>>  s->data[i] = s->pi->id[i];
>>>  }
>>> +for (; i < SPI_NOR_MAX_ID_LEN; i++) {
>>> +s->data[i] = 0;
>>> +}
>>
>> This is breaking an old firmware (Linux version 2.6.28.9) for a SuperMicro
>> board : 
>>
>>  https://www.supermicro.com/en/products/motherboard/X11SSL-F 
>>
>> which expects the flash ID to repeat. Erik solved the problem with the patch 
>> below and I think it makes sense to wrap around. Anyone knows what should be 
>> the expected behavior ? 
>>
> 
> No idea what the expected behavior is, but I am fine with the code below
> if it works.

I checked on a few real systems and indeed the mx25l25635e behaves
differently.

AST2400

[5.594176] aspeed-smc 1e62.spi: reading JEDEC ID 20:BA:19:10:00:00
[5.602226] aspeed-smc 1e62.spi: n25q256a (32768 Kbytes)
...
[6.174052] aspeed-smc 1e63.spi: reading JEDEC ID C2:20:19:C2:20:19
[6.181682] aspeed-smc 1e63.spi: mx25l25635e (32768 Kbytes)

AST2500

[1.641317] aspeed-smc 1e62.spi: reading JEDEC ID EF:40:19:00:00:00
[1.648174] aspeed-smc 1e62.spi: w25q256 (32768 Kbytes)
...
[1.179214] aspeed-smc 1e63.spi: reading JEDEC ID EF:40:19:00:00:00
[1.186737] aspeed-smc 1e63.spi: w25q256 (32768 Kbytes)

AST2600

[1.020255] aspeed-smc 1e62.spi: reading JEDEC ID EF:40:20:00:00:00
[1.027830] aspeed-smc 1e62.spi: w25q512jv (65536 Kbytes)
...
[1.884171] aspeed-smc 1e63.spi: reading JEDEC ID EF:40:19:00:00:00
[1.890937] aspeed-smc 1e63.spi: w25q256 (32768 Kbytes)


I think we need a special case for it.

C. 



Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-22 Thread Klaus Jensen
@keith, please see below - can you comment on the Linux kernel 2 MB
boundary requirement for the CMB? Or should we hail Stephen (or Logan
maybe) since this seems to be related to p2pdma?

On Jul 21 14:54, Andrzej Jakowski wrote:
> On 7/15/20 1:06 AM, Klaus Jensen wrote:
> > Hi Andrzej,
> > 
> > I've not been ignoring this, but sorry for not following up earlier.
> > 
> > I'm hesitent to merge anything that very obviously breaks an OS that we
> > know is used a lot to this using this device. Also because the issue has
> > not been analyzed well enough to actually know if this is a QEMU or
> > kernel issue.
> 
> Hi Klaus,
> 
> Thx for your response! I understand your hesitance on merging stuff that
> obviously breaks guest OS. 
> 
> > 
> > Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> > 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> > (irregardless of IOMMU on/off).
> > 
> > Later, when the issue is better understood, we can add options to set
> > offsets, BIRs etc.
> > 
> > The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> > be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> > git://git.infradead.org/qemu-nvme.git nvme-next branch.
> > 
> > Can you reproduce the issues with that patch? I can't on a stock Arch
> > Linux 5.7.5-arch1-1 kernel.
> 
> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
> feel that investigation part why it works while mine doesn't is
> missing. It looks to me that both patches are basically following same 
> approach: create memory subregion and overlay on top of other memory
> region. Why one works and the other doesn't then?
> 
> Having in mind that, I have recently focused on understanding problem.
> I observed that when guest assigns address to BAR4, addr field in
> nvme-bar4 memory region gets populated, but it doesn't get automatically
> populated in ctrl_mem (cmb) memory subregion, so later when 
> nvme_addr_is_cmb() 
> is called address check works incorrectly and as a consequence vmm does dma 
> read instead of memcpy.
> I created a patch that sets correct address on ctrl_mem subregion and guest 
> OS boots up correctly.
> 
> When I looked into pci and memory region code I noticed that indeed address
> is only assigned to top level memory region but not to contained subregions.
> I think that because in your approach cmb grabs whole bar exclusively it works
> fine.
> 
> Here is my question (perhaps pci folks can help answer :)): if we consider 
> memory region overlapping for pci devices as valid use case should pci 
> code on configuration cycles walk through all contained subregion and
> update addr field accordingly?
> 
> Thx!
> 

Hi Andrzej,

Thanks for looking into this. I think your analysis helped me nail this.
The problem is that we added the use of a subregion and have some
assumptions that no longer hold.

nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
But when the memory region is a subregion, addr holds an offset into the
parent container instead. Thus, changing all occurances of
n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
(this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
in your original patch[1]. The reason my version worked is because there
was no subregion involved for the CMB, so the existing address
validation calculations were still correct.

This leaves us with the Linux kernel complaining about not being able to
register the CMB if it is not on a 2MB boundary - this is probably just
a constraint in the kernel that we can't do anything about (but I'm no
kernel hacker...), which can be fixed by either being "nice" towards the
Linux kernel by forcing a 2 MB alignment in the device or exposing the
SZU field such that the user can choose 16MiB size units (or higher)
instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the
device such that we do not have to introduce new cmb_size parameters,
while also making it easier for the user to configure. But I'm not
really sure.

  [1]: Message-Id: <20200701214858.28515-3-andrzej.jakow...@linux.intel.com>