date:20220823

Re: [PATCH v2 17/24] vhost: vhost-user: update vhost_dev_virtqueue_restart()

2022-08-23 Thread Kangjie Xu




在 2022/8/24 12:03, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

Update vhost_dev_virtqueue_restart() for vhost-user scenario.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost.c | 26 ++
  1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index a0d6824353..bd90cfe13a 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1937,11 +1937,29 @@ int vhost_dev_virtqueue_restart(struct 
vhost_dev *hdev, VirtIODevice *vdev,

  int idx)
  {
  const VhostOps *vhost_ops = hdev->vhost_ops;
+    int r;
    assert(vhost_ops);
  -    return vhost_virtqueue_start(hdev,
- vdev,
- hdev->vqs + idx,
- hdev->vq_index + idx);
+    r = vhost_virtqueue_start(hdev,
+  vdev,
+  hdev->vqs + idx,
+  hdev->vq_index + idx);
+    if (r < 0) {
+    goto err_start;
+    }
+
+    if (vhost_ops->vhost_set_single_vring_enable) {
+    r = vhost_ops->vhost_set_single_vring_enable(hdev,
+ hdev->vq_index + idx,
+ 1);



Rethink about the name, I think the current names is kind of confusing:

    .vhost_set_single_vring_enable = 
vhost_user_set_single_vring_enable,

    .vhost_set_vring_enable = vhost_user_set_vring_enable,

I wonder if it's better:

1) rename vhost_set_vring_enable to vhost_set_dev_enable()

then we can

2) use vhost_set_vring_enable() for per vq enabling?

Thanks



I agree. It looks better, I will refactor this part.

Thanks.


+    if (r < 0) {
+    goto err_start;
+    }
+    }
+
+    return 0;
+
+err_start:
+    return r;
  }

Re: [PATCH v2 16/24] vhost: vhost-user: update vhost_dev_virtqueue_stop()

2022-08-23 Thread Kangjie Xu



在 2022/8/24 11:56, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

Update vhost_dev_virtqueue_stop() for vhost-user scenario.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost.c | 19 +++
  1 file changed, 19 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index fc3f550c76..a0d6824353 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1908,10 +1908,29 @@ int vhost_net_set_backend(struct vhost_dev 
*hdev,
  void vhost_dev_virtqueue_stop(struct vhost_dev *hdev, VirtIODevice 
*vdev,

    int idx)
  {
+    const VhostOps *vhost_ops = hdev->vhost_ops;
+    struct vhost_vring_state state;
+    int r;
+
+    assert(vhost_ops);
+
+    if (vhost_ops->vhost_reset_vring) {
+    state.index = hdev->vq_index + idx;
+    r = vhost_ops->vhost_reset_vring(hdev, );
+    if (r < 0) {
+    goto err_queue_reset;
+    }
+    }
+



So this worries me:

1) having two similar functions

vhost_virtqueue_stop()

and

vhost_dev_virtqueue_stop()

It can easily confuse the people who want stop the device.

I think we need rename vhost_dev_virtqueue_stop() to 
vhost_virqtueue_reset() since it has different semantic compared to stop:


1) stop means the virtqueue state is reserved, e.g the index could be 
synced via get_vring_base()

2) reset means the virqtueue state is lost

Thanks



Totally agree, will fix.

Thanks


  vhost_virtqueue_unmap(hdev,
    vdev,
    hdev->vqs + idx,
    idx);
+
+    return;
+
+err_queue_reset:
+    error_report("Error when stopping the qeuue.");
  }
    int vhost_dev_virtqueue_restart(struct vhost_dev *hdev, 
VirtIODevice *vdev,

Re: [PATCH v2 18/24] vhost-net: vhost-user: update vhost_net_virtqueue_stop()

2022-08-23 Thread Kangjie Xu




在 2022/8/24 12:05, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

Update vhost_net_virtqueue_stop() for vhost-user scenario.



Let's explain why it is needed now or why it doesn't cause any issue 
or it's a bug fix or not.


Thanks


This patch is to suppport vq reset for vhost-user.

We need this simply because the behavior of vhost_ops->get_vq_index() is 
different in vhost-user and vhost-kernel.


vhost_user_get_vq_index(dev, idx) simply returns "idx".

vhost_kernel_get_vq_index(dev, idx) returns "idx - dev->vq_index".

Thanks





Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/net/vhost_net.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 2ab67e875e..c0d408f3b4 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -533,6 +533,10 @@ void vhost_net_virtqueue_stop(VirtIODevice 
*vdev, NetClientState *nc,

  assert(r >= 0);
  }
  +    if (net->nc->info->type == NET_CLIENT_DRIVER_VHOST_USER) {
+    idx = idx - net->dev.vq_index;
+    }
+
  vhost_dev_virtqueue_stop(>dev, vdev, idx);
  }

Re: [PATCH 4/5] virtio-net: Update virtio-net curr_queue_pairs in vdpa backends

2022-08-23 Thread Jason Wang




在 2022/8/20 01:13, Eugenio Pérez 写道:

It was returned as error before. Instead of it, simply update the
corresponding field so qemu can send it in the migration data.

Signed-off-by: Eugenio Pérez 
---



Looks correct.

Adding Si Wei for double check.

Thanks



  hw/net/virtio-net.c | 17 ++---
  1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index dd0d056fde..63a8332cd0 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1412,19 +1412,14 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t 
cmd,
  return VIRTIO_NET_ERR;
  }
  
-/* Avoid changing the number of queue_pairs for vdpa device in

- * userspace handler. A future fix is needed to handle the mq
- * change in userspace handler with vhost-vdpa. Let's disable
- * the mq handling from userspace for now and only allow get
- * done through the kernel. Ripples may be seen when falling
- * back to userspace, but without doing it qemu process would
- * crash on a recursive entry to virtio_net_set_status().
- */
+n->curr_queue_pairs = queue_pairs;
  if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-return VIRTIO_NET_ERR;
+/*
+ * Avoid updating the backend for a vdpa device: We're only interested
+ * in updating the device model queues.
+ */
+return VIRTIO_NET_OK;
  }
-
-n->curr_queue_pairs = queue_pairs;
  /* stop the backend before changing the number of queue_pairs to avoid 
handling a
   * disabled queue */
  virtio_net_set_status(vdev, vdev->status);

Re: [PATCH 2/5] vdpa: Add vhost_vdpa_net_load_mq

2022-08-23 Thread Jason Wang




在 2022/8/20 01:13, Eugenio Pérez 写道:

Same way as with the MAC, restore the expected number of queues at
device's start.

Signed-off-by: Eugenio Pérez 
---
  net/vhost-vdpa.c | 33 +
  1 file changed, 33 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1e0dbfcced..96fd3bc835 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -391,6 +391,35 @@ static int vhost_vdpa_net_load_mac(VhostVDPAState *s,
  return 0;
  }
  
+static int vhost_vdpa_net_load_mq(VhostVDPAState *s,

+  const VirtIONet *n)
+{
+uint64_t features = n->parent_obj.guest_features;
+ssize_t dev_written;
+void *cursor = s->cvq_cmd_out_buffer;
+if (!(features & BIT_ULL(VIRTIO_NET_F_MQ))) {
+return 0;
+}
+
+*(struct virtio_net_ctrl_hdr *)cursor = (struct virtio_net_ctrl_hdr) {
+.class = VIRTIO_NET_CTRL_MQ,
+.cmd = VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET,
+};
+cursor += sizeof(struct virtio_net_ctrl_hdr);
+*(struct virtio_net_ctrl_mq *)cursor = (struct virtio_net_ctrl_mq) {
+.virtqueue_pairs = cpu_to_le16(n->curr_queue_pairs),
+};



Such casting is not elegant, let's just prepare buffer and then do the 
copy inside vhost_vdpa_net_cvq_add()?




+cursor += sizeof(struct virtio_net_ctrl_mq);
+
+dev_written = vhost_vdpa_net_cvq_add(s, cursor - s->cvq_cmd_out_buffer,
+ sizeof(virtio_net_ctrl_ack));
+if (unlikely(dev_written < 0)) {
+return dev_written;
+}
+
+return *((virtio_net_ctrl_ack *)s->cvq_cmd_in_buffer) != VIRTIO_NET_OK;



So I think we should have a dedicated buffer just for ack, then there's 
no need for such casting.


Thanks



+}
+
  static int vhost_vdpa_net_load(NetClientState *nc)
  {
  VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
@@ -409,6 +438,10 @@ static int vhost_vdpa_net_load(NetClientState *nc)
  if (unlikely(r < 0)) {
  return r;
  }
+r = vhost_vdpa_net_load_mq(s, n);
+if (unlikely(r)) {
+return r;
+}
  
  return 0;

  }

Re: [PATCH 1/5] vdpa: extract vhost_vdpa_net_load_mac from vhost_vdpa_net_load

2022-08-23 Thread Jason Wang




在 2022/8/20 01:13, Eugenio Pérez 写道:

Since there may be many commands we need to issue to load the NIC
state, let's split them in individual functions

Signed-off-by: Eugenio Pérez 



Acked-by: Jason Wang 



---
  net/vhost-vdpa.c | 39 +--
  1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 97b658f412..1e0dbfcced 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -363,21 +363,10 @@ static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, 
size_t out_len,
  return vhost_svq_poll(svq);
  }
  
-static int vhost_vdpa_net_load(NetClientState *nc)

+static int vhost_vdpa_net_load_mac(VhostVDPAState *s,
+  const VirtIONet *n)
  {
-VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-const struct vhost_vdpa *v = >vhost_vdpa;
-const VirtIONet *n;
-uint64_t features;
-
-assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
-
-if (!v->shadow_vqs_enabled) {
-return 0;
-}
-
-n = VIRTIO_NET(v->dev->vdev);
-features = n->parent_obj.guest_features;
+uint64_t features = n->parent_obj.guest_features;
  if (features & BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR)) {
  const struct virtio_net_ctrl_hdr ctrl = {
  .class = VIRTIO_NET_CTRL_MAC,
@@ -402,6 +391,28 @@ static int vhost_vdpa_net_load(NetClientState *nc)
  return 0;
  }
  
+static int vhost_vdpa_net_load(NetClientState *nc)

+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = >vhost_vdpa;
+const VirtIONet *n;
+int r;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+if (!v->shadow_vqs_enabled) {
+return 0;
+}
+
+n = VIRTIO_NET(v->dev->vdev);
+r = vhost_vdpa_net_load_mac(s, n);
+if (unlikely(r < 0)) {
+return r;
+}
+
+return 0;
+}
+
  static NetClientInfo net_vhost_vdpa_cvq_info = {
  .type = NET_CLIENT_DRIVER_VHOST_VDPA,
  .size = sizeof(VhostVDPAState),

Re: [PATCH 00/24] Support VIRTIO_F_RING_RESET for virtio-net, vhost-user, vhost-kernel in virtio pci-modern

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

Host Kernel Patch:
 
https://github.com/middaywords/linux/commit/19a91e0d7167b2031e46078c6215c213b89cb2c3



Btw, this patch could be posted right now. Then it would be more easier 
for people to try you series.


Thanks

Re: [PATCH 00/24] Support VIRTIO_F_RING_RESET for virtio-net, vhost-user, vhost-kernel in virtio pci-modern

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

The virtio queue reset function has already been defined in the virtio spec 1.2.
The relevant virtio spec information is here:

 https://github.com/oasis-tcs/virtio-spec/issues/124
 https://github.com/oasis-tcs/virtio-spec/issues/139

This patch set is to support this function in QEMU. It consists of several 
parts:
1. Patches 1-7 are the basic interfaces for vq reset in virtio and virtio-pci.
2. Patches 8-12 support vq stop and vq restart for vhost-kernel.
3. Patches 13-19 support vq stop and vq restart for vhost-user.
4. Patches 20-22 support vq reset and re-enable for virtio-net.
5. Patches 23-24 enable the vq reset feature for vhost-kernel and vhost-user.

The process of virtqueue reset can be concluded as:
1. The virtqueue is disabled when VIRTIO_PCI_COMMON_Q_RESET is written.
2. Then the virtqueue can be optionally restarted(re-enabled).

Since this patch set involves multiple modules and seems a bit messy, we 
briefly describe the
calling process for different modes below.
virtio-net:
1. VIRTIO_PCI_COMMON_Q_RESET is written [virtio-pci]
 -> virtio_queue_reset() [virtio]
 -> virtio_net_queue_reset() [virtio-net]
 -> __virtio_queue_reset()
2. VIRTIO_PCI_COMMON_Q_ENABLE is written [virtio-pci]
 -> set enabled, reset status of vq.

vhost-kernel:
1. VIRTIO_PCI_COMMON_Q_RESET is written [virtio-pci]
 -> virtio_queue_reset() [virtio]
 -> virtio_net_queue_reset() [virtio-net]
 -> vhost_net_virtqueue_stop() [vhost-net]
 -> vhost_net_set_backend() [vhost]
 -> vhost_dev_virtqueue_stop()
 -> vhost_virtqueue_unmap()
 -> __virtio_queue_reset()
2. VIRTIO_PCI_COMMON_Q_ENABLE is written [virtio-pci]
 -> virtio_queue_enable() [virtio]
 -> virtio_net_queue_enable() [virtio-net]
 -> vhost_net_virtqueue_restart() [vhost-net]
 -> vhost_dev_virtqueue_restart() [vhost]
 -> vhost_virtqueue_start()
 -> vhost_net_set_backend()
 -> set enabled, reset status of vq.

vhost-user:
1. VIRTIO_PCI_COMMON_Q_RESET is written [virtio-pci]
 -> virtio_queue_reset() [virtio]
 -> virtio_net_queue_reset() [virtio-net]
 -> vhost_net_virtqueue_stop() [vhost-net]
 -> vhost_dev_virtqueue_stop() [vhost]
 -> vhost_user_reset_vring() [vhost-user]
 -> send VHOST_USER_RESET_VRING to the device
 -> vhost_virtqueue_unmap()
 -> __virtio_queue_reset()
2. VIRTIO_PCI_COMMON_Q_ENABLE is written [virtio-pci]
 -> virtio_queue_enable() [virtio]
 -> virtio_net_queue_enable() [virtio-net]
 -> vhost_net_virtqueue_restart() [vhost-net]
 -> vhost_dev_virtqueue_restart() [vhost]
 -> vhost_virtqueue_start()
 -> vhost_user_set_single_vring_enable [vhost-user]
 -> send VHOST_USER_SET_VRING_ENABLE to the device
 -> set enabled, reset status of vq.


Test environment:
 Host: 5.19.0-rc3 (With vq reset support)
 Qemu: QEMU emulator version 7.0.50
 Guest: 5.19.0-rc3 (With vq reset support)
 DPDK: 22.07-rc1 (With vq reset support)
 Test Cmd: ethtool -g eth1; ethtool -G eth1 rx $1 tx $2; ethtool -g eth1;

 The drvier can resize the virtio queue, then virtio queue reset function 
should
 be triggered.

 The default is split mode, modify Qemu virtio-net to add PACKED feature to
 test packed mode.

Guest Kernel Patch:
 
https://lore.kernel.org/bpf/20220801063902.129329-1-xuanz...@linux.alibaba.com/

DPDK Patch:
 
https://github.com/middaywords/dpdk/compare/72206323a5dd3182b13f61b25a64abdddfee595c...eabadfac7953da66bc10ffb8284b490d09bb7ec7

Host Kernel Patch:
 
https://github.com/middaywords/linux/commit/19a91e0d7167b2031e46078c6215c213b89cb2c3

Looking forward to your review and comments. Thanks.



I suggest to split vhost-user part into another series since:

1) vhost-net kernel is ready for this future

2) vhost-user probably requires some extensions of the current protocol

Then it can speed up the merging and reduce the rebasing of a huge series.

Thanks




Kangjie Xu (19):
   virtio: introduce virtio_queue_enable()
   virtio: core: vq reset feature negotation support
   virtio-pci: support queue enable
   vhost: extract the logic of unmapping the vrings and desc
   vhost: introduce vhost_dev_virtqueue_stop()
   vhost: introduce vhost_dev_virtqueue_restart()
   vhost-net: vhost-kernel: introduce vhost_net_virtqueue_stop()
   vhost-net: vhost-kernel: introduce vhost_net_virtqueue_restart()
   docs: vhost-user: add VHOST_USER_RESET_VRING message
   vhost-user: introduce vhost_reset_vring() interface
   vhost-user: add op to enable or disable a single vring
   vhost: vhost-user: update vhost_dev_virtqueue_stop()
   vhost: vhost-user: update vhost_dev_virtqueue_restart()

Re: [PATCH v2 19/24] vhost-net: vhost-user: update vhost_net_virtqueue_restart()

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

Update vhost_net_virtqueue_restart() for vhost-user scenario.



Similar to previous patch, let's explain why it it needed.

Thanks




Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/net/vhost_net.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index c0d408f3b4..778081e54a 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -556,6 +556,10 @@ int vhost_net_virtqueue_restart(VirtIODevice *vdev, 
NetClientState *nc,
  
  idx =  vhost_ops->vhost_get_vq_index(>dev, vq_index);
  
+if (net->nc->info->type == NET_CLIENT_DRIVER_VHOST_USER) {

+idx = idx - net->dev.vq_index;
+}
+
  r = vhost_dev_virtqueue_restart(>dev, vdev, idx);
  if (r < 0) {
  goto err_start;

Re: [PATCH v2 18/24] vhost-net: vhost-user: update vhost_net_virtqueue_stop()

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

Update vhost_net_virtqueue_stop() for vhost-user scenario.



Let's explain why it is needed now or why it doesn't cause any issue or 
it's a bug fix or not.


Thanks




Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/net/vhost_net.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 2ab67e875e..c0d408f3b4 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -533,6 +533,10 @@ void vhost_net_virtqueue_stop(VirtIODevice *vdev, 
NetClientState *nc,
  assert(r >= 0);
  }
  
+if (net->nc->info->type == NET_CLIENT_DRIVER_VHOST_USER) {

+idx = idx - net->dev.vq_index;
+}
+
  vhost_dev_virtqueue_stop(>dev, vdev, idx);
  }

Re: [PATCH v2 17/24] vhost: vhost-user: update vhost_dev_virtqueue_restart()

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

Update vhost_dev_virtqueue_restart() for vhost-user scenario.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost.c | 26 ++
  1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index a0d6824353..bd90cfe13a 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1937,11 +1937,29 @@ int vhost_dev_virtqueue_restart(struct vhost_dev *hdev, 
VirtIODevice *vdev,
  int idx)
  {
  const VhostOps *vhost_ops = hdev->vhost_ops;
+int r;
  
  assert(vhost_ops);
  
-return vhost_virtqueue_start(hdev,

- vdev,
- hdev->vqs + idx,
- hdev->vq_index + idx);
+r = vhost_virtqueue_start(hdev,
+  vdev,
+  hdev->vqs + idx,
+  hdev->vq_index + idx);
+if (r < 0) {
+goto err_start;
+}
+
+if (vhost_ops->vhost_set_single_vring_enable) {
+r = vhost_ops->vhost_set_single_vring_enable(hdev,
+ hdev->vq_index + idx,
+ 1);



Rethink about the name, I think the current names is kind of confusing:

    .vhost_set_single_vring_enable = 
vhost_user_set_single_vring_enable,

    .vhost_set_vring_enable = vhost_user_set_vring_enable,

I wonder if it's better:

1) rename vhost_set_vring_enable to vhost_set_dev_enable()

then we can

2) use vhost_set_vring_enable() for per vq enabling?

Thanks



+if (r < 0) {
+goto err_start;
+}
+}
+
+return 0;
+
+err_start:
+return r;
  }

Re: [PATCH v2 16/24] vhost: vhost-user: update vhost_dev_virtqueue_stop()

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

Update vhost_dev_virtqueue_stop() for vhost-user scenario.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost.c | 19 +++
  1 file changed, 19 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index fc3f550c76..a0d6824353 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1908,10 +1908,29 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
  void vhost_dev_virtqueue_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
int idx)
  {
+const VhostOps *vhost_ops = hdev->vhost_ops;
+struct vhost_vring_state state;
+int r;
+
+assert(vhost_ops);
+
+if (vhost_ops->vhost_reset_vring) {
+state.index = hdev->vq_index + idx;
+r = vhost_ops->vhost_reset_vring(hdev, );
+if (r < 0) {
+goto err_queue_reset;
+}
+}
+



So this worries me:

1) having two similar functions

vhost_virtqueue_stop()

and

vhost_dev_virtqueue_stop()

It can easily confuse the people who want stop the device.

I think we need rename vhost_dev_virtqueue_stop() to 
vhost_virqtueue_reset() since it has different semantic compared to stop:


1) stop means the virtqueue state is reserved, e.g the index could be 
synced via get_vring_base()

2) reset means the virqtueue state is lost

Thanks



  vhost_virtqueue_unmap(hdev,
vdev,
hdev->vqs + idx,
idx);
+
+return;
+
+err_queue_reset:
+error_report("Error when stopping the qeuue.");
  }
  
  int vhost_dev_virtqueue_restart(struct vhost_dev *hdev, VirtIODevice *vdev,

Re: [PATCH v2 11/24] vhost-net: vhost-kernel: introduce vhost_net_virtqueue_stop()

2022-08-23 Thread Kangjie Xu




在 2022/8/24 10:40, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce vhost_virtqueue_stop(), which can reset the virtqueue
in the device. Then it will unmap vrings and the desc of the
virtqueue.

This patch only considers the case for vhost-kernel, when
NetClientDriver is NET_CLIENT_DRIVER_TAP.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/net/vhost_net.c  | 21 +
  include/net/vhost_net.h |  2 ++
  2 files changed, 23 insertions(+)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index ccac5b7a64..aa60dd901c 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -514,3 +514,24 @@ int vhost_net_set_mtu(struct vhost_net *net, 
uint16_t mtu)

    return vhost_ops->vhost_net_set_mtu(>dev, mtu);
  }
+
+void vhost_net_virtqueue_stop(VirtIODevice *vdev, NetClientState *nc,
+  int vq_index)
+{
+    VHostNetState *net = get_vhost_net(nc->peer);
+    const VhostOps *vhost_ops = net->dev.vhost_ops;
+    struct vhost_vring_file file = { .fd = -1 };
+    int idx;
+
+    assert(vhost_ops);
+
+    idx = vhost_ops->vhost_get_vq_index(>dev, vq_index);
+
+    if (net->nc->info->type == NET_CLIENT_DRIVER_TAP) {
+    file.index = idx;
+    int r = vhost_net_set_backend(>dev, );
+    assert(r >= 0);
+    }



Let's have a vhost_ops here instead of open code it.

Thanks

I double-checked it, vhost_net_set_backend is already a wrapper of 
vhost_ops->vhost_net_set_backend().


It seems that, to further simplify it, we can only add idx and fd to the 
parameter list of vhost_net_set_backend().


Thanks




+
+    vhost_dev_virtqueue_stop(>dev, vdev, idx);
+}
diff --git a/include/net/vhost_net.h b/include/net/vhost_net.h
index 387e913e4e..9b3aaf3814 100644
--- a/include/net/vhost_net.h
+++ b/include/net/vhost_net.h
@@ -48,4 +48,6 @@ uint64_t vhost_net_get_acked_features(VHostNetState 
*net);

    int vhost_net_set_mtu(struct vhost_net *net, uint16_t mtu);
  +void vhost_net_virtqueue_stop(VirtIODevice *vdev, NetClientState *nc,
+  int vq_index);
  #endif

Re: [PATCH v2 0/4] hw/arm/virt: Improve address assignment for high memory regions

2022-08-23 Thread Gavin Shan


Hi Marc,

On 8/15/22 4:29 PM, Gavin Shan wrote:

There are three high memory regions, which are VIRT_HIGH_REDIST2,
VIRT_HIGH_PCIE_ECAM and VIRT_HIGH_PCIE_MMIO. Their base addresses
are floating on highest RAM address. However, they can be disabled
in several cases.
 
(1) One specific high memory region is disabled by developer by

 toggling vms->highmem_{redists, ecam, mmio}.
 
(2) VIRT_HIGH_PCIE_ECAM region is disabled on machine, which is

 'virt-2.12' or ealier than it.
 
(3) VIRT_HIGH_PCIE_ECAM region is disabled when firmware is loaded

 on 32-bits system.
 
(4) One specific high memory region is disabled when it breaks the

 PA space limit.
 
The current implementation of virt_set_memmap() isn't comprehensive

because the space for one specific high memory region is always
reserved from the PA space for case (1), (2) and (3). In the code,
'base' and 'vms->highest_gpa' are always increased for those three
cases. It's unnecessary since the assigned space of the disabled
high memory region won't be used afterwards.

The series intends to improve the address assignment for these
high memory regions:

PATCH[1] and PATCH[2] are cleanup and preparatory works.
PATCH[3] improves address assignment for these high memory regions
PATCH[4] moves the address assignment logic into standalone helper

History
===
v1: https://lists.nongnu.org/archive/html/qemu-arm/2022-08/msg00013.html

Changelog
=
v2:
   * Split the patches for easier review(Gavin)
   * Improved changelog (Marc)
   * Use 'bool fits' in virt_set_high_memmap()  (Eric)



Could you help to review when you have free cycles? It's just a kindly
ping :)

Thanks,
Gavin



Gavin Shan (4):
   hw/arm/virt: Rename variable size to region_size in virt_set_memmap()
   hw/arm/virt: Introduce variable region_base in virt_set_memmap()
   hw/arm/virt: Improve address assignment for high memory regions
   virt/hw/virt: Add virt_set_high_memmap() helper

  hw/arm/virt.c | 84 ++-
  1 file changed, 50 insertions(+), 34 deletions(-)

Re: [PATCH v2 15/24] vhost-user: add op to enable or disable a single vring

2022-08-23 Thread Kangjie Xu




在 2022/8/24 10:53, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

The interface to set enable status for a single vring is lacked in
VhostOps, since the vhost_set_vring_enable_op will manipulate all
virtqueues in a device.

Resetting a single vq will rely on this interface.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost-user.c    | 26 +++---
  include/hw/virtio/vhost-backend.h |  3 +++
  2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 56033f7a92..8307976cda 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1199,6 +1199,22 @@ static int vhost_user_set_vring_base(struct 
vhost_dev *dev,

  return vhost_set_vring(dev, VHOST_USER_SET_VRING_BASE, ring);
  }
  +static int vhost_user_set_single_vring_enable(struct vhost_dev *dev,
+  int index,
+  int enable)
+{
+    if (index < dev->vq_index || index >= dev->vq_index + dev->nvqs) {
+    return -EINVAL;
+    }
+
+    struct vhost_vring_state state = {
+    .index = index,
+    .num   = enable,
+    };
+
+    return vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, );
+}
+
  static int vhost_user_set_vring_enable(struct vhost_dev *dev, int 
enable)

  {
  int i;
@@ -1208,13 +1224,8 @@ static int vhost_user_set_vring_enable(struct 
vhost_dev *dev, int enable)

  }
    for (i = 0; i < dev->nvqs; ++i) {
-    int ret;
-    struct vhost_vring_state state = {
-    .index = dev->vq_index + i,
-    .num   = enable,
-    };
-
-    ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, 
);



Then I'd squash this into previous patch or re-roder to let this patch 
(vhost_user_set_single_vring_enable()) to be first.


Thanks


Sorry, I don't get why we should re-order them, since these two patches 
are independent.


Thanks

+    int ret = vhost_user_set_single_vring_enable(dev, 
dev->vq_index + i,

+ enable);
  if (ret < 0) {
  /*
   * Restoring the previous state is likely infeasible, 
as well as

@@ -2668,6 +2679,7 @@ const VhostOps user_ops = {
  .vhost_reset_vring = vhost_user_reset_vring,
  .vhost_reset_device = vhost_user_reset_device,
  .vhost_get_vq_index = vhost_user_get_vq_index,
+    .vhost_set_single_vring_enable = 
vhost_user_set_single_vring_enable,

  .vhost_set_vring_enable = vhost_user_set_vring_enable,
  .vhost_requires_shm_log = vhost_user_requires_shm_log,
  .vhost_migration_done = vhost_user_migration_done,
diff --git a/include/hw/virtio/vhost-backend.h 
b/include/hw/virtio/vhost-backend.h

index f23bf71a8d..38f6b752ff 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -83,6 +83,8 @@ typedef int (*vhost_reset_vring_op)(struct 
vhost_dev *dev,

  struct vhost_vring_state *ring);
  typedef int (*vhost_reset_device_op)(struct vhost_dev *dev);
  typedef int (*vhost_get_vq_index_op)(struct vhost_dev *dev, int idx);
+typedef int (*vhost_set_single_vring_enable_op)(struct vhost_dev *dev,
+    int index, int enable);
  typedef int (*vhost_set_vring_enable_op)(struct vhost_dev *dev,
   int enable);
  typedef bool (*vhost_requires_shm_log_op)(struct vhost_dev *dev);
@@ -158,6 +160,7 @@ typedef struct VhostOps {
  vhost_reset_device_op vhost_reset_device;
  vhost_reset_vring_op vhost_reset_vring;
  vhost_get_vq_index_op vhost_get_vq_index;
+    vhost_set_single_vring_enable_op vhost_set_single_vring_enable;
  vhost_set_vring_enable_op vhost_set_vring_enable;
  vhost_requires_shm_log_op vhost_requires_shm_log;
  vhost_migration_done_op vhost_migration_done;

Re: [PATCH v2 14/24] vhost-user: introduce vhost_reset_vring() interface

2022-08-23 Thread Kangjie Xu




在 2022/8/24 10:50, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce the interface vhost_reset_vring(). The interface is a wrapper
to send a VHOST_USER_RESET_VRING message to the back-end. It will reset
an individual vring in the back-end. Meanwhile, it will wait for a reply
to ensure the reset has been completed.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost-user.c    | 41 +++
  include/hw/virtio/vhost-backend.h |  3 +++
  2 files changed, 44 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 75b8df21a4..56033f7a92 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -126,6 +126,7 @@ typedef enum VhostUserRequest {
  VHOST_USER_GET_MAX_MEM_SLOTS = 36,
  VHOST_USER_ADD_MEM_REG = 37,
  VHOST_USER_REM_MEM_REG = 38,
+    VHOST_USER_RESET_VRING = 41,
  VHOST_USER_MAX
  } VhostUserRequest;
  @@ -1498,6 +1499,45 @@ static int 
vhost_user_get_max_memslots(struct vhost_dev *dev,

  return 0;
  }
  +static int vhost_user_reset_vring(struct vhost_dev *dev,
+  struct vhost_vring_state *ring)
+{
+    int ret;
+    VhostUserMsg msg = {
+    .hdr.request = VHOST_USER_RESET_VRING,
+    .hdr.flags = VHOST_USER_VERSION,



Do we need VHOST_USER_NEED_REPLY_MASK here?

Other looks good.

Thanks

No, it works the same as vhost_user_get_features() or 
vhost_user_get_vring_base().


Thanks




+    .payload.state = *ring,
+    .hdr.size = sizeof(msg.payload.state),
+    };
+
+    if (!virtio_has_feature(dev->acked_features, 
VIRTIO_F_RING_RESET)) {

+    return -ENOTSUP;
+    }
+
+    ret = vhost_user_write(dev, , NULL, 0);
+    if (ret < 0) {
+    return ret;
+    }
+
+    ret = vhost_user_read(dev, );
+    if (ret < 0) {
+    return ret;
+    }
+
+    if (msg.hdr.request != VHOST_USER_RESET_VRING) {
+    error_report("Received unexpected msg type. Expected %d 
received %d",

+ VHOST_USER_RESET_VRING, msg.hdr.request);
+    return -EPROTO;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.state)) {
+    error_report("Received bad msg size.");
+    return -EPROTO;
+    }
+
+    return 0;
+}
+
  static int vhost_user_reset_device(struct vhost_dev *dev)
  {
  VhostUserMsg msg = {
@@ -2625,6 +2665,7 @@ const VhostOps user_ops = {
  .vhost_set_features = vhost_user_set_features,
  .vhost_get_features = vhost_user_get_features,
  .vhost_set_owner = vhost_user_set_owner,
+    .vhost_reset_vring = vhost_user_reset_vring,
  .vhost_reset_device = vhost_user_reset_device,
  .vhost_get_vq_index = vhost_user_get_vq_index,
  .vhost_set_vring_enable = vhost_user_set_vring_enable,
diff --git a/include/hw/virtio/vhost-backend.h 
b/include/hw/virtio/vhost-backend.h

index eab46d7f0b..f23bf71a8d 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -79,6 +79,8 @@ typedef int (*vhost_get_features_op)(struct 
vhost_dev *dev,

   uint64_t *features);
  typedef int (*vhost_set_backend_cap_op)(struct vhost_dev *dev);
  typedef int (*vhost_set_owner_op)(struct vhost_dev *dev);
+typedef int (*vhost_reset_vring_op)(struct vhost_dev *dev,
+    struct vhost_vring_state *ring);
  typedef int (*vhost_reset_device_op)(struct vhost_dev *dev);
  typedef int (*vhost_get_vq_index_op)(struct vhost_dev *dev, int idx);
  typedef int (*vhost_set_vring_enable_op)(struct vhost_dev *dev,
@@ -154,6 +156,7 @@ typedef struct VhostOps {
  vhost_set_backend_cap_op vhost_set_backend_cap;
  vhost_set_owner_op vhost_set_owner;
  vhost_reset_device_op vhost_reset_device;
+    vhost_reset_vring_op vhost_reset_vring;
  vhost_get_vq_index_op vhost_get_vq_index;
  vhost_set_vring_enable_op vhost_set_vring_enable;
  vhost_requires_shm_log_op vhost_requires_shm_log;

Re: [PATCH v2 15/24] vhost-user: add op to enable or disable a single vring

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

The interface to set enable status for a single vring is lacked in
VhostOps, since the vhost_set_vring_enable_op will manipulate all
virtqueues in a device.

Resetting a single vq will rely on this interface.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost-user.c| 26 +++---
  include/hw/virtio/vhost-backend.h |  3 +++
  2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 56033f7a92..8307976cda 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1199,6 +1199,22 @@ static int vhost_user_set_vring_base(struct vhost_dev 
*dev,
  return vhost_set_vring(dev, VHOST_USER_SET_VRING_BASE, ring);
  }
  
+static int vhost_user_set_single_vring_enable(struct vhost_dev *dev,

+  int index,
+  int enable)
+{
+if (index < dev->vq_index || index >= dev->vq_index + dev->nvqs) {
+return -EINVAL;
+}
+
+struct vhost_vring_state state = {
+.index = index,
+.num   = enable,
+};
+
+return vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, );
+}
+
  static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
  {
  int i;
@@ -1208,13 +1224,8 @@ static int vhost_user_set_vring_enable(struct vhost_dev 
*dev, int enable)
  }
  
  for (i = 0; i < dev->nvqs; ++i) {

-int ret;
-struct vhost_vring_state state = {
-.index = dev->vq_index + i,
-.num   = enable,
-};
-
-ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, );



Then I'd squash this into previous patch or re-roder to let this patch 
(vhost_user_set_single_vring_enable()) to be first.


Thanks



+int ret = vhost_user_set_single_vring_enable(dev, dev->vq_index + i,
+ enable);
  if (ret < 0) {
  /*
   * Restoring the previous state is likely infeasible, as well as
@@ -2668,6 +2679,7 @@ const VhostOps user_ops = {
  .vhost_reset_vring = vhost_user_reset_vring,
  .vhost_reset_device = vhost_user_reset_device,
  .vhost_get_vq_index = vhost_user_get_vq_index,
+.vhost_set_single_vring_enable = vhost_user_set_single_vring_enable,
  .vhost_set_vring_enable = vhost_user_set_vring_enable,
  .vhost_requires_shm_log = vhost_user_requires_shm_log,
  .vhost_migration_done = vhost_user_migration_done,
diff --git a/include/hw/virtio/vhost-backend.h 
b/include/hw/virtio/vhost-backend.h
index f23bf71a8d..38f6b752ff 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -83,6 +83,8 @@ typedef int (*vhost_reset_vring_op)(struct vhost_dev *dev,
  struct vhost_vring_state *ring);
  typedef int (*vhost_reset_device_op)(struct vhost_dev *dev);
  typedef int (*vhost_get_vq_index_op)(struct vhost_dev *dev, int idx);
+typedef int (*vhost_set_single_vring_enable_op)(struct vhost_dev *dev,
+int index, int enable);
  typedef int (*vhost_set_vring_enable_op)(struct vhost_dev *dev,
   int enable);
  typedef bool (*vhost_requires_shm_log_op)(struct vhost_dev *dev);
@@ -158,6 +160,7 @@ typedef struct VhostOps {
  vhost_reset_device_op vhost_reset_device;
  vhost_reset_vring_op vhost_reset_vring;
  vhost_get_vq_index_op vhost_get_vq_index;
+vhost_set_single_vring_enable_op vhost_set_single_vring_enable;
  vhost_set_vring_enable_op vhost_set_vring_enable;
  vhost_requires_shm_log_op vhost_requires_shm_log;
  vhost_migration_done_op vhost_migration_done;

Re: [PATCH v2 12/24] vhost-net: vhost-kernel: introduce vhost_net_virtqueue_restart()

2022-08-23 Thread Kangjie Xu




在 2022/8/24 10:44, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce vhost_net_virtqueue_restart(), which can restart the
virtqueue when the vhost net started running before. If it fails
to restart the virtqueue, the device will be stopped.

This patch only considers the case for vhost-kernel, when
NetClientDriver is NET_CLIENT_DRIVER_TAP.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 



I would explain why current vhost_net_start_one()/vhost_net_stop_one() 
can't work. Is it because it works at queue pair level? If yes can we 
restructure the code and try to reuse ?


Thanks


Because vhost_net_start_one()/vhost_net_stop_one() works at device level.

The queue pair level start/stop are vhost_virtqueue_start() and 
vhost_virtqueue_stop().


What we can reuse is the vhost_virtqueue_start(). vhost_virtqueue_stop() 
cannot be reused because it will destroy device.


I think we do not need to restructure because we've already had an 
abstraction vhost_virtqueue_start().


Thanks.




---
  hw/net/vhost_net.c  | 48 +
  include/net/vhost_net.h |  2 ++
  2 files changed, 50 insertions(+)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index aa60dd901c..2ab67e875e 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -535,3 +535,51 @@ void vhost_net_virtqueue_stop(VirtIODevice 
*vdev, NetClientState *nc,

    vhost_dev_virtqueue_stop(>dev, vdev, idx);
  }
+
+int vhost_net_virtqueue_restart(VirtIODevice *vdev, NetClientState *nc,
+    int vq_index)
+{
+    VHostNetState *net = get_vhost_net(nc->peer);
+    const VhostOps *vhost_ops = net->dev.vhost_ops;
+    struct vhost_vring_file file = { };
+    int idx, r;
+
+    if (!net->dev.started) {
+    return 0;
+    }
+
+    assert(vhost_ops);
+
+    idx =  vhost_ops->vhost_get_vq_index(>dev, vq_index);
+
+    r = vhost_dev_virtqueue_restart(>dev, vdev, idx);
+    if (r < 0) {
+    goto err_start;
+    }
+
+    if (net->nc->info->type == NET_CLIENT_DRIVER_TAP) {
+    file.index = idx;
+    file.fd = net->backend;
+    r = vhost_net_set_backend(>dev, );
+    if (r < 0) {
+    r = -errno;
+    goto err_start;
+    }
+    }
+
+    return 0;
+
+err_start:
+    error_report("Error when restarting the queue.");
+
+    if (net->nc->info->type == NET_CLIENT_DRIVER_TAP) {
+    file.fd = -1;
+    file.index = idx;
+    int r = vhost_net_set_backend(>dev, );
+    assert(r >= 0);
+    }
+
+    vhost_dev_stop(>dev, vdev);
+
+    return r;
+}
diff --git a/include/net/vhost_net.h b/include/net/vhost_net.h
index 9b3aaf3814..e11a297380 100644
--- a/include/net/vhost_net.h
+++ b/include/net/vhost_net.h
@@ -50,4 +50,6 @@ int vhost_net_set_mtu(struct vhost_net *net, 
uint16_t mtu);
    void vhost_net_virtqueue_stop(VirtIODevice *vdev, NetClientState 
*nc,

    int vq_index);
+int vhost_net_virtqueue_restart(VirtIODevice *vdev, NetClientState *nc,
+    int vq_index);
  #endif

Re: [PATCH v2 14/24] vhost-user: introduce vhost_reset_vring() interface

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce the interface vhost_reset_vring(). The interface is a wrapper
to send a VHOST_USER_RESET_VRING message to the back-end. It will reset
an individual vring in the back-end. Meanwhile, it will wait for a reply
to ensure the reset has been completed.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost-user.c| 41 +++
  include/hw/virtio/vhost-backend.h |  3 +++
  2 files changed, 44 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 75b8df21a4..56033f7a92 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -126,6 +126,7 @@ typedef enum VhostUserRequest {
  VHOST_USER_GET_MAX_MEM_SLOTS = 36,
  VHOST_USER_ADD_MEM_REG = 37,
  VHOST_USER_REM_MEM_REG = 38,
+VHOST_USER_RESET_VRING = 41,
  VHOST_USER_MAX
  } VhostUserRequest;
  
@@ -1498,6 +1499,45 @@ static int vhost_user_get_max_memslots(struct vhost_dev *dev,

  return 0;
  }
  
+static int vhost_user_reset_vring(struct vhost_dev *dev,

+  struct vhost_vring_state *ring)
+{
+int ret;
+VhostUserMsg msg = {
+.hdr.request = VHOST_USER_RESET_VRING,
+.hdr.flags = VHOST_USER_VERSION,



Do we need VHOST_USER_NEED_REPLY_MASK here?

Other looks good.

Thanks



+.payload.state = *ring,
+.hdr.size = sizeof(msg.payload.state),
+};
+
+if (!virtio_has_feature(dev->acked_features, VIRTIO_F_RING_RESET)) {
+return -ENOTSUP;
+}
+
+ret = vhost_user_write(dev, , NULL, 0);
+if (ret < 0) {
+return ret;
+}
+
+ret = vhost_user_read(dev, );
+if (ret < 0) {
+return ret;
+}
+
+if (msg.hdr.request != VHOST_USER_RESET_VRING) {
+error_report("Received unexpected msg type. Expected %d received %d",
+ VHOST_USER_RESET_VRING, msg.hdr.request);
+return -EPROTO;
+}
+
+if (msg.hdr.size != sizeof(msg.payload.state)) {
+error_report("Received bad msg size.");
+return -EPROTO;
+}
+
+return 0;
+}
+
  static int vhost_user_reset_device(struct vhost_dev *dev)
  {
  VhostUserMsg msg = {
@@ -2625,6 +2665,7 @@ const VhostOps user_ops = {
  .vhost_set_features = vhost_user_set_features,
  .vhost_get_features = vhost_user_get_features,
  .vhost_set_owner = vhost_user_set_owner,
+.vhost_reset_vring = vhost_user_reset_vring,
  .vhost_reset_device = vhost_user_reset_device,
  .vhost_get_vq_index = vhost_user_get_vq_index,
  .vhost_set_vring_enable = vhost_user_set_vring_enable,
diff --git a/include/hw/virtio/vhost-backend.h 
b/include/hw/virtio/vhost-backend.h
index eab46d7f0b..f23bf71a8d 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -79,6 +79,8 @@ typedef int (*vhost_get_features_op)(struct vhost_dev *dev,
   uint64_t *features);
  typedef int (*vhost_set_backend_cap_op)(struct vhost_dev *dev);
  typedef int (*vhost_set_owner_op)(struct vhost_dev *dev);
+typedef int (*vhost_reset_vring_op)(struct vhost_dev *dev,
+struct vhost_vring_state *ring);
  typedef int (*vhost_reset_device_op)(struct vhost_dev *dev);
  typedef int (*vhost_get_vq_index_op)(struct vhost_dev *dev, int idx);
  typedef int (*vhost_set_vring_enable_op)(struct vhost_dev *dev,
@@ -154,6 +156,7 @@ typedef struct VhostOps {
  vhost_set_backend_cap_op vhost_set_backend_cap;
  vhost_set_owner_op vhost_set_owner;
  vhost_reset_device_op vhost_reset_device;
+vhost_reset_vring_op vhost_reset_vring;
  vhost_get_vq_index_op vhost_get_vq_index;
  vhost_set_vring_enable_op vhost_set_vring_enable;
  vhost_requires_shm_log_op vhost_requires_shm_log;

Re: [PATCH v2 11/24] vhost-net: vhost-kernel: introduce vhost_net_virtqueue_stop()

2022-08-23 Thread Kangjie Xu



在 2022/8/24 10:40, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce vhost_virtqueue_stop(), which can reset the virtqueue
in the device. Then it will unmap vrings and the desc of the
virtqueue.

This patch only considers the case for vhost-kernel, when
NetClientDriver is NET_CLIENT_DRIVER_TAP.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/net/vhost_net.c  | 21 +
  include/net/vhost_net.h |  2 ++
  2 files changed, 23 insertions(+)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index ccac5b7a64..aa60dd901c 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -514,3 +514,24 @@ int vhost_net_set_mtu(struct vhost_net *net, 
uint16_t mtu)

    return vhost_ops->vhost_net_set_mtu(>dev, mtu);
  }
+
+void vhost_net_virtqueue_stop(VirtIODevice *vdev, NetClientState *nc,
+  int vq_index)
+{
+    VHostNetState *net = get_vhost_net(nc->peer);
+    const VhostOps *vhost_ops = net->dev.vhost_ops;
+    struct vhost_vring_file file = { .fd = -1 };
+    int idx;
+
+    assert(vhost_ops);
+
+    idx = vhost_ops->vhost_get_vq_index(>dev, vq_index);
+
+    if (net->nc->info->type == NET_CLIENT_DRIVER_TAP) {
+    file.index = idx;
+    int r = vhost_net_set_backend(>dev, );
+    assert(r >= 0);
+    }



Let's have a vhost_ops here instead of open code it.

Thanks


I share your view.

Thanks




+
+    vhost_dev_virtqueue_stop(>dev, vdev, idx);
+}
diff --git a/include/net/vhost_net.h b/include/net/vhost_net.h
index 387e913e4e..9b3aaf3814 100644
--- a/include/net/vhost_net.h
+++ b/include/net/vhost_net.h
@@ -48,4 +48,6 @@ uint64_t vhost_net_get_acked_features(VHostNetState 
*net);

    int vhost_net_set_mtu(struct vhost_net *net, uint16_t mtu);
  +void vhost_net_virtqueue_stop(VirtIODevice *vdev, NetClientState *nc,
+  int vq_index);
  #endif

Re: [PATCH v2 13/24] docs: vhost-user: add VHOST_USER_RESET_VRING message

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

To support the reset operation for an individual virtqueue,
we introduce a new message VHOST_USER_RESET_VRING. This
message is submitted by the front-end to reset an individual
virtqueue to initial states in the back-end. The reply is
needed to ensure that the reset operation is complete.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 



Acked-by: Jason Wang 



---
  docs/interop/vhost-user.rst | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 3f18ab424e..ce7991b9d3 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1422,6 +1422,16 @@ Front-end message types
query the back-end for its device status as defined in the Virtio
specification.
  
+``VHOST_USER_RESET_VRING``

+  :id: 41
+  :equivalent ioctl: N/A
+  :request payload: vring state description
+  :reply payload: ``u64``
+
+  When the feature ``VIRTIO_F_RING_RESET`` feature has been successfully
+  negotiated, this message is submitted by the front-end to reset an
+  individual virtqueue to initial states in the back-end. It will ask
+  for a reply to ensure the virtqueue is successfully reset in the back-end.
  
  Back-end message types

  --

Re: [PATCH v2 10/24] vhost: introduce vhost_dev_virtqueue_restart()

2022-08-23 Thread Kangjie Xu




在 2022/8/24 10:37, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce vhost_dev_virtqueue_restart(), which can restart the
virtqueue when the vhost has already started running.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost.c | 13 +
  include/hw/virtio/vhost.h |  2 ++
  2 files changed, 15 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 1bca9ff48d..fc3f550c76 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1913,3 +1913,16 @@ void vhost_dev_virtqueue_stop(struct vhost_dev 
*hdev, VirtIODevice *vdev,

    hdev->vqs + idx,
    idx);
  }
+
+int vhost_dev_virtqueue_restart(struct vhost_dev *hdev, VirtIODevice 
*vdev,

+    int idx)
+{
+    const VhostOps *vhost_ops = hdev->vhost_ops;
+
+    assert(vhost_ops);



So we had the comment like:

    /* should only be called after backend is connected */

in vhost_virtqueue_mask().

If this assert has the same reason, let's add a comment here.



Get it.

+
+    return vhost_virtqueue_start(hdev,
+ vdev,
+ hdev->vqs + idx,
+ hdev->vq_index + idx);



So it just a wrapper of vhost_virtqueue_start(), any value to have a 
re-start wrapper?


Thanks

Because in subsequent patches, to support vhost-user, we will add 
vhost_ops->vhost_set_single_vring_enable() in this function.


Thanks.




+}
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 574888440c..b3394b6348 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -291,4 +291,6 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, 
uint16_t queue_size,
    void vhost_dev_virtqueue_stop(struct vhost_dev *hdev, 
VirtIODevice *vdev,

    int idx);
+int vhost_dev_virtqueue_restart(struct vhost_dev *hdev, VirtIODevice 
*vdev,

+    int idx);
  #endif

Re: [PATCH v2 12/24] vhost-net: vhost-kernel: introduce vhost_net_virtqueue_restart()

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce vhost_net_virtqueue_restart(), which can restart the
virtqueue when the vhost net started running before. If it fails
to restart the virtqueue, the device will be stopped.

This patch only considers the case for vhost-kernel, when
NetClientDriver is NET_CLIENT_DRIVER_TAP.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 



I would explain why current vhost_net_start_one()/vhost_net_stop_one() 
can't work. Is it because it works at queue pair level? If yes can we 
restructure the code and try to reuse ?


Thanks



---
  hw/net/vhost_net.c  | 48 +
  include/net/vhost_net.h |  2 ++
  2 files changed, 50 insertions(+)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index aa60dd901c..2ab67e875e 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -535,3 +535,51 @@ void vhost_net_virtqueue_stop(VirtIODevice *vdev, 
NetClientState *nc,
  
  vhost_dev_virtqueue_stop(>dev, vdev, idx);

  }
+
+int vhost_net_virtqueue_restart(VirtIODevice *vdev, NetClientState *nc,
+int vq_index)
+{
+VHostNetState *net = get_vhost_net(nc->peer);
+const VhostOps *vhost_ops = net->dev.vhost_ops;
+struct vhost_vring_file file = { };
+int idx, r;
+
+if (!net->dev.started) {
+return 0;
+}
+
+assert(vhost_ops);
+
+idx =  vhost_ops->vhost_get_vq_index(>dev, vq_index);
+
+r = vhost_dev_virtqueue_restart(>dev, vdev, idx);
+if (r < 0) {
+goto err_start;
+}
+
+if (net->nc->info->type == NET_CLIENT_DRIVER_TAP) {
+file.index = idx;
+file.fd = net->backend;
+r = vhost_net_set_backend(>dev, );
+if (r < 0) {
+r = -errno;
+goto err_start;
+}
+}
+
+return 0;
+
+err_start:
+error_report("Error when restarting the queue.");
+
+if (net->nc->info->type == NET_CLIENT_DRIVER_TAP) {
+file.fd = -1;
+file.index = idx;
+int r = vhost_net_set_backend(>dev, );
+assert(r >= 0);
+}
+
+vhost_dev_stop(>dev, vdev);
+
+return r;
+}
diff --git a/include/net/vhost_net.h b/include/net/vhost_net.h
index 9b3aaf3814..e11a297380 100644
--- a/include/net/vhost_net.h
+++ b/include/net/vhost_net.h
@@ -50,4 +50,6 @@ int vhost_net_set_mtu(struct vhost_net *net, uint16_t mtu);
  
  void vhost_net_virtqueue_stop(VirtIODevice *vdev, NetClientState *nc,

int vq_index);
+int vhost_net_virtqueue_restart(VirtIODevice *vdev, NetClientState *nc,
+int vq_index);
  #endif

Re: [PATCH v2 09/24] vhost: introduce vhost_dev_virtqueue_stop()

2022-08-23 Thread Jason Wang




在 2022/8/23 16:03, Kangjie Xu 写道:


在 2022/8/23 15:52, Jason Wang 写道:


在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce vhost_dev_virtqueue_stop(), which can ummap the
vrings and the desc of it.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost.c | 9 +
  include/hw/virtio/vhost.h | 3 +++
  2 files changed, 12 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e467dfc7bc..1bca9ff48d 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1904,3 +1904,12 @@ int vhost_net_set_backend(struct vhost_dev 
*hdev,

    return -ENOSYS;
  }
+
+void vhost_dev_virtqueue_stop(struct vhost_dev *hdev, VirtIODevice 
*vdev,

+  int idx)
+{
+    vhost_virtqueue_unmap(hdev,
+  vdev,
+  hdev->vqs + idx,
+  idx);



So I think the unmap is not sufficient, we need backend specific 
support. E.g for vhost kernel, need a SET_BACKEND here?


Thanks

But SET_BACKEND of vhost-net needs a parameter fd in vhost_vring_file: 
that is net->backend of VHostNetState.


If we add the fd parameter or struct vhost_vring_file to 
vhost_dev_virtqueue_stop/restart, it exposes some implementation 
details in the parameter list.


And that seems not good? So I put SET_BACKEND in the vhost-net module. 
The workflow is similar to vhost_net_start_one().



That looks fine.

Thanks




Thanks




+}
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a346f23d13..574888440c 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -288,4 +288,7 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
 struct vhost_inflight *inflight);
  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t 
queue_size,

 struct vhost_inflight *inflight);
+
+void vhost_dev_virtqueue_stop(struct vhost_dev *hdev, VirtIODevice 
*vdev,

+  int idx);
  #endif

Re: [PATCH v2 11/24] vhost-net: vhost-kernel: introduce vhost_net_virtqueue_stop()

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce vhost_virtqueue_stop(), which can reset the virtqueue
in the device. Then it will unmap vrings and the desc of the
virtqueue.

This patch only considers the case for vhost-kernel, when
NetClientDriver is NET_CLIENT_DRIVER_TAP.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/net/vhost_net.c  | 21 +
  include/net/vhost_net.h |  2 ++
  2 files changed, 23 insertions(+)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index ccac5b7a64..aa60dd901c 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -514,3 +514,24 @@ int vhost_net_set_mtu(struct vhost_net *net, uint16_t mtu)
  
  return vhost_ops->vhost_net_set_mtu(>dev, mtu);

  }
+
+void vhost_net_virtqueue_stop(VirtIODevice *vdev, NetClientState *nc,
+  int vq_index)
+{
+VHostNetState *net = get_vhost_net(nc->peer);
+const VhostOps *vhost_ops = net->dev.vhost_ops;
+struct vhost_vring_file file = { .fd = -1 };
+int idx;
+
+assert(vhost_ops);
+
+idx = vhost_ops->vhost_get_vq_index(>dev, vq_index);
+
+if (net->nc->info->type == NET_CLIENT_DRIVER_TAP) {
+file.index = idx;
+int r = vhost_net_set_backend(>dev, );
+assert(r >= 0);
+}



Let's have a vhost_ops here instead of open code it.

Thanks



+
+vhost_dev_virtqueue_stop(>dev, vdev, idx);
+}
diff --git a/include/net/vhost_net.h b/include/net/vhost_net.h
index 387e913e4e..9b3aaf3814 100644
--- a/include/net/vhost_net.h
+++ b/include/net/vhost_net.h
@@ -48,4 +48,6 @@ uint64_t vhost_net_get_acked_features(VHostNetState *net);
  
  int vhost_net_set_mtu(struct vhost_net *net, uint16_t mtu);
  
+void vhost_net_virtqueue_stop(VirtIODevice *vdev, NetClientState *nc,

+  int vq_index);
  #endif

Re: [PATCH v2 10/24] vhost: introduce vhost_dev_virtqueue_restart()

2022-08-23 Thread Jason Wang




在 2022/8/16 09:06, Kangjie Xu 写道:

Introduce vhost_dev_virtqueue_restart(), which can restart the
virtqueue when the vhost has already started running.

Signed-off-by: Kangjie Xu 
Signed-off-by: Xuan Zhuo 
---
  hw/virtio/vhost.c | 13 +
  include/hw/virtio/vhost.h |  2 ++
  2 files changed, 15 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 1bca9ff48d..fc3f550c76 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1913,3 +1913,16 @@ void vhost_dev_virtqueue_stop(struct vhost_dev *hdev, 
VirtIODevice *vdev,
hdev->vqs + idx,
idx);
  }
+
+int vhost_dev_virtqueue_restart(struct vhost_dev *hdev, VirtIODevice *vdev,
+int idx)
+{
+const VhostOps *vhost_ops = hdev->vhost_ops;
+
+assert(vhost_ops);



So we had the comment like:

    /* should only be called after backend is connected */

in vhost_virtqueue_mask().

If this assert has the same reason, let's add a comment here.



+
+return vhost_virtqueue_start(hdev,
+ vdev,
+ hdev->vqs + idx,
+ hdev->vq_index + idx);



So it just a wrapper of vhost_virtqueue_start(), any value to have a 
re-start wrapper?


Thanks



+}
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 574888440c..b3394b6348 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -291,4 +291,6 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t 
queue_size,
  
  void vhost_dev_virtqueue_stop(struct vhost_dev *hdev, VirtIODevice *vdev,

int idx);
+int vhost_dev_virtqueue_restart(struct vhost_dev *hdev, VirtIODevice *vdev,
+int idx);
  #endif

Re: [PATCH v10 00/12] NIC vhost-vdpa state restore via Shadow CVQ

2022-08-23 Thread Jason Wang




在 2022/8/24 02:30, Eugenio Pérez 写道:

CVQ of net vhost-vdpa devices can be intercepted since the addition of x-svq.
The virtio-net device model is updated. The migration was blocked because
although the state can be megrated between VMM it was not possible to restore
on the destination NIC.

This series add support for SVQ to inject external messages without the guest's
knowledge, so before the guest is resumed all the guest visible state is
restored. It is done using standard CVQ messages, so the vhost-vdpa device does
not need to learn how to restore it: As long as they have the feature, they
know how to handle it.

Depending on the device, this series may need fixes with message id [1] to
achieve full live migration.

[1] <20220823182008.97141-1-epere...@redhat.com>



I've queued the two series in net-next branch:

https://github.com/jasowang/qemu/commits/net-next

Thanks




v10:
- Rebase on latest fixes of [1].

v9:
- Use guest acked features instead of device's.
- Minors: fix typos and patch messages, constify vhost_vdpa and VirtIONet vars,
   delete unneeded increment of cursor.

v8:
- Rename NetClientInfo load to start, so is symmetrical with stop()
- Delete copy of device's in buffer at vhost_vdpa_net_load

v7:
- Remove accidental double free.

v6:
- Move map and unmap of the buffers to the start and stop of the device. This
   implies more callbacks on NetClientInfo, but simplifies the SVQ CVQ code.
- Not assume that in buffer is sizeof(virtio_net_ctrl_ack) in
   vhost_vdpa_net_cvq_add
- Reduce the number of changes from previous versions
- Delete unused memory barrier

v5:
- Rename s/start/load/
- Use independent NetClientInfo to only add load callback on cvq.
- Accept out sg instead of dev_buffers[] at vhost_vdpa_net_cvq_map_elem
- Use only out size instead of iovec dev_buffers to know if the descriptor is
   effectively available, allowing to delete artificial !NULL VirtQueueElement
   on vhost_svq_add call.

v4:
- Actually use NetClientInfo callback.

v3:
- Route vhost-vdpa start code through NetClientInfo callback.
- Delete extra vhost_net_stop_one() call.

v2:
- Fix SIGSEGV dereferencing SVQ when not in svq mode

v1 from RFC:
- Do not reorder DRIVER_OK & enable patches.
- Delete leftovers

Eugenio Pérez (12):
   vhost: stop transfer elem ownership in vhost_handle_guest_kick
   vhost: use SVQ element ndescs instead of opaque data for desc
 validation
   vhost: Delete useless read memory barrier
   vhost: Do not depend on !NULL VirtQueueElement on vhost_svq_flush
   vhost_net: Add NetClientInfo start callback
   vhost_net: Add NetClientInfo stop callback
   vdpa: add net_vhost_vdpa_cvq_info NetClientInfo
   vdpa: Move command buffers map to start of net device
   vdpa: extract vhost_vdpa_net_cvq_add from
 vhost_vdpa_net_handle_ctrl_avail
   vhost_net: add NetClientState->load() callback
   vdpa: Add virtio-net mac address via CVQ at start
   vdpa: Delete CVQ migration blocker

  include/hw/virtio/vhost-vdpa.h |   1 -
  include/net/net.h  |   6 +
  hw/net/vhost_net.c |  17 +++
  hw/virtio/vhost-shadow-virtqueue.c |  27 ++--
  hw/virtio/vhost-vdpa.c |  15 --
  net/vhost-vdpa.c   | 224 ++---
  6 files changed, 177 insertions(+), 113 deletions(-)

Re: [PATCH] net: tulip: Restrict DMA engine to memories

2022-08-23 Thread Jason Wang




在 2022/8/24 10:26, Jason Wang 写道:

On Sun, Aug 21, 2022 at 8:29 PM Zheyu Ma  wrote:

The DMA engine is started by I/O access and then itself accesses the
I/O registers, triggering a reentrancy bug.

The following log can reveal it:
==5637==ERROR: AddressSanitizer: stack-overflow
 #0 0x5595435f6078 in tulip_xmit_list_update qemu/hw/net/tulip.c:673
 #1 0x5595435f204a in tulip_write qemu/hw/net/tulip.c:805:13
 #2 0x559544637f86 in memory_region_write_accessor 
qemu/softmmu/memory.c:492:5
 #3 0x5595446379fa in access_with_adjusted_size qemu/softmmu/memory.c:554:18
 #4 0x5595446372fa in memory_region_dispatch_write qemu/softmmu/memory.c
 #5 0x55954468b74c in flatview_write_continue qemu/softmmu/physmem.c:2825:23
 #6 0x559544683662 in flatview_write qemu/softmmu/physmem.c:2867:12
 #7 0x5595446833f3 in address_space_write qemu/softmmu/physmem.c:2963:18
 #8 0x5595435fb082 in dma_memory_rw_relaxed 
/home/mzy/truman/third_party/qemu/include/sysemu/dma.h:87:12
 #9 0x5595435fb082 in dma_memory_rw 
/home/mzy/truman/third_party/qemu/include/sysemu/dma.h:130:12
 #10 0x5595435fb082 in dma_memory_write 
/home/mzy/truman/third_party/qemu/include/sysemu/dma.h:171:12
 #11 0x5595435fb082 in stl_le_dma 
/home/mzy/truman/third_party/qemu/include/sysemu/dma.h:272:1
 #12 0x5595435fb082 in stl_le_pci_dma 
/home/mzy/truman/third_party/qemu/include/hw/pci/pci.h:910:1
 #13 0x5595435fb082 in tulip_desc_write qemu/hw/net/tulip.c:101:9
 #14 0x5595435f7e3d in tulip_xmit_list_update qemu/hw/net/tulip.c:706:9
 #15 0x5595435f204a in tulip_write qemu/hw/net/tulip.c:805:13

Fix this bug by restricting the DMA engine to memories regions.

Signed-off-by: Zheyu Ma 

Queued for 7.2.



Ok, I saw v2, so I've queued that version since it has a better change log.

Thanks




Thanks


---
  hw/net/tulip.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/net/tulip.c b/hw/net/tulip.c
index 097e905bec..b9e42c322a 100644
--- a/hw/net/tulip.c
+++ b/hw/net/tulip.c
@@ -70,7 +70,7 @@ static const VMStateDescription vmstate_pci_tulip = {
  static void tulip_desc_read(TULIPState *s, hwaddr p,
  struct tulip_descriptor *desc)
  {
-const MemTxAttrs attrs = MEMTXATTRS_UNSPECIFIED;
+const MemTxAttrs attrs = { .memory = true };

  if (s->csr[0] & CSR0_DBO) {
  ldl_be_pci_dma(>dev, p, >status, attrs);
@@ -88,7 +88,7 @@ static void tulip_desc_read(TULIPState *s, hwaddr p,
  static void tulip_desc_write(TULIPState *s, hwaddr p,
  struct tulip_descriptor *desc)
  {
-const MemTxAttrs attrs = MEMTXATTRS_UNSPECIFIED;
+const MemTxAttrs attrs = { .memory = true };

  if (s->csr[0] & CSR0_DBO) {
  stl_be_pci_dma(>dev, p, desc->status, attrs);
--
2.25.1

Re: [PATCH] net: tulip: Restrict DMA engine to memories

2022-08-23 Thread Jason Wang

On Sun, Aug 21, 2022 at 8:29 PM Zheyu Ma  wrote:
>
> The DMA engine is started by I/O access and then itself accesses the
> I/O registers, triggering a reentrancy bug.
>
> The following log can reveal it:
> ==5637==ERROR: AddressSanitizer: stack-overflow
> #0 0x5595435f6078 in tulip_xmit_list_update qemu/hw/net/tulip.c:673
> #1 0x5595435f204a in tulip_write qemu/hw/net/tulip.c:805:13
> #2 0x559544637f86 in memory_region_write_accessor 
> qemu/softmmu/memory.c:492:5
> #3 0x5595446379fa in access_with_adjusted_size 
> qemu/softmmu/memory.c:554:18
> #4 0x5595446372fa in memory_region_dispatch_write qemu/softmmu/memory.c
> #5 0x55954468b74c in flatview_write_continue 
> qemu/softmmu/physmem.c:2825:23
> #6 0x559544683662 in flatview_write qemu/softmmu/physmem.c:2867:12
> #7 0x5595446833f3 in address_space_write qemu/softmmu/physmem.c:2963:18
> #8 0x5595435fb082 in dma_memory_rw_relaxed 
> /home/mzy/truman/third_party/qemu/include/sysemu/dma.h:87:12
> #9 0x5595435fb082 in dma_memory_rw 
> /home/mzy/truman/third_party/qemu/include/sysemu/dma.h:130:12
> #10 0x5595435fb082 in dma_memory_write 
> /home/mzy/truman/third_party/qemu/include/sysemu/dma.h:171:12
> #11 0x5595435fb082 in stl_le_dma 
> /home/mzy/truman/third_party/qemu/include/sysemu/dma.h:272:1
> #12 0x5595435fb082 in stl_le_pci_dma 
> /home/mzy/truman/third_party/qemu/include/hw/pci/pci.h:910:1
> #13 0x5595435fb082 in tulip_desc_write qemu/hw/net/tulip.c:101:9
> #14 0x5595435f7e3d in tulip_xmit_list_update qemu/hw/net/tulip.c:706:9
> #15 0x5595435f204a in tulip_write qemu/hw/net/tulip.c:805:13
>
> Fix this bug by restricting the DMA engine to memories regions.
>
> Signed-off-by: Zheyu Ma 

Queued for 7.2.

Thanks

> ---
>  hw/net/tulip.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/net/tulip.c b/hw/net/tulip.c
> index 097e905bec..b9e42c322a 100644
> --- a/hw/net/tulip.c
> +++ b/hw/net/tulip.c
> @@ -70,7 +70,7 @@ static const VMStateDescription vmstate_pci_tulip = {
>  static void tulip_desc_read(TULIPState *s, hwaddr p,
>  struct tulip_descriptor *desc)
>  {
> -const MemTxAttrs attrs = MEMTXATTRS_UNSPECIFIED;
> +const MemTxAttrs attrs = { .memory = true };
>
>  if (s->csr[0] & CSR0_DBO) {
>  ldl_be_pci_dma(>dev, p, >status, attrs);
> @@ -88,7 +88,7 @@ static void tulip_desc_read(TULIPState *s, hwaddr p,
>  static void tulip_desc_write(TULIPState *s, hwaddr p,
>  struct tulip_descriptor *desc)
>  {
> -const MemTxAttrs attrs = MEMTXATTRS_UNSPECIFIED;
> +const MemTxAttrs attrs = { .memory = true };
>
>  if (s->csr[0] & CSR0_DBO) {
>  stl_be_pci_dma(>dev, p, desc->status, attrs);
> --
> 2.25.1
>

EBUSY when using NVMe Block Driver with multiple devices in the same IOMMU group

2022-08-23 Thread Martin Oliveira

Hello,

I'm trying to use the QEMU NVMe userspace driver and I'm hitting an error when 
trying to use more than one device from an IOMMU group:

Failed to open VFIO group file: /dev/vfio/39: Device or resource busy

If devices belong to different IOMMU groups, then it works as expected.

For each device, I bind it to vfio-pci and then use something like this:

-drive file=nvme://:26:00.0,if=none,id=drive0,format=raw
-device virtio-blk,drive=drive0,id=virtio0,serial=nvme0

Using the file-based protocol (file=/dev/nvme0n1) works with multiple devices 
from the same group.

My host is running a 5.19 kernel and QEMU is the latest upstream (a8cc5842b5cb).

Thanks,
Martin

Re: [PATCH 1/9] hw/isa/vt82c686: QOM'ify Super I/O creation

2022-08-23 Thread BALATON Zoltan


On Tue, 23 Aug 2022, Bernhard Beschow wrote:

On Tue, Aug 23, 2022 at 2:35 AM BALATON Zoltan  wrote:


On Tue, 23 Aug 2022, Bernhard Beschow wrote:

The object creation now happens in chip-specific init methods which
allows the realize methods to be consolidated into one method. Shifting
the logic into the init methods has the addidional advantage that the
parent object's init methods are called implicitly.


This and later patches titled "QOM'ify" don't do what I understand on
QOMifying which is turining an old device model into a QOM object. These
devices are already QOMified, what's really done here is that they are
moved to the ViaISAState or embedded into it and created as part of the
south bridge rather then individually by the boards. Either my
understanding of what QOMify means is wrong or these patches are misnamed.



I think your understanding is correct. Peter refers to it as the
embed-the-device-struct style [1] which I can take as inspiration for
renaming my patches in v2.

Technically via_isa_realize() is the realize method of the abstract

TYPE_VIA_ISA class which were overriden by the subclasses but since QOM
does not call overridden methods implicitly this had to be explicitly
called in the overriding realize methods of the subclasses. Now that we
don't have to ovverride the method maybe it could be set once on the
VIA_ISA class then it may work that way but as it's done here is also OK
maybe as a reminder that this super class method should be called by any
overriding method if one's added in the future for some reason.



Right. This would involve moving some code around and creating a class
struct. Do you think it's worth it?


You mean a class_init func to assign realize as the class struct 
via_isa_info already exists. But yes this would need to be moved after 
via_isa_realize then. As I wrote above I don't think this is worth it, 
especially becuase if in the future a realize method was re-added to the 
subclasses then it's easy to forget to revert this and call the superclass 
method so assigning via_isa_realize to the subclass as done here is OK and 
I can think of it as a reminder to call this method when overriding it. 
Also with the added class_init func it's shorter this way even if slightly 
mixing different objects. So in the end I'm OK with this as it is.


Regards,
BALATON Zoltan


Best regards,
Bernhard

[1]
http://patchwork.ozlabs.org/project/qemu-devel/patch/20220205175913.31738-2-shen...@gmail.com/

Regards,

BALATON Zoltan


Signed-off-by: Bernhard Beschow 
---
hw/isa/vt82c686.c | 33 ++---
1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/hw/isa/vt82c686.c b/hw/isa/vt82c686.c
index 8f656251b8..0217c98fe4 100644
--- a/hw/isa/vt82c686.c
+++ b/hw/isa/vt82c686.c
@@ -544,7 +544,7 @@ struct ViaISAState {
qemu_irq cpu_intr;
qemu_irq *isa_irqs;
ISABus *isa_bus;
-ViaSuperIOState *via_sio;
+ViaSuperIOState via_sio;
};

static const VMStateDescription vmstate_via = {
@@ -602,6 +602,11 @@ static void via_isa_realize(PCIDevice *d, Error

**errp)

d->wmask[i] = 0;
}
}
+
+/* Super I/O */
+if (!qdev_realize(DEVICE(>via_sio), BUS(s->isa_bus), errp)) {
+return;
+}
}

/* TYPE_VT82C686B_ISA */
@@ -615,7 +620,7 @@ static void vt82c686b_write_config(PCIDevice *d,

uint32_t addr,

pci_default_write_config(d, addr, val, len);
if (addr == 0x85) {
/* BIT(1): enable or disable superio config io ports */
-via_superio_io_enable(s->via_sio, val & BIT(1));
+via_superio_io_enable(>via_sio, val & BIT(1));
}
}

@@ -639,13 +644,11 @@ static void vt82c686b_isa_reset(DeviceState *dev)
pci_conf[0x77] = 0x10; /* GPIO Control 1/2/3/4 */
}

-static void vt82c686b_realize(PCIDevice *d, Error **errp)
+static void vt82c686b_init(Object *obj)
{
-ViaISAState *s = VIA_ISA(d);
+ViaISAState *s = VIA_ISA(obj);

-via_isa_realize(d, errp);
-s->via_sio = VIA_SUPERIO(isa_create_simple(s->isa_bus,
-   TYPE_VT82C686B_SUPERIO));
+object_initialize_child(obj, "sio", >via_sio,

TYPE_VT82C686B_SUPERIO);

}

static void vt82c686b_class_init(ObjectClass *klass, void *data)
@@ -653,7 +656,7 @@ static void vt82c686b_class_init(ObjectClass *klass,

void *data)

DeviceClass *dc = DEVICE_CLASS(klass);
PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);

-k->realize = vt82c686b_realize;
+k->realize = via_isa_realize;
k->config_write = vt82c686b_write_config;
k->vendor_id = PCI_VENDOR_ID_VIA;
k->device_id = PCI_DEVICE_ID_VIA_82C686B_ISA;
@@ -670,6 +673,7 @@ static const TypeInfo vt82c686b_isa_info = {
.name  = TYPE_VT82C686B_ISA,
.parent= TYPE_VIA_ISA,
.instance_size = sizeof(ViaISAState),
+.instance_init = vt82c686b_init,
.class_init= vt82c686b_class_init,
};

@@ -684,7 +688,7 @@ static void vt8231_write_config(PCIDevice *d,

uint32_t addr,

Re: [PATCH 8/9] hw/isa/vt82c686: QOM'ify RTC creation

2022-08-23 Thread BALATON Zoltan


On Tue, 23 Aug 2022, Bernhard Beschow wrote:

On Tue, Aug 23, 2022 at 2:20 AM BALATON Zoltan  wrote:

On Tue, 23 Aug 2022, Bernhard Beschow wrote:

Signed-off-by: Bernhard Beschow 
---
hw/isa/vt82c686.c | 12 +++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/isa/vt82c686.c b/hw/isa/vt82c686.c
index 47f2fd2669..ee745d5d49 100644
--- a/hw/isa/vt82c686.c
+++ b/hw/isa/vt82c686.c
@@ -546,6 +546,7 @@ struct ViaISAState {
qemu_irq cpu_intr;
qemu_irq *isa_irqs;
ViaSuperIOState via_sio;
+RTCState rtc;
PCIIDEState ide;
UHCIState uhci[2];
ViaPMState pm;
@@ -567,6 +568,7 @@ static void via_isa_init(Object *obj)
{
ViaISAState *s = VIA_ISA(obj);

+object_initialize_child(obj, "rtc", >rtc, TYPE_MC146818_RTC);
object_initialize_child(obj, "ide", >ide, "via-ide");
object_initialize_child(obj, "uhci1", >uhci[0],

"vt82c686b-usb-uhci");

object_initialize_child(obj, "uhci2", >uhci[1],

"vt82c686b-usb-uhci");

@@ -615,7 +617,15 @@ static void via_isa_realize(PCIDevice *d, Error

**errp)

isa_bus_irqs(isa_bus, s->isa_irqs);
i8254_pit_init(isa_bus, 0x40, 0, NULL);
i8257_dma_init(isa_bus, 0);
-mc146818_rtc_init(isa_bus, 2000, NULL);
+
+/* RTC */
+qdev_prop_set_int32(DEVICE(>rtc), "base_year", 2000);
+if (!qdev_realize(DEVICE(>rtc), BUS(isa_bus), errp)) {
+return;
+}
+object_property_add_alias(qdev_get_machine(), "rtc-time",

OBJECT(>rtc),

+  "date");
+isa_connect_gpio_out(ISA_DEVICE(>rtc), 0, s->rtc.isairq);

for (i = 0; i < PCI_CONFIG_HEADER_SIZE; i++) {
if (i < PCI_COMMAND || i >= PCI_REVISION_ID) {



This actually introduces code duplication as all other places except piix4
seem to still use the init function (probably to ensure that the rtc-rime
alias on the machine is properly set) so I'd keep this the same as
everything else and drop this patch until this init function is removed
from all other places as well.



Hi Zoltan,

Thanks for the fast reply! Regarding code homogeneity and duplication I've
made a similar argument for mc146818_rtc_init() in the past [1] and I've
learnt that my patch went backwards. Incidentally, Peter mentioned vt686c
as a candidate for the embed-the-device-struct style which - again
incidentally - I've now done.


I've seen patches embedding devices recently but in this case it looked 
not that simple because of the rtc-time alias.



The rtc-time alias is actually only used by a couple of PPC machines where
Pegasos II is one of them. So the alias actually needs to be created only
for these machines, and identifying the cases where it has to be preserved
requires a lot of careful investigation. In the Pegasos II case this seems
especially complicated since one needs to look through several layers of
devices. During my work on the VT82xx south bridges I've gained some
knowledge such that I'd like to make this simplifying contribution.


I've used it to implement the get-time-of-day rtas call with VOF in 
pegasos2 because otherwise it would need to access internals of the RTC 
model and/or duplicate some code. Here's the message discussing this:


https://lists.nongnu.org/archive/html/qemu-ppc/2021-10/msg00170.html

so this alias still seems to be the simplest way.

I think the primary function of this alias is not these ppc machines but 
some QMP/HMP command or to make the guest time available from the monitor 
or something like that so it's probably also used from there and therefore 
all rtc probably should have it but I'm not sure about it.



Our discussion makes me realize that the creation of the alias could now
actually be moved to the Pegasos II board. This way, the Pegasos II board
would both create and consume that alias, which seems to remove quite some
cognitive load. Do you agree? Would moving the alias to the board work for
you?


Yes I think that would be better. This way the vt82xx and piix4 would be 
similar and the alias would also be clear within the pegasos2 code and it 
also has the machine directly at that point so it's clearer that way.


Regards,
BALATON Zoltan

Re: [PATCH 7/9] hw/isa/vt82c686: QOM'ify ac97 and mc97 creation

2022-08-23 Thread BALATON Zoltan


On Tue, 23 Aug 2022, Bernhard Beschow wrote:

On Tue, Aug 23, 2022 at 2:44 AM BALATON Zoltan  wrote:


On Tue, 23 Aug 2022, Bernhard Beschow wrote:

Resolves duplicate code in the boards.

Signed-off-by: Bernhard Beschow 
---
hw/isa/vt82c686.c   | 16 
hw/mips/fuloong2e.c |  4 
hw/ppc/pegasos2.c   |  4 
3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/hw/isa/vt82c686.c b/hw/isa/vt82c686.c
index b964d1a760..47f2fd2669 100644
--- a/hw/isa/vt82c686.c
+++ b/hw/isa/vt82c686.c
@@ -549,6 +549,8 @@ struct ViaISAState {
PCIIDEState ide;
UHCIState uhci[2];
ViaPMState pm;
+PCIDevice ac97;
+PCIDevice mc97;
};

static const VMStateDescription vmstate_via = {
@@ -568,6 +570,8 @@ static void via_isa_init(Object *obj)
object_initialize_child(obj, "ide", >ide, "via-ide");
object_initialize_child(obj, "uhci1", >uhci[0],

"vt82c686b-usb-uhci");

object_initialize_child(obj, "uhci2", >uhci[1],

"vt82c686b-usb-uhci");

+object_initialize_child(obj, "ac97", >ac97, TYPE_VIA_AC97);
+object_initialize_child(obj, "mc97", >mc97, TYPE_VIA_MC97);
}

static const TypeInfo via_isa_info = {
@@ -644,6 +648,18 @@ static void via_isa_realize(PCIDevice *d, Error

**errp)

if (!qdev_realize(DEVICE(>pm), BUS(pci_bus), errp)) {
return;
}
+
+/* Function 5: AC97 Audio */
+qdev_prop_set_int32(DEVICE(>ac97), "addr", d->devfn + 5);
+if (!qdev_realize(DEVICE(>ac97), BUS(pci_bus), errp)) {
+return;
+}
+
+/* Function 6: AC97 Modem */
+qdev_prop_set_int32(DEVICE(>mc97), "addr", d->devfn + 6);
+if (!qdev_realize(DEVICE(>mc97), BUS(pci_bus), errp)) {
+return;
+}
}

/* TYPE_VT82C686B_ISA */
diff --git a/hw/mips/fuloong2e.c b/hw/mips/fuloong2e.c
index f05474348f..ea1aef3049 100644
--- a/hw/mips/fuloong2e.c
+++ b/hw/mips/fuloong2e.c
@@ -207,10 +207,6 @@ static void vt82c686b_southbridge_init(PCIBus

*pci_bus, int slot, qemu_irq intc,


dev = PCI_DEVICE(object_resolve_path_component(OBJECT(dev), "pm"));
*i2c_bus = I2C_BUS(qdev_get_child_bus(DEVICE(dev), "i2c"));
-
-/* Audio support */
-pci_create_simple(pci_bus, PCI_DEVFN(slot, 5), TYPE_VIA_AC97);
-pci_create_simple(pci_bus, PCI_DEVFN(slot, 6), TYPE_VIA_MC97);
}

/* Network support */
diff --git a/hw/ppc/pegasos2.c b/hw/ppc/pegasos2.c
index 4e29e42fba..89ef4aed8b 100644
--- a/hw/ppc/pegasos2.c
+++ b/hw/ppc/pegasos2.c
@@ -171,10 +171,6 @@ static void pegasos2_init(MachineState *machine)
spd_data = spd_data_generate(DDR, machine->ram_size);
smbus_eeprom_init_one(i2c_bus, 0x57, spd_data);

-/* VT8231 function 5-6: AC97 Audio & Modem */
-pci_create_simple(pci_bus, PCI_DEVFN(12, 5), TYPE_VIA_AC97);
-pci_create_simple(pci_bus, PCI_DEVFN(12, 6), TYPE_VIA_MC97);
-


This removes the last function created here so the comment above saying:
/* VT8231 function 0: PCI-to-ISA Bridge */
is now stale and may be removed as well.



Sure, I'll remove it in v2. What about the comment saying:
/* VT8231 function 4: Power Management Controller */
?


I thought that was removed by patch 6 but indeed it wasn't. I think that's 
now also stale and can be dropped (or replapced by something saying SPD 
EEPROM but the remaining code is fairly clear without a comment so jist 
removing it is fine).


Regards,
BALATON Zoltan

[PATCH v7 16/20] accel/tcg: Add fast path for translator_ld*

2022-08-23 Thread Richard Henderson

Cache the translation from guest to host address, so we may
use direct loads when we hit on the primary translation page.

Look up the second translation page only once, during translation.
This obviates another lookup of the second page within tb_gen_code
after translation.

Fixes a bug in that plugin_insn_append should be passed the bytes
in the original memory order, not bswapped by pieces.

Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 include/exec/translator.h |  63 +++
 accel/tcg/translate-all.c |  23 +++
 accel/tcg/translator.c| 126 +-
 3 files changed, 141 insertions(+), 71 deletions(-)

diff --git a/include/exec/translator.h b/include/exec/translator.h
index 69db0f5c21..329a42fe46 100644
--- a/include/exec/translator.h
+++ b/include/exec/translator.h
@@ -81,24 +81,14 @@ typedef enum DisasJumpType {
  * Architecture-agnostic disassembly context.
  */
 typedef struct DisasContextBase {
-const TranslationBlock *tb;
+TranslationBlock *tb;
 target_ulong pc_first;
 target_ulong pc_next;
 DisasJumpType is_jmp;
 int num_insns;
 int max_insns;
 bool singlestep_enabled;
-#ifdef CONFIG_USER_ONLY
-/*
- * Guest address of the last byte of the last protected page.
- *
- * Pages containing the translated instructions are made non-writable in
- * order to achieve consistency in case another thread is modifying the
- * code while translate_insn() fetches the instruction bytes piecemeal.
- * Such writer threads are blocked on mmap_lock() in page_unprotect().
- */
-target_ulong page_protect_end;
-#endif
+void *host_addr[2];
 } DisasContextBase;
 
 /**
@@ -183,24 +173,43 @@ bool translator_use_goto_tb(DisasContextBase *db, 
target_ulong dest);
  * the relevant information at translation time.
  */
 
-#define GEN_TRANSLATOR_LD(fullname, type, load_fn, swap_fn) \
-type fullname ## _swap(CPUArchState *env, DisasContextBase *dcbase, \
-   abi_ptr pc, bool do_swap);   \
-static inline type fullname(CPUArchState *env,  \
-DisasContextBase *dcbase, abi_ptr pc)   \
-{   \
-return fullname ## _swap(env, dcbase, pc, false);   \
+uint8_t translator_ldub(CPUArchState *env, DisasContextBase *db, abi_ptr pc);
+uint16_t translator_lduw(CPUArchState *env, DisasContextBase *db, abi_ptr pc);
+uint32_t translator_ldl(CPUArchState *env, DisasContextBase *db, abi_ptr pc);
+uint64_t translator_ldq(CPUArchState *env, DisasContextBase *db, abi_ptr pc);
+
+static inline uint16_t
+translator_lduw_swap(CPUArchState *env, DisasContextBase *db,
+ abi_ptr pc, bool do_swap)
+{
+uint16_t ret = translator_lduw(env, db, pc);
+if (do_swap) {
+ret = bswap16(ret);
 }
+return ret;
+}
 
-#define FOR_EACH_TRANSLATOR_LD(F)   \
-F(translator_ldub, uint8_t, cpu_ldub_code, /* no swap */)   \
-F(translator_lduw, uint16_t, cpu_lduw_code, bswap16)\
-F(translator_ldl, uint32_t, cpu_ldl_code, bswap32)  \
-F(translator_ldq, uint64_t, cpu_ldq_code, bswap64)
+static inline uint32_t
+translator_ldl_swap(CPUArchState *env, DisasContextBase *db,
+abi_ptr pc, bool do_swap)
+{
+uint32_t ret = translator_ldl(env, db, pc);
+if (do_swap) {
+ret = bswap32(ret);
+}
+return ret;
+}
 
-FOR_EACH_TRANSLATOR_LD(GEN_TRANSLATOR_LD)
-
-#undef GEN_TRANSLATOR_LD
+static inline uint64_t
+translator_ldq_swap(CPUArchState *env, DisasContextBase *db,
+abi_ptr pc, bool do_swap)
+{
+uint64_t ret = translator_ldq_swap(env, db, pc, false);
+if (do_swap) {
+ret = bswap64(ret);
+}
+return ret;
+}
 
 /*
  * Return whether addr is on the same page as where disassembly started.
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 587886aa4e..f5e8592d4a 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1385,8 +1385,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 {
 CPUArchState *env = cpu->env_ptr;
 TranslationBlock *tb, *existing_tb;
-tb_page_addr_t phys_pc, phys_page2;
-target_ulong virt_page2;
+tb_page_addr_t phys_pc;
 tcg_insn_unit *gen_code_buf;
 int gen_code_size, search_size, max_insns;
 #ifdef CONFIG_PROFILER
@@ -1429,6 +1428,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 tb->flags = flags;
 tb->cflags = cflags;
 tb->trace_vcpu_dstate = *cpu->trace_dstate;
+tb->page_addr[0] = phys_pc;
+tb->page_addr[1] = -1;
 tcg_ctx->tb_cflags = cflags;
  tb_overflow:
 
@@ -1622,13 +1623,11 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 }
 
 /*
- * If the TB is not associated with a physical RAM

[PATCH v7 17/20] target/s390x: Make translator stop before the end of a page

2022-08-23 Thread Richard Henderson

From: Ilya Leoshkevich 

Right now translator stops right *after* the end of a page, which
breaks reporting of fault locations when the last instruction of a
multi-insn translation block crosses a page boundary.

Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Richard Henderson 
Message-Id: <20220817150506.592862-3-...@linux.ibm.com>
Signed-off-by: Richard Henderson 
---
 target/s390x/tcg/translate.c |  15 +++-
 tests/tcg/s390x/noexec.c | 106 +++
 tests/tcg/multiarch/noexec.c.inc | 139 +++
 tests/tcg/s390x/Makefile.target  |   1 +
 4 files changed, 257 insertions(+), 4 deletions(-)
 create mode 100644 tests/tcg/s390x/noexec.c
 create mode 100644 tests/tcg/multiarch/noexec.c.inc

diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c
index d4c0b9b3a2..1d2dddab1c 100644
--- a/target/s390x/tcg/translate.c
+++ b/target/s390x/tcg/translate.c
@@ -6609,6 +6609,14 @@ static void s390x_tr_insn_start(DisasContextBase 
*dcbase, CPUState *cs)
 dc->insn_start = tcg_last_op();
 }
 
+static target_ulong get_next_pc(CPUS390XState *env, DisasContext *s,
+uint64_t pc)
+{
+uint64_t insn = ld_code2(env, s, pc);
+
+return pc + get_ilen((insn >> 8) & 0xff);
+}
+
 static void s390x_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs)
 {
 CPUS390XState *env = cs->env_ptr;
@@ -6616,10 +6624,9 @@ static void s390x_tr_translate_insn(DisasContextBase 
*dcbase, CPUState *cs)
 
 dc->base.is_jmp = translate_one(env, dc);
 if (dc->base.is_jmp == DISAS_NEXT) {
-uint64_t page_start;
-
-page_start = dc->base.pc_first & TARGET_PAGE_MASK;
-if (dc->base.pc_next - page_start >= TARGET_PAGE_SIZE || dc->ex_value) 
{
+if (!is_same_page(dcbase, dc->base.pc_next) ||
+!is_same_page(dcbase, get_next_pc(env, dc, dc->base.pc_next)) ||
+dc->ex_value) {
 dc->base.is_jmp = DISAS_TOO_MANY;
 }
 }
diff --git a/tests/tcg/s390x/noexec.c b/tests/tcg/s390x/noexec.c
new file mode 100644
index 00..15d007d07f
--- /dev/null
+++ b/tests/tcg/s390x/noexec.c
@@ -0,0 +1,106 @@
+#include "../multiarch/noexec.c.inc"
+
+static void *arch_mcontext_pc(const mcontext_t *ctx)
+{
+return (void *)ctx->psw.addr;
+}
+
+static int arch_mcontext_arg(const mcontext_t *ctx)
+{
+return ctx->gregs[2];
+}
+
+static void arch_flush(void *p, int len)
+{
+}
+
+extern char noexec_1[];
+extern char noexec_2[];
+extern char noexec_end[];
+
+asm("noexec_1:\n"
+"   lgfi %r2,1\n"   /* %r2 is 0 on entry, set 1. */
+"noexec_2:\n"
+"   lgfi %r2,2\n"   /* %r2 is 0/1; set 2. */
+"   br %r14\n"  /* return */
+"noexec_end:");
+
+extern char exrl_1[];
+extern char exrl_2[];
+extern char exrl_end[];
+
+asm("exrl_1:\n"
+"   exrl %r0, exrl_2\n"
+"   br %r14\n"
+"exrl_2:\n"
+"   lgfi %r2,2\n"
+"exrl_end:");
+
+int main(void)
+{
+struct noexec_test noexec_tests[] = {
+{
+.name = "fallthrough",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = noexec_1 - noexec_2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 1,
+},
+{
+.name = "jump",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = 0,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 0,
+},
+{
+.name = "exrl",
+.test_code = exrl_1,
+.test_len = exrl_end - exrl_1,
+.page_ofs = exrl_1 - exrl_2,
+.entry_ofs = exrl_1 - exrl_2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = exrl_1 - exrl_2,
+.expected_arg = 0,
+},
+{
+.name = "fallthrough [cross]",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2 - 2,
+.entry_ofs = noexec_1 - noexec_2 - 2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = -2,
+.expected_arg = 1,
+},
+{
+.name = "jump [cross]",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2 - 2,
+.entry_ofs = -2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = -2,
+.expected_arg = 0,
+},
+{
+.name = "exrl [cross]",
+.test_code = exrl_1,
+.test_len = exrl_end - exrl_1,
+.page_ofs = exrl_1 - exrl_2 - 2,
+.entry_ofs = exrl_1 - exrl_2 - 2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = exrl_1 - exrl_2 - 2,
+

[PATCH v7 15/20] accel/tcg: Add pc and host_pc params to gen_intermediate_code

2022-08-23 Thread Richard Henderson

Pass these along to translator_loop -- pc may be used instead
of tb->pc, and host_pc is currently unused.  Adjust all targets
at one time.

Acked-by: Alistair Francis 
Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 include/exec/exec-all.h   |  1 -
 include/exec/translator.h | 24 
 accel/tcg/translate-all.c |  6 --
 accel/tcg/translator.c|  9 +
 target/alpha/translate.c  |  5 +++--
 target/arm/translate.c|  5 +++--
 target/avr/translate.c|  5 +++--
 target/cris/translate.c   |  5 +++--
 target/hexagon/translate.c|  6 --
 target/hppa/translate.c   |  5 +++--
 target/i386/tcg/translate.c   |  5 +++--
 target/loongarch/translate.c  |  6 --
 target/m68k/translate.c   |  5 +++--
 target/microblaze/translate.c |  5 +++--
 target/mips/tcg/translate.c   |  5 +++--
 target/nios2/translate.c  |  5 +++--
 target/openrisc/translate.c   |  6 --
 target/ppc/translate.c|  5 +++--
 target/riscv/translate.c  |  5 +++--
 target/rx/translate.c |  5 +++--
 target/s390x/tcg/translate.c  |  5 +++--
 target/sh4/translate.c|  5 +++--
 target/sparc/translate.c  |  5 +++--
 target/tricore/translate.c|  6 --
 target/xtensa/translate.c |  6 --
 25 files changed, 97 insertions(+), 53 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 9f35e3b7a9..bcad607c4e 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -39,7 +39,6 @@ typedef ram_addr_t tb_page_addr_t;
 #define TB_PAGE_ADDR_FMT RAM_ADDR_FMT
 #endif
 
-void gen_intermediate_code(CPUState *cpu, TranslationBlock *tb, int max_insns);
 void restore_state_to_opc(CPUArchState *env, TranslationBlock *tb,
   target_ulong *data);
 
diff --git a/include/exec/translator.h b/include/exec/translator.h
index 45b9268ca4..69db0f5c21 100644
--- a/include/exec/translator.h
+++ b/include/exec/translator.h
@@ -26,6 +26,19 @@
 #include "exec/translate-all.h"
 #include "tcg/tcg.h"
 
+/**
+ * gen_intermediate_code
+ * @cpu: cpu context
+ * @tb: translation block
+ * @max_insns: max number of instructions to translate
+ * @pc: guest virtual program counter address
+ * @host_pc: host physical program counter address
+ *
+ * This function must be provided by the target, which should create
+ * the target-specific DisasContext, and then invoke translator_loop.
+ */
+void gen_intermediate_code(CPUState *cpu, TranslationBlock *tb, int max_insns,
+   target_ulong pc, void *host_pc);
 
 /**
  * DisasJumpType:
@@ -123,11 +136,13 @@ typedef struct TranslatorOps {
 
 /**
  * translator_loop:
- * @ops: Target-specific operations.
- * @db: Disassembly context.
  * @cpu: Target vCPU.
  * @tb: Translation block.
  * @max_insns: Maximum number of insns to translate.
+ * @pc: guest virtual program counter address
+ * @host_pc: host physical program counter address
+ * @ops: Target-specific operations.
+ * @db: Disassembly context.
  *
  * Generic translator loop.
  *
@@ -141,8 +156,9 @@ typedef struct TranslatorOps {
  * - When single-stepping is enabled (system-wide or on the current vCPU).
  * - When too many instructions have been translated.
  */
-void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
- CPUState *cpu, TranslationBlock *tb, int max_insns);
+void translator_loop(CPUState *cpu, TranslationBlock *tb, int max_insns,
+ target_ulong pc, void *host_pc,
+ const TranslatorOps *ops, DisasContextBase *db);
 
 void translator_loop_temp_check(DisasContextBase *db);
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index b83161a081..587886aa4e 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -46,6 +46,7 @@
 
 #include "exec/cputlb.h"
 #include "exec/translate-all.h"
+#include "exec/translator.h"
 #include "qemu/bitmap.h"
 #include "qemu/qemu-print.h"
 #include "qemu/timer.h"
@@ -1392,11 +1393,12 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 TCGProfile *prof = _ctx->prof;
 int64_t ti;
 #endif
+void *host_pc;
 
 assert_memory_lock();
 qemu_thread_jit_write();
 
-phys_pc = get_page_addr_code(env, pc);
+phys_pc = get_page_addr_code_hostp(env, pc, _pc);
 
 if (phys_pc == -1) {
 /* Generate a one-shot TB with 1 insn in it */
@@ -1444,7 +1446,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 tcg_func_start(tcg_ctx);
 
 tcg_ctx->cpu = env_cpu(env);
-gen_intermediate_code(cpu, tb, max_insns);
+gen_intermediate_code(cpu, tb, max_insns, pc, host_pc);
 assert(tb->size != 0);
 tcg_ctx->cpu = NULL;
 max_insns = tb->icount;
diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
index fe7af9b943..3eef30d93a 100644
--- a/accel/tcg/translator.c
+++ b/accel/tcg/translator.c
@@ -51,16 +51,17 @@ static inline void

[PATCH v7 14/20] accel/tcg: Remove translator_ldsw

2022-08-23 Thread Richard Henderson

The only user can easily use translator_lduw and
adjust the type to signed during the return.

Reviewed-by: Alistair Francis 
Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 include/exec/translator.h   | 1 -
 target/i386/tcg/translate.c | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/exec/translator.h b/include/exec/translator.h
index 0d0bf3a31e..45b9268ca4 100644
--- a/include/exec/translator.h
+++ b/include/exec/translator.h
@@ -178,7 +178,6 @@ bool translator_use_goto_tb(DisasContextBase *db, 
target_ulong dest);
 
 #define FOR_EACH_TRANSLATOR_LD(F)   \
 F(translator_ldub, uint8_t, cpu_ldub_code, /* no swap */)   \
-F(translator_ldsw, int16_t, cpu_ldsw_code, bswap16) \
 F(translator_lduw, uint16_t, cpu_lduw_code, bswap16)\
 F(translator_ldl, uint32_t, cpu_ldl_code, bswap32)  \
 F(translator_ldq, uint64_t, cpu_ldq_code, bswap64)
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index b7972f0ff5..a23417d058 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -2033,7 +2033,7 @@ static inline uint8_t x86_ldub_code(CPUX86State *env, 
DisasContext *s)
 
 static inline int16_t x86_ldsw_code(CPUX86State *env, DisasContext *s)
 {
-return translator_ldsw(env, >base, advance_pc(env, s, 2));
+return translator_lduw(env, >base, advance_pc(env, s, 2));
 }
 
 static inline uint16_t x86_lduw_code(CPUX86State *env, DisasContext *s)
-- 
2.34.1

[PATCH v7 18/20] target/i386: Make translator stop before the end of a page

2022-08-23 Thread Richard Henderson

From: Ilya Leoshkevich 

Right now translator stops right *after* the end of a page, which
breaks reporting of fault locations when the last instruction of a
multi-insn translation block crosses a page boundary.

An implementation, like the one arm and s390x have, would require an
i386 length disassembler, which is burdensome to maintain. Another
alternative would be to single-step at the end of a guest page, but
this may come with a performance impact.

Fix by snapshotting disassembly state and restoring it after we figure
out we crossed a page boundary. This includes rolling back cc_op
updates and emitted ops.

Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Richard Henderson 
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1143
Message-Id: <20220817150506.592862-4-...@linux.ibm.com>
[rth: Simplify end-of-insn cross-page checks.]
Signed-off-by: Richard Henderson 
---
 target/i386/tcg/translate.c  | 64 ---
 tests/tcg/x86_64/noexec.c| 75 
 tests/tcg/x86_64/Makefile.target |  3 +-
 3 files changed, 116 insertions(+), 26 deletions(-)
 create mode 100644 tests/tcg/x86_64/noexec.c

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 4836c889e0..b184fe33b8 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -130,6 +130,7 @@ typedef struct DisasContext {
 TCGv_i64 tmp1_i64;
 
 sigjmp_buf jmpbuf;
+TCGOp *prev_insn_end;
 } DisasContext;
 
 /* The environment in which user-only runs is constrained. */
@@ -2008,6 +2009,12 @@ static uint64_t advance_pc(CPUX86State *env, 
DisasContext *s, int num_bytes)
 {
 uint64_t pc = s->pc;
 
+/* This is a subsequent insn that crosses a page boundary.  */
+if (s->base.num_insns > 1 &&
+!is_same_page(>base, s->pc + num_bytes - 1)) {
+siglongjmp(s->jmpbuf, 2);
+}
+
 s->pc += num_bytes;
 if (unlikely(s->pc - s->pc_start > X86_MAX_INSN_LENGTH)) {
 /* If the instruction's 16th byte is on a different page than the 1st, 
a
@@ -4556,6 +4563,8 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 int modrm, reg, rm, mod, op, opreg, val;
 target_ulong next_eip, tval;
 target_ulong pc_start = s->base.pc_next;
+bool orig_cc_op_dirty = s->cc_op_dirty;
+CCOp orig_cc_op = s->cc_op;
 
 s->pc_start = s->pc = pc_start;
 s->override = -1;
@@ -4568,9 +4577,22 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 s->rip_offset = 0; /* for relative ip address */
 s->vex_l = 0;
 s->vex_v = 0;
-if (sigsetjmp(s->jmpbuf, 0) != 0) {
+switch (sigsetjmp(s->jmpbuf, 0)) {
+case 0:
+break;
+case 1:
 gen_exception_gpf(s);
 return s->pc;
+case 2:
+/* Restore state that may affect the next instruction. */
+s->cc_op_dirty = orig_cc_op_dirty;
+s->cc_op = orig_cc_op;
+s->base.num_insns--;
+tcg_remove_ops_after(s->prev_insn_end);
+s->base.is_jmp = DISAS_TOO_MANY;
+return pc_start;
+default:
+g_assert_not_reached();
 }
 
 prefixes = 0;
@@ -8632,6 +8654,7 @@ static void i386_tr_insn_start(DisasContextBase *dcbase, 
CPUState *cpu)
 {
 DisasContext *dc = container_of(dcbase, DisasContext, base);
 
+dc->prev_insn_end = tcg_last_op();
 tcg_gen_insn_start(dc->base.pc_next, dc->cc_op);
 }
 
@@ -8652,31 +8675,22 @@ static void i386_tr_translate_insn(DisasContextBase 
*dcbase, CPUState *cpu)
 #endif
 
 pc_next = disas_insn(dc, cpu);
-
-if (dc->flags & (HF_TF_MASK | HF_INHIBIT_IRQ_MASK)) {
-/* if single step mode, we generate only one instruction and
-   generate an exception */
-/* if irq were inhibited with HF_INHIBIT_IRQ_MASK, we clear
-   the flag and abort the translation to give the irqs a
-   chance to happen */
-dc->base.is_jmp = DISAS_TOO_MANY;
-} else if ((tb_cflags(dc->base.tb) & CF_USE_ICOUNT)
-   && ((pc_next & TARGET_PAGE_MASK)
-   != ((pc_next + TARGET_MAX_INSN_SIZE - 1)
-   & TARGET_PAGE_MASK)
-   || (pc_next & ~TARGET_PAGE_MASK) == 0)) {
-/* Do not cross the boundary of the pages in icount mode,
-   it can cause an exception. Do it only when boundary is
-   crossed by the first instruction in the block.
-   If current instruction already crossed the bound - it's ok,
-   because an exception hasn't stopped this code.
- */
-dc->base.is_jmp = DISAS_TOO_MANY;
-} else if ((pc_next - dc->base.pc_first) >= (TARGET_PAGE_SIZE - 32)) {
-dc->base.is_jmp = DISAS_TOO_MANY;
-}
-
 dc->base.pc_next = pc_next;
+
+if (dc->base.is_jmp == DISAS_NEXT) {
+if (dc->flags & (HF_TF_MASK | HF_INHIBIT_IRQ_MASK)) {
+/*
+ * If single step mode, we generate only one instruction and
+ * generate an exception.
+

[PATCH v7 19/20] target/riscv: Add MAX_INSN_LEN and insn_len

2022-08-23 Thread Richard Henderson

These will be useful in properly ending the TB.

Reviewed-by: Alistair Francis 
Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 target/riscv/translate.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 38666ddc91..a719aa6e63 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -1022,6 +1022,14 @@ static uint32_t opcode_at(DisasContextBase *dcbase, 
target_ulong pc)
 /* Include decoders for factored-out extensions */
 #include "decode-XVentanaCondOps.c.inc"
 
+/* The specification allows for longer insns, but not supported by qemu. */
+#define MAX_INSN_LEN  4
+
+static inline int insn_len(uint16_t first_word)
+{
+return (first_word & 3) == 3 ? 4 : 2;
+}
+
 static void decode_opc(CPURISCVState *env, DisasContext *ctx, uint16_t opcode)
 {
 /*
@@ -1037,7 +1045,7 @@ static void decode_opc(CPURISCVState *env, DisasContext 
*ctx, uint16_t opcode)
 };
 
 /* Check for compressed insn */
-if (extract16(opcode, 0, 2) != 3) {
+if (insn_len(opcode) == 2) {
 if (!has_ext(ctx, RVC)) {
 gen_exception_illegal(ctx);
 } else {
-- 
2.34.1

[PATCH v7 20/20] target/riscv: Make translator stop before the end of a page

2022-08-23 Thread Richard Henderson

Right now the translator stops right *after* the end of a page, which
breaks reporting of fault locations when the last instruction of a
multi-insn translation block crosses a page boundary.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1155
Reviewed-by: Alistair Francis 
Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 target/riscv/translate.c  | 17 +--
 tests/tcg/riscv64/noexec.c| 79 +++
 tests/tcg/riscv64/Makefile.target |  1 +
 3 files changed, 93 insertions(+), 4 deletions(-)
 create mode 100644 tests/tcg/riscv64/noexec.c

diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index a719aa6e63..f8af6daa70 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -1154,12 +1154,21 @@ static void riscv_tr_translate_insn(DisasContextBase 
*dcbase, CPUState *cpu)
 }
 ctx->nftemp = 0;
 
+/* Only the first insn within a TB is allowed to cross a page boundary. */
 if (ctx->base.is_jmp == DISAS_NEXT) {
-target_ulong page_start;
-
-page_start = ctx->base.pc_first & TARGET_PAGE_MASK;
-if (ctx->base.pc_next - page_start >= TARGET_PAGE_SIZE) {
+if (!is_same_page(>base, ctx->base.pc_next)) {
 ctx->base.is_jmp = DISAS_TOO_MANY;
+} else {
+unsigned page_ofs = ctx->base.pc_next & ~TARGET_PAGE_MASK;
+
+if (page_ofs > TARGET_PAGE_SIZE - MAX_INSN_LEN) {
+uint16_t next_insn = cpu_lduw_code(env, ctx->base.pc_next);
+int len = insn_len(next_insn);
+
+if (!is_same_page(>base, ctx->base.pc_next + len)) {
+ctx->base.is_jmp = DISAS_TOO_MANY;
+}
+}
 }
 }
 }
diff --git a/tests/tcg/riscv64/noexec.c b/tests/tcg/riscv64/noexec.c
new file mode 100644
index 00..86f64b28db
--- /dev/null
+++ b/tests/tcg/riscv64/noexec.c
@@ -0,0 +1,79 @@
+#include "../multiarch/noexec.c.inc"
+
+static void *arch_mcontext_pc(const mcontext_t *ctx)
+{
+return (void *)ctx->__gregs[REG_PC];
+}
+
+static int arch_mcontext_arg(const mcontext_t *ctx)
+{
+return ctx->__gregs[REG_A0];
+}
+
+static void arch_flush(void *p, int len)
+{
+__builtin___clear_cache(p, p + len);
+}
+
+extern char noexec_1[];
+extern char noexec_2[];
+extern char noexec_end[];
+
+asm(".option push\n"
+".option norvc\n"
+"noexec_1:\n"
+"   li a0,1\n"   /* a0 is 0 on entry, set 1. */
+"noexec_2:\n"
+"   li a0,2\n"  /* a0 is 0/1; set 2. */
+"   ret\n"
+"noexec_end:\n"
+".option pop");
+
+int main(void)
+{
+struct noexec_test noexec_tests[] = {
+{
+.name = "fallthrough",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = noexec_1 - noexec_2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 1,
+},
+{
+.name = "jump",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = 0,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 0,
+},
+{
+.name = "fallthrough [cross]",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2 - 2,
+.entry_ofs = noexec_1 - noexec_2 - 2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = -2,
+.expected_arg = 1,
+},
+{
+.name = "jump [cross]",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2 - 2,
+.entry_ofs = -2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = -2,
+.expected_arg = 0,
+},
+};
+
+return test_noexec(noexec_tests,
+   sizeof(noexec_tests) / sizeof(noexec_tests[0]));
+}
diff --git a/tests/tcg/riscv64/Makefile.target 
b/tests/tcg/riscv64/Makefile.target
index d41bf6d60d..b5b89dfb0e 100644
--- a/tests/tcg/riscv64/Makefile.target
+++ b/tests/tcg/riscv64/Makefile.target
@@ -3,3 +3,4 @@
 
 VPATH += $(SRC_PATH)/tests/tcg/riscv64
 TESTS += test-div
+TESTS += noexec
-- 
2.34.1

[PATCH v7 05/20] linux-user: Clear translations and tb_jmp_cache on mprotect()

2022-08-23 Thread Richard Henderson

From: Ilya Leoshkevich 

Currently it's possible to execute pages that do not have PAGE_EXEC
if there is an existing translation block. Fix by clearing tb_jmp_cache
and invalidating TBs, which forces recheck of permission bits.

Signed-off-by: Ilya Leoshkevich 
Message-Id: <20220817150506.592862-2-...@linux.ibm.com>
[rth: Invalidate is required -- e.g. riscv fallthrough cross test]
Signed-off-by: Richard Henderson 

fixup mprotect
---
 linux-user/mmap.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/linux-user/mmap.c b/linux-user/mmap.c
index 048c4135af..e9dc8848be 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -115,6 +115,7 @@ int target_mprotect(abi_ulong start, abi_ulong len, int 
target_prot)
 {
 abi_ulong end, host_start, host_end, addr;
 int prot1, ret, page_flags, host_prot;
+CPUState *cpu;
 
 trace_target_mprotect(start, len, target_prot);
 
@@ -177,7 +178,14 @@ int target_mprotect(abi_ulong start, abi_ulong len, int 
target_prot)
 goto error;
 }
 }
+
 page_set_flags(start, start + len, page_flags);
+tb_invalidate_phys_range(start, start + len);
+
+CPU_FOREACH(cpu) {
+cpu_tb_jmp_cache_clear(cpu);
+}
+
 mmap_unlock();
 return 0;
 error:
-- 
2.34.1

[PATCH v7 12/20] accel/tcg: Use probe_access_internal for softmmu get_page_addr_code_hostp

2022-08-23 Thread Richard Henderson

Simplify the implementation of get_page_addr_code_hostp
by reusing the existing probe_access infrastructure.

Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 accel/tcg/cputlb.c | 76 --
 1 file changed, 26 insertions(+), 50 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 80a3eb4f1c..8fad2d9b83 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1482,56 +1482,6 @@ static bool victim_tlb_hit(CPUArchState *env, size_t 
mmu_idx, size_t index,
   victim_tlb_hit(env, mmu_idx, index, offsetof(CPUTLBEntry, TY), \
  (ADDR) & TARGET_PAGE_MASK)
 
-/*
- * Return a ram_addr_t for the virtual address for execution.
- *
- * Return -1 if we can't translate and execute from an entire page
- * of RAM.  This will force us to execute by loading and translating
- * one insn at a time, without caching.
- *
- * NOTE: This function will trigger an exception if the page is
- * not executable.
- */
-tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
-void **hostp)
-{
-uintptr_t mmu_idx = cpu_mmu_index(env, true);
-uintptr_t index = tlb_index(env, mmu_idx, addr);
-CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
-void *p;
-
-if (unlikely(!tlb_hit(entry->addr_code, addr))) {
-if (!VICTIM_TLB_HIT(addr_code, addr)) {
-tlb_fill(env_cpu(env), addr, 0, MMU_INST_FETCH, mmu_idx, 0);
-index = tlb_index(env, mmu_idx, addr);
-entry = tlb_entry(env, mmu_idx, addr);
-
-if (unlikely(entry->addr_code & TLB_INVALID_MASK)) {
-/*
- * The MMU protection covers a smaller range than a target
- * page, so we must redo the MMU check for every insn.
- */
-return -1;
-}
-}
-assert(tlb_hit(entry->addr_code, addr));
-}
-
-if (unlikely(entry->addr_code & TLB_MMIO)) {
-/* The region is not backed by RAM.  */
-if (hostp) {
-*hostp = NULL;
-}
-return -1;
-}
-
-p = (void *)((uintptr_t)addr + entry->addend);
-if (hostp) {
-*hostp = p;
-}
-return qemu_ram_addr_from_host_nofail(p);
-}
-
 static void notdirty_write(CPUState *cpu, vaddr mem_vaddr, unsigned size,
CPUIOTLBEntry *iotlbentry, uintptr_t retaddr)
 {
@@ -1687,6 +1637,32 @@ void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr,
 return flags ? NULL : host;
 }
 
+/*
+ * Return a ram_addr_t for the virtual address for execution.
+ *
+ * Return -1 if we can't translate and execute from an entire page
+ * of RAM.  This will force us to execute by loading and translating
+ * one insn at a time, without caching.
+ *
+ * NOTE: This function will trigger an exception if the page is
+ * not executable.
+ */
+tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
+void **hostp)
+{
+void *p;
+
+(void)probe_access_internal(env, addr, 1, MMU_INST_FETCH,
+cpu_mmu_index(env, true), false, , 0);
+if (p == NULL) {
+return -1;
+}
+if (hostp) {
+*hostp = p;
+}
+return qemu_ram_addr_from_host_nofail(p);
+}
+
 #ifdef CONFIG_PLUGIN
 /*
  * Perform a TLB lookup and populate the qemu_plugin_hwaddr structure.
-- 
2.34.1

[PATCH v7 13/20] accel/tcg: Document the faulting lookup in tb_lookup_cmp

2022-08-23 Thread Richard Henderson

It was non-obvious to me why we can raise an exception in
the middle of a comparison function, but it works.
While nearby, use TARGET_PAGE_ALIGN instead of open-coding.

Signed-off-by: Richard Henderson 
---
 accel/tcg/cpu-exec.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 7887af6f45..5f43b9769a 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -198,7 +198,16 @@ static bool tb_lookup_cmp(const void *p, const void *d)
 tb_page_addr_t phys_page2;
 target_ulong virt_page2;
 
-virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+/*
+ * We know that the first page matched, and an otherwise valid TB
+ * encountered an incomplete instruction at the end of that page,
+ * therefore we know that generating a new TB from the current PC
+ * must also require reading from the next page -- even if the
+ * second pages do not match, and therefore the resulting insn
+ * is different for the new TB.  Therefore any exception raised
+ * here by the faulting lookup is not premature.
+ */
+virt_page2 = TARGET_PAGE_ALIGN(desc->pc);
 phys_page2 = get_page_addr_code(desc->env, virt_page2);
 if (tb->page_addr[1] == phys_page2) {
 return true;
-- 
2.34.1

[PATCH v7 07/20] accel/tcg: Introduce is_same_page()

2022-08-23 Thread Richard Henderson

From: Ilya Leoshkevich 

Introduce a function that checks whether a given address is on the same
page as where disassembly started. Having it improves readability of
the following patches.

Reviewed-by: Alistair Francis 
Signed-off-by: Ilya Leoshkevich 
Message-Id: <20220811095534.241224-3-...@linux.ibm.com>
Reviewed-by: Richard Henderson 
[rth: Make the DisasContextBase parameter const.]
Signed-off-by: Richard Henderson 
---
 include/exec/translator.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/exec/translator.h b/include/exec/translator.h
index 7db6845535..0d0bf3a31e 100644
--- a/include/exec/translator.h
+++ b/include/exec/translator.h
@@ -187,4 +187,14 @@ FOR_EACH_TRANSLATOR_LD(GEN_TRANSLATOR_LD)
 
 #undef GEN_TRANSLATOR_LD
 
+/*
+ * Return whether addr is on the same page as where disassembly started.
+ * Translators can use this to enforce the rule that only single-insn
+ * translation blocks are allowed to cross page boundaries.
+ */
+static inline bool is_same_page(const DisasContextBase *db, target_ulong addr)
+{
+return ((addr ^ db->pc_first) & TARGET_PAGE_MASK) == 0;
+}
+
 #endif /* EXEC__TRANSLATOR_H */
-- 
2.34.1

[PATCH v7 06/20] tests/tcg/i386: Move smc_code2 to an executable section

2022-08-23 Thread Richard Henderson

We're about to start validating PAGE_EXEC, which means
that we've got to put this code into a section that is
both writable and executable.

Note that this test did not run on hardware beforehand either.

Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 tests/tcg/i386/test-i386.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/tcg/i386/test-i386.c b/tests/tcg/i386/test-i386.c
index ac8d5a3c1f..e6b308a2c0 100644
--- a/tests/tcg/i386/test-i386.c
+++ b/tests/tcg/i386/test-i386.c
@@ -1998,7 +1998,7 @@ uint8_t code[] = {
 0xc3, /* ret */
 };
 
-asm(".section \".data\"\n"
+asm(".section \".data_x\",\"awx\"\n"
 "smc_code2:\n"
 "movl 4(%esp), %eax\n"
 "movl %eax, smc_patch_addr2 + 1\n"
-- 
2.34.1

[PATCH v7 10/20] accel/tcg: Make tb_htable_lookup static

2022-08-23 Thread Richard Henderson

The function is not used outside of cpu-exec.c.  Move it and
its subroutines up in the file, before the first use.

Reviewed-by: Alistair Francis 
Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 include/exec/exec-all.h |   3 -
 accel/tcg/cpu-exec.c| 122 
 2 files changed, 61 insertions(+), 64 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 0475ec6007..9f35e3b7a9 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -552,9 +552,6 @@ void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr, 
MemTxAttrs attrs);
 #endif
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
-TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
-   target_ulong cs_base, uint32_t flags,
-   uint32_t cflags);
 void tb_set_jmp_target(TranslationBlock *tb, int n, uintptr_t addr);
 
 /* GETPC is the true target of the return instruction that we'll execute.  */
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index d18081ca6f..7887af6f45 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -170,6 +170,67 @@ uint32_t curr_cflags(CPUState *cpu)
 return cflags;
 }
 
+struct tb_desc {
+target_ulong pc;
+target_ulong cs_base;
+CPUArchState *env;
+tb_page_addr_t phys_page1;
+uint32_t flags;
+uint32_t cflags;
+uint32_t trace_vcpu_dstate;
+};
+
+static bool tb_lookup_cmp(const void *p, const void *d)
+{
+const TranslationBlock *tb = p;
+const struct tb_desc *desc = d;
+
+if (tb->pc == desc->pc &&
+tb->page_addr[0] == desc->phys_page1 &&
+tb->cs_base == desc->cs_base &&
+tb->flags == desc->flags &&
+tb->trace_vcpu_dstate == desc->trace_vcpu_dstate &&
+tb_cflags(tb) == desc->cflags) {
+/* check next page if needed */
+if (tb->page_addr[1] == -1) {
+return true;
+} else {
+tb_page_addr_t phys_page2;
+target_ulong virt_page2;
+
+virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+phys_page2 = get_page_addr_code(desc->env, virt_page2);
+if (tb->page_addr[1] == phys_page2) {
+return true;
+}
+}
+}
+return false;
+}
+
+static TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
+  target_ulong cs_base, uint32_t flags,
+  uint32_t cflags)
+{
+tb_page_addr_t phys_pc;
+struct tb_desc desc;
+uint32_t h;
+
+desc.env = cpu->env_ptr;
+desc.cs_base = cs_base;
+desc.flags = flags;
+desc.cflags = cflags;
+desc.trace_vcpu_dstate = *cpu->trace_dstate;
+desc.pc = pc;
+phys_pc = get_page_addr_code(desc.env, pc);
+if (phys_pc == -1) {
+return NULL;
+}
+desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
+h = tb_hash_func(phys_pc, pc, flags, cflags, *cpu->trace_dstate);
+return qht_lookup_custom(_ctx.htable, , h, tb_lookup_cmp);
+}
+
 /* Might cause an exception, so have a longjmp destination ready */
 static inline TranslationBlock *tb_lookup(CPUState *cpu, target_ulong pc,
   target_ulong cs_base,
@@ -485,67 +546,6 @@ void cpu_exec_step_atomic(CPUState *cpu)
 end_exclusive();
 }
 
-struct tb_desc {
-target_ulong pc;
-target_ulong cs_base;
-CPUArchState *env;
-tb_page_addr_t phys_page1;
-uint32_t flags;
-uint32_t cflags;
-uint32_t trace_vcpu_dstate;
-};
-
-static bool tb_lookup_cmp(const void *p, const void *d)
-{
-const TranslationBlock *tb = p;
-const struct tb_desc *desc = d;
-
-if (tb->pc == desc->pc &&
-tb->page_addr[0] == desc->phys_page1 &&
-tb->cs_base == desc->cs_base &&
-tb->flags == desc->flags &&
-tb->trace_vcpu_dstate == desc->trace_vcpu_dstate &&
-tb_cflags(tb) == desc->cflags) {
-/* check next page if needed */
-if (tb->page_addr[1] == -1) {
-return true;
-} else {
-tb_page_addr_t phys_page2;
-target_ulong virt_page2;
-
-virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
-phys_page2 = get_page_addr_code(desc->env, virt_page2);
-if (tb->page_addr[1] == phys_page2) {
-return true;
-}
-}
-}
-return false;
-}
-
-TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
-   target_ulong cs_base, uint32_t flags,
-   uint32_t cflags)
-{
-tb_page_addr_t phys_pc;
-struct tb_desc desc;
-uint32_t h;
-
-desc.env = cpu->env_ptr;
-desc.cs_base = cs_base;
-desc.flags = flags;
-desc.cflags = cflags;
-

[PATCH v7 11/20] accel/tcg: Move qemu_ram_addr_from_host_nofail to physmem.c

2022-08-23 Thread Richard Henderson

The base qemu_ram_addr_from_host function is already in
softmmu/physmem.c; move the nofail version to be adjacent.

Reviewed-by: Alistair Francis 
Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 include/exec/cpu-common.h |  1 +
 accel/tcg/cputlb.c| 12 
 softmmu/physmem.c | 12 
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 2281be4e10..d909429427 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -72,6 +72,7 @@ typedef uintptr_t ram_addr_t;
 void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should not be used by devices.  */
 ram_addr_t qemu_ram_addr_from_host(void *ptr);
+ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
 RAMBlock *qemu_ram_block_by_name(const char *name);
 RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
ram_addr_t *offset);
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 43bd65c973..80a3eb4f1c 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1283,18 +1283,6 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
 prot, mmu_idx, size);
 }
 
-static inline ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr)
-{
-ram_addr_t ram_addr;
-
-ram_addr = qemu_ram_addr_from_host(ptr);
-if (ram_addr == RAM_ADDR_INVALID) {
-error_report("Bad ram pointer %p", ptr);
-abort();
-}
-return ram_addr;
-}
-
 /*
  * Note: tlb_fill() can trigger a resize of the TLB. This means that all of the
  * caller's prior references to the TLB table (e.g. CPUTLBEntry pointers) must
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index dc3c3e5f2e..d4c30e99ea 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2460,6 +2460,18 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr)
 return block->offset + offset;
 }
 
+ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr)
+{
+ram_addr_t ram_addr;
+
+ram_addr = qemu_ram_addr_from_host(ptr);
+if (ram_addr == RAM_ADDR_INVALID) {
+error_report("Bad ram pointer %p", ptr);
+abort();
+}
+return ram_addr;
+}
+
 static MemTxResult flatview_read(FlatView *fv, hwaddr addr,
  MemTxAttrs attrs, void *buf, hwaddr len);
 static MemTxResult flatview_write(FlatView *fv, hwaddr addr, MemTxAttrs attrs,
-- 
2.34.1

[PATCH v7 04/20] linux-user: Honor PT_GNU_STACK

2022-08-23 Thread Richard Henderson

Map the stack executable if required by default or on demand.

Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 include/elf.h|  1 +
 linux-user/qemu.h|  1 +
 linux-user/elfload.c | 19 ++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/elf.h b/include/elf.h
index 3a4bcb646a..3d6b9062c0 100644
--- a/include/elf.h
+++ b/include/elf.h
@@ -31,6 +31,7 @@ typedef int64_t  Elf64_Sxword;
 #define PT_LOPROC  0x7000
 #define PT_HIPROC  0x7fff
 
+#define PT_GNU_STACK  (PT_LOOS + 0x474e551)
 #define PT_GNU_PROPERTY   (PT_LOOS + 0x474e553)
 
 #define PT_MIPS_REGINFO   0x7000
diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index 7d90de1b15..e2e93fbd1d 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -48,6 +48,7 @@ struct image_info {
 uint32_telf_flags;
 int personality;
 abi_ulong   alignment;
+boolexec_stack;
 
 /* Generic semihosting knows about these pointers. */
 abi_ulong   arg_strings;   /* strings for argv */
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index b20d513929..90375c6b74 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -232,6 +232,7 @@ static bool init_guest_commpage(void)
 #define ELF_ARCHEM_386
 
 #define ELF_PLATFORM get_elf_platform()
+#define EXSTACK_DEFAULT true
 
 static const char *get_elf_platform(void)
 {
@@ -308,6 +309,7 @@ static void elf_core_copy_regs(target_elf_gregset_t *regs, 
const CPUX86State *en
 
 #define ELF_ARCHEM_ARM
 #define ELF_CLASS   ELFCLASS32
+#define EXSTACK_DEFAULT true
 
 static inline void init_thread(struct target_pt_regs *regs,
struct image_info *infop)
@@ -776,6 +778,7 @@ static inline void init_thread(struct target_pt_regs *regs,
 #else
 
 #define ELF_CLASS   ELFCLASS32
+#define EXSTACK_DEFAULT true
 
 #endif
 
@@ -973,6 +976,7 @@ static void elf_core_copy_regs(target_elf_gregset_t *regs, 
const CPUPPCState *en
 
 #define ELF_CLASS   ELFCLASS64
 #define ELF_ARCHEM_LOONGARCH
+#define EXSTACK_DEFAULT true
 
 #define elf_check_arch(x) ((x) == EM_LOONGARCH)
 
@@ -1068,6 +1072,7 @@ static uint32_t get_elf_hwcap(void)
 #define ELF_CLASS   ELFCLASS32
 #endif
 #define ELF_ARCHEM_MIPS
+#define EXSTACK_DEFAULT true
 
 #ifdef TARGET_ABI_MIPSN32
 #define elf_check_abi(x) ((x) & EF_MIPS_ABI2)
@@ -1806,6 +1811,10 @@ static inline void init_thread(struct target_pt_regs 
*regs,
 #define bswaptls(ptr) bswap32s(ptr)
 #endif
 
+#ifndef EXSTACK_DEFAULT
+#define EXSTACK_DEFAULT false
+#endif
+
 #include "elf.h"
 
 /* We must delay the following stanzas until after "elf.h". */
@@ -2081,6 +2090,7 @@ static abi_ulong setup_arg_pages(struct linux_binprm 
*bprm,
  struct image_info *info)
 {
 abi_ulong size, error, guard;
+int prot;
 
 size = guest_stack_size;
 if (size < STACK_LOWER_LIMIT) {
@@ -2091,7 +2101,11 @@ static abi_ulong setup_arg_pages(struct linux_binprm 
*bprm,
 guard = qemu_real_host_page_size();
 }
 
-error = target_mmap(0, size + guard, PROT_READ | PROT_WRITE,
+prot = PROT_READ | PROT_WRITE;
+if (info->exec_stack) {
+prot |= PROT_EXEC;
+}
+error = target_mmap(0, size + guard, prot,
 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 if (error == -1) {
 perror("mmap stack");
@@ -2921,6 +2935,7 @@ static void load_elf_image(const char *image_name, int 
image_fd,
  */
 loaddr = -1, hiaddr = 0;
 info->alignment = 0;
+info->exec_stack = EXSTACK_DEFAULT;
 for (i = 0; i < ehdr->e_phnum; ++i) {
 struct elf_phdr *eppnt = phdr + i;
 if (eppnt->p_type == PT_LOAD) {
@@ -2963,6 +2978,8 @@ static void load_elf_image(const char *image_name, int 
image_fd,
 if (!parse_elf_properties(image_fd, info, eppnt, bprm_buf, )) {
 goto exit_errmsg;
 }
+} else if (eppnt->p_type == PT_GNU_STACK) {
+info->exec_stack = eppnt->p_flags & PF_X;
 }
 }
 
-- 
2.34.1

[PATCH v7 03/20] linux-user/x86_64: Allocate vsyscall page as a commpage

2022-08-23 Thread Richard Henderson

We're about to start validating PAGE_EXEC, which means that we've
got to the vsyscall page executable.  We had been special casing
this entirely within translate.

Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 linux-user/elfload.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 29d910c4cc..b20d513929 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -195,6 +195,27 @@ static void elf_core_copy_regs(target_elf_gregset_t *regs, 
const CPUX86State *en
 (*regs)[26] = tswapreg(env->segs[R_GS].selector & 0x);
 }
 
+#if ULONG_MAX >= TARGET_VSYSCALL_PAGE
+#define INIT_GUEST_COMMPAGE
+static bool init_guest_commpage(void)
+{
+/*
+ * The vsyscall page is at a high negative address aka kernel space,
+ * which means that we cannot actually allocate it with target_mmap.
+ * We still should be able to use page_set_flags, unless the user
+ * has specified -R reserved_va, which would trigger an assert().
+ */
+if (reserved_va != 0 &&
+TARGET_VSYSCALL_PAGE + TARGET_PAGE_SIZE >= reserved_va) {
+error_report("Cannot allocate vsyscall page");
+exit(EXIT_FAILURE);
+}
+page_set_flags(TARGET_VSYSCALL_PAGE,
+   TARGET_VSYSCALL_PAGE + TARGET_PAGE_SIZE,
+   PAGE_EXEC | PAGE_VALID);
+return true;
+}
+#endif
 #else
 
 #define ELF_START_MMAP 0x8000
@@ -2360,8 +2381,10 @@ static abi_ulong create_elf_tables(abi_ulong p, int 
argc, int envc,
 #else
 #define HI_COMMPAGE 0
 #define LO_COMMPAGE -1
+#ifndef INIT_GUEST_COMMPAGE
 #define init_guest_commpage() true
 #endif
+#endif
 
 static void pgb_fail_in_use(const char *image_name)
 {
-- 
2.34.1

[PATCH v7 08/20] accel/tcg: Properly implement get_page_addr_code for user-only

2022-08-23 Thread Richard Henderson

The current implementation is a no-op, simply returning addr.
This is incorrect, because we ought to be checking the page
permissions for execution.

Make get_page_addr_code inline for both implementations.

Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Acked-by: Alistair Francis 
Signed-off-by: Richard Henderson 
---
 include/exec/exec-all.h | 85 ++---
 accel/tcg/cputlb.c  |  5 ---
 accel/tcg/user-exec.c   | 15 
 3 files changed, 43 insertions(+), 62 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 311e5fb422..0475ec6007 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -598,43 +598,44 @@ struct MemoryRegionSection *iotlb_to_section(CPUState 
*cpu,
  hwaddr index, MemTxAttrs attrs);
 #endif
 
-#if defined(CONFIG_USER_ONLY)
-void mmap_lock(void);
-void mmap_unlock(void);
-bool have_mmap_lock(void);
-
 /**
- * get_page_addr_code() - user-mode version
+ * get_page_addr_code_hostp()
  * @env: CPUArchState
  * @addr: guest virtual address of guest code
  *
- * Returns @addr.
+ * See get_page_addr_code() (full-system version) for documentation on the
+ * return value.
+ *
+ * Sets *@hostp (when @hostp is non-NULL) as follows.
+ * If the return value is -1, sets *@hostp to NULL. Otherwise, sets *@hostp
+ * to the host address where @addr's content is kept.
+ *
+ * Note: this function can trigger an exception.
+ */
+tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
+void **hostp);
+
+/**
+ * get_page_addr_code()
+ * @env: CPUArchState
+ * @addr: guest virtual address of guest code
+ *
+ * If we cannot translate and execute from the entire RAM page, or if
+ * the region is not backed by RAM, returns -1. Otherwise, returns the
+ * ram_addr_t corresponding to the guest code at @addr.
+ *
+ * Note: this function can trigger an exception.
  */
 static inline tb_page_addr_t get_page_addr_code(CPUArchState *env,
 target_ulong addr)
 {
-return addr;
+return get_page_addr_code_hostp(env, addr, NULL);
 }
 
-/**
- * get_page_addr_code_hostp() - user-mode version
- * @env: CPUArchState
- * @addr: guest virtual address of guest code
- *
- * Returns @addr.
- *
- * If @hostp is non-NULL, sets *@hostp to the host address where @addr's 
content
- * is kept.
- */
-static inline tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env,
-  target_ulong addr,
-  void **hostp)
-{
-if (hostp) {
-*hostp = g2h_untagged(addr);
-}
-return addr;
-}
+#if defined(CONFIG_USER_ONLY)
+void mmap_lock(void);
+void mmap_unlock(void);
+bool have_mmap_lock(void);
 
 /**
  * adjust_signal_pc:
@@ -691,36 +692,6 @@ G_NORETURN void cpu_loop_exit_sigbus(CPUState *cpu, 
target_ulong addr,
 static inline void mmap_lock(void) {}
 static inline void mmap_unlock(void) {}
 
-/**
- * get_page_addr_code() - full-system version
- * @env: CPUArchState
- * @addr: guest virtual address of guest code
- *
- * If we cannot translate and execute from the entire RAM page, or if
- * the region is not backed by RAM, returns -1. Otherwise, returns the
- * ram_addr_t corresponding to the guest code at @addr.
- *
- * Note: this function can trigger an exception.
- */
-tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr);
-
-/**
- * get_page_addr_code_hostp() - full-system version
- * @env: CPUArchState
- * @addr: guest virtual address of guest code
- *
- * See get_page_addr_code() (full-system version) for documentation on the
- * return value.
- *
- * Sets *@hostp (when @hostp is non-NULL) as follows.
- * If the return value is -1, sets *@hostp to NULL. Otherwise, sets *@hostp
- * to the host address where @addr's content is kept.
- *
- * Note: this function can trigger an exception.
- */
-tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
-void **hostp);
-
 void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length);
 void tlb_set_dirty(CPUState *cpu, target_ulong vaddr);
 
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index a46f3a654d..43bd65c973 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1544,11 +1544,6 @@ tb_page_addr_t get_page_addr_code_hostp(CPUArchState 
*env, target_ulong addr,
 return qemu_ram_addr_from_host_nofail(p);
 }
 
-tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr)
-{
-return get_page_addr_code_hostp(env, addr, NULL);
-}
-
 static void notdirty_write(CPUState *cpu, vaddr mem_vaddr, unsigned size,
CPUIOTLBEntry *iotlbentry, uintptr_t retaddr)
 {
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index 20ada5472b..cd232967e6 100644
--- a/accel/tcg/user-exec.c
+++

[PATCH v7 09/20] accel/tcg: Unlock mmap_lock after longjmp

2022-08-23 Thread Richard Henderson

The mmap_lock is held around tb_gen_code.  While the comment
is correct that the lock is dropped when tb_gen_code runs out
of memory, the lock is *not* dropped when an exception is
raised reading code for translation.

Acked-by: Alistair Francis 
Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 accel/tcg/cpu-exec.c  | 12 ++--
 accel/tcg/user-exec.c |  3 ---
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index a565a3f8ec..d18081ca6f 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -462,13 +462,11 @@ void cpu_exec_step_atomic(CPUState *cpu)
 cpu_tb_exec(cpu, tb, _exit);
 cpu_exec_exit(cpu);
 } else {
-/*
- * The mmap_lock is dropped by tb_gen_code if it runs out of
- * memory.
- */
 #ifndef CONFIG_SOFTMMU
 clear_helper_retaddr();
-tcg_debug_assert(!have_mmap_lock());
+if (have_mmap_lock()) {
+mmap_unlock();
+}
 #endif
 if (qemu_mutex_iothread_locked()) {
 qemu_mutex_unlock_iothread();
@@ -936,7 +934,9 @@ int cpu_exec(CPUState *cpu)
 
 #ifndef CONFIG_SOFTMMU
 clear_helper_retaddr();
-tcg_debug_assert(!have_mmap_lock());
+if (have_mmap_lock()) {
+mmap_unlock();
+}
 #endif
 if (qemu_mutex_iothread_locked()) {
 qemu_mutex_unlock_iothread();
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index cd232967e6..a27d814f19 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -80,10 +80,7 @@ MMUAccessType adjust_signal_pc(uintptr_t *pc, bool is_write)
  * (and if the translator doesn't handle page boundaries correctly
  * there's little we can do about that here).  Therefore, do not
  * trigger the unwinder.
- *
- * Like tb_gen_code, release the memory lock before cpu_loop_exit.
  */
-mmap_unlock();
 *pc = 0;
 return MMU_INST_FETCH;
 }
-- 
2.34.1

[PATCH v7 00/20] linux-user: Fix siginfo_t contents when jumping to non-readable pages

2022-08-23 Thread Richard Henderson

Changes from v6:
  * Fix an unintentional behaviour change in patches 8 & 12, which
had inspired the old patches 13 & 14 to fix (removed).
  * Added a new documentation patch 13.


r~


Ilya Leoshkevich (4):
  linux-user: Clear translations and tb_jmp_cache on mprotect()
  accel/tcg: Introduce is_same_page()
  target/s390x: Make translator stop before the end of a page
  target/i386: Make translator stop before the end of a page

Richard Henderson (16):
  linux-user/arm: Mark the commpage executable
  linux-user/hppa: Allocate page zero as a commpage
  linux-user/x86_64: Allocate vsyscall page as a commpage
  linux-user: Honor PT_GNU_STACK
  tests/tcg/i386: Move smc_code2 to an executable section
  accel/tcg: Properly implement get_page_addr_code for user-only
  accel/tcg: Unlock mmap_lock after longjmp
  accel/tcg: Make tb_htable_lookup static
  accel/tcg: Move qemu_ram_addr_from_host_nofail to physmem.c
  accel/tcg: Use probe_access_internal for softmmu
get_page_addr_code_hostp
  accel/tcg: Document the faulting lookup in tb_lookup_cmp
  accel/tcg: Remove translator_ldsw
  accel/tcg: Add pc and host_pc params to gen_intermediate_code
  accel/tcg: Add fast path for translator_ld*
  target/riscv: Add MAX_INSN_LEN and insn_len
  target/riscv: Make translator stop before the end of a page

 include/elf.h |   1 +
 include/exec/cpu-common.h |   1 +
 include/exec/exec-all.h   |  89 ++-
 include/exec/translator.h |  96 +---
 linux-user/arm/target_cpu.h   |   4 +-
 linux-user/qemu.h |   1 +
 accel/tcg/cpu-exec.c  | 143 --
 accel/tcg/cputlb.c|  93 ++-
 accel/tcg/translate-all.c |  29 +++---
 accel/tcg/translator.c| 135 +---
 accel/tcg/user-exec.c |  18 +++-
 linux-user/elfload.c  |  82 +++--
 linux-user/mmap.c |   8 ++
 softmmu/physmem.c |  12 +++
 target/alpha/translate.c  |   5 +-
 target/arm/translate.c|   5 +-
 target/avr/translate.c|   5 +-
 target/cris/translate.c   |   5 +-
 target/hexagon/translate.c|   6 +-
 target/hppa/translate.c   |   5 +-
 target/i386/tcg/translate.c   |  71 +--
 target/loongarch/translate.c  |   6 +-
 target/m68k/translate.c   |   5 +-
 target/microblaze/translate.c |   5 +-
 target/mips/tcg/translate.c   |   5 +-
 target/nios2/translate.c  |   5 +-
 target/openrisc/translate.c   |   6 +-
 target/ppc/translate.c|   5 +-
 target/riscv/translate.c  |  32 +--
 target/rx/translate.c |   5 +-
 target/s390x/tcg/translate.c  |  20 +++--
 target/sh4/translate.c|   5 +-
 target/sparc/translate.c  |   5 +-
 target/tricore/translate.c|   6 +-
 target/xtensa/translate.c |   6 +-
 tests/tcg/i386/test-i386.c|   2 +-
 tests/tcg/riscv64/noexec.c|  79 +
 tests/tcg/s390x/noexec.c  | 106 ++
 tests/tcg/x86_64/noexec.c |  75 
 tests/tcg/multiarch/noexec.c.inc  | 139 +
 tests/tcg/riscv64/Makefile.target |   1 +
 tests/tcg/s390x/Makefile.target   |   1 +
 tests/tcg/x86_64/Makefile.target  |   3 +-
 43 files changed, 971 insertions(+), 365 deletions(-)
 create mode 100644 tests/tcg/riscv64/noexec.c
 create mode 100644 tests/tcg/s390x/noexec.c
 create mode 100644 tests/tcg/x86_64/noexec.c
 create mode 100644 tests/tcg/multiarch/noexec.c.inc

-- 
2.34.1

[PATCH v7 01/20] linux-user/arm: Mark the commpage executable

2022-08-23 Thread Richard Henderson

We're about to start validating PAGE_EXEC, which means
that we've got to mark the commpage executable.  We had
been placing the commpage outside of reserved_va, which
was incorrect and lead to an abort.

Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 linux-user/arm/target_cpu.h | 4 ++--
 linux-user/elfload.c| 6 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/linux-user/arm/target_cpu.h b/linux-user/arm/target_cpu.h
index 709d19bc9e..89ba274cfc 100644
--- a/linux-user/arm/target_cpu.h
+++ b/linux-user/arm/target_cpu.h
@@ -34,9 +34,9 @@ static inline unsigned long arm_max_reserved_va(CPUState *cs)
 } else {
 /*
  * We need to be able to map the commpage.
- * See validate_guest_space in linux-user/elfload.c.
+ * See init_guest_commpage in linux-user/elfload.c.
  */
-return 0xul;
+return 0xul;
 }
 }
 #define MAX_RESERVED_VA  arm_max_reserved_va
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index ce902dbd56..3e3dc02499 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -398,7 +398,8 @@ enum {
 
 static bool init_guest_commpage(void)
 {
-void *want = g2h_untagged(HI_COMMPAGE & -qemu_host_page_size);
+abi_ptr commpage = HI_COMMPAGE & -qemu_host_page_size;
+void *want = g2h_untagged(commpage);
 void *addr = mmap(want, qemu_host_page_size, PROT_READ | PROT_WRITE,
   MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
 
@@ -417,6 +418,9 @@ static bool init_guest_commpage(void)
 perror("Protecting guest commpage");
 exit(EXIT_FAILURE);
 }
+
+page_set_flags(commpage, commpage + qemu_host_page_size,
+   PAGE_READ | PAGE_EXEC | PAGE_VALID);
 return true;
 }
 
-- 
2.34.1

[PATCH v7 02/20] linux-user/hppa: Allocate page zero as a commpage

2022-08-23 Thread Richard Henderson

We're about to start validating PAGE_EXEC, which means that we've
got to mark page zero executable.  We had been special casing this
entirely within translate.

Acked-by: Ilya Leoshkevich 
Tested-by: Ilya Leoshkevich 
Signed-off-by: Richard Henderson 
---
 linux-user/elfload.c | 34 +++---
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 3e3dc02499..29d910c4cc 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -1646,6 +1646,34 @@ static inline void init_thread(struct target_pt_regs 
*regs,
 regs->gr[31] = infop->entry;
 }
 
+#define LO_COMMPAGE  0
+
+static bool init_guest_commpage(void)
+{
+void *want = g2h_untagged(LO_COMMPAGE);
+void *addr = mmap(want, qemu_host_page_size, PROT_NONE,
+  MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+
+if (addr == MAP_FAILED) {
+perror("Allocating guest commpage");
+exit(EXIT_FAILURE);
+}
+if (addr != want) {
+return false;
+}
+
+/*
+ * On Linux, page zero is normally marked execute only + gateway.
+ * Normal read or write is supposed to fail (thus PROT_NONE above),
+ * but specific offsets have kernel code mapped to raise permissions
+ * and implement syscalls.  Here, simply mark the page executable.
+ * Special case the entry points during translation (see do_page_zero).
+ */
+page_set_flags(LO_COMMPAGE, LO_COMMPAGE + TARGET_PAGE_SIZE,
+   PAGE_EXEC | PAGE_VALID);
+return true;
+}
+
 #endif /* TARGET_HPPA */
 
 #ifdef TARGET_XTENSA
@@ -2326,12 +2354,12 @@ static abi_ulong create_elf_tables(abi_ulong p, int 
argc, int envc,
 }
 
 #if defined(HI_COMMPAGE)
-#define LO_COMMPAGE 0
+#define LO_COMMPAGE -1
 #elif defined(LO_COMMPAGE)
 #define HI_COMMPAGE 0
 #else
 #define HI_COMMPAGE 0
-#define LO_COMMPAGE 0
+#define LO_COMMPAGE -1
 #define init_guest_commpage() true
 #endif
 
@@ -2555,7 +2583,7 @@ static void pgb_static(const char *image_name, abi_ulong 
orig_loaddr,
 } else {
 offset = -(HI_COMMPAGE & -align);
 }
-} else if (LO_COMMPAGE != 0) {
+} else if (LO_COMMPAGE != -1) {
 loaddr = MIN(loaddr, LO_COMMPAGE & -align);
 }
 
-- 
2.34.1

Re: [PATCH 0/2] target/arm: armv7m_load_kernel() improvements

2022-08-23 Thread Richard Henderson


On 8/23/22 09:04, Peter Maydell wrote:

Two small patches to armv7m_load_kernel().  The first is just getting
rid of some dead code, that I noticed while working on the function.
The second is to make boards pass armv7m_load_kernel() the base
address for loading guest (non-ELF) binaries.  At the moment we
assume all M-profile boards start at address 0; this happens to be
true for all the ones we implement right now, but it's not true in
general.  In particular the Teeny board has its ROM at 0x0020_.

I thought about having armv7m_load_kernel() be "clever" and ask the
CPU what init-svtor/init-nsvtor were set to, but that seems like it
might have unanticipated consequences[*].  "Just pass the base address"
is simpler and is how A-profile does it (though for A-profile it's
the loader_start field in struct arm_boot_info rather than an extra
argument).

[*] eg where the board has the rom/flash aliased at both address
0 and some other address, and init-svtor points at an alias;
also Secure vs NonSecure address spaces and loading...


Thanks, queued to target-arm.next.


r~

Re: [PATCH v2 00/10] target/arm: Implement FEAT_PMUv3p5

2022-08-23 Thread Richard Henderson

On Mon, 22 Aug 2022 at 06:24, Peter Maydell  wrote:
>
> This patchset implements the Armv8.5 feature FEAT_PMUv3p5, which is
> a set of minor enhancements to the PMU:
>  * EL2 and EL3 can now prohibit the cycle counter from counting
>when in EL2 or when Secure, using new MDCR_EL2.HCCD and
>MDCR_EL3.SCCD bits
>  * event counters are now 64 bits, with the overflow detection
>configurably at the 32 bit or 64 bit mark
>
> It also fixes a set of bugs in the existing PMU emulation which I
> discovered while trying to test my additions.
>
> This is of course all intended for 7.2.
>
> Changes v1->v2:
>  * fixed indent error, comment typo
>  * a non-change: opted not to use bitwise |= for bool
>  * fixed patch 8 to implement MDCR_EL3.SCCD, not some
>weird mix of MCCD and SCCD
>  * update emulation.rst to note feature is implemented
>
> Patch 8 is the only one that needs review.

Thanks, queued to target-arm.next.


r~

Re: [PATCH 0/6] target/arm: Fix v8 AArch32 RAZ ID regs; implement FEAT_ETS

2022-08-23 Thread Richard Henderson


On 8/19/22 04:00, Peter Maydell wrote:

The main aim of this patchset is to implement FEAT_ETS.
FEAT_ETS provides tighter guarantees on some memory orderings
involving translation table walks that permit guest code to
skip the context-synchronization event they would otherwise
need to do after a TLB maintenance operation. QEMU already
provides the tighter guarantees this feature requires, so
all we need to do is advertise it in the ID registers...

...except that it turns out that for AArch32 this is done
in ID_MMFR5, which is a new-in-v8.6 register that we don't
implement yet. So we need to provide it. And while I was
doing that I noticed that we accidentally forgot to
implement a big chunk of the "reserved for future ID
registers, must RAZ" cp15 space for v8 AArch32. So the
big bit of the patchset is sorting that out :-)


Thanks, queued to target-arm.next.


r~

Re: [PATCH] hw/arm/bcm2835_property: Add support for RPI_FIRMWARE_FRAMEBUFFER_GET_NUM_DISPLAYS

2022-08-23 Thread Richard Henderson


On 8/12/22 07:35, Enrik Berkhan wrote:

In more recent Raspbian OS Linux kernels, the fb driver gives up
immediately if RPI_FIRMWARE_FRAMEBUFFER_GET_NUM_DISPLAYS fails or no
displays are reported.

This change simply always reports one display. It makes bcm2835_fb work
again with these more recent kernels.

Signed-off-by: Enrik Berkhan 


Thanks, queued to target-arm.next.


r~

Re: [PATCH v2] target/arm: Add cortex-a35

2022-08-23 Thread Richard Henderson


On 8/18/22 17:20, Hao Wu wrote:

Add cortex A35 core and enable it for virt board.

Signed-off-by: Hao Wu 
Reviewed-by: Joe Komlodi 
---
  docs/system/arm/virt.rst |  1 +
  hw/arm/virt.c|  1 +
  target/arm/cpu64.c   | 80 
  3 files changed, 82 insertions(+)


Thanks, queued to target-arm.next.


r~

[PATCH 2/2] target/s390x: Align __excp_addr in s390_probe_access

2022-08-23 Thread Richard Henderson

Per the comment in s390_cpu_record_sigsegv, the saved address
is always page aligned.

Signed-off-by: Richard Henderson 
---
 target/s390x/tcg/mem_helper.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/s390x/tcg/mem_helper.c b/target/s390x/tcg/mem_helper.c
index 4c0f8baa39..19ea7d2f8d 100644
--- a/target/s390x/tcg/mem_helper.c
+++ b/target/s390x/tcg/mem_helper.c
@@ -147,7 +147,7 @@ static int s390_probe_access(CPUArchState *env, 
target_ulong addr, int size,
 #if defined(CONFIG_USER_ONLY)
 flags = page_get_flags(addr);
 if (!(flags & (access_type == MMU_DATA_LOAD ?  PAGE_READ : 
PAGE_WRITE_ORG))) {
-env->__excp_addr = addr;
+env->__excp_addr = addr & TARGET_PAGE_MASK;
 flags = (flags & PAGE_VALID) ? PGM_PROTECTION : PGM_ADDRESSING;
 if (nonfault) {
 return flags;
-- 
2.34.1

[PATCH 1/2] Revert "target/s390x: Use probe_access_flags in s390_probe_access"

2022-08-23 Thread Richard Henderson

This reverts commit db9aab5783a2fb62250e12f0c4cfed5e1778c189.

This patch breaks the contract of s390_probe_access, in that
it no longer returns an exception code, nor set __excp_addr.

Reported-by: David Hildenbrand 
Signed-off-by: Richard Henderson 
---
 target/s390x/tcg/mem_helper.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/target/s390x/tcg/mem_helper.c b/target/s390x/tcg/mem_helper.c
index fc52aa128b..4c0f8baa39 100644
--- a/target/s390x/tcg/mem_helper.c
+++ b/target/s390x/tcg/mem_helper.c
@@ -142,12 +142,20 @@ static int s390_probe_access(CPUArchState *env, 
target_ulong addr, int size,
  MMUAccessType access_type, int mmu_idx,
  bool nonfault, void **phost, uintptr_t ra)
 {
-#if defined(CONFIG_USER_ONLY)
-return probe_access_flags(env, addr, access_type, mmu_idx,
-  nonfault, phost, ra);
-#else
 int flags;
 
+#if defined(CONFIG_USER_ONLY)
+flags = page_get_flags(addr);
+if (!(flags & (access_type == MMU_DATA_LOAD ?  PAGE_READ : 
PAGE_WRITE_ORG))) {
+env->__excp_addr = addr;
+flags = (flags & PAGE_VALID) ? PGM_PROTECTION : PGM_ADDRESSING;
+if (nonfault) {
+return flags;
+}
+tcg_s390_program_interrupt(env, flags, ra);
+}
+*phost = g2h(env_cpu(env), addr);
+#else
 /*
  * For !CONFIG_USER_ONLY, we cannot rely on TLB_INVALID_MASK or haddr==NULL
  * to detect if there was an exception during tlb_fill().
@@ -166,8 +174,8 @@ static int s390_probe_access(CPUArchState *env, 
target_ulong addr, int size,
  (access_type == MMU_DATA_STORE
   ? BP_MEM_WRITE : BP_MEM_READ), ra);
 }
-return 0;
 #endif
+return 0;
 }
 
 static int access_prepare_nf(S390Access *access, CPUS390XState *env,
-- 
2.34.1

[PATCH 0/2] target/s390x: s390_probe_access fixes

2022-08-23 Thread Richard Henderson

First, as pointed out by David; second by inspection.

I really wish there were a better way to structure this,
but alas, I don't see any alternatives that aren't just
different but similar amounts of ugly.


r~


Richard Henderson (2):
  Revert "target/s390x: Use probe_access_flags in s390_probe_access"
  target/s390x: Align __excp_addr in s390_probe_access

 target/s390x/tcg/mem_helper.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

-- 
2.34.1

Re: [PULL 1/6] tests/avocado: push default timeout to QemuBaseTest

2022-08-23 Thread Richard Henderson


On 8/23/22 08:25, Alex Bennée wrote:

All of the QEMU tests eventually end up derrived from this class. Move
the default timeout from LinuxTest to ensure we catch them all. As 15
minutes is fairly excessive we drop the default down to 2 minutes
which is a more reasonable target for tests to aim for.

Signed-off-by: Alex Bennée 
Reviewed-by: Richard Henderson 
Message-Id: <20220822165608.2980552-2-alex.ben...@linaro.org>

diff --git a/tests/avocado/avocado_qemu/__init__.py 
b/tests/avocado/avocado_qemu/__init__.py
index ed4853c805..0efd2bd212 100644
--- a/tests/avocado/avocado_qemu/__init__.py
+++ b/tests/avocado/avocado_qemu/__init__.py
@@ -227,6 +227,10 @@ def exec_command_and_wait_for_pattern(test, command,
  _console_interaction(test, success_message, failure_message, command + 
'\r')
  
  class QemuBaseTest(avocado.Test):

+
+# default timeout for all tests, can be overridden
+timeout = 120
+
  def _get_unique_tag_val(self, tag_name):
  """
  Gets a tag value, if unique for a key
@@ -512,7 +516,6 @@ class LinuxTest(LinuxSSHMixIn, QemuSystemTest):
  to start with than the more vanilla `QemuSystemTest` class.
  """
  
-timeout = 900

  distro = None
  username = 'root'
  password = 'password'


Bah.

https://gitlab.com/qemu-project/qemu/-/jobs/2923804714

 (001/192) tests/avocado/boot_linux.py:BootLinuxX8664.test_pc_i440fx_tcg:  INTERRUPTED: 
Test interrupted by SIGTERM\nRunner error occurred: Timeout reached\nOriginal status: 
ERROR\n{'name': '001-tests/avocado/boot_linux.py:BootLinuxX8664.test_pc_i440fx_tcg', 
'logdir': '/builds/qemu-project/qemu/build/tests/results/job-2022-08-23T21.03-6d06db2/t... 
(120.85 s)
 (003/192) tests/avocado/boot_linux.py:BootLinuxX8664.test_pc_q35_tcg:  INTERRUPTED: Test 
interrupted by SIGTERM\nRunner error occurred: Timeout reached\nOriginal status: 
ERROR\n{'name': '003-tests/avocado/boot_linux.py:BootLinuxX8664.test_pc_q35_tcg', 
'logdir': 
'/builds/qemu-project/qemu/build/tests/results/job-2022-08-23T21.03-6d06db2/test... (120.81 s)


The previous successful run had

 (001/192) tests/avocado/boot_linux.py:BootLinuxX8664.test_pc_i440fx_tcg:  PASS 
(257.00 s)
 (003/192) tests/avocado/boot_linux.py:BootLinuxX8664.test_pc_q35_tcg:  PASS 
(238.67 s)


r~

Re: [PATCH 2/2] target/arm: Make boards pass base address to armv7m_load_kernel()

2022-08-23 Thread Richard Henderson


On 8/23/22 09:04, Peter Maydell wrote:

Currently armv7m_load_kernel() takes the size of the block of memory
where it should load the initial guest image, but assumes that it
should always load it at address 0.  This happens to be true of all
our M-profile boards at the moment, but it isn't guaranteed to always
be so: M-profile CPUs can be configured (via init-svtor and
init-nsvtor, which match equivalent hardware configuration signals)
to have the initial vector table at any address, not just zero.  (For
instance the Teeny board has the boot ROM at address 0x0200_.)

Add a base address argument to armv7m_load_kernel(), so that
callers now pass in both base address and size. All the current
callers pass 0, so this is not a behaviour change.

Signed-off-by: Peter Maydell
---


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 1/2] target/arm: Remove useless TARGET_BIG_ENDIAN check in armv7m_load_kernel()

2022-08-23 Thread Richard Henderson


On 8/23/22 09:04, Peter Maydell wrote:

Arm system emulation targets always have TARGET_BIG_ENDIAN clear, so
there is no need to have handling in armv7m_load_kernel() for the
case when it is defined.  Remove the unnecessary code.

Side notes:
  * our M-profile implementation is always little-endian (that is, it
makes the IMPDEF choice that the read-only AIRCR.ENDIANNESS is 0)
  * if we did want to handle big-endian ELF files here we should do it
the way that hw/arm/boot.c:arm_load_elf() does, by looking at the
ELF header to see what endianness the file itself is

Signed-off-by: Peter Maydell 


Reviewed-by: Richard Henderson 


r~

Re: [PATCH v9 1/3] hw/intc: Move mtimer/mtimecmp to aclint

2022-08-23 Thread Alistair Francis

On Thu, Aug 11, 2022 at 4:57 AM Atish Patra  wrote:
>
> Historically, The mtime/mtimecmp has been part of the CPU because
> they are per hart entities. However, they actually belong to aclint
> which is a MMIO device.
>
> Move them to the ACLINT device. This also emulates the real hardware
> more closely.
>
> Reviewed-by: Anup Patel 
> Reviewed-by: Alistair Francis 
> Reviewed-by: Andrew Jones 
> Signed-off-by: Atish Patra 

This patch breaks my multi-socket boot.

When using OpenSBI 1.1 and Linux 5.19-rc7

qemu-system-riscv64 \
-machine virt \
-serial mon:stdio -serial null -nographic \
-append "root=/dev/vda rw highres=off  console=ttyS0 ip=dhcp earlycon=sbi" \
-device virtio-net-device,netdev=net0,mac=52:54:00:12:34:02 \
-netdev user,id=net0 \
-object rng-random,filename=/dev/urandom,id=rng0 \
-device virtio-rng-device,rng=rng0 \
-smp 4 \
-d guest_errors \
-m 2G \
-object memory-backend-ram,size=1G,policy=bind,host-nodes=0,id=ram-node0 \
-numa node,memdev=ram-node0 \
-object memory-backend-ram,size=1G,policy=bind,host-nodes=0,id=ram-node1 \
-numa node,memdev=ram-node1 \
-numa cpu,node-id=0,core-id=0 \
-numa cpu,node-id=0,core-id=1 \
-numa cpu,node-id=1,core-id=2 \
-numa cpu,node-id=1,core-id=3 \
-kernel ./images/qemuriscv64/Image
-bios default

It looks like OpenSBI hangs when booting after applying this patch

Alistair

> ---
>  hw/intc/riscv_aclint.c | 41 --
>  hw/timer/ibex_timer.c  | 18 ++-
>  include/hw/intc/riscv_aclint.h |  2 ++
>  include/hw/timer/ibex_timer.h  |  2 ++
>  target/riscv/cpu.h |  2 --
>  target/riscv/machine.c |  5 ++---
>  6 files changed, 42 insertions(+), 28 deletions(-)
>
> diff --git a/hw/intc/riscv_aclint.c b/hw/intc/riscv_aclint.c
> index e7942c4e5a32..a125c73d535c 100644
> --- a/hw/intc/riscv_aclint.c
> +++ b/hw/intc/riscv_aclint.c
> @@ -32,6 +32,7 @@
>  #include "hw/intc/riscv_aclint.h"
>  #include "qemu/timer.h"
>  #include "hw/irq.h"
> +#include "migration/vmstate.h"
>
>  typedef struct riscv_aclint_mtimer_callback {
>  RISCVAclintMTimerState *s;
> @@ -65,8 +66,8 @@ static void 
> riscv_aclint_mtimer_write_timecmp(RISCVAclintMTimerState *mtimer,
>
>  uint64_t rtc_r = cpu_riscv_read_rtc(mtimer);
>
> -cpu->env.timecmp = value;
> -if (cpu->env.timecmp <= rtc_r) {
> +mtimer->timecmp[hartid] = value;
> +if (mtimer->timecmp[hartid] <= rtc_r) {
>  /*
>   * If we're setting an MTIMECMP value in the "past",
>   * immediately raise the timer interrupt
> @@ -77,7 +78,7 @@ static void 
> riscv_aclint_mtimer_write_timecmp(RISCVAclintMTimerState *mtimer,
>
>  /* otherwise, set up the future timer interrupt */
>  qemu_irq_lower(mtimer->timer_irqs[hartid - mtimer->hartid_base]);
> -diff = cpu->env.timecmp - rtc_r;
> +diff = mtimer->timecmp[hartid] - rtc_r;
>  /* back to ns (note args switched in muldiv64) */
>  uint64_t ns_diff = muldiv64(diff, NANOSECONDS_PER_SECOND, timebase_freq);
>
> @@ -102,7 +103,7 @@ static void 
> riscv_aclint_mtimer_write_timecmp(RISCVAclintMTimerState *mtimer,
>  next = MIN(next, INT64_MAX);
>  }
>
> -timer_mod(cpu->env.timer, next);
> +timer_mod(mtimer->timers[hartid], next);
>  }
>
>  /*
> @@ -133,11 +134,11 @@ static uint64_t riscv_aclint_mtimer_read(void *opaque, 
> hwaddr addr,
>"aclint-mtimer: invalid hartid: %zu", hartid);
>  } else if ((addr & 0x7) == 0) {
>  /* timecmp_lo for RV32/RV64 or timecmp for RV64 */
> -uint64_t timecmp = env->timecmp;
> +uint64_t timecmp = mtimer->timecmp[hartid];
>  return (size == 4) ? (timecmp & 0x) : timecmp;
>  } else if ((addr & 0x7) == 4) {
>  /* timecmp_hi */
> -uint64_t timecmp = env->timecmp;
> +uint64_t timecmp = mtimer->timecmp[hartid];
>  return (timecmp >> 32) & 0x;
>  } else {
>  qemu_log_mask(LOG_UNIMP,
> @@ -177,7 +178,7 @@ static void riscv_aclint_mtimer_write(void *opaque, 
> hwaddr addr,
>  } else if ((addr & 0x7) == 0) {
>  if (size == 4) {
>  /* timecmp_lo for RV32/RV64 */
> -uint64_t timecmp_hi = env->timecmp >> 32;
> +uint64_t timecmp_hi = mtimer->timecmp[hartid] >> 32;
>  riscv_aclint_mtimer_write_timecmp(mtimer, RISCV_CPU(cpu), 
> hartid,
>  timecmp_hi << 32 | (value & 0x));
>  } else {
> @@ -188,7 +189,7 @@ static void riscv_aclint_mtimer_write(void *opaque, 
> hwaddr addr,
>  } else if ((addr & 0x7) == 4) {
>  if (size == 4) {
>  /* timecmp_hi for RV32/RV64 */
> -uint64_t timecmp_lo = env->timecmp;
> +uint64_t timecmp_lo = mtimer->timecmp[hartid];
>

[PATCH] gitlab-ci/custom-runners: Disable -static-pie for ubuntu-20.04-aarch64

2022-08-23 Thread Richard Henderson

The project has reached the magic size at which we see

/usr/aarch64-linux-gnu/lib/libc.a(init-first.o): in function 
`__libc_init_first':
(.text+0x10): relocation truncated to fit: R_AARCH64_LD64_GOTPAGE_LO15 against \
symbol `__environ' defined in .bss section in 
/usr/aarch64-linux-gnu/lib/libc.a(environ.o)
/usr/bin/ld: (.text+0x10): warning: too many GOT entries for -fpic, please 
recompile with -fPIC

The bug has been reported upstream, but in the meantime there is
nothing we can do except build a non-pie executable.

Signed-off-by: Richard Henderson 
---

Irritatingly, once this is fixed, we're still in a broken state because
we're now picking up an x86_64 cross-compiler, without all of the
required static libraries:

  https://gitlab.com/qemu-project/qemu/-/jobs/2923714301#L3028

Cross compilers
aarch64  : cc
arm  : arm-linux-gnueabihf-gcc
i386 : i686-linux-gnu-gcc
mips64el : mips64el-linux-gnuabi64-gcc
mipsel   : mipsel-linux-gnu-gcc
riscv64  : riscv64-linux-gnu-gcc
s390x: s390x-linux-gnu-gcc
x86_64   : x86_64-linux-gnu-gcc

where we hadn't done so just 4 days ago:

  https://gitlab.com/qemu-project/qemu/-/jobs/2908305198

Cross compilers
aarch64  : cc
arm  : arm-linux-gnueabihf-gcc
i386 : i686-linux-gnu-gcc
riscv64  : riscv64-linux-gnu-gcc
s390x: s390x-linux-gnu-gcc

Alex? I think you're the only one who would know how this host
is supposed to be configured for gitlab...

I guess I'm not going to let this affect the release, but we
do have quite a number of annoyingly consistent failures now.
We should either fix them or disable them.


r~

---
 .gitlab-ci.d/custom-runners/ubuntu-20.04-aarch64.yml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/.gitlab-ci.d/custom-runners/ubuntu-20.04-aarch64.yml 
b/.gitlab-ci.d/custom-runners/ubuntu-20.04-aarch64.yml
index 3d878914e7..85a234801a 100644
--- a/.gitlab-ci.d/custom-runners/ubuntu-20.04-aarch64.yml
+++ b/.gitlab-ci.d/custom-runners/ubuntu-20.04-aarch64.yml
@@ -16,7 +16,9 @@ ubuntu-20.04-aarch64-all-linux-static:
  # --disable-glusterfs is needed because there's no static version of those 
libs in distro supplied packages
  - mkdir build
  - cd build
- - ../configure --enable-debug --static --disable-system --disable-glusterfs 
--disable-libssh
+ # Disable -static-pie due to build error with system libc:
+ # https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1987438
+ - ../configure --enable-debug --static --disable-system --disable-glusterfs 
--disable-libssh --disable-pie --extra-cflags='-fno-pie -no-pie'
|| { cat config.log meson-logs/meson-log.txt; exit 1; }
  - make --output-sync -j`nproc --ignore=40`
  - make --output-sync -j`nproc --ignore=40` check V=1
-- 
2.34.1

Re: [PATCH v13 3/6] target/riscv: Add few cache related PMU events

2022-08-23 Thread Alistair Francis

On Wed, Aug 24, 2022 at 5:19 AM Atish Kumar Patra  wrote:
>
>
>
> On Mon, Aug 22, 2022 at 5:38 PM Alistair Francis  wrote:
>>
>> On Wed, Aug 17, 2022 at 9:24 AM Atish Patra  wrote:
>> >
>> > From: Atish Patra 
>> >
>> > Qemu can monitor the following cache related PMU events through
>> > tlb_fill functions.
>> >
>> > 1. DTLB load/store miss
>> > 3. ITLB prefetch miss
>> >
>> > Increment the PMU counter in tlb_fill function.
>> >
>> > Reviewed-by: Alistair Francis 
>> > Tested-by: Heiko Stuebner 
>> > Signed-off-by: Atish Patra 
>> > Signed-off-by: Atish Patra 
>>
>> This patch causes a number of test failures.
>>
>> On some boots I see lots of these errors printed:
>>
>> qemu-system-riscv64: GLib: g_hash_table_lookup: assertion 'hash_table
>> != NULL' failed
>>
>> and I'm unable to boot Linux.
>>
>> The command line is:
>>
>> qemu-system-riscv64 \
>> -machine sifive_u \
>> -serial mon:stdio -serial null -nographic \
>> -append "root=/dev/ram rw highres=off console=ttySIF0 ip=dhcp
>> earlycon=sbi" \
>> -smp 5 \
>> -d guest_errors \
>> -kernel ./images/qemuriscv64/Image \
>> -initrd ./images/qemuriscv64/buildroot/rootfs.cpio \
>> -bios default -m 256M
>>
>> I see that faiulre with the 5.19-rc7 kernel and OpenSBI 1.1.
>>
>> I also see the same messages on other machines when running baremetal
>> code. I'm going to drop these patches from my tree
>>
>
> Sorry for the breakage.  This should fix the issue. We just need to add few 
> sanity checks
> for the platforms that don't support these events.
>
> diff --git a/target/riscv/pmu.c b/target/riscv/pmu.c
> index ad33b81b2ea0..0a473010cadd 100644
> --- a/target/riscv/pmu.c
> +++ b/target/riscv/pmu.c
> @@ -187,6 +187,9 @@ int riscv_pmu_incr_ctr(RISCVCPU *cpu, enum 
> riscv_pmu_event_idx event_idx)
>  CPURISCVState *env = >env;
>  gpointer value;
>
> +if (!cpu->cfg.pmu_num)
> +return 0;
> +
>  value = g_hash_table_lookup(cpu->pmu_event_ctr_map,
>  GUINT_TO_POINTER(event_idx));
>  if (!value) {
> @@ -221,6 +224,9 @@ bool riscv_pmu_ctr_monitor_instructions(CPURISCVState 
> *env,
>  }
>
>  cpu = RISCV_CPU(env_cpu(env));
> +if (!cpu->pmu_event_ctr_map)
> +return false;
> +
>  event_idx = RISCV_PMU_EVENT_HW_INSTRUCTIONS;
>  ctr_idx = GPOINTER_TO_UINT(g_hash_table_lookup(cpu->pmu_event_ctr_map,
> GUINT_TO_POINTER(event_idx)));
> @@ -243,6 +249,9 @@ bool riscv_pmu_ctr_monitor_cycles(CPURISCVState *env, 
> uint32_t target_ctr)
>  }
>
>  cpu = RISCV_CPU(env_cpu(env));
> +if (!cpu->pmu_event_ctr_map)
> +return false;
> +
>  event_idx = RISCV_PMU_EVENT_HW_CPU_CYCLES;
>  ctr_idx = GPOINTER_TO_UINT(g_hash_table_lookup(cpu->pmu_event_ctr_map,
> GUINT_TO_POINTER(event_idx)));
> @@ -280,7 +289,7 @@ int riscv_pmu_update_event_map(CPURISCVState *env, 
> uint64_t value,
>  uint32_t event_idx;
>  RISCVCPU *cpu = RISCV_CPU(env_cpu(env));
>
> -if (!riscv_pmu_counter_valid(cpu, ctr_idx)) {
> +if (!riscv_pmu_counter_valid(cpu, ctr_idx) || !cpu->pmu_event_ctr_map) {
>  return -1;
>  }
>
> Should I respin the series or send this as a fix ?

Can you wait till tomorrow, rebase on my branch and then send a new
series? I'm just chasing down another issue today

Alistair

>
>> Alistair
>>
>> > ---
>> >  target/riscv/cpu_helper.c | 25 +
>> >  1 file changed, 25 insertions(+)
>> >
>> > diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
>> > index 1e4faa84e839..81948b37dd9a 100644
>> > --- a/target/riscv/cpu_helper.c
>> > +++ b/target/riscv/cpu_helper.c
>> > @@ -21,11 +21,13 @@
>> >  #include "qemu/log.h"
>> >  #include "qemu/main-loop.h"
>> >  #include "cpu.h"
>> > +#include "pmu.h"
>> >  #include "exec/exec-all.h"
>> >  #include "instmap.h"
>> >  #include "tcg/tcg-op.h"
>> >  #include "trace.h"
>> >  #include "semihosting/common-semi.h"
>> > +#include "cpu_bits.h"
>> >
>> >  int riscv_cpu_mmu_index(CPURISCVState *env, bool ifetch)
>> >  {
>> > @@ -1188,6 +1190,28 @@ void riscv_cpu_do_unaligned_access(CPUState *cs, 
>> > vaddr addr,
>> >  cpu_loop_exit_restore(cs, retaddr);
>> >  }
>> >
>> > +
>> > +static void pmu_tlb_fill_incr_ctr(RISCVCPU *cpu, MMUAccessType 
>> > access_type)
>> > +{
>> > +enum riscv_pmu_event_idx pmu_event_type;
>> > +
>> > +switch (access_type) {
>> > +case MMU_INST_FETCH:
>> > +pmu_event_type = RISCV_PMU_EVENT_CACHE_ITLB_PREFETCH_MISS;
>> > +break;
>> > +case MMU_DATA_LOAD:
>> > +pmu_event_type = RISCV_PMU_EVENT_CACHE_DTLB_READ_MISS;
>> > +break;
>> > +case MMU_DATA_STORE:
>> > +pmu_event_type = RISCV_PMU_EVENT_CACHE_DTLB_WRITE_MISS;
>> > +break;
>> > +default:
>> > +return;
>> > +}
>> > +
>> > +riscv_pmu_incr_ctr(cpu, pmu_event_type);
>> > +}
>> > +
>> >  bool riscv_cpu_tlb_fill(CPUState *cs, vaddr address,

Re: [RFC v4 00/11] blkio: add libblkio BlockDriver

2022-08-23 Thread Stefan Hajnoczi

On Tue, Aug 23, 2022 at 08:31:03PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 8/23/22 01:23, Stefan Hajnoczi wrote:
> > The remainder of the patch series reworks the existing QEMU 
> > bdrv_register_buf()
> > API so virtio-blk emulation efficiently map guest RAM for libblkio - some
> > libblkio drivers require that I/O buffer memory is pre-registered (think 
> > VFIO,
> > vhost, etc).
> 
> Hi!
> 
> So patches 01-11 are for performance optimization? Don't you have some 
> performance measurements for it?

Good point, I'll gather some data and share it.

Stefan


signature.asc
Description: PGP signature

Re: [RFC v4 11/11] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint

2022-08-23 Thread Stefan Hajnoczi

On Tue, Aug 23, 2022 at 10:01:59AM +0200, David Hildenbrand wrote:
> On 23.08.22 00:24, Stefan Hajnoczi wrote:
> > Register guest RAM using BlockRAMRegistrar and set the
> > BDRV_REQ_REGISTERED_BUF flag so block drivers can optimize memory
> > accesses in I/O requests.
> > 
> > This is for vdpa-blk, vhost-user-blk, and other I/O interfaces that rely
> > on DMA mapping/unmapping.
> 
> Can you explain why we're monitoring RAMRegistrar to hook into "guest
> RAM" and not go the usual path of the MemoryListener?

The requirements are similar to VFIO, which uses RAMBlockNotifier. We
need to learn about all guest RAM because that's where I/O buffers are
located.

Do you think RAMBlockNotifier should be avoided?

> What will BDRV_REQ_REGISTERED_BUF actually do? Pin all guest memory in
> the worst case such as io_uring fixed buffers would do ( I hope not ).

BLK_REQ_REGISTERED_BUF is a hint that no bounce buffer is necessary
because the I/O buffer is located in memory that was previously
registered with bdrv_registered_buf().

The RAMBlockNotifier calls bdrv_register_buf() to let the libblkio
driver know about RAM. Some libblkio drivers ignore this hint, io_uring
may use the fixed buffers feature, vhost-user sends the shared memory
file descriptors to the vhost device server, and VFIO/vhost may pin
pages.

So the blkio block driver doesn't add anything new, it's the union of
VFIO/vhost/vhost-user/etc memory requirements.

Stefan

signature.asc
Description: PGP signature

Re: [PATCH v13 3/6] target/riscv: Add few cache related PMU events

2022-08-23 Thread Atish Kumar Patra

On Mon, Aug 22, 2022 at 5:38 PM Alistair Francis 
wrote:

> On Wed, Aug 17, 2022 at 9:24 AM Atish Patra  wrote:
> >
> > From: Atish Patra 
> >
> > Qemu can monitor the following cache related PMU events through
> > tlb_fill functions.
> >
> > 1. DTLB load/store miss
> > 3. ITLB prefetch miss
> >
> > Increment the PMU counter in tlb_fill function.
> >
> > Reviewed-by: Alistair Francis 
> > Tested-by: Heiko Stuebner 
> > Signed-off-by: Atish Patra 
> > Signed-off-by: Atish Patra 
>
> This patch causes a number of test failures.
>
> On some boots I see lots of these errors printed:
>
> qemu-system-riscv64: GLib: g_hash_table_lookup: assertion 'hash_table
> != NULL' failed
>
> and I'm unable to boot Linux.
>
> The command line is:
>
> qemu-system-riscv64 \
> -machine sifive_u \
> -serial mon:stdio -serial null -nographic \
> -append "root=/dev/ram rw highres=off console=ttySIF0 ip=dhcp
> earlycon=sbi" \
> -smp 5 \
> -d guest_errors \
> -kernel ./images/qemuriscv64/Image \
> -initrd ./images/qemuriscv64/buildroot/rootfs.cpio \
> -bios default -m 256M
>
> I see that faiulre with the 5.19-rc7 kernel and OpenSBI 1.1.
>
> I also see the same messages on other machines when running baremetal
> code. I'm going to drop these patches from my tree
>
>
Sorry for the breakage.  This should fix the issue. We just need to add few
sanity checks
for the platforms that don't support these events.

diff --git a/target/riscv/pmu.c b/target/riscv/pmu.c
index ad33b81b2ea0..0a473010cadd 100644
--- a/target/riscv/pmu.c
+++ b/target/riscv/pmu.c
@@ -187,6 +187,9 @@ int riscv_pmu_incr_ctr(RISCVCPU *cpu, enum
riscv_pmu_event_idx event_idx)
 CPURISCVState *env = >env;
 gpointer value;

+if (!cpu->cfg.pmu_num)
+return 0;
+
 value = g_hash_table_lookup(cpu->pmu_event_ctr_map,
 GUINT_TO_POINTER(event_idx));
 if (!value) {
@@ -221,6 +224,9 @@ bool riscv_pmu_ctr_monitor_instructions(CPURISCVState
*env,
 }

 cpu = RISCV_CPU(env_cpu(env));
+if (!cpu->pmu_event_ctr_map)
+return false;
+
 event_idx = RISCV_PMU_EVENT_HW_INSTRUCTIONS;
 ctr_idx = GPOINTER_TO_UINT(g_hash_table_lookup(cpu->pmu_event_ctr_map,
GUINT_TO_POINTER(event_idx)));
@@ -243,6 +249,9 @@ bool riscv_pmu_ctr_monitor_cycles(CPURISCVState *env,
uint32_t target_ctr)
 }

 cpu = RISCV_CPU(env_cpu(env));
+if (!cpu->pmu_event_ctr_map)
+return false;
+
 event_idx = RISCV_PMU_EVENT_HW_CPU_CYCLES;
 ctr_idx = GPOINTER_TO_UINT(g_hash_table_lookup(cpu->pmu_event_ctr_map,
GUINT_TO_POINTER(event_idx)));
@@ -280,7 +289,7 @@ int riscv_pmu_update_event_map(CPURISCVState *env,
uint64_t value,
 uint32_t event_idx;
 RISCVCPU *cpu = RISCV_CPU(env_cpu(env));

-if (!riscv_pmu_counter_valid(cpu, ctr_idx)) {
+if (!riscv_pmu_counter_valid(cpu, ctr_idx) || !cpu->pmu_event_ctr_map)
{
 return -1;
 }

Should I respin the series or send this as a fix ?

Alistair
>
> > ---
> >  target/riscv/cpu_helper.c | 25 +
> >  1 file changed, 25 insertions(+)
> >
> > diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
> > index 1e4faa84e839..81948b37dd9a 100644
> > --- a/target/riscv/cpu_helper.c
> > +++ b/target/riscv/cpu_helper.c
> > @@ -21,11 +21,13 @@
> >  #include "qemu/log.h"
> >  #include "qemu/main-loop.h"
> >  #include "cpu.h"
> > +#include "pmu.h"
> >  #include "exec/exec-all.h"
> >  #include "instmap.h"
> >  #include "tcg/tcg-op.h"
> >  #include "trace.h"
> >  #include "semihosting/common-semi.h"
> > +#include "cpu_bits.h"
> >
> >  int riscv_cpu_mmu_index(CPURISCVState *env, bool ifetch)
> >  {
> > @@ -1188,6 +1190,28 @@ void riscv_cpu_do_unaligned_access(CPUState *cs,
> vaddr addr,
> >  cpu_loop_exit_restore(cs, retaddr);
> >  }
> >
> > +
> > +static void pmu_tlb_fill_incr_ctr(RISCVCPU *cpu, MMUAccessType
> access_type)
> > +{
> > +enum riscv_pmu_event_idx pmu_event_type;
> > +
> > +switch (access_type) {
> > +case MMU_INST_FETCH:
> > +pmu_event_type = RISCV_PMU_EVENT_CACHE_ITLB_PREFETCH_MISS;
> > +break;
> > +case MMU_DATA_LOAD:
> > +pmu_event_type = RISCV_PMU_EVENT_CACHE_DTLB_READ_MISS;
> > +break;
> > +case MMU_DATA_STORE:
> > +pmu_event_type = RISCV_PMU_EVENT_CACHE_DTLB_WRITE_MISS;
> > +break;
> > +default:
> > +return;
> > +}
> > +
> > +riscv_pmu_incr_ctr(cpu, pmu_event_type);
> > +}
> > +
> >  bool riscv_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
> >  MMUAccessType access_type, int mmu_idx,
> >  bool probe, uintptr_t retaddr)
> > @@ -1286,6 +1310,7 @@ bool riscv_cpu_tlb_fill(CPUState *cs, vaddr
> address, int size,
> >  }
> >  }
> >  } else {
> > +pmu_tlb_fill_incr_ctr(cpu, access_type);
> >  /* Single stage lookup */
> >  ret =

Re: [PULL 0/1] Block layer patches

2022-08-23 Thread Richard Henderson


On 8/23/22 07:04, Kevin Wolf wrote:

The following changes since commit ba58ccbef60338d0b7334c714589a6423a3e7f91:

   Merge tag 'for-7.1-hppa' of https://github.com/hdeller/qemu-hppa into 
staging (2022-08-19 09:35:29 -0700)

are available in the Git repository at:

   git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 51e15194b0a091e5c40aab2eb234a1d36c5c58ee:

   scsi-generic: Fix emulated block limits VPD page (2022-08-23 16:01:13 +0200)


Block layer patches

- scsi-generic: Fix I/O errors due to wrong block limits


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/7.1 as 
appropriate.


r~

Re: [PATCH 7/9] hw/isa/vt82c686: QOM'ify ac97 and mc97 creation

2022-08-23 Thread Bernhard Beschow

On Tue, Aug 23, 2022 at 2:44 AM BALATON Zoltan  wrote:

> On Tue, 23 Aug 2022, Bernhard Beschow wrote:
> > Resolves duplicate code in the boards.
> >
> > Signed-off-by: Bernhard Beschow 
> > ---
> > hw/isa/vt82c686.c   | 16 
> > hw/mips/fuloong2e.c |  4 
> > hw/ppc/pegasos2.c   |  4 
> > 3 files changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/hw/isa/vt82c686.c b/hw/isa/vt82c686.c
> > index b964d1a760..47f2fd2669 100644
> > --- a/hw/isa/vt82c686.c
> > +++ b/hw/isa/vt82c686.c
> > @@ -549,6 +549,8 @@ struct ViaISAState {
> > PCIIDEState ide;
> > UHCIState uhci[2];
> > ViaPMState pm;
> > +PCIDevice ac97;
> > +PCIDevice mc97;
> > };
> >
> > static const VMStateDescription vmstate_via = {
> > @@ -568,6 +570,8 @@ static void via_isa_init(Object *obj)
> > object_initialize_child(obj, "ide", >ide, "via-ide");
> > object_initialize_child(obj, "uhci1", >uhci[0],
> "vt82c686b-usb-uhci");
> > object_initialize_child(obj, "uhci2", >uhci[1],
> "vt82c686b-usb-uhci");
> > +object_initialize_child(obj, "ac97", >ac97, TYPE_VIA_AC97);
> > +object_initialize_child(obj, "mc97", >mc97, TYPE_VIA_MC97);
> > }
> >
> > static const TypeInfo via_isa_info = {
> > @@ -644,6 +648,18 @@ static void via_isa_realize(PCIDevice *d, Error
> **errp)
> > if (!qdev_realize(DEVICE(>pm), BUS(pci_bus), errp)) {
> > return;
> > }
> > +
> > +/* Function 5: AC97 Audio */
> > +qdev_prop_set_int32(DEVICE(>ac97), "addr", d->devfn + 5);
> > +if (!qdev_realize(DEVICE(>ac97), BUS(pci_bus), errp)) {
> > +return;
> > +}
> > +
> > +/* Function 6: AC97 Modem */
> > +qdev_prop_set_int32(DEVICE(>mc97), "addr", d->devfn + 6);
> > +if (!qdev_realize(DEVICE(>mc97), BUS(pci_bus), errp)) {
> > +return;
> > +}
> > }
> >
> > /* TYPE_VT82C686B_ISA */
> > diff --git a/hw/mips/fuloong2e.c b/hw/mips/fuloong2e.c
> > index f05474348f..ea1aef3049 100644
> > --- a/hw/mips/fuloong2e.c
> > +++ b/hw/mips/fuloong2e.c
> > @@ -207,10 +207,6 @@ static void vt82c686b_southbridge_init(PCIBus
> *pci_bus, int slot, qemu_irq intc,
> >
> > dev = PCI_DEVICE(object_resolve_path_component(OBJECT(dev), "pm"));
> > *i2c_bus = I2C_BUS(qdev_get_child_bus(DEVICE(dev), "i2c"));
> > -
> > -/* Audio support */
> > -pci_create_simple(pci_bus, PCI_DEVFN(slot, 5), TYPE_VIA_AC97);
> > -pci_create_simple(pci_bus, PCI_DEVFN(slot, 6), TYPE_VIA_MC97);
> > }
> >
> > /* Network support */
> > diff --git a/hw/ppc/pegasos2.c b/hw/ppc/pegasos2.c
> > index 4e29e42fba..89ef4aed8b 100644
> > --- a/hw/ppc/pegasos2.c
> > +++ b/hw/ppc/pegasos2.c
> > @@ -171,10 +171,6 @@ static void pegasos2_init(MachineState *machine)
> > spd_data = spd_data_generate(DDR, machine->ram_size);
> > smbus_eeprom_init_one(i2c_bus, 0x57, spd_data);
> >
> > -/* VT8231 function 5-6: AC97 Audio & Modem */
> > -pci_create_simple(pci_bus, PCI_DEVFN(12, 5), TYPE_VIA_AC97);
> > -pci_create_simple(pci_bus, PCI_DEVFN(12, 6), TYPE_VIA_MC97);
> > -
>
> This removes the last function created here so the comment above saying:
> /* VT8231 function 0: PCI-to-ISA Bridge */
> is now stale and may be removed as well.
>

Sure, I'll remove it in v2. What about the comment saying:
/* VT8231 function 4: Power Management Controller */
?

Thanks,
Bernhard

>
> Regards,
> BALATON Zoltan
>
> > /* other PC hardware */
> > pci_vga_init(pci_bus);
> >
> >
>

[PATCH v10 12/12] vdpa: Delete CVQ migration blocker

2022-08-23 Thread Eugenio Pérez

We can restore the device state in the destination via CVQ now. Remove
the migration blocker.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 include/hw/virtio/vhost-vdpa.h |  1 -
 hw/virtio/vhost-vdpa.c | 15 ---
 net/vhost-vdpa.c   |  2 --
 3 files changed, 18 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index d10a89303e..d85643 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -35,7 +35,6 @@ typedef struct vhost_vdpa {
 bool shadow_vqs_enabled;
 /* IOVA mapping used by the Shadow Virtqueue */
 VhostIOVATree *iova_tree;
-Error *migration_blocker;
 GPtrArray *shadow_vqs;
 const VhostShadowVirtqueueOps *shadow_vq_ops;
 void *shadow_vq_ops_opaque;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 23ae5ef48b..7468e44b87 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1033,13 +1033,6 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
 return true;
 }
 
-if (v->migration_blocker) {
-int r = migrate_add_blocker(v->migration_blocker, );
-if (unlikely(r < 0)) {
-return false;
-}
-}
-
 for (i = 0; i < v->shadow_vqs->len; ++i) {
 VirtQueue *vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
 VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
@@ -1082,10 +1075,6 @@ err:
 vhost_svq_stop(svq);
 }
 
-if (v->migration_blocker) {
-migrate_del_blocker(v->migration_blocker);
-}
-
 return false;
 }
 
@@ -1101,10 +1090,6 @@ static void vhost_vdpa_svqs_stop(struct vhost_dev *dev)
 VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
 vhost_vdpa_svq_unmap_rings(dev, svq);
 }
-
-if (v->migration_blocker) {
-migrate_del_blocker(v->migration_blocker);
-}
 }
 
 static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 640434d1ea..6ce68fcd3f 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -555,8 +555,6 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 
 s->vhost_vdpa.shadow_vq_ops = _vdpa_net_svq_ops;
 s->vhost_vdpa.shadow_vq_ops_opaque = s;
-error_setg(>vhost_vdpa.migration_blocker,
-   "Migration disabled: vhost-vdpa uses CVQ.");
 }
 ret = vhost_vdpa_add(nc, (void *)>vhost_vdpa, queue_pair_index, nvqs);
 if (ret) {
-- 
2.31.1

Re: [PATCH 1/9] hw/isa/vt82c686: QOM'ify Super I/O creation

2022-08-23 Thread Bernhard Beschow

On Tue, Aug 23, 2022 at 2:35 AM BALATON Zoltan  wrote:

> On Tue, 23 Aug 2022, Bernhard Beschow wrote:
> > The object creation now happens in chip-specific init methods which
> > allows the realize methods to be consolidated into one method. Shifting
> > the logic into the init methods has the addidional advantage that the
> > parent object's init methods are called implicitly.
>
> This and later patches titled "QOM'ify" don't do what I understand on
> QOMifying which is turining an old device model into a QOM object. These
> devices are already QOMified, what's really done here is that they are
> moved to the ViaISAState or embedded into it and created as part of the
> south bridge rather then individually by the boards. Either my
> understanding of what QOMify means is wrong or these patches are misnamed.
>

I think your understanding is correct. Peter refers to it as the
embed-the-device-struct style [1] which I can take as inspiration for
renaming my patches in v2.

Technically via_isa_realize() is the realize method of the abstract
> TYPE_VIA_ISA class which were overriden by the subclasses but since QOM
> does not call overridden methods implicitly this had to be explicitly
> called in the overriding realize methods of the subclasses. Now that we
> don't have to ovverride the method maybe it could be set once on the
> VIA_ISA class then it may work that way but as it's done here is also OK
> maybe as a reminder that this super class method should be called by any
> overriding method if one's added in the future for some reason.
>

Right. This would involve moving some code around and creating a class
struct. Do you think it's worth it?

Best regards,
Bernhard

[1]
http://patchwork.ozlabs.org/project/qemu-devel/patch/20220205175913.31738-2-shen...@gmail.com/

Regards,
> BALATON Zoltan
>
> > Signed-off-by: Bernhard Beschow 
> > ---
> > hw/isa/vt82c686.c | 33 ++---
> > 1 file changed, 18 insertions(+), 15 deletions(-)
> >
> > diff --git a/hw/isa/vt82c686.c b/hw/isa/vt82c686.c
> > index 8f656251b8..0217c98fe4 100644
> > --- a/hw/isa/vt82c686.c
> > +++ b/hw/isa/vt82c686.c
> > @@ -544,7 +544,7 @@ struct ViaISAState {
> > qemu_irq cpu_intr;
> > qemu_irq *isa_irqs;
> > ISABus *isa_bus;
> > -ViaSuperIOState *via_sio;
> > +ViaSuperIOState via_sio;
> > };
> >
> > static const VMStateDescription vmstate_via = {
> > @@ -602,6 +602,11 @@ static void via_isa_realize(PCIDevice *d, Error
> **errp)
> > d->wmask[i] = 0;
> > }
> > }
> > +
> > +/* Super I/O */
> > +if (!qdev_realize(DEVICE(>via_sio), BUS(s->isa_bus), errp)) {
> > +return;
> > +}
> > }
> >
> > /* TYPE_VT82C686B_ISA */
> > @@ -615,7 +620,7 @@ static void vt82c686b_write_config(PCIDevice *d,
> uint32_t addr,
> > pci_default_write_config(d, addr, val, len);
> > if (addr == 0x85) {
> > /* BIT(1): enable or disable superio config io ports */
> > -via_superio_io_enable(s->via_sio, val & BIT(1));
> > +via_superio_io_enable(>via_sio, val & BIT(1));
> > }
> > }
> >
> > @@ -639,13 +644,11 @@ static void vt82c686b_isa_reset(DeviceState *dev)
> > pci_conf[0x77] = 0x10; /* GPIO Control 1/2/3/4 */
> > }
> >
> > -static void vt82c686b_realize(PCIDevice *d, Error **errp)
> > +static void vt82c686b_init(Object *obj)
> > {
> > -ViaISAState *s = VIA_ISA(d);
> > +ViaISAState *s = VIA_ISA(obj);
> >
> > -via_isa_realize(d, errp);
> > -s->via_sio = VIA_SUPERIO(isa_create_simple(s->isa_bus,
> > -   TYPE_VT82C686B_SUPERIO));
> > +object_initialize_child(obj, "sio", >via_sio,
> TYPE_VT82C686B_SUPERIO);
> > }
> >
> > static void vt82c686b_class_init(ObjectClass *klass, void *data)
> > @@ -653,7 +656,7 @@ static void vt82c686b_class_init(ObjectClass *klass,
> void *data)
> > DeviceClass *dc = DEVICE_CLASS(klass);
> > PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
> >
> > -k->realize = vt82c686b_realize;
> > +k->realize = via_isa_realize;
> > k->config_write = vt82c686b_write_config;
> > k->vendor_id = PCI_VENDOR_ID_VIA;
> > k->device_id = PCI_DEVICE_ID_VIA_82C686B_ISA;
> > @@ -670,6 +673,7 @@ static const TypeInfo vt82c686b_isa_info = {
> > .name  = TYPE_VT82C686B_ISA,
> > .parent= TYPE_VIA_ISA,
> > .instance_size = sizeof(ViaISAState),
> > +.instance_init = vt82c686b_init,
> > .class_init= vt82c686b_class_init,
> > };
> >
> > @@ -684,7 +688,7 @@ static void vt8231_write_config(PCIDevice *d,
> uint32_t addr,
> > pci_default_write_config(d, addr, val, len);
> > if (addr == 0x50) {
> > /* BIT(2): enable or disable superio config io ports */
> > -via_superio_io_enable(s->via_sio, val & BIT(2));
> > +via_superio_io_enable(>via_sio, val & BIT(2));
> > }
> > }
> >
> > @@ -703,13 +707,11 @@ static void vt8231_isa_reset(DeviceState *dev)
> > pci_conf[0x6b] = 0x01; /* Fast IR I/O Base */
> > }

[PATCH v10 11/12] vdpa: Add virtio-net mac address via CVQ at start

2022-08-23 Thread Eugenio Pérez

This is needed so the destination vdpa device see the same state a the
guest set in the source.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
v9:
* Use guest acked features instead of device's.
* Constify vhost_vdpa and VirtIONet variables.
* Delete unneeded increment of cursor.

v8:
* Delete unneeded copy from device's in buffer.

v6:
* Map and unmap command buffers at the start and end of device usage.

v5:
* Rename s/start/load/
* Use independent NetClientInfo to only add load callback on cvq.
---
 net/vhost-vdpa.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 3575bf64ee..640434d1ea 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -363,11 +363,51 @@ static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, 
size_t out_len,
 return vhost_svq_poll(svq);
 }
 
+static int vhost_vdpa_net_load(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+const struct vhost_vdpa *v = >vhost_vdpa;
+const VirtIONet *n;
+uint64_t features;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+if (!v->shadow_vqs_enabled) {
+return 0;
+}
+
+n = VIRTIO_NET(v->dev->vdev);
+features = n->parent_obj.guest_features;
+if (features & BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR)) {
+const struct virtio_net_ctrl_hdr ctrl = {
+.class = VIRTIO_NET_CTRL_MAC,
+.cmd = VIRTIO_NET_CTRL_MAC_ADDR_SET,
+};
+char *cursor = s->cvq_cmd_out_buffer;
+ssize_t dev_written;
+
+memcpy(cursor, , sizeof(ctrl));
+cursor += sizeof(ctrl);
+memcpy(cursor, n->mac, sizeof(n->mac));
+
+dev_written = vhost_vdpa_net_cvq_add(s, sizeof(ctrl) + sizeof(n->mac),
+ sizeof(virtio_net_ctrl_ack));
+if (unlikely(dev_written < 0)) {
+return dev_written;
+}
+
+return *((virtio_net_ctrl_ack *)s->cvq_cmd_in_buffer) != VIRTIO_NET_OK;
+}
+
+return 0;
+}
+
 static NetClientInfo net_vhost_vdpa_cvq_info = {
 .type = NET_CLIENT_DRIVER_VHOST_VDPA,
 .size = sizeof(VhostVDPAState),
 .receive = vhost_vdpa_receive,
 .start = vhost_vdpa_net_cvq_start,
+.load = vhost_vdpa_net_load,
 .stop = vhost_vdpa_net_cvq_stop,
 .cleanup = vhost_vdpa_cleanup,
 .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
-- 
2.31.1

Re: [PATCH] audio: exit(1) if audio backend failed to be found or initialized

2022-08-23 Thread Volker Rümelin


From: Marc-André Lureau

If you specify a known backend but it isn't compiled in, or failed to
initialize, you get a simple warning and the "none" backend as a
fallback, and QEMU runs happily:

$ qemu-system-x86_64 -audiodev id=audio,driver=dsound
audio: Unknown audio driver `dsound'
audio: warning: Using timer based audio emulation
...

Instead, QEMU should fail to start:
$ qemu-system-x86_64 -audiodev id=audio,driver=dsound
audio: Unknown audio driver `dsound'
$

Resolves:
https://bugzilla.redhat.com/show_bug.cgi?id=1983493

Signed-off-by: Marc-André Lureau
---
  audio/audio.h |  2 +-
  audio/audio.c | 14 +++---
  softmmu/vl.c  |  4 +++-
  3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/audio/audio.h b/audio/audio.h
index b5e17cd218..27e67079a0 100644
--- a/audio/audio.h
+++ b/audio/audio.h
@@ -170,7 +170,7 @@ void audio_sample_from_uint64(void *samples, int pos,
  
  void audio_define(Audiodev *audio);

  void audio_parse_option(const char *opt);
-void audio_init_audiodevs(void);
+bool audio_init_audiodevs(void);
  void audio_legacy_help(void);
  
  AudioState *audio_state_by_name(const char *name);

diff --git a/audio/audio.c b/audio/audio.c
index a02f3ce5c6..76b8735b44 100644
--- a/audio/audio.c
+++ b/audio/audio.c
@@ -1743,7 +1743,6 @@ static AudioState *audio_init(Audiodev *dev, const char 
*name)
  atexit(audio_cleanup);
  atexit_registered = true;
  }
-QTAILQ_INSERT_TAIL(_states, s, list);
  
  s->ts = timer_new_ns(QEMU_CLOCK_VIRTUAL, audio_timer, s);
  
@@ -1769,6 +1768,10 @@ static AudioState *audio_init(Audiodev *dev, const char *name)

  } else {
  dolog ("Unknown audio driver `%s'\n", drvname);
  }
+if (!done) {
+free_audio_state(s);
+return NULL;


Hi Marc-André,

I don't understand why you move the QTAILQ_INSERT_TAIL(_states, s, 
list) macro down. Without this you don't need the additional 
free_audio_state() call. audio_cleanup() takes care to free the audio state.


The rest looks good. You can add my

Reviewed-by: Volker Rümelin 

even if you insist on keeping the patch as it is.

With best regards,
Volker


+}
  } else {
  for (i = 0; audio_prio_list[i]; i++) {
  AudiodevListEntry *e = audiodev_find(, audio_prio_list[i]);
@@ -1806,6 +1809,7 @@ static AudioState *audio_init(Audiodev *dev, const char 
*name)
 "(Audio can continue looping even after stopping the VM)\n");
  }
  
+QTAILQ_INSERT_TAIL(_states, s, list);

  QLIST_INIT (>card_head);
  vmstate_register (NULL, 0, _audio, s);
  return s;
@@ -2119,13 +2123,17 @@ void audio_define(Audiodev *dev)
  QSIMPLEQ_INSERT_TAIL(, e, next);
  }
  
-void audio_init_audiodevs(void)

+bool audio_init_audiodevs(void)
  {
  AudiodevListEntry *e;
  
  QSIMPLEQ_FOREACH(e, , next) {

-audio_init(e->dev, NULL);
+if (!audio_init(e->dev, NULL)) {
+return false;
+}
  }
+
+return true;
  }
  
  audsettings audiodev_to_audsettings(AudiodevPerDirectionOptions *pdo)

diff --git a/softmmu/vl.c b/softmmu/vl.c
index 706bd7cff7..dea4005e47 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -1885,7 +1885,9 @@ static void qemu_create_early_backends(void)
   * setting machine properties, so they can be referred to.
   */
  configure_blockdev(_queue, machine_class, snapshot);
-audio_init_audiodevs();
+if (!audio_init_audiodevs()) {
+exit(1);
+}
  }

[PATCH v10 09/12] vdpa: extract vhost_vdpa_net_cvq_add from vhost_vdpa_net_handle_ctrl_avail

2022-08-23 Thread Eugenio Pérez

So we can reuse it to inject state messages.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
--
v7:
* Remove double free error

v6:
* Do not assume in buffer sent to the device is sizeof(virtio_net_ctrl_ack)

v5:
* Do not use an artificial !NULL VirtQueueElement
* Use only out size instead of iovec dev_buffers for these functions.
---
 net/vhost-vdpa.c | 59 +++-
 1 file changed, 38 insertions(+), 21 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 452d10ed93..3575bf64ee 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -331,6 +331,38 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
 }
 }
 
+static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
+  size_t in_len)
+{
+/* Buffers for the device */
+const struct iovec out = {
+.iov_base = s->cvq_cmd_out_buffer,
+.iov_len = out_len,
+};
+const struct iovec in = {
+.iov_base = s->cvq_cmd_in_buffer,
+.iov_len = sizeof(virtio_net_ctrl_ack),
+};
+VhostShadowVirtqueue *svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0);
+int r;
+
+r = vhost_svq_add(svq, , 1, , 1, NULL);
+if (unlikely(r != 0)) {
+if (unlikely(r == -ENOSPC)) {
+qemu_log_mask(LOG_GUEST_ERROR, "%s: No space on device queue\n",
+  __func__);
+}
+return r;
+}
+
+/*
+ * We can poll here since we've had BQL from the time we sent the
+ * descriptor. Also, we need to take the answer before SVQ pulls by itself,
+ * when BQL is released
+ */
+return vhost_svq_poll(svq);
+}
+
 static NetClientInfo net_vhost_vdpa_cvq_info = {
 .type = NET_CLIENT_DRIVER_VHOST_VDPA,
 .size = sizeof(VhostVDPAState),
@@ -387,23 +419,18 @@ static int 
vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
 void *opaque)
 {
 VhostVDPAState *s = opaque;
-size_t in_len, dev_written;
+size_t in_len;
 virtio_net_ctrl_ack status = VIRTIO_NET_ERR;
 /* Out buffer sent to both the vdpa device and the device model */
 struct iovec out = {
 .iov_base = s->cvq_cmd_out_buffer,
 };
-/* In buffer sent to the device */
-const struct iovec dev_in = {
-.iov_base = s->cvq_cmd_in_buffer,
-.iov_len = sizeof(virtio_net_ctrl_ack),
-};
 /* in buffer used for device model */
 const struct iovec in = {
 .iov_base = ,
 .iov_len = sizeof(status),
 };
-int r = -EINVAL;
+ssize_t dev_written = -EINVAL;
 bool ok;
 
 out.iov_len = iov_to_buf(elem->out_sg, elem->out_num, 0,
@@ -414,21 +441,11 @@ static int 
vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
 goto out;
 }
 
-r = vhost_svq_add(svq, , 1, _in, 1, elem);
-if (unlikely(r != 0)) {
-if (unlikely(r == -ENOSPC)) {
-qemu_log_mask(LOG_GUEST_ERROR, "%s: No space on device queue\n",
-  __func__);
-}
+dev_written = vhost_vdpa_net_cvq_add(s, out.iov_len, sizeof(status));
+if (unlikely(dev_written < 0)) {
 goto out;
 }
 
-/*
- * We can poll here since we've had BQL from the time we sent the
- * descriptor. Also, we need to take the answer before SVQ pulls by itself,
- * when BQL is released
- */
-dev_written = vhost_svq_poll(svq);
 if (unlikely(dev_written < sizeof(status))) {
 error_report("Insufficient written data (%zu)", dev_written);
 goto out;
@@ -436,7 +453,7 @@ static int 
vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
 
 memcpy(, s->cvq_cmd_in_buffer, sizeof(status));
 if (status != VIRTIO_NET_OK) {
-goto out;
+return VIRTIO_NET_ERR;
 }
 
 status = VIRTIO_NET_ERR;
@@ -453,7 +470,7 @@ out:
 }
 vhost_svq_push_elem(svq, elem, MIN(in_len, sizeof(status)));
 g_free(elem);
-return r;
+return dev_written < 0 ? dev_written : 0;
 }
 
 static const VhostShadowVirtqueueOps vhost_vdpa_net_svq_ops = {
-- 
2.31.1

[PATCH v10 10/12] vhost_net: add NetClientState->load() callback

2022-08-23 Thread Eugenio Pérez

It allows per-net client operations right after device's successful
start. In particular, to load the device status.

Vhost-vdpa net will use it to add the CVQ buffers to restore the device
status.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
v5: Rename start / load, naming it more specifically.
---
 include/net/net.h  | 2 ++
 hw/net/vhost_net.c | 7 +++
 2 files changed, 9 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 476ad45b9a..81d0b21def 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -45,6 +45,7 @@ typedef struct NICConf {
 typedef void (NetPoll)(NetClientState *, bool enable);
 typedef bool (NetCanReceive)(NetClientState *);
 typedef int (NetStart)(NetClientState *);
+typedef int (NetLoad)(NetClientState *);
 typedef void (NetStop)(NetClientState *);
 typedef ssize_t (NetReceive)(NetClientState *, const uint8_t *, size_t);
 typedef ssize_t (NetReceiveIOV)(NetClientState *, const struct iovec *, int);
@@ -74,6 +75,7 @@ typedef struct NetClientInfo {
 NetReceiveIOV *receive_iov;
 NetCanReceive *can_receive;
 NetStart *start;
+NetLoad *load;
 NetStop *stop;
 NetCleanup *cleanup;
 LinkStatusChanged *link_status_changed;
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 9d4b334453..d28f8b974b 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -281,6 +281,13 @@ static int vhost_net_start_one(struct vhost_net *net,
 }
 }
 }
+
+if (net->nc->info->load) {
+r = net->nc->info->load(net->nc);
+if (r < 0) {
+goto fail;
+}
+}
 return 0;
 fail:
 file.fd = -1;
-- 
2.31.1

Re: [PATCH 8/9] hw/isa/vt82c686: QOM'ify RTC creation

2022-08-23 Thread Bernhard Beschow

On Tue, Aug 23, 2022 at 2:20 AM BALATON Zoltan  wrote:

> On Tue, 23 Aug 2022, Bernhard Beschow wrote:
> > Signed-off-by: Bernhard Beschow 
> > ---
> > hw/isa/vt82c686.c | 12 +++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/isa/vt82c686.c b/hw/isa/vt82c686.c
> > index 47f2fd2669..ee745d5d49 100644
> > --- a/hw/isa/vt82c686.c
> > +++ b/hw/isa/vt82c686.c
> > @@ -546,6 +546,7 @@ struct ViaISAState {
> > qemu_irq cpu_intr;
> > qemu_irq *isa_irqs;
> > ViaSuperIOState via_sio;
> > +RTCState rtc;
> > PCIIDEState ide;
> > UHCIState uhci[2];
> > ViaPMState pm;
> > @@ -567,6 +568,7 @@ static void via_isa_init(Object *obj)
> > {
> > ViaISAState *s = VIA_ISA(obj);
> >
> > +object_initialize_child(obj, "rtc", >rtc, TYPE_MC146818_RTC);
> > object_initialize_child(obj, "ide", >ide, "via-ide");
> > object_initialize_child(obj, "uhci1", >uhci[0],
> "vt82c686b-usb-uhci");
> > object_initialize_child(obj, "uhci2", >uhci[1],
> "vt82c686b-usb-uhci");
> > @@ -615,7 +617,15 @@ static void via_isa_realize(PCIDevice *d, Error
> **errp)
> > isa_bus_irqs(isa_bus, s->isa_irqs);
> > i8254_pit_init(isa_bus, 0x40, 0, NULL);
> > i8257_dma_init(isa_bus, 0);
> > -mc146818_rtc_init(isa_bus, 2000, NULL);
> > +
> > +/* RTC */
> > +qdev_prop_set_int32(DEVICE(>rtc), "base_year", 2000);
> > +if (!qdev_realize(DEVICE(>rtc), BUS(isa_bus), errp)) {
> > +return;
> > +}
> > +object_property_add_alias(qdev_get_machine(), "rtc-time",
> OBJECT(>rtc),
> > +  "date");
> > +isa_connect_gpio_out(ISA_DEVICE(>rtc), 0, s->rtc.isairq);
> >
> > for (i = 0; i < PCI_CONFIG_HEADER_SIZE; i++) {
> > if (i < PCI_COMMAND || i >= PCI_REVISION_ID) {
> >
>
> This actually introduces code duplication as all other places except piix4
> seem to still use the init function (probably to ensure that the rtc-rime
> alias on the machine is properly set) so I'd keep this the same as
> everything else and drop this patch until this init function is removed
> from all other places as well.
>

Hi Zoltan,

Thanks for the fast reply! Regarding code homogeneity and duplication I've
made a similar argument for mc146818_rtc_init() in the past [1] and I've
learnt that my patch went backwards. Incidentally, Peter mentioned vt686c
as a candidate for the embed-the-device-struct style which - again
incidentally - I've now done.

The rtc-time alias is actually only used by a couple of PPC machines where
Pegasos II is one of them. So the alias actually needs to be created only
for these machines, and identifying the cases where it has to be preserved
requires a lot of careful investigation. In the Pegasos II case this seems
especially complicated since one needs to look through several layers of
devices. During my work on the VT82xx south bridges I've gained some
knowledge such that I'd like to make this simplifying contribution.

Our discussion makes me realize that the creation of the alias could now
actually be moved to the Pegasos II board. This way, the Pegasos II board
would both create and consume that alias, which seems to remove quite some
cognitive load. Do you agree? Would moving the alias to the board work for
you?

Thanks,
Bernhard

[1]
http://patchwork.ozlabs.org/project/qemu-devel/patch/20220205175913.31738-2-shen...@gmail.com/

>
> Regards,
> BALATON Zoltan
>

[PATCH v10 08/12] vdpa: Move command buffers map to start of net device

2022-08-23 Thread Eugenio Pérez

As this series will reuse them to restore the device state at the end of
a migration (or a device start), let's allocate only once at the device
start so we don't duplicate their map and unmap.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
v10: Solve conflict on vhost_iova_tree_remove call by value
---
 net/vhost-vdpa.c | 123 ++-
 1 file changed, 58 insertions(+), 65 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1a597c2e92..452d10ed93 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -263,29 +263,20 @@ static size_t vhost_vdpa_net_cvq_cmd_page_len(void)
 return ROUND_UP(vhost_vdpa_net_cvq_cmd_len(), qemu_real_host_page_size());
 }
 
-/** Copy and map a guest buffer. */
-static bool vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v,
-   const struct iovec *out_data,
-   size_t out_num, size_t data_len, void *buf,
-   size_t *written, bool write)
+/** Map CVQ buffer. */
+static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
+  bool write)
 {
 DMAMap map = {};
 int r;
 
-if (unlikely(!data_len)) {
-qemu_log_mask(LOG_GUEST_ERROR, "%s: invalid legnth of %s buffer\n",
-  __func__, write ? "in" : "out");
-return false;
-}
-
-*written = iov_to_buf(out_data, out_num, 0, buf, data_len);
 map.translated_addr = (hwaddr)(uintptr_t)buf;
-map.size = vhost_vdpa_net_cvq_cmd_page_len() - 1;
+map.size = size - 1;
 map.perm = write ? IOMMU_RW : IOMMU_RO,
 r = vhost_iova_tree_map_alloc(v->iova_tree, );
 if (unlikely(r != IOVA_OK)) {
 error_report("Cannot map injected element");
-return false;
+return r;
 }
 
 r = vhost_vdpa_dma_map(v, map.iova, vhost_vdpa_net_cvq_cmd_page_len(), buf,
@@ -294,50 +285,58 @@ static bool vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v,
 goto dma_map_err;
 }
 
-return true;
+return 0;
 
 dma_map_err:
 vhost_iova_tree_remove(v->iova_tree, map);
-return false;
+return r;
 }
 
-/**
- * Copy the guest element into a dedicated buffer suitable to be sent to NIC
- *
- * @iov: [0] is the out buffer, [1] is the in one
- */
-static bool vhost_vdpa_net_cvq_map_elem(VhostVDPAState *s,
-VirtQueueElement *elem,
-struct iovec *iov)
+static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 {
-size_t in_copied;
-bool ok;
+VhostVDPAState *s;
+int r;
 
-iov[0].iov_base = s->cvq_cmd_out_buffer;
-ok = vhost_vdpa_cvq_map_buf(>vhost_vdpa, elem->out_sg, elem->out_num,
-vhost_vdpa_net_cvq_cmd_len(), iov[0].iov_base,
-[0].iov_len, false);
-if (unlikely(!ok)) {
-return false;
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+s = DO_UPCAST(VhostVDPAState, nc, nc);
+if (!s->vhost_vdpa.shadow_vqs_enabled) {
+return 0;
 }
 
-iov[1].iov_base = s->cvq_cmd_in_buffer;
-ok = vhost_vdpa_cvq_map_buf(>vhost_vdpa, NULL, 0,
-sizeof(virtio_net_ctrl_ack), iov[1].iov_base,
-_copied, true);
-if (unlikely(!ok)) {
+r = vhost_vdpa_cvq_map_buf(>vhost_vdpa, s->cvq_cmd_out_buffer,
+   vhost_vdpa_net_cvq_cmd_page_len(), false);
+if (unlikely(r < 0)) {
+return r;
+}
+
+r = vhost_vdpa_cvq_map_buf(>vhost_vdpa, s->cvq_cmd_in_buffer,
+   vhost_vdpa_net_cvq_cmd_page_len(), true);
+if (unlikely(r < 0)) {
 vhost_vdpa_cvq_unmap_buf(>vhost_vdpa, s->cvq_cmd_out_buffer);
-return false;
 }
 
-iov[1].iov_len = sizeof(virtio_net_ctrl_ack);
-return true;
+return r;
+}
+
+static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+if (s->vhost_vdpa.shadow_vqs_enabled) {
+vhost_vdpa_cvq_unmap_buf(>vhost_vdpa, s->cvq_cmd_out_buffer);
+vhost_vdpa_cvq_unmap_buf(>vhost_vdpa, s->cvq_cmd_in_buffer);
+}
 }
 
 static NetClientInfo net_vhost_vdpa_cvq_info = {
 .type = NET_CLIENT_DRIVER_VHOST_VDPA,
 .size = sizeof(VhostVDPAState),
 .receive = vhost_vdpa_receive,
+.start = vhost_vdpa_net_cvq_start,
+.stop = vhost_vdpa_net_cvq_stop,
 .cleanup = vhost_vdpa_cleanup,
 .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
 .has_ufo = vhost_vdpa_has_ufo,
@@ -348,19 +347,17 @@ static NetClientInfo net_vhost_vdpa_cvq_info = {
  * Do not forward commands not supported by SVQ. Otherwise, the device could
  * accept it and qemu would not know how to update the device model.
  */
-static bool vhost_vdpa_net_cvq_validate_cmd(const struct iovec *out,
-

[PATCH v10 04/12] vhost: Do not depend on !NULL VirtQueueElement on vhost_svq_flush

2022-08-23 Thread Eugenio Pérez

Since QEMU will be able to inject new elements on CVQ to restore the
state, we need not to depend on a VirtQueueElement to know if a new
element has been used by the device or not. Instead of check that, check
if there are new elements only using used idx on vhost_svq_flush.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
v6: Change less from the previous function
---
 hw/virtio/vhost-shadow-virtqueue.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index 8df5296f24..e8e5bbc368 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -499,17 +499,20 @@ static void vhost_svq_flush(VhostShadowVirtqueue *svq,
 size_t vhost_svq_poll(VhostShadowVirtqueue *svq)
 {
 int64_t start_us = g_get_monotonic_time();
+uint32_t len;
+
 do {
-uint32_t len;
-VirtQueueElement *elem = vhost_svq_get_buf(svq, );
-if (elem) {
-return len;
+if (vhost_svq_more_used(svq)) {
+break;
 }
 
 if (unlikely(g_get_monotonic_time() - start_us > 10e6)) {
 return 0;
 }
 } while (true);
+
+vhost_svq_get_buf(svq, );
+return len;
 }
 
 /**
-- 
2.31.1

[PATCH v10 02/12] vhost: use SVQ element ndescs instead of opaque data for desc validation

2022-08-23 Thread Eugenio Pérez

Since we're going to allow SVQ to add elements without the guest's
knowledge and without its own VirtQueueElement, it's easier to check if
an element is a valid head checking a different thing than the
VirtQueueElement.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 hw/virtio/vhost-shadow-virtqueue.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index a1261d4a0f..b35aeef4bd 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -414,7 +414,7 @@ static VirtQueueElement 
*vhost_svq_get_buf(VhostShadowVirtqueue *svq,
 return NULL;
 }
 
-if (unlikely(!svq->desc_state[used_elem.id].elem)) {
+if (unlikely(!svq->desc_state[used_elem.id].ndescs)) {
 qemu_log_mask(LOG_GUEST_ERROR,
 "Device %s says index %u is used, but it was not available",
 svq->vdev->name, used_elem.id);
@@ -422,6 +422,7 @@ static VirtQueueElement 
*vhost_svq_get_buf(VhostShadowVirtqueue *svq,
 }
 
 num = svq->desc_state[used_elem.id].ndescs;
+svq->desc_state[used_elem.id].ndescs = 0;
 last_used_chain = vhost_svq_last_desc_of_chain(svq, num, used_elem.id);
 svq->desc_next[last_used_chain] = svq->free_head;
 svq->free_head = used_elem.id;
-- 
2.31.1

[PATCH v10 07/12] vdpa: add net_vhost_vdpa_cvq_info NetClientInfo

2022-08-23 Thread Eugenio Pérez

Next patches will add a new info callback to restore NIC status through
CVQ. Since only the CVQ vhost device is needed, create it with a new
NetClientInfo.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
v5: Create a new NetClientInfo instead of reusing the dataplane one.
---
 net/vhost-vdpa.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index a49e7e649d..1a597c2e92 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -334,6 +334,16 @@ static bool vhost_vdpa_net_cvq_map_elem(VhostVDPAState *s,
 return true;
 }
 
+static NetClientInfo net_vhost_vdpa_cvq_info = {
+.type = NET_CLIENT_DRIVER_VHOST_VDPA,
+.size = sizeof(VhostVDPAState),
+.receive = vhost_vdpa_receive,
+.cleanup = vhost_vdpa_cleanup,
+.has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
+.has_ufo = vhost_vdpa_has_ufo,
+.check_peer_type = vhost_vdpa_check_peer_type,
+};
+
 /**
  * Do not forward commands not supported by SVQ. Otherwise, the device could
  * accept it and qemu would not know how to update the device model.
@@ -475,7 +485,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 nc = qemu_new_net_client(_vhost_vdpa_info, peer, device,
  name);
 } else {
-nc = qemu_new_net_control_client(_vhost_vdpa_info, peer,
+nc = qemu_new_net_control_client(_vhost_vdpa_cvq_info, peer,
  device, name);
 }
 snprintf(nc->info_str, sizeof(nc->info_str), TYPE_VHOST_VDPA);
-- 
2.31.1

[PATCH v10 03/12] vhost: Delete useless read memory barrier

2022-08-23 Thread Eugenio Pérez

As discussed in previous series [1], this memory barrier is useless with
the atomic read of used idx at vhost_svq_more_used. Deleting it.

[1] https://lists.nongnu.org/archive/html/qemu-devel/2022-07/msg02616.html

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 hw/virtio/vhost-shadow-virtqueue.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index b35aeef4bd..8df5296f24 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -509,9 +509,6 @@ size_t vhost_svq_poll(VhostShadowVirtqueue *svq)
 if (unlikely(g_get_monotonic_time() - start_us > 10e6)) {
 return 0;
 }
-
-/* Make sure we read new used_idx */
-smp_rmb();
 } while (true);
 }
 
-- 
2.31.1

[PATCH v10 00/12] NIC vhost-vdpa state restore via Shadow CVQ

2022-08-23 Thread Eugenio Pérez

CVQ of net vhost-vdpa devices can be intercepted since the addition of x-svq.
The virtio-net device model is updated. The migration was blocked because
although the state can be megrated between VMM it was not possible to restore
on the destination NIC.

This series add support for SVQ to inject external messages without the guest's
knowledge, so before the guest is resumed all the guest visible state is
restored. It is done using standard CVQ messages, so the vhost-vdpa device does
not need to learn how to restore it: As long as they have the feature, they
know how to handle it.

Depending on the device, this series may need fixes with message id [1] to
achieve full live migration.

[1] <20220823182008.97141-1-epere...@redhat.com>

v10:
- Rebase on latest fixes of [1].

v9:
- Use guest acked features instead of device's.
- Minors: fix typos and patch messages, constify vhost_vdpa and VirtIONet vars,
  delete unneeded increment of cursor.

v8:
- Rename NetClientInfo load to start, so is symmetrical with stop()
- Delete copy of device's in buffer at vhost_vdpa_net_load

v7:
- Remove accidental double free.

v6:
- Move map and unmap of the buffers to the start and stop of the device. This
  implies more callbacks on NetClientInfo, but simplifies the SVQ CVQ code.
- Not assume that in buffer is sizeof(virtio_net_ctrl_ack) in
  vhost_vdpa_net_cvq_add
- Reduce the number of changes from previous versions
- Delete unused memory barrier

v5:
- Rename s/start/load/
- Use independent NetClientInfo to only add load callback on cvq.
- Accept out sg instead of dev_buffers[] at vhost_vdpa_net_cvq_map_elem
- Use only out size instead of iovec dev_buffers to know if the descriptor is
  effectively available, allowing to delete artificial !NULL VirtQueueElement
  on vhost_svq_add call.

v4:
- Actually use NetClientInfo callback.

v3:
- Route vhost-vdpa start code through NetClientInfo callback.
- Delete extra vhost_net_stop_one() call.

v2:
- Fix SIGSEGV dereferencing SVQ when not in svq mode

v1 from RFC:
- Do not reorder DRIVER_OK & enable patches.
- Delete leftovers

Eugenio Pérez (12):
  vhost: stop transfer elem ownership in vhost_handle_guest_kick
  vhost: use SVQ element ndescs instead of opaque data for desc
validation
  vhost: Delete useless read memory barrier
  vhost: Do not depend on !NULL VirtQueueElement on vhost_svq_flush
  vhost_net: Add NetClientInfo start callback
  vhost_net: Add NetClientInfo stop callback
  vdpa: add net_vhost_vdpa_cvq_info NetClientInfo
  vdpa: Move command buffers map to start of net device
  vdpa: extract vhost_vdpa_net_cvq_add from
vhost_vdpa_net_handle_ctrl_avail
  vhost_net: add NetClientState->load() callback
  vdpa: Add virtio-net mac address via CVQ at start
  vdpa: Delete CVQ migration blocker

 include/hw/virtio/vhost-vdpa.h |   1 -
 include/net/net.h  |   6 +
 hw/net/vhost_net.c |  17 +++
 hw/virtio/vhost-shadow-virtqueue.c |  27 ++--
 hw/virtio/vhost-vdpa.c |  15 --
 net/vhost-vdpa.c   | 224 ++---
 6 files changed, 177 insertions(+), 113 deletions(-)

-- 
2.31.1

[PATCH v10 06/12] vhost_net: Add NetClientInfo stop callback

2022-08-23 Thread Eugenio Pérez

Used by the backend to perform actions after the device is stopped.

In particular, vdpa net use it to unmap CVQ buffers to the device,
cleaning the actions performed in prepare().

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
v9: Replace performend by performed in patch message
---
 include/net/net.h  | 2 ++
 hw/net/vhost_net.c | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index ad9e80083a..476ad45b9a 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -45,6 +45,7 @@ typedef struct NICConf {
 typedef void (NetPoll)(NetClientState *, bool enable);
 typedef bool (NetCanReceive)(NetClientState *);
 typedef int (NetStart)(NetClientState *);
+typedef void (NetStop)(NetClientState *);
 typedef ssize_t (NetReceive)(NetClientState *, const uint8_t *, size_t);
 typedef ssize_t (NetReceiveIOV)(NetClientState *, const struct iovec *, int);
 typedef void (NetCleanup) (NetClientState *);
@@ -73,6 +74,7 @@ typedef struct NetClientInfo {
 NetReceiveIOV *receive_iov;
 NetCanReceive *can_receive;
 NetStart *start;
+NetStop *stop;
 NetCleanup *cleanup;
 LinkStatusChanged *link_status_changed;
 QueryRxFilter *query_rx_filter;
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 2e0baeba26..9d4b334453 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -320,6 +320,9 @@ static void vhost_net_stop_one(struct vhost_net *net,
 net->nc->info->poll(net->nc, true);
 }
 vhost_dev_stop(>dev, dev);
+if (net->nc->info->stop) {
+net->nc->info->stop(net->nc);
+}
 vhost_dev_disable_notifiers(>dev, dev);
 }
 
-- 
2.31.1

[PATCH v10 05/12] vhost_net: Add NetClientInfo start callback

2022-08-23 Thread Eugenio Pérez

This is used by the backend to perform actions before the device is
started.

In particular, vdpa net use it to map CVQ buffers to the device, so it
can send control commands using them.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
v9: Rename also in patch message
v8: Rename NetClientInfo prepare callback to start, so it aligns with
future "stop"
---
 include/net/net.h  | 2 ++
 hw/net/vhost_net.c | 7 +++
 2 files changed, 9 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 523136c7ac..ad9e80083a 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -44,6 +44,7 @@ typedef struct NICConf {
 
 typedef void (NetPoll)(NetClientState *, bool enable);
 typedef bool (NetCanReceive)(NetClientState *);
+typedef int (NetStart)(NetClientState *);
 typedef ssize_t (NetReceive)(NetClientState *, const uint8_t *, size_t);
 typedef ssize_t (NetReceiveIOV)(NetClientState *, const struct iovec *, int);
 typedef void (NetCleanup) (NetClientState *);
@@ -71,6 +72,7 @@ typedef struct NetClientInfo {
 NetReceive *receive_raw;
 NetReceiveIOV *receive_iov;
 NetCanReceive *can_receive;
+NetStart *start;
 NetCleanup *cleanup;
 LinkStatusChanged *link_status_changed;
 QueryRxFilter *query_rx_filter;
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index ccac5b7a64..2e0baeba26 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -244,6 +244,13 @@ static int vhost_net_start_one(struct vhost_net *net,
 struct vhost_vring_file file = { };
 int r;
 
+if (net->nc->info->start) {
+r = net->nc->info->start(net->nc);
+if (r < 0) {
+return r;
+}
+}
+
 r = vhost_dev_enable_notifiers(>dev, dev);
 if (r < 0) {
 goto fail_notifiers;
-- 
2.31.1

[PATCH v10 01/12] vhost: stop transfer elem ownership in vhost_handle_guest_kick

2022-08-23 Thread Eugenio Pérez

It was easier to allow vhost_svq_add to handle the memory. Now that we
will allow qemu to add elements to a SVQ without the guest's knowledge,
it's better to handle it in the caller.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 hw/virtio/vhost-shadow-virtqueue.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index 82a784d250..a1261d4a0f 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -233,9 +233,6 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
 /**
  * Add an element to a SVQ.
  *
- * The caller must check that there is enough slots for the new element. It
- * takes ownership of the element: In case of failure not ENOSPC, it is free.
- *
  * Return -EINVAL if element is invalid, -ENOSPC if dev queue is full
  */
 int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
@@ -252,7 +249,6 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct 
iovec *out_sg,
 
 ok = vhost_svq_add_split(svq, out_sg, out_num, in_sg, in_num, _head);
 if (unlikely(!ok)) {
-g_free(elem);
 return -EINVAL;
 }
 
@@ -293,7 +289,7 @@ static void vhost_handle_guest_kick(VhostShadowVirtqueue 
*svq)
 virtio_queue_set_notification(svq->vq, false);
 
 while (true) {
-VirtQueueElement *elem;
+g_autofree VirtQueueElement *elem;
 int r;
 
 if (svq->next_guest_avail_elem) {
@@ -324,12 +320,14 @@ static void vhost_handle_guest_kick(VhostShadowVirtqueue 
*svq)
  * queue the current guest descriptor and ignore kicks
  * until some elements are used.
  */
-svq->next_guest_avail_elem = elem;
+svq->next_guest_avail_elem = g_steal_pointer();
 }
 
 /* VQ is full or broken, just return and ignore kicks */
 return;
 }
+/* elem belongs to SVQ or external caller now */
+elem = NULL;
 }
 
 virtio_queue_set_notification(svq->vq, true);
-- 
2.31.1

[PATCH v2 6/7] vhost: Always store new kick fd on vhost_svq_set_svq_kick_fd

2022-08-23 Thread Eugenio Pérez

We can unbind twice a file descriptor if we call twice
vhost_svq_set_svq_kick_fd because of this. Since it comes from vhost and
not from SVQ, that file descriptor could be a different thing that
guest's vhost notifier.

Likewise, it can happens the same if a guest start and stop the device
multiple times.

Reported-by: Lei Yang 
Fixes: dff4426fa6 ("vhost: Add Shadow VirtQueue kick forwarding capabilities")
Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 hw/virtio/vhost-shadow-virtqueue.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index e4956728dd..82a784d250 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -602,13 +602,13 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, 
int svq_kick_fd)
 event_notifier_set_handler(svq_kick, NULL);
 }
 
+event_notifier_init_fd(svq_kick, svq_kick_fd);
 /*
  * event_notifier_set_handler already checks for guest's notifications if
  * they arrive at the new file descriptor in the switch, so there is no
  * need to explicitly check for them.
  */
 if (poll_start) {
-event_notifier_init_fd(svq_kick, svq_kick_fd);
 event_notifier_set(svq_kick);
 event_notifier_set_handler(svq_kick, vhost_handle_guest_kick_notifier);
 }
@@ -655,7 +655,7 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, 
VirtIODevice *vdev,
  */
 void vhost_svq_stop(VhostShadowVirtqueue *svq)
 {
-event_notifier_set_handler(>svq_kick, NULL);
+vhost_svq_set_svq_kick_fd(svq, VHOST_FILE_UNBIND);
 g_autofree VirtQueueElement *next_avail_elem = NULL;
 
 if (!svq->vq) {
-- 
2.31.1

[PATCH v2 3/7] util: accept iova_tree_remove_parameter by value

2022-08-23 Thread Eugenio Pérez

It's convenient to call iova_tree_remove from a map returned from
iova_tree_find or iova_tree_find_iova. With the current code this is not
possible, since we will free it, and then we will try to search for it
again.

Fix it making accepting the map by value, forcing a copy of the
argument. Not applying a fixes tag, since there is no use like that at
the moment.

Signed-off-by: Eugenio Pérez 
---
v2: Accept map parameter by value instead of make a copy
---
 hw/virtio/vhost-iova-tree.h | 2 +-
 include/qemu/iova-tree.h| 2 +-
 hw/i386/intel_iommu.c   | 6 +++---
 hw/virtio/vhost-iova-tree.c | 2 +-
 hw/virtio/vhost-vdpa.c  | 6 +++---
 net/vhost-vdpa.c| 4 ++--
 util/iova-tree.c| 4 ++--
 7 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
index 6a4f24e0f9..4adfd79ff0 100644
--- a/hw/virtio/vhost-iova-tree.h
+++ b/hw/virtio/vhost-iova-tree.h
@@ -22,6 +22,6 @@ G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, 
vhost_iova_tree_delete);
 const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
 const DMAMap *map);
 int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
-void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map);
+void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
 
 #endif
diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 16bbfdf5f8..8528e5c98f 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -73,7 +73,7 @@ int iova_tree_insert(IOVATree *tree, const DMAMap *map);
  * all the mappings that are included in the provided range will be
  * removed from the tree.  Here map->translated_addr is meaningless.
  */
-void iova_tree_remove(IOVATree *tree, const DMAMap *map);
+void iova_tree_remove(IOVATree *tree, DMAMap map);
 
 /**
  * iova_tree_find:
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2162394e08..05d53a1aa9 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1187,7 +1187,7 @@ static int vtd_page_walk_one(IOMMUTLBEvent *event, 
vtd_page_walk_info *info)
 return ret;
 }
 /* Drop any existing mapping */
-iova_tree_remove(as->iova_tree, );
+iova_tree_remove(as->iova_tree, target);
 /* Recover the correct type */
 event->type = IOMMU_NOTIFIER_MAP;
 entry->perm = cache_perm;
@@ -1200,7 +1200,7 @@ static int vtd_page_walk_one(IOMMUTLBEvent *event, 
vtd_page_walk_info *info)
 trace_vtd_page_walk_one_skip_unmap(entry->iova, entry->addr_mask);
 return 0;
 }
-iova_tree_remove(as->iova_tree, );
+iova_tree_remove(as->iova_tree, target);
 }
 
 trace_vtd_page_walk_one(info->domain_id, entry->iova,
@@ -3563,7 +3563,7 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, 
IOMMUNotifier *n)
 
 map.iova = n->start;
 map.size = size;
-iova_tree_remove(as->iova_tree, );
+iova_tree_remove(as->iova_tree, map);
 }
 
 static void vtd_address_space_unmap_all(IntelIOMMUState *s)
diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
index 67bf6d57ab..3d03395a77 100644
--- a/hw/virtio/vhost-iova-tree.c
+++ b/hw/virtio/vhost-iova-tree.c
@@ -104,7 +104,7 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap 
*map)
  * @iova_tree: The vhost iova tree
  * @map: The map to remove
  */
-void vhost_iova_tree_remove(VhostIOVATree *iova_tree, const DMAMap *map)
+void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
 {
 iova_tree_remove(iova_tree->iova_taddr_map, map);
 }
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 7e28d2f674..87e0ad393f 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -240,7 +240,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
 
 fail_map:
 if (v->shadow_vqs_enabled) {
-vhost_iova_tree_remove(v->iova_tree, _region);
+vhost_iova_tree_remove(v->iova_tree, mem_region);
 }
 
 fail:
@@ -300,7 +300,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
 return;
 }
 iova = result->iova;
-vhost_iova_tree_remove(v->iova_tree, result);
+vhost_iova_tree_remove(v->iova_tree, *result);
 }
 vhost_vdpa_iotlb_batch_begin_once(v);
 ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
@@ -944,7 +944,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, 
DMAMap *needle,
needle->perm == IOMMU_RO);
 if (unlikely(r != 0)) {
 error_setg_errno(errp, -r, "Cannot map region to device");
-vhost_iova_tree_remove(v->iova_tree, needle);
+vhost_iova_tree_remove(v->iova_tree, *needle);
 }
 
 return r == 0;
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 303447a68e..a49e7e649d

[PATCH v2 7/7] vdpa: Use ring hwaddr at vhost_vdpa_svq_unmap_ring

2022-08-23 Thread Eugenio Pérez

Reduce code duplication.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 hw/virtio/vhost-vdpa.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index e208dd000e..23ae5ef48b 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -884,10 +884,12 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
 /**
  * Unmap a SVQ area in the device
  */
-static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
-  const DMAMap *needle)
+static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr addr)
 {
-const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
+const DMAMap needle = {
+.translated_addr = addr,
+};
+const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, );
 hwaddr size;
 int r;
 
@@ -909,17 +911,14 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa 
*v,
 static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
const VhostShadowVirtqueue *svq)
 {
-DMAMap needle = {};
 struct vhost_vdpa *v = dev->opaque;
 struct vhost_vring_addr svq_addr;
 
 vhost_svq_get_vring_addr(svq, _addr);
 
-needle.translated_addr = svq_addr.desc_user_addr;
-vhost_vdpa_svq_unmap_ring(v, );
+vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr);
 
-needle.translated_addr = svq_addr.used_user_addr;
-vhost_vdpa_svq_unmap_ring(v, );
+vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr);
 }
 
 /**
@@ -997,7 +996,7 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
 ok = vhost_vdpa_svq_map_ring(v, _region, errp);
 if (unlikely(!ok)) {
 error_prepend(errp, "Cannot create vq device region: ");
-vhost_vdpa_svq_unmap_ring(v, _region);
+vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
 }
 addr->used_user_addr = device_region.iova;
 
-- 
2.31.1

[PATCH v2 2/7] vdpa: do not save failed dma maps in SVQ iova tree

2022-08-23 Thread Eugenio Pérez

If a map fails for whatever reason, it must not be saved in the tree.
Otherwise, qemu will try to unmap it in cleanup, leaving to more errors.

Fixes: 34e3c94eda ("vdpa: Add custom IOTLB translations to SVQ")
Reported-by: Lei Yang 
Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 hw/virtio/vhost-vdpa.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 983d3697b0..7e28d2f674 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -176,6 +176,7 @@ static void vhost_vdpa_listener_commit(MemoryListener 
*listener)
 static void vhost_vdpa_listener_region_add(MemoryListener *listener,
MemoryRegionSection *section)
 {
+DMAMap mem_region = {};
 struct vhost_vdpa *v = container_of(listener, struct vhost_vdpa, listener);
 hwaddr iova;
 Int128 llend, llsize;
@@ -212,13 +213,13 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
 
 llsize = int128_sub(llend, int128_make64(iova));
 if (v->shadow_vqs_enabled) {
-DMAMap mem_region = {
-.translated_addr = (hwaddr)(uintptr_t)vaddr,
-.size = int128_get64(llsize) - 1,
-.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
-};
+int r;
 
-int r = vhost_iova_tree_map_alloc(v->iova_tree, _region);
+mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
+mem_region.size = int128_get64(llsize) - 1,
+mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
+
+r = vhost_iova_tree_map_alloc(v->iova_tree, _region);
 if (unlikely(r != IOVA_OK)) {
 error_report("Can't allocate a mapping (%d)", r);
 goto fail;
@@ -232,11 +233,16 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
  vaddr, section->readonly);
 if (ret) {
 error_report("vhost vdpa map fail!");
-goto fail;
+goto fail_map;
 }
 
 return;
 
+fail_map:
+if (v->shadow_vqs_enabled) {
+vhost_iova_tree_remove(v->iova_tree, _region);
+}
+
 fail:
 /*
  * On the initfn path, store the first error in the container so we
-- 
2.31.1

[PATCH v2 0/7] vDPA shadow virtqueue iova tree fixes.

2022-08-23 Thread Eugenio Pérez

Collection of iova tree fixes detected preparing live migration with real
devices and multiqueue.

These cannot be triggered in simple setups (vdpa_sim_net, no display, no
device reset with different features) but it's possible to trigger them with
real devices or if the kernel fails some step like memory mapping / unmapping.

First two patches are already in the list at [1]. Last one is not a fix by
itself but a straightforward merge of the same code.

[1] https://lists.nongnu.org/archive/html/qemu-devel/2022-08/msg00773.html

v2:
* Accept iova_tree_remove map arg by value
* Add error message on unmap fail

Eugenio Pérez (7):
  vdpa: Skip the maps not in the iova tree
  vdpa: do not save failed dma maps in SVQ iova tree
  util: accept iova_tree_remove_parameter by value
  vdpa: Remove SVQ vring from iova_tree at shutdown
  vdpa: Make SVQ vring unmapping return void
  vhost: Always store new kick fd on vhost_svq_set_svq_kick_fd
  vdpa: Use ring hwaddr at vhost_vdpa_svq_unmap_ring

 hw/virtio/vhost-iova-tree.h|  2 +-
 include/qemu/iova-tree.h   |  2 +-
 hw/i386/intel_iommu.c  |  6 +--
 hw/virtio/vhost-iova-tree.c|  2 +-
 hw/virtio/vhost-shadow-virtqueue.c |  4 +-
 hw/virtio/vhost-vdpa.c | 75 --
 net/vhost-vdpa.c   |  4 +-
 util/iova-tree.c   |  4 +-
 8 files changed, 51 insertions(+), 48 deletions(-)

-- 
2.31.1

[PATCH v2 5/7] vdpa: Make SVQ vring unmapping return void

2022-08-23 Thread Eugenio Pérez

Nothing actually reads the return value, but an error in cleaning some
entries could cause device stop to abort, making a restart impossible.
Better ignore explicitely the return value.

Reported-by: Lei Yang 
Fixes: 34e3c94eda ("vdpa: Add custom IOTLB translations to SVQ")
Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 hw/virtio/vhost-vdpa.c | 32 ++--
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index e16e0e222e..e208dd000e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -884,7 +884,7 @@ static int vhost_vdpa_svq_set_fds(struct vhost_dev *dev,
 /**
  * Unmap a SVQ area in the device
  */
-static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
+static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
   const DMAMap *needle)
 {
 const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, needle);
@@ -893,38 +893,33 @@ static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa 
*v,
 
 if (unlikely(!result)) {
 error_report("Unable to find SVQ address to unmap");
-return false;
+return;
 }
 
 size = ROUND_UP(result->size, qemu_real_host_page_size());
 r = vhost_vdpa_dma_unmap(v, result->iova, size);
 if (unlikely(r < 0)) {
 error_report("Unable to unmap SVQ vring: %s (%d)", g_strerror(-r), -r);
-return false;
+return;
 }
 
 vhost_iova_tree_remove(v->iova_tree, *result);
-return r == 0;
 }
 
-static bool vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
+static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
const VhostShadowVirtqueue *svq)
 {
 DMAMap needle = {};
 struct vhost_vdpa *v = dev->opaque;
 struct vhost_vring_addr svq_addr;
-bool ok;
 
 vhost_svq_get_vring_addr(svq, _addr);
 
 needle.translated_addr = svq_addr.desc_user_addr;
-ok = vhost_vdpa_svq_unmap_ring(v, );
-if (unlikely(!ok)) {
-return false;
-}
+vhost_vdpa_svq_unmap_ring(v, );
 
 needle.translated_addr = svq_addr.used_user_addr;
-return vhost_vdpa_svq_unmap_ring(v, );
+vhost_vdpa_svq_unmap_ring(v, );
 }
 
 /**
@@ -1095,26 +1090,22 @@ err:
 return false;
 }
 
-static bool vhost_vdpa_svqs_stop(struct vhost_dev *dev)
+static void vhost_vdpa_svqs_stop(struct vhost_dev *dev)
 {
 struct vhost_vdpa *v = dev->opaque;
 
 if (!v->shadow_vqs) {
-return true;
+return;
 }
 
 for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
 VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
-bool ok = vhost_vdpa_svq_unmap_rings(dev, svq);
-if (unlikely(!ok)) {
-return false;
-}
+vhost_vdpa_svq_unmap_rings(dev, svq);
 }
 
 if (v->migration_blocker) {
 migrate_del_blocker(v->migration_blocker);
 }
-return true;
 }
 
 static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
@@ -1131,10 +1122,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
 }
 vhost_vdpa_set_vring_ready(dev);
 } else {
-ok = vhost_vdpa_svqs_stop(dev);
-if (unlikely(!ok)) {
-return -1;
-}
+vhost_vdpa_svqs_stop(dev);
 vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
 }
 
-- 
2.31.1

[PATCH v2 4/7] vdpa: Remove SVQ vring from iova_tree at shutdown

2022-08-23 Thread Eugenio Pérez

Although the device will be reset before usage, the right thing to do is
to clean it.

Reported-by: Lei Yang 
Fixes: 34e3c94eda ("vdpa: Add custom IOTLB translations to SVQ")
Signed-off-by: Eugenio Pérez 
---
v2:
* Call vhost_iova_tree_remove with the map as value.
* report_error on vhost_vdpa_dma_unmap fail
---
 hw/virtio/vhost-vdpa.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 87e0ad393f..e16e0e222e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -898,6 +898,12 @@ static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
 
 size = ROUND_UP(result->size, qemu_real_host_page_size());
 r = vhost_vdpa_dma_unmap(v, result->iova, size);
+if (unlikely(r < 0)) {
+error_report("Unable to unmap SVQ vring: %s (%d)", g_strerror(-r), -r);
+return false;
+}
+
+vhost_iova_tree_remove(v->iova_tree, *result);
 return r == 0;
 }
 
-- 
2.31.1

[PATCH v2 1/7] vdpa: Skip the maps not in the iova tree

2022-08-23 Thread Eugenio Pérez

Next patch will skip the registering of dma maps that the vdpa device
rejects in the iova tree. We need to consider that here or we cause a
SIGSEGV accessing result.

Reported-by: Lei Yang 
Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 hw/virtio/vhost-vdpa.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3ff9ce3501..983d3697b0 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -289,6 +289,10 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
 };
 
 result = vhost_iova_tree_find_iova(v->iova_tree, _region);
+if (!result) {
+/* The memory listener map wasn't mapped */
+return;
+}
 iova = result->iova;
 vhost_iova_tree_remove(v->iova_tree, result);
 }
-- 
2.31.1

Re: [PATCH for-7.2 v2 10/20] hw/ppc: set machine->fdt in spapr machine

2022-08-23 Thread Daniel Henrique Barboza

On 8/23/22 05:58, Alexey Kardashevskiy wrote:

On 22/08/2022 20:30, Daniel Henrique Barboza wrote:

On 8/22/22 00:29, Alexey Kardashevskiy wrote:

On 22/08/2022 13:05, David Gibson wrote:

On Fri, Aug 19, 2022 at 06:42:34AM -0300, Daniel Henrique Barboza wrote:

On 8/18/22 23:11, Alexey Kardashevskiy wrote:

On 05/08/2022 19:39, Daniel Henrique Barboza wrote:

The pSeries machine never bothered with the common machine->fdt
attribute. We do all the FDT related work using spapr->fdt_blob.

We're going to introduce HMP commands to read and save the FDT, which
will rely on setting machine->fdt properly to work across all machine
archs/types.

Out of curiosity - why new HMP command, is not QOM'ing this ms::fdt property
enough?

I tried to do the minimal changes needed for the commands to work. ms::fdt is
one of the few MachineState fields that hasn't been QOMified by
machine_class_init() yet. All pre-existing code that uses ms::fdt are using the
pointer directly. To make a QOMified use of it would require extra patches
in machine.c to QOMify the property first.

There's also the issue with how each machine is creating the FDT. Most are using
helpers from device_tree.c, some are creating it from scratch, others required
a .dtb file, most of them are not doing a fdt_pack() and so on. To really QOMify
the use of ms::fdt we would need some machine hooks that standardize all that.
I believe it's worth the trouble, but it would be too much to do
right now.

Hmm.. I think this depends on what you mean by "QOM"ify exactly. If
you're meaning make the full DT representation QOM objects, that you
can look into in detail, then, yes, that's pretty complicated.

I suspect what Alexey was suggesting though, was merely to make
ms::fdt accessible as a single bytestring property on the machine QOM
object. Effectively it's just "dumpdtb" but as a property get.

Yes, I meant the bytestream, as DTC can easily decompile it onto a DTS.

I'm not 100% certain if QOM can safely represent arbitrary bytestrings
as QOM properties, which would need checking.

I am not sure either but rather than adding another command to HMP, I'd explore
this option first.

I'm not sure what you mean by that. The HMP version of 'dumpdtb' is more
flexible
that the current "-machine dumpdtb", an extra machine option that would cause
the guest to exit after writing the dtb

True. Especially with CAS :)

And 'info fdt' is a new command that
makes it easier to inspect specific nodes/props.

btw what is this new command going to do? decompile the tree or save dtb?

At this moment, 'info fdt [prop]' is using the current ms->fdt bytestream
plus libfdt to output nodes and properties.

I don't see how making ms::fdt being retrievable by object_property_get()
internally
(remember that ms::fdt it's not fully QOMified, so there's no introspection of
its
value from the QEMU monitor) would make any of these new HMP commands obsolete.

Well, there are QMP and HMP and my feeling was that HMP is slowly getting
deprecated or something and QMP is the superior one. So I thought since this
FDT is a property and there is no associated action with it, making it a
property would do.

I don't have an option about HMP being deprecated and QMP being the preferable
choice. Perhaps someone else reading this thread can tell more about it.

For ages I've been using a python3 script to talk to QMP as HMP is really quite limited, the only
thing in HMP which is not in QMP is dumping memory ("x", "xp"), in this case I
wrap HMP into QMP and keep using QMP :)

Apparently we have a QMP enthusiast over here ;)

Daniel

Thanks,

Daniel

Leif's email issues resolved

2022-08-23 Thread Leif Lindholm

Hi everyone,

As some of you might have noticed, my email setup broke a little over
a month ago. At the time I was on leave, so did not notice for a
while. And upon my return, it turned out it had broken in New and
Exciting ways - so it took some time to restore.

This has now been achieved, and I am back on the list and able to
post. (And hopefully a bit more attentive than before my break.)

/
Leif

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-23 Thread Gupta, Pankaj





Actually the current version allows you to delay the allocation to a
later time (e.g. page fault time) if you don't call fallocate() on the
private fd. fallocate() is necessary in previous versions because we
treat the existense in the fd as 'private' but in this version we track
private/shared info in KVM so we don't rely on that fact from memory
backstores.


Does this also mean reservation of guest physical memory with secure
processor (both for SEV-SNP & TDX) will also happen at page fault time?

Do we plan to keep it this way?


If you are talking about accepting memory by the guest, it is initiated by
the guest and has nothing to do with page fault time vs fallocate()
allocation of host memory. I mean acceptance happens after host memory
allocation but they are not in lockstep, acceptance can happen much later.


No, I meant reserving guest physical memory range from hypervisor e.g with
RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).


As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way
for userspace to pre-map guest memory.

I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
initializing guest private memory with a source page can be implemented via a
flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
flag could be added to say that the dest is also the source.


Questions to clarify *my* understanding here:

- Do you suggest to use KVM_TDX_INIT_MEM_REGION into a generic ioctl to
  pre-map guest private memory in addition to initialize the payload
  (in-place encryption or just copy page to guest private memory)?

- Want to clarify "pre-map": Are you suggesting to use the ioctl
  to avoid the RMP/PAMT registration at guest page fault time? instead
  pre-map guest private memory i.e to allocate and do RMP/PAMT
  registration before running the actual guest vCPU's?

Thanks,
Pankaj



The TDX and SNP restrictions would then become addition restrictions on when
initializing with a source is allowed (and VMs that don't have guest private
memory wouldn't allow the flag at all).

1 2 3 >

1 - 100 of 217 matches

Mail list logo