date:20220720

Re: [PATCH v12 02/40] virtio: struct virtio_config_ops add callbacks for queue_reset

2022-07-20 Thread Jason Wang



在 2022/7/20 11:03, Xuan Zhuo 写道:

reset can be divided into the following four steps (example):
  1. transport: notify the device to reset the queue
  2. vring: recycle the buffer submitted
  3. vring: reset/resize the vring (may re-alloc)
  4. transport: mmap vring to device, and enable the queue

In order to support queue reset, add two callbacks in struct
virtio_config_ops to implement steps 1 and 4.

Signed-off-by: Xuan Zhuo 



Acked-by: Jason Wang 



---
  include/linux/virtio_config.h | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index b47c2e7ed0ee..36ec7be1f480 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -78,6 +78,18 @@ struct virtio_shm_region {
   * @set_vq_affinity: set the affinity for a virtqueue (optional).
   * @get_vq_affinity: get the affinity for a virtqueue (optional).
   * @get_shm_region: get a shared memory region based on the index.
+ * @disable_vq_and_reset: reset a queue individually (optional).
+ * vq: the virtqueue
+ * Returns 0 on success or error status
+ * disable_vq_and_reset will guarantee that the callbacks are disabled and
+ * synchronized.
+ * Except for the callback, the caller should guarantee that the vring is
+ * not accessed by any functions of virtqueue.
+ * @enable_vq_after_reset: enable a reset queue
+ * vq: the virtqueue
+ * Returns 0 on success or error status
+ * If disable_vq_and_reset is set, then enable_vq_after_reset must also be
+ * set.
   */
  typedef void vq_callback_t(struct virtqueue *);
  struct virtio_config_ops {
@@ -104,6 +116,8 @@ struct virtio_config_ops {
int index);
bool (*get_shm_region)(struct virtio_device *vdev,
   struct virtio_shm_region *region, u8 id);
+   int (*disable_vq_and_reset)(struct virtqueue *vq);
+   int (*enable_vq_after_reset)(struct virtqueue *vq);
  };
  
  /* If driver didn't advertise the feature, it will never appear. */


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v12 01/40] virtio: record the maximum queue num supported by the device.

2022-07-20 Thread Jason Wang



在 2022/7/20 11:03, Xuan Zhuo 写道:

virtio-net can display the maximum (supported by hardware) ring size in
ethtool -g eth0.

When the subsequent patch implements vring reset, it can judge whether
the ring size passed by the driver is legal based on this.

Signed-off-by: Xuan Zhuo 


Acked-by: Jason Wang 



---
  arch/um/drivers/virtio_uml.c | 1 +
  drivers/platform/mellanox/mlxbf-tmfifo.c | 2 ++
  drivers/remoteproc/remoteproc_virtio.c   | 2 ++
  drivers/s390/virtio/virtio_ccw.c | 3 +++
  drivers/virtio/virtio_mmio.c | 2 ++
  drivers/virtio/virtio_pci_legacy.c   | 2 ++
  drivers/virtio/virtio_pci_modern.c   | 2 ++
  drivers/virtio/virtio_vdpa.c | 2 ++
  include/linux/virtio.h   | 2 ++
  9 files changed, 18 insertions(+)

diff --git a/arch/um/drivers/virtio_uml.c b/arch/um/drivers/virtio_uml.c
index 82ff3785bf69..e719af8bdf56 100644
--- a/arch/um/drivers/virtio_uml.c
+++ b/arch/um/drivers/virtio_uml.c
@@ -958,6 +958,7 @@ static struct virtqueue *vu_setup_vq(struct virtio_device 
*vdev,
goto error_create;
}
vq->priv = info;
+   vq->num_max = num;
num = virtqueue_get_vring_size(vq);
  
  	if (vu_dev->protocol_features &

diff --git a/drivers/platform/mellanox/mlxbf-tmfifo.c 
b/drivers/platform/mellanox/mlxbf-tmfifo.c
index 38800e86ed8a..1ae3c56b66b0 100644
--- a/drivers/platform/mellanox/mlxbf-tmfifo.c
+++ b/drivers/platform/mellanox/mlxbf-tmfifo.c
@@ -959,6 +959,8 @@ static int mlxbf_tmfifo_virtio_find_vqs(struct 
virtio_device *vdev,
goto error;
}
  
+		vq->num_max = vring->num;

+
vqs[i] = vq;
vring->vq = vq;
vq->priv = vring;
diff --git a/drivers/remoteproc/remoteproc_virtio.c 
b/drivers/remoteproc/remoteproc_virtio.c
index d43d74733f0a..0f7706e23eb9 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -125,6 +125,8 @@ static struct virtqueue *rp_find_vq(struct virtio_device 
*vdev,
return ERR_PTR(-ENOMEM);
}
  
+	vq->num_max = num;

+
rvring->vq = vq;
vq->priv = rvring;
  
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c

index 161d3b141f0d..6b86d0280d6b 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -530,6 +530,9 @@ static struct virtqueue *virtio_ccw_setup_vq(struct 
virtio_device *vdev,
err = -ENOMEM;
goto out_err;
}
+
+   vq->num_max = info->num;
+
/* it may have been reduced */
info->num = virtqueue_get_vring_size(vq);
  
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c

index 083ff1eb743d..a20d5a6b5819 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -403,6 +403,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device 
*vdev, unsigned int in
goto error_new_virtqueue;
}
  
+	vq->num_max = num;

+
/* Activate the queue */
writel(virtqueue_get_vring_size(vq), vm_dev->base + 
VIRTIO_MMIO_QUEUE_NUM);
if (vm_dev->version == 1) {
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index a5e5721145c7..2257f1b3d8ae 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,6 +135,8 @@ static struct virtqueue *setup_vq(struct virtio_pci_device 
*vp_dev,
if (!vq)
return ERR_PTR(-ENOMEM);
  
+	vq->num_max = num;

+
q_pfn = virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT;
if (q_pfn >> 32) {
dev_err(&vp_dev->pci_dev->dev,
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 623906b4996c..e7e0b8c850f6 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -218,6 +218,8 @@ static struct virtqueue *setup_vq(struct virtio_pci_device 
*vp_dev,
if (!vq)
return ERR_PTR(-ENOMEM);
  
+	vq->num_max = num;

+
/* activate the queue */
vp_modern_set_queue_size(mdev, index, virtqueue_get_vring_size(vq));
vp_modern_queue_address(mdev, index, virtqueue_get_desc_addr(vq),
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index c40f7deb6b5a..9670cc79371d 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -183,6 +183,8 @@ virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned 
int index,
goto error_new_virtqueue;
}
  
+	vq->num_max = max_num;

+
/* Setup virtqueue callback */
cb.callback = callback ? virtio_vdpa_virtqueue_cb : NULL;
cb.private = info;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index d8fdf170637c..129bde7521e3 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -19,6 +19,7 @@
   * @priv: a pointer for

Re: VIRTIO_NET_F_MTU not negotiated

2022-07-20 Thread Michael S. Tsirkin

On Wed, Jul 20, 2022 at 08:17:29AM +, Eli Cohen wrote:
> > From: Eugenio Perez Martin 
> > Sent: Wednesday, July 20, 2022 10:47 AM
> > To: Eli Cohen 
> > Cc: Michael S. Tsirkin ; Jason Wang ; 
> > virtualization@lists.linux-foundation.org
> > Subject: Re: VIRTIO_NET_F_MTU not negotiated
> > 
> > On Wed, Jul 20, 2022 at 8:30 AM Eli Cohen  wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Eugenio Perez Martin 
> > > > Sent: Tuesday, July 19, 2022 5:51 PM
> > > > To: Eli Cohen 
> > > > Cc: Michael S. Tsirkin ; Jason Wang 
> > > > ; virtualization@lists.linux-foundation.org
> > > > Subject: Re: VIRTIO_NET_F_MTU not negotiated
> > > >
> > > > On Tue, Jul 19, 2022 at 4:02 PM Eli Cohen  wrote:
> > > > >
> > > > > > From: Eli Cohen
> > > > > > Sent: Tuesday, July 19, 2022 4:53 PM
> > > > > > To: Michael S. Tsirkin 
> > > > > > Cc: Jason Wang ; Eugenio Perez Martin 
> > > > > > ; virtualization@lists.linux-
> > > > foundation.org
> > > > > > Subject: RE: VIRTIO_NET_F_MTU not negotiated
> > > > > >
> > > > > > >
> > > > > > > On Tue, Jul 19, 2022 at 01:25:42PM +, Eli Cohen wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > mlx5_vdpa is offering VIRTIO_NET_F_MTU. However the driver (is 
> > > > > > > > it qemu
> > > > > > > > responsibility?) does not accept and it ends up not negotiated.
> > > > > > >
> > > > > > > qemu is responsible for passing features to driver.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't see how can the driver refuse to negotiate this. What 
> > > > > > > > if the hardware
> > > > > > > > has a limitation with respect to mtu?
> > > > > > >
> > > > > > > Then it can fail FEATURES_OK
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > I noticed this when I created the device with mtu of 1000. I 
> > > > > > > > expected the
> > > > > > > > netdev at the vm to have mtu of 1000 and any attempt to go 
> > > > > > > > beyond should fail
> > > > > > > > but that's not the case.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Comments?
> > > > > > >
> > > > > > >
> > > > > > > Any chance mtu is too small?
> > > > > > >
> > > > > >
> > > > > > MIN_MTU is 68 bytes and I was trying 1000. I tried also 1300 but 
> > > > > > same thing.
> > > > > >
> > > > > > > if (virtio_has_feature(vdev, VIRTIO_NET_F_MTU)) {
> > > > > > > int mtu = virtio_cread16(vdev,
> > > > > > >  offsetof(struct 
> > > > > > > virtio_net_config,
> > > > > > >   mtu));
> > > > > > > if (mtu < MIN_MTU)
> > > > > > > __virtio_clear_bit(vdev, 
> > > > > > > VIRTIO_NET_F_MTU);
> > > > > > > }
> > > > > > >
> > > > > > > any chance it's on power or another BE system?
> > > > > > >
> > > > > >
> > > > > > No, it's x86_64.
> > > > > >
> > > > > > I will test also vdpa device running on the host.
> > > > >
> > > > > vdpa running on the host (using virtio_vdpa) behaves as expected.
> > > > > Is there a quick way to check if qemu fails to pass the information 
> > > > > to the driver on the guest?
> > > > >
> > > >
> > > > Can you trace with "-trace 'vhost_vdpa_set_features' -trace
> > > > 'vhost_vdpa_set_features'"? They're parameters of qemu.
> > >
> > > I get this:
> > > vhost_vdpa_set_features dev: 0x5595ae9751a0 features: 0x3008b1803
> > >
> > > VIRTIO_NET_F_MTU is bit 3 and it seems to be off.
> > >
> > 
> > Sorry, I put trace on vhost_vdpa *_set_* features twice in my message.
> > 
> > Could you try tracing vhost_vdpa_get_features too? That way we know if
> > qemu offers it to the guest.
> > 
> 
> Sure.
> I get these two successive prints right as the VM starts booting:
> vhost_vdpa_get_features dev: 0x55c87e4651e0 features: 0x300cb182b
> vhost_vdpa_get_features dev: 0x55c87e4627a0 features: 0x300cb182b
> 
> Later on I get this:
> vhost_vdpa_set_features dev: 0x55c87e4651e0 features: 0x3008b1803
> 
> # qemu-system-x86_64 --version
> QEMU emulator version 7.0.0 (v7.0.0)
> Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
> 
> 
> 
> > Thanks!
> 

so it's there but driver does not ack it.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio: Force DMA restricted devices through DMA API

2022-07-20 Thread Michael S. Tsirkin

On Wed, Jul 20, 2022 at 08:27:51AM +, Keir Fraser wrote:
> On Wed, Jul 20, 2022 at 02:59:53AM -0400, Michael S. Tsirkin wrote:
> > On Tue, Jul 19, 2022 at 04:11:50PM +, Keir Fraser wrote:
> > > On Tue, Jul 19, 2022 at 08:51:54AM -0700, Christoph Hellwig wrote:
> > > > On Tue, Jul 19, 2022 at 03:46:08PM +, Keir Fraser wrote:
> > > > > However, if the general idea at least is acceptable, would the
> > > > > implementation be acceptable if I add an explicit API for this to the
> > > > > DMA subsystem, and hide the detail there?
> > > > 
> > > > I don't think so.  The right thing to key off is
> > > > VIRTIO_F_ACCESS_PLATFORM, which really should be set in any modern
> > > > virtio device after all the problems we had with the lack of it.
> > > 
> > > Ok. Certainly the flag description in virtio spec fits the bill.
> > 
> > Maybe. But balloon really isn't a normal device. Yes the rings kind of
> > operate more or less normally. However consider for example free page
> > hinting. We stick a page address in ring for purposes of telling host it
> > can blow it away at any later time unless we write something into the
> > page.  Free page reporting is similar.
> > Sending gigabytes of memory through swiotlb is at minimum not
> > a good idea.
> 
> I don't think any balloon use case needs the page's guest data to be
> made available to the host device. What the device side does with
> reported guest-physical page addresses is somewhat VMM/Hyp specific,
> but I think it's fair to say it will know how to reclaim or track
> pages by guest PA, and bouncing reported/hinted page addresses through
> a SWIOTLB or IOMMU would not be of any use to any use case I can think
> of.
> 
> As far as I can see, the only translation that makes sense at all for
> virtio balloon is in ring management.
> 
> > Conversely, is it even okay security wise that host can blow away any
> > guest page?  Because with balloon GFNs do not go through bounce
> > buffering.
> 
> Ok, I think this is fair and I can address that by describing the use
> case more broadly.
> 
> The short answer is that there will be more patches forthcoming,
> because the balloon driver will need to tell the hypervisor (EL2 Hyp
> in the ARM PKVM case) that is is willingly relinquishing memory
> pages. So, there will be a patch to add a new HVC to PKVM Hyp, and a
> patch to detect and use the new HVC via a new API that hooks into
> Linux's balloon infrastructure.
> 
> So the use case is that virtio_balloon needs to set up its rings and
> descriptors in a shared memory region, as requested via
> dma-restricted-pool and the VIRTIO_F_PALTFORM_ACCESS flag. This is
> required because the host device has no access to regular guest
> memory.
> 
> However, in PKVM, page notifications will notify both the (trusted)
> Hypervisor, via hypercall, and the (untrusted) VMM, via virtio. Guest
> physical addresses are fine here. The VMM understands guest PAs
> perfectly well, it's just not normally allowed to access their
> contents or otherwise map or use those pages itself.

OK ... and I wonder whether this extends the balloon device
in some way? Is there or can there be a new feature bit for this
functionality?


> > 
> > > > > Or a completely different approach would be to revert the patch
> > > > > e41b1355508d which clears VIRTIO_F_ACCESS_PLATFORM in the balloon
> > > > > driver. MST: That's back in your court, as it's your patch!
> > > > 
> > > > Which also means this needs to be addresses, but I don't think a
> > > > simple revert is enough.
> > > 
> > > Well here are two possible approaches:
> > > 
> > > 1. Revert e41b1355508d outright. I'm not even sure what it would mean
> > > for reported pages to go through IOMMU. And VIRTIO_F_ACCESS_PLATFORM
> > > is no longer IOMMU-specific anyway.
> > > 
> > > 2. Continue to clear the flag during virtio_balloon negotiation, but
> > > remember that it was offered, and test for that in vring_use_dma_api()
> > > as well as, or instead of, virtio_has_dma_quirk().
> > > 
> > > Do either of those appeal?
> > 
> > I think the use case needs to be documented better.
> 
> I hope the above is helpful in giving some more context.
> 
> Perhaps it would make more sense to re-submit this patch as part of
> a larger series that includes the rest of the mechanism needed to
> support virtio_balloon on PKVM?
> 
> Thanks,
> Keir

I suspect so, yes.


> > 
> > -- 
> > MST
> > 
> > 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v1 0/3] virtio/vsock: use SO_RCVLOWAT to set POLLIN/POLLRDNORM

2022-07-20 Thread Stefano Garzarella


On Wed, Jul 20, 2022 at 06:07:47AM +, Arseniy Krasnov wrote:

On 19.07.2022 15:58, Stefano Garzarella wrote:

On Mon, Jul 18, 2022 at 08:12:52AM +, Arseniy Krasnov wrote:

Hello,

during my experiments with zerocopy receive, i found, that in some
cases, poll() implementation violates POSIX: when socket has non-
default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and
POLLRDNORM bits in 'revents' even number of bytes available to read
on socket is smaller than SO_RCVLOWAT value. In this case,user sees
POLLIN flag and then tries to read data(for example using  'read()'
call), but read call will be blocked, because  SO_RCVLOWAT logic is
supported in dequeue loop in af_vsock.c. But the same time,  POSIX
requires that:

"POLLIN Data other than high-priority data may be read without
   blocking.
POLLRDNORM Normal data may be read without blocking."

See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293.

So, we have, that poll() syscall returns POLLIN, but read call will
be blocked.

Also in man page socket(7) i found that:

"Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a
socket as readable only if at least SO_RCVLOWAT bytes are available."

I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it
uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with
this case for TCP socket, it works as POSIX required.


I tried to look at the code and it seems that only TCP complies with it or am I 
wrong?

Yes, i checked AF_UNIX, it also don't care about that. It calls 
skb_queue_empty() that of
course ignores SO_RCVLOWAT.




I've added some fixes to af_vsock.c and virtio_transport_common.c,
test is also implemented.

What do You think guys?


Nice, thanks for fixing this and for the test!

I left some comments, but I think the series is fine if we will support it in 
all transports.

Ack


I'd just like to understand if it's just TCP complying with it or I'm missing 
some check included in the socket layer that we could reuse.

Seems sock_poll() which is socket layer entry point for poll() doesn't contain 
any such checks


@David, @Jakub, @Paolo, any advice?

Thanks,
Stefano



PS: moreover, i found one more interesting thing with TCP and poll: TCP receive 
logic wakes up poll waiter
only when number of available bytes > SO_RCVLOWAT. E.g. it prevents "spurious" 
wake ups, when poll will be
woken up because new data arrived, but POLLIN to allow user dequeue this data 
won't be set(as amount of data
is too small).
See tcp_data_ready() in net/ipv4/tcp_input.c


Do you mean that we should call sk->sk_data_ready(sk) checking 
SO_RCVLOWAT?


It seems fine, maybe we can add vsock_data_ready() in af_vsock.c that 
transports should call instead of calling sk->sk_data_ready(sk) 
directly.


Then we can something similar to tcp_data_ready().

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 5/5] vduse: Support querying information of IOVA regions

2022-07-20 Thread Jason Wang

On Wed, Jul 20, 2022 at 12:42 PM Xie Yongji  wrote:
>
> This introduces a new ioctl: VDUSE_IOTLB_GET_INFO to
> support querying some information of IOVA regions.
>
> Now it can be used to query whether the IOVA region
> supports userspace memory registration.
>
> Signed-off-by: Xie Yongji 
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 39 ++
>  include/uapi/linux/vduse.h | 25 +++
>  2 files changed, 64 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index eedff0a3885a..cc4a9a700c24 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1228,6 +1228,45 @@ static long vduse_dev_ioctl(struct file *file, 
> unsigned int cmd,
>umem.size);
> break;
> }
> +   case VDUSE_IOTLB_GET_INFO: {
> +   struct vduse_iova_info info;
> +   struct vhost_iotlb_map *map;
> +   struct vduse_iova_domain *domain = dev->domain;
> +
> +   ret = -EFAULT;
> +   if (copy_from_user(&info, argp, sizeof(info)))
> +   break;
> +
> +   ret = -EINVAL;
> +   if (info.start > info.last)
> +   break;
> +
> +   if (!is_mem_zero((const char *)info.reserved,
> +sizeof(info.reserved)))
> +   break;
> +
> +   spin_lock(&domain->iotlb_lock);
> +   map = vhost_iotlb_itree_first(domain->iotlb,
> + info.start, info.last);
> +   if (map) {
> +   info.start = map->start;
> +   info.last = map->last;
> +   info.capability = 0;
> +   if (domain->bounce_map && map->start >= 0 &&
> +   map->last < domain->bounce_size)

We don't check map->start and map->last in GET_FD, any reason for
those checks here?

> +   info.capability |= 
> VDUSE_IOVA_CAP_UMEM_SUPPORT;

Not a native speaker, but I think CAP is duplicated with SUPPORT.

Something like VDUSE_IOVA_CAP_UMEM might be better.

> +   }
> +   spin_unlock(&domain->iotlb_lock);
> +   if (!map)
> +   break;
> +
> +   ret = -EFAULT;
> +   if (copy_to_user(argp, &info, sizeof(info)))
> +   break;
> +
> +   ret = 0;
> +   break;
> +   }
> default:
> ret = -ENOIOCTLCMD;
> break;
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> index 9885e0571f09..720fb0b6d8ff 100644
> --- a/include/uapi/linux/vduse.h
> +++ b/include/uapi/linux/vduse.h
> @@ -233,6 +233,31 @@ struct vduse_iova_umem {
>  /* De-register the userspace memory. Caller should set iova and size field. 
> */
>  #define VDUSE_IOTLB_DEREG_UMEM _IOW(VDUSE_BASE, 0x19, struct vduse_iova_umem)
>
> +/**
> + * struct vduse_iova_info - information of one IOVA region
> + * @start: start of the IOVA region
> + * @last: last of the IOVA region
> + * @capability: capability of the IOVA regsion
> + * @reserved: for future use, needs to be initialized to zero
> + *
> + * Structure used by VDUSE_IOTLB_GET_INFO ioctl to get information of
> + * one IOVA region.
> + */
> +struct vduse_iova_info {
> +   __u64 start;
> +   __u64 last;
> +#define VDUSE_IOVA_CAP_UMEM_SUPPORT (1 << 0)
> +   __u64 capability;
> +   __u64 reserved[3];
> +};
> +
> +/*
> + * Find the first IOVA region that overlaps with the range [start, last]
> + * and return some information on it. Caller should set start and last 
> fields.
> + */
> +#define VDUSE_IOTLB_GET_INFO   _IOW(VDUSE_BASE, 0x1a, struct vduse_iova_info)

This should be _IOWR() ?

Thanks

> +
> +
>  /* The control messages definition for read(2)/write(2) on /dev/vduse/$NAME 
> */
>
>  /**
> --
> 2.20.1
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 4/5] vduse: Support registering userspace memory for IOVA regions

2022-07-20 Thread Jason Wang

On Wed, Jul 20, 2022 at 12:42 PM Xie Yongji  wrote:
>
> Introduce two ioctls: VDUSE_IOTLB_REG_UMEM and
> VDUSE_IOTLB_DEREG_UMEM to support registering
> and de-registering userspace memory for IOVA
> regions.
>
> Now it only supports registering userspace memory
> for bounce buffer region in virtio-vdpa case.
>
> Signed-off-by: Xie Yongji 

Acked-by: Jason Wang 

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 141 +
>  include/uapi/linux/vduse.h |  23 +
>  2 files changed, 164 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 3bc27de58f46..eedff0a3885a 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -21,6 +21,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -64,6 +66,13 @@ struct vduse_vdpa {
> struct vduse_dev *dev;
>  };
>
> +struct vduse_umem {
> +   unsigned long iova;
> +   unsigned long npages;
> +   struct page **pages;
> +   struct mm_struct *mm;
> +};
> +
>  struct vduse_dev {
> struct vduse_vdpa *vdev;
> struct device *dev;
> @@ -95,6 +104,8 @@ struct vduse_dev {
> u8 status;
> u32 vq_num;
> u32 vq_align;
> +   struct vduse_umem *umem;
> +   struct mutex mem_lock;
>  };
>
>  struct vduse_dev_msg {
> @@ -917,6 +928,102 @@ static int vduse_dev_queue_irq_work(struct vduse_dev 
> *dev,
> return ret;
>  }
>
> +static int vduse_dev_dereg_umem(struct vduse_dev *dev,
> +   u64 iova, u64 size)
> +{
> +   int ret;
> +
> +   mutex_lock(&dev->mem_lock);
> +   ret = -ENOENT;
> +   if (!dev->umem)
> +   goto unlock;
> +
> +   ret = -EINVAL;
> +   if (dev->umem->iova != iova || size != dev->domain->bounce_size)
> +   goto unlock;
> +
> +   vduse_domain_remove_user_bounce_pages(dev->domain);
> +   unpin_user_pages_dirty_lock(dev->umem->pages,
> +   dev->umem->npages, true);
> +   atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
> +   mmdrop(dev->umem->mm);
> +   vfree(dev->umem->pages);
> +   kfree(dev->umem);
> +   dev->umem = NULL;
> +   ret = 0;
> +unlock:
> +   mutex_unlock(&dev->mem_lock);
> +   return ret;
> +}
> +
> +static int vduse_dev_reg_umem(struct vduse_dev *dev,
> + u64 iova, u64 uaddr, u64 size)
> +{
> +   struct page **page_list = NULL;
> +   struct vduse_umem *umem = NULL;
> +   long pinned = 0;
> +   unsigned long npages, lock_limit;
> +   int ret;
> +
> +   if (!dev->domain->bounce_map ||
> +   size != dev->domain->bounce_size ||
> +   iova != 0 || uaddr & ~PAGE_MASK)
> +   return -EINVAL;
> +
> +   mutex_lock(&dev->mem_lock);
> +   ret = -EEXIST;
> +   if (dev->umem)
> +   goto unlock;
> +
> +   ret = -ENOMEM;
> +   npages = size >> PAGE_SHIFT;
> +   page_list = __vmalloc(array_size(npages, sizeof(struct page *)),
> + GFP_KERNEL_ACCOUNT);
> +   umem = kzalloc(sizeof(*umem), GFP_KERNEL);
> +   if (!page_list || !umem)
> +   goto unlock;
> +
> +   mmap_read_lock(current->mm);
> +
> +   lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> +   if (npages + atomic64_read(¤t->mm->pinned_vm) > lock_limit)
> +   goto out;
> +
> +   pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
> +   page_list, NULL);
> +   if (pinned != npages) {
> +   ret = pinned < 0 ? pinned : -ENOMEM;
> +   goto out;
> +   }
> +
> +   ret = vduse_domain_add_user_bounce_pages(dev->domain,
> +page_list, pinned);
> +   if (ret)
> +   goto out;
> +
> +   atomic64_add(npages, ¤t->mm->pinned_vm);
> +
> +   umem->pages = page_list;
> +   umem->npages = pinned;
> +   umem->iova = iova;
> +   umem->mm = current->mm;
> +   mmgrab(current->mm);
> +
> +   dev->umem = umem;
> +out:
> +   if (ret && pinned > 0)
> +   unpin_user_pages(page_list, pinned);
> +
> +   mmap_read_unlock(current->mm);
> +unlock:
> +   if (ret) {
> +   vfree(page_list);
> +   kfree(umem);
> +   }
> +   mutex_unlock(&dev->mem_lock);
> +   return ret;
> +}
> +
>  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
>  {
> @@ -1089,6 +1196,38 @@ static long vduse_dev_ioctl(struct file *file, 
> unsigned int cmd,
> ret = vduse_dev_queue_irq_work(dev, &dev->vqs[index].inject);
> break;
> }
> +   case VDUSE_IOTLB_REG_UMEM: {
> +   struct vduse_iova_umem umem;
> +
> +   ret = -EFAULT;

Re: [RFC PATCH v1 3/3] vsock_test: POLLIN + SO_RCVLOWAT test.

2022-07-20 Thread Stefano Garzarella


On Wed, Jul 20, 2022 at 05:46:01AM +, Arseniy Krasnov wrote:

On 19.07.2022 15:52, Stefano Garzarella wrote:

On Mon, Jul 18, 2022 at 08:19:06AM +, Arseniy Krasnov wrote:

This adds test to check, that when poll() returns POLLIN and
POLLRDNORM bits, next read call won't block.

Signed-off-by: Arseniy Krasnov 
---
tools/testing/vsock/vsock_test.c | 90 
1 file changed, 90 insertions(+)

diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index dc577461afc2..8e394443eaf6 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -18,6 +18,7 @@
#include 
#include 
#include 
+#include 

#include "timeout.h"
#include "control.h"
@@ -596,6 +597,90 @@ static void test_seqpacket_invalid_rec_buffer_server(const 
struct test_opts *opt
close(fd);
}

+static void test_stream_poll_rcvlowat_server(const struct test_opts *opts)
+{
+#define RCVLOWAT_BUF_SIZE 128
+    int fd;
+    int i;
+
+    fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+    if (fd < 0) {
+    perror("accept");
+    exit(EXIT_FAILURE);
+    }
+
+    /* Send 1 byte. */
+    send_byte(fd, 1, 0);
+
+    control_writeln("SRVSENT");
+
+    /* Just empirically delay value. */
+    sleep(4);


Why we need this sleep()?

Purpose of sleep() is to move client in state, when it has 1 byte of rx data
and poll() won't wake. For example:
client:server:
waits for "SRVSENT"
  send 1 byte
  send "SRVSENT"
poll()
  sleep
...
poll sleeps
...
  send rest of data
poll wake up

I think, without sleep there is chance, that client enters poll() when whole
data from server is already received, thus test will be useless(it just tests


Right, I see (maybe add a comment in the test).


poll()). May be i can remove "SRVSENT" as sleep is enough.


I think it's fine.

An alternative could be to use the `timeout` of poll():

client:server:
waits for "SRVSENT"
   send 1 byte
   send "SRVSENT"
poll(, timeout = 1 * 1000)
   wait for "CLNSENT"
poll should return 0
send "CLNSENT"

poll(, timeout = 10 * 1000)
...
poll sleeps
...
   send rest of data
poll wake up


I don't have a strong opinion, also your version seems fine, just an 
alternative ;-)


Maybe in your version you can add a 10 sec timeout to poll, to avoid 
that the test stuck for some reason (failing if the timeout is reached).


Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 4/6] dma-buf: Acquire wait-wound context on attachment

2022-07-20 Thread Christian König via Virtualization


Am 19.07.22 um 22:05 schrieb Dmitry Osipenko:

On 7/15/22 09:59, Dmitry Osipenko wrote:

On 7/15/22 09:50, Christian König wrote:

Am 15.07.22 um 02:52 schrieb Dmitry Osipenko:

Intel i915 GPU driver uses wait-wound mutex to lock multiple GEMs on the
attachment to the i915 dma-buf. In order to let all drivers utilize
shared
wait-wound context during attachment in a general way, make dma-buf
core to
acquire the ww context internally for the attachment operation and update
i915 driver to use the importer's ww context instead of the internal one.

  From now on all dma-buf exporters shall use the importer's ww context
for
the attachment operation.

Signed-off-by: Dmitry Osipenko 
---
   drivers/dma-buf/dma-buf.c |  8 +-
   drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c    |  2 +-
   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  2 +-
   drivers/gpu/drm/i915/gem/i915_gem_object.h    |  6 ++---
   drivers/gpu/drm/i915/i915_gem_evict.c |  2 +-
   drivers/gpu/drm/i915/i915_gem_ww.c    | 26 +++
   drivers/gpu/drm/i915/i915_gem_ww.h    | 15 +--
   7 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0ee588276534..37545ecb845a 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -807,6 +807,8 @@ static struct sg_table * __map_dma_buf(struct
dma_buf_attachment *attach,
    * Optionally this calls &dma_buf_ops.attach to allow
device-specific attach
    * functionality.
    *
+ * Exporters shall use ww_ctx acquired by this function.
+ *
    * Returns:
    *
    * A pointer to newly created &dma_buf_attachment on success, or a
negative
@@ -822,6 +824,7 @@ dma_buf_dynamic_attach_unlocked(struct dma_buf
*dmabuf, struct device *dev,
   void *importer_priv)
   {
   struct dma_buf_attachment *attach;
+    struct ww_acquire_ctx ww_ctx;
   int ret;
     if (WARN_ON(!dmabuf || !dev))
@@ -841,7 +844,8 @@ dma_buf_dynamic_attach_unlocked(struct dma_buf
*dmabuf, struct device *dev,
   attach->importer_ops = importer_ops;
   attach->importer_priv = importer_priv;
   -    dma_resv_lock(dmabuf->resv, NULL);
+    ww_acquire_init(&ww_ctx, &reservation_ww_class);
+    dma_resv_lock(dmabuf->resv, &ww_ctx);

That won't work like this. The core property of a WW context is that you
need to unwind all the locks and re-quire them with the contended one
first.

When you statically lock the imported one here you can't do that any more.

You're right. I felt that something is missing here, but couldn't
notice. I'll think more about this and enable
CONFIG_DEBUG_WW_MUTEX_SLOWPATH. Thank you!


Christian, do you think we could make an excuse for the attach()
callback and make the exporter responsible for taking the resv lock? It
will be inconsistent with the rest of the callbacks, where importer
takes the lock, but it will be the simplest and least invasive solution.
It's very messy to do a cross-driver ww locking, I don't think it's the
right approach.


So to summarize the following calls will require that the caller hold 
the resv lock:

1. dma_buf_pin()/dma_buf_unpin()
2. dma_buf_map_attachment()/dma_buf_unmap_attachment()
3. dma_buf_vmap()/dma_buf_vunmap()
4. dma_buf_move_notify()

The following calls require that caller does not held the resv lock:
1. dma_buf_attach()/dma_buf_dynamic_attach()/dma_buf_detach()
2. dma_buf_export()/dma_buf_fd()
3. dma_buf_get()/dma_buf_put()
4. dma_buf_begin_cpu_access()/dma_buf_end_cpu_access()

If that's correct than that would work for me as well, but we should 
probably document this.


Or let me ask the other way around: What calls exactly do you need to 
change to solve your original issue? That was vmap/vunmap, wasn't it? If 
yes then let's concentrate on those for the moment.


Regards,
Christian.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v1 2/3] virtio/vsock: use 'target' in notify_poll_in, callback.

2022-07-20 Thread Stefano Garzarella


On Wed, Jul 20, 2022 at 05:38:03AM +, Arseniy Krasnov wrote:

On 19.07.2022 15:48, Stefano Garzarella wrote:

On Mon, Jul 18, 2022 at 08:17:31AM +, Arseniy Krasnov wrote:

This callback controls setting of POLLIN,POLLRDNORM output bits
of poll() syscall,but in some cases,it is incorrectly to set it,
when socket has at least 1 bytes of available data. Use 'target'
which is already exists and equal to sk_rcvlowat in this case.

Signed-off-by: Arseniy Krasnov 
---
net/vmw_vsock/virtio_transport_common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index ec2c2afbf0d0..591908740992 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -634,7 +634,7 @@ virtio_transport_notify_poll_in(struct vsock_sock *vsk,
    size_t target,
    bool *data_ready_now)
{
-    if (vsock_stream_has_data(vsk))
+    if (vsock_stream_has_data(vsk) >= target)
    *data_ready_now = true;
else
    *data_ready_now = false;


Perhaps we can take the opportunity to clean up the code in this way:

*data_ready_now = vsock_stream_has_data(vsk) >= target;

Ack


Anyway, I think we also need to fix the other transports (vmci and hyperv), 
what do you think?

For vmci it is look clear to fix it. For hyperv i need to check it more, 
because it already
uses some internal target value.


Yep, I see. Maybe you can pass `target` to hvs_channel_readable() and 
use it as parameter of HVS_PKT_LEN().


@Dexuan what do you think?

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio: Force DMA restricted devices through DMA API

2022-07-20 Thread Michael S. Tsirkin

On Tue, Jul 19, 2022 at 04:11:50PM +, Keir Fraser wrote:
> On Tue, Jul 19, 2022 at 08:51:54AM -0700, Christoph Hellwig wrote:
> > On Tue, Jul 19, 2022 at 03:46:08PM +, Keir Fraser wrote:
> > > However, if the general idea at least is acceptable, would the
> > > implementation be acceptable if I add an explicit API for this to the
> > > DMA subsystem, and hide the detail there?
> > 
> > I don't think so.  The right thing to key off is
> > VIRTIO_F_ACCESS_PLATFORM, which really should be set in any modern
> > virtio device after all the problems we had with the lack of it.
> 
> Ok. Certainly the flag description in virtio spec fits the bill.

Maybe. But balloon really isn't a normal device. Yes the rings kind of
operate more or less normally. However consider for example free page
hinting. We stick a page address in ring for purposes of telling host it
can blow it away at any later time unless we write something into the
page.  Free page reporting is similar.
Sending gigabytes of memory through swiotlb is at minimum not
a good idea.

Conversely, is it even okay security wise that host can blow away any
guest page?  Because with balloon GFNs do not go through bounce
buffering.



> > > Or a completely different approach would be to revert the patch
> > > e41b1355508d which clears VIRTIO_F_ACCESS_PLATFORM in the balloon
> > > driver. MST: That's back in your court, as it's your patch!
> > 
> > Which also means this needs to be addresses, but I don't think a
> > simple revert is enough.
> 
> Well here are two possible approaches:
> 
> 1. Revert e41b1355508d outright. I'm not even sure what it would mean
> for reported pages to go through IOMMU. And VIRTIO_F_ACCESS_PLATFORM
> is no longer IOMMU-specific anyway.
> 
> 2. Continue to clear the flag during virtio_balloon negotiation, but
> remember that it was offered, and test for that in vring_use_dma_api()
> as well as, or instead of, virtio_has_dma_quirk().
> 
> Do either of those appeal?

I think the use case needs to be documented better.


-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v12 02/40] virtio: struct virtio_config_ops add callbacks for queue_reset

Re: [PATCH v12 01/40] virtio: record the maximum queue num supported by the device.

Re: VIRTIO_NET_F_MTU not negotiated

Re: [PATCH] virtio: Force DMA restricted devices through DMA API

Re: [RFC PATCH v1 0/3] virtio/vsock: use SO_RCVLOWAT to set POLLIN/POLLRDNORM

Re: [PATCH v3 5/5] vduse: Support querying information of IOVA regions

Re: [PATCH v3 4/5] vduse: Support registering userspace memory for IOVA regions

Re: [RFC PATCH v1 3/3] vsock_test: POLLIN + SO_RCVLOWAT test.

Re: [PATCH v1 4/6] dma-buf: Acquire wait-wound context on attachment

Re: [RFC PATCH v1 2/3] virtio/vsock: use 'target' in notify_poll_in, callback.

Re: [PATCH] virtio: Force DMA restricted devices through DMA API

11 matches

Site Navigation

Mail list logo

Footer information