Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address

2019-09-09 Thread Jason Wang



On 2019/9/9 下午12:45, Michael S. Tsirkin wrote:

Since idx can be speculated, I guess we need array_index_nospec here?

So we have

ACQUIRE(mmu_lock)

get idx

RELEASE(mmu_lock)

ACQUIRE(mmu_lock)

read array[idx]

RELEASE(mmu_lock)

Then I think idx can't be speculated consider we've passed RELEASE +
ACQUIRE?

I don't think memory barriers have anything to do with speculation,
they are architectural.



Oh right. Let me add array_index_nospec() in next version.

Thanks



Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address

2019-09-08 Thread Michael S. Tsirkin
On Mon, Sep 09, 2019 at 10:18:57AM +0800, Jason Wang wrote:
> 
> On 2019/9/8 下午7:05, Michael S. Tsirkin wrote:
> > On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote:
> > > This is a rework on the commit 7f466032dc9e ("vhost: access vq
> > > metadata through kernel virtual address").
> > > 
> > > It was noticed that the copy_to/from_user() friends that was used to
> > > access virtqueue metdata tends to be very expensive for dataplane
> > > implementation like vhost since it involves lots of software checks,
> > > speculation barriers,
> > So if we drop speculation barrier,
> > there's a problem here in access will now be speculated.
> > This effectively disables the defence in depth effect of
> > b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd
> >  x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec
> > 
> > 
> > So now we need to sprinkle array_index_nospec or barrier_nospec over the
> > code whenever we use an index we got from userspace.
> > See below for some examples.
> > 
> > 
> > > hardware feature toggling (e.g SMAP). The
> > > extra cost will be more obvious when transferring small packets since
> > > the time spent on metadata accessing become more significant.
> > > 
> > > This patch tries to eliminate those overheads by accessing them
> > > through direct mapping of those pages. Invalidation callbacks is
> > > implemented for co-operation with general VM management (swap, KSM,
> > > THP or NUMA balancing). We will try to get the direct mapping of vq
> > > metadata before each round of packet processing if it doesn't
> > > exist. If we fail, we will simplely fallback to copy_to/from_user()
> > > friends.
> > > 
> > > This invalidation, direct mapping access and set are synchronized
> > > through spinlock. This takes a step back from the original commit
> > > 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > address") which tries to RCU which is suspicious and hard to be
> > > reviewed. This won't perform as well as RCU because of the atomic,
> > > this could be addressed by the future optimization.
> > > 
> > > This method might does not work for high mem page which requires
> > > temporary mapping so we just fallback to normal
> > > copy_to/from_user() and may not for arch that has virtual tagged cache
> > > since extra cache flushing is needed to eliminate the alias. This will
> > > result complex logic and bad performance. For those archs, this patch
> > > simply go for copy_to/from_user() friends. This is done by ruling out
> > > kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
> > > 
> > > Note that this is only done when device IOTLB is not enabled. We
> > > could use similar method to optimize IOTLB in the future.
> > > 
> > > Tests shows at most about 22% improvement on TX PPS when using
> > > virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake.
> > > 
> > >  SMAP on | SMAP off
> > > Before: 4.9Mpps | 6.9Mpps
> > > After:  6.0Mpps | 7.5Mpps
> > > 
> > > On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
> > > any difference.
> > Why is not Kaby Lake with SMAP off the same as Sandy Bridge?
> 
> 
> I don't know, I guess it was because the atomic is l
> 
> 
> > 
> > 
> > > Cc: Andrea Arcangeli 
> > > Cc: James Bottomley 
> > > Cc: Christoph Hellwig 
> > > Cc: David Miller 
> > > Cc: Jerome Glisse 
> > > Cc: Jason Gunthorpe 
> > > Cc: linux...@kvack.org
> > > Cc: linux-arm-ker...@lists.infradead.org
> > > Cc: linux-par...@vger.kernel.org
> > > Signed-off-by: Jason Wang 
> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > >   drivers/vhost/vhost.c | 551 +-
> > >   drivers/vhost/vhost.h |  41 
> > >   2 files changed, 589 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 791562e03fe0..f98155f28f02 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
> > >   __vhost_vq_meta_reset(d->vqs[i]);
> > >   }
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +static void vhost_map_unprefetch(struct vhost_map *map)
> > > +{
> > > + kfree(map->pages);
> > > + kfree(map);
> > > +}
> > > +
> > > +static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
> > > + struct vhost_map *map, int index)
> > > +{
> > > + struct vhost_uaddr *uaddr = >uaddrs[index];
> > > + int i;
> > > +
> > > + if (uaddr->write) {
> > > + for (i = 0; i < map->npages; i++)
> > > + set_page_dirty(map->pages[i]);
> > > + }
> > > +}
> > > +
> > > +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
> > > +{
> > > + struct vhost_map *map[VHOST_NUM_ADDRS];
> > > + int i;
> > > +
> > > + spin_lock(>mmu_lock);
> > > + for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> > > + map[i] = vq->maps[i];
> > > + if (map[i]) {
> > > + vhost_set_map_dirty(vq, map[i], 

Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address

2019-09-08 Thread Jason Wang



On 2019/9/9 上午10:18, Jason Wang wrote:


On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
any difference.

Why is not Kaby Lake with SMAP off the same as Sandy Bridge?



I don't know, I guess it was because the atomic is l 



Sorry, I meant atomic costs less for Kaby Lake.

Thanks




Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address

2019-09-08 Thread Jason Wang



On 2019/9/8 下午7:05, Michael S. Tsirkin wrote:

On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote:

This is a rework on the commit 7f466032dc9e ("vhost: access vq
metadata through kernel virtual address").

It was noticed that the copy_to/from_user() friends that was used to
access virtqueue metdata tends to be very expensive for dataplane
implementation like vhost since it involves lots of software checks,
speculation barriers,

So if we drop speculation barrier,
there's a problem here in access will now be speculated.
This effectively disables the defence in depth effect of
b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd
 x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec


So now we need to sprinkle array_index_nospec or barrier_nospec over the
code whenever we use an index we got from userspace.
See below for some examples.



hardware feature toggling (e.g SMAP). The
extra cost will be more obvious when transferring small packets since
the time spent on metadata accessing become more significant.

This patch tries to eliminate those overheads by accessing them
through direct mapping of those pages. Invalidation callbacks is
implemented for co-operation with general VM management (swap, KSM,
THP or NUMA balancing). We will try to get the direct mapping of vq
metadata before each round of packet processing if it doesn't
exist. If we fail, we will simplely fallback to copy_to/from_user()
friends.

This invalidation, direct mapping access and set are synchronized
through spinlock. This takes a step back from the original commit
7f466032dc9e ("vhost: access vq metadata through kernel virtual
address") which tries to RCU which is suspicious and hard to be
reviewed. This won't perform as well as RCU because of the atomic,
this could be addressed by the future optimization.

This method might does not work for high mem page which requires
temporary mapping so we just fallback to normal
copy_to/from_user() and may not for arch that has virtual tagged cache
since extra cache flushing is needed to eliminate the alias. This will
result complex logic and bad performance. For those archs, this patch
simply go for copy_to/from_user() friends. This is done by ruling out
kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.

Note that this is only done when device IOTLB is not enabled. We
could use similar method to optimize IOTLB in the future.

Tests shows at most about 22% improvement on TX PPS when using
virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake.

 SMAP on | SMAP off
Before: 4.9Mpps | 6.9Mpps
After:  6.0Mpps | 7.5Mpps

On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
any difference.

Why is not Kaby Lake with SMAP off the same as Sandy Bridge?



I don't know, I guess it was because the atomic is l






Cc: Andrea Arcangeli 
Cc: James Bottomley 
Cc: Christoph Hellwig 
Cc: David Miller 
Cc: Jerome Glisse 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-par...@vger.kernel.org
Signed-off-by: Jason Wang 
Signed-off-by: Michael S. Tsirkin 
---
  drivers/vhost/vhost.c | 551 +-
  drivers/vhost/vhost.h |  41 
  2 files changed, 589 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 791562e03fe0..f98155f28f02 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
__vhost_vq_meta_reset(d->vqs[i]);
  }
  
+#if VHOST_ARCH_CAN_ACCEL_UACCESS

+static void vhost_map_unprefetch(struct vhost_map *map)
+{
+   kfree(map->pages);
+   kfree(map);
+}
+
+static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
+   struct vhost_map *map, int index)
+{
+   struct vhost_uaddr *uaddr = >uaddrs[index];
+   int i;
+
+   if (uaddr->write) {
+   for (i = 0; i < map->npages; i++)
+   set_page_dirty(map->pages[i]);
+   }
+}
+
+static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
+{
+   struct vhost_map *map[VHOST_NUM_ADDRS];
+   int i;
+
+   spin_lock(>mmu_lock);
+   for (i = 0; i < VHOST_NUM_ADDRS; i++) {
+   map[i] = vq->maps[i];
+   if (map[i]) {
+   vhost_set_map_dirty(vq, map[i], i);
+   vq->maps[i] = NULL;
+   }
+   }
+   spin_unlock(>mmu_lock);
+
+   /* No need for synchronization since we are serialized with
+* memory accessors (e.g vq mutex held).
+*/
+
+   for (i = 0; i < VHOST_NUM_ADDRS; i++)
+   if (map[i])
+   vhost_map_unprefetch(map[i]);
+
+}
+
+static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
+{
+   int i;
+
+   vhost_uninit_vq_maps(vq);
+   for (i = 0; i < VHOST_NUM_ADDRS; i++)
+   vq->uaddrs[i].size = 0;
+}
+
+static bool 

Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address

2019-09-08 Thread Michael S. Tsirkin
On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote:
> This is a rework on the commit 7f466032dc9e ("vhost: access vq
> metadata through kernel virtual address").
> 
> It was noticed that the copy_to/from_user() friends that was used to
> access virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software checks,
> speculation barriers,

So if we drop speculation barrier,
there's a problem here in access will now be speculated.
This effectively disables the defence in depth effect of
b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd
x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec


So now we need to sprinkle array_index_nospec or barrier_nospec over the
code whenever we use an index we got from userspace.
See below for some examples.


> hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets since
> the time spent on metadata accessing become more significant.
> 
> This patch tries to eliminate those overheads by accessing them
> through direct mapping of those pages. Invalidation callbacks is
> implemented for co-operation with general VM management (swap, KSM,
> THP or NUMA balancing). We will try to get the direct mapping of vq
> metadata before each round of packet processing if it doesn't
> exist. If we fail, we will simplely fallback to copy_to/from_user()
> friends.
> 
> This invalidation, direct mapping access and set are synchronized
> through spinlock. This takes a step back from the original commit
> 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> address") which tries to RCU which is suspicious and hard to be
> reviewed. This won't perform as well as RCU because of the atomic,
> this could be addressed by the future optimization.
> 
> This method might does not work for high mem page which requires
> temporary mapping so we just fallback to normal
> copy_to/from_user() and may not for arch that has virtual tagged cache
> since extra cache flushing is needed to eliminate the alias. This will
> result complex logic and bad performance. For those archs, this patch
> simply go for copy_to/from_user() friends. This is done by ruling out
> kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
> 
> Note that this is only done when device IOTLB is not enabled. We
> could use similar method to optimize IOTLB in the future.
> 
> Tests shows at most about 22% improvement on TX PPS when using
> virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake.
> 
> SMAP on | SMAP off
> Before: 4.9Mpps | 6.9Mpps
> After:  6.0Mpps | 7.5Mpps
> 
> On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
> any difference.

Why is not Kaby Lake with SMAP off the same as Sandy Bridge?


> Cc: Andrea Arcangeli 
> Cc: James Bottomley 
> Cc: Christoph Hellwig 
> Cc: David Miller 
> Cc: Jerome Glisse 
> Cc: Jason Gunthorpe 
> Cc: linux...@kvack.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-par...@vger.kernel.org
> Signed-off-by: Jason Wang 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/vhost.c | 551 +-
>  drivers/vhost/vhost.h |  41 
>  2 files changed, 589 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 791562e03fe0..f98155f28f02 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
>   __vhost_vq_meta_reset(d->vqs[i]);
>  }
>  
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +static void vhost_map_unprefetch(struct vhost_map *map)
> +{
> + kfree(map->pages);
> + kfree(map);
> +}
> +
> +static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
> + struct vhost_map *map, int index)
> +{
> + struct vhost_uaddr *uaddr = >uaddrs[index];
> + int i;
> +
> + if (uaddr->write) {
> + for (i = 0; i < map->npages; i++)
> + set_page_dirty(map->pages[i]);
> + }
> +}
> +
> +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
> +{
> + struct vhost_map *map[VHOST_NUM_ADDRS];
> + int i;
> +
> + spin_lock(>mmu_lock);
> + for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> + map[i] = vq->maps[i];
> + if (map[i]) {
> + vhost_set_map_dirty(vq, map[i], i);
> + vq->maps[i] = NULL;
> + }
> + }
> + spin_unlock(>mmu_lock);
> +
> + /* No need for synchronization since we are serialized with
> +  * memory accessors (e.g vq mutex held).
> +  */
> +
> + for (i = 0; i < VHOST_NUM_ADDRS; i++)
> + if (map[i])
> + vhost_map_unprefetch(map[i]);
> +
> +}
> +
> +static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
> +{
> + int i;
> +
> + vhost_uninit_vq_maps(vq);
> + for (i = 0; i < VHOST_NUM_ADDRS; i++)
> + 

Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address

2019-09-06 Thread Jason Wang



On 2019/9/6 上午11:21, Hillf Danton wrote:

On Thu,  5 Sep 2019 20:27:36 +0800 From:   Jason Wang 

+static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
+   struct vhost_map *map, int index)
+{
+   struct vhost_uaddr *uaddr = >uaddrs[index];
+   int i;
+
+   if (uaddr->write) {
+   for (i = 0; i < map->npages; i++)
+   set_page_dirty(map->pages[i]);
+   }

Not sure need to set page dirty under page lock.



Just to make sure I understand the issue. Do you mean there's no need 
for set_page_dirty() here? If yes, is there any other function that 
already did this?


Thanks



[PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address

2019-09-05 Thread Jason Wang
This is a rework on the commit 7f466032dc9e ("vhost: access vq
metadata through kernel virtual address").

It was noticed that the copy_to/from_user() friends that was used to
access virtqueue metdata tends to be very expensive for dataplane
implementation like vhost since it involves lots of software checks,
speculation barriers, hardware feature toggling (e.g SMAP). The
extra cost will be more obvious when transferring small packets since
the time spent on metadata accessing become more significant.

This patch tries to eliminate those overheads by accessing them
through direct mapping of those pages. Invalidation callbacks is
implemented for co-operation with general VM management (swap, KSM,
THP or NUMA balancing). We will try to get the direct mapping of vq
metadata before each round of packet processing if it doesn't
exist. If we fail, we will simplely fallback to copy_to/from_user()
friends.

This invalidation, direct mapping access and set are synchronized
through spinlock. This takes a step back from the original commit
7f466032dc9e ("vhost: access vq metadata through kernel virtual
address") which tries to RCU which is suspicious and hard to be
reviewed. This won't perform as well as RCU because of the atomic,
this could be addressed by the future optimization.

This method might does not work for high mem page which requires
temporary mapping so we just fallback to normal
copy_to/from_user() and may not for arch that has virtual tagged cache
since extra cache flushing is needed to eliminate the alias. This will
result complex logic and bad performance. For those archs, this patch
simply go for copy_to/from_user() friends. This is done by ruling out
kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.

Note that this is only done when device IOTLB is not enabled. We
could use similar method to optimize IOTLB in the future.

Tests shows at most about 22% improvement on TX PPS when using
virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake.

SMAP on | SMAP off
Before: 4.9Mpps | 6.9Mpps
After:  6.0Mpps | 7.5Mpps

On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
any difference.

Cc: Andrea Arcangeli 
Cc: James Bottomley 
Cc: Christoph Hellwig 
Cc: David Miller 
Cc: Jerome Glisse 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-par...@vger.kernel.org
Signed-off-by: Jason Wang 
Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/vhost.c | 551 +-
 drivers/vhost/vhost.h |  41 
 2 files changed, 589 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 791562e03fe0..f98155f28f02 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
__vhost_vq_meta_reset(d->vqs[i]);
 }
 
+#if VHOST_ARCH_CAN_ACCEL_UACCESS
+static void vhost_map_unprefetch(struct vhost_map *map)
+{
+   kfree(map->pages);
+   kfree(map);
+}
+
+static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
+   struct vhost_map *map, int index)
+{
+   struct vhost_uaddr *uaddr = >uaddrs[index];
+   int i;
+
+   if (uaddr->write) {
+   for (i = 0; i < map->npages; i++)
+   set_page_dirty(map->pages[i]);
+   }
+}
+
+static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
+{
+   struct vhost_map *map[VHOST_NUM_ADDRS];
+   int i;
+
+   spin_lock(>mmu_lock);
+   for (i = 0; i < VHOST_NUM_ADDRS; i++) {
+   map[i] = vq->maps[i];
+   if (map[i]) {
+   vhost_set_map_dirty(vq, map[i], i);
+   vq->maps[i] = NULL;
+   }
+   }
+   spin_unlock(>mmu_lock);
+
+   /* No need for synchronization since we are serialized with
+* memory accessors (e.g vq mutex held).
+*/
+
+   for (i = 0; i < VHOST_NUM_ADDRS; i++)
+   if (map[i])
+   vhost_map_unprefetch(map[i]);
+
+}
+
+static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
+{
+   int i;
+
+   vhost_uninit_vq_maps(vq);
+   for (i = 0; i < VHOST_NUM_ADDRS; i++)
+   vq->uaddrs[i].size = 0;
+}
+
+static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
+unsigned long start,
+unsigned long end)
+{
+   if (unlikely(!uaddr->size))
+   return false;
+
+   return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
+}
+
+static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
+{
+   spin_lock(>mmu_lock);
+}
+
+static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
+{
+   spin_unlock(>mmu_lock);
+}
+
+static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
+int index,
+