Re: [Qemu-devel] [PATCH v7 2/6] virtio-pmem: Add virtio pmem driver

2019-05-10 Thread Pankaj Gupta


> >
> > Hi Dan,
> >
> > Thank you for the review. Please see my reply inline.
> >
> > >
> > > Hi Pankaj,
> > >
> > > Some minor file placement comments below.
> >
> > Sure.
> >
> > >
> > > On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta  wrote:
> > > >
> > > > This patch adds virtio-pmem driver for KVM guest.
> > > >
> > > > Guest reads the persistent memory range information from
> > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > > creates a nd_region object with the persistent memory
> > > > range information so that existing 'nvdimm/pmem' driver
> > > > can reserve this into system memory map. This way
> > > > 'virtio-pmem' driver uses existing functionality of pmem
> > > > driver to register persistent memory compatible for DAX
> > > > capable filesystems.
> > > >
> > > > This also provides function to perform guest flush over
> > > > VIRTIO from 'pmem' driver when userspace performs flush
> > > > on DAX memory range.
> > > >
> > > > Signed-off-by: Pankaj Gupta 
> > > > ---
> > > >  drivers/nvdimm/virtio_pmem.c | 114 +
> > > >  drivers/virtio/Kconfig   |  10 +++
> > > >  drivers/virtio/Makefile  |   1 +
> > > >  drivers/virtio/pmem.c| 118 +++
> > > >  include/linux/virtio_pmem.h  |  60 
> > > >  include/uapi/linux/virtio_ids.h  |   1 +
> > > >  include/uapi/linux/virtio_pmem.h |  10 +++
> > > >  7 files changed, 314 insertions(+)
> > > >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> > > >  create mode 100644 drivers/virtio/pmem.c
> > > >  create mode 100644 include/linux/virtio_pmem.h
> > > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > > >
> > > > diff --git a/drivers/nvdimm/virtio_pmem.c
> > > > b/drivers/nvdimm/virtio_pmem.c
> > > > new file mode 100644
> > > > index ..66b582f751a3
> > > > --- /dev/null
> > > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > > @@ -0,0 +1,114 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * virtio_pmem.c: Virtio pmem Driver
> > > > + *
> > > > + * Discovers persistent memory range information
> > > > + * from host and provides a virtio based flushing
> > > > + * interface.
> > > > + */
> > > > +#include 
> > > > +#include "nd.h"
> > > > +
> > > > + /* The interrupt handler */
> > > > +void host_ack(struct virtqueue *vq)
> > > > +{
> > > > +   unsigned int len;
> > > > +   unsigned long flags;
> > > > +   struct virtio_pmem_request *req, *req_buf;
> > > > +   struct virtio_pmem *vpmem = vq->vdev->priv;
> > > > +
> > > > +   spin_lock_irqsave(>pmem_lock, flags);
> > > > +   while ((req = virtqueue_get_buf(vq, )) != NULL) {
> > > > +   req->done = true;
> > > > +   wake_up(>host_acked);
> > > > +
> > > > +   if (!list_empty(>req_list)) {
> > > > +   req_buf = list_first_entry(>req_list,
> > > > +   struct virtio_pmem_request,
> > > > list);
> > > > +   list_del(>req_list);
> > > > +   req_buf->wq_buf_avail = true;
> > > > +   wake_up(_buf->wq_buf);
> > > > +   }
> > > > +   }
> > > > +   spin_unlock_irqrestore(>pmem_lock, flags);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(host_ack);
> > > > +
> > > > + /* The request submission function */
> > > > +int virtio_pmem_flush(struct nd_region *nd_region)
> > > > +{
> > > > +   int err;
> > > > +   unsigned long flags;
> > > > +   struct scatterlist *sgs[2], sg, ret;
> > > > +   struct virtio_device *vdev = nd_region->provider_data;
> > > > +   struct virtio_pmem *vpmem = vdev->priv;
> > > > +   struct virtio_pmem_request *req;
> > > > +
> > > > +   might_sleep();
> > > > +   req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > > +   if (!req)
> > > > +   return -ENOMEM;
> > > > +
> > > > +   req->done = req->wq_buf_avail = false;
> > > > +   strcpy(req->name, "FLUSH");
> > > > +   init_waitqueue_head(>host_acked);
> > > > +   init_waitqueue_head(>wq_buf);
> > > > +   sg_init_one(, req->name, strlen(req->name));
> > > > +   sgs[0] = 
> > > > +   sg_init_one(, >ret, sizeof(req->ret));
> > > > +   sgs[1] = 
> > > > +
> > > > +   spin_lock_irqsave(>pmem_lock, flags);
> > > > +   err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req,
> > > > GFP_ATOMIC);
> > > > +   if (err) {
> > > > +   dev_err(>dev, "failed to send command to virtio
> > > > pmem
> > > > device\n");
> > > > +
> > > > +   list_add_tail(>req_list, >list);
> > > > +   spin_unlock_irqrestore(>pmem_lock, flags);
> > > > +
> > > > +   /* When host has read buffer, this completes via
> > > > host_ack
> > > > */
> > > > +   wait_event(req->wq_buf, req->wq_buf_avail);
> > > > +   spin_lock_irqsave(>pmem_lock, flags);
> > > > +   }
> > > > +   err = 

Re: [Qemu-devel] [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

2019-05-10 Thread Pankaj Gupta


> > > >
> > > > This patch adds 'DAXDEV_SYNC' flag which is set
> > > > for nd_region doing synchronous flush. This later
> > > > is used to disable MAP_SYNC functionality for
> > > > ext4 & xfs filesystem for devices don't support
> > > > synchronous flush.
> > > >
> > > > Signed-off-by: Pankaj Gupta 
> > > > ---
> > > >  drivers/dax/bus.c|  2 +-
> > > >  drivers/dax/super.c  | 13 -
> > > >  drivers/md/dm.c  |  3 ++-
> > > >  drivers/nvdimm/pmem.c|  5 -
> > > >  drivers/nvdimm/region_devs.c |  7 +++
> > > >  include/linux/dax.h  |  8 ++--
> > > >  include/linux/libnvdimm.h|  1 +
> > > >  7 files changed, 33 insertions(+), 6 deletions(-)
> > > [..]
> > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > index 043f0761e4a0..ee007b75d9fd 100644
> > > > --- a/drivers/md/dm.c
> > > > +++ b/drivers/md/dm.c
> > > > @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
> > > > sprintf(md->disk->disk_name, "dm-%d", minor);
> > > >
> > > > if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> > > > -   dax_dev = alloc_dax(md, md->disk->disk_name,
> > > > _dax_ops);
> > > > +   dax_dev = alloc_dax(md, md->disk->disk_name,
> > > > _dax_ops,
> > > > +
> > > > DAXDEV_F_SYNC);
> > >
> > > Apologies for not realizing this until now, but this is broken.
> > > Imaging a device-mapper configuration composed of both 'async'
> > > virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified
> > > across all members. I would change this argument to '0' and then
> > > arrange for it to be set at dm_table_supports_dax() time after
> > > validating that all components support synchronous dax.
> >
> > o.k. Need to set 'DAXDEV_F_SYNC' flag after verifying all the target
> > components support synchronous DAX.
> >
> > Just a question, If device mapper configuration have composed of both
> > virtio-pmem or pmem devices, we want to configure device mapper for async
> > flush?
> 
> If it's composed of both then, yes, it needs to be async flush at the
> device-mapper level. Otherwise MAP_SYNC will succeed and fail to
> trigger fsync on the host file when necessary for the virtio-pmem
> backed portion of the device-mapper device.

o.k. Agree.

Thanks you,
Pankaj
> 
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

2019-05-10 Thread Dan Williams
On Fri, May 10, 2019 at 5:45 PM Pankaj Gupta  wrote:
>
>
>
> > >
> > > This patch adds 'DAXDEV_SYNC' flag which is set
> > > for nd_region doing synchronous flush. This later
> > > is used to disable MAP_SYNC functionality for
> > > ext4 & xfs filesystem for devices don't support
> > > synchronous flush.
> > >
> > > Signed-off-by: Pankaj Gupta 
> > > ---
> > >  drivers/dax/bus.c|  2 +-
> > >  drivers/dax/super.c  | 13 -
> > >  drivers/md/dm.c  |  3 ++-
> > >  drivers/nvdimm/pmem.c|  5 -
> > >  drivers/nvdimm/region_devs.c |  7 +++
> > >  include/linux/dax.h  |  8 ++--
> > >  include/linux/libnvdimm.h|  1 +
> > >  7 files changed, 33 insertions(+), 6 deletions(-)
> > [..]
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index 043f0761e4a0..ee007b75d9fd 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
> > > sprintf(md->disk->disk_name, "dm-%d", minor);
> > >
> > > if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> > > -   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops);
> > > +   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops,
> > > +DAXDEV_F_SYNC);
> >
> > Apologies for not realizing this until now, but this is broken.
> > Imaging a device-mapper configuration composed of both 'async'
> > virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified
> > across all members. I would change this argument to '0' and then
> > arrange for it to be set at dm_table_supports_dax() time after
> > validating that all components support synchronous dax.
>
> o.k. Need to set 'DAXDEV_F_SYNC' flag after verifying all the target
> components support synchronous DAX.
>
> Just a question, If device mapper configuration have composed of both
> virtio-pmem or pmem devices, we want to configure device mapper for async 
> flush?

If it's composed of both then, yes, it needs to be async flush at the
device-mapper level. Otherwise MAP_SYNC will succeed and fail to
trigger fsync on the host file when necessary for the virtio-pmem
backed portion of the device-mapper device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

2019-05-10 Thread Pankaj Gupta



> >
> > This patch adds 'DAXDEV_SYNC' flag which is set
> > for nd_region doing synchronous flush. This later
> > is used to disable MAP_SYNC functionality for
> > ext4 & xfs filesystem for devices don't support
> > synchronous flush.
> >
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  drivers/dax/bus.c|  2 +-
> >  drivers/dax/super.c  | 13 -
> >  drivers/md/dm.c  |  3 ++-
> >  drivers/nvdimm/pmem.c|  5 -
> >  drivers/nvdimm/region_devs.c |  7 +++
> >  include/linux/dax.h  |  8 ++--
> >  include/linux/libnvdimm.h|  1 +
> >  7 files changed, 33 insertions(+), 6 deletions(-)
> [..]
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 043f0761e4a0..ee007b75d9fd 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
> > sprintf(md->disk->disk_name, "dm-%d", minor);
> >
> > if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> > -   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops);
> > +   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops,
> > +DAXDEV_F_SYNC);
> 
> Apologies for not realizing this until now, but this is broken.
> Imaging a device-mapper configuration composed of both 'async'
> virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified
> across all members. I would change this argument to '0' and then
> arrange for it to be set at dm_table_supports_dax() time after
> validating that all components support synchronous dax.

o.k. Need to set 'DAXDEV_F_SYNC' flag after verifying all the target
components support synchronous DAX.

Just a question, If device mapper configuration have composed of both 
virtio-pmem or pmem devices, we want to configure device mapper for async flush?

Thank you,
Pankaj 
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 0/6] virtio pmem driver

2019-05-10 Thread Pankaj Gupta


> >
> >  Hi Michael & Dan,
> >
> >  Please review/ack the patch series from LIBNVDIMM & VIRTIO side.
> >  We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need
> >  your ack on nvdimm patches(1 & 3) & virtio patch 2.
> 
> I was planning to merge these via the nvdimm tree, not ack them. Did
> you have another maintainer lined up to take these patches?

Sorry! for not being clear on this. I wanted to say same.

Proposed the patch series to be merged via nvdimm tree as kindly agreed
by you. We only need an ack on virtio patch 2 from Micahel.

Thank you for all your help.

Best regards,
Pankaj Gupta

> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH v7 2/6] virtio-pmem: Add virtio pmem driver

2019-05-10 Thread Dan Williams
On Wed, May 8, 2019 at 4:19 AM Pankaj Gupta  wrote:
>
>
> Hi Dan,
>
> Thank you for the review. Please see my reply inline.
>
> >
> > Hi Pankaj,
> >
> > Some minor file placement comments below.
>
> Sure.
>
> >
> > On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta  wrote:
> > >
> > > This patch adds virtio-pmem driver for KVM guest.
> > >
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > >
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > >
> > > Signed-off-by: Pankaj Gupta 
> > > ---
> > >  drivers/nvdimm/virtio_pmem.c | 114 +
> > >  drivers/virtio/Kconfig   |  10 +++
> > >  drivers/virtio/Makefile  |   1 +
> > >  drivers/virtio/pmem.c| 118 +++
> > >  include/linux/virtio_pmem.h  |  60 
> > >  include/uapi/linux/virtio_ids.h  |   1 +
> > >  include/uapi/linux/virtio_pmem.h |  10 +++
> > >  7 files changed, 314 insertions(+)
> > >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> > >  create mode 100644 drivers/virtio/pmem.c
> > >  create mode 100644 include/linux/virtio_pmem.h
> > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > >
> > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > > new file mode 100644
> > > index ..66b582f751a3
> > > --- /dev/null
> > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > @@ -0,0 +1,114 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and provides a virtio based flushing
> > > + * interface.
> > > + */
> > > +#include 
> > > +#include "nd.h"
> > > +
> > > + /* The interrupt handler */
> > > +void host_ack(struct virtqueue *vq)
> > > +{
> > > +   unsigned int len;
> > > +   unsigned long flags;
> > > +   struct virtio_pmem_request *req, *req_buf;
> > > +   struct virtio_pmem *vpmem = vq->vdev->priv;
> > > +
> > > +   spin_lock_irqsave(>pmem_lock, flags);
> > > +   while ((req = virtqueue_get_buf(vq, )) != NULL) {
> > > +   req->done = true;
> > > +   wake_up(>host_acked);
> > > +
> > > +   if (!list_empty(>req_list)) {
> > > +   req_buf = list_first_entry(>req_list,
> > > +   struct virtio_pmem_request, list);
> > > +   list_del(>req_list);
> > > +   req_buf->wq_buf_avail = true;
> > > +   wake_up(_buf->wq_buf);
> > > +   }
> > > +   }
> > > +   spin_unlock_irqrestore(>pmem_lock, flags);
> > > +}
> > > +EXPORT_SYMBOL_GPL(host_ack);
> > > +
> > > + /* The request submission function */
> > > +int virtio_pmem_flush(struct nd_region *nd_region)
> > > +{
> > > +   int err;
> > > +   unsigned long flags;
> > > +   struct scatterlist *sgs[2], sg, ret;
> > > +   struct virtio_device *vdev = nd_region->provider_data;
> > > +   struct virtio_pmem *vpmem = vdev->priv;
> > > +   struct virtio_pmem_request *req;
> > > +
> > > +   might_sleep();
> > > +   req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > +   if (!req)
> > > +   return -ENOMEM;
> > > +
> > > +   req->done = req->wq_buf_avail = false;
> > > +   strcpy(req->name, "FLUSH");
> > > +   init_waitqueue_head(>host_acked);
> > > +   init_waitqueue_head(>wq_buf);
> > > +   sg_init_one(, req->name, strlen(req->name));
> > > +   sgs[0] = 
> > > +   sg_init_one(, >ret, sizeof(req->ret));
> > > +   sgs[1] = 
> > > +
> > > +   spin_lock_irqsave(>pmem_lock, flags);
> > > +   err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, 
> > > GFP_ATOMIC);
> > > +   if (err) {
> > > +   dev_err(>dev, "failed to send command to virtio pmem
> > > device\n");
> > > +
> > > +   list_add_tail(>req_list, >list);
> > > +   spin_unlock_irqrestore(>pmem_lock, flags);
> > > +
> > > +   /* When host has read buffer, this completes via host_ack
> > > */
> > > +   wait_event(req->wq_buf, req->wq_buf_avail);
> > > +   spin_lock_irqsave(>pmem_lock, flags);
> > > +   }
> > > +   err = virtqueue_kick(vpmem->req_vq);
> > > +   spin_unlock_irqrestore(>pmem_lock, flags);
> > > +
> > > +   if (!err) {
> > > +   err = -EIO;
> > > +   goto ret;
> > > +   }
> > > +   /* When host has 

Re: [PATCH v2 2/8] vsock/virtio: free packets during the socket release

2019-05-10 Thread David Miller
From: Stefano Garzarella 
Date: Fri, 10 May 2019 14:58:37 +0200

> @@ -827,12 +827,20 @@ static bool virtio_transport_close(struct vsock_sock 
> *vsk)
>  
>  void virtio_transport_release(struct vsock_sock *vsk)
>  {
> + struct virtio_vsock_sock *vvs = vsk->trans;
> + struct virtio_vsock_buf *buf;
>   struct sock *sk = >sk;
>   bool remove_sock = true;
>  
>   lock_sock(sk);
>   if (sk->sk_type == SOCK_STREAM)
>   remove_sock = virtio_transport_close(vsk);
> + while (!list_empty(>rx_queue)) {
> + buf = list_first_entry(>rx_queue,
> +struct virtio_vsock_buf, list);

Please use list_for_each_entry_safe().
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

2019-05-10 Thread Dan Williams
On Fri, May 10, 2019 at 8:53 AM Pankaj Gupta  wrote:
>
> This patch adds 'DAXDEV_SYNC' flag which is set
> for nd_region doing synchronous flush. This later
> is used to disable MAP_SYNC functionality for
> ext4 & xfs filesystem for devices don't support
> synchronous flush.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/dax/bus.c|  2 +-
>  drivers/dax/super.c  | 13 -
>  drivers/md/dm.c  |  3 ++-
>  drivers/nvdimm/pmem.c|  5 -
>  drivers/nvdimm/region_devs.c |  7 +++
>  include/linux/dax.h  |  8 ++--
>  include/linux/libnvdimm.h|  1 +
>  7 files changed, 33 insertions(+), 6 deletions(-)
[..]
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 043f0761e4a0..ee007b75d9fd 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
> sprintf(md->disk->disk_name, "dm-%d", minor);
>
> if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> -   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops);
> +   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops,
> +DAXDEV_F_SYNC);

Apologies for not realizing this until now, but this is broken.
Imaging a device-mapper configuration composed of both 'async'
virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified
across all members. I would change this argument to '0' and then
arrange for it to be set at dm_table_supports_dax() time after
validating that all components support synchronous dax.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 0/6] virtio pmem driver

2019-05-10 Thread Dan Williams
On Fri, May 10, 2019 at 8:52 AM Pankaj Gupta  wrote:
>
>  Hi Michael & Dan,
>
>  Please review/ack the patch series from LIBNVDIMM & VIRTIO side.
>  We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need
>  your ack on nvdimm patches(1 & 3) & virtio patch 2.

I was planning to merge these via the nvdimm tree, not ack them. Did
you have another maintainer lined up to take these patches?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 0/6] virtio pmem driver

2019-05-10 Thread Michael S. Tsirkin
On Fri, May 10, 2019 at 09:21:56PM +0530, Pankaj Gupta wrote:
>  Hi Michael & Dan,
> 
>  Please review/ack the patch series from LIBNVDIMM & VIRTIO side.

Thanks!
Hope to do this early next week.

>  We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need
>  your ack on nvdimm patches(1 & 3) & virtio patch 2. 
> 
>  Changes done from v7 are only in patch(2 & 3) and not
>  affecting existing reviews. Request to please review.
>  
> 
>  This patch series has implementation for "virtio pmem". 
>  "virtio pmem" is fake persistent memory(nvdimm) in guest 
>  which allows to bypass the guest page cache. This also
>  implements a VIRTIO based asynchronous flush mechanism.  
>  
>  Sharing guest kernel driver in this patchset with the 
>  changes suggested in v4. Tested with Qemu side device 
>  emulation [6] for virtio-pmem. Documented the impact of
>  possible page cache side channel attacks with suggested
>  countermeasures.
> 
>  Details of project idea for 'virtio pmem' flushing interface 
>  is shared [3] & [4].
> 
>  Implementation is divided into two parts:
>  New virtio pmem guest driver and qemu code changes for new 
>  virtio pmem paravirtualized device.
> 
> 1. Guest virtio-pmem kernel driver
> -
>- Reads persistent memory range from paravirt device and 
>  registers with 'nvdimm_bus'.  
>- 'nvdimm/pmem' driver uses this information to allocate 
>  persistent memory region and setup filesystem operations 
>  to the allocated memory. 
>- virtio pmem driver implements asynchronous flushing 
>  interface to flush from guest to host.
> 
> 2. Qemu virtio-pmem device
> -
>- Creates virtio pmem device and exposes a memory range to 
>  KVM guest. 
>- At host side this is file backed memory which acts as 
>  persistent memory. 
>- Qemu side flush uses aio thread pool API's and virtio 
>  for asynchronous guest multi request handling. 
> 
>David Hildenbrand CCed also posted a modified version[7] of 
>qemu virtio-pmem code based on updated Qemu memory device API. 
> 
>  Virtio-pmem security implications and countermeasures:
>  -
> 
>  In previous posting of kernel driver, there was discussion [9]
>  on possible implications of page cache side channel attacks with 
>  virtio pmem. After thorough analysis of details of known side 
>  channel attacks, below are the suggestions:
> 
>  - Depends entirely on how host backing image file is mapped 
>into guest address space. 
> 
>  - virtio-pmem device emulation, by default shared mapping is used
>to map host backing file. It is recommended to use separate
>backing file at host side for every guest. This will prevent
>any possibility of executing common code from multiple guests
>and any chance of inferring guest local data based based on 
>execution time.
> 
>  - If backing file is required to be shared among multiple guests 
>it is recommended to don't support host page cache eviction 
>commands from the guest driver. This will avoid any possibility
>of inferring guest local data or host data from another guest. 
> 
>  - Proposed device specification [8] for virtio-pmem device with 
>details of possible security implications and suggested 
>countermeasures for device emulation.
> 
>  Virtio-pmem errors handling:
>  
>   Checked behaviour of virtio-pmem for below types of errors
>   Need suggestions on expected behaviour for handling these errors?
> 
>   - Hardware Errors: Uncorrectable recoverable Errors: 
>   a] virtio-pmem: 
> - As per current logic if error page belongs to Qemu process, 
>   host MCE handler isolates(hwpoison) that page and send SIGBUS. 
>   Qemu SIGBUS handler injects exception to KVM guest. 
> - KVM guest then isolates the page and send SIGBUS to guest 
>   userspace process which has mapped the page. 
>   
>   b] Existing implementation for ACPI pmem driver: 
> - Handles such errors with MCE notifier and creates a list 
>   of bad blocks. Read/direct access DAX operation return EIO 
>   if accessed memory page fall in bad block list.
> - It also starts backgound scrubbing.  
> - Similar functionality can be reused in virtio-pmem with MCE 
>   notifier but without scrubbing(no ACPI/ARS)? Need inputs to 
>   confirm if this behaviour is ok or needs any change?
> 
> Changes from PATCH v7: [1]
>  - Corrected pending request queue logic (patch 2) - Jakub Staroń
>  - Used unsigned long flags for passing DAXDEV_F_SYNC (patch 3) - Dan
>  - Fixed typo =>  vma 'flag' to 'vm_flag' (patch 4)
>  - Added rob in patch 6 & patch 2
> 
> Changes from PATCH v6: [1]
>  - Corrected comment format in patch 5 & patch 6. [Dave]
>  - Changed variable declaration indentation in patch 6 [Darrick]
>  - Add Reviewed-by tag by 'Jan Kara' in patch 4 & patch 5
> 

[PATCH v8 3/6] libnvdimm: add dax_dev sync flag

2019-05-10 Thread Pankaj Gupta
This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.

Signed-off-by: Pankaj Gupta 
---
 drivers/dax/bus.c|  2 +-
 drivers/dax/super.c  | 13 -
 drivers/md/dm.c  |  3 ++-
 drivers/nvdimm/pmem.c|  5 -
 drivers/nvdimm/region_devs.c |  7 +++
 include/linux/dax.h  |  8 ++--
 include/linux/libnvdimm.h|  1 +
 7 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 2109cfe80219..5f184e751c82 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region 
*dax_region, int id,
 * No 'host' or dax_operations since there is no access to this
 * device outside of mmap of the resulting character device.
 */
-   dax_dev = alloc_dax(dev_dax, NULL, NULL);
+   dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
if (!dax_dev)
goto err;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index bbd57ca0634a..b6c44b5062e9 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -186,6 +186,8 @@ enum dax_device_flags {
DAXDEV_ALIVE,
/* gate whether dax_flush() calls the low level flush routine */
DAXDEV_WRITE_CACHE,
+   /* flag to check if device supports synchronous flush */
+   DAXDEV_SYNC,
 };
 
 /**
@@ -354,6 +356,12 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
 
+bool dax_synchronous(struct dax_device *dax_dev)
+{
+   return test_bit(DAXDEV_SYNC, _dev->flags);
+}
+EXPORT_SYMBOL_GPL(dax_synchronous);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
@@ -508,7 +516,7 @@ static void dax_add_host(struct dax_device *dax_dev, const 
char *host)
 }
 
 struct dax_device *alloc_dax(void *private, const char *__host,
-   const struct dax_operations *ops)
+   const struct dax_operations *ops, unsigned long flags)
 {
struct dax_device *dax_dev;
const char *host;
@@ -531,6 +539,9 @@ struct dax_device *alloc_dax(void *private, const char 
*__host,
dax_add_host(dax_dev, host);
dax_dev->ops = ops;
dax_dev->private = private;
+   if (flags & DAXDEV_F_SYNC)
+   set_bit(DAXDEV_SYNC, _dev->flags);
+
return dax_dev;
 
  err_dev:
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 043f0761e4a0..ee007b75d9fd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
sprintf(md->disk->disk_name, "dm-%d", minor);
 
if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
-   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops);
+   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops,
+DAXDEV_F_SYNC);
if (!dax_dev)
goto bad;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0279eb1da3ef..bdbd2b414d3d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -365,6 +365,7 @@ static int pmem_attach_disk(struct device *dev,
struct gendisk *disk;
void *addr;
int rc;
+   unsigned long flags = 0UL;
 
pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
if (!pmem)
@@ -462,7 +463,9 @@ static int pmem_attach_disk(struct device *dev,
nvdimm_badblocks_populate(nd_region, >bb, _res);
disk->bb = >bb;
 
-   dax_dev = alloc_dax(pmem, disk->disk_name, _dax_ops);
+   if (is_nvdimm_sync(nd_region))
+   flags = DAXDEV_F_SYNC;
+   dax_dev = alloc_dax(pmem, disk->disk_name, _dax_ops, flags);
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b4ef7d9ff22e..f3ea5369d20a 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1197,6 +1197,13 @@ int nvdimm_has_cache(struct nd_region *nd_region)
 }
 EXPORT_SYMBOL_GPL(nvdimm_has_cache);
 
+bool is_nvdimm_sync(struct nd_region *nd_region)
+{
+   return is_nd_pmem(_region->dev) &&
+   !test_bit(ND_REGION_ASYNC, _region->flags);
+}
+EXPORT_SYMBOL_GPL(is_nvdimm_sync);
+
 struct conflict_context {
struct nd_region *nd_region;
resource_size_t start, size;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..ed75b7d9d178 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,6 +7,9 @@
 #include 
 #include 
 
+/* Flag for synchronous flush */
+#define DAXDEV_F_SYNC (1UL << 0)
+
 typedef unsigned long dax_entry_t;
 
 struct iomap_ops;
@@ -32,18 +35,19 @@ extern struct attribute_group 

[PATCH v8 4/6] dax: check synchronous mapping is supported

2019-05-10 Thread Pankaj Gupta
This patch introduces 'daxdev_mapping_supported' helper
which checks if 'MAP_SYNC' is supported with filesystem
mapping. It also checks if corresponding dax_device is
synchronous. Virtio pmem device is asynchronous and
does not not support VM_SYNC. 

Suggested-by: Jan Kara 
Signed-off-by: Pankaj Gupta 
Reviewed-by: Jan Kara 
---
 include/linux/dax.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index c97fc0cc7167..41b4a5db6305 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,6 +41,18 @@ void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
 bool dax_synchronous(struct dax_device *dax_dev);
+/*
+ * Check if given mapping is supported by the file / underlying device.
+ */
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+   struct dax_device *dax_dev)
+{
+   if (!(vma->vm_flags & VM_SYNC))
+   return true;
+   if (!IS_DAX(file_inode(vma->vm_file)))
+   return false;
+   return dax_synchronous(dax_dev);
+}
 #else
 static inline struct dax_device *dax_get_by_host(const char *host)
 {
@@ -68,6 +80,11 @@ static inline bool dax_write_cache_enabled(struct dax_device 
*dax_dev)
 {
return false;
 }
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+   struct dax_device *dax_dev)
+{
+   return !(vma->vm_flags & VM_SYNC);
+}
 #endif
 
 struct writeback_control;
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v8 6/6] xfs: disable map_sync for async flush

2019-05-10 Thread Pankaj Gupta
Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and xfs.

Signed-off-by: Pankaj Gupta 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/xfs_file.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a7ceae90110e..f17652cca5ff 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1203,11 +1203,14 @@ xfs_file_mmap(
struct file *filp,
struct vm_area_struct *vma)
 {
+   struct dax_device   *dax_dev;
+
+   dax_dev = xfs_find_daxdev_for_inode(file_inode(filp));
/*
-* We don't support synchronous mappings for non-DAX files. At least
-* until someone comes with a sensible use case.
+* We don't support synchronous mappings for non-DAX files and
+* for DAX files if underneath dax_device is not synchronous.
 */
-   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
+   if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;
 
file_accessed(filp);
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v8 5/6] ext4: disable map_sync for async flush

2019-05-10 Thread Pankaj Gupta
Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and ext4.

Signed-off-by: Pankaj Gupta 
Reviewed-by: Jan Kara 
---
 fs/ext4/file.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 98ec11f69cd4..dee549339e13 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -360,15 +360,17 @@ static const struct vm_operations_struct ext4_file_vm_ops 
= {
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
struct inode *inode = file->f_mapping->host;
+   struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+   struct dax_device *dax_dev = sbi->s_daxdev;
 
-   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
+   if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;
 
/*
-* We don't support synchronous mappings for non-DAX files. At least
-* until someone comes with a sensible use case.
+* We don't support synchronous mappings for non-DAX files and
+* for DAX files if underneath dax_device is not synchronous.
 */
-   if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
+   if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;
 
file_accessed(file);
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v8 2/6] virtio-pmem: Add virtio pmem driver

2019-05-10 Thread Pankaj Gupta
This patch adds virtio-pmem driver for KVM guest.

Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.

This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.

Signed-off-by: Pankaj Gupta 
Reviewed-by: Yuval Shaia 
---
 drivers/nvdimm/Makefile  |   1 +
 drivers/nvdimm/nd_virtio.c   | 129 +++
 drivers/nvdimm/virtio_pmem.c | 117 
 drivers/virtio/Kconfig   |  10 +++
 include/linux/virtio_pmem.h  |  60 ++
 include/uapi/linux/virtio_ids.h  |   1 +
 include/uapi/linux/virtio_pmem.h |  10 +++
 7 files changed, 328 insertions(+)
 create mode 100644 drivers/nvdimm/nd_virtio.c
 create mode 100644 drivers/nvdimm/virtio_pmem.c
 create mode 100644 include/linux/virtio_pmem.h
 create mode 100644 include/uapi/linux/virtio_pmem.h

diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 6f2a088afad6..cefe233e0b52 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
 obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
 obj-$(CONFIG_OF_PMEM) += of_pmem.o
+obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o nd_virtio.o
 
 nd_pmem-y := pmem.o
 
diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
new file mode 100644
index ..ed7ddcc5a62c
--- /dev/null
+++ b/drivers/nvdimm/nd_virtio.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ */
+#include 
+#include "nd.h"
+
+ /* The interrupt handler */
+void host_ack(struct virtqueue *vq)
+{
+   unsigned int len;
+   unsigned long flags;
+   struct virtio_pmem_request *req, *req_buf;
+   struct virtio_pmem *vpmem = vq->vdev->priv;
+
+   spin_lock_irqsave(>pmem_lock, flags);
+   while ((req = virtqueue_get_buf(vq, )) != NULL) {
+   req->done = true;
+   wake_up(>host_acked);
+
+   if (!list_empty(>req_list)) {
+   req_buf = list_first_entry(>req_list,
+   struct virtio_pmem_request, list);
+   req_buf->wq_buf_avail = true;
+   wake_up(_buf->wq_buf);
+   list_del(_buf->list);
+   }
+   }
+   spin_unlock_irqrestore(>pmem_lock, flags);
+}
+EXPORT_SYMBOL_GPL(host_ack);
+
+ /* The request submission function */
+int virtio_pmem_flush(struct nd_region *nd_region)
+{
+   int err, err1;
+   unsigned long flags;
+   struct scatterlist *sgs[2], sg, ret;
+   struct virtio_device *vdev = nd_region->provider_data;
+   struct virtio_pmem *vpmem = vdev->priv;
+   struct virtio_pmem_request *req;
+
+   might_sleep();
+   req = kmalloc(sizeof(*req), GFP_KERNEL);
+   if (!req)
+   return -ENOMEM;
+
+   req->done = false;
+   strcpy(req->name, "FLUSH");
+   init_waitqueue_head(>host_acked);
+   init_waitqueue_head(>wq_buf);
+   INIT_LIST_HEAD(>list);
+   sg_init_one(, req->name, strlen(req->name));
+   sgs[0] = 
+   sg_init_one(, >ret, sizeof(req->ret));
+   sgs[1] = 
+
+   spin_lock_irqsave(>pmem_lock, flags);
+/*
+ * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual
+ * queue does not have free descriptor. We add the request
+ * to req_list and wait for host_ack to wake us up when free
+ * slots are available.
+ */
+   while ((err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req,
+   GFP_ATOMIC)) == -ENOSPC) {
+
+   dev_err(>dev, "failed to send command to virtio pmem"\
+   "device, no free slots in the virtqueue\n");
+   req->wq_buf_avail = false;
+   list_add_tail(>list, >req_list);
+   spin_unlock_irqrestore(>pmem_lock, flags);
+
+   /* When host has read buffer, this completes via host_ack */
+   wait_event(req->wq_buf, req->wq_buf_avail);
+   spin_lock_irqsave(>pmem_lock, flags);
+   }
+   err1 = virtqueue_kick(vpmem->req_vq);
+   spin_unlock_irqrestore(>pmem_lock, flags);
+
+   /*
+* virtqueue_add_sgs failed with error different than -ENOSPC, we can't
+* do anything about that.
+*/
+   if (err || !err1) {
+   dev_info(>dev, "failed to send command to virtio pmem 

[PATCH v8 1/6] libnvdimm: nd_region flush callback support

2019-05-10 Thread Pankaj Gupta
This patch adds functionality to perform flush from guest
to host over VIRTIO. We are registering a callback based
on 'nd_region' type. virtio_pmem driver requires this special
flush function. For rest of the region types we are registering
existing flush function. Report error returned by host fsync
failure to userspace.

Signed-off-by: Pankaj Gupta 
---
 drivers/acpi/nfit/core.c |  4 ++--
 drivers/nvdimm/claim.c   |  6 --
 drivers/nvdimm/nd.h  |  1 +
 drivers/nvdimm/pmem.c| 13 -
 drivers/nvdimm/region_devs.c | 26 --
 include/linux/libnvdimm.h|  8 +++-
 6 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 5a389a4f4f65..08dde76cf459 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2434,7 +2434,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, 
unsigned int bw,
offset = to_interleave_offset(offset, mmio);
 
writeq(cmd, mmio->addr.base + offset);
-   nvdimm_flush(nfit_blk->nd_region);
+   nvdimm_flush(nfit_blk->nd_region, NULL);
 
if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
readq(mmio->addr.base + offset);
@@ -2483,7 +2483,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
*nfit_blk,
}
 
if (rw)
-   nvdimm_flush(nfit_blk->nd_region);
+   nvdimm_flush(nfit_blk->nd_region, NULL);
 
rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
return rc;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index fb667bf469c7..13510bae1e6f 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -263,7 +263,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
sector_t sector = offset >> 9;
-   int rc = 0;
+   int rc = 0, ret = 0;
 
if (unlikely(!size))
return 0;
@@ -301,7 +301,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
}
 
memcpy_flushcache(nsio->addr + offset, buf, size);
-   nvdimm_flush(to_nd_region(ndns->dev.parent));
+   ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL);
+   if (ret)
+   rc = ret;
 
return rc;
 }
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index a5ac3b240293..0c74d2428bd7 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -159,6 +159,7 @@ struct nd_region {
struct badblocks bb;
struct nd_interleave_set *nd_set;
struct nd_percpu_lane __percpu *lane;
+   int (*flush)(struct nd_region *nd_region, struct bio *bio);
struct nd_mapping mapping[0];
 };
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bc2f700feef8..f719245da170 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -192,6 +192,7 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, 
struct page *page,
 
 static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 {
+   int ret = 0;
blk_status_t rc = 0;
bool do_acct;
unsigned long start;
@@ -201,7 +202,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, 
struct bio *bio)
struct nd_region *nd_region = to_region(pmem);
 
if (bio->bi_opf & REQ_PREFLUSH)
-   nvdimm_flush(nd_region);
+   ret = nvdimm_flush(nd_region, bio);
 
do_acct = nd_iostat_start(bio, );
bio_for_each_segment(bvec, bio, iter) {
@@ -216,7 +217,10 @@ static blk_qc_t pmem_make_request(struct request_queue *q, 
struct bio *bio)
nd_iostat_end(bio, start);
 
if (bio->bi_opf & REQ_FUA)
-   nvdimm_flush(nd_region);
+   ret = nvdimm_flush(nd_region, bio);
+
+   if (ret)
+   bio->bi_status = errno_to_blk_status(ret);
 
bio_endio(bio);
return BLK_QC_T_NONE;
@@ -469,7 +473,6 @@ static int pmem_attach_disk(struct device *dev,
}
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
pmem->dax_dev = dax_dev;
-
gendev = disk_to_dev(disk);
gendev->groups = pmem_attribute_groups;
 
@@ -527,14 +530,14 @@ static int nd_pmem_remove(struct device *dev)
sysfs_put(pmem->bb_state);
pmem->bb_state = NULL;
}
-   nvdimm_flush(to_nd_region(dev->parent));
+   nvdimm_flush(to_nd_region(dev->parent), NULL);
 
return 0;
 }
 
 static void nd_pmem_shutdown(struct device *dev)
 {
-   nvdimm_flush(to_nd_region(dev->parent));
+   nvdimm_flush(to_nd_region(dev->parent), NULL);
 }
 
 static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b4ef7d9ff22e..e5b59708865e 100644
--- a/drivers/nvdimm/region_devs.c
+++ 

[PATCH v8 0/6] virtio pmem driver

2019-05-10 Thread Pankaj Gupta
 Hi Michael & Dan,

 Please review/ack the patch series from LIBNVDIMM & VIRTIO side.
 We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need
 your ack on nvdimm patches(1 & 3) & virtio patch 2. 

 Changes done from v7 are only in patch(2 & 3) and not
 affecting existing reviews. Request to please review.
 

 This patch series has implementation for "virtio pmem". 
 "virtio pmem" is fake persistent memory(nvdimm) in guest 
 which allows to bypass the guest page cache. This also
 implements a VIRTIO based asynchronous flush mechanism.  
 
 Sharing guest kernel driver in this patchset with the 
 changes suggested in v4. Tested with Qemu side device 
 emulation [6] for virtio-pmem. Documented the impact of
 possible page cache side channel attacks with suggested
 countermeasures.

 Details of project idea for 'virtio pmem' flushing interface 
 is shared [3] & [4].

 Implementation is divided into two parts:
 New virtio pmem guest driver and qemu code changes for new 
 virtio pmem paravirtualized device.

1. Guest virtio-pmem kernel driver
-
   - Reads persistent memory range from paravirt device and 
 registers with 'nvdimm_bus'.  
   - 'nvdimm/pmem' driver uses this information to allocate 
 persistent memory region and setup filesystem operations 
 to the allocated memory. 
   - virtio pmem driver implements asynchronous flushing 
 interface to flush from guest to host.

2. Qemu virtio-pmem device
-
   - Creates virtio pmem device and exposes a memory range to 
 KVM guest. 
   - At host side this is file backed memory which acts as 
 persistent memory. 
   - Qemu side flush uses aio thread pool API's and virtio 
 for asynchronous guest multi request handling. 

   David Hildenbrand CCed also posted a modified version[7] of 
   qemu virtio-pmem code based on updated Qemu memory device API. 

 Virtio-pmem security implications and countermeasures:
 -

 In previous posting of kernel driver, there was discussion [9]
 on possible implications of page cache side channel attacks with 
 virtio pmem. After thorough analysis of details of known side 
 channel attacks, below are the suggestions:

 - Depends entirely on how host backing image file is mapped 
   into guest address space. 

 - virtio-pmem device emulation, by default shared mapping is used
   to map host backing file. It is recommended to use separate
   backing file at host side for every guest. This will prevent
   any possibility of executing common code from multiple guests
   and any chance of inferring guest local data based based on 
   execution time.

 - If backing file is required to be shared among multiple guests 
   it is recommended to don't support host page cache eviction 
   commands from the guest driver. This will avoid any possibility
   of inferring guest local data or host data from another guest. 

 - Proposed device specification [8] for virtio-pmem device with 
   details of possible security implications and suggested 
   countermeasures for device emulation.

 Virtio-pmem errors handling:
 
  Checked behaviour of virtio-pmem for below types of errors
  Need suggestions on expected behaviour for handling these errors?

  - Hardware Errors: Uncorrectable recoverable Errors: 
  a] virtio-pmem: 
- As per current logic if error page belongs to Qemu process, 
  host MCE handler isolates(hwpoison) that page and send SIGBUS. 
  Qemu SIGBUS handler injects exception to KVM guest. 
- KVM guest then isolates the page and send SIGBUS to guest 
  userspace process which has mapped the page. 
  
  b] Existing implementation for ACPI pmem driver: 
- Handles such errors with MCE notifier and creates a list 
  of bad blocks. Read/direct access DAX operation return EIO 
  if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.  
- Similar functionality can be reused in virtio-pmem with MCE 
  notifier but without scrubbing(no ACPI/ARS)? Need inputs to 
  confirm if this behaviour is ok or needs any change?

Changes from PATCH v7: [1]
 - Corrected pending request queue logic (patch 2) - Jakub Staroń
 - Used unsigned long flags for passing DAXDEV_F_SYNC (patch 3) - Dan
 - Fixed typo =>  vma 'flag' to 'vm_flag' (patch 4)
 - Added rob in patch 6 & patch 2

Changes from PATCH v6: [1]
 - Corrected comment format in patch 5 & patch 6. [Dave]
 - Changed variable declaration indentation in patch 6 [Darrick]
 - Add Reviewed-by tag by 'Jan Kara' in patch 4 & patch 5

Changes from PATCH v5: [2]
  Changes suggested in by - [Cornelia, Yuval]
- Remove assignment chaining in virtio driver
- Better error message and remove not required free
- Check nd_region before use

  Changes suggested by - [Jan Kara]
- dax_synchronous() for !CONFIG_DAX
- Correct 'daxdev_mapping_supported' comment and 

Re: [PATCH 05/10] s390/cio: introduce DMA pools to cio

2019-05-10 Thread Cornelia Huck
On Fri, 10 May 2019 00:11:12 +0200
Halil Pasic  wrote:

> On Thu, 9 May 2019 12:11:06 +0200
> Cornelia Huck  wrote:
> 
> > On Wed, 8 May 2019 23:22:10 +0200
> > Halil Pasic  wrote:
> >   
> > > On Wed, 8 May 2019 15:18:10 +0200 (CEST)
> > > Sebastian Ott  wrote:  
> >   
> > > > > @@ -1063,6 +1163,7 @@ static int __init css_bus_init(void)
> > > > >   unregister_reboot_notifier(_reboot_notifier);
> > > > >   goto out_unregister;
> > > > >   }
> > > > > + cio_dma_pool_init();  
> > > > 
> > > > This is too late for early devices (ccw console!).
> > > 
> > > You have already raised concern about this last time (thanks). I think,
> > > I've addressed this issue: tje cio_dma_pool is only used by the airq
> > > stuff. I don't think the ccw console needs it. Please have an other look
> > > at patch #6, and explain your concern in more detail if it persists.  
> > 
> > What about changing the naming/adding comments here, so that (1) folks
> > aren't confused by the same thing in the future and (2) folks don't try
> > to use that pool for something needed for the early ccw consoles?
> >   
> 
> I'm all for clarity! Suggestions for better names?

css_aiv_dma_pool, maybe? Or is there other cross-device stuff that may
need it?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 01/10] virtio/s390: use vring_create_virtqueue

2019-05-10 Thread Cornelia Huck
On Tue, 7 May 2019 15:58:12 +0200
Christian Borntraeger  wrote:

> On 05.05.19 13:15, Cornelia Huck wrote:
> > On Sat, 4 May 2019 16:03:40 +0200
> > Halil Pasic  wrote:
> >   
> >> On Fri, 3 May 2019 16:04:48 -0400
> >> "Michael S. Tsirkin"  wrote:
> >>  
> >>> On Fri, May 03, 2019 at 11:17:24AM +0200, Cornelia Huck wrote:
>  On Fri, 26 Apr 2019 20:32:36 +0200
>  Halil Pasic  wrote:
>  
> > The commit 2a2d1382fe9d ("virtio: Add improved queue allocation API")
> > establishes a new way of allocating virtqueues (as a part of the effort
> > that taught DMA to virtio rings).
> >
> > In the future we will want virtio-ccw to use the DMA API as well.
> >
> > Let us switch from the legacy method of allocating virtqueues to
> > vring_create_virtqueue() as the first step into that direction.
> >
> > Signed-off-by: Halil Pasic 
> > ---
> >  drivers/s390/virtio/virtio_ccw.c | 30 +++---
> >  1 file changed, 11 insertions(+), 19 deletions(-)
> 
>  Reviewed-by: Cornelia Huck 
> 
>  I'd vote for merging this patch right away for 5.2.
> >>>
> >>> So which tree is this going through? mine?
> >>> 
> >>
> >> Christian, what do you think? If the whole series is supposed to go in
> >> in one go (which I hope it is), via Martin's tree could be the simplest
> >> route IMHO.  
> > 
> > 
> > The first three patches are virtio(-ccw) only and the those are the ones
> > that I think are ready to go.
> > 
> > I'm not feeling comfortable going forward with the remainder as it
> > stands now; waiting for some other folks to give feedback. (They are
> > touching/interacting with code parts I'm not so familiar with, and lack
> > of documentation, while not the developers' fault, does not make it
> > easier.)
> > 
> > Michael, would you like to pick up 1-3 for your tree directly? That
> > looks like the easiest way.  
> 
> Agreed. Michael please pick 1-3.
> We will continue to review 4- first and then see which tree is best.

Michael, please let me know if you'll pick directly or whether I should
post a series.

[Given that the patches are from one virtio-ccw maintainer and reviewed
by the other, picking directly would eliminate an unnecessary
indirection :)]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 1/8] vsock/virtio: limit the memory used per-socket

2019-05-10 Thread Stefano Garzarella
Since virtio-vsock was introduced, the buffers filled by the host
and pushed to the guest using the vring, are directly queued in
a per-socket list avoiding to copy it.
These buffers are preallocated by the guest with a fixed
size (4 KB).

The maximum amount of memory used by each socket should be
controlled by the credit mechanism.
The default credit available per-socket is 256 KB, but if we use
only 1 byte per packet, the guest can queue up to 262144 of 4 KB
buffers, using up to 1 GB of memory per-socket. In addition, the
guest will continue to fill the vring with new 4 KB free buffers
to avoid starvation of other sockets.

This patch solves this issue copying the payload in a new buffer.
Then it is queued in the per-socket list, and the 4KB buffer used
by the host is freed.

In this way, the memory used by each socket respects the credit
available, and we still avoid starvation, paying the cost of an
extra memory copy. When the buffer is completely full we do a
"zero-copy", moving the buffer directly in the per-socket list.

Signed-off-by: Stefano Garzarella 
---
 drivers/vhost/vsock.c   |  2 +
 include/linux/virtio_vsock.h|  8 +++
 net/vmw_vsock/virtio_transport.c|  1 +
 net/vmw_vsock/virtio_transport_common.c | 95 ++---
 4 files changed, 81 insertions(+), 25 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index bb5fc0e9fbc2..7964e2daee09 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -320,6 +320,8 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
return NULL;
}
 
+   pkt->buf_len = pkt->len;
+
nbytes = copy_from_iter(pkt->buf, pkt->len, _iter);
if (nbytes != pkt->len) {
vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index e223e2632edd..345f04ee9193 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -54,9 +54,17 @@ struct virtio_vsock_pkt {
void *buf;
u32 len;
u32 off;
+   u32 buf_len;
bool reply;
 };
 
+struct virtio_vsock_buf {
+   struct list_head list;
+   void *addr;
+   u32 len;
+   u32 off;
+};
+
 struct virtio_vsock_pkt_info {
u32 remote_cid, remote_port;
struct vsock_sock *vsk;
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 15eb5d3d4750..af1d2ce12f54 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -280,6 +280,7 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
break;
}
 
+   pkt->buf_len = buf_len;
pkt->len = buf_len;
 
sg_init_one(, >hdr, sizeof(pkt->hdr));
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 602715fc9a75..0248d6808755 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -65,6 +65,9 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
pkt->buf = kmalloc(len, GFP_KERNEL);
if (!pkt->buf)
goto out_pkt;
+
+   pkt->buf_len = len;
+
err = memcpy_from_msg(pkt->buf, info->msg, len);
if (err)
goto out;
@@ -86,6 +89,46 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info 
*info,
return NULL;
 }
 
+static struct virtio_vsock_buf *
+virtio_transport_alloc_buf(struct virtio_vsock_pkt *pkt, bool zero_copy)
+{
+   struct virtio_vsock_buf *buf;
+
+   if (pkt->len == 0)
+   return NULL;
+
+   buf = kzalloc(sizeof(*buf), GFP_KERNEL);
+   if (!buf)
+   return NULL;
+
+   /* If the buffer in the virtio_vsock_pkt is full, we can move it to
+* the new virtio_vsock_buf avoiding the copy, because we are sure that
+* we are not use more memory than that counted by the credit mechanism.
+*/
+   if (zero_copy && pkt->len == pkt->buf_len) {
+   buf->addr = pkt->buf;
+   pkt->buf = NULL;
+   } else {
+   buf->addr = kmalloc(pkt->len, GFP_KERNEL);
+   if (!buf->addr) {
+   kfree(buf);
+   return NULL;
+   }
+
+   memcpy(buf->addr, pkt->buf, pkt->len);
+   }
+
+   buf->len = pkt->len;
+
+   return buf;
+}
+
+static void virtio_transport_free_buf(struct virtio_vsock_buf *buf)
+{
+   kfree(buf->addr);
+   kfree(buf);
+}
+
 /* Packet capture */
 static struct sk_buff *virtio_transport_build_skb(void *opaque)
 {
@@ -190,17 +233,15 @@ static int virtio_transport_send_pkt_info(struct 
vsock_sock *vsk,
return virtio_transport_get_ops()->send_pkt(pkt);
 }
 
-static void virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
-   

[PATCH v2 0/8] vsock/virtio: optimizations to increase the throughput

2019-05-10 Thread Stefano Garzarella
While I was testing this new series (v2) I discovered an huge use of memory
and a memory leak in the virtio-vsock driver in the guest when I sent
1-byte packets to the guest.

These issues are present since the introduction of the virtio-vsock
driver. I added the patches 1 and 2 to fix them in this series in order
to better track the performance trends.

v1: https://patchwork.kernel.org/cover/10885431/

v2:
- Add patch 1 to limit the memory usage
- Add patch 2 to avoid memory leak during the socket release
- Add patch 3 to fix locking of fwd_cnt and buf_alloc
- Patch 4: fix 'free_space' type (u32 instead of s64) [Stefan]
- Patch 5: Avoid integer underflow of iov_len [Stefan]
- Patch 5: Fix packet capture in order to see the exact packets that are
   delivered. [Stefan]
- Add patch 8 to make the RX buffer size tunable [Stefan]

Below are the benchmarks step by step. I used iperf3 [1] modified with VSOCK
support.
As Micheal suggested in the v1, I booted host and guest with 'nosmap', and I
added a column with virtio-net+vhost-net performance.

A brief description of patches:
- Patches 1+2: limit the memory usage with an extra copy and avoid memory leak
- Patches 3+4: fix locking and reduce the number of credit update messages sent
   to the transmitter
- Patches 5+6: allow the host to split packets on multiple buffers and use
   VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size allowed
- Patches 7+8: increase RX buffer size to 64 KiB

host -> guest [Gbps]
pkt_size before opt  p 1+2p 3+4p 5+6p 7+8   virtio-net + vhost
 TCP_NODELAY
64 0.068 0.0630.1300.1310.128 0.188 0.187
2560.274 0.2360.3920.3380.282 0.749 0.654
5120.531 0.4570.8620.7250.602 1.419 1.414
1K 0.954 0.8271.5911.5981.548 2.599 2.640
2K 1.783 1.5433.7313.6373.469 4.530 4.754
4K 3.332 3.4367.1647.1246.494 7.738 7.696
8K 5.792 5.530   11.653   11.787   11.44412.30711.850
16K8.405 8.462   16.372   16.855   17.56216.93616.954
32K   14.20813.669   18.945   20.009   23.12821.98023.015
64K   21.08218.893   20.266   20.903   30.62227.29027.383
128K  20.69620.148   20.112   21.746   32.15230.44630.990
256K  20.80120.589   20.725   22.685   34.72133.15132.745
512K  21.22020.465   20.432   22.106   34.49636.84731.096

guest -> host [Gbps]
pkt_size before opt  p 1+2p 3+4p 5+6p 7+8   virtio-net + vhost
 TCP_NODELAY
64 0.089 0.0910.1200.1150.117 0.274 0.272
2560.352 0.3540.4520.4450.451 1.085 1.136
5120.705 0.7040.8930.8580.898 2.131 1.882
1K 1.394 1.4331.7211.6691.691 3.984 3.576
2K 2.818 2.8743.3163.2493.303 6.719 6.359
4K 5.293 5.3976.1295.9336.08210.105 9.860
8K 8.890 9.151   10.990   10.545   10.51915.23914.868
16K   11.44411.018   12.074   15.255   15.57720.55120.848
32K   11.22910.875   10.857   24.401   25.22726.29426.380
64K   10.83210.545   10.816   39.487   39.61634.99632.041
128K  10.43510.241   10.500   39.813   40.01238.37935.055
256K  10.263 9.8669.845   34.971   35.14336.55937.232
512K  10.22410.060   10.092   35.469   34.62734.96333.401

As Stefan suggested in the v1, this time I measured also the efficiency in this
way:
efficiency = Mbps / (%CPU_Host + %CPU_Guest)

The '%CPU_Guest' is taken inside the VM. I know that it is not the best way,
but it's provided for free from iperf3 and could be an indication.

host -> guest efficiency [Mbps / (%CPU_Host + %CPU_Guest)]
pkt_size before opt  p 1+2p 3+4p 5+6p 7+8   virtio-net + vhost
 TCP_NODELAY
64  0.94  0.59 3.96 4.06 4.09  2.82  2.11
256 2.62  2.50 6.45 6.09 5.81  9.64  8.73
512 5.16  4.8713.1612.3911.67 17.83 17.76
1K  9.16  8.8524.9824.9725.01 32.57 32.04
2K 17.41 17.0349.0948.5949.22 55.31 57.14
4K 32.99 33.6290.8090.9891.72 91.79 91.40
8K 58.51 59.98   153.53   170.83   167.31137.51132.85
16K89.32 

[PATCH v2 4/8] vsock/virtio: reduce credit update messages

2019-05-10 Thread Stefano Garzarella
In order to reduce the number of credit update messages,
we send them only when the space available seen by the
transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE.

Signed-off-by: Stefano Garzarella 
---
 include/linux/virtio_vsock.h|  1 +
 net/vmw_vsock/virtio_transport_common.c | 16 +---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index fb5954fc85c8..84b72026d327 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -41,6 +41,7 @@ struct virtio_vsock_sock {
/* Protected by rx_lock */
u32 buf_alloc;
u32 fwd_cnt;
+   u32 last_fwd_cnt;
u32 rx_bytes;
struct list_head rx_queue;
 };
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index f2e4e128bc86..b61fd5e29a1f 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -247,6 +247,7 @@ static void virtio_transport_dec_rx_pkt(struct 
virtio_vsock_sock *vvs, u32 len)
 void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct 
virtio_vsock_pkt *pkt)
 {
spin_lock_bh(>rx_lock);
+   vvs->last_fwd_cnt = vvs->fwd_cnt;
pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
spin_unlock_bh(>rx_lock);
@@ -297,6 +298,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
struct virtio_vsock_sock *vvs = vsk->trans;
struct virtio_vsock_buf *buf;
size_t bytes, total = 0;
+   u32 free_space;
int err = -EFAULT;
 
spin_lock_bh(>rx_lock);
@@ -327,11 +329,19 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
virtio_transport_free_buf(buf);
}
}
+
+   free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
+
spin_unlock_bh(>rx_lock);
 
-   /* Send a credit pkt to peer */
-   virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM,
-   NULL);
+   /* We send a credit update only when the space available seen
+* by the transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE
+*/
+   if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
+   virtio_transport_send_credit_update(vsk,
+   VIRTIO_VSOCK_TYPE_STREAM,
+   NULL);
+   }
 
return total;
 
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 5/8] vhost/vsock: split packets to send using multiple buffers

2019-05-10 Thread Stefano Garzarella
If the packets to sent to the guest are bigger than the buffer
available, we can split them, using multiple buffers and fixing
the length in the packet header.
This is safe since virtio-vsock supports only stream sockets.

Signed-off-by: Stefano Garzarella 
---
 drivers/vhost/vsock.c   | 51 +++--
 net/vmw_vsock/virtio_transport_common.c | 15 ++--
 2 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 7964e2daee09..fb731d09f5f1 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -94,7 +94,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct iov_iter iov_iter;
unsigned out, in;
size_t nbytes;
-   size_t len;
+   size_t iov_len, payload_len;
int head;
 
spin_lock_bh(>send_pkt_list_lock);
@@ -139,8 +139,24 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
 
-   len = iov_length(>iov[out], in);
-   iov_iter_init(_iter, READ, >iov[out], in, len);
+   iov_len = iov_length(>iov[out], in);
+   if (iov_len < sizeof(pkt->hdr)) {
+   virtio_transport_free_pkt(pkt);
+   vq_err(vq, "Buffer len [%zu] too small\n", iov_len);
+   break;
+   }
+
+   iov_iter_init(_iter, READ, >iov[out], in, iov_len);
+   payload_len = pkt->len - pkt->off;
+
+   /* If the packet is greater than the space available in the
+* buffer, we split it using multiple buffers.
+*/
+   if (payload_len > iov_len - sizeof(pkt->hdr))
+   payload_len = iov_len - sizeof(pkt->hdr);
+
+   /* Set the correct length in the header */
+   pkt->hdr.len = cpu_to_le32(payload_len);
 
nbytes = copy_to_iter(>hdr, sizeof(pkt->hdr), _iter);
if (nbytes != sizeof(pkt->hdr)) {
@@ -149,16 +165,34 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
 
-   nbytes = copy_to_iter(pkt->buf, pkt->len, _iter);
-   if (nbytes != pkt->len) {
+   nbytes = copy_to_iter(pkt->buf + pkt->off, payload_len,
+ _iter);
+   if (nbytes != payload_len) {
virtio_transport_free_pkt(pkt);
vq_err(vq, "Faulted on copying pkt buf\n");
break;
}
 
-   vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+   vhost_add_used(vq, head, sizeof(pkt->hdr) + payload_len);
added = true;
 
+   /* Deliver to monitoring devices all correctly transmitted
+* packets.
+*/
+   virtio_transport_deliver_tap_pkt(pkt);
+
+   pkt->off += payload_len;
+
+   /* If we didn't send all the payload we can requeue the packet
+* to send it with the next available buffer.
+*/
+   if (pkt->off < pkt->len) {
+   spin_lock_bh(>send_pkt_list_lock);
+   list_add(>list, >send_pkt_list);
+   spin_unlock_bh(>send_pkt_list_lock);
+   continue;
+   }
+
if (pkt->reply) {
int val;
 
@@ -169,11 +203,6 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
restart_tx = true;
}
 
-   /* Deliver to monitoring devices all correctly transmitted
-* packets.
-*/
-   virtio_transport_deliver_tap_pkt(pkt);
-
virtio_transport_free_pkt(pkt);
}
if (added)
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index b61fd5e29a1f..3f313bcd6a26 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -135,8 +135,17 @@ static struct sk_buff *virtio_transport_build_skb(void 
*opaque)
struct virtio_vsock_pkt *pkt = opaque;
struct af_vsockmon_hdr *hdr;
struct sk_buff *skb;
+   size_t payload_len;
+   void *payload_buf;
 
-   skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
+   /* A packet could be split to fit the RX buffer, so we can retrieve
+* the payload length from the header and the buffer pointer taking
+* care of the offset in the original packet.
+*/
+   payload_len = le32_to_cpu(pkt->hdr.len);
+   payload_buf = pkt->buf + pkt->off;
+
+   skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + payload_len,
GFP_ATOMIC);
if (!skb)
return 

[PATCH v2 3/8] vsock/virtio: fix locking for fwd_cnt and buf_alloc

2019-05-10 Thread Stefano Garzarella
fwd_cnt is written with rx_lock, so we should read it using
the same spinlock also if we are in the TX path.

Move also buf_alloc under rx_lock and add a missing locking
when we modify it.

Signed-off-by: Stefano Garzarella 
---
 include/linux/virtio_vsock.h| 2 +-
 net/vmw_vsock/virtio_transport_common.c | 6 --
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 345f04ee9193..fb5954fc85c8 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -35,11 +35,11 @@ struct virtio_vsock_sock {
 
/* Protected by tx_lock */
u32 tx_cnt;
-   u32 buf_alloc;
u32 peer_fwd_cnt;
u32 peer_buf_alloc;
 
/* Protected by rx_lock */
+   u32 buf_alloc;
u32 fwd_cnt;
u32 rx_bytes;
struct list_head rx_queue;
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 65c8b4a23f2b..f2e4e128bc86 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -246,10 +246,10 @@ static void virtio_transport_dec_rx_pkt(struct 
virtio_vsock_sock *vvs, u32 len)
 
 void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct 
virtio_vsock_pkt *pkt)
 {
-   spin_lock_bh(>tx_lock);
+   spin_lock_bh(>rx_lock);
pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
-   spin_unlock_bh(>tx_lock);
+   spin_unlock_bh(>rx_lock);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
 
@@ -469,7 +469,9 @@ void virtio_transport_set_buffer_size(struct vsock_sock 
*vsk, u64 val)
if (val > vvs->buf_size_max)
vvs->buf_size_max = val;
vvs->buf_size = val;
+   spin_lock_bh(>rx_lock);
vvs->buf_alloc = val;
+   spin_unlock_bh(>rx_lock);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_set_buffer_size);
 
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 7/8] vsock/virtio: increase RX buffer size to 64 KiB

2019-05-10 Thread Stefano Garzarella
In order to increase host -> guest throughput with large packets,
we can use 64 KiB RX buffers.

Signed-off-by: Stefano Garzarella 
---
 include/linux/virtio_vsock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 84b72026d327..5a9d25be72df 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -10,7 +10,7 @@
 #define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE  128
 #define VIRTIO_VSOCK_DEFAULT_BUF_SIZE  (1024 * 256)
 #define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE  (1024 * 256)
-#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE   (1024 * 4)
+#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE   (1024 * 64)
 #define VIRTIO_VSOCK_MAX_BUF_SIZE  0xUL
 #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE  (1024 * 64)
 
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 8/8] vsock/virtio: make the RX buffer size tunable

2019-05-10 Thread Stefano Garzarella
The RX buffer size determines the memory consumption of the
vsock/virtio guest driver, so we make it tunable through
a module parameter.

The size allowed are between 4 KB and 64 KB in order to be
compatible with old host drivers.

Suggested-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
---
 include/linux/virtio_vsock.h |  1 +
 net/vmw_vsock/virtio_transport.c | 27 ++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 5a9d25be72df..b9f8c3d91f80 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -13,6 +13,7 @@
 #define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE   (1024 * 64)
 #define VIRTIO_VSOCK_MAX_BUF_SIZE  0xUL
 #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE  (1024 * 64)
+#define VIRTIO_VSOCK_MIN_PKT_BUF_SIZE  (1024 * 4)
 
 enum {
VSOCK_VQ_RX = 0, /* for host to guest data */
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index af1d2ce12f54..732398b4e28f 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -66,6 +66,31 @@ struct virtio_vsock {
u32 guest_cid;
 };
 
+static unsigned int rx_buf_size = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+
+static int param_set_rx_buf_size(const char *val, const struct kernel_param 
*kp)
+{
+   unsigned int size;
+   int ret;
+
+   ret = kstrtouint(val, 0, );
+   if (ret)
+   return ret;
+
+   if (size < VIRTIO_VSOCK_MIN_PKT_BUF_SIZE ||
+   size > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
+   return -EINVAL;
+
+   return param_set_uint(val, kp);
+};
+
+static const struct kernel_param_ops param_ops_rx_buf_size = {
+   .set = param_set_rx_buf_size,
+   .get = param_get_uint,
+};
+
+module_param_cb(rx_buf_size, _ops_rx_buf_size, _buf_size, 0644);
+
 static struct virtio_vsock *virtio_vsock_get(void)
 {
return the_virtio_vsock;
@@ -261,7 +286,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
 
 static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
 {
-   int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+   int buf_len = rx_buf_size;
struct virtio_vsock_pkt *pkt;
struct scatterlist hdr, buf, *sgs[2];
struct virtqueue *vq;
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 6/8] vsock/virtio: change the maximum packet size allowed

2019-05-10 Thread Stefano Garzarella
Since now we are able to split packets, we can avoid limiting
their sizes to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE.
Instead, we can use VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max
packet size.

Signed-off-by: Stefano Garzarella 
---
 net/vmw_vsock/virtio_transport_common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 3f313bcd6a26..63606525755d 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -219,8 +219,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock 
*vsk,
vvs = vsk->trans;
 
/* we can send less than pkt_len bytes */
-   if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
-   pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+   if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
+   pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
 
/* virtio_transport_get_credit might return less than pkt_len credit */
pkt_len = virtio_transport_get_credit(vvs, pkt_len);
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 2/8] vsock/virtio: free packets during the socket release

2019-05-10 Thread Stefano Garzarella
When the socket is released, we should free all packets
queued in the per-socket list in order to avoid a memory
leak.

Signed-off-by: Stefano Garzarella 
---
 net/vmw_vsock/virtio_transport_common.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 0248d6808755..65c8b4a23f2b 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -827,12 +827,20 @@ static bool virtio_transport_close(struct vsock_sock *vsk)
 
 void virtio_transport_release(struct vsock_sock *vsk)
 {
+   struct virtio_vsock_sock *vvs = vsk->trans;
+   struct virtio_vsock_buf *buf;
struct sock *sk = >sk;
bool remove_sock = true;
 
lock_sock(sk);
if (sk->sk_type == SOCK_STREAM)
remove_sock = virtio_transport_close(vsk);
+   while (!list_empty(>rx_queue)) {
+   buf = list_first_entry(>rx_queue,
+  struct virtio_vsock_buf, list);
+   list_del(>list);
+   virtio_transport_free_buf(buf);
+   }
release_sock(sk);
 
if (remove_sock)
-- 
2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 08/10] virtio/s390: add indirection to indicators access

2019-05-10 Thread Halil Pasic
On Fri, 10 May 2019 09:43:08 +0200
Pierre Morel  wrote:

> On 09/05/2019 20:26, Halil Pasic wrote:
> > On Thu, 9 May 2019 14:01:01 +0200
> > Pierre Morel  wrote:
> > 
> >> On 08/05/2019 16:31, Pierre Morel wrote:
> >>> On 26/04/2019 20:32, Halil Pasic wrote:
>  This will come in handy soon when we pull out the indicators from
>  virtio_ccw_device to a memory area that is shared with the hypervisor
>  (in particular for protected virtualization guests).
> 
>  Signed-off-by: Halil Pasic 
>  ---
>     drivers/s390/virtio/virtio_ccw.c | 40
>  +---
>     1 file changed, 25 insertions(+), 15 deletions(-)
> 
>  diff --git a/drivers/s390/virtio/virtio_ccw.c
>  b/drivers/s390/virtio/virtio_ccw.c
>  index bb7a92316fc8..1f3e7d56924f 100644
>  --- a/drivers/s390/virtio/virtio_ccw.c
>  +++ b/drivers/s390/virtio/virtio_ccw.c
>  @@ -68,6 +68,16 @@ struct virtio_ccw_device {
>     void *airq_info;
>     };
>  +static inline unsigned long *indicators(struct virtio_ccw_device *vcdev)
>  +{
>  +    return >indicators;
>  +}
>  +
>  +static inline unsigned long *indicators2(struct virtio_ccw_device
>  *vcdev)
>  +{
>  +    return >indicators2;
>  +}
>  +
>     struct vq_info_block_legacy {
>     __u64 queue;
>     __u32 align;
>  @@ -337,17 +347,17 @@ static void virtio_ccw_drop_indicator(struct
>  virtio_ccw_device *vcdev,
>     ccw->cda = (__u32)(unsigned long) thinint_area;
>     } else {
>     /* payload is the address of the indicators */
>  -    indicatorp = kmalloc(sizeof(>indicators),
>  +    indicatorp = kmalloc(sizeof(indicators(vcdev)),
>      GFP_DMA | GFP_KERNEL);
>     if (!indicatorp)
>     return;
>     *indicatorp = 0;
>     ccw->cmd_code = CCW_CMD_SET_IND;
>  -    ccw->count = sizeof(>indicators);
>  +    ccw->count = sizeof(indicators(vcdev));
> >>>
> >>> This looks strange to me. Was already weird before.
> >>> Lucky we are indicators are long...
> >>> may be just sizeof(long)
> >>
> > 
> > I'm not sure I understand where are you coming from...
> > 
> > With CCW_CMD_SET_IND we tell the hypervisor the guest physical address
> > at which the so called classic indicators. There is a comment that
> > makes this obvious. The argument of the sizeof was and remained a
> > pointer type. AFAIU this is what bothers you.
> >>
> >> AFAIK the size of the indicators (AIV/AIS) is not restricted by the
> >> architecture.
> > 
> > The size of vcdev->indicators is restricted or defined by the virtio
> > specification. Please have a look at '4.3.2.6.1 Setting Up Classic Queue
> > Indicators' here:
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-1630002
> > 
> > Since with Linux on s390 only 64 bit is supported, both the sizes are in
> > line with the specification. Using u64 would semantically match the spec
> > better, modulo pre virtio 1.0 which ain't specified. I did not want to
> > do changes that are not necessary for what I'm trying to accomplish. If
> > we want we can change these to u64 with a patch on top.
> 
> I mean you are changing these line already, so why not doing it right 
> while at it?
> 

This patch is about adding the indirection so we can move the member
painlessly. Mixing in different stuff would be a bad practice.

BTW I just explained that it ain't wrong, so I really do not understand
what do you mean by  'why not doing it right'. Can you please explain?

Did you agree with the rest of my comment? I mean there was more to it.

Regards,
Halil

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization