Re: [PATCH 0/5] vhost-scsi: Add support for host virtualized target

2012-09-10 Thread Asias He
Hello Nicholas,

On 09/07/2012 02:48 PM, Nicholas A. Bellinger wrote:
> From: Nicholas Bellinger 
> 
> Hello Anthony & Co,
> 
> This is the fourth installment to add host virtualized target support for
> the mainline tcm_vhost fabric driver using Linux v3.6-rc into QEMU 1.3.0-rc.
> 
> The series is available directly from the following git branch:
> 
>git://git.kernel.org/pub/scm/virt/kvm/nab/qemu-kvm.git vhost-scsi-for-1.3
> 
> Note the code is cut against yesterday's QEMU head, and dispite the name
> of the tree is based upon mainline qemu.org git code + has thus far been
> running overnight with > 100K IOPs small block 4k workloads using v3.6-rc2+
> based target code with RAMDISK_DR backstores.

Are you still seeing the performance degradation discussed in the thread

 "vhost-scsi port to v1.1.0 + MSI-X performance regression"

?

> Other than some minor fuzz between jumping from QEMU 1.2.0 -> 1.2.50, this
> series is functionally identical to what's been posted for vhost-scsi RFC-v3
> to qemu-devel.
> 
> Please consider applying these patches for an initial vhost-scsi merge into
> QEMU 1.3.0-rc code, or let us know what else you'd like to see addressed for
> this series to in order to merge.
> 
> Thank you!
> 
> --nab
> 
> Nicholas Bellinger (2):
>   monitor: Rename+move net_handle_fd_param -> monitor_handle_fd_param
>   virtio-scsi: Set max_target=0 during vhost-scsi operation
> 
> Stefan Hajnoczi (3):
>   vhost: Pass device path to vhost_dev_init()
>   vhost-scsi: add -vhost-scsi host device for use with tcm-vhost
>   virtio-scsi: Add start/stop functionality for vhost-scsi
> 
>  configure|   10 +++
>  hw/Makefile.objs |1 +
>  hw/qdev-properties.c |   41 +++
>  hw/vhost-scsi.c  |  190 
> ++
>  hw/vhost-scsi.h  |   62 
>  hw/vhost.c   |5 +-
>  hw/vhost.h   |3 +-
>  hw/vhost_net.c   |2 +-
>  hw/virtio-pci.c  |2 +
>  hw/virtio-scsi.c |   55 ++-
>  hw/virtio-scsi.h |1 +
>  monitor.c|   18 +
>  monitor.h|1 +
>  net.c|   18 -
>  net.h|2 -
>  net/socket.c |2 +-
>  net/tap.c|4 +-
>  qemu-common.h|1 +
>  qemu-config.c|   19 +
>  qemu-options.hx  |4 +
>  vl.c |   18 +
>  21 files changed, 431 insertions(+), 28 deletions(-)
>  create mode 100644 hw/vhost-scsi.c
>  create mode 100644 hw/vhost-scsi.h
> 


-- 
Asias
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 3/3] xen/privcmd: remove const modifier from declaration

2012-09-10 Thread David Vrabel
On 10/09/12 10:03, Jan Beulich wrote:
 On 08.09.12 at 11:58, Dan Carpenter  wrote:
>> When we use this pointer, we cast away the const modifier and modify the
>> data.  I think it was an accident to declare it as const.
> 
> NAK - the const is very valid here, as the v2 interface (as opposed
> to the v1 one) does _not_ modify this array (or if it does, it's a
> bug). This is a guarantee made to user mode, so it should also be
> expressed that way in the interface.
> 
> But of course the cast used before this patch isn't right either, as
> it indeed inappropriately discards the qualifier. Afaiu this was done
> to simplify the internal workings of the code, but I don't think it's
> desirable to sacrifice type safety for implementation simplicity.

m.arr here isn't const as m is really the V1 structure where m.arr is
non-const.

David
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 1/3] xen/privcmd: check for integer overflow in ioctl

2012-09-10 Thread David Vrabel
On 08/09/12 10:52, Dan Carpenter wrote:
> If m.num is too large then the "m.num * sizeof(*m.arr)" multiplication
> could overflow and the access_ok() check wouldn't test the right size.

m.num is range checked later on so it doesn't matter that the
access_ok() checks might be wrong.  A bit subtle, perhaps.

David

> Signed-off-by: Dan Carpenter 
> ---
> Only needed in linux-next.
> 
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 215a3c0..fdff8f9 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -325,6 +325,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
> int version)
>   return -EFAULT;
>   /* Returns per-frame error in m.arr. */
>   m.err = NULL;
> + if (m.num > SIZE_MAX / sizeof(*m.arr))
> + return -EINVAL;
>   if (!access_ok(VERIFY_WRITE, m.arr, m.num * sizeof(*m.arr)))
>   return -EFAULT;
>   break;
> @@ -332,6 +334,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
> int version)
>   if (copy_from_user(&m, udata, sizeof(struct 
> privcmd_mmapbatch_v2)))
>   return -EFAULT;
>   /* Returns per-frame error code in m.err. */
> + if (m.num > SIZE_MAX / sizeof(*m.err))
> + return -EINVAL;
>   if (!access_ok(VERIFY_WRITE, m.err, m.num * (sizeof(*m.err
>   return -EFAULT;
>   break;
> 
> ___
> Xen-devel mailing list
> xen-de...@lists.xen.org
> http://lists.xen.org/xen-devel

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 3/3] xen/privcmd: remove const modifier from declaration

2012-09-10 Thread Andres Lagar-Cavilla
On Sep 8, 2012, at 5:58 AM, Dan Carpenter wrote:

> When we use this pointer, we cast away the const modifier and modify the
> data.  I think it was an accident to declare it as const.
> 
> Signed-off-by: Dan Carpenter 
Acked-by: Andres Lagar-Cavilla 

Thanks for the careful review & hardening.
Andres

> 
> diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h
> index a853168..58ed953 100644
> --- a/include/xen/privcmd.h
> +++ b/include/xen/privcmd.h
> @@ -69,7 +69,7 @@ struct privcmd_mmapbatch_v2 {
>   unsigned int num; /* number of pages to populate */
>   domid_t dom;  /* target domain */
>   __u64 addr;   /* virtual address */
> - const xen_pfn_t __user *arr; /* array of mfns */
> + xen_pfn_t __user *arr; /* array of mfns */
>   int __user *err;  /* array of error codes */
> };
> 
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 0ce006a..fceb83e 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -389,7 +389,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
> int version)
> 
>   if (state.global_error && (version == 1)) {
>   /* Write back errors in second pass. */
> - state.user_mfn = (xen_pfn_t *)m.arr;
> + state.user_mfn = m.arr;
>   state.err  = err_array;
>   ret = traverse_pages(m.num, sizeof(xen_pfn_t),
>&pagelist, mmap_return_errors_v1, &state);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 2/3] xen/privcmd: return -EFAULT on error

2012-09-10 Thread Andres Lagar-Cavilla

On Sep 8, 2012, at 5:57 AM, Dan Carpenter wrote:

> __copy_to_user() returns the number of bytes remaining to be copied but
> we want to return a negative error code here.
> 
> Signed-off-by: Dan Carpenter 
Acked-by: Andres Lagar-Cavilla 
> 
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index fdff8f9..0ce006a 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -393,8 +393,11 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
> int version)
>   state.err  = err_array;
>   ret = traverse_pages(m.num, sizeof(xen_pfn_t),
>&pagelist, mmap_return_errors_v1, &state);
> - } else if (version == 2)
> + } else if (version == 2) {
>   ret = __copy_to_user(m.err, err_array, m.num * sizeof(int));
> + if (ret)
> + ret = -EFAULT;
> + }
> 
>   /* If we have not had any EFAULT-like global errors then set the global
>* error to -ENOENT if necessary. */

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 1/3] xen/privcmd: check for integer overflow in ioctl

2012-09-10 Thread Andres Lagar-Cavilla
On Sep 8, 2012, at 5:52 AM, Dan Carpenter wrote:

> If m.num is too large then the "m.num * sizeof(*m.arr)" multiplication
> could overflow and the access_ok() check wouldn't test the right size.
> 
> Signed-off-by: Dan Carpenter 
Acked-by: Andres Lagar-Cavilla 
> ---
> Only needed in linux-next.
> 
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 215a3c0..fdff8f9 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -325,6 +325,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
> int version)
>   return -EFAULT;
>   /* Returns per-frame error in m.arr. */
>   m.err = NULL;
> + if (m.num > SIZE_MAX / sizeof(*m.arr))
> + return -EINVAL;
>   if (!access_ok(VERIFY_WRITE, m.arr, m.num * sizeof(*m.arr)))
>   return -EFAULT;
>   break;
> @@ -332,6 +334,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
> int version)
>   if (copy_from_user(&m, udata, sizeof(struct 
> privcmd_mmapbatch_v2)))
>   return -EFAULT;
>   /* Returns per-frame error code in m.err. */
> + if (m.num > SIZE_MAX / sizeof(*m.err))
> + return -EINVAL;
>   if (!access_ok(VERIFY_WRITE, m.err, m.num * (sizeof(*m.err
>   return -EFAULT;
>   break;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 1/2] virtio_console: Add support for DMA memory allocation

2012-09-10 Thread Michael S. Tsirkin
On Thu, Sep 06, 2012 at 08:27:35AM +0300, Michael S. Tsirkin wrote:
> On Thu, Sep 06, 2012 at 11:34:25AM +0930, Rusty Russell wrote:
> > "Michael S. Tsirkin"  writes:
> > > On Tue, Sep 04, 2012 at 06:58:47PM +0200, Sjur Brændeland wrote:
> > >> Hi Michael,
> > >> 
> > >> > Exactly. Though if we just fail load it will be much less code.
> > >> >
> > >> > Generally, using a feature bit for this is a bit of a problem though:
> > >> > normally driver is expected to be able to simply ignore
> > >> > a feature bit. In this case driver is required to
> > >> > do something so a feature bit is not a good fit.
> > >> > I am not sure what the right thing to do is.
> > >> 
> > >> I see - so in order to avoid the binding between driver and device
> > >> there are two options I guess. Either make virtio_dev_match() or
> > >> virtcons_probe() fail. Neither of them seems like the obvious choice.
> > >> 
> > >> Maybe adding a check for VIRTIO_CONSOLE_F_DMA_MEM match
> > >> between device and driver in virtcons_probe() is the lesser evil?
> > >> 
> > >> Regards,
> > >> Sjur
> > >
> > > A simplest thing to do is change dev id. rusty?
> > 
> > For generic usage, this is correct.  But my opinion is that fallback on
> > feature non-ack is quality-of-implementation issue: great to have, but
> > there are cases where you just want to fail with "you're too old".
> > 
> > And in this case, an old system simply will never work.  So it's a
> > question of how graceful the failure is.
> > 
> > Can your userspace loader can refuse to proceed if the driver doesn't
> > ack the bits?  If so, it's simpler than a whole new ID.
> > 
> > Cheers,
> > Rusty.
> 
> Yes but how can it signal guest that it will never proceed?
> 
> Also grep for BUG_ON in core found this:
> 
>drv->remove(dev);
> 
>/* Driver should have reset device. */
>BUG_ON(dev->config->get_status(dev));
> 
> I think below is what Sjur refers to.
> I think below is a good idea for 3.6. Thoughts?
> 
> --->
> 
> virtio: don't crash when device is buggy
> 
> Because of a sanity check in virtio_dev_remove, a buggy device can crash
> kernel.  And in case of rproc it's userspace so it's not a good idea.
> We are unloading a driver so how bad can it be?
> Be less aggressive in handling this error: if it's a driver bug,
> warning once should be enough.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> --

Rusty?

> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> index c3b3f7f..1e8659c 100644
> --- a/drivers/virtio/virtio.c
> +++ b/drivers/virtio/virtio.c
> @@ -159,7 +159,7 @@ static int virtio_dev_remove(struct device *_d)
>   drv->remove(dev);
>  
>   /* Driver should have reset device. */
> - BUG_ON(dev->config->get_status(dev));
> + WARN_ON_ONCE(dev->config->get_status(dev));
>  
>   /* Acknowledge the device's existence again. */
>   add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> 
> -- 
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 04:49:40PM -0400, Mike Waychison wrote:
> On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin  wrote:
> > On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
> >> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  
> >> wrote:
> >> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> >> >> This implementation of a virtio balloon driver uses the page cache to
> >> >> "store" pages that have been released to the host.  The communication
> >> >> (outside of target counts) is one way--the guest notifies the host when
> >> >> it adds a page to the page cache, allowing the host to madvise(2) with
> >> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> >> >> (via the regular page reclaim).  This means that inflating the balloon
> >> >> is similar to the existing balloon mechanism, but the deflate is
> >> >> different--it re-uses existing Linux kernel functionality to
> >> >> automatically reclaim.
> >> >>
> >> >> Signed-off-by: Frank Swiderski 
> >>
> >> Hi Michael,
> >>
> >> I'm very sorry that Frank and I have been silent on these threads.
> >> I've been out of the office and Frank has been been swamped :)
> >>
> >> I'll take a stab at answering some of your questions below, and
> >> hopefully we can end up on the same page.
> >>
> >> > I've been trying to understand this, and I have
> >> > a question: what exactly is the benefit
> >> > of this new device?r balloon is told upper limit on target size by host 
> >> > and pulls
> >>
> >> The key difference between this device/driver and the pre-existing
> >> virtio_balloon device/driver is in how the memory pressure loop is
> >> controlled.
> >>
> >> With the pre-existing balloon device/driver, the control loop for how
> >> much memory a given VM is allowed to use is controlled completely by
> >> the host.  This is probably fine if the goal is to pack as much work
> >> on a given host as possible, but it says nothing about the expected
> >> performance that any given VM is expecting to have.  Specifically, it
> >> allows the host to set a target goal for the size of a VM, and the
> >> driver in the guest does whatever is needed to get to that goal.  This
> >> is great for systems where one wants to "grow or shrink" a VM from the
> >> outside.
> >>
> >>
> >> This behaviour however doesn't match what applications actually expectr 
> >> balloon is told upper limit on target size by host and pulls
> >> from a memory control loop however.  In a native setup, an application
> >> can usually expect to allocate memory from the kernel on an as-needed
> >> basis, and can in turn return memory back to the system (using a heap
> >> implementation that actually releases memory that is).  The dynamic
> >> size of an application is completely controlled by the application,
> >> and there is very little that cluster management software can do to
> >> ensure that the application fits some prescribed size.
> >>
> >> We recognized this in the development of our cluster management
> >> software long ago, so our systems are designed for managing tasks that
> >> have a dynamic memory footprint.  Overcommit is possible (as most
> >> applications do not use the full reservation of memory they asked for
> >> originally), letting us do things like schedule lower priority/lower
> >> service-classification work using resources that are otherwise
> >> available in stand-by for high-priority/low-latency workloads.
> >
> > OK I am not sure I got this right so pls tell me if this summary is
> > correct (note: this does not talk about what guest does with memory,
> > ust what it is that device does):
> >
> > - existing balloon is told lower limit on target size by host and pulls in 
> > at least
> >   target size. Guest can inflate > target size if it likes
> >   and then it is OK to deflate back to target size but not less.
> 
> Is this true?   I take it nothing is keeping the existing balloon
> driver from going over the target, but the same can be said about
> either balloon implementation.
> 
> > - your balloon is told upper limit on target size by host and pulls at most
> >   target size. Guest can deflate down to 0 at any point.
> >
> > If so I think both approaches make sense and in fact they
> > can be useful at the same time for the same guest.
> > In that case, I see two ways how this can be done:
> >
> > 1. two devices: existing ballon + cache balloon the
> > 2. add "upper limit" to existing ballon
> >
> > A single device looks a bit more natural in that we don't
> > really care in which balloon a page is as long as we
> > are between lower and upper limit. Right?
> 
> I agree that this may be better done using a single device if possible.

I am not sure myself, just asking.

> > From implementation POV we could have it use
> > pagecache for pages above lower limit but that
> > is a separate question about driver design,
> > I would like to make sure I understand the highr balloon is told upper 
> > limit 

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Mike Waychison
On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin  wrote:
> On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
>> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  wrote:
>> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
>> >> This implementation of a virtio balloon driver uses the page cache to
>> >> "store" pages that have been released to the host.  The communication
>> >> (outside of target counts) is one way--the guest notifies the host when
>> >> it adds a page to the page cache, allowing the host to madvise(2) with
>> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> >> (via the regular page reclaim).  This means that inflating the balloon
>> >> is similar to the existing balloon mechanism, but the deflate is
>> >> different--it re-uses existing Linux kernel functionality to
>> >> automatically reclaim.
>> >>
>> >> Signed-off-by: Frank Swiderski 
>>
>> Hi Michael,
>>
>> I'm very sorry that Frank and I have been silent on these threads.
>> I've been out of the office and Frank has been been swamped :)
>>
>> I'll take a stab at answering some of your questions below, and
>> hopefully we can end up on the same page.
>>
>> > I've been trying to understand this, and I have
>> > a question: what exactly is the benefit
>> > of this new device?r balloon is told upper limit on target size by host 
>> > and pulls
>>
>> The key difference between this device/driver and the pre-existing
>> virtio_balloon device/driver is in how the memory pressure loop is
>> controlled.
>>
>> With the pre-existing balloon device/driver, the control loop for how
>> much memory a given VM is allowed to use is controlled completely by
>> the host.  This is probably fine if the goal is to pack as much work
>> on a given host as possible, but it says nothing about the expected
>> performance that any given VM is expecting to have.  Specifically, it
>> allows the host to set a target goal for the size of a VM, and the
>> driver in the guest does whatever is needed to get to that goal.  This
>> is great for systems where one wants to "grow or shrink" a VM from the
>> outside.
>>
>>
>> This behaviour however doesn't match what applications actually expectr 
>> balloon is told upper limit on target size by host and pulls
>> from a memory control loop however.  In a native setup, an application
>> can usually expect to allocate memory from the kernel on an as-needed
>> basis, and can in turn return memory back to the system (using a heap
>> implementation that actually releases memory that is).  The dynamic
>> size of an application is completely controlled by the application,
>> and there is very little that cluster management software can do to
>> ensure that the application fits some prescribed size.
>>
>> We recognized this in the development of our cluster management
>> software long ago, so our systems are designed for managing tasks that
>> have a dynamic memory footprint.  Overcommit is possible (as most
>> applications do not use the full reservation of memory they asked for
>> originally), letting us do things like schedule lower priority/lower
>> service-classification work using resources that are otherwise
>> available in stand-by for high-priority/low-latency workloads.
>
> OK I am not sure I got this right so pls tell me if this summary is
> correct (note: this does not talk about what guest does with memory,
> ust what it is that device does):
>
> - existing balloon is told lower limit on target size by host and pulls in at 
> least
>   target size. Guest can inflate > target size if it likes
>   and then it is OK to deflate back to target size but not less.

Is this true?   I take it nothing is keeping the existing balloon
driver from going over the target, but the same can be said about
either balloon implementation.

> - your balloon is told upper limit on target size by host and pulls at most
>   target size. Guest can deflate down to 0 at any point.
>
> If so I think both approaches make sense and in fact they
> can be useful at the same time for the same guest.
> In that case, I see two ways how this can be done:
>
> 1. two devices: existing ballon + cache balloon the
> 2. add "upper limit" to existing ballon
>
> A single device looks a bit more natural in that we don't
> really care in which balloon a page is as long as we
> are between lower and upper limit. Right?

I agree that this may be better done using a single device if possible.

> From implementation POV we could have it use
> pagecache for pages above lower limit but that
> is a separate question about driver design,
> I would like to make sure I understand the highr balloon is told upper limit 
> on tr balloon is told upper limit on target size by host and pullsarget size 
> by host and pulls
> level design first.

I agree that this is an implementation detail that is separate from
discussions of high and low limits.  That said, there are several
advantages to pushing these pages to the 

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  wrote:
> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> >> This implementation of a virtio balloon driver uses the page cache to
> >> "store" pages that have been released to the host.  The communication
> >> (outside of target counts) is one way--the guest notifies the host when
> >> it adds a page to the page cache, allowing the host to madvise(2) with
> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> >> (via the regular page reclaim).  This means that inflating the balloon
> >> is similar to the existing balloon mechanism, but the deflate is
> >> different--it re-uses existing Linux kernel functionality to
> >> automatically reclaim.
> >>
> >> Signed-off-by: Frank Swiderski 
> 
> Hi Michael,
> 
> I'm very sorry that Frank and I have been silent on these threads.
> I've been out of the office and Frank has been been swamped :)
> 
> I'll take a stab at answering some of your questions below, and
> hopefully we can end up on the same page.
> 
> > I've been trying to understand this, and I have
> > a question: what exactly is the benefit
> > of this new device?
> 
> The key difference between this device/driver and the pre-existing
> virtio_balloon device/driver is in how the memory pressure loop is
> controlled.
> 
> With the pre-existing balloon device/driver, the control loop for how
> much memory a given VM is allowed to use is controlled completely by
> the host.  This is probably fine if the goal is to pack as much work
> on a given host as possible, but it says nothing about the expected
> performance that any given VM is expecting to have.  Specifically, it
> allows the host to set a target goal for the size of a VM, and the
> driver in the guest does whatever is needed to get to that goal.  This
> is great for systems where one wants to "grow or shrink" a VM from the
> outside.
> 
> 
> This behaviour however doesn't match what applications actually expect
> from a memory control loop however.  In a native setup, an application
> can usually expect to allocate memory from the kernel on an as-needed
> basis, and can in turn return memory back to the system (using a heap
> implementation that actually releases memory that is).  The dynamic
> size of an application is completely controlled by the application,
> and there is very little that cluster management software can do to
> ensure that the application fits some prescribed size.
> 
> We recognized this in the development of our cluster management
> software long ago, so our systems are designed for managing tasks that
> have a dynamic memory footprint.  Overcommit is possible (as most
> applications do not use the full reservation of memory they asked for
> originally), letting us do things like schedule lower priority/lower
> service-classification work using resources that are otherwise
> available in stand-by for high-priority/low-latency workloads.

OK I am not sure I got this right so pls tell me if this summary is
correct (note: this does not talk about what guest does with memory, 
ust what it is that device does):

- existing balloon is told lower limit on target size by host and pulls in at 
least
  target size. Guest can inflate > target size if it likes
  and then it is OK to deflate back to target size but not less.
- your balloon is told upper limit on target size by host and pulls at most
  target size. Guest can deflate down to 0 at any point.

If so I think both approaches make sense and in fact they
can be useful at the same time for the same guest.
In that case, I see two ways how this can be done:

1. two devices: existing ballon + cache balloon
2. add "upper limit" to existing ballon

A single device looks a bit more natural in that we don't
really care in which balloon a page is as long as we
are between lower and upper limit. Right?
>From implementation POV we could have it use
pagecache for pages above lower limit but that
is a separate question about driver design,
I would like to make sure I understand the high
level design first.





> >
> > Note that users could not care less about how a driver
> > is implemented internally.
> >
> > Is there some workload where you see VM working better with
> > this than regular balloon? Any numbers?
> 
> This device is less about performance as it is about getting the
> memory size of a job (or in this case, a job in a VM) to grow and
> shrink as the application workload sees fit, much like how processes
> today can grow and shrink without external direction.

Still, e.g. swap in host achieves more or less the same functionality.
I am guessing balloon can work better by getting more cooperation
from guest but aren't there any tests showing this is true in practice?


> >
> > Also, can't we just replace existing balloon implementation
> > with this one?
> 
> Perhaps, but as described above, both devices have very diff

Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-10 Thread Rick Jones

On 09/09/2012 07:12 PM, Rusty Russell wrote:

OK, I read the spec (pasted below for easy of reading), but I'm still
confused over how this will work.

I thought normal net drivers have the hardware provide an rxhash for
each packet, and we map that to CPU to queue the packet on[1].  We hope
that the receiving process migrates to that CPU, so xmit queue
matches.

For virtio this would mean a new per-packet rxhash value, right?

Why are we doing something different?  What am I missing?

Thanks,
Rusty.
[1] Everything I Know About Networking I Learned From LWN:
 https://lwn.net/Articles/362339/


In my taxonomy at least, "multi-queue" predates RPS and RFS and is 
simply where the NIC via some means, perhaps a headers hash, separates 
incoming frames to different queues.


RPS can be thought of as doing something similar inside the host.  That 
could be used to get a spread from an otherwise "dumb" NIC (certainly 
that is what one of its predecessors - Inbound Packet Scheduling - used 
it for in HP-UX 10.20), or it could be used to augment the multi-queue 
support of a not-so-dump NIC - say if said NIC had a limit of queues 
that was rather lower than the number of cores/threads in the host. 
Indeed some driver/NIC combinations provide a hash value to the host for 
the host to use as it sees fit.


However, there is still the matter of a single thread of an application 
servicing multiple connections, each of which would hash to different 
locations.


RFS  (Receive Flow Steering) then goes one step further, and looks-up 
where the flow endpoint was last accessed and steers the traffic there. 
 The idea being that a thread of execution servicing multiple flows 
will have the traffic of those flows sent to the same place.  It then 
allows the scheduler to decide where things should be run rather than 
the networking code.


rick jones

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Mike Waychison
On Mon, Sep 10, 2012 at 2:04 PM, Rik van Riel  wrote:
> On 09/10/2012 01:37 PM, Mike Waychison wrote:
>>
>> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin 
>> wrote:
>
>
>>> Also can you pls answer Avi's question?
>>> How is overcommit managed?
>>
>>
>> Overcommit in our deployments is managed using memory cgroups on the
>> host.  This allows us to have very directed policies as to how
>> competing VMs on a host may overcommit.
>
>
> How do your memory cgroups lead to guests inflating their balloons?

The control loop that is driving the cgroup on the host can still move
the balloon target page count causing the balloon in the guest to try
and inflate.  This allows the host to effectively slowly grow the
balloon in the guest, allowing reclaim of guest free memory, followed
by guest page cache (and memory on the host system).  This can then be
compared with the subsequent growth (as this balloon setup allows the
guest to grow as it sees fit), which in effect gives us a memory
pressure indicator on the host, allowing it to back-off shrinking the
guest if the guest balloon quickly deflates.

The net effect is an opportunistic release of memory from the guest
back to the host, and the ability to quickly grow a VM's memory
footprint as the workload within it requires.

This dynamic memory sizing of VMs is much more in line with what we
can expect from native tasks today (and which is what our resource
management systems are designed to handle).
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Rik van Riel

On 09/10/2012 01:37 PM, Mike Waychison wrote:

On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  wrote:



Also can you pls answer Avi's question?
How is overcommit managed?


Overcommit in our deployments is managed using memory cgroups on the
host.  This allows us to have very directed policies as to how
competing VMs on a host may overcommit.


How do your memory cgroups lead to guests inflating their balloons?

--
All rights reversed
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Mike Waychison
On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  wrote:
> On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
>> This implementation of a virtio balloon driver uses the page cache to
>> "store" pages that have been released to the host.  The communication
>> (outside of target counts) is one way--the guest notifies the host when
>> it adds a page to the page cache, allowing the host to madvise(2) with
>> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> (via the regular page reclaim).  This means that inflating the balloon
>> is similar to the existing balloon mechanism, but the deflate is
>> different--it re-uses existing Linux kernel functionality to
>> automatically reclaim.
>>
>> Signed-off-by: Frank Swiderski 

Hi Michael,

I'm very sorry that Frank and I have been silent on these threads.
I've been out of the office and Frank has been been swamped :)

I'll take a stab at answering some of your questions below, and
hopefully we can end up on the same page.

> I've been trying to understand this, and I have
> a question: what exactly is the benefit
> of this new device?

The key difference between this device/driver and the pre-existing
virtio_balloon device/driver is in how the memory pressure loop is
controlled.

With the pre-existing balloon device/driver, the control loop for how
much memory a given VM is allowed to use is controlled completely by
the host.  This is probably fine if the goal is to pack as much work
on a given host as possible, but it says nothing about the expected
performance that any given VM is expecting to have.  Specifically, it
allows the host to set a target goal for the size of a VM, and the
driver in the guest does whatever is needed to get to that goal.  This
is great for systems where one wants to "grow or shrink" a VM from the
outside.


This behaviour however doesn't match what applications actually expect
from a memory control loop however.  In a native setup, an application
can usually expect to allocate memory from the kernel on an as-needed
basis, and can in turn return memory back to the system (using a heap
implementation that actually releases memory that is).  The dynamic
size of an application is completely controlled by the application,
and there is very little that cluster management software can do to
ensure that the application fits some prescribed size.

We recognized this in the development of our cluster management
software long ago, so our systems are designed for managing tasks that
have a dynamic memory footprint.  Overcommit is possible (as most
applications do not use the full reservation of memory they asked for
originally), letting us do things like schedule lower priority/lower
service-classification work using resources that are otherwise
available in stand-by for high-priority/low-latency workloads.

>
> Note that users could not care less about how a driver
> is implemented internally.
>
> Is there some workload where you see VM working better with
> this than regular balloon? Any numbers?

This device is less about performance as it is about getting the
memory size of a job (or in this case, a job in a VM) to grow and
shrink as the application workload sees fit, much like how processes
today can grow and shrink without external direction.

>
> Also, can't we just replace existing balloon implementation
> with this one?

Perhaps, but as described above, both devices have very different
characteristics.

> Why it is so important to deflate silently?

It may not be so important to deflate silently.  I'm not sure why it
is important that we deflate "loudly" though either :)  Doing so seems
like unnecessary guest/host communication IMO, especially if the guest
is expecting to be able to grow to totalram (and the host isn't able
to nack any pages reclaimed anyway...).

> I guess filesystem does not currently get a callback
> before page is reclaimed but this isan implementation detail -
> maybe this can be fixed?

I do not follow this question.

>
> Also can you pls answer Avi's question?
> How is overcommit managed?

Overcommit in our deployments is managed using memory cgroups on the
host.  This allows us to have very directed policies as to how
competing VMs on a host may overcommit.

>
>
>> ---
>>  drivers/virtio/Kconfig  |   13 +
>>  drivers/virtio/Makefile |1 +
>>  drivers/virtio/virtio_fileballoon.c |  636 
>> +++
>>  include/linux/virtio_balloon.h  |9 +
>>  include/linux/virtio_ids.h  |1 +
>>  5 files changed, 660 insertions(+), 0 deletions(-)
>>  create mode 100644 drivers/virtio/virtio_fileballoon.c
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index f38b17a..cffa2a7 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -35,6 +35,19 @@ config VIRTIO_BALLOON
>>
>>If unsure, say M.
>>
>> +config VIRTIO_FILEBALLOON
>> + tristate "Virtio page cache-backed balloon

Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Thomas Lendacky
On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote:
> "Michael S. Tsirkin"  writes:
> > On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote:
> >> "Michael S. Tsirkin"  writes:
> >> > Yes without checksum net core always linearizes packets, so yes it is
> >> > screwed.
> >> > For -net, skb always allocates space for 17 frags + linear part so
> >> > it seems sane to do same in virtio core, and allocate, for -net,
> >> > up to max_frags + 1 from cache.
> >> > We can adjust it: no _SG -> 2 otherwise 18.
> >> 
> >> But I thought it used individual buffers these days?
> > 
> > Yes for receive, no for transmit. That's probably why
> > we should have the threshold per vq, not per device, BTW.
> 
> Can someone actually run with my histogram patch and see what the real
> numbers are?

Somehow some HTML got in my first reply, resending...

I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and
guest-to-host, with a form of the histogram patch applied against a
RHEL6.3 kernel. The histogram values were reset after each test.

Here are the results:

60 session TCP_RR from host-to-guest with 256 byte request and 256 byte
response for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=7818456):
 1: 7818456 
Size distribution for output (max=7816698):
 2: 149 
 3: 7816698 
 4: 2   
 5: 1   
Size distribution for control (max=1):
 0: 0


4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=16050941):
 1: 16050941 
Size distribution for output (max=1877796):
 2: 1877796 
 3: 5   
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=16831151):
 1: 16831151 
Size distribution for output (max=1923965):
 2: 1923965 
 3: 5   
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=1316069):
 1: 1316069 
Size distribution for output (max=879213):
 2: 24  
 3: 24097   #
 4: 23176   #
 5: 3412
 6: 4446
 7: 4663
 8: 4195
 9: 3772
10: 3388
11: 3666
12: 2885
13: 2759
14: 2997
15: 3060
16: 2651
17: 2235
18: 92721   ##  
19: 879213  
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=1428590):
 1: 1428590 
Size distribution for output (max=957774):
 2: 20
 3: 54955   ###
 4: 34281   ##
 5: 2967
 6: 3394
 7: 9400
 8: 3061
 9: 3397
10: 3258
11: 3275
12: 3147
13: 2876
14: 2747
15: 2832
16: 2013
17: 1670
18: 100369  ##
19: 957774  
Size distribution for control (max=1):
 0: 0

Thanks,
Tom

> 
> I'm not convinced that the ideal 17-buffer case actually happens as much
> as we think.  And if it's not happening with this netperf test, we're
> testing the wrong thing.
> 
> Thanks,
> Rusty.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Michael S. Tsirkin
On Mon, Sep 10, 2012 at 10:47:15AM -0500, Thomas Lendacky wrote:
> On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote:
> 
> > "Michael S. Tsirkin"  writes:
> 
> > > On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote:
> 
> > >> "Michael S. Tsirkin"  writes:
> 
> > >> > Yes without checksum net core always linearizes packets, so yes it is
> 
> > >> > screwed.
> 
> > >> > For -net, skb always allocates space for 17 frags + linear part so
> 
> > >> > it seems sane to do same in virtio core, and allocate, for -net,
> 
> > >> > up to max_frags + 1 from cache.
> 
> > >> > We can adjust it: no _SG -> 2 otherwise 18.
> 
> > >>
> 
> > >> But I thought it used individual buffers these days?
> 
> > >
> 
> > > Yes for receive, no for transmit. That's probably why
> 
> > > we should have the threshold per vq, not per device, BTW.
> 
> >
> 
> > Can someone actually run with my histogram patch and see what the real
> 
> > numbers are?
> 
> >
> 
>  
> 
> I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and
> 
> guest-to-host, with a form of the histogram patch applied against a
> 
> RHEL6.3 kernel. The histogram values were reset after each test.
> 
>  
> 
> Here are the results:
> 
>  
> 
> 60 session TCP_RR from host-to-guest with 256 byte request and 256 byte
> 
> response for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=7818456):
> 
> 1: 7818456 
> 
> Size distribution for output (max=7816698):
> 
> 2: 149
> 
> 3: 7816698 

Here, a threshold would help.

> 
> 4: 2
> 
> 5: 1
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 
>  
> 
>  
> 
> 4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=16050941):
> 
> 1: 16050941 
> 
> Size distribution for output (max=1877796):
> 
> 2: 1877796 
> 
> 3: 5
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 
>  
> 
> 4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=16831151):
> 
> 1: 16831151 
> 
> Size distribution for output (max=1923965):
> 
> 2: 1923965 
> 
> 3: 5
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 

Hmm for virtio net output we do always use 2 s/g, this is because of a
qemu bug. Maybe it's time we fixed this, added a feature bit?
This would fix above without threshold hacks.


> 
> 4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=1316069):
> 
> 1: 1316069 
> 
> Size distribution for output (max=879213):
> 
> 2: 24
> 
> 3: 24097 #
> 
> 4: 23176 #
> 
> 5: 3412
> 
> 6: 4446
> 
> 7: 4663
> 
> 8: 4195
> 
> 9: 3772
> 
> 10: 3388
> 
> 11: 3666
> 
> 12: 2885
> 
> 13: 2759
> 
> 14: 2997
> 
> 15: 3060
> 
> 16: 2651
> 
> 17: 2235
> 
> 18: 92721 ##
> 
> 19: 879213 
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 
>  
> 
> 4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=1428590):
> 
> 1: 1428590 
> 
> Size distribution for output (max=957774):
> 
> 2: 20
> 
> 3: 54955 ###
> 
> 4: 34281 ##
> 
> 5: 2967
> 
> 6: 3394
> 
> 7: 9400
> 
> 8: 3061
> 
> 9: 3397
> 
> 10: 3258
> 
> 11: 3275
> 
> 12: 3147
> 
> 13: 2876
> 
> 14: 2747
> 
> 15: 2832
> 
> 16: 2013
> 
> 17: 1670
> 
> 18: 100369 ##
> 
> 19: 957774 
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 
>  
> 
> Thanks,
> 
> Tom

In these tests we would have to set threshold pretty high.
I wonder whether the following makes any difference: the idea is to
A. get less false cache sharing by allocating full cache lines
B. better locality by using same cache for multiple sizes

So we get some of the wins of the threshold without bothering
with a cache.

Will try to test but not until later this week.

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..c184712 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -132,7 +132,8 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
unsigned head;
int i;
 
-   desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+   desc = k

Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Paolo Bonzini
Il 06/09/2012 07:02, Michael S. Tsirkin ha scritto:
>> > It might be worth just unconditionally having a cache for the 2
>> > descriptor case.  This is what I get with qemu tap, though for some
>> > reason the device features don't have guest or host CSUM, so my setup is
>> > probably screwed:
> Yes without checksum net core always linearizes packets, so yes it is
> screwed.
> For -net, skb always allocates space for 17 frags + linear part so
> it seems sane to do same in virtio core, and allocate, for -net,
> up to max_frags + 1 from cache.
> We can adjust it: no _SG -> 2 otherwise 18.
> 
> Not sure about other drivers, maybe really use 2 there for now.

2 should also be good for virtio-blk and virtio-scsi 4KB random rw
workloads.

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Thomas Lendacky
On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote:
> "Michael S. Tsirkin"  writes:
> > On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote:
> >> "Michael S. Tsirkin"  writes:
> >> > Yes without checksum net core always linearizes packets, so yes it is
> >> > screwed.
> >> > For -net, skb always allocates space for 17 frags + linear part so
> >> > it seems sane to do same in virtio core, and allocate, for -net,
> >> > up to max_frags + 1 from cache.
> >> > We can adjust it: no _SG -> 2 otherwise 18.
> >> 
> >> But I thought it used individual buffers these days?
> > 
> > Yes for receive, no for transmit. That's probably why
> > we should have the threshold per vq, not per device, BTW.
> 
> Can someone actually run with my histogram patch and see what the real
> numbers are?
> 

I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and
guest-to-host, with a form of the histogram patch applied against a
RHEL6.3 kernel. The histogram values were reset after each test.

Here are the results:

60 session TCP_RR from host-to-guest with 256 byte request and 256 byte
response for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=7818456):
 1: 7818456 
Size distribution for output (max=7816698):
 2: 149 
 3: 7816698 
 4: 2   
 5: 1   
Size distribution for control (max=1):
 0: 0


4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=16050941):
 1: 16050941 
Size distribution for output (max=1877796):
 2: 1877796 
 3: 5   
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=16831151):
 1: 16831151 
Size distribution for output (max=1923965):
 2: 1923965 
 3: 5   
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=1316069):
 1: 1316069 
Size distribution for output (max=879213):
 2: 24  
 3: 24097   #
 4: 23176   #
 5: 3412
 6: 4446
 7: 4663
 8: 4195
 9: 3772
10: 3388
11: 3666
12: 2885
13: 2759
14: 2997
15: 3060
16: 2651
17: 2235
18: 92721   ##  
19: 879213  
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=1428590):
 1: 1428590 
Size distribution for output (max=957774):
 2: 20
 3: 54955   ###
 4: 34281   ##
 5: 2967
 6: 3394
 7: 9400
 8: 3061
 9: 3397
10: 3258
11: 3275
12: 3147
13: 2876
14: 2747
15: 2832
16: 2013
17: 1670
18: 100369  ##
19: 957774  
Size distribution for control (max=1):
 0: 0

Thanks,
Tom

> I'm not convinced that the ideal 17-buffer case actually happens as much
> as we think.  And if it's not happening with this netperf test, we're
> testing the wrong thing.
> 
> Thanks,
> Rusty.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFCv3] virtio_console: Add support for virtio remoteproc serial

2012-09-10 Thread sjur . brandeland
From: Sjur Brændeland 

Add a virtio remoteproc serial driver: VIRTIO_ID_RPROC_SERIAL (0xB)
for communicating with a remote processor in an asymmetric
multi-processing configuration.

The virtio remoteproc serial driver reuses the existing virtio_console
implementation, and adds support for DMA allocation of data buffers
but disables support for tty console, mutiple ports, and the control queue.

Signed-off-by: Sjur Brændeland 
cc: Rusty Russell 
cc: Michael S. Tsirkin 
cc: Amit Shah 
cc: Ohad Ben-Cohen 
cc: Linus Walleij 
cc: virtualization@lists.linux-foundation.org
cc: linux-ker...@vger.kernel.org
---

Hi Rusty,

...

>Sorry for the back and forth, I've been pondering MST's points.
>If we make a new dma-multiport device (eg. ID 11), how ugly is the code?

I don't think it's too bad.
It's a few more code lines - as I have added a new virtio_driver
definition and feature table. The rest is just replacing the check
on feature bits with driver type. I no longer need to check on
feature bits in the probe function, as the match on device type is
done automatically by the driver framework.

I actually like this new approach better.
It solves the issues Michael has pointed out, and we don't have to
think through side effects of weired combination of feature bits.

>It would be a virtio console with DMA buffers and no console, just the
>multiport stuff.  This would have no impact on the current spec for
>virtio console.

Yes, and this way it will also be easier to explain what this new
feature does.

Regards,
Sjur

 drivers/char/virtio_console.c |  133 +
 include/linux/virtio_ids.h|1 +
 2 files changed, 109 insertions(+), 25 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index cdf2f54..08593aa 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "../tty/hvc/hvc_console.h"
 
 /*
@@ -323,6 +324,52 @@ static bool is_console_port(struct port *port)
return false;
 }
 
+#ifdef CONFIG_HAS_DMA
+static inline bool is_rproc_serial(struct virtio_device *vdev)
+{
+   return vdev->id.device == VIRTIO_ID_RPROC_SERIAL;
+}
+#else
+static inline bool is_rproc_serial(struct virtio_device *vdev)
+{
+   return false;
+}
+#endif
+
+/* Allocate data buffer from DMA memory if requested */
+static inline void *
+alloc_databuf(struct virtio_device *vdev, size_t size, gfp_t flag)
+{
+   if (is_rproc_serial(vdev)) {
+   dma_addr_t dma;
+   struct device *dev = &vdev->dev;
+   /*
+* Allocate DMA memory from ancestors. Finding the ancestor
+* is a bit quirky when DMA_MEMORY_INCLUDES_CHILDREN is not
+* implemented.
+*/
+   dev = dev->parent ? dev->parent : dev;
+   dev = dev->parent ? dev->parent : dev;
+   return dma_alloc_coherent(dev, size, &dma, flag);
+   }
+   return kzalloc(size, flag);
+}
+
+static inline void
+free_databuf(struct virtio_device *vdev, size_t size, void *cpu_addr)
+{
+
+   if (is_rproc_serial(vdev)) {
+   struct device *dev = &vdev->dev;
+   dma_addr_t dma_handle = virt_to_bus(cpu_addr);
+   dev = dev->parent ? dev->parent : dev;
+   dev = dev->parent ? dev->parent : dev;
+   dma_free_coherent(dev, size, cpu_addr, dma_handle);
+   return;
+   }
+   kfree(cpu_addr);
+}
+
 static inline bool use_multiport(struct ports_device *portdev)
 {
/*
@@ -334,20 +381,21 @@ static inline bool use_multiport(struct ports_device 
*portdev)
return portdev->vdev->features[0] & (1 << VIRTIO_CONSOLE_F_MULTIPORT);
 }
 
-static void free_buf(struct port_buffer *buf)
+static void
+free_buf(struct virtqueue *vq, struct port_buffer *buf, size_t buf_size)
 {
-   kfree(buf->buf);
+   free_databuf(vq->vdev, buf_size, buf->buf);
kfree(buf);
 }
 
-static struct port_buffer *alloc_buf(size_t buf_size)
+static struct port_buffer *alloc_buf(struct virtqueue *vq, size_t buf_size)
 {
struct port_buffer *buf;
 
buf = kmalloc(sizeof(*buf), GFP_KERNEL);
if (!buf)
goto fail;
-   buf->buf = kzalloc(buf_size, GFP_KERNEL);
+   buf->buf = alloc_databuf(vq->vdev, buf_size, GFP_KERNEL);
if (!buf->buf)
goto free_buf;
buf->len = 0;
@@ -414,7 +462,7 @@ static void discard_port_data(struct port *port)
port->stats.bytes_discarded += buf->len - buf->offset;
if (add_inbuf(port->in_vq, buf) < 0) {
err++;
-   free_buf(buf);
+   free_buf(port->in_vq, buf, PAGE_SIZE);
}
port->inbuf = NULL;
buf = get_inbuf(port);
@@ -485,7 +533,7 @@ static void reclaim_consumed_buffers(struct port *port)

Re: [Xen-devel] [patch 1/3] xen/privcmd: check for integer overflow in ioctl

2012-09-10 Thread Dan Carpenter
On Mon, Sep 10, 2012 at 11:35:11AM +0100, David Vrabel wrote:
> On 08/09/12 10:52, Dan Carpenter wrote:
> > If m.num is too large then the "m.num * sizeof(*m.arr)" multiplication
> > could overflow and the access_ok() check wouldn't test the right size.
> 
> m.num is range checked later on so it doesn't matter that the
> access_ok() checks might be wrong.  A bit subtle, perhaps.
> 

Yeah.  It's too subtle for my static checker but not so subtle for
a human being.  Laziness on my part.

Please drop this patch.

regards,
dan carpenter

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-10 Thread Jason Wang

On 09/10/2012 02:33 PM, Michael S. Tsirkin wrote:

On Mon, Sep 10, 2012 at 09:27:38AM +0300, Michael S. Tsirkin wrote:

On Mon, Sep 10, 2012 at 09:16:29AM +0300, Michael S. Tsirkin wrote:

On Mon, Sep 10, 2012 at 11:42:25AM +0930, Rusty Russell wrote:

OK, I read the spec (pasted below for easy of reading), but I'm still
confused over how this will work.

I thought normal net drivers have the hardware provide an rxhash for
each packet, and we map that to CPU to queue the packet on[1].  We hope
that the receiving process migrates to that CPU, so xmit queue
matches.

This ony works sometimes.  For example it's common to pin netperf to a
cpu to get consistent performance.  Proper hardware must obey what
applications want it to do, not the other way around.


For virtio this would mean a new per-packet rxhash value, right?

Why are we doing something different?  What am I missing?

Thanks,
Rusty.
[1] Everything I Know About Networking I Learned From LWN:
 https://lwn.net/Articles/362339/

I think you missed this:

Some network interfaces can help with the distribution of incoming
packets; they have multiple receive queues and multiple interrupt lines.
Others, though, are equipped with a single queue, meaning that the
driver for that hardware must deal with all incoming packets in a
single, serialized stream. Parallelizing such a stream requires some
intelligence on the part of the host operating system.

In other words RPS is a hack to speed up networking on cheapo
hardware, this is one of the reasons it is off by default.
Good hardware has multiple receive queues.
We can implement a good one so we do not need RPS.

Also not all guest OS-es support RPS.

Does this clarify?

I would like to add that on many processors, sending
IPCs between guest CPUs requires exits on sending *and*
receiving path, making it very expensive.

A final addition: what you suggest above would be
"TX follows RX", right?
It is in anticipation of something like that, that I made
steering programming so generic.
I think TX follows RX is more immediately useful for reasons above
but we can add both to spec and let drivers and devices
decide what they want to support.
Pls let me know.


AFAIK, ixgbe does "rx follows tx". The only differences between ixgbe 
and virtio-net is that ixgbe driver programs the flow director during 
packet transmission but we suggest to do it silently in the device for 
simplicity. Even with this, more co-operation is still needed for the 
driver ( e.g ixgbe try to use per-cpu queue by setting affinity hint and 
using cpuid to choose the txq which could be reused in virtio-net driver).



---
Transmit Packet Steering

When VIRTIO_NET_F_MULTIQUEUE feature bit is negotiated, guest can use any of 
multiple configured transmit queues to transmit a given packet. To avoid packet 
reordering by device (which generally leads to performance degradation) driver 
should attempt to utilize the same transmit virtqueue for all packets of a 
given transmit flow. For bi-directional protocols (in practice, TCP), a given 
network connection can utilize both transmit and receive queues. For best 
performance, packets from a single connection should utilize the paired 
transmit and receive queues from the same virtqueue pair; for example both 
transmitqN and receiveqN. This rule makes it possible to optimize processing on 
the device side, but this is not a hard requirement: devices should function 
correctly even when this rule is not followed.

Driver selects an active steering rule using VIRTIO_NET_CTRL_STEERING command 
(this controls both which virtqueue is selected for a given packet for receive 
and notifies the device which virtqueues are about to be used for transmit).

This command accepts a single out argument in the following format:

#define VIRTIO_NET_CTRL_STEERING   4

The field rule specifies the function used to select transmit virtqueue for a 
given packet; the field param makes it possible to pass an extra parameter if 
appropriate. When rule is set to VIRTIO_NET_CTRL_STEERING_SINGLE (this is the 
default) all packets are steered to the default virtqueue transmitq (1); param 
is unused; this is the default. With any other rule, When rule is set to 
VIRTIO_NET_CTRL_STEERING_RX_FOLLOWS_TX packets are steered by driver to the 
first N=(param+1) multiqueue virtqueues transmitq1...transmitqN; the default 
transmitq is unused. Driver must have configured all these (param+1) virtqueues 
beforehand.

Supported steering rules can be added and removed in the future. Driver should 
check that the request to change the steering rule was successful by checking 
ack values of the command. As selecting a specific steering is an optimization 
feature, drivers should avoid hard failure and fall back on using a supported 
steering rule if this command fails. The default steering rule is 
VIRTIO_NET_CTRL_STEERING_SINGLE. It will not be removed.

When the steering rule 

[patch 3/3 v2] xen/privcmd: add a __user annotation to a cast

2012-09-10 Thread Dan Carpenter
Sparse complains that we lose the __user annotation in the cast here.

drivers/xen/privcmd.c:388:35: warning: cast removes address space of expression
drivers/xen/privcmd.c:388:32: warning: incorrect type in assignment (different 
address spaces)
drivers/xen/privcmd.c:388:32:expected unsigned long [noderef] [usertype] 
*[addressable] [assigned] user_mfn
drivers/xen/privcmd.c:388:32:got unsigned long [usertype] *

Signed-off-by: Dan Carpenter 
---
v2: In v1 I removed the const from the declaration but now I just removed it
from the cast.  This data can either be a v1 or a v2 type struct.  The m.arr
data is const in version 2 but not in version 1.

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 0ce006a..fceb83e 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -389,7 +389,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
int version)
 
if (state.global_error && (version == 1)) {
/* Write back errors in second pass. */
-   state.user_mfn = (xen_pfn_t *)m.arr;
+   state.user_mfn = (xen_pfn_t __user *)m.arr;
state.err  = err_array;
ret = traverse_pages(m.num, sizeof(xen_pfn_t),
 &pagelist, mmap_return_errors_v1, &state);
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 3/3] xen/privcmd: remove const modifier from declaration

2012-09-10 Thread Jan Beulich
>>> On 08.09.12 at 11:58, Dan Carpenter  wrote:
> When we use this pointer, we cast away the const modifier and modify the
> data.  I think it was an accident to declare it as const.

NAK - the const is very valid here, as the v2 interface (as opposed
to the v1 one) does _not_ modify this array (or if it does, it's a
bug). This is a guarantee made to user mode, so it should also be
expressed that way in the interface.

But of course the cast used before this patch isn't right either, as
it indeed inappropriately discards the qualifier. Afaiu this was done
to simplify the internal workings of the code, but I don't think it's
desirable to sacrifice type safety for implementation simplicity.

Jan

> Signed-off-by: Dan Carpenter 
> 
> diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h
> index a853168..58ed953 100644
> --- a/include/xen/privcmd.h
> +++ b/include/xen/privcmd.h
> @@ -69,7 +69,7 @@ struct privcmd_mmapbatch_v2 {
>   unsigned int num; /* number of pages to populate */
>   domid_t dom;  /* target domain */
>   __u64 addr;   /* virtual address */
> - const xen_pfn_t __user *arr; /* array of mfns */
> + xen_pfn_t __user *arr; /* array of mfns */
>   int __user *err;  /* array of error codes */
>  };
>  
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 0ce006a..fceb83e 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -389,7 +389,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
> int version)
>  
>   if (state.global_error && (version == 1)) {
>   /* Write back errors in second pass. */
> - state.user_mfn = (xen_pfn_t *)m.arr;
> + state.user_mfn = m.arr;
>   state.err  = err_array;
>   ret = traverse_pages(m.num, sizeof(xen_pfn_t),
>&pagelist, mmap_return_errors_v1, &state);
> 
> ___
> Xen-devel mailing list
> xen-de...@lists.xen.org 
> http://lists.xen.org/xen-devel 



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Michael S. Tsirkin
On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> This implementation of a virtio balloon driver uses the page cache to
> "store" pages that have been released to the host.  The communication
> (outside of target counts) is one way--the guest notifies the host when
> it adds a page to the page cache, allowing the host to madvise(2) with
> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> (via the regular page reclaim).  This means that inflating the balloon
> is similar to the existing balloon mechanism, but the deflate is
> different--it re-uses existing Linux kernel functionality to
> automatically reclaim.
> 
> Signed-off-by: Frank Swiderski 

I've been trying to understand this, and I have
a question: what exactly is the benefit
of this new device?

Note that users could not care less about how a driver
is implemented internally.

Is there some workload where you see VM working better with
this than regular balloon? Any numbers?

Also, can't we just replace existing balloon implementation
with this one?  Why it is so important to deflate silently?
I guess filesystem does not currently get a callback
before page is reclaimed but this isan implementation detail -
maybe this can be fixed?

Also can you pls answer Avi's question?
How is overcommit managed?


> ---
>  drivers/virtio/Kconfig  |   13 +
>  drivers/virtio/Makefile |1 +
>  drivers/virtio/virtio_fileballoon.c |  636 
> +++
>  include/linux/virtio_balloon.h  |9 +
>  include/linux/virtio_ids.h  |1 +
>  5 files changed, 660 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/virtio/virtio_fileballoon.c
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index f38b17a..cffa2a7 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -35,6 +35,19 @@ config VIRTIO_BALLOON
>  
>If unsure, say M.
>  
> +config VIRTIO_FILEBALLOON
> + tristate "Virtio page cache-backed balloon driver"
> + select VIRTIO
> + select VIRTIO_RING
> + ---help---
> +  This driver supports decreasing and automatically reclaiming the
> +  memory within a guest VM.  Unlike VIRTIO_BALLOON, this driver instead
> +  tries to maintain a specific target balloon size using the page cache.
> +  This allows the guest to implicitly deflate the balloon by flushing
> +  pages from the cache and touching the page.
> +
> +  If unsure, say N.
> +
>   config VIRTIO_MMIO
>   tristate "Platform bus driver for memory mapped virtio devices 
> (EXPERIMENTAL)"
>   depends on HAS_IOMEM && EXPERIMENTAL
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 5a4c63c..7ca0a3f 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
>  obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
>  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o
> diff --git a/drivers/virtio/virtio_fileballoon.c 
> b/drivers/virtio/virtio_fileballoon.c
> new file mode 100644
> index 000..ff252ec
> --- /dev/null
> +++ b/drivers/virtio/virtio_fileballoon.c
> @@ -0,0 +1,636 @@
> +/* Virtio file (page cache-backed) balloon implementation, inspired by
> + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty 
> Russel's
> + * implementation.
> + *
> + * This implementation of the virtio balloon driver re-uses the page cache to
> + * allow memory consumed by inflating the balloon to be reclaimed by linux.  
> It
> + * creates and mounts a bare-bones filesystem containing a single inode.  
> When
> + * the host requests the balloon to inflate, it does so by "reading" pages at
> + * offsets into the inode mapping's page_tree.  The host is notified when the
> + * pages are added to the page_tree, allowing it (the host) to madvise(2) the
> + * corresponding host memory, reducing the RSS of the virtual machine.  In 
> this
> + * implementation, the host is only notified when a page is added to the
> + * balloon.  Reclaim happens under the existing TTFP logic, which flushes 
> unused
> + * pages in the page cache.  If the host used MADV_DONTNEED, then when the 
> guest
> + * uses the page, the zero page will be mapped in, allowing automatic (and 
> fast,
> + * compared to requiring a host notification via a virtio queue to get memory
> + * back) reclaim.
> + *
> + *  Copyright 2008 Rusty Russell IBM Corporation
> + *  Copyright 2011 Frank Swiderski Google Inc
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY W

Re: [RFCv2 1/2] virtio_console: Add support for DMA memory allocation

2012-09-10 Thread Rusty Russell
sjur.brandel...@stericsson.com writes:

> From: Sjur Brændeland 
>
> Add feature VIRTIO_CONSOLE_F_DMA_MEM. If the architecture has
> DMA support and this feature bit is set, the virtio data buffers
> will be allocated from DMA memory. If the device requests
> the feature VIRTIO_CONSOLE_F_DMA_MEM, but the architecture
> don't support DMA the driver's probe function will fail.
>
> This is needed for using virtio_console from the remoteproc
> framework.

Sorry for the back and forth, I've been pondering MST's points.

If we make a new dma-multiport device (eg. ID 11), how ugly is the code?

It would be a virtio console with DMA buffers and no console, just the
multiport stuff.  This would have no impact on the current spec for
virtio console.

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization