Re: [PATCH 0/5] vhost-scsi: Add support for host virtualized target
Hello Nicholas, On 09/07/2012 02:48 PM, Nicholas A. Bellinger wrote: > From: Nicholas Bellinger > > Hello Anthony & Co, > > This is the fourth installment to add host virtualized target support for > the mainline tcm_vhost fabric driver using Linux v3.6-rc into QEMU 1.3.0-rc. > > The series is available directly from the following git branch: > >git://git.kernel.org/pub/scm/virt/kvm/nab/qemu-kvm.git vhost-scsi-for-1.3 > > Note the code is cut against yesterday's QEMU head, and dispite the name > of the tree is based upon mainline qemu.org git code + has thus far been > running overnight with > 100K IOPs small block 4k workloads using v3.6-rc2+ > based target code with RAMDISK_DR backstores. Are you still seeing the performance degradation discussed in the thread "vhost-scsi port to v1.1.0 + MSI-X performance regression" ? > Other than some minor fuzz between jumping from QEMU 1.2.0 -> 1.2.50, this > series is functionally identical to what's been posted for vhost-scsi RFC-v3 > to qemu-devel. > > Please consider applying these patches for an initial vhost-scsi merge into > QEMU 1.3.0-rc code, or let us know what else you'd like to see addressed for > this series to in order to merge. > > Thank you! > > --nab > > Nicholas Bellinger (2): > monitor: Rename+move net_handle_fd_param -> monitor_handle_fd_param > virtio-scsi: Set max_target=0 during vhost-scsi operation > > Stefan Hajnoczi (3): > vhost: Pass device path to vhost_dev_init() > vhost-scsi: add -vhost-scsi host device for use with tcm-vhost > virtio-scsi: Add start/stop functionality for vhost-scsi > > configure| 10 +++ > hw/Makefile.objs |1 + > hw/qdev-properties.c | 41 +++ > hw/vhost-scsi.c | 190 > ++ > hw/vhost-scsi.h | 62 > hw/vhost.c |5 +- > hw/vhost.h |3 +- > hw/vhost_net.c |2 +- > hw/virtio-pci.c |2 + > hw/virtio-scsi.c | 55 ++- > hw/virtio-scsi.h |1 + > monitor.c| 18 + > monitor.h|1 + > net.c| 18 - > net.h|2 - > net/socket.c |2 +- > net/tap.c|4 +- > qemu-common.h|1 + > qemu-config.c| 19 + > qemu-options.hx |4 + > vl.c | 18 + > 21 files changed, 431 insertions(+), 28 deletions(-) > create mode 100644 hw/vhost-scsi.c > create mode 100644 hw/vhost-scsi.h > -- Asias ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 3/3] xen/privcmd: remove const modifier from declaration
On 10/09/12 10:03, Jan Beulich wrote: On 08.09.12 at 11:58, Dan Carpenter wrote: >> When we use this pointer, we cast away the const modifier and modify the >> data. I think it was an accident to declare it as const. > > NAK - the const is very valid here, as the v2 interface (as opposed > to the v1 one) does _not_ modify this array (or if it does, it's a > bug). This is a guarantee made to user mode, so it should also be > expressed that way in the interface. > > But of course the cast used before this patch isn't right either, as > it indeed inappropriately discards the qualifier. Afaiu this was done > to simplify the internal workings of the code, but I don't think it's > desirable to sacrifice type safety for implementation simplicity. m.arr here isn't const as m is really the V1 structure where m.arr is non-const. David ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 1/3] xen/privcmd: check for integer overflow in ioctl
On 08/09/12 10:52, Dan Carpenter wrote: > If m.num is too large then the "m.num * sizeof(*m.arr)" multiplication > could overflow and the access_ok() check wouldn't test the right size. m.num is range checked later on so it doesn't matter that the access_ok() checks might be wrong. A bit subtle, perhaps. David > Signed-off-by: Dan Carpenter > --- > Only needed in linux-next. > > diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c > index 215a3c0..fdff8f9 100644 > --- a/drivers/xen/privcmd.c > +++ b/drivers/xen/privcmd.c > @@ -325,6 +325,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, > int version) > return -EFAULT; > /* Returns per-frame error in m.arr. */ > m.err = NULL; > + if (m.num > SIZE_MAX / sizeof(*m.arr)) > + return -EINVAL; > if (!access_ok(VERIFY_WRITE, m.arr, m.num * sizeof(*m.arr))) > return -EFAULT; > break; > @@ -332,6 +334,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, > int version) > if (copy_from_user(&m, udata, sizeof(struct > privcmd_mmapbatch_v2))) > return -EFAULT; > /* Returns per-frame error code in m.err. */ > + if (m.num > SIZE_MAX / sizeof(*m.err)) > + return -EINVAL; > if (!access_ok(VERIFY_WRITE, m.err, m.num * (sizeof(*m.err > return -EFAULT; > break; > > ___ > Xen-devel mailing list > xen-de...@lists.xen.org > http://lists.xen.org/xen-devel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [patch 3/3] xen/privcmd: remove const modifier from declaration
On Sep 8, 2012, at 5:58 AM, Dan Carpenter wrote: > When we use this pointer, we cast away the const modifier and modify the > data. I think it was an accident to declare it as const. > > Signed-off-by: Dan Carpenter Acked-by: Andres Lagar-Cavilla Thanks for the careful review & hardening. Andres > > diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h > index a853168..58ed953 100644 > --- a/include/xen/privcmd.h > +++ b/include/xen/privcmd.h > @@ -69,7 +69,7 @@ struct privcmd_mmapbatch_v2 { > unsigned int num; /* number of pages to populate */ > domid_t dom; /* target domain */ > __u64 addr; /* virtual address */ > - const xen_pfn_t __user *arr; /* array of mfns */ > + xen_pfn_t __user *arr; /* array of mfns */ > int __user *err; /* array of error codes */ > }; > > diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c > index 0ce006a..fceb83e 100644 > --- a/drivers/xen/privcmd.c > +++ b/drivers/xen/privcmd.c > @@ -389,7 +389,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, > int version) > > if (state.global_error && (version == 1)) { > /* Write back errors in second pass. */ > - state.user_mfn = (xen_pfn_t *)m.arr; > + state.user_mfn = m.arr; > state.err = err_array; > ret = traverse_pages(m.num, sizeof(xen_pfn_t), >&pagelist, mmap_return_errors_v1, &state); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [patch 2/3] xen/privcmd: return -EFAULT on error
On Sep 8, 2012, at 5:57 AM, Dan Carpenter wrote: > __copy_to_user() returns the number of bytes remaining to be copied but > we want to return a negative error code here. > > Signed-off-by: Dan Carpenter Acked-by: Andres Lagar-Cavilla > > diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c > index fdff8f9..0ce006a 100644 > --- a/drivers/xen/privcmd.c > +++ b/drivers/xen/privcmd.c > @@ -393,8 +393,11 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, > int version) > state.err = err_array; > ret = traverse_pages(m.num, sizeof(xen_pfn_t), >&pagelist, mmap_return_errors_v1, &state); > - } else if (version == 2) > + } else if (version == 2) { > ret = __copy_to_user(m.err, err_array, m.num * sizeof(int)); > + if (ret) > + ret = -EFAULT; > + } > > /* If we have not had any EFAULT-like global errors then set the global >* error to -ENOENT if necessary. */ ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [patch 1/3] xen/privcmd: check for integer overflow in ioctl
On Sep 8, 2012, at 5:52 AM, Dan Carpenter wrote: > If m.num is too large then the "m.num * sizeof(*m.arr)" multiplication > could overflow and the access_ok() check wouldn't test the right size. > > Signed-off-by: Dan Carpenter Acked-by: Andres Lagar-Cavilla > --- > Only needed in linux-next. > > diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c > index 215a3c0..fdff8f9 100644 > --- a/drivers/xen/privcmd.c > +++ b/drivers/xen/privcmd.c > @@ -325,6 +325,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, > int version) > return -EFAULT; > /* Returns per-frame error in m.arr. */ > m.err = NULL; > + if (m.num > SIZE_MAX / sizeof(*m.arr)) > + return -EINVAL; > if (!access_ok(VERIFY_WRITE, m.arr, m.num * sizeof(*m.arr))) > return -EFAULT; > break; > @@ -332,6 +334,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, > int version) > if (copy_from_user(&m, udata, sizeof(struct > privcmd_mmapbatch_v2))) > return -EFAULT; > /* Returns per-frame error code in m.err. */ > + if (m.num > SIZE_MAX / sizeof(*m.err)) > + return -EINVAL; > if (!access_ok(VERIFY_WRITE, m.err, m.num * (sizeof(*m.err > return -EFAULT; > break; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 1/2] virtio_console: Add support for DMA memory allocation
On Thu, Sep 06, 2012 at 08:27:35AM +0300, Michael S. Tsirkin wrote: > On Thu, Sep 06, 2012 at 11:34:25AM +0930, Rusty Russell wrote: > > "Michael S. Tsirkin" writes: > > > On Tue, Sep 04, 2012 at 06:58:47PM +0200, Sjur Brændeland wrote: > > >> Hi Michael, > > >> > > >> > Exactly. Though if we just fail load it will be much less code. > > >> > > > >> > Generally, using a feature bit for this is a bit of a problem though: > > >> > normally driver is expected to be able to simply ignore > > >> > a feature bit. In this case driver is required to > > >> > do something so a feature bit is not a good fit. > > >> > I am not sure what the right thing to do is. > > >> > > >> I see - so in order to avoid the binding between driver and device > > >> there are two options I guess. Either make virtio_dev_match() or > > >> virtcons_probe() fail. Neither of them seems like the obvious choice. > > >> > > >> Maybe adding a check for VIRTIO_CONSOLE_F_DMA_MEM match > > >> between device and driver in virtcons_probe() is the lesser evil? > > >> > > >> Regards, > > >> Sjur > > > > > > A simplest thing to do is change dev id. rusty? > > > > For generic usage, this is correct. But my opinion is that fallback on > > feature non-ack is quality-of-implementation issue: great to have, but > > there are cases where you just want to fail with "you're too old". > > > > And in this case, an old system simply will never work. So it's a > > question of how graceful the failure is. > > > > Can your userspace loader can refuse to proceed if the driver doesn't > > ack the bits? If so, it's simpler than a whole new ID. > > > > Cheers, > > Rusty. > > Yes but how can it signal guest that it will never proceed? > > Also grep for BUG_ON in core found this: > >drv->remove(dev); > >/* Driver should have reset device. */ >BUG_ON(dev->config->get_status(dev)); > > I think below is what Sjur refers to. > I think below is a good idea for 3.6. Thoughts? > > ---> > > virtio: don't crash when device is buggy > > Because of a sanity check in virtio_dev_remove, a buggy device can crash > kernel. And in case of rproc it's userspace so it's not a good idea. > We are unloading a driver so how bad can it be? > Be less aggressive in handling this error: if it's a driver bug, > warning once should be enough. > > Signed-off-by: Michael S. Tsirkin > > -- Rusty? > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > index c3b3f7f..1e8659c 100644 > --- a/drivers/virtio/virtio.c > +++ b/drivers/virtio/virtio.c > @@ -159,7 +159,7 @@ static int virtio_dev_remove(struct device *_d) > drv->remove(dev); > > /* Driver should have reset device. */ > - BUG_ON(dev->config->get_status(dev)); > + WARN_ON_ONCE(dev->config->get_status(dev)); > > /* Acknowledge the device's existence again. */ > add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE); > > -- > MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] Add a page cache-backed balloon device driver.
On Mon, Sep 10, 2012 at 04:49:40PM -0400, Mike Waychison wrote: > On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin wrote: > > On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote: > >> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin > >> wrote: > >> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote: > >> >> This implementation of a virtio balloon driver uses the page cache to > >> >> "store" pages that have been released to the host. The communication > >> >> (outside of target counts) is one way--the guest notifies the host when > >> >> it adds a page to the page cache, allowing the host to madvise(2) with > >> >> MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit > >> >> (via the regular page reclaim). This means that inflating the balloon > >> >> is similar to the existing balloon mechanism, but the deflate is > >> >> different--it re-uses existing Linux kernel functionality to > >> >> automatically reclaim. > >> >> > >> >> Signed-off-by: Frank Swiderski > >> > >> Hi Michael, > >> > >> I'm very sorry that Frank and I have been silent on these threads. > >> I've been out of the office and Frank has been been swamped :) > >> > >> I'll take a stab at answering some of your questions below, and > >> hopefully we can end up on the same page. > >> > >> > I've been trying to understand this, and I have > >> > a question: what exactly is the benefit > >> > of this new device?r balloon is told upper limit on target size by host > >> > and pulls > >> > >> The key difference between this device/driver and the pre-existing > >> virtio_balloon device/driver is in how the memory pressure loop is > >> controlled. > >> > >> With the pre-existing balloon device/driver, the control loop for how > >> much memory a given VM is allowed to use is controlled completely by > >> the host. This is probably fine if the goal is to pack as much work > >> on a given host as possible, but it says nothing about the expected > >> performance that any given VM is expecting to have. Specifically, it > >> allows the host to set a target goal for the size of a VM, and the > >> driver in the guest does whatever is needed to get to that goal. This > >> is great for systems where one wants to "grow or shrink" a VM from the > >> outside. > >> > >> > >> This behaviour however doesn't match what applications actually expectr > >> balloon is told upper limit on target size by host and pulls > >> from a memory control loop however. In a native setup, an application > >> can usually expect to allocate memory from the kernel on an as-needed > >> basis, and can in turn return memory back to the system (using a heap > >> implementation that actually releases memory that is). The dynamic > >> size of an application is completely controlled by the application, > >> and there is very little that cluster management software can do to > >> ensure that the application fits some prescribed size. > >> > >> We recognized this in the development of our cluster management > >> software long ago, so our systems are designed for managing tasks that > >> have a dynamic memory footprint. Overcommit is possible (as most > >> applications do not use the full reservation of memory they asked for > >> originally), letting us do things like schedule lower priority/lower > >> service-classification work using resources that are otherwise > >> available in stand-by for high-priority/low-latency workloads. > > > > OK I am not sure I got this right so pls tell me if this summary is > > correct (note: this does not talk about what guest does with memory, > > ust what it is that device does): > > > > - existing balloon is told lower limit on target size by host and pulls in > > at least > > target size. Guest can inflate > target size if it likes > > and then it is OK to deflate back to target size but not less. > > Is this true? I take it nothing is keeping the existing balloon > driver from going over the target, but the same can be said about > either balloon implementation. > > > - your balloon is told upper limit on target size by host and pulls at most > > target size. Guest can deflate down to 0 at any point. > > > > If so I think both approaches make sense and in fact they > > can be useful at the same time for the same guest. > > In that case, I see two ways how this can be done: > > > > 1. two devices: existing ballon + cache balloon the > > 2. add "upper limit" to existing ballon > > > > A single device looks a bit more natural in that we don't > > really care in which balloon a page is as long as we > > are between lower and upper limit. Right? > > I agree that this may be better done using a single device if possible. I am not sure myself, just asking. > > From implementation POV we could have it use > > pagecache for pages above lower limit but that > > is a separate question about driver design, > > I would like to make sure I understand the highr balloon is told upper > > limit
Re: [PATCH] Add a page cache-backed balloon device driver.
On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin wrote: > On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote: >> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin wrote: >> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote: >> >> This implementation of a virtio balloon driver uses the page cache to >> >> "store" pages that have been released to the host. The communication >> >> (outside of target counts) is one way--the guest notifies the host when >> >> it adds a page to the page cache, allowing the host to madvise(2) with >> >> MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit >> >> (via the regular page reclaim). This means that inflating the balloon >> >> is similar to the existing balloon mechanism, but the deflate is >> >> different--it re-uses existing Linux kernel functionality to >> >> automatically reclaim. >> >> >> >> Signed-off-by: Frank Swiderski >> >> Hi Michael, >> >> I'm very sorry that Frank and I have been silent on these threads. >> I've been out of the office and Frank has been been swamped :) >> >> I'll take a stab at answering some of your questions below, and >> hopefully we can end up on the same page. >> >> > I've been trying to understand this, and I have >> > a question: what exactly is the benefit >> > of this new device?r balloon is told upper limit on target size by host >> > and pulls >> >> The key difference between this device/driver and the pre-existing >> virtio_balloon device/driver is in how the memory pressure loop is >> controlled. >> >> With the pre-existing balloon device/driver, the control loop for how >> much memory a given VM is allowed to use is controlled completely by >> the host. This is probably fine if the goal is to pack as much work >> on a given host as possible, but it says nothing about the expected >> performance that any given VM is expecting to have. Specifically, it >> allows the host to set a target goal for the size of a VM, and the >> driver in the guest does whatever is needed to get to that goal. This >> is great for systems where one wants to "grow or shrink" a VM from the >> outside. >> >> >> This behaviour however doesn't match what applications actually expectr >> balloon is told upper limit on target size by host and pulls >> from a memory control loop however. In a native setup, an application >> can usually expect to allocate memory from the kernel on an as-needed >> basis, and can in turn return memory back to the system (using a heap >> implementation that actually releases memory that is). The dynamic >> size of an application is completely controlled by the application, >> and there is very little that cluster management software can do to >> ensure that the application fits some prescribed size. >> >> We recognized this in the development of our cluster management >> software long ago, so our systems are designed for managing tasks that >> have a dynamic memory footprint. Overcommit is possible (as most >> applications do not use the full reservation of memory they asked for >> originally), letting us do things like schedule lower priority/lower >> service-classification work using resources that are otherwise >> available in stand-by for high-priority/low-latency workloads. > > OK I am not sure I got this right so pls tell me if this summary is > correct (note: this does not talk about what guest does with memory, > ust what it is that device does): > > - existing balloon is told lower limit on target size by host and pulls in at > least > target size. Guest can inflate > target size if it likes > and then it is OK to deflate back to target size but not less. Is this true? I take it nothing is keeping the existing balloon driver from going over the target, but the same can be said about either balloon implementation. > - your balloon is told upper limit on target size by host and pulls at most > target size. Guest can deflate down to 0 at any point. > > If so I think both approaches make sense and in fact they > can be useful at the same time for the same guest. > In that case, I see two ways how this can be done: > > 1. two devices: existing ballon + cache balloon the > 2. add "upper limit" to existing ballon > > A single device looks a bit more natural in that we don't > really care in which balloon a page is as long as we > are between lower and upper limit. Right? I agree that this may be better done using a single device if possible. > From implementation POV we could have it use > pagecache for pages above lower limit but that > is a separate question about driver design, > I would like to make sure I understand the highr balloon is told upper limit > on tr balloon is told upper limit on target size by host and pullsarget size > by host and pulls > level design first. I agree that this is an implementation detail that is separate from discussions of high and low limits. That said, there are several advantages to pushing these pages to the
Re: [PATCH] Add a page cache-backed balloon device driver.
On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote: > On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin wrote: > > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote: > >> This implementation of a virtio balloon driver uses the page cache to > >> "store" pages that have been released to the host. The communication > >> (outside of target counts) is one way--the guest notifies the host when > >> it adds a page to the page cache, allowing the host to madvise(2) with > >> MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit > >> (via the regular page reclaim). This means that inflating the balloon > >> is similar to the existing balloon mechanism, but the deflate is > >> different--it re-uses existing Linux kernel functionality to > >> automatically reclaim. > >> > >> Signed-off-by: Frank Swiderski > > Hi Michael, > > I'm very sorry that Frank and I have been silent on these threads. > I've been out of the office and Frank has been been swamped :) > > I'll take a stab at answering some of your questions below, and > hopefully we can end up on the same page. > > > I've been trying to understand this, and I have > > a question: what exactly is the benefit > > of this new device? > > The key difference between this device/driver and the pre-existing > virtio_balloon device/driver is in how the memory pressure loop is > controlled. > > With the pre-existing balloon device/driver, the control loop for how > much memory a given VM is allowed to use is controlled completely by > the host. This is probably fine if the goal is to pack as much work > on a given host as possible, but it says nothing about the expected > performance that any given VM is expecting to have. Specifically, it > allows the host to set a target goal for the size of a VM, and the > driver in the guest does whatever is needed to get to that goal. This > is great for systems where one wants to "grow or shrink" a VM from the > outside. > > > This behaviour however doesn't match what applications actually expect > from a memory control loop however. In a native setup, an application > can usually expect to allocate memory from the kernel on an as-needed > basis, and can in turn return memory back to the system (using a heap > implementation that actually releases memory that is). The dynamic > size of an application is completely controlled by the application, > and there is very little that cluster management software can do to > ensure that the application fits some prescribed size. > > We recognized this in the development of our cluster management > software long ago, so our systems are designed for managing tasks that > have a dynamic memory footprint. Overcommit is possible (as most > applications do not use the full reservation of memory they asked for > originally), letting us do things like schedule lower priority/lower > service-classification work using resources that are otherwise > available in stand-by for high-priority/low-latency workloads. OK I am not sure I got this right so pls tell me if this summary is correct (note: this does not talk about what guest does with memory, ust what it is that device does): - existing balloon is told lower limit on target size by host and pulls in at least target size. Guest can inflate > target size if it likes and then it is OK to deflate back to target size but not less. - your balloon is told upper limit on target size by host and pulls at most target size. Guest can deflate down to 0 at any point. If so I think both approaches make sense and in fact they can be useful at the same time for the same guest. In that case, I see two ways how this can be done: 1. two devices: existing ballon + cache balloon 2. add "upper limit" to existing ballon A single device looks a bit more natural in that we don't really care in which balloon a page is as long as we are between lower and upper limit. Right? >From implementation POV we could have it use pagecache for pages above lower limit but that is a separate question about driver design, I would like to make sure I understand the high level design first. > > > > Note that users could not care less about how a driver > > is implemented internally. > > > > Is there some workload where you see VM working better with > > this than regular balloon? Any numbers? > > This device is less about performance as it is about getting the > memory size of a job (or in this case, a job in a VM) to grow and > shrink as the application workload sees fit, much like how processes > today can grow and shrink without external direction. Still, e.g. swap in host achieves more or less the same functionality. I am guessing balloon can work better by getting more cooperation from guest but aren't there any tests showing this is true in practice? > > > > Also, can't we just replace existing balloon implementation > > with this one? > > Perhaps, but as described above, both devices have very diff
Re: [PATCHv4] virtio-spec: virtio network device multiqueue support
On 09/09/2012 07:12 PM, Rusty Russell wrote: OK, I read the spec (pasted below for easy of reading), but I'm still confused over how this will work. I thought normal net drivers have the hardware provide an rxhash for each packet, and we map that to CPU to queue the packet on[1]. We hope that the receiving process migrates to that CPU, so xmit queue matches. For virtio this would mean a new per-packet rxhash value, right? Why are we doing something different? What am I missing? Thanks, Rusty. [1] Everything I Know About Networking I Learned From LWN: https://lwn.net/Articles/362339/ In my taxonomy at least, "multi-queue" predates RPS and RFS and is simply where the NIC via some means, perhaps a headers hash, separates incoming frames to different queues. RPS can be thought of as doing something similar inside the host. That could be used to get a spread from an otherwise "dumb" NIC (certainly that is what one of its predecessors - Inbound Packet Scheduling - used it for in HP-UX 10.20), or it could be used to augment the multi-queue support of a not-so-dump NIC - say if said NIC had a limit of queues that was rather lower than the number of cores/threads in the host. Indeed some driver/NIC combinations provide a hash value to the host for the host to use as it sees fit. However, there is still the matter of a single thread of an application servicing multiple connections, each of which would hash to different locations. RFS (Receive Flow Steering) then goes one step further, and looks-up where the flow endpoint was last accessed and steers the traffic there. The idea being that a thread of execution servicing multiple flows will have the traffic of those flows sent to the same place. It then allows the scheduler to decide where things should be run rather than the networking code. rick jones ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] Add a page cache-backed balloon device driver.
On Mon, Sep 10, 2012 at 2:04 PM, Rik van Riel wrote: > On 09/10/2012 01:37 PM, Mike Waychison wrote: >> >> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin >> wrote: > > >>> Also can you pls answer Avi's question? >>> How is overcommit managed? >> >> >> Overcommit in our deployments is managed using memory cgroups on the >> host. This allows us to have very directed policies as to how >> competing VMs on a host may overcommit. > > > How do your memory cgroups lead to guests inflating their balloons? The control loop that is driving the cgroup on the host can still move the balloon target page count causing the balloon in the guest to try and inflate. This allows the host to effectively slowly grow the balloon in the guest, allowing reclaim of guest free memory, followed by guest page cache (and memory on the host system). This can then be compared with the subsequent growth (as this balloon setup allows the guest to grow as it sees fit), which in effect gives us a memory pressure indicator on the host, allowing it to back-off shrinking the guest if the guest balloon quickly deflates. The net effect is an opportunistic release of memory from the guest back to the host, and the ability to quickly grow a VM's memory footprint as the workload within it requires. This dynamic memory sizing of VMs is much more in line with what we can expect from native tasks today (and which is what our resource management systems are designed to handle). ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] Add a page cache-backed balloon device driver.
On 09/10/2012 01:37 PM, Mike Waychison wrote: On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin wrote: Also can you pls answer Avi's question? How is overcommit managed? Overcommit in our deployments is managed using memory cgroups on the host. This allows us to have very directed policies as to how competing VMs on a host may overcommit. How do your memory cgroups lead to guests inflating their balloons? -- All rights reversed ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] Add a page cache-backed balloon device driver.
On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin wrote: > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote: >> This implementation of a virtio balloon driver uses the page cache to >> "store" pages that have been released to the host. The communication >> (outside of target counts) is one way--the guest notifies the host when >> it adds a page to the page cache, allowing the host to madvise(2) with >> MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit >> (via the regular page reclaim). This means that inflating the balloon >> is similar to the existing balloon mechanism, but the deflate is >> different--it re-uses existing Linux kernel functionality to >> automatically reclaim. >> >> Signed-off-by: Frank Swiderski Hi Michael, I'm very sorry that Frank and I have been silent on these threads. I've been out of the office and Frank has been been swamped :) I'll take a stab at answering some of your questions below, and hopefully we can end up on the same page. > I've been trying to understand this, and I have > a question: what exactly is the benefit > of this new device? The key difference between this device/driver and the pre-existing virtio_balloon device/driver is in how the memory pressure loop is controlled. With the pre-existing balloon device/driver, the control loop for how much memory a given VM is allowed to use is controlled completely by the host. This is probably fine if the goal is to pack as much work on a given host as possible, but it says nothing about the expected performance that any given VM is expecting to have. Specifically, it allows the host to set a target goal for the size of a VM, and the driver in the guest does whatever is needed to get to that goal. This is great for systems where one wants to "grow or shrink" a VM from the outside. This behaviour however doesn't match what applications actually expect from a memory control loop however. In a native setup, an application can usually expect to allocate memory from the kernel on an as-needed basis, and can in turn return memory back to the system (using a heap implementation that actually releases memory that is). The dynamic size of an application is completely controlled by the application, and there is very little that cluster management software can do to ensure that the application fits some prescribed size. We recognized this in the development of our cluster management software long ago, so our systems are designed for managing tasks that have a dynamic memory footprint. Overcommit is possible (as most applications do not use the full reservation of memory they asked for originally), letting us do things like schedule lower priority/lower service-classification work using resources that are otherwise available in stand-by for high-priority/low-latency workloads. > > Note that users could not care less about how a driver > is implemented internally. > > Is there some workload where you see VM working better with > this than regular balloon? Any numbers? This device is less about performance as it is about getting the memory size of a job (or in this case, a job in a VM) to grow and shrink as the application workload sees fit, much like how processes today can grow and shrink without external direction. > > Also, can't we just replace existing balloon implementation > with this one? Perhaps, but as described above, both devices have very different characteristics. > Why it is so important to deflate silently? It may not be so important to deflate silently. I'm not sure why it is important that we deflate "loudly" though either :) Doing so seems like unnecessary guest/host communication IMO, especially if the guest is expecting to be able to grow to totalram (and the host isn't able to nack any pages reclaimed anyway...). > I guess filesystem does not currently get a callback > before page is reclaimed but this isan implementation detail - > maybe this can be fixed? I do not follow this question. > > Also can you pls answer Avi's question? > How is overcommit managed? Overcommit in our deployments is managed using memory cgroups on the host. This allows us to have very directed policies as to how competing VMs on a host may overcommit. > > >> --- >> drivers/virtio/Kconfig | 13 + >> drivers/virtio/Makefile |1 + >> drivers/virtio/virtio_fileballoon.c | 636 >> +++ >> include/linux/virtio_balloon.h |9 + >> include/linux/virtio_ids.h |1 + >> 5 files changed, 660 insertions(+), 0 deletions(-) >> create mode 100644 drivers/virtio/virtio_fileballoon.c >> >> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig >> index f38b17a..cffa2a7 100644 >> --- a/drivers/virtio/Kconfig >> +++ b/drivers/virtio/Kconfig >> @@ -35,6 +35,19 @@ config VIRTIO_BALLOON >> >>If unsure, say M. >> >> +config VIRTIO_FILEBALLOON >> + tristate "Virtio page cache-backed balloon
Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible
On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote: > "Michael S. Tsirkin" writes: > > On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote: > >> "Michael S. Tsirkin" writes: > >> > Yes without checksum net core always linearizes packets, so yes it is > >> > screwed. > >> > For -net, skb always allocates space for 17 frags + linear part so > >> > it seems sane to do same in virtio core, and allocate, for -net, > >> > up to max_frags + 1 from cache. > >> > We can adjust it: no _SG -> 2 otherwise 18. > >> > >> But I thought it used individual buffers these days? > > > > Yes for receive, no for transmit. That's probably why > > we should have the threshold per vq, not per device, BTW. > > Can someone actually run with my histogram patch and see what the real > numbers are? Somehow some HTML got in my first reply, resending... I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and guest-to-host, with a form of the histogram patch applied against a RHEL6.3 kernel. The histogram values were reset after each test. Here are the results: 60 session TCP_RR from host-to-guest with 256 byte request and 256 byte response for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=7818456): 1: 7818456 Size distribution for output (max=7816698): 2: 149 3: 7816698 4: 2 5: 1 Size distribution for control (max=1): 0: 0 4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=16050941): 1: 16050941 Size distribution for output (max=1877796): 2: 1877796 3: 5 Size distribution for control (max=1): 0: 0 4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=16831151): 1: 16831151 Size distribution for output (max=1923965): 2: 1923965 3: 5 Size distribution for control (max=1): 0: 0 4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=1316069): 1: 1316069 Size distribution for output (max=879213): 2: 24 3: 24097 # 4: 23176 # 5: 3412 6: 4446 7: 4663 8: 4195 9: 3772 10: 3388 11: 3666 12: 2885 13: 2759 14: 2997 15: 3060 16: 2651 17: 2235 18: 92721 ## 19: 879213 Size distribution for control (max=1): 0: 0 4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=1428590): 1: 1428590 Size distribution for output (max=957774): 2: 20 3: 54955 ### 4: 34281 ## 5: 2967 6: 3394 7: 9400 8: 3061 9: 3397 10: 3258 11: 3275 12: 3147 13: 2876 14: 2747 15: 2832 16: 2013 17: 1670 18: 100369 ## 19: 957774 Size distribution for control (max=1): 0: 0 Thanks, Tom > > I'm not convinced that the ideal 17-buffer case actually happens as much > as we think. And if it's not happening with this netperf test, we're > testing the wrong thing. > > Thanks, > Rusty. > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible
On Mon, Sep 10, 2012 at 10:47:15AM -0500, Thomas Lendacky wrote: > On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote: > > > "Michael S. Tsirkin" writes: > > > > On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote: > > > >> "Michael S. Tsirkin" writes: > > > >> > Yes without checksum net core always linearizes packets, so yes it is > > > >> > screwed. > > > >> > For -net, skb always allocates space for 17 frags + linear part so > > > >> > it seems sane to do same in virtio core, and allocate, for -net, > > > >> > up to max_frags + 1 from cache. > > > >> > We can adjust it: no _SG -> 2 otherwise 18. > > > >> > > > >> But I thought it used individual buffers these days? > > > > > > > > Yes for receive, no for transmit. That's probably why > > > > we should have the threshold per vq, not per device, BTW. > > > > > > Can someone actually run with my histogram patch and see what the real > > > numbers are? > > > > > > > I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and > > guest-to-host, with a form of the histogram patch applied against a > > RHEL6.3 kernel. The histogram values were reset after each test. > > > > Here are the results: > > > > 60 session TCP_RR from host-to-guest with 256 byte request and 256 byte > > response for 60 seconds: > > > > Queue histogram for virtio1: > > Size distribution for input (max=7818456): > > 1: 7818456 > > Size distribution for output (max=7816698): > > 2: 149 > > 3: 7816698 Here, a threshold would help. > > 4: 2 > > 5: 1 > > Size distribution for control (max=1): > > 0: 0 > > > > > > 4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds: > > > > Queue histogram for virtio1: > > Size distribution for input (max=16050941): > > 1: 16050941 > > Size distribution for output (max=1877796): > > 2: 1877796 > > 3: 5 > > Size distribution for control (max=1): > > 0: 0 > > > > 4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds: > > > > Queue histogram for virtio1: > > Size distribution for input (max=16831151): > > 1: 16831151 > > Size distribution for output (max=1923965): > > 2: 1923965 > > 3: 5 > > Size distribution for control (max=1): > > 0: 0 > Hmm for virtio net output we do always use 2 s/g, this is because of a qemu bug. Maybe it's time we fixed this, added a feature bit? This would fix above without threshold hacks. > > 4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds: > > > > Queue histogram for virtio1: > > Size distribution for input (max=1316069): > > 1: 1316069 > > Size distribution for output (max=879213): > > 2: 24 > > 3: 24097 # > > 4: 23176 # > > 5: 3412 > > 6: 4446 > > 7: 4663 > > 8: 4195 > > 9: 3772 > > 10: 3388 > > 11: 3666 > > 12: 2885 > > 13: 2759 > > 14: 2997 > > 15: 3060 > > 16: 2651 > > 17: 2235 > > 18: 92721 ## > > 19: 879213 > > Size distribution for control (max=1): > > 0: 0 > > > > 4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds: > > > > Queue histogram for virtio1: > > Size distribution for input (max=1428590): > > 1: 1428590 > > Size distribution for output (max=957774): > > 2: 20 > > 3: 54955 ### > > 4: 34281 ## > > 5: 2967 > > 6: 3394 > > 7: 9400 > > 8: 3061 > > 9: 3397 > > 10: 3258 > > 11: 3275 > > 12: 3147 > > 13: 2876 > > 14: 2747 > > 15: 2832 > > 16: 2013 > > 17: 1670 > > 18: 100369 ## > > 19: 957774 > > Size distribution for control (max=1): > > 0: 0 > > > > Thanks, > > Tom In these tests we would have to set threshold pretty high. I wonder whether the following makes any difference: the idea is to A. get less false cache sharing by allocating full cache lines B. better locality by using same cache for multiple sizes So we get some of the wins of the threshold without bothering with a cache. Will try to test but not until later this week. diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5aa43c3..c184712 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -132,7 +132,8 @@ static int vring_add_indirect(struct vring_virtqueue *vq, unsigned head; int i; - desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp); + desc = k
Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible
Il 06/09/2012 07:02, Michael S. Tsirkin ha scritto: >> > It might be worth just unconditionally having a cache for the 2 >> > descriptor case. This is what I get with qemu tap, though for some >> > reason the device features don't have guest or host CSUM, so my setup is >> > probably screwed: > Yes without checksum net core always linearizes packets, so yes it is > screwed. > For -net, skb always allocates space for 17 frags + linear part so > it seems sane to do same in virtio core, and allocate, for -net, > up to max_frags + 1 from cache. > We can adjust it: no _SG -> 2 otherwise 18. > > Not sure about other drivers, maybe really use 2 there for now. 2 should also be good for virtio-blk and virtio-scsi 4KB random rw workloads. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible
On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote: > "Michael S. Tsirkin" writes: > > On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote: > >> "Michael S. Tsirkin" writes: > >> > Yes without checksum net core always linearizes packets, so yes it is > >> > screwed. > >> > For -net, skb always allocates space for 17 frags + linear part so > >> > it seems sane to do same in virtio core, and allocate, for -net, > >> > up to max_frags + 1 from cache. > >> > We can adjust it: no _SG -> 2 otherwise 18. > >> > >> But I thought it used individual buffers these days? > > > > Yes for receive, no for transmit. That's probably why > > we should have the threshold per vq, not per device, BTW. > > Can someone actually run with my histogram patch and see what the real > numbers are? > I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and guest-to-host, with a form of the histogram patch applied against a RHEL6.3 kernel. The histogram values were reset after each test. Here are the results: 60 session TCP_RR from host-to-guest with 256 byte request and 256 byte response for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=7818456): 1: 7818456 Size distribution for output (max=7816698): 2: 149 3: 7816698 4: 2 5: 1 Size distribution for control (max=1): 0: 0 4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=16050941): 1: 16050941 Size distribution for output (max=1877796): 2: 1877796 3: 5 Size distribution for control (max=1): 0: 0 4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=16831151): 1: 16831151 Size distribution for output (max=1923965): 2: 1923965 3: 5 Size distribution for control (max=1): 0: 0 4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=1316069): 1: 1316069 Size distribution for output (max=879213): 2: 24 3: 24097 # 4: 23176 # 5: 3412 6: 4446 7: 4663 8: 4195 9: 3772 10: 3388 11: 3666 12: 2885 13: 2759 14: 2997 15: 3060 16: 2651 17: 2235 18: 92721 ## 19: 879213 Size distribution for control (max=1): 0: 0 4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds: Queue histogram for virtio1: Size distribution for input (max=1428590): 1: 1428590 Size distribution for output (max=957774): 2: 20 3: 54955 ### 4: 34281 ## 5: 2967 6: 3394 7: 9400 8: 3061 9: 3397 10: 3258 11: 3275 12: 3147 13: 2876 14: 2747 15: 2832 16: 2013 17: 1670 18: 100369 ## 19: 957774 Size distribution for control (max=1): 0: 0 Thanks, Tom > I'm not convinced that the ideal 17-buffer case actually happens as much > as we think. And if it's not happening with this netperf test, we're > testing the wrong thing. > > Thanks, > Rusty. > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFCv3] virtio_console: Add support for virtio remoteproc serial
From: Sjur Brændeland Add a virtio remoteproc serial driver: VIRTIO_ID_RPROC_SERIAL (0xB) for communicating with a remote processor in an asymmetric multi-processing configuration. The virtio remoteproc serial driver reuses the existing virtio_console implementation, and adds support for DMA allocation of data buffers but disables support for tty console, mutiple ports, and the control queue. Signed-off-by: Sjur Brændeland cc: Rusty Russell cc: Michael S. Tsirkin cc: Amit Shah cc: Ohad Ben-Cohen cc: Linus Walleij cc: virtualization@lists.linux-foundation.org cc: linux-ker...@vger.kernel.org --- Hi Rusty, ... >Sorry for the back and forth, I've been pondering MST's points. >If we make a new dma-multiport device (eg. ID 11), how ugly is the code? I don't think it's too bad. It's a few more code lines - as I have added a new virtio_driver definition and feature table. The rest is just replacing the check on feature bits with driver type. I no longer need to check on feature bits in the probe function, as the match on device type is done automatically by the driver framework. I actually like this new approach better. It solves the issues Michael has pointed out, and we don't have to think through side effects of weired combination of feature bits. >It would be a virtio console with DMA buffers and no console, just the >multiport stuff. This would have no impact on the current spec for >virtio console. Yes, and this way it will also be easier to explain what this new feature does. Regards, Sjur drivers/char/virtio_console.c | 133 + include/linux/virtio_ids.h|1 + 2 files changed, 109 insertions(+), 25 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index cdf2f54..08593aa 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -35,6 +35,7 @@ #include #include #include +#include #include "../tty/hvc/hvc_console.h" /* @@ -323,6 +324,52 @@ static bool is_console_port(struct port *port) return false; } +#ifdef CONFIG_HAS_DMA +static inline bool is_rproc_serial(struct virtio_device *vdev) +{ + return vdev->id.device == VIRTIO_ID_RPROC_SERIAL; +} +#else +static inline bool is_rproc_serial(struct virtio_device *vdev) +{ + return false; +} +#endif + +/* Allocate data buffer from DMA memory if requested */ +static inline void * +alloc_databuf(struct virtio_device *vdev, size_t size, gfp_t flag) +{ + if (is_rproc_serial(vdev)) { + dma_addr_t dma; + struct device *dev = &vdev->dev; + /* +* Allocate DMA memory from ancestors. Finding the ancestor +* is a bit quirky when DMA_MEMORY_INCLUDES_CHILDREN is not +* implemented. +*/ + dev = dev->parent ? dev->parent : dev; + dev = dev->parent ? dev->parent : dev; + return dma_alloc_coherent(dev, size, &dma, flag); + } + return kzalloc(size, flag); +} + +static inline void +free_databuf(struct virtio_device *vdev, size_t size, void *cpu_addr) +{ + + if (is_rproc_serial(vdev)) { + struct device *dev = &vdev->dev; + dma_addr_t dma_handle = virt_to_bus(cpu_addr); + dev = dev->parent ? dev->parent : dev; + dev = dev->parent ? dev->parent : dev; + dma_free_coherent(dev, size, cpu_addr, dma_handle); + return; + } + kfree(cpu_addr); +} + static inline bool use_multiport(struct ports_device *portdev) { /* @@ -334,20 +381,21 @@ static inline bool use_multiport(struct ports_device *portdev) return portdev->vdev->features[0] & (1 << VIRTIO_CONSOLE_F_MULTIPORT); } -static void free_buf(struct port_buffer *buf) +static void +free_buf(struct virtqueue *vq, struct port_buffer *buf, size_t buf_size) { - kfree(buf->buf); + free_databuf(vq->vdev, buf_size, buf->buf); kfree(buf); } -static struct port_buffer *alloc_buf(size_t buf_size) +static struct port_buffer *alloc_buf(struct virtqueue *vq, size_t buf_size) { struct port_buffer *buf; buf = kmalloc(sizeof(*buf), GFP_KERNEL); if (!buf) goto fail; - buf->buf = kzalloc(buf_size, GFP_KERNEL); + buf->buf = alloc_databuf(vq->vdev, buf_size, GFP_KERNEL); if (!buf->buf) goto free_buf; buf->len = 0; @@ -414,7 +462,7 @@ static void discard_port_data(struct port *port) port->stats.bytes_discarded += buf->len - buf->offset; if (add_inbuf(port->in_vq, buf) < 0) { err++; - free_buf(buf); + free_buf(port->in_vq, buf, PAGE_SIZE); } port->inbuf = NULL; buf = get_inbuf(port); @@ -485,7 +533,7 @@ static void reclaim_consumed_buffers(struct port *port)
Re: [Xen-devel] [patch 1/3] xen/privcmd: check for integer overflow in ioctl
On Mon, Sep 10, 2012 at 11:35:11AM +0100, David Vrabel wrote: > On 08/09/12 10:52, Dan Carpenter wrote: > > If m.num is too large then the "m.num * sizeof(*m.arr)" multiplication > > could overflow and the access_ok() check wouldn't test the right size. > > m.num is range checked later on so it doesn't matter that the > access_ok() checks might be wrong. A bit subtle, perhaps. > Yeah. It's too subtle for my static checker but not so subtle for a human being. Laziness on my part. Please drop this patch. regards, dan carpenter ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCHv4] virtio-spec: virtio network device multiqueue support
On 09/10/2012 02:33 PM, Michael S. Tsirkin wrote: On Mon, Sep 10, 2012 at 09:27:38AM +0300, Michael S. Tsirkin wrote: On Mon, Sep 10, 2012 at 09:16:29AM +0300, Michael S. Tsirkin wrote: On Mon, Sep 10, 2012 at 11:42:25AM +0930, Rusty Russell wrote: OK, I read the spec (pasted below for easy of reading), but I'm still confused over how this will work. I thought normal net drivers have the hardware provide an rxhash for each packet, and we map that to CPU to queue the packet on[1]. We hope that the receiving process migrates to that CPU, so xmit queue matches. This ony works sometimes. For example it's common to pin netperf to a cpu to get consistent performance. Proper hardware must obey what applications want it to do, not the other way around. For virtio this would mean a new per-packet rxhash value, right? Why are we doing something different? What am I missing? Thanks, Rusty. [1] Everything I Know About Networking I Learned From LWN: https://lwn.net/Articles/362339/ I think you missed this: Some network interfaces can help with the distribution of incoming packets; they have multiple receive queues and multiple interrupt lines. Others, though, are equipped with a single queue, meaning that the driver for that hardware must deal with all incoming packets in a single, serialized stream. Parallelizing such a stream requires some intelligence on the part of the host operating system. In other words RPS is a hack to speed up networking on cheapo hardware, this is one of the reasons it is off by default. Good hardware has multiple receive queues. We can implement a good one so we do not need RPS. Also not all guest OS-es support RPS. Does this clarify? I would like to add that on many processors, sending IPCs between guest CPUs requires exits on sending *and* receiving path, making it very expensive. A final addition: what you suggest above would be "TX follows RX", right? It is in anticipation of something like that, that I made steering programming so generic. I think TX follows RX is more immediately useful for reasons above but we can add both to spec and let drivers and devices decide what they want to support. Pls let me know. AFAIK, ixgbe does "rx follows tx". The only differences between ixgbe and virtio-net is that ixgbe driver programs the flow director during packet transmission but we suggest to do it silently in the device for simplicity. Even with this, more co-operation is still needed for the driver ( e.g ixgbe try to use per-cpu queue by setting affinity hint and using cpuid to choose the txq which could be reused in virtio-net driver). --- Transmit Packet Steering When VIRTIO_NET_F_MULTIQUEUE feature bit is negotiated, guest can use any of multiple configured transmit queues to transmit a given packet. To avoid packet reordering by device (which generally leads to performance degradation) driver should attempt to utilize the same transmit virtqueue for all packets of a given transmit flow. For bi-directional protocols (in practice, TCP), a given network connection can utilize both transmit and receive queues. For best performance, packets from a single connection should utilize the paired transmit and receive queues from the same virtqueue pair; for example both transmitqN and receiveqN. This rule makes it possible to optimize processing on the device side, but this is not a hard requirement: devices should function correctly even when this rule is not followed. Driver selects an active steering rule using VIRTIO_NET_CTRL_STEERING command (this controls both which virtqueue is selected for a given packet for receive and notifies the device which virtqueues are about to be used for transmit). This command accepts a single out argument in the following format: #define VIRTIO_NET_CTRL_STEERING 4 The field rule specifies the function used to select transmit virtqueue for a given packet; the field param makes it possible to pass an extra parameter if appropriate. When rule is set to VIRTIO_NET_CTRL_STEERING_SINGLE (this is the default) all packets are steered to the default virtqueue transmitq (1); param is unused; this is the default. With any other rule, When rule is set to VIRTIO_NET_CTRL_STEERING_RX_FOLLOWS_TX packets are steered by driver to the first N=(param+1) multiqueue virtqueues transmitq1...transmitqN; the default transmitq is unused. Driver must have configured all these (param+1) virtqueues beforehand. Supported steering rules can be added and removed in the future. Driver should check that the request to change the steering rule was successful by checking ack values of the command. As selecting a specific steering is an optimization feature, drivers should avoid hard failure and fall back on using a supported steering rule if this command fails. The default steering rule is VIRTIO_NET_CTRL_STEERING_SINGLE. It will not be removed. When the steering rule
[patch 3/3 v2] xen/privcmd: add a __user annotation to a cast
Sparse complains that we lose the __user annotation in the cast here. drivers/xen/privcmd.c:388:35: warning: cast removes address space of expression drivers/xen/privcmd.c:388:32: warning: incorrect type in assignment (different address spaces) drivers/xen/privcmd.c:388:32:expected unsigned long [noderef] [usertype] *[addressable] [assigned] user_mfn drivers/xen/privcmd.c:388:32:got unsigned long [usertype] * Signed-off-by: Dan Carpenter --- v2: In v1 I removed the const from the declaration but now I just removed it from the cast. This data can either be a v1 or a v2 type struct. The m.arr data is const in version 2 but not in version 1. diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c index 0ce006a..fceb83e 100644 --- a/drivers/xen/privcmd.c +++ b/drivers/xen/privcmd.c @@ -389,7 +389,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version) if (state.global_error && (version == 1)) { /* Write back errors in second pass. */ - state.user_mfn = (xen_pfn_t *)m.arr; + state.user_mfn = (xen_pfn_t __user *)m.arr; state.err = err_array; ret = traverse_pages(m.num, sizeof(xen_pfn_t), &pagelist, mmap_return_errors_v1, &state); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 3/3] xen/privcmd: remove const modifier from declaration
>>> On 08.09.12 at 11:58, Dan Carpenter wrote: > When we use this pointer, we cast away the const modifier and modify the > data. I think it was an accident to declare it as const. NAK - the const is very valid here, as the v2 interface (as opposed to the v1 one) does _not_ modify this array (or if it does, it's a bug). This is a guarantee made to user mode, so it should also be expressed that way in the interface. But of course the cast used before this patch isn't right either, as it indeed inappropriately discards the qualifier. Afaiu this was done to simplify the internal workings of the code, but I don't think it's desirable to sacrifice type safety for implementation simplicity. Jan > Signed-off-by: Dan Carpenter > > diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h > index a853168..58ed953 100644 > --- a/include/xen/privcmd.h > +++ b/include/xen/privcmd.h > @@ -69,7 +69,7 @@ struct privcmd_mmapbatch_v2 { > unsigned int num; /* number of pages to populate */ > domid_t dom; /* target domain */ > __u64 addr; /* virtual address */ > - const xen_pfn_t __user *arr; /* array of mfns */ > + xen_pfn_t __user *arr; /* array of mfns */ > int __user *err; /* array of error codes */ > }; > > diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c > index 0ce006a..fceb83e 100644 > --- a/drivers/xen/privcmd.c > +++ b/drivers/xen/privcmd.c > @@ -389,7 +389,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, > int version) > > if (state.global_error && (version == 1)) { > /* Write back errors in second pass. */ > - state.user_mfn = (xen_pfn_t *)m.arr; > + state.user_mfn = m.arr; > state.err = err_array; > ret = traverse_pages(m.num, sizeof(xen_pfn_t), >&pagelist, mmap_return_errors_v1, &state); > > ___ > Xen-devel mailing list > xen-de...@lists.xen.org > http://lists.xen.org/xen-devel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] Add a page cache-backed balloon device driver.
On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote: > This implementation of a virtio balloon driver uses the page cache to > "store" pages that have been released to the host. The communication > (outside of target counts) is one way--the guest notifies the host when > it adds a page to the page cache, allowing the host to madvise(2) with > MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit > (via the regular page reclaim). This means that inflating the balloon > is similar to the existing balloon mechanism, but the deflate is > different--it re-uses existing Linux kernel functionality to > automatically reclaim. > > Signed-off-by: Frank Swiderski I've been trying to understand this, and I have a question: what exactly is the benefit of this new device? Note that users could not care less about how a driver is implemented internally. Is there some workload where you see VM working better with this than regular balloon? Any numbers? Also, can't we just replace existing balloon implementation with this one? Why it is so important to deflate silently? I guess filesystem does not currently get a callback before page is reclaimed but this isan implementation detail - maybe this can be fixed? Also can you pls answer Avi's question? How is overcommit managed? > --- > drivers/virtio/Kconfig | 13 + > drivers/virtio/Makefile |1 + > drivers/virtio/virtio_fileballoon.c | 636 > +++ > include/linux/virtio_balloon.h |9 + > include/linux/virtio_ids.h |1 + > 5 files changed, 660 insertions(+), 0 deletions(-) > create mode 100644 drivers/virtio/virtio_fileballoon.c > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig > index f38b17a..cffa2a7 100644 > --- a/drivers/virtio/Kconfig > +++ b/drivers/virtio/Kconfig > @@ -35,6 +35,19 @@ config VIRTIO_BALLOON > >If unsure, say M. > > +config VIRTIO_FILEBALLOON > + tristate "Virtio page cache-backed balloon driver" > + select VIRTIO > + select VIRTIO_RING > + ---help--- > + This driver supports decreasing and automatically reclaiming the > + memory within a guest VM. Unlike VIRTIO_BALLOON, this driver instead > + tries to maintain a specific target balloon size using the page cache. > + This allows the guest to implicitly deflate the balloon by flushing > + pages from the cache and touching the page. > + > + If unsure, say N. > + > config VIRTIO_MMIO > tristate "Platform bus driver for memory mapped virtio devices > (EXPERIMENTAL)" > depends on HAS_IOMEM && EXPERIMENTAL > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile > index 5a4c63c..7ca0a3f 100644 > --- a/drivers/virtio/Makefile > +++ b/drivers/virtio/Makefile > @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o > obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o > obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o > +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o > diff --git a/drivers/virtio/virtio_fileballoon.c > b/drivers/virtio/virtio_fileballoon.c > new file mode 100644 > index 000..ff252ec > --- /dev/null > +++ b/drivers/virtio/virtio_fileballoon.c > @@ -0,0 +1,636 @@ > +/* Virtio file (page cache-backed) balloon implementation, inspired by > + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty > Russel's > + * implementation. > + * > + * This implementation of the virtio balloon driver re-uses the page cache to > + * allow memory consumed by inflating the balloon to be reclaimed by linux. > It > + * creates and mounts a bare-bones filesystem containing a single inode. > When > + * the host requests the balloon to inflate, it does so by "reading" pages at > + * offsets into the inode mapping's page_tree. The host is notified when the > + * pages are added to the page_tree, allowing it (the host) to madvise(2) the > + * corresponding host memory, reducing the RSS of the virtual machine. In > this > + * implementation, the host is only notified when a page is added to the > + * balloon. Reclaim happens under the existing TTFP logic, which flushes > unused > + * pages in the page cache. If the host used MADV_DONTNEED, then when the > guest > + * uses the page, the zero page will be mapped in, allowing automatic (and > fast, > + * compared to requiring a host notification via a virtio queue to get memory > + * back) reclaim. > + * > + * Copyright 2008 Rusty Russell IBM Corporation > + * Copyright 2011 Frank Swiderski Google Inc > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY W
Re: [RFCv2 1/2] virtio_console: Add support for DMA memory allocation
sjur.brandel...@stericsson.com writes: > From: Sjur Brændeland > > Add feature VIRTIO_CONSOLE_F_DMA_MEM. If the architecture has > DMA support and this feature bit is set, the virtio data buffers > will be allocated from DMA memory. If the device requests > the feature VIRTIO_CONSOLE_F_DMA_MEM, but the architecture > don't support DMA the driver's probe function will fail. > > This is needed for using virtio_console from the remoteproc > framework. Sorry for the back and forth, I've been pondering MST's points. If we make a new dma-multiport device (eg. ID 11), how ugly is the code? It would be a virtio console with DMA buffers and no console, just the multiport stuff. This would have no impact on the current spec for virtio console. Thanks, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization