Re: [PATCH] kvm: fast-path msi injection with irqfd

2010-11-19 Thread Gregory Haskins
 On 11/19/2010 at 10:54 AM, in message 20101119155427.ga20...@amt.cnet,
Marcelo Tosatti mtosa...@redhat.com wrote: 
 On Thu, Nov 18, 2010 at 07:09:08PM +0200, Michael S. Tsirkin wrote:
 Store irq routing table pointer in the irqfd object,
 and use that to inject MSI directly without bouncing out to
 a kernel thread.
 
 While we touch this structure, rearrange irqfd fields to make fastpath
 better packed for better cache utilization.
 
 This also adds some comments about locking rules and rcu usage in code.
 
 Some notes on the design:
 - Use pointer into the rt instead of copying an entry,
   to make it possible to use rcu, thus side-stepping
   locking complexities.  We also save some memory this way.
 - Old workqueue code is still used for level irqs.
   I don't think we DTRT with level anyway, however,
   it seems easier to keep the code around as
   it has been thought through and debugged, and fix level later than
   rip out and re-instate it later.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 OK, this seems to work fine for me. Tested with virtio-net in guest
 with and without vhost-net.  Pls review/apply if appropriate.
 
 Acked-by: Marcelo Tosatti mtosa...@redhat.com

Acked-by: Gregory Haskins ghask...@novell.com



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix irqfd assign/deassign race

2010-09-19 Thread Gregory Haskins
 On 9/19/2010 at 01:02 PM, in message 20100919170231.ga12...@redhat.com,
Michael S. Tsirkin m...@redhat.com wrote: 
 I think I see the following (theoretical) race:
 
 During irqfd assign, we drop irqfds lock before we
 schedule inject work. Therefore, deassign running
 on another CPU could cause shutdown and flush to run
 before inject, causing user after free in inject.
 
 A simple fix it to schedule inject under the lock.

I swear there was some reason why the schedule_work() was done outside of the 
lock, but I can't for the life of me remember why anymore (it obviously was a 
failing on my part to _comment_ why if there was such a reason).  So, short of 
recalling what that reason was, and the fact that Michael's theory seems 
rational and legit... 

Acked-by: Gregory Haskins ghask...@novell.com

 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 If the issue is real, this might be a 2.6.36 and -stable
 candidate. Comments?
 
  virt/kvm/eventfd.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)
 
 diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
 index 66cf65b..c1f1e3c 100644
 --- a/virt/kvm/eventfd.c
 +++ b/virt/kvm/eventfd.c
 @@ -218,7 +218,6 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
   events = file-f_op-poll(file, irqfd-pt);
  
   list_add_tail(irqfd-list, kvm-irqfds.items);
 - spin_unlock_irq(kvm-irqfds.lock);
  
   /*
* Check if there was an event already pending on the eventfd
 @@ -227,6 +226,8 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
   if (events  POLLIN)
   schedule_work(irqfd-inject);
  
 + spin_unlock_irq(kvm-irqfds.lock);
 +
   /*
* do not drop the file until the irqfd is fully initialized, otherwise
* we might race against the POLLHUP




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: only allow one gsi per fd

2010-01-13 Thread Gregory Haskins
 On 1/13/2010 at 11:58 AM, in message 20100113165809.ga13...@redhat.com,
Michael S. Tsirkin m...@redhat.com wrote: 
 Looks like repeatedly binding same fd to multiple gsi's with irqfd can
 use up a ton of kernel memory for irqfd structures.
 
 A simple fix is to allow each fd to only trigger one gsi: triggering a
 srorm of interrupts in guest is likely useless anyway, and we can do it
 by binding a single gsi to many interrupts if we really want to.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Seems reasonable to me.

Acked-by: Gregory Haskins ghask...@novell.com

 ---
 
 This patch is IMO a good candidate for 2.6.33 and 2.6.32.x.
 
  virt/kvm/eventfd.c |   15 +--
  1 files changed, 13 insertions(+), 2 deletions(-)
 
 diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
 index 30f70fd..62e4cd9 100644
 --- a/virt/kvm/eventfd.c
 +++ b/virt/kvm/eventfd.c
 @@ -166,7 +166,7 @@ irqfd_ptable_queue_proc(struct file *file, 
 wait_queue_head_t *wqh,
  static int
  kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
  {
 - struct _irqfd *irqfd;
 + struct _irqfd *irqfd, *tmp;
   struct file *file = NULL;
   struct eventfd_ctx *eventfd = NULL;
   int ret;
 @@ -203,9 +203,20 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
   init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup);
   init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc);
  
 + spin_lock_irq(kvm-irqfds.lock);
 +
 + ret = 0;
 + list_for_each_entry(tmp, kvm-irqfds.items, list) {
 + if (irqfd-eventfd != tmp-eventfd)
 + continue;
 + /* This fd is used for another irq already. */
 + ret = -EBUSY;
 + spin_unlock_irq(kvm-irqfds.lock);
 + goto fail;
 + }
 +
   events = file-f_op-poll(file, irqfd-pt);
  
 - spin_lock_irq(kvm-irqfds.lock);
   list_add_tail(irqfd-list, kvm-irqfds.items);
   spin_unlock_irq(kvm-irqfds.lock);
  




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-27 Thread Gregory Haskins
On 12/27/09 4:15 AM, Avi Kivity wrote:
 On 12/23/2009 11:21 PM, Gregory Haskins wrote:
 That said, you are still incorrect.  With what I proposed, the model
 will run as an in-kernel vbus device, and no longer run in userspace.
 It would therefore improve virtio-net as I stated, much in the same
 way vhost-net or venet-tap do today.

 
 That can't work.  virtio-net has its own ABI on top of virtio, for
 example it prepends a header for TSO information.  Maybe if you disable
 all features it becomes compatible with venet, but that cripples it.
 


You are confused.  The backend would be virtio-net specific, and would
therefore understand the virtio-net ABI.  It would support any feature
of virtio-net as long as it was implemented and negotiated by both sides
of the link.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-27 Thread Gregory Haskins
On 12/27/09 4:33 AM, Avi Kivity wrote:
 On 12/24/2009 11:36 AM, Gregory Haskins wrote:
 As a twist on this, the VMware paravirt driver interface is so
 hardware-like that they're getting hardware vendors to supply cards that
 implement it.  Try that with a pure software approach.
  
 Any hardware engineer (myself included) will tell you that, generally
 speaking, what you can do in hardware you can do in software (think of
 what QEMU does today, for instance).  It's purely a cost/performance
 tradeoff.

 I can at least tell you that is true of vbus.  Anything on the vbus side
 would be equally eligible for a hardware implementation, though there is
 not reason to do this today since we have equivalent functionality in
 baremetal already.
 
 There's a huge difference in the probability of vmware getting cards to
 their spec, or x86 vendors improving interrupt delivery to guests,
 compared to vbus being implemented in hardware.

Thats not relevant, however.  I said in the original quote that you
snipped that I made it a software design on purpose, and you tried to
somehow paint that as a negative because vmware made theirs
hardware-like and you implied it could not be done with my approach
with the statement try that with a pure software approach.  And the
bottom line is that the statement is incorrect and/or misleading.

 
 The only motiviation is if you wanted to preserve
 ABI etc, which is what vmware is presumably after.  However, I am not
 advocating this as necessary at this juncture.

 
 Maybe AlacrityVM users don't care about compatibility, but my users do.

Again, not relevant to this thread.  Making your interface
hardware-like buys you nothing in the end, as you ultimately need to
load drivers in the guest either way, and any major OS lets you extend
both devices and buses with relative ease.  The only counter example
would be if you truly were hardware-exactly like e1000 emulation, but
we already know that this means it is hardware centric and not
exit-rate aware and would perform poorly.  Otherwise compatible is
purely a point on the time line (for instance, the moment virtio-pci ABI
shipped), not an architectural description such as hardware-like.



 




signature.asc
Description: OpenPGP digital signature


Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-27 Thread Gregory Haskins
On 12/27/09 8:27 AM, Avi Kivity wrote:
 On 12/27/2009 03:18 PM, Gregory Haskins wrote:
 On 12/27/09 4:15 AM, Avi Kivity wrote:
   
 On 12/23/2009 11:21 PM, Gregory Haskins wrote:
 
 That said, you are still incorrect.  With what I proposed, the model
 will run as an in-kernel vbus device, and no longer run in userspace.
 It would therefore improve virtio-net as I stated, much in the same
 way vhost-net or venet-tap do today.


 That can't work.  virtio-net has its own ABI on top of virtio, for
 example it prepends a header for TSO information.  Maybe if you disable
 all features it becomes compatible with venet, but that cripples it.

  
 You are confused.  The backend would be virtio-net specific, and would
 therefore understand the virtio-net ABI.  It would support any feature
 of virtio-net as long as it was implemented and negotiated by both sides
 of the link.

 
 Then we're back to square one.  A nice demonstration of vbus
 flexibility, but no help for virtio.
 

No, where we are is at the point where we demonstrate that your original
statement that I did nothing to improve virtio was wrong.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-27 Thread Gregory Haskins
On 12/27/09 8:49 AM, Avi Kivity wrote:
 On 12/27/2009 03:34 PM, Gregory Haskins wrote:
 On 12/27/09 4:33 AM, Avi Kivity wrote:
   
 On 12/24/2009 11:36 AM, Gregory Haskins wrote:
 
 As a twist on this, the VMware paravirt driver interface is so
 hardware-like that they're getting hardware vendors to supply cards
 that
 implement it.  Try that with a pure software approach.

  
 Any hardware engineer (myself included) will tell you that, generally
 speaking, what you can do in hardware you can do in software (think of
 what QEMU does today, for instance).  It's purely a cost/performance
 tradeoff.

 I can at least tell you that is true of vbus.  Anything on the vbus
 side
 would be equally eligible for a hardware implementation, though
 there is
 not reason to do this today since we have equivalent functionality in
 baremetal already.

 There's a huge difference in the probability of vmware getting cards to
 their spec, or x86 vendors improving interrupt delivery to guests,
 compared to vbus being implemented in hardware.
  
 Thats not relevant, however.  I said in the original quote that you
 snipped that I made it a software design on purpose, and you tried to
 somehow paint that as a negative because vmware made theirs
 hardware-like and you implied it could not be done with my approach
 with the statement try that with a pure software approach.  And the
 bottom line is that the statement is incorrect and/or misleading.

 
 It's not incorrect.

At the very best it's misleading.

 VMware stuck to the pci specs and as a result they
 can have hardware implement their virtual NIC protocol.  For vbus this
 is much harder

Not really.

 to do since you need a side-channel between different
 cards to coordinate interrupt delivery.  In theory you can do eveything
 if you don't consider practicalities.

pci based designs, such as vmware and virtio-pci arent free of this
notion either.  They simply rely on APIC emulation for the irq-chip, and
it just so happens that vbus implements a different irq-chip (more
specifically, the connector that we employ between the guest and vbus
does).  On one hand, you have the advantage of the guest already
supporting the irq-chip ABI, and on other other, you have an optimized
(e.g. shared memory based inject/ack) and feature enhanced ABI
(interrupt priority, no IDT constraints, etc).  The are pros and cons to
either direction, but the vbus project charter is to go for maximum
performance and features, so that is acceptable to us.


 
 That's a digression, though, I'm not suggesting we'll see virtio
 hardware or that this is a virtio/pci advantage vs. vbus.  It's an
 anecdote showing that sticking with specs has its advantages.

It also has distinct disadvantages.  For instance, the PCI spec is
gigantic, yet almost none of it is needed to do the job here.  When you
are talking full-virt, you are left with no choice.  With para-virt, you
do have a choice, and the vbus-connector for AlacrityVM capitalizes on this.

As an example, think about all the work that went into emulating the PCI
chipset, the APIC chipset, MSI-X support, irq-routing, etc, when all you
needed was a simple event-queue to indicate that an event (e.g. an
interrupt) occurred.

This is an example connector in vbus:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/connectors/null.c;h=b6d16cb68b7e49e07528278bc9f5b73e1dac0c2f;hb=HEAD

It encapsulates all of hotplug, signal (interrupt) routing, and memory
routing for both sides of the link in 584 lines of code.  And that
also implicitly brings in device discovery and configuration since that
is covered by the vbus framework.  Try doing that with PCI, especially
when you are not already under the qemu umbrella, and the standards
based approach suddenly doesn't look very attractive.

 
 wrt pci vs vbus, the difference is in the ability to use improvements in
 interrupt delivery accelerations in virt hardware.

Most of which will also apply to the current vbus design as well since
at some point I have to have an underlying IDT mechanism too, btw.

  If this happens,
 virtio/pci can immediately take advantage of it, while vbus has to stick
 with software delivery for backward compatibility, and all that code
 becomes a useless support burden.


The shared-memory path will always be the fastest anyway, so I am not
too worried about it.  But vbus supports feature negotiation, so we can
always phase that out if need be, same as anything else.

 As an example of what hardware can do when it really sets its mind to
 it, s390 can IPI from vcpu to vcpu without exiting to the host.

Great!  I am just not in the practice of waiting for hardware to cover
sloppy software.  There is a ton of impracticality in doing so, such as
the fact that the hardware, even once available, will not be ubiquitous
instantly.

 
 The only motiviation is if you wanted to preserve
 ABI etc, which is what vmware is presumably

Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-27 Thread Gregory Haskins
On 12/27/09 8:49 AM, Avi Kivity wrote:
 On 12/27/2009 03:39 PM, Gregory Haskins wrote:
 No, where we are is at the point where we demonstrate that your original
 statement that I did nothing to improve virtio was wrong.


 
 I stand by it.  virtio + your patch does nothing without a ton more work
 (more or less equivalent to vhost-net).
 

Perhaps, but my work predates vhost-net by months and that has nothing
to do with what we are talking about anyway.  Since you snipped your
original comment that started the thread, here it is again:

On 12/23/09 5:22 AM, Avi Kivity wrote:
 
  There was no attempt by Gregory to improve virtio-net.

It's not a gray area, nor open to interpretation.  That statement was,
is, and will always be demonstrably false, so I'm sorry but you are
still wrong.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-24 Thread Gregory Haskins
On 12/23/09 3:36 PM, Avi Kivity wrote:
 On 12/23/2009 06:44 PM, Gregory Haskins wrote:

   - Are a pure software concept
  
 By design.  In fact, I would describe it as software to software
 optimized as opposed to trying to shoehorn into something that was
 designed as a software-to-hardware interface (and therefore has
 assumptions about the constraints in that environment that are not
 applicable in software-only).


 
 And that's the biggest mistake you can make.

Sorry, that is just wrong or you wouldn't have virtio either.

  Look at Xen, for
 instance.  The paravirtualized the fork out of everything that moved in
 order to get x86 virt going.  And where are they now? x86_64 syscalls
 are slow since they have to trap to the hypervisor and (partially) flush
 the tlb.  With npt or ept capable hosts performance is better for many
 workloads on fullvirt.  And paravirt doesn't support Windows.  Their
 unsung hero Jeremy is still trying to upstream dom0 Xen support.  And
 they get to support it forever.

We are only talking about PV-IO here, so not apples to apples to what
Xen is going through.

 
 VMware stuck with the hardware defined interfaces.  Sure they had to
 implement binary translation to get there, but as a result, they only
 have to support one interface, all guests support it, and they can drop
 it on newer hosts where it doesn't give them anything.

Again, you are confusing PV-IO.  Not relevant here.   Afaict, vmware,
kvm, xen, etc, all still do PV-IO and likely will for the foreseeable
future.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-24 Thread Gregory Haskins
On 12/23/09 4:01 PM, Avi Kivity wrote:
 On 12/23/2009 10:36 PM, Avi Kivity wrote:
 On 12/23/2009 06:44 PM, Gregory Haskins wrote:

   - Are a pure software concept
 By design.  In fact, I would describe it as software to software
 optimized as opposed to trying to shoehorn into something that was
 designed as a software-to-hardware interface (and therefore has
 assumptions about the constraints in that environment that are not
 applicable in software-only).


 And that's the biggest mistake you can make.  Look at Xen, for
 instance.  The paravirtualized the fork out of everything that moved
 in order to get x86 virt going.  And where are they now? x86_64
 syscalls are slow since they have to trap to the hypervisor and
 (partially) flush the tlb.  With npt or ept capable hosts performance
 is better for many workloads on fullvirt.  And paravirt doesn't
 support Windows.  Their unsung hero Jeremy is still trying to upstream
 dom0 Xen support.  And they get to support it forever.

 VMware stuck with the hardware defined interfaces.  Sure they had to
 implement binary translation to get there, but as a result, they only
 have to support one interface, all guests support it, and they can
 drop it on newer hosts where it doesn't give them anything.
 
 As a twist on this, the VMware paravirt driver interface is so
 hardware-like that they're getting hardware vendors to supply cards that
 implement it.  Try that with a pure software approach.

Any hardware engineer (myself included) will tell you that, generally
speaking, what you can do in hardware you can do in software (think of
what QEMU does today, for instance).  It's purely a cost/performance
tradeoff.

I can at least tell you that is true of vbus.  Anything on the vbus side
would be equally eligible for a hardware implementation, though there is
not reason to do this today since we have equivalent functionality in
baremetal already.  The only motiviation is if you wanted to preserve
ABI etc, which is what vmware is presumably after.  However, I am not
advocating this as necessary at this juncture.

So sorry, your statement is not relevant.

-Greg




 




signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Gregory Haskins
On 12/23/09 1:51 AM, Ingo Molnar wrote:
 
 * Anthony Liguori anth...@codemonkey.ws wrote:
 
 On 12/22/2009 10:01 AM, Bartlomiej Zolnierkiewicz wrote:
 new e1000 driver is more superior in architecture and do the required
 work to make the new e1000 driver a full replacement for the old one.
 Right, like everyone actually does things this way..

 I wonder why do we have OSS, old Firewire and IDE stacks still around then?

 And it's always a source of pain, isn't it.
 
 Even putting aside the fact that such overlap sucks and is a pain to users 
 (and that 98% of driver and subsystem version transitions are done completely 
 seemlessly to users - the examples that were cited were the odd ones out of 
 150K commits in the past 4 years, 149K+ of which are seemless), the 
 comparison 
 does not even apply really.
 
 e1000, OSS, old Firewire and IDE are hardware stacks, where hardware is a not 
 fully understood externality, with its inevitable set of compatibility voes. 
 There's often situations where one piece of hardware still works better with 
 the old driver, for some odd (or not so odd) reason.
 
 Also, note that the 'new' hw drivers are generally intended and are 
 maintained 
 as clear replacements for the old ones, and do so with minimal ABI changes - 
 or preferably with no ABI changes at all. Most driver developers just switch 
 from old to new and the old bits are left around and are phased out. We 
 phased 
 out old OSS recently.
 
 That is a very different situation from the AlacrityVM patches, which:
 
  - Are a pure software concept

By design.  In fact, I would describe it as software to software
optimized as opposed to trying to shoehorn into something that was
designed as a software-to-hardware interface (and therefore has
assumptions about the constraints in that environment that are not
applicable in software-only).

 and any compatibility mismatch is self-inflicted.

.. because the old model is not great for the intended use cases and has
issues.  I've already covered the reasons why adnauseam.

  The patches are in fact breaking the ABI to KVM intentionally (for better or 
 worse).

No, at the very worst they are _augmenting_ the ABI, as evident from the
fact that AlacrityVM is a superset of the entire KVM ABI.

 
  - Gregory claims that the AlacricityVM patches are not a replacement for KVM.
I.e. there's no intention to phase out any 'old stuff' 

There's no reason to phase anything out, except perhaps the virtio-pci
transport.  This is one more transport, plugging into virtio underneath
(just like virtio-pci, virtio-lguest, and virtio-s390).  I am not even
suggesting that the old transport has to go away, per se.  It is the KVM
maintainers who insist on it being all or nothing.  For me, I do not see
the big deal in having one more model option in the qemu cmd-line, but
that is just my opinion.  If the maintainers were really so adamant that
choice is pure evil, I am not sure why we don't see patches for removing
everything but one model type in each IO category.  But I digress.

 and it splits the pool of driver developers.

..it these dumb threads that are splitting driver developers with
ignorant statements, irrelevant numbers, and dubious facts.  I
actually tried many many times to ask others to join the effort, and
instead _they_ forked off and made vhost-net with a sorry, not
interested in working with you(and based the design largely on the
ideas proposed in my framework, I might add).  Thats fine, and it's
their prerogative.  I can easily maintain my project out of tree if
upstream is not interested.  But do not turn around and try to blame me
for the current state of affairs.

 
 i.e. it has all the makings of a stupid, avoidable, permanent fork. The thing 
 is, if AlacricityVM is better, and KVM developers are not willing to fix 
 their 
 stuff, replace KVM with it.
 
 It's a bit as if someone found a performance problem with sys_open() and came 
 up with sys_open_v2() and claimed that he wants to work with the VFS 
 developers while not really doing so but advances sys_open_v2() all the time.

No, its more like if I suggested sys_open_vbus() to augment
sys_open_pci(), sys_open_lguest(), sys_open_s390() because our
fictitious glibc natively supported modular open() implementations.  And
I tried to work for a very long time convincing the VFS developers that
this new transport was a good idea to add because it was optimized for
the problem space, made it easy for API callers to reuse the important
elements of the design (e.g. built in tx-mitigation instead of waiting
for someone to write it for each device), had new features like the
ability to prioritize signals, create/route signal paths arbitrarily,
implement raw shared memory for idempotent updates, and didn't require
the massive and useless PCI/APIC emulation logic to function like
sys_open_pci() (e.g. easier to port around).

Ultimately, the VFS developers said I know we let other transports in
in the past, but now 

Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Gregory Haskins
On 12/23/09 12:10 PM, Andi Kleen wrote:
 And its moot, anyway, as I have already retracted my one outstanding
 pull request based on Linus' observation.  So at this time, I am not
 advocating _anything_ for upstream inclusion.  And I am contemplating
 _never_ doing so again.  It's not worth _this_.
 
 That certainly sounds like the wrong reaction. Out of tree drivers
 are typically a pain to use.

Well, to Linus' point, it shouldn't go in until a critical mass of users
have expressed desire to see it in tree, which seems reasonable to me.
For the admittedly small group that are using it today, modern tools
like the opensuse-build-service ease the deployment as a KMP, so that
can suffice for now.  Its actually what most of the alacrityvm community
uses today anyway (as opposed to using a merged tree in the guest)

 
 And upstream submission is not always like this!

I would think the process would come to a grinding halt if it were ;)

Thanks Andi,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Gregory Haskins
On 12/23/09 1:15 AM, Kyle Moffett wrote:
 On Tue, Dec 22, 2009 at 12:36, Gregory Haskins
 gregory.hask...@gmail.com wrote:
 On 12/22/09 2:57 AM, Ingo Molnar wrote:
 * Gregory Haskins gregory.hask...@gmail.com wrote:
 Actually, these patches have nothing to do with the KVM folks. [...]

 That claim is curious to me - the AlacrityVM host

 It's quite simple, really.  These drivers support accessing vbus, and
 vbus is hypervisor agnostic.  In fact, vbus isn't necessarily even
 hypervisor related.  It may be used anywhere where a Linux kernel is the
 io backend, which includes hypervisors like AlacrityVM, but also
 userspace apps, and interconnected physical systems as well.

 The vbus-core on the backend, and the drivers on the frontend operate
 completely independent of the underlying hypervisor.  A glue piece
 called a connector ties them together, and any hypervisor specific
 details are encapsulated in the connector module.  In this case, the
 connector surfaces to the guest side as a pci-bridge, so even that is
 not hypervisor specific per se.  It will work with any pci-bridge that
 exposes a compatible ABI, which conceivably could be actual hardware.
 
 This is actually something that is of particular interest to me.  I
 have a few prototype boards right now with programmable PCI-E
 host/device links on them; one of my long-term plans is to finagle
 vbus into providing multiple virtual devices across that single
 PCI-E interface.
 
 Specifically, I want to be able to provide virtual NIC(s), serial
 ports and serial consoles, virtual block storage, and possibly other
 kinds of interfaces.  My big problem with existing virtio right now
 (although I would be happy to be proven wrong) is that it seems to
 need some sort of out-of-band communication channel for setting up
 devices, not to mention it seems to need one PCI device per virtual
 device.
 
 So I would love to be able to port something like vbus to my nify PCI
 hardware and write some backend drivers... then my PCI-E connected
 systems would dynamically provide a list of highly-efficient virtual
 devices to each other, with only one 4-lane PCI-E bus.

Hi Kyle,

We indeed have others that are doing something similar.  I have CC'd Ira
who may be able to provide you more details.  I would also point you at
the canonical example for what you would need to write to tie your
systems together.  Its the null connector, which you can find here:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/connectors/null.c;h=b6d16cb68b7e49e07528278bc9f5b73e1dac0c2f;hb=HEAD

Do not hesitate to ask any questions, though you may want to take the
conversation to the alacrityvm-devel list as to not annoy the current CC
list any further than I already have ;)

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Gregory Haskins
On 12/23/09 5:22 AM, Avi Kivity wrote:

 
 There was no attempt by Gregory to improve virtio-net.

If you truly do not understand why your statement is utterly wrong at
this point in the discussion, I feel sorry for you.  If you are trying
to be purposely disingenuous, you should be ashamed of yourself.  In any
case, your statement is demonstrably bogus, but you should already know
this given that we talked about at least several times.

To refresh your memory: http://patchwork.kernel.org/patch/17428/

In case its not blatantly clear, which I would hope it would be to
anyone that understands the problem space:  What that patch would do is
allow an unmodified virtio-net to bridge to a vbus based virtio-net
backend.  (Also note that this predates vhost-net by months (the date in
that thread is 4/9/2009) in case you are next going to try to argue that
it does nothing over vhost-net).

This would mean that virtio-net would gain most of the benefits I have
been advocating (fewer exits, cheaper exits, concurrent execution, etc).
 So this would very much improve virtio-net indeed, given how poorly the
current backend was performing.  I tried to convince the team to help me
build it out to completion on multiple occasions, but that request was
answered with sorry, we are doing our own thing instead.  You can say
that you didn't like my approach, since that is a subjective opinion.
But to say that I didn't attempt to improve it is a flat out wrong, and
I do not appreciate it.

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Gregory Haskins
On 12/23/09 12:52 PM, Peter W. Morreale wrote:
 On Wed, 2009-12-23 at 13:14 +0100, Andi Kleen wrote:
 http://www.redhat.com/f/pdf/summit/cwright_11_open_source_virt.pdf

 See slide 32.  This is without vhost-net.

 Thanks. Do you also have latency numbers?

 It seems like there's definitely still potential for improvement
 with messages 4K. But for the large messages they indeed 
 look rather good.

 It's unclear what message size the Alacrity numbers used, but I presume
 it was rather large.

 
 No.  It was 1500.  Please see: 
 
 http://developer.novell.com/wiki/index.php/AlacrityVM/Results
 

Note: 1500 was the L2 MTU, not necessarily the L3/L4 size which was
probably much larger (though I do not recall what exactly atm).

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Gregory Haskins
(Sorry for top post...on a mobile)

When someone repeatedly makes a claim you believe to be wrong and you
correct them, you start to wonder if that person has a less than
honorable agenda.  In any case, I overreacted.  For that, I apologize.

That said, you are still incorrect.  With what I proposed, the model
will run as an in-kernel vbus device, and no longer run in userspace.
It would therefore improve virtio-net as I stated, much in the same
way vhost-net or venet-tap do today.

FYI  I am about to log out for the long holiday, so will be
unresponsive for a bit.

Kind Regards,
-Greg

On 12/23/09, Avi Kivity a...@redhat.com wrote:
 On 12/23/2009 08:15 PM, Gregory Haskins wrote:
 On 12/23/09 5:22 AM, Avi Kivity wrote:


 There was no attempt by Gregory to improve virtio-net.

 If you truly do not understand why your statement is utterly wrong at
 this point in the discussion, I feel sorry for you.  If you are trying
 to be purposely disingenuous, you should be ashamed of yourself.  In any
 case, your statement is demonstrably bogus, but you should already know
 this given that we talked about at least several times.


 There's no need to feel sorry for me, thanks.  There's no reason for me
 to be ashamed, either.  And there's no need to take the discussion to
 personal levels.  Please keep it technical.


 To refresh your memory: http://patchwork.kernel.org/patch/17428/


 This is not an attempt to improve virtio-net, it's an attempt to push
 vbus.  With this, virtio-net doesn't become any faster, since the
 greatest bottleneck is not removed, it remains in userspace.

 If you wanted to improve virtio-net, you would port venet-host to the
 virtio-net guest/host interface, and port any secret sauce in
 venet(-guest) to virtio-net.  After that we could judge what vbus'
 contribution to the equation is.

 In case its not blatantly clear, which I would hope it would be to
 anyone that understands the problem space:  What that patch would do is
 allow an unmodified virtio-net to bridge to a vbus based virtio-net
 backend.  (Also note that this predates vhost-net by months (the date in
 that thread is 4/9/2009) in case you are next going to try to argue that
 it does nothing over vhost-net).


 Without the backend, it is useless.  It demonstrates vbus' flexibility
 quite well, but does nothing for virtio-net or its users, at least
 without a lot more work.

 This would mean that virtio-net would gain most of the benefits I have
 been advocating (fewer exits, cheaper exits, concurrent execution, etc).
   So this would very much improve virtio-net indeed, given how poorly the
 current backend was performing.  I tried to convince the team to help me
 build it out to completion on multiple occasions, but that request was
 answered with sorry, we are doing our own thing instead.  You can say
 that you didn't like my approach, since that is a subjective opinion.
 But to say that I didn't attempt to improve it is a flat out wrong, and
 I do not appreciate it.


 Cutting down on the rhetoric is more important than cutting down exits
 at this point in time.

 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.



-- 
Sent from my mobile device
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-22 Thread Gregory Haskins
On 12/22/09 1:53 PM, Avi Kivity wrote:
 On 12/22/2009 07:36 PM, Gregory Haskins wrote:

 Gregory, it would be nice if you worked _much_ harder with the KVM folks
 before giving up.
  
 I think the 5+ months that I politely tried to convince the KVM folks
 that this was a good idea was pretty generous of my employer.  The KVM
 maintainers have ultimately made it clear they are not interested in
 directly supporting this concept (which is their prerogative), but are
 perhaps willing to support the peripheral logic needed to allow it to
 easily interface with KVM.  I can accept that, and thus AlacrityVM was
 born.

 
 Review pointed out locking issues with xinterface which I have not seen
 addressed.  I asked why the irqfd/ioeventfd mechanisms are insufficient,
 and you did not reply.
 

Yes, I understand.  I've been too busy to rework the code for an
upstream push.  I will certainly address those questions when I make the
next attempt, but they weren't relevant to the guest side.



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-22 Thread Gregory Haskins
On 12/22/09 1:53 PM, Avi Kivity wrote:
 I asked why the irqfd/ioeventfd mechanisms are insufficient, and you did not 
 reply.
 

BTW: the ioeventfd issue just fell through the cracks, so sorry about
that.  Note that I have no specific issue with irqfd ever since the
lockless IRQ injection code was added.

ioeventfd turned out to be suboptimal for me in the fast path for two
reasons:

1) the underlying eventfd is called in atomic context.  I had posted
patches to Davide to address that limitation, but I believe he rejected
them on the grounds that they are only relevant to KVM.

2) it cannot retain the data field passed in the PIO.  I wanted to have
one vector that could tell me what value was written, and this cannot be
expressed in ioeventfd.

Based on this, it was a better decision to add a ioevent interface to
xinterface.  It neatly solves both problems.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-22 Thread Gregory Haskins
On 12/22/09 2:32 PM, Gregory Haskins wrote:
 On 12/22/09 2:25 PM, Avi Kivity wrote:


 If you're not doing something pretty minor, you're better of waking up a
 thread (perhaps _sync if you want to keep on the same cpu).  With the
 new user return notifier thingie, that's pretty cheap.
 
 We have exploits that take advantage of IO heuristics.  When triggered
 they do more work in vcpu context than normal, which reduces latency
 under certain circumstances.  But you definitely do _not_ want to do
 them in-atomic ;)

And I almost forgot:  dev-call() is an RPC to the backend device.
Therefore, it must be synchronous, yet we dont want it locked either.  I
think that was actually the primary motivation for the change, now that
I think about it.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-22 Thread Gregory Haskins
On 12/22/09 2:38 PM, Avi Kivity wrote:
 On 12/22/2009 09:32 PM, Gregory Haskins wrote:
 xinterface, as it turns out, is a great KVM interface for me and easy to
 extend, all without conflicting with the changes in upstream.  The old
 way was via the kvm ioctl interface, but that sucked as the ABI was
 always moving.  Where is the problem?  ioeventfd still works fine as
 it is.

 
 It means that kvm locking suddenly affects more of the kernel.
 

Thats ok.  This would only be w.r.t. devices that are bound to the KVM
instance anyway, so they better know what they are doing (and they do).

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-22 Thread Gregory Haskins
On 12/22/09 2:43 PM, Avi Kivity wrote:
 On 12/22/2009 09:41 PM, Gregory Haskins wrote:

 It means that kvm locking suddenly affects more of the kernel.

  
 Thats ok.  This would only be w.r.t. devices that are bound to the KVM
 instance anyway, so they better know what they are doing (and they do).


 
 It's okay to the author of that device.  It's not okay to the kvm
 developers who are still evolving the locking and have to handle all
 devices that use xinterface.

Perhaps, but like it or not, if you want to do in-kernel you need to
invoke backends.  And if you want to invoke backends, limiting it to
thread wakeups is, well, limiting.  For one, you miss out on that
exploit I mentioned earlier which can help sometimes.

Besides, the direction that Marcelo and I left the mmio/pio bus was that
it would go lockless eventually, not more lockful ;)

Has that changed?  I honestly haven't followed whats going on in the
io-bus code in a while.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-22 Thread Gregory Haskins
On 12/22/09 2:39 PM, Davide Libenzi wrote:
 On Tue, 22 Dec 2009, Gregory Haskins wrote:
 
 On 12/22/09 1:53 PM, Avi Kivity wrote:
 I asked why the irqfd/ioeventfd mechanisms are insufficient, and you did 
 not reply.


 BTW: the ioeventfd issue just fell through the cracks, so sorry about
 that.  Note that I have no specific issue with irqfd ever since the
 lockless IRQ injection code was added.

 ioeventfd turned out to be suboptimal for me in the fast path for two
 reasons:

 1) the underlying eventfd is called in atomic context.  I had posted
 patches to Davide to address that limitation, but I believe he rejected
 them on the grounds that they are only relevant to KVM.
 
 I thought we addressed this already, in the few hundreds of email we 
 exchanged back then :)

We addressed the race conditions, but not the atomic callbacks.  I can't
remember exactly what you said, but the effect was no, so I dropped it. ;)

This was the thread.

http://www.archivum.info/linux-ker...@vger.kernel.org/2009-06/08548/Re:_[KVM-RFC_PATCH_1_2]_eventfd:_add_an_explicit_srcu_based_notifier_interface

 
 
 
 2) it cannot retain the data field passed in the PIO.  I wanted to have
 one vector that could tell me what value was written, and this cannot be
 expressed in ioeventfd.
 
 Like might have hinted in his reply, couldn't you add data support to the 
 ioeventfd bits in KVM, instead of leaking them into mainline eventfd?
 

Perhaps, or even easier I could extend xinterface.  Which is what I did ;)

The problem with the first proposal is that you would no longer actually
have an eventfd based mechanism...so any code using ioeventfd (like
Michael Tsirkin's for instance) would break.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-22 Thread Gregory Haskins
On 12/21/09 7:12 PM, Anthony Liguori wrote:
 On 12/21/2009 11:44 AM, Gregory Haskins wrote:
 Well, surely something like SR-IOV is moving in that direction, no?

 
 Not really, but that's a different discussion.

Ok, but my general point still stands.  At some level, some crafty
hardware engineer may invent something that obsoletes the
need for, say, PV 802.x drivers because it can hit 40GE line rate at the
same performance level of bare metal with some kind of pass-through
trick.  But I still do not see that as an excuse for sloppy software in
the meantime, as there will always be older platforms, older IO cards,
or different IO types that are not benefactors of said hw based
optimizations.

 
 But let's focus on concrete data.  For a given workload,
 how many exits do you see due to EOI?
  
 Its of course highly workload dependent, and I've published these
 details in the past, I believe.  Off the top of my head, I recall that
 virtio-pci tends to throw about 65k exits per second, vs about 32k/s for
 venet on a 10GE box, but I don't recall what ratio of those exits are
 EOI.
 
 Was this userspace virtio-pci or was this vhost-net?

Both, actually, though userspace is obviously even worse.

  If it was the
 former, then were you using MSI-X?

MSI-X

  If you weren't, there would be an
 additional (rather heavy) exit per-interrupt to clear the ISR which
 would certainly account for a large portion of the additional exits.


Yep, if you don't use MSI it is significantly worse as expected.


To be perfectly honest, I don't care.  I do not discriminate
 against the exit type...I want to eliminate as many as possible,
 regardless of the type.  That's how you go fast and yet use less CPU.

 
 It's important to understand why one mechanism is better than another. 

Agreed, but note _I_ already understand why.  I've certainly spent
countless hours/emails trying to get others to understand as well, but
it seems most are too busy to actually listen.


 All I'm looking for is a set of bullet points that say, vbus does this,
 vhost-net does that, therefore vbus is better.  We would then either
 say, oh, that's a good idea, let's change vhost-net to do that, or we
 would say, hrm, well, we can't change vhost-net to do that because of
 some fundamental flaw, let's drop it and adopt vbus.
 
 It's really that simple :-)

This is all been covered ad-nauseam, directly with youself in many
cases.  Google is your friend.

Here are some tips while you research:  Do not fall into the trap of
vhost-net vs vbus, or venet vs virtio-net, or you miss the point
entirely.  Recall that venet was originally crafted to demonstrate the
virtues of my three performance objectives (kill exits, reduce exit
overhead, and run concurrently). Then there is all the stuff we are
laying on top, like qos, real-time, advanced fabrics, and easy adoption
for various environments (so it doesn't need to be redefined each time).

Therefore if you only look at the limited feature set of virtio-net, you
will miss the majority of the points of the framework.  virtio tried to
capture some of these ideas, but it missed the mark on several levels
and was only partially defined.  Incidentally, you can stil run virtio
over vbus if desired, but so far no one has tried to use my transport.

 
 
   They should be relatively rare
 because obtaining good receive batching is pretty easy.
  
 Batching is poor mans throughput (its easy when you dont care about
 latency), so we generally avoid as much as possible.

 
 Fair enough.
 
 Considering
 these are lightweight exits (on the order of 1-2us),
  
 APIC EOIs on x86 are MMIO based, so they are generally much heavier than
 that.  I measure at least 4-5us just for the MMIO exit on my Woodcrest,
 never mind executing the locking/apic-emulation code.

 
 You won't like to hear me say this, but Woodcrests are pretty old and
 clunky as far as VT goes :-)

Fair enough.

 
 On a modern Nehalem, I would be surprised if an MMIO exit handled in the
 kernel was muck more than 2us.  The hardware is getting very, very
 fast.  The trends here are very important to consider when we're looking
 at architectures that we potentially are going to support for a long time.

The exit you do not take will always be infinitely faster.

 
 you need an awfully
 large amount of interrupts before you get really significant performance
 impact.  You would think NAPI would kick in at this point anyway.

  
 Whether NAPI can kick in or not is workload dependent, and it also does
 not address coincident events.  But on that topic, you can think of
 AlacrityVM's interrupt controller as NAPI for interrupts, because it
 operates on the same principle.  For what its worth, it also operates on
 a NAPI for hypercalls concept too.

 
 The concept of always batching hypercalls has certainly been explored
 within the context of Xen.

I am not talking about batching, which again is a poor mans throughput
trick

Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-21 Thread Gregory Haskins
On 12/18/09 4:51 PM, Ingo Molnar wrote:
 
 * Gregory Haskins gregory.hask...@gmail.com wrote:
 
 Hi Linus,

 Please pull AlacrityVM guest support for 2.6.33 from:

 git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git
 for-linus

 All of these patches have stewed in linux-next for quite a while now:

 Gregory Haskins (26):
 
 I think it would be fair to point out that these patches have been objected 
 to 
 by the KVM folks quite extensively,

Actually, these patches have nothing to do with the KVM folks.  You are
perhaps confusing this with the hypervisor-side discussion, of which
there is indeed much disagreement.

To that point, it's certainly fair to point out the controversy on the
host side.  It ultimately is what forced the creation of the AlacrityVM
project, after all.  However, it should also be pointed out that this
pull request is not KVM specific, nor even KVM related per se.  These
patches can (and in fact, do) work in other environments that do not use
KVM nor even AlacrityVM at all.

VBUS, the underlying technology here, is a framework for creating
optimized software-based device models using a Linux-kernel as a host
and their corresponding driver resources to the backend.  AlacrityVM
is the application of these technologies using KVM/Linux/Qemu as a base,
but that is an implementation detail.

For more details, please see the project wiki

http://developer.novell.com/wiki/index.php/AlacrityVM

This pull request is for drivers to support running a Linux kernel as a
guest in this environment, so it actually doesn't affect KVM in any way.
 They are just standard Linux drivers and in fact can load as
stand-alone KMPs in any modern vanilla distro.  I haven't even pushed
the host side code to linux-next yet specifically because of that
controversy you mention.


 on multiple technical grounds - as 
 basically this tree forks the KVM driver space for which no valid technical 
 reason could be offered by you in a 100+ mails long discussion.

You will have to be more specific on these technical grounds you
mention, because I believe I satisfactorily rebutted any issues raised.
 To say that there is no technical reason is, at best, a matter of
opinion.  I have in fact listed numerous reasons on a technical,
feature, and architectural basis on what differentiates my approach, and
provided numbers which highlights their merits.  Given that they are all
recorded in the archives of said 100+ email thread as well as numerous
others, I wont rehash the entire list here.  Instead, I will post a
summary of the problem space from the performance perspective, since
that seems to be of most interest atm.

From my research, the reason why virt in general, and KVM in particular
suffers on the IO performance front is as follows: IOs
(traps+interrupts) are more expensive than bare-metal, and real hardware
is naturally concurrent (your hbas and nics are effectively parallel
execution engines, etc).

Assuming my observations are correct, in order to squeeze maximum
performance from a given guest, you need to do three things:  A)
eliminate as many IOs as you possibly can, B) reduce the cost of the
ones you can't avoid, and C) run your algorithms in parallel to emulate
concurrent silicon.

So to that front, we move the device models to the kernel (where they
are closest to the physical IO devices) and use cheap instructions
like PIOs/Hypercalls for (B), and exploit spare host-side SMP resources
via kthreads for (C).  For (A), part of the problem is that virtio-pci
is not designed optimally to address the problem space, and part of it
is a limitation of the PCI transport underneath it.

For example, PCI is somewhat of a unique bus design in that it wants to
map signals to interrupts 1:1.  This works fine for real hardware where
interrupts are relatively cheap, but is quite suboptimal on virt where
the window-exits, injection-exits, and MMIO-based EIOs hurt
substantially (multiple microseconds per).

One core observation is that we don't technically need 1:1 interrupts to
signals in order to function properly.  Ideally we will only bother the
CPU when work of a higher priority becomes ready.  So the alacrityvm
connector to vbus uses a model were we deploy a lockless shared-memory
queue to inject interrupts.  This means that temporal interrupts (of
both intra and inter device variety) of similar priority can queue
without incurring any extra IO.  This means fewer exits, fewer EOIs, etc.

The end result is that I can demonstrate that even with a single stream
to a single device, I can reduce exit rate by over 45% and interrupt
rate  50% when compared to the equivalent virtio-pci ABI.  This scales
even higher when you add additional devices to the mix.  The bottom line
is that we use significantly less CPU while producing the highest
throughput and lowest latency.  In fact, to my knowledge vbus+venet is
still the highest performing 802.x device for KVM to my knowledge, even
when turning off its

Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-21 Thread Gregory Haskins
On 12/21/09 10:43 AM, Avi Kivity wrote:
 On 12/21/2009 05:34 PM, Gregory Haskins wrote:

 I think it would be fair to point out that these patches have been
 objected to
 by the KVM folks quite extensively,
  
 Actually, these patches have nothing to do with the KVM folks.  You are
 perhaps confusing this with the hypervisor-side discussion, of which
 there is indeed much disagreement.

 
 This is true, though these drivers are fairly pointless for
 virtualization without the host side support.

The host side support is available in various forms (git tree, rpm, etc)
from our project page.  I would encourage any interested parties to
check it out:

Here is the git tree

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=summary

Here are some RPMs:

http://download.opensuse.org/repositories/devel://LLDC://alacrity/openSUSE_11.1/

And the main project site:

http://developer.novell.com/wiki/index.php/AlacrityVM

 
 I did have a few issues with the guest drivers:
 - the duplication of effort wrt virtio.  These drivers don't cover
 exactly the same problem space, but nearly so.

Virtio itself is more or less compatible with this effort, as we have
discussed (see my virtio-vbus transport, for instance).  I have issues
with some of the design decisions in the virtio device and ring models,
but they are minor in comparison to the beef I have with the virtio-pci
transport as a whole.

 - no effort at scalability - all interrupts are taken on one cpu

Addressed by the virtual-interrupt controller.  This will enable us to
route shm-signal messages to a core, under guidance from the standard
irq-balance facilities.

 - the patches introduce a new virtual interrupt controller for dubious
 (IMO) benefits

See above.  Its not fully plumbed yet, which is perhaps the reason for
the confusion as to its merits.  Eventually I will trap the affinity
calls and pass them to the host, too.  Today, it at least lets us see
the shm-signal statistics under /proc/interrupts, which is nice and is
consistent with other IO mechanisms.


 
  From my research, the reason why virt in general, and KVM in particular
 suffers on the IO performance front is as follows: IOs
 (traps+interrupts) are more expensive than bare-metal, and real hardware
 is naturally concurrent (your hbas and nics are effectively parallel
 execution engines, etc).

 Assuming my observations are correct, in order to squeeze maximum
 performance from a given guest, you need to do three things:  A)
 eliminate as many IOs as you possibly can, B) reduce the cost of the
 ones you can't avoid, and C) run your algorithms in parallel to emulate
 concurrent silicon.

 
 All these are addressed by vhost-net without introducing new drivers.

No, B and C definitely are, but A is lacking.  And the performance
suffers as a result in my testing (vhost-net still throws a ton of exits
as its limited by virtio-pci and only adds about 1Gb/s to virtio-u, far
behind venet even with things like zero-copy turned off).

I will also point out that these performance aspects are only a subset
of the discussion, since we are also addressing things like
qos/priority, alternate fabric types, etc.  I do not expect you to
understand and agree where I am going per se.  We can have that
discussion when I once again ask you for merge consideration.  But if
you say they are the same I will call you on it, because they are
demonstrably unique capability sets.

Kind Regards,
-Greg






signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-21 Thread Gregory Haskins
On 12/21/09 11:37 AM, Anthony Liguori wrote:
 On 12/21/2009 10:04 AM, Gregory Haskins wrote:
 No, B and C definitely are, but A is lacking.  And the performance
 suffers as a result in my testing (vhost-net still throws a ton of exits
 as its limited by virtio-pci and only adds about 1Gb/s to virtio-u, far
 behind venet even with things like zero-copy turned off).

 
 How does virtio-pci limit vhost-net?  The only time exits should occur
 are when the guest notifies the host that something has been placed on
 the ring.  Since vhost-net has no tx mitigation scheme right now, the
 result may be that it's taking an io exit on every single packet but
 this is orthogonal to virtio-pci.
 
 Since virtio-pci supports MSI-X, there should be no IO exits on
 host-guest notification other than EOI in the virtual APIC.

The very best you can hope to achieve is 1:1 EOI per signal (though
today virtio-pci is even worse than that).  As I indicated above, I can
eliminate more than 50% of even the EOIs in trivial examples, and even
more as we scale up the number of devices or the IO load (or both).

 This is a
 light weight exit today and will likely disappear entirely with newer
 hardware.

By that argument, this is all moot.  New hardware will likely obsolete
the need for venet or virtio-net anyway.  The goal of my work is to
provide an easy to use framework for maximizing the IO transport _in
lieu_ of hardware acceleration.  Software will always be leading here,
so we don't want to get into a pattern of waiting for new hardware to
cover poor software engineering.  Its simply not necessary in most
cases.  A little smart software design and a framework that allows it to
be easily exploited/reused is the best step forward, IMO.

Kind Regards,
-Greg






signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-21 Thread Gregory Haskins
On 12/21/09 11:40 AM, Avi Kivity wrote:
 On 12/21/2009 06:37 PM, Anthony Liguori wrote:
 Since virtio-pci supports MSI-X, there should be no IO exits on
 host-guest notification other than EOI in the virtual APIC.  This is
 a light weight exit today and will likely disappear entirely with
 newer hardware.
 
 I'm working on disappearing EOI exits on older hardware as well.  Same
 idea as the old TPR patching, without most of the magic.
 

While I applaud any engineering effort that results in more optimal
execution, if you are talking about what we have discussed in the past
its not quite in the same league as my proposal.

You are talking about the ability to optimize the final EOI if there are
no pending interrupts remaining, right?  The problem with this approach
is it addresses the wrong side of the curve: That is, it optimizes the
code as its about to go io-idle.  You still have to take an extra exit
for each injection during the heat of battle, which is when you actually
need it most.

To that front, what I have done is re-used the lockless shared-memory
concept for even interrupt injection.  Lockless shared-memory rings
have the property that both producer and consumer can simultaneously
manipulate the ring.  So what we do in alacrityvm is deliver shm-signals
(shm-signal == interrupt in vbus) over a ring so that the host can
inject a signal to a running vcpu and a vcpu can complete an
ack/re-inject cycle directly from vcpu context.  Therefore, we only need
a physical IDT injection when the vcpu is io-idle transitioning to
io-busy, and remain completely in parallel guest/host context until we
go idle again.

That said, your suggestion would play nicely with the above mentioned
scheme, so I look forward to seeing it in the tree.  Feel free to send
me patches for testing.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-21 Thread Gregory Haskins
On 12/21/09 12:05 PM, Avi Kivity wrote:
 On 12/21/2009 06:56 PM, Gregory Haskins wrote:
 I'm working on disappearing EOI exits on older hardware as well.  Same
 idea as the old TPR patching, without most of the magic.

  
 While I applaud any engineering effort that results in more optimal
 execution, if you are talking about what we have discussed in the past
 its not quite in the same league as my proposal.

 
 I don't doubt this for a minute.
 
 You are talking about the ability to optimize the final EOI if there are
 no pending interrupts remaining, right?  The problem with this approach
 is it addresses the wrong side of the curve: That is, it optimizes the
 code as its about to go io-idle.  You still have to take an extra exit
 for each injection during the heat of battle, which is when you actually
 need it most.

 
 No, it's completely orthogonal.  An interrupt is injected, the handler
 disables further interrupts and EOIs, then schedules the rest of the
 handling code.  So long as there as packets in the ring interrupts won't
 be enabled and hence there won't be any reinjections.

I meant inter-vector next-interrupt injects.  For lack of a better
term, I called it reinject, but I realize in retrospect that this is
ambiguous.

 
 Different interrupt sources still need different interrupts, but as all
 of your tests have been single-interface, this can't be the reason for
 your performance.
 

Actually I have tested both single and multi-homed setups, but it
doesn't matter.  Even a single device can benefit, as even single
devices may have multiple vector sources that are highly probably to
generate coincident events.  For instance, consider that even a basic
ethernet may have separate vectors for rx and tx-complete.  A simple
ping is likely to generate both vectors at approximately the same time,
given how the host side resources often work.

Trying to condense multiple vectors into one means its up to the driver
to implement any type of prioritization on its own (or worse, it just
suffers from PI).  Likewise, implementing them as unique vectors means
you are likely to have coincident events for certain workloads.

What alacrityvm tries to do is recognize these points and optimize for
both cases.  It means we still retain framework-managed prioritized
callbacks, yet optimize away extraneous IO for coincident signals.  IOW:
best of both worlds.

Kind Regards,
-Greg





signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-21 Thread Gregory Haskins
On 12/21/09 12:20 PM, Anthony Liguori wrote:
 On 12/21/2009 10:46 AM, Gregory Haskins wrote:
 The very best you can hope to achieve is 1:1 EOI per signal (though
 today virtio-pci is even worse than that).  As I indicated above, I can
 eliminate more than 50% of even the EOIs in trivial examples, and even
 more as we scale up the number of devices or the IO load (or both).

 
 If optimizing EOI is the main technical advantage of vbus, then surely
 we could paravirtualize EOI access and get that benefit in KVM without
 introducing a whole new infrastructure, no?

No, because I never claimed optimizing EOI was the main/only advantage.
 The feature set has all been covered in extensive detail in the lists,
however, so I will defer you to google for the archives for your reading
pleasure.

 
 This is a
 light weight exit today and will likely disappear entirely with newer
 hardware.
  
 By that argument, this is all moot.  New hardware will likely obsolete
 the need for venet or virtio-net anyway.
 
 Not at all.

Well, surely something like SR-IOV is moving in that direction, no?

 But let's focus on concrete data.  For a given workload,
 how many exits do you see due to EOI?

Its of course highly workload dependent, and I've published these
details in the past, I believe.  Off the top of my head, I recall that
virtio-pci tends to throw about 65k exits per second, vs about 32k/s for
venet on a 10GE box, but I don't recall what ratio of those exits are
EOI.  To be perfectly honest, I don't care.  I do not discriminate
against the exit type...I want to eliminate as many as possible,
regardless of the type.  That's how you go fast and yet use less CPU.

  They should be relatively rare
 because obtaining good receive batching is pretty easy.

Batching is poor mans throughput (its easy when you dont care about
latency), so we generally avoid as much as possible.

 Considering
 these are lightweight exits (on the order of 1-2us),

APIC EOIs on x86 are MMIO based, so they are generally much heavier than
that.  I measure at least 4-5us just for the MMIO exit on my Woodcrest,
never mind executing the locking/apic-emulation code.

 you need an awfully
 large amount of interrupts before you get really significant performance
 impact.  You would think NAPI would kick in at this point anyway.


Whether NAPI can kick in or not is workload dependent, and it also does
not address coincident events.  But on that topic, you can think of
AlacrityVM's interrupt controller as NAPI for interrupts, because it
operates on the same principle.  For what its worth, it also operates on
a NAPI for hypercalls concept too.

 Do you have data demonstrating the advantage of EOI mitigation?

I have non-scientifically gathered numbers in my notebook that put it on
average of about 55%-60% reduction in EOIs for inbound netperf runs, for
instance.  I don't have time to gather more in the near term, but its
typically in that range for a chatty enough workload, and it goes up as
you add devices.  I would certainly formally generate those numbers when
I make another merge request in the future, but I don't have them now.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


[ANNOUNCE] AlacrityVM v0.2 is released

2009-11-05 Thread Gregory Haskins
We (the AlacrityVM team) are pleased to announce the availability of the
v0.2 release.  There are numerous tweaks, fixes, and features that we
have added since the v0.1 days.  Here are a few of the key highlights.

*) VENET support:
   *) zero-copy transmits (guest memory is paged directly to the
physical adapter on the host, where applicable)
   *) SAR offloading (essentially bi-directional GSO with 64k MTU)
   *) pre-mapped descriptors to reduce latency
   *) pre-loading of mmu context to reduce latency

*) shm-signal's are now routable (e.g. future MQ support)

*) vbus is optionally managed in its entirely from the rebundled qemu
   *) -net venet-tap will fully initialize the interface
internally..no more scripting!

*) vbus components are automatically reaped when the guest exits, even
if the exit is not clean.

We will update the performance numbers ASAP and announce any changes.
For further details, please visit our wiki:

http://developer.novell.com/wiki/index.php/AlacrityVM

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv8 0/3] vhost: a kernel-level virtio server

2009-11-04 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 Ok, I think I've addressed all comments so far here.
 Rusty, I'd like this to go into linux-next, through your tree, and
 hopefully 2.6.33.  What do you think?

I think the benchmark data is a prerequisite for merge consideration, IMO.

Do you have anything for us to look at?  I think comparison that show
the following are of interest:

throughput (e.g. netperf::TCP_STREAM): guest-host, guest-host-guest,
guest-host-remote, host-remote, remote-host-guest

latency (e.g. netperf::UDP_RR): same conditions as throughput

cpu-utilization

others?

Ideally, this should be at least between upstream virtio and vhost.
Bonus points if you include venet as well.

Kind regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Wed, Nov 04, 2009 at 09:25:42AM -0800, Paul E. McKenney wrote:
 (Sorry, but, as always, I could not resist!)

  Thanx, Paul
 
 Thanks Paul!
 Jonathan: are you reading this?
 Another one for your quotes of the week collection :)
 

I second that.



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv8 0/3] vhost: a kernel-level virtio server

2009-11-04 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Wed, Nov 04, 2009 at 11:02:15AM -0500, Gregory Haskins wrote:
 Michael S. Tsirkin wrote:
 Ok, I think I've addressed all comments so far here.
 Rusty, I'd like this to go into linux-next, through your tree, and
 hopefully 2.6.33.  What do you think?
 I think the benchmark data is a prerequisite for merge consideration, IMO.
 
 Shirley Ma was kind enough to send me some measurement results showing
 how kernel level acceleration helps speed up you can find them here:
 http://www.linux-kvm.org/page/VhostNet

Thanks for the pointers.  I will roll your latest v8 code into our test
matrix.  What kernel/qemu trees do they apply to?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-03 Thread Gregory Haskins
Gregory Haskins wrote:
 Eric Dumazet wrote:
 Michael S. Tsirkin a écrit :
 +static void handle_tx(struct vhost_net *net)
 +{
 +   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 +   unsigned head, out, in, s;
 +   struct msghdr msg = {
 +   .msg_name = NULL,
 +   .msg_namelen = 0,
 +   .msg_control = NULL,
 +   .msg_controllen = 0,
 +   .msg_iov = vq-iov,
 +   .msg_flags = MSG_DONTWAIT,
 +   };
 +   size_t len, total_len = 0;
 +   int err, wmem;
 +   size_t hdr_size;
 +   struct socket *sock = rcu_dereference(vq-private_data);
 +   if (!sock)
 +   return;
 +
 +   wmem = atomic_read(sock-sk-sk_wmem_alloc);
 +   if (wmem = sock-sk-sk_sndbuf)
 +   return;
 +
 +   use_mm(net-dev.mm);
 +   mutex_lock(vq-mutex);
 +   vhost_no_notify(vq);
 +
 using rcu_dereference() and mutex_lock() at the same time seems wrong, I 
 suspect
 that your use of RCU is not correct.

 1) rcu_dereference() should be done inside a read_rcu_lock() section, and
we are not allowed to sleep in such a section.
(Quoting Documentation/RCU/whatisRCU.txt :
  It is illegal to block while in an RCU read-side critical section, )

 2) mutex_lock() can sleep (ie block)

 
 
 Michael,
   I warned you that this needed better documentation ;)
 
 Eric,
   I think I flagged this once before, but Michael convinced me that it
 was indeed ok, if but perhaps a bit unconventional.  I will try to
 find the thread.
 
 Kind Regards,
 -Greg
 

Here it is:

http://lkml.org/lkml/2009/8/12/173

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv7 2/3] mm: export use_mm/unuse_mm to modules

2009-11-03 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 vhost net module wants to do copy to/from user from a kernel thread,
 which needs use_mm. Export it to modules.
 
 Acked-by: Andrea Arcangeli aarca...@redhat.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

I need this too:

Acked-by: Gregory Haskins ghask...@novell.com

 ---
  mm/mmu_context.c |3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)
 
 diff --git a/mm/mmu_context.c b/mm/mmu_context.c
 index ded9081..0777654 100644
 --- a/mm/mmu_context.c
 +++ b/mm/mmu_context.c
 @@ -5,6 +5,7 @@
  
  #include linux/mm.h
  #include linux/mmu_context.h
 +#include linux/module.h
  #include linux/sched.h
  
  #include asm/mmu_context.h
 @@ -37,6 +38,7 @@ void use_mm(struct mm_struct *mm)
   if (active_mm != mm)
   mmdrop(active_mm);
  }
 +EXPORT_SYMBOL_GPL(use_mm);
  
  /*
   * unuse_mm
 @@ -56,3 +58,4 @@ void unuse_mm(struct mm_struct *mm)
   enter_lazy_tlb(mm, tsk);
   task_unlock(tsk);
  }
 +EXPORT_SYMBOL_GPL(unuse_mm);




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-03 Thread Gregory Haskins
Eric Dumazet wrote:
 Michael S. Tsirkin a écrit :
 +static void handle_tx(struct vhost_net *net)
 +{
 +struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 +unsigned head, out, in, s;
 +struct msghdr msg = {
 +.msg_name = NULL,
 +.msg_namelen = 0,
 +.msg_control = NULL,
 +.msg_controllen = 0,
 +.msg_iov = vq-iov,
 +.msg_flags = MSG_DONTWAIT,
 +};
 +size_t len, total_len = 0;
 +int err, wmem;
 +size_t hdr_size;
 +struct socket *sock = rcu_dereference(vq-private_data);
 +if (!sock)
 +return;
 +
 +wmem = atomic_read(sock-sk-sk_wmem_alloc);
 +if (wmem = sock-sk-sk_sndbuf)
 +return;
 +
 +use_mm(net-dev.mm);
 +mutex_lock(vq-mutex);
 +vhost_no_notify(vq);
 +
 
 using rcu_dereference() and mutex_lock() at the same time seems wrong, I 
 suspect
 that your use of RCU is not correct.
 
 1) rcu_dereference() should be done inside a read_rcu_lock() section, and
we are not allowed to sleep in such a section.
(Quoting Documentation/RCU/whatisRCU.txt :
  It is illegal to block while in an RCU read-side critical section, )
 
 2) mutex_lock() can sleep (ie block)
 


Michael,
  I warned you that this needed better documentation ;)

Eric,
  I think I flagged this once before, but Michael convinced me that it
was indeed ok, if but perhaps a bit unconventional.  I will try to
find the thread.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-03 Thread Gregory Haskins
Eric Dumazet wrote:
 Gregory Haskins a écrit :
 Gregory Haskins wrote:
 Eric Dumazet wrote:
 Michael S. Tsirkin a écrit :

 using rcu_dereference() and mutex_lock() at the same time seems wrong, I 
 suspect
 that your use of RCU is not correct.

 1) rcu_dereference() should be done inside a read_rcu_lock() section, and
we are not allowed to sleep in such a section.
(Quoting Documentation/RCU/whatisRCU.txt :
  It is illegal to block while in an RCU read-side critical section, )

 2) mutex_lock() can sleep (ie block)

 Michael,
   I warned you that this needed better documentation ;)

 Eric,
   I think I flagged this once before, but Michael convinced me that it
 was indeed ok, if but perhaps a bit unconventional.  I will try to
 find the thread.

 Kind Regards,
 -Greg

 Here it is:

 http://lkml.org/lkml/2009/8/12/173

 
 Yes, this doesnt convince me at all, and could be a precedent for a wrong RCU 
 use.
 People wanting to use RCU do a grep on kernel sources to find how to correctly
 use RCU.
 
 Michael, please use existing locking/barrier mechanisms, and not pretend to 
 use RCU.

Yes, I would tend to agree with you.  In fact, I think I suggested that
a normal barrier should be used instead of abusing rcu_dereference().

But as far as his code is concerned, I think it technically works
properly, and that was my main point.  Also note that the usage
rcu_dereference+mutex_lock() are not necessarily broken, per se:  it
could be an srcu-based critical section created by the caller, for
instance.  It would be perfectly legal to sleep on the mutex if that
were the case.

To me, the bigger issue is that the rcu_dereference() without any
apparent hint of a corresponding RSCS is simply confusing as a reviewer.
 smp_rmb() (or whatever is proper in this case) is probably more
appropriate.

Kind Regards,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [Alacrityvm-devel] [KVM PATCH v2 1/2] KVM: export lockless GSI attribute

2009-10-28 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/26/2009 05:38 PM, Gregory Haskins wrote:
 Instead of a lockless attribute, how about a -set_atomic() method. 
 For
 msi this can be the same as -set(), for non-msi it can be a function
 that schedules the work (which will eventually call -set()).

 The benefit is that we make a decision only once, when preparing the
 routing entry, and install that decision in the routing entry
 instead of
 making it again and again later.

 Yeah, I like this idea.  I think we can also get rid of the custom
 workqueue if we do this as well, TBD.
  
 So I looked into this.  It isn't straight forward because you need to
 retain some kind of state across the deferment on a per-request basis
 (not per-GSI).  Today, this state is neatly tracked into the irqfd
 object itself (e.g. it knows to toggle the GSI).

 
 Yes, and it also contains the work_struct.
 
 What if we make the work_struct (and any additional state) part of the
 set_atomic() argument list?  Does it simplify things?

Hmmm, that might not, but we could do a kmalloc(GFP_ATOMIC) for such
parameters.  Considering this is just a safety net, perhaps this would
work fine.

 
 So while generalizing this perhaps makes sense at some point, especially
 if irqfd-like interfaces get added, it probably doesn't make a ton of
 sense to expend energy on it ATM.  It is basically a generalization of
 the irqfd deferrment code.  Lets just wait until we have a user beyond
 irqfd for now.  Sound acceptable?

 
 I'll look at v3, but would really like to disentangle this.

Ok, I will see what I can do.  I need at least a v4 to get rid of the
dependency on the now defunct v3:1/3 patch per yesterdays discussion.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation

2009-10-28 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Tue, Oct 27, 2009 at 02:54:40PM -0400, Gregory Haskins wrote:
 Michael S. Tsirkin wrote:
 On Mon, Oct 26, 2009 at 12:22:08PM -0400, Gregory Haskins wrote:
 IRQFD currently uses a deferred workqueue item to execute the injection
 operation.  It was originally designed this way because kvm_set_irq()
 required the caller to hold the irq_lock mutex, and the eventfd callback
 is invoked from within a non-preemptible critical section.

 With the advent of lockless injection support for certain GSIs, the
 deferment mechanism is no longer technically needed in all cases.
 Since context switching to the workqueue is a source of interrupt
 latency, lets switch to a direct method whenever possible.  Fortunately
 for us, the most common use of irqfd (MSI-based GSIs) readily support
 lockless injection.

 Signed-off-by: Gregory Haskins ghask...@novell.com
 This is a useful optimization I think.
 Some comments below.

 ---

  virt/kvm/eventfd.c |   31 +++
  1 files changed, 27 insertions(+), 4 deletions(-)

 diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
 index 30f70fd..e6cc958 100644
 --- a/virt/kvm/eventfd.c
 +++ b/virt/kvm/eventfd.c
 @@ -51,20 +51,34 @@ struct _irqfd {
wait_queue_t  wait;
struct work_structinject;
struct work_structshutdown;
 +  void (*execute)(struct _irqfd *);
  };
  
  static struct workqueue_struct *irqfd_cleanup_wq;
  
  static void
 -irqfd_inject(struct work_struct *work)
 +irqfd_inject(struct _irqfd *irqfd)
  {
 -  struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
struct kvm *kvm = irqfd-kvm;
  
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1);
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0);
  }
  
 +static void
 +irqfd_deferred_inject(struct work_struct *work)
 +{
 +  struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
 +
 +  irqfd_inject(irqfd);
 +}
 +
 +static void
 +irqfd_schedule(struct _irqfd *irqfd)
 +{
 +  schedule_work(irqfd-inject);
 +}
 +
  /*
   * Race-free decouple logic (ordering is critical)
   */
 @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int 
 sync, void *key)
  
if (flags  POLLIN)
/* An event has been signaled, inject an interrupt */
 -  schedule_work(irqfd-inject);
 +  irqfd-execute(irqfd);
  
if (flags  POLLHUP) {
/* The eventfd is closing, detach from KVM */
 @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
irqfd-kvm = kvm;
irqfd-gsi = gsi;
INIT_LIST_HEAD(irqfd-list);
 -  INIT_WORK(irqfd-inject, irqfd_inject);
 +  INIT_WORK(irqfd-inject, irqfd_deferred_inject);
INIT_WORK(irqfd-shutdown, irqfd_shutdown);
  
file = eventfd_fget(fd);
 @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
list_add_tail(irqfd-list, kvm-irqfds.items);
spin_unlock_irq(kvm-irqfds.lock);
  
 +  ret = kvm_irq_check_lockless(kvm, gsi);
 +  if (ret  0)
 +  goto fail;
 +
 +  if (ret)
 +  irqfd-execute = irqfd_inject;
 +  else
 +  irqfd-execute = irqfd_schedule;
 +
 Can't gsi get converted from lockless to non-lockless
 after it's checked (by the routing ioctl)?
 I think I protect against this in patch 2/3 by ensuring that any vectors
 that are added have to conform to the same locking rules.  The code
 doesn't support deleting routes, so we really only need to make sure
 that new routes do not change.
 
 What I refer to, is when userspace calls KVM_SET_GSI_ROUTING.
 I don't see how your patch helps here: can't a GSI formerly
 used for MSI become unused, and then reused for non-MSI?
 If not, it's a problem I think, because I think userspace currently does this
 sometimes.

I see your point.  I was thinking vectors could only be added, not
deleted, but I see upon further inspection that is not the case.

 
 Kernel will crash then.

 How about, each time we get event from eventfd, we implement
 kvm_irqfd_toggle_lockless, which does a single scan, and returns
 true/false status (and I really mean toggle, let's not do set 1 / set 0
 as well) telling us whether interrupts could be delivered in a lockless
 manner?
 I am not sure I like this idea in general given that I believe I already
 handle the error case you are concerned with.

 However, the concept of providing a toggle option so we can avoid
 scanning the list twice is a good one.  That can be done as a new patch
 series, but it would be a nice addition.

 Thanks Michael,
 -Greg

 
 




signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v3 2/3] KVM: export lockless GSI attribute

2009-10-28 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Mon, Oct 26, 2009 at 12:22:03PM -0400, Gregory Haskins wrote:
 Certain GSI's support lockless injecton, but we have no way to detect
 which ones at the GSI level.  Knowledge of this attribute will be
 useful later in the series so that we can optimize irqfd injection
 paths for cases where we know the code will not sleep.  Therefore,
 we provide an API to query a specific GSI.

 Signed-off-by: Gregory Haskins ghask...@novell.com
 ---

  include/linux/kvm_host.h |2 ++
  virt/kvm/irq_comm.c  |   35 ++-
  2 files changed, 36 insertions(+), 1 deletions(-)

 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 1fe135d..01151a6 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -119,6 +119,7 @@ struct kvm_memory_slot {
  struct kvm_kernel_irq_routing_entry {
  u32 gsi;
  u32 type;
 +bool lockless;
 
 So lockless is the same as type == MSI from below?

Yep, today anyway.

 If the idea is to make it extensible for the future,
 let's just add an inline function, we don't need a field for this.
 

This makes sense.

  int (*set)(struct kvm_kernel_irq_routing_entry *e,
 struct kvm *kvm, int irq_source_id, int level);
  union {
 @@ -420,6 +421,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic 
 *ioapic,
 unsigned long *deliver_bitmask);
  #endif
  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
 +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq);
  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
  void kvm_register_irq_ack_notifier(struct kvm *kvm,
 struct kvm_irq_ack_notifier *kian);
 diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
 index db2553f..a7fd487 100644
 --- a/virt/kvm/irq_comm.c
 +++ b/virt/kvm/irq_comm.c
 @@ -173,6 +173,35 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
 irq, int level)
  return ret;
  }
  
 +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq)
 +{
 +struct kvm_kernel_irq_routing_entry *e;
 +struct kvm_irq_routing_table *irq_rt;
 +struct hlist_node *n;
 +int ret = -ENOENT;
 +int idx;
 +
 +idx = srcu_read_lock(kvm-irq_routing.srcu);
 +irq_rt = rcu_dereference(kvm-irq_routing.table);
 +if (irq  irq_rt-nr_rt_entries)
 +hlist_for_each_entry(e, n, irq_rt-map[irq], link) {
 +if (!e-lockless) {
 +/*
 + * all destinations need to be lockless to
 + * declare that the GSI as a whole is also
 + * lockless
 + */
 +ret = 0;
 +break;
 +}
 +
 +ret = 1;
 +}
 +srcu_read_unlock(kvm-irq_routing.srcu, idx);
 +
 +return ret;
 +}
 +
  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
  {
  struct kvm_irq_ack_notifier *kian;
 @@ -310,18 +339,22 @@ static int setup_routing_entry(struct 
 kvm_irq_routing_table *rt,
  int delta;
  struct kvm_kernel_irq_routing_entry *ei;
  struct hlist_node *n;
 +bool lockless = ue-type == KVM_IRQ_ROUTING_MSI;
  
  /*
   * Do not allow GSI to be mapped to the same irqchip more than once.
   * Allow only one to one mapping between GSI and MSI.
 + * Do not allow mixed lockless vs locked variants to coexist.
 
 Userspace has no idea which entries are lockless and which are not:
 this is an implementation detail - so it might not be able to avoid
 illegal combinations.
 Since this is called on an ioctl, can the rule be formulated in a way
 that makes sense for userspace?
 

I'm not sure.

   */
  hlist_for_each_entry(ei, n, rt-map[ue-gsi], link)
  if (ei-type == KVM_IRQ_ROUTING_MSI ||
 -ue-u.irqchip.irqchip == ei-irqchip.irqchip)
 +ue-u.irqchip.irqchip == ei-irqchip.irqchip ||
 +ei-lockless != lockless)
 
 So this check seems like it does nothing, because lockless is same as
 MSI, and MSI is always 1:1? Intentional?
 

Yeah, it was more to guard-against/document the dependency, in case the
1:1 with MSI ever changes in the future.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [Alacrityvm-devel] [KVM PATCH v2 1/2] KVM: export lockless GSI attribute

2009-10-28 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/28/2009 03:19 PM, Gregory Haskins wrote:
 Yes, and it also contains the work_struct.

 What if we make the work_struct (and any additional state) part of the
 set_atomic() argument list?  Does it simplify things?
  
 Hmmm, that might not, but we could do a kmalloc(GFP_ATOMIC) for such
 parameters.  Considering this is just a safety net, perhaps this would
 work fine.

 
 Can't you simply pass the same work_struct from irqfd as we use now?

Well, yes, of course, but I am not sure that buys us much in terms of
generalizing the code.  Unless I am misunderstanding, that would still
leave the impetus of the init/sync/cleanup to the irqfd code, at which
point we might as well just leave it entirely in irqfd anyway.  Or am I
misunderstanding you?

 
 So while generalizing this perhaps makes sense at some point,
 especially
 if irqfd-like interfaces get added, it probably doesn't make a ton of
 sense to expend energy on it ATM.  It is basically a generalization of
 the irqfd deferrment code.  Lets just wait until we have a user beyond
 irqfd for now.  Sound acceptable?


 I'll look at v3, but would really like to disentangle this.
  
 Ok, I will see what I can do.  I need at least a v4 to get rid of the
 dependency on the now defunct v3:1/3 patch per yesterdays discussion.

 
 There's another alternative - make ioapic and pic irq-safe by switching
 irq locking to spinlocks and using spin_lock_irqsave().
 
 I've long opposed this since the ioapic loops on all vcpus when
 injecting some irqs and this will increase irqoff times with large
 guests.  But we don't have large guests now, and we need irq-safe
 injection in three places now:
 
 - irqfd
 - pit - we now signal vcpu0 to handle the injection, but this has its
 problems
 - device assignment
 
 so it may be better to have irq-safe injection, and deal with the loop
 later (would be good to have an idea how exactly).

Ok, perhaps I should just hold off on this series for now, then.  I can
submit the original assume atomic safe once the path is fully lockless.

-Greg

 




signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

2009-10-27 Thread Gregory Haskins
Hi Paul,

Paul E. McKenney wrote:
 On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
 The current code suffers from the following race condition:

 thread-1thread-2
 ---

 kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
rcu_read_unlock();

kvm_set_irq_routing() {
   mutex_lock();
   irq_rt = table;
   rcu_assign_pointer();
   mutex_unlock();
   synchronize_rcu();

   kfree(irq_rt);

irq_rt-entry-set(); /* bad */

 -

 Because the pointer is accessed outside of the read-side critical
 section.  There are two basic patterns we can use to fix this bug:

 1) Switch to sleeping-rcu and encompass the -set() access within the
read-side critical section,

OR

 2) Add reference counting to the irq_rt structure, and simply acquire
the reference from within the RSCS.

 This patch implements solution (1).
 
 Looks like a good transformation!  A few questions interspersed below.

Thanks for the review.  I would have CC'd you but I figured I pestered
you enough with my RCU reviews in the past, and didn't want to annoy you ;)

I will be sure to CC you in the future, unless you ask otherwise.

 
 Signed-off-by: Gregory Haskins ghask...@novell.com
 ---

  include/linux/kvm_host.h |6 +-
  virt/kvm/irq_comm.c  |   50 
 +++---
  virt/kvm/kvm_main.c  |1 +
  3 files changed, 35 insertions(+), 22 deletions(-)

 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index bd5a616..1fe135d 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -185,7 +185,10 @@ struct kvm {

  struct mutex irq_lock;
  #ifdef CONFIG_HAVE_KVM_IRQCHIP
 -struct kvm_irq_routing_table *irq_routing;
 +struct {
 +struct srcu_structsrcu;
 
 Each structure has its own SRCU domain.  This is OK, but just asking
 if that is the intent.  It does look like the SRCU primitives are
 passed a pointer to the correct structure, and that the return value
 from srcu_read_lock() gets passed into the matching srcu_read_unlock()
 like it needs to be, so that is good.

Yeah, it was intentional.  Technically the table is per-guest, and thus
the locking is too, which is the desired/intentional granularity.

On that note, I tried to denote that kvm-irq_routing.srcu and
kvm-irq_routing.table were related to one another, but then went ahead
and modified the hunks that touched kvm-irq_ack_notifier_list, too.  In
retrospect, this was probably a mistake.  I should leave the rcu usage
outside of -irq_routing.table alone.

 
 +struct kvm_irq_routing_table *table;
 +} irq_routing;
  struct hlist_head mask_notifier_list;
  struct hlist_head irq_ack_notifier_list;
  #endif
 
 [ . . . ]
 
 @@ -155,21 +156,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, 
 u32 irq, int level)
   * IOAPIC.  So set the bit in both. The guest will ignore
   * writes to the unused one.
   */
 -rcu_read_lock();
 -irq_rt = rcu_dereference(kvm-irq_routing);
 +idx = srcu_read_lock(kvm-irq_routing.srcu);
 +irq_rt = rcu_dereference(kvm-irq_routing.table);
  if (irq  irq_rt-nr_rt_entries)
 -hlist_for_each_entry(e, n, irq_rt-map[irq], link)
 -irq_set[i++] = *e;
 -rcu_read_unlock();
 +hlist_for_each_entry(e, n, irq_rt-map[irq], link) {
 
 What prevents the above list from changing while we are traversing it?
 (Yes, presumably whatever was preventing it from changing before this
 patch, but what?)
 
 Mostly kvm-lock is held, but not always.  And if kvm-lock were held
 all the time, there would be no point in using SRCU.  ;-)

This is protected by kvm-irq_lock within kvm_set_irq_routing().
Entries are added to a copy of the list, and the top-level table pointer
is swapped (via rcu_assign_pointer(), as it should be) while holding the
lock.  Finally, we synchronize with the RSCS before deleting the old
copy.  It looks to me like the original author got this part right, so I
didn't modify it outside of converting to SRCU.

 
 +int r;

 -while(i--) {
 -int r;
 -r = irq_set[i].set(irq_set[i], kvm, irq_source_id, level);
 -if (r  0)
 -continue;
 +r = e-set(e, kvm, irq_source_id, level);
 +if (r  0)
 +continue;

 -ret = r + ((ret  0) ? 0 : ret);
 -}
 +ret = r + ((ret  0) ? 0 : ret);
 +}
 +srcu_read_unlock(kvm

Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

2009-10-27 Thread Gregory Haskins
Gleb Natapov wrote:
 On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
 The current code suffers from the following race condition:

 thread-1thread-2
 ---

 kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
rcu_read_unlock();

kvm_set_irq_routing() {
   mutex_lock();
   irq_rt = table;
   rcu_assign_pointer();
   mutex_unlock();
   synchronize_rcu();

   kfree(irq_rt);

irq_rt-entry-set(); /* bad */

 This is not what happens. irq_rt is never accessed outside read-side
 critical section.

Sorry, I was generalizing to keep the comments short.  I figured it
would be clear what I was actually saying, but realize in retrospect
that I was a little ambiguous.

Yes, irq_rt is not accessed outside the RSCS.  However, the entry
pointers stored in the irq_rt-map are, and this is equally problematic
afaict.

In this particular case we seem to never delete entries at run-time once
they are established.  Therefore, while perhaps sloppy, its technically
safe to leave them unprotected from this perspective.  The issue is more
related to shutdown since a kvm_set_irq() caller could be within the
aforementioned race-region and call entry-set() after the guest is
gone.  Or did I miss something?

 Data is copied from irq_rt onto the stack and this copy is accessed
 outside critical section.

As mentioned above, I do not believe this really protect us.  And even
if it did, the copy is just a work-around to avoid sleeping within the
standard RCU RSCS, which is what SRCU is designed for.  So rather than
inventing an awkward two-phased stack based solution, it's better to
reuse the provided tools, IMO.

To flip it around:  Is there any reason why an SRCU would not work here,
and thus we were forced to use something like the stack-copy approach?

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

2009-10-27 Thread Gregory Haskins
Gregory Haskins wrote:
 Gleb Natapov wrote:
 On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
 The current code suffers from the following race condition:

 thread-1thread-2
 ---

 kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
rcu_read_unlock();

kvm_set_irq_routing() {
   mutex_lock();
   irq_rt = table;
   rcu_assign_pointer();
   mutex_unlock();
   synchronize_rcu();

   kfree(irq_rt);

irq_rt-entry-set(); /* bad */

 This is not what happens. irq_rt is never accessed outside read-side
 critical section.
 
 Sorry, I was generalizing to keep the comments short.  I figured it
 would be clear what I was actually saying, but realize in retrospect
 that I was a little ambiguous.

Here is a revised problem statement

thread-1thread-2
---

kvm_set_irq() {
   rcu_read_lock()
   irq_rt = rcu_dereference(table);
   entry_cache = get_entries(irq_rt);
   rcu_read_unlock();

invalidate_entries(irq_rt);

   for_each_entry(entry_cache)
  entry-set(); /* bad */

-


invalidate_entries() may be any operation that deletes an entry at
run-time (doesn't exist today), or as the guest is shutting down.  As
far as I can tell, the current code does not protect us from either
condition, and my proposed patch protects us from both.  Did I miss
anything?

HTH
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

2009-10-27 Thread Gregory Haskins
Gleb Natapov wrote:
 On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote:
 Gleb Natapov wrote:
 On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
 The current code suffers from the following race condition:

 thread-1thread-2
 ---

 kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
rcu_read_unlock();

kvm_set_irq_routing() {
   mutex_lock();
   irq_rt = table;
   rcu_assign_pointer();
   mutex_unlock();
   synchronize_rcu();

   kfree(irq_rt);

irq_rt-entry-set(); /* bad */

 This is not what happens. irq_rt is never accessed outside read-side
 critical section.
 Sorry, I was generalizing to keep the comments short.  I figured it
 would be clear what I was actually saying, but realize in retrospect
 that I was a little ambiguous.

 A little is underestimation :) There is not /* bad */ line in the code!

Sorry, that was my own highlighting, not trying to reflect actual code.

 
 Yes, irq_rt is not accessed outside the RSCS.  However, the entry
 pointers stored in the irq_rt-map are, and this is equally problematic
 afaict.
 The pointer is in text and can't disappear without kvm_set_irq()
 disappearing too.

No, the entry* pointer is .text _AND_ .data, and is subject to standard
synchronization rules like most other objects.

Unless I am misreading the code, the entry* pointers point to heap
within the irq_rt pointer.  Therefore, the kfree(irq_rt) I mention
above effectively invalidates the entire set of entry* pointers that you
are caching, and is thus an issue.

 
 In this particular case we seem to never delete entries at run-time once
 they are established.  Therefore, while perhaps sloppy, its technically
 safe to leave them unprotected from this perspective.

Note: I was wrong in this statement.  I forgot that it's not safe at
run-time either since the entry objects are part of irq_rt.

 The issue is more
 related to shutdown since a kvm_set_irq() caller could be within the
 aforementioned race-region and call entry-set() after the guest is
 gone.  Or did I miss something?

 The caller of kvm_set_irq() should hold reference to kvm instance, so it
 can't disappear while you are inside kvm_set_irq(). RCU protects only
 kvm-irq_routing not kvm structure itself.

Agreed, but this has nothing to do with protecting the entry* pointers.

 
 Data is copied from irq_rt onto the stack and this copy is accessed
 outside critical section.
 As mentioned above, I do not believe this really protect us.  And even
 I don't see the prove it doesn't, so I assume it does.

What would you like to see beyond what I've already provided you?  I can
show how the entry pointers are allocated as part of the irq_rt, and I
can show how the irq_rt (via entry-set) access is racy against
invalidation.

 
 if it did, the copy is just a work-around to avoid sleeping within the
 It is not a work-around. There was two solutions to the problem one is
 to call -set() outside rcu critical section

This is broken afaict without taking additional precautions, such as a
reference count on the irq_rt structure, but I mentioned this alternate
solution in my header.

 another is to use SRCU. I
 decided to use the first one. This way the code is much simpler

simpler is debatable, but ok.  SRCU is an established pattern
available in the upstream kernel, so I don't think its particularly
complicated or controversial to use.

 and I remember asking Paul what are the disadvantages of using SRCU and there
 was something.
 

The disadvantages to my knowledge are as follows:

1) rcu_read_lock is something like 4x faster than srcu_read_lock(), but
we are talking about nanoseconds on modern hardware (I think Paul quoted
me 10ns vs 45ns on his rig).  I don't think either overhead is something
to be concerned about in this case.

2) standard rcu supports deferred synchronization (call_rcu()), as well
as barriers (synchronize_rcu()), whereas SRCU only supports barriers
(synchronize_srcu()).  We only use the barrier type in this code path,
so that is not an issue.

3) SRCU requires explicit initialization/cleanup, whereas regular RCU
does not.  Trivially solved in my patch since KVM has plenty of
init/cleanup hook points.

 standard RCU RSCS, which is what SRCU is designed for.  So rather than
 inventing an awkward two-phased stack based solution, it's better to
 reuse the provided tools, IMO.

 To flip it around:  Is there any reason why an SRCU would not work here,
 and thus we were forced to use something like the stack-copy approach?

 If SRCU has no disadvantage comparing to RCU why not use it always? :)

No one is debating

Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

2009-10-27 Thread Gregory Haskins
Gleb Natapov wrote:
 On Tue, Oct 27, 2009 at 10:00:15AM -0400, Gregory Haskins wrote:
 Gregory Haskins wrote:
 Gleb Natapov wrote:
 On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
 The current code suffers from the following race condition:

 thread-1thread-2
 ---

 kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
rcu_read_unlock();

kvm_set_irq_routing() {
   mutex_lock();
   irq_rt = table;
   rcu_assign_pointer();
   mutex_unlock();
   synchronize_rcu();

   kfree(irq_rt);

irq_rt-entry-set(); /* bad */

 This is not what happens. irq_rt is never accessed outside read-side
 critical section.
 Sorry, I was generalizing to keep the comments short.  I figured it
 would be clear what I was actually saying, but realize in retrospect
 that I was a little ambiguous.
 Here is a revised problem statement

 thread-1thread-2
 ---

 kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
entry_cache = get_entries(irq_rt);
rcu_read_unlock();

 invalidate_entries(irq_rt);

for_each_entry(entry_cache)
entry-set(); /* bad */

 -


 invalidate_entries() may be any operation that deletes an entry at
 run-time (doesn't exist today), or as the guest is shutting down.  As
 far as I can tell, the current code does not protect us from either
 condition, and my proposed patch protects us from both.  Did I miss
 anything?

 Yes. What happened to irq_rt is completely irrelevant at the point you
 marked /* bad */.

kfree() happened to irq_rt, and thus to the objects behind the pointers
in entry_cache at the point I marked /* bad */.

That certainly isn't /* good */ ;)

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

2009-10-27 Thread Gregory Haskins
Thanks for this, Paul.

Some questions and statements below.

Paul E. McKenney wrote:
 On Tue, Oct 27, 2009 at 04:02:37PM +0200, Gleb Natapov wrote:
 On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote:
 
 [ . . . ]
 
 standard RCU RSCS, which is what SRCU is designed for.  So rather than
 inventing an awkward two-phased stack based solution, it's better to
 reuse the provided tools, IMO.

 To flip it around:  Is there any reason why an SRCU would not work here,
 and thus we were forced to use something like the stack-copy approach?

 If SRCU has no disadvantage comparing to RCU why not use it always? :)
 
 The disadvantages of SRCU compared to RCU include the following:
 
 1.SRCU requires that the return value of srcu_read_lock()
   be fed into srcu_read_unlock().  This is usually not a problem,
   but can be painful if there are multiple levels of function
   call separating the two.

Right, and this is simple/neat w.r.t. its usage in irq_routing, so no
problem there.

 
 2.SRCU's grace periods are about 4x slower than those of RCU.
   And they also don't scale all that well with extremely large
   numbers of CPUs (but this can be fixed when/if it becomes a
   real problem).

The irq_routing update path is extremely infrequent, so this should not
be an issue.

 
 3.SRCU's read-side primitives are also significantly slower than
   those of RCU.
 

Are the 10ns vs 45ns numbers that I mentioned in my last reply the
proper ballpark?  How do these compare to an atomic-op, say an
uncontended spinlock on modern hardware?  The assumption is that
srcu_read_lock() should be significantly cheaper than a read-lock().  If
its not, then we might as well use something else, I suppose.  But if
its not, I guess you probably wouldn't have bothered to invent it in the
first place ;)

 4.SRCU does not have a call_srcu().  One could be provided, but
   its semantics would be a bit strange due to the need to limit
   the number of callbacks, given that general blocking is
   permitted in SRCU read-side critical sections.  (And it would
   take some doing to convince me to supply an SRCU!)

This is not an issue in our design.

 
 5.The current SRCU has no reasonable way to implement read-side
   priority boosting, as there is no record of which task
   is read-holding which SRCU.

Given the infrequency of the update path, I do not see this as a problem.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

2009-10-27 Thread Gregory Haskins
Gleb Natapov wrote:
 On Tue, Oct 27, 2009 at 10:50:45AM -0400, Gregory Haskins wrote:
 Gleb Natapov wrote:
 On Tue, Oct 27, 2009 at 10:00:15AM -0400, Gregory Haskins wrote:
 Gregory Haskins wrote:
 Gleb Natapov wrote:
 On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
 The current code suffers from the following race condition:

 thread-1thread-2
 ---

 kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
rcu_read_unlock();

kvm_set_irq_routing() {
   mutex_lock();
   irq_rt = table;
   rcu_assign_pointer();
   mutex_unlock();
   synchronize_rcu();

   kfree(irq_rt);

irq_rt-entry-set(); /* bad */

 This is not what happens. irq_rt is never accessed outside read-side
 critical section.
 Sorry, I was generalizing to keep the comments short.  I figured it
 would be clear what I was actually saying, but realize in retrospect
 that I was a little ambiguous.
 Here is a revised problem statement

 thread-1thread-2
 ---

 kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
entry_cache = get_entries(irq_rt);
rcu_read_unlock();

 invalidate_entries(irq_rt);

for_each_entry(entry_cache)
  entry-set(); /* bad */

 -


 invalidate_entries() may be any operation that deletes an entry at
 run-time (doesn't exist today), or as the guest is shutting down.  As
 far as I can tell, the current code does not protect us from either
 condition, and my proposed patch protects us from both.  Did I miss
 anything?

 Yes. What happened to irq_rt is completely irrelevant at the point you
 marked /* bad */.
 kfree() happened to irq_rt, and thus to the objects behind the pointers
 in entry_cache at the point I marked /* bad */.
 The entire entry is cached not a pointer to an entry! kfree().

light bulb goes off

Ah, I see.  I missed that detail that it was a structure copy, not a
pointer copy.

My bad.  You are right, and I am wrong.  I retract the 1/3 patch.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

2009-10-27 Thread Gregory Haskins
Gleb Natapov wrote:

 1) rcu_read_lock is something like 4x faster than srcu_read_lock(), but
 we are talking about nanoseconds on modern hardware (I think Paul quoted
 me 10ns vs 45ns on his rig).  I don't think either overhead is something
 to be concerned about in this case.

 If we can avoid why not? Nanoseconds tend to add up.
 

BTW: I didn't mean to imply that we should be cavalier in adding
overhead.  My point was that adding overhead is sometimes _necessary_ to
prevent a race above and beyond an RCU pointer acquisition, and 35ns is
not _terrible_ IMO.

I was suggesting to solve this by switching to SRCU, but an alternative
is copying the structure (when permitted with immutable objects), which
seems to work in this particular case.  It should be noted that the copy
has its own unquantified overhead beyond basic RCU as well, so its not
truly free (I'd guess its = as the switch to SRCU without copies, though).

IOW: sync hurts, but its sometimes a necessary evil ;)

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation

2009-10-27 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Mon, Oct 26, 2009 at 12:22:08PM -0400, Gregory Haskins wrote:
 IRQFD currently uses a deferred workqueue item to execute the injection
 operation.  It was originally designed this way because kvm_set_irq()
 required the caller to hold the irq_lock mutex, and the eventfd callback
 is invoked from within a non-preemptible critical section.

 With the advent of lockless injection support for certain GSIs, the
 deferment mechanism is no longer technically needed in all cases.
 Since context switching to the workqueue is a source of interrupt
 latency, lets switch to a direct method whenever possible.  Fortunately
 for us, the most common use of irqfd (MSI-based GSIs) readily support
 lockless injection.

 Signed-off-by: Gregory Haskins ghask...@novell.com
 
 This is a useful optimization I think.
 Some comments below.
 
 ---

  virt/kvm/eventfd.c |   31 +++
  1 files changed, 27 insertions(+), 4 deletions(-)

 diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
 index 30f70fd..e6cc958 100644
 --- a/virt/kvm/eventfd.c
 +++ b/virt/kvm/eventfd.c
 @@ -51,20 +51,34 @@ struct _irqfd {
  wait_queue_t  wait;
  struct work_structinject;
  struct work_structshutdown;
 +void (*execute)(struct _irqfd *);
  };
  
  static struct workqueue_struct *irqfd_cleanup_wq;
  
  static void
 -irqfd_inject(struct work_struct *work)
 +irqfd_inject(struct _irqfd *irqfd)
  {
 -struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
  struct kvm *kvm = irqfd-kvm;
  
  kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1);
  kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0);
  }
  
 +static void
 +irqfd_deferred_inject(struct work_struct *work)
 +{
 +struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
 +
 +irqfd_inject(irqfd);
 +}
 +
 +static void
 +irqfd_schedule(struct _irqfd *irqfd)
 +{
 +schedule_work(irqfd-inject);
 +}
 +
  /*
   * Race-free decouple logic (ordering is critical)
   */
 @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int 
 sync, void *key)
  
  if (flags  POLLIN)
  /* An event has been signaled, inject an interrupt */
 -schedule_work(irqfd-inject);
 +irqfd-execute(irqfd);
  
  if (flags  POLLHUP) {
  /* The eventfd is closing, detach from KVM */
 @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
  irqfd-kvm = kvm;
  irqfd-gsi = gsi;
  INIT_LIST_HEAD(irqfd-list);
 -INIT_WORK(irqfd-inject, irqfd_inject);
 +INIT_WORK(irqfd-inject, irqfd_deferred_inject);
  INIT_WORK(irqfd-shutdown, irqfd_shutdown);
  
  file = eventfd_fget(fd);
 @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
  list_add_tail(irqfd-list, kvm-irqfds.items);
  spin_unlock_irq(kvm-irqfds.lock);
  
 +ret = kvm_irq_check_lockless(kvm, gsi);
 +if (ret  0)
 +goto fail;
 +
 +if (ret)
 +irqfd-execute = irqfd_inject;
 +else
 +irqfd-execute = irqfd_schedule;
 +
 
 Can't gsi get converted from lockless to non-lockless
 after it's checked (by the routing ioctl)?

I think I protect against this in patch 2/3 by ensuring that any vectors
that are added have to conform to the same locking rules.  The code
doesn't support deleting routes, so we really only need to make sure
that new routes do not change.

 Kernel will crash then.
 
 How about, each time we get event from eventfd, we implement
 kvm_irqfd_toggle_lockless, which does a single scan, and returns
 true/false status (and I really mean toggle, let's not do set 1 / set 0
 as well) telling us whether interrupts could be delivered in a lockless
 manner?

I am not sure I like this idea in general given that I believe I already
handle the error case you are concerned with.

However, the concept of providing a toggle option so we can avoid
scanning the list twice is a good one.  That can be done as a new patch
series, but it would be a nice addition.

Thanks Michael,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [Alacrityvm-devel] [KVM PATCH v2 1/2] KVM: export lockless GSI attribute

2009-10-26 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/23/2009 04:38 AM, Gregory Haskins wrote:
 Certain GSI's support lockless injecton, but we have no way to detect
 which ones at the GSI level.  Knowledge of this attribute will be
 useful later in the series so that we can optimize irqfd injection
 paths for cases where we know the code will not sleep.  Therefore,
 we provide an API to query a specific GSI.


 
 Instead of a lockless attribute, how about a -set_atomic() method.  For 
 msi this can be the same as -set(), for non-msi it can be a function 
 that schedules the work (which will eventually call -set()).
 
 The benefit is that we make a decision only once, when preparing the 
 routing entry, and install that decision in the routing entry instead of 
 making it again and again later.

Yeah, I like this idea.  I think we can also get rid of the custom
workqueue if we do this as well, TBD.

 
 +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq)

 
 bool kvm_irq_check_lockless(...)

We lose the ability to detect failure (such as ENOENT) if we do this,
but its moot if we move to the -set_atomic() model, since this
attribute is no longer necessary and this patch can be dropped.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [Alacrityvm-devel] [KVM PATCH v2 1/2] KVM: export lockless GSI attribute

2009-10-26 Thread Gregory Haskins
Gregory Haskins wrote:
 Avi Kivity wrote:
 On 10/23/2009 04:38 AM, Gregory Haskins wrote:
 Certain GSI's support lockless injecton, but we have no way to detect
 which ones at the GSI level.  Knowledge of this attribute will be
 useful later in the series so that we can optimize irqfd injection
 paths for cases where we know the code will not sleep.  Therefore,
 we provide an API to query a specific GSI.


 Instead of a lockless attribute, how about a -set_atomic() method.  For 
 msi this can be the same as -set(), for non-msi it can be a function 
 that schedules the work (which will eventually call -set()).

 The benefit is that we make a decision only once, when preparing the 
 routing entry, and install that decision in the routing entry instead of 
 making it again and again later.
 
 Yeah, I like this idea.  I think we can also get rid of the custom
 workqueue if we do this as well, TBD.

So I looked into this.  It isn't straight forward because you need to
retain some kind of state across the deferment on a per-request basis
(not per-GSI).  Today, this state is neatly tracked into the irqfd
object itself (e.g. it knows to toggle the GSI).

So while generalizing this perhaps makes sense at some point, especially
if irqfd-like interfaces get added, it probably doesn't make a ton of
sense to expend energy on it ATM.  It is basically a generalization of
the irqfd deferrment code.  Lets just wait until we have a user beyond
irqfd for now.  Sound acceptable?

In the meantime, I found a bug in the irq_routing code, so I will submit
a v3 with this fix, as well as a few other things I improved in the v2
series.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


[KVM PATCH v3 0/3] irqfd enhancements, and irq_routing fixes

2009-10-26 Thread Gregory Haskins
(Applies to kvm.git/master:11b06403)

The following patches are cleanups/enhancements for IRQFD now that
we have lockless interrupt injection.  For more details, please see
the patch headers.

These patches pass checkpatch, and are fully tested.  Please consider
for merging.  Patch 1/3 is a fix for an issue that may exist upstream
and should be considered for a more timely push upstream.  Patches 2/3
- 3/3 are an enhancement only, so there is no urgency to push to
mainline until a suitable merge window presents itself.

Kind Regards,
-Greg

[ Change log:

  v3:
 *) Added patch 1/3 as a fix for a race condition
 *) Minor cleanup to 2/3 to ensure that all shared vectors conform
to a unified locking model.

  v2:
 *) dropped original cleanup which relied on the user registering
MSI based GSIs or we may crash at runtime.  Instead, we now
check at registration whether the GSI supports lockless
operation and dynamically adapt to either the original
deferred path for lock-based injections, or direct for lockless.

  v1:
 *) original release
]

---

Gregory Haskins (3):
  KVM: Directly inject interrupts if they support lockless operation
  KVM: export lockless GSI attribute
  KVM: fix race in irq_routing logic


 include/linux/kvm_host.h |8 
 virt/kvm/eventfd.c   |   31 +++--
 virt/kvm/irq_comm.c  |   85 ++
 virt/kvm/kvm_main.c  |1 +
 4 files changed, 98 insertions(+), 27 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation

2009-10-26 Thread Gregory Haskins
IRQFD currently uses a deferred workqueue item to execute the injection
operation.  It was originally designed this way because kvm_set_irq()
required the caller to hold the irq_lock mutex, and the eventfd callback
is invoked from within a non-preemptible critical section.

With the advent of lockless injection support for certain GSIs, the
deferment mechanism is no longer technically needed in all cases.
Since context switching to the workqueue is a source of interrupt
latency, lets switch to a direct method whenever possible.  Fortunately
for us, the most common use of irqfd (MSI-based GSIs) readily support
lockless injection.

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 virt/kvm/eventfd.c |   31 +++
 1 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 30f70fd..e6cc958 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -51,20 +51,34 @@ struct _irqfd {
wait_queue_t  wait;
struct work_structinject;
struct work_structshutdown;
+   void (*execute)(struct _irqfd *);
 };
 
 static struct workqueue_struct *irqfd_cleanup_wq;
 
 static void
-irqfd_inject(struct work_struct *work)
+irqfd_inject(struct _irqfd *irqfd)
 {
-   struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
struct kvm *kvm = irqfd-kvm;
 
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1);
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0);
 }
 
+static void
+irqfd_deferred_inject(struct work_struct *work)
+{
+   struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
+
+   irqfd_inject(irqfd);
+}
+
+static void
+irqfd_schedule(struct _irqfd *irqfd)
+{
+   schedule_work(irqfd-inject);
+}
+
 /*
  * Race-free decouple logic (ordering is critical)
  */
@@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, 
void *key)
 
if (flags  POLLIN)
/* An event has been signaled, inject an interrupt */
-   schedule_work(irqfd-inject);
+   irqfd-execute(irqfd);
 
if (flags  POLLHUP) {
/* The eventfd is closing, detach from KVM */
@@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
irqfd-kvm = kvm;
irqfd-gsi = gsi;
INIT_LIST_HEAD(irqfd-list);
-   INIT_WORK(irqfd-inject, irqfd_inject);
+   INIT_WORK(irqfd-inject, irqfd_deferred_inject);
INIT_WORK(irqfd-shutdown, irqfd_shutdown);
 
file = eventfd_fget(fd);
@@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
list_add_tail(irqfd-list, kvm-irqfds.items);
spin_unlock_irq(kvm-irqfds.lock);
 
+   ret = kvm_irq_check_lockless(kvm, gsi);
+   if (ret  0)
+   goto fail;
+
+   if (ret)
+   irqfd-execute = irqfd_inject;
+   else
+   irqfd-execute = irqfd_schedule;
+
/*
 * Check if there was an event already pending on the eventfd
 * before we registered, and trigger it as if we didn't miss it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

2009-10-26 Thread Gregory Haskins
The current code suffers from the following race condition:

thread-1thread-2
---

kvm_set_irq() {
   rcu_read_lock()
   irq_rt = rcu_dereference(table);
   rcu_read_unlock();

   kvm_set_irq_routing() {
  mutex_lock();
  irq_rt = table;
  rcu_assign_pointer();
  mutex_unlock();
  synchronize_rcu();

  kfree(irq_rt);

   irq_rt-entry-set(); /* bad */

-

Because the pointer is accessed outside of the read-side critical
section.  There are two basic patterns we can use to fix this bug:

1) Switch to sleeping-rcu and encompass the -set() access within the
   read-side critical section,

   OR

2) Add reference counting to the irq_rt structure, and simply acquire
   the reference from within the RSCS.

This patch implements solution (1).

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 include/linux/kvm_host.h |6 +-
 virt/kvm/irq_comm.c  |   50 +++---
 virt/kvm/kvm_main.c  |1 +
 3 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bd5a616..1fe135d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -185,7 +185,10 @@ struct kvm {
 
struct mutex irq_lock;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
-   struct kvm_irq_routing_table *irq_routing;
+   struct {
+   struct srcu_structsrcu;
+   struct kvm_irq_routing_table *table;
+   } irq_routing;
struct hlist_head mask_notifier_list;
struct hlist_head irq_ack_notifier_list;
 #endif
@@ -541,6 +544,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
const struct kvm_irq_routing_entry *entries,
unsigned nr,
unsigned flags);
+void kvm_init_irq_routing(struct kvm *kvm);
 void kvm_free_irq_routing(struct kvm *kvm);
 
 #else
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 00c68d2..db2553f 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -144,10 +144,11 @@ static int kvm_set_msi(struct 
kvm_kernel_irq_routing_entry *e,
  */
 int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
 {
-   struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
-   int ret = -1, i = 0;
+   struct kvm_kernel_irq_routing_entry *e;
+   int ret = -1;
struct kvm_irq_routing_table *irq_rt;
struct hlist_node *n;
+   int idx;
 
trace_kvm_set_irq(irq, level, irq_source_id);
 
@@ -155,21 +156,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq, int level)
 * IOAPIC.  So set the bit in both. The guest will ignore
 * writes to the unused one.
 */
-   rcu_read_lock();
-   irq_rt = rcu_dereference(kvm-irq_routing);
+   idx = srcu_read_lock(kvm-irq_routing.srcu);
+   irq_rt = rcu_dereference(kvm-irq_routing.table);
if (irq  irq_rt-nr_rt_entries)
-   hlist_for_each_entry(e, n, irq_rt-map[irq], link)
-   irq_set[i++] = *e;
-   rcu_read_unlock();
+   hlist_for_each_entry(e, n, irq_rt-map[irq], link) {
+   int r;
 
-   while(i--) {
-   int r;
-   r = irq_set[i].set(irq_set[i], kvm, irq_source_id, level);
-   if (r  0)
-   continue;
+   r = e-set(e, kvm, irq_source_id, level);
+   if (r  0)
+   continue;
 
-   ret = r + ((ret  0) ? 0 : ret);
-   }
+   ret = r + ((ret  0) ? 0 : ret);
+   }
+   srcu_read_unlock(kvm-irq_routing.srcu, idx);
 
return ret;
 }
@@ -179,17 +178,18 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned 
irqchip, unsigned pin)
struct kvm_irq_ack_notifier *kian;
struct hlist_node *n;
int gsi;
+   int idx;
 
trace_kvm_ack_irq(irqchip, pin);
 
-   rcu_read_lock();
-   gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin];
+   idx = srcu_read_lock(kvm-irq_routing.srcu);
+   gsi = rcu_dereference(kvm-irq_routing.table)-chip[irqchip][pin];
if (gsi != -1)
hlist_for_each_entry_rcu(kian, n, kvm-irq_ack_notifier_list,
 link)
if (kian-gsi == gsi)
kian-irq_acked(kian);
-   rcu_read_unlock();
+   srcu_read_unlock(kvm-irq_routing.srcu, idx);
 }
 
 void kvm_register_irq_ack_notifier(struct kvm *kvm,
@@ -287,11

[KVM PATCH v3 2/3] KVM: export lockless GSI attribute

2009-10-26 Thread Gregory Haskins
Certain GSI's support lockless injecton, but we have no way to detect
which ones at the GSI level.  Knowledge of this attribute will be
useful later in the series so that we can optimize irqfd injection
paths for cases where we know the code will not sleep.  Therefore,
we provide an API to query a specific GSI.

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 include/linux/kvm_host.h |2 ++
 virt/kvm/irq_comm.c  |   35 ++-
 2 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1fe135d..01151a6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -119,6 +119,7 @@ struct kvm_memory_slot {
 struct kvm_kernel_irq_routing_entry {
u32 gsi;
u32 type;
+   bool lockless;
int (*set)(struct kvm_kernel_irq_routing_entry *e,
   struct kvm *kvm, int irq_source_id, int level);
union {
@@ -420,6 +421,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic 
*ioapic,
   unsigned long *deliver_bitmask);
 #endif
 int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
+int kvm_irq_check_lockless(struct kvm *kvm, u32 irq);
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
 void kvm_register_irq_ack_notifier(struct kvm *kvm,
   struct kvm_irq_ack_notifier *kian);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index db2553f..a7fd487 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -173,6 +173,35 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq, int level)
return ret;
 }
 
+int kvm_irq_check_lockless(struct kvm *kvm, u32 irq)
+{
+   struct kvm_kernel_irq_routing_entry *e;
+   struct kvm_irq_routing_table *irq_rt;
+   struct hlist_node *n;
+   int ret = -ENOENT;
+   int idx;
+
+   idx = srcu_read_lock(kvm-irq_routing.srcu);
+   irq_rt = rcu_dereference(kvm-irq_routing.table);
+   if (irq  irq_rt-nr_rt_entries)
+   hlist_for_each_entry(e, n, irq_rt-map[irq], link) {
+   if (!e-lockless) {
+   /*
+* all destinations need to be lockless to
+* declare that the GSI as a whole is also
+* lockless
+*/
+   ret = 0;
+   break;
+   }
+
+   ret = 1;
+   }
+   srcu_read_unlock(kvm-irq_routing.srcu, idx);
+
+   return ret;
+}
+
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
struct kvm_irq_ack_notifier *kian;
@@ -310,18 +339,22 @@ static int setup_routing_entry(struct 
kvm_irq_routing_table *rt,
int delta;
struct kvm_kernel_irq_routing_entry *ei;
struct hlist_node *n;
+   bool lockless = ue-type == KVM_IRQ_ROUTING_MSI;
 
/*
 * Do not allow GSI to be mapped to the same irqchip more than once.
 * Allow only one to one mapping between GSI and MSI.
+* Do not allow mixed lockless vs locked variants to coexist.
 */
hlist_for_each_entry(ei, n, rt-map[ue-gsi], link)
if (ei-type == KVM_IRQ_ROUTING_MSI ||
-   ue-u.irqchip.irqchip == ei-irqchip.irqchip)
+   ue-u.irqchip.irqchip == ei-irqchip.irqchip ||
+   ei-lockless != lockless)
return r;
 
e-gsi = ue-gsi;
e-type = ue-type;
+   e-lockless = lockless;
switch (ue-type) {
case KVM_IRQ_ROUTING_IRQCHIP:
delta = 0;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd

2009-10-22 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/21/2009 05:42 PM, Gregory Haskins wrote:
 I believe Avi, Michael, et. al. were in agreement with me on that design
 choice.  I believe the reason is that there is no good way to do EOI/ACK
 feedback within the constraints of an eventfd pipe which would be
 required for the legacy pin-type interrupts.  Therefore, we won't even
 bother trying.  High-performance subsystems will use irqfd/msi, and
 legacy emulation can use the existing injection code (which includes the
 necessary feedback for ack/eoi).


 
 Right.  But we don't actually prevent anyone using non-msi with irqfd,
 which can trigger the bad lock usage from irq context, with a nice boom
 afterwards.  So we need to either prevent it during registration, or to
 gracefully handle it afterwards.
 

Yeah, I was thinking about that after I initially responded to Gleb.

I am thinking something along these lines:

Provide a function that lets you query a GSI for whether it supports
LOCKLESS or not.  Then we can either do one of two things:

1) Check for the LOCKLESS attribute at irqfd registration, fail if not
present

2) Cache the LOCKLESS attribute in the irqfd structure, and either go
direct or defer to a workqueue depending on the flag.

Thoughts?
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd

2009-10-22 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/22/2009 05:14 PM, Gregory Haskins wrote:
 Yeah, I was thinking about that after I initially responded to Gleb.

 I am thinking something along these lines:

 Provide a function that lets you query a GSI for whether it supports
 LOCKLESS or not.  Then we can either do one of two things:

 1) Check for the LOCKLESS attribute at irqfd registration, fail if not
 present

 
 This is the most practical path and leads to the smallest code.  However
 it has the deficiency of exposing internal implementation details to
 userspace.  In theory userspace could use msi and edge-triggered
 pic/ioapic interrupts equally well, it shouldn't have to know that we
 didn't bother to lockfree ioapic/pic.
 
 2) Cache the LOCKLESS attribute in the irqfd structure, and either go
 direct or defer to a workqueue depending on the flag.

 
 While this leads to larger code, it is more consistent.
 

Yeah, I think you are right.  Consider these two patches retracted, and
I will rewrite it with this concept in place.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


[KVM PATCH v2 0/2] irqfd enhancements

2009-10-22 Thread Gregory Haskins
(Applies to kvm.git/master:11b06403)

The following patches are cleanups/enhancements for IRQFD now that
we have lockless interrupt injection.  For more details, please see
the patch headers.

These patches pass checkpatch, and are fully tested.  Please consider
for merging.  They are an enhancement only, so there is no urgency
to push to mainline until a suitable merge window presents itself.

Kind Regards,
-Greg

[ Change log:

  v2:
 *) dropped original cleanup which relied on the user registering
MSI based GSIs or we may crash at runtime.  Instead, we now
check at registration whether the GSI supports lockless
operation and dynamically adapt to either the original
deferred path for lock-based injections, or direct for lockless.

  v1:
 *) original release
]

---

Gregory Haskins (2):
  KVM: Directly inject interrupts if they support lockless operation
  KVM: export lockless GSI attribute


 include/linux/kvm_host.h |2 ++
 virt/kvm/eventfd.c   |   31 +++
 virt/kvm/irq_comm.c  |   19 +++
 3 files changed, 48 insertions(+), 4 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM PATCH v2 1/2] KVM: export lockless GSI attribute

2009-10-22 Thread Gregory Haskins
Certain GSI's support lockless injecton, but we have no way to detect
which ones at the GSI level.  Knowledge of this attribute will be
useful later in the series so that we can optimize irqfd injection
paths for cases where we know the code will not sleep.  Therefore,
we provide an API to query a specific GSI.

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 include/linux/kvm_host.h |2 ++
 virt/kvm/irq_comm.c  |   19 +++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bd5a616..93393a4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -119,6 +119,7 @@ struct kvm_memory_slot {
 struct kvm_kernel_irq_routing_entry {
u32 gsi;
u32 type;
+   bool lockless;
int (*set)(struct kvm_kernel_irq_routing_entry *e,
   struct kvm *kvm, int irq_source_id, int level);
union {
@@ -417,6 +418,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic 
*ioapic,
   unsigned long *deliver_bitmask);
 #endif
 int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
+int kvm_irq_check_lockless(struct kvm *kvm, u32 irq);
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
 void kvm_register_irq_ack_notifier(struct kvm *kvm,
   struct kvm_irq_ack_notifier *kian);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 00c68d2..04f0134 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -174,6 +174,23 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq, int level)
return ret;
 }
 
+int kvm_irq_check_lockless(struct kvm *kvm, u32 irq)
+{
+   struct kvm_kernel_irq_routing_entry *e;
+   struct kvm_irq_routing_table *irq_rt;
+   struct hlist_node *n;
+   int ret = -ENOENT;
+
+   rcu_read_lock();
+   irq_rt = rcu_dereference(kvm-irq_routing);
+   if (irq  irq_rt-nr_rt_entries)
+   hlist_for_each_entry(e, n, irq_rt-map[irq], link)
+   ret = e-lockless ? 1 : 0;
+   rcu_read_unlock();
+
+   return ret;
+}
+
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
struct kvm_irq_ack_notifier *kian;
@@ -314,6 +331,7 @@ static int setup_routing_entry(struct kvm_irq_routing_table 
*rt,
 
e-gsi = ue-gsi;
e-type = ue-type;
+   e-lockless = false;
switch (ue-type) {
case KVM_IRQ_ROUTING_IRQCHIP:
delta = 0;
@@ -342,6 +360,7 @@ static int setup_routing_entry(struct kvm_irq_routing_table 
*rt,
e-msi.address_lo = ue-u.msi.address_lo;
e-msi.address_hi = ue-u.msi.address_hi;
e-msi.data = ue-u.msi.data;
+   e-lockless = true;
break;
default:
goto out;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM PATCH v2 2/2] KVM: Directly inject interrupts if they support lockless operation

2009-10-22 Thread Gregory Haskins
IRQFD currently uses a deferred workqueue item to execute the injection
operation.  It was originally designed this way because kvm_set_irq()
required the caller to hold the irq_lock mutex, and the eventfd callback
is invoked from within a non-preemptible critical section.

With the advent of lockless injection support for certain GSIs, the
deferment mechanism is no longer technically needed in all cases.
Since context switching to the workqueue is a source of interrupt
latency, lets switch to a direct method whenever possible.  Fortunately
for us, the most common use of irqfd (MSI-based GSIs) readily support
lockless injection.

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 virt/kvm/eventfd.c |   31 +++
 1 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 30f70fd..e6cc958 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -51,20 +51,34 @@ struct _irqfd {
wait_queue_t  wait;
struct work_structinject;
struct work_structshutdown;
+   void (*execute)(struct _irqfd *);
 };
 
 static struct workqueue_struct *irqfd_cleanup_wq;
 
 static void
-irqfd_inject(struct work_struct *work)
+irqfd_inject(struct _irqfd *irqfd)
 {
-   struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
struct kvm *kvm = irqfd-kvm;
 
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1);
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0);
 }
 
+static void
+irqfd_deferred_inject(struct work_struct *work)
+{
+   struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
+
+   irqfd_inject(irqfd);
+}
+
+static void
+irqfd_schedule(struct _irqfd *irqfd)
+{
+   schedule_work(irqfd-inject);
+}
+
 /*
  * Race-free decouple logic (ordering is critical)
  */
@@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, 
void *key)
 
if (flags  POLLIN)
/* An event has been signaled, inject an interrupt */
-   schedule_work(irqfd-inject);
+   irqfd-execute(irqfd);
 
if (flags  POLLHUP) {
/* The eventfd is closing, detach from KVM */
@@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
irqfd-kvm = kvm;
irqfd-gsi = gsi;
INIT_LIST_HEAD(irqfd-list);
-   INIT_WORK(irqfd-inject, irqfd_inject);
+   INIT_WORK(irqfd-inject, irqfd_deferred_inject);
INIT_WORK(irqfd-shutdown, irqfd_shutdown);
 
file = eventfd_fget(fd);
@@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
list_add_tail(irqfd-list, kvm-irqfds.items);
spin_unlock_irq(kvm-irqfds.lock);
 
+   ret = kvm_irq_check_lockless(kvm, gsi);
+   if (ret  0)
+   goto fail;
+
+   if (ret)
+   irqfd-execute = irqfd_inject;
+   else
+   irqfd-execute = irqfd_schedule;
+
/*
 * Check if there was an event already pending on the eventfd
 * before we registered, and trigger it as if we didn't miss it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH v2 1/2] KVM: export lockless GSI attribute

2009-10-22 Thread Gregory Haskins
Gregory Haskins wrote:
 Certain GSI's support lockless injecton, but we have no way to detect
 which ones at the GSI level.  Knowledge of this attribute will be
 useful later in the series so that we can optimize irqfd injection
 paths for cases where we know the code will not sleep.  Therefore,
 we provide an API to query a specific GSI.
 
 Signed-off-by: Gregory Haskins ghask...@novell.com
 ---
 
  include/linux/kvm_host.h |2 ++
  virt/kvm/irq_comm.c  |   19 +++
  2 files changed, 21 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index bd5a616..93393a4 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -119,6 +119,7 @@ struct kvm_memory_slot {
  struct kvm_kernel_irq_routing_entry {
   u32 gsi;
   u32 type;
 + bool lockless;
   int (*set)(struct kvm_kernel_irq_routing_entry *e,
  struct kvm *kvm, int irq_source_id, int level);
   union {
 @@ -417,6 +418,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic 
 *ioapic,
  unsigned long *deliver_bitmask);
  #endif
  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
 +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq);
  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
  void kvm_register_irq_ack_notifier(struct kvm *kvm,
  struct kvm_irq_ack_notifier *kian);
 diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
 index 00c68d2..04f0134 100644
 --- a/virt/kvm/irq_comm.c
 +++ b/virt/kvm/irq_comm.c
 @@ -174,6 +174,23 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
 irq, int level)
   return ret;
  }
  
 +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq)
 +{
 + struct kvm_kernel_irq_routing_entry *e;
 + struct kvm_irq_routing_table *irq_rt;
 + struct hlist_node *n;
 + int ret = -ENOENT;
 +
 + rcu_read_lock();
 + irq_rt = rcu_dereference(kvm-irq_routing);
 + if (irq  irq_rt-nr_rt_entries)
 + hlist_for_each_entry(e, n, irq_rt-map[irq], link)
 + ret = e-lockless ? 1 : 0;

Sigh...  just noticed this as it hits the list.  I should probably break
out of the loop here.

 + rcu_read_unlock();
 +
 + return ret;
 +}
 +
  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
  {
   struct kvm_irq_ack_notifier *kian;
 @@ -314,6 +331,7 @@ static int setup_routing_entry(struct 
 kvm_irq_routing_table *rt,
  
   e-gsi = ue-gsi;
   e-type = ue-type;
 + e-lockless = false;
   switch (ue-type) {
   case KVM_IRQ_ROUTING_IRQCHIP:
   delta = 0;
 @@ -342,6 +360,7 @@ static int setup_routing_entry(struct 
 kvm_irq_routing_table *rt,
   e-msi.address_lo = ue-u.msi.address_lo;
   e-msi.address_hi = ue-u.msi.address_hi;
   e-msi.data = ue-u.msi.data;
 + e-lockless = true;
   break;
   default:
   goto out;
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/




signature.asc
Description: OpenPGP digital signature


[KVM PATCH 0/2] irqfd enhancements

2009-10-21 Thread Gregory Haskins
(Applies to kvm.git/master:11b06403)

The following patches are cleanups/enhancements for IRQFD now that
we have lockless interrupt injection.  For more details, please see
the patch headers.

These patches pass checkpatch, and are fully tested.  Please consider
for merging.  They are an enhancement only, so there is no urgency
to push to mainline until a suitable merge window presents itself.

Kind Regards,
-Greg

---

Gregory Haskins (2):
  KVM: Remove unecessary irqfd-cleanup-wq
  KVM: Directly inject interrupts via irqfd


 virt/kvm/eventfd.c |   45 ++---
 1 files changed, 6 insertions(+), 39 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd

2009-10-21 Thread Gregory Haskins
IRQFD currently uses a deferred workqueue item to execute the injection
operation.  It was originally designed this way because kvm_set_irq()
required the caller to hold the irq_lock mutex, and the eventfd callback
is invoked from within a non-preemptible critical section.

With the advent of lockless injection support in kvm_set_irq, the deferment
mechanism is no longer technically needed. Since context switching to the
workqueue is a source of interrupt latency, lets switch to a direct
method.

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 virt/kvm/eventfd.c |   15 +++
 1 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 30f70fd..1a529d4 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -49,16 +49,14 @@ struct _irqfd {
poll_tablept;
wait_queue_head_t*wqh;
wait_queue_t  wait;
-   struct work_structinject;
struct work_structshutdown;
 };
 
 static struct workqueue_struct *irqfd_cleanup_wq;
 
 static void
-irqfd_inject(struct work_struct *work)
+irqfd_inject(struct _irqfd *irqfd)
 {
-   struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
struct kvm *kvm = irqfd-kvm;
 
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1);
@@ -80,12 +78,6 @@ irqfd_shutdown(struct work_struct *work)
remove_wait_queue(irqfd-wqh, irqfd-wait);
 
/*
-* We know no new events will be scheduled at this point, so block
-* until all previously outstanding events have completed
-*/
-   flush_work(irqfd-inject);
-
-   /*
 * It is now safe to release the object's resources
 */
eventfd_ctx_put(irqfd-eventfd);
@@ -126,7 +118,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, 
void *key)
 
if (flags  POLLIN)
/* An event has been signaled, inject an interrupt */
-   schedule_work(irqfd-inject);
+   irqfd_inject(irqfd);
 
if (flags  POLLHUP) {
/* The eventfd is closing, detach from KVM */
@@ -179,7 +171,6 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
irqfd-kvm = kvm;
irqfd-gsi = gsi;
INIT_LIST_HEAD(irqfd-list);
-   INIT_WORK(irqfd-inject, irqfd_inject);
INIT_WORK(irqfd-shutdown, irqfd_shutdown);
 
file = eventfd_fget(fd);
@@ -214,7 +205,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
 * before we registered, and trigger it as if we didn't miss it.
 */
if (events  POLLIN)
-   schedule_work(irqfd-inject);
+   irqfd_inject(irqfd);
 
/*
 * do not drop the file until the irqfd is fully initialized, otherwise

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM PATCH 2/2] KVM: Remove unecessary irqfd-cleanup-wq

2009-10-21 Thread Gregory Haskins
We originally created the irqfd-cleanup-wq so that we could safely
implement a shutdown that blocked on outstanding injection-requests
that may have been in flight.  We couldn't reuse something like kevent
to shutdown since the injection path was also using kevent and that may
have caused a deadlock if we tried.

Since the injection path is now no longer utilizing a work-item, it is
no longer necessary to maintain a separate cleanup WQ.  The standard
kevent queues should be sufficient, and thus we can eliminate an extra
kthread from the system.

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 virt/kvm/eventfd.c |   30 +++---
 1 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 1a529d4..fb698f4 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -52,8 +52,6 @@ struct _irqfd {
struct work_structshutdown;
 };
 
-static struct workqueue_struct *irqfd_cleanup_wq;
-
 static void
 irqfd_inject(struct _irqfd *irqfd)
 {
@@ -104,7 +102,7 @@ irqfd_deactivate(struct _irqfd *irqfd)
 
list_del_init(irqfd-list);
 
-   queue_work(irqfd_cleanup_wq, irqfd-shutdown);
+   schedule_work(irqfd-shutdown);
 }
 
 /*
@@ -262,7 +260,7 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
 * so that we guarantee there will not be any more interrupts on this
 * gsi once this deassign function returns.
 */
-   flush_workqueue(irqfd_cleanup_wq);
+   flush_scheduled_work();
 
return 0;
 }
@@ -296,33 +294,11 @@ kvm_irqfd_release(struct kvm *kvm)
 * Block until we know all outstanding shutdown jobs have completed
 * since we do not take a kvm* reference.
 */
-   flush_workqueue(irqfd_cleanup_wq);
+   flush_scheduled_work();
 
 }
 
 /*
- * create a host-wide workqueue for issuing deferred shutdown requests
- * aggregated from all vm* instances. We need our own isolated single-thread
- * queue to prevent deadlock against flushing the normal work-queue.
- */
-static int __init irqfd_module_init(void)
-{
-   irqfd_cleanup_wq = create_singlethread_workqueue(kvm-irqfd-cleanup);
-   if (!irqfd_cleanup_wq)
-   return -ENOMEM;
-
-   return 0;
-}
-
-static void __exit irqfd_module_exit(void)
-{
-   destroy_workqueue(irqfd_cleanup_wq);
-}
-
-module_init(irqfd_module_init);
-module_exit(irqfd_module_exit);
-
-/*
  * 
  * ioeventfd: translate a PIO/MMIO memory write to an eventfd signal.
  *

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd

2009-10-21 Thread Gregory Haskins
Gleb Natapov wrote:
 On Wed, Oct 21, 2009 at 10:34:53AM -0400, Gregory Haskins wrote:
 IRQFD currently uses a deferred workqueue item to execute the injection
 operation.  It was originally designed this way because kvm_set_irq()
 required the caller to hold the irq_lock mutex, and the eventfd callback
 is invoked from within a non-preemptible critical section.

 With the advent of lockless injection support in kvm_set_irq, the deferment
 mechanism is no longer technically needed. Since context switching to the
 workqueue is a source of interrupt latency, lets switch to a direct
 method.

 kvm_set_irq is fully lockless only in MSI case. IOAPIC/PIC has mutexes.

Right, but irqfd by design only works with MSI (or MSI like edge
triggers) anyway.  Legacy-type injections follow a different path.

In any case, I didn't change the locking (you did ;).  You recently
patched the irqfd code to remove the irq_lock, but we still had the
deferment mechanism in place to avoid the mutex_lock from within the
POLLIN callback.  Since the mutex_lock is now no longer acquired in this
path, the deferment technique is not needed either.  Its only adding
overhead for no purpose.  So I am simply cleaning that up to improve
interrupt performance.

HTH,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd

2009-10-21 Thread Gregory Haskins
Gleb Natapov wrote:
 On Wed, Oct 21, 2009 at 11:34:51AM -0400, Gregory Haskins wrote:
 Gleb Natapov wrote:
 On Wed, Oct 21, 2009 at 10:34:53AM -0400, Gregory Haskins wrote:
 IRQFD currently uses a deferred workqueue item to execute the injection
 operation.  It was originally designed this way because kvm_set_irq()
 required the caller to hold the irq_lock mutex, and the eventfd callback
 is invoked from within a non-preemptible critical section.

 With the advent of lockless injection support in kvm_set_irq, the deferment
 mechanism is no longer technically needed. Since context switching to the
 workqueue is a source of interrupt latency, lets switch to a direct
 method.

 kvm_set_irq is fully lockless only in MSI case. IOAPIC/PIC has mutexes.
 Right, but irqfd by design only works with MSI (or MSI like edge
 triggers) anyway.  Legacy-type injections follow a different path.

 Ah, If this the case and it will stay that way then the change looks OK
 to me.


I believe Avi, Michael, et. al. were in agreement with me on that design
choice.  I believe the reason is that there is no good way to do EOI/ACK
feedback within the constraints of an eventfd pipe which would be
required for the legacy pin-type interrupts.  Therefore, we won't even
bother trying.  High-performance subsystems will use irqfd/msi, and
legacy emulation can use the existing injection code (which includes the
necessary feedback for ack/eoi).

To that point, perhaps it should be better documented.  If we need a v2,
I will add a comment.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-07 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/06/2009 09:40 PM, Gregory Haskins wrote:
 Thinking about this some more over lunch, I think we (Avi and I) might
 both be wrong (and David is right).  Avi is right that we don't need
 rmb() or barrier() for the reasons already stated, but I think David is
 right that we need an smp_mb() to ensure the cpu doesn't do the
 reordering.  Otherwise a different cpu could invalidate the memory if it
 reuses the freed memory in the meantime, iiuc.  IOW: its not a compiler
 issue but a cpu issue.

 Or am I still confused?


 
 The sequence of operations is:
 
 v = p-v;
 f();
 // rmb() ?
 g(v);
 
 You are worried that the compiler

No

 or cpu will fetch p-v after f() has executed?

Yes.

 The compiler may not, since it can't tell whether f() might
 change p-v.

Right, you were correct to say my barrier() suggestion was wrong.

 If f() can cause another agent to write to p (by freeing
 it to a global list, for example), then it is its responsibility to
 issue the smp_rmb(), otherwise no calculation that took place before f()
 and accessed p is safe.
 

IOW: David is right.  You need a cpu-barrier one way or the other.  We
can either allow -release() to imply one (and probably document it that
way, like we did for slow-work), or we can be explicit.  I chose to be
explicit since it is kind of self-documenting, and there is no need to
be worried about performance since the release is slow-path.

OTOH: If you feel strongly about it, we can take it out, knowing that
most anything the properly invalidates the memory will likely include an
implicit barrier of some kind.

Kind Regards,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-06 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/06/2009 01:57 AM, Gregory Haskins wrote:
 Avi Kivity wrote:
   
 On 10/02/2009 10:19 PM, Gregory Haskins wrote:
 
 What: xinterface is a mechanism that allows kernel modules external to
 the kvm.ko proper to interface with a running guest.  It accomplishes
 this by creating an abstracted interface which does not expose any
 private details of the guest or its related KVM structures, and
 provides
 a mechanism to find and bind to this interface at run-time.


 If this is needed, it should be done as a virt_address_space to which
 kvm and other modules bind, instead of as something that kvm exports and
 other modules import.  The virt_address_space can be identified by an fd
 and passed around to kvm and other modules.
  
 IIUC, what you are proposing is something similar to generalizing the
 vbus::memctx object.  I had considered doing something like that in the
 early design phase of vbus, but then decided it would be a hard-sell to
 the mm crowd, and difficult to generalize.

 What do you propose as the interface to program the object?

 
 Something like the current kvm interfaces, de-warted.  It will be a hard
 sell indeed, for good reasons.

I am not convinced generalizing this at this point is the best step
forward, since we only have a KVM client.  Let me put some more thought
into it.


 
 So, under my suggestion above, you'd call
 sys_create_virt_address_space(), populate it, and pass the result to kvm
 and to foo.  This allows the use of virt_address_space without kvm and
 doesn't require foo to interact with kvm.
  
 The problem I see here is that the only way I can think to implement
 this generally is something that looks very kvm-esque (slots-to-pages
 kind of translation).  Is there a way you can think of that does not
 involve a kvm.ko originated vtable that is also not kvm centric?

 
 slots would be one implementation, if you can think of others then you'd
 add them.

I'm more interested in *how* you'd add them more than if we would add
them.  What I am getting at are the logistics of such a beast.

For instance, would I have /dev/slots-vas with ioctls for adding slots,
and /dev/foo-vas for adding foos?  And each one would instantiate a
different vas_struct object with its own vas_struct-ops?  Or were you
thinking of something different.

 
 If you can't, I think it indicates that the whole thing isn't necessary
 and we're better off with slots and virtual memory.

I'm not sure if we are talking about the same thing yet, but if we are,
there are uses of a generalized interface outside of slots/virtual
memory (Ira's physical box being a good example).

In any case, I think the best approach is what I already proposed.
KVM's arrangement of memory is going to tend to be KVM specific, and
what better place to implement the interface than close to the kvm.ko core.

 The only thing missing is dma, which you don't deal with anyway.
 

Afaict I do support dma in the generalized vbus::memctx, though I do not
use it on anything related to KVM or xinterface.  Can you elaborate on
the problem here?  Does the SG interface in 4/4 help get us closer to
what you envision?

 +struct kvm_xinterface_ops {
 +unsigned long (*copy_to)(struct kvm_xinterface *intf,
 + unsigned long gpa, const void *src,
 + unsigned long len);
 +unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst,
 +   unsigned long gpa, unsigned long len);
 +struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
 +  unsigned long gpa,
 +  unsigned long len);


 How would vmap() work with live migration?
  
 vmap represents shmem regions, and is a per-guest-instance resource.  So
 my plan there is that the new and old guest instance would each have the
 vmap region instated at the same GPA location (assumption: gpas are
 stable across migration), and any state relevant data local to the shmem
 (like ring head/tail position) is conveyed in the serialized stream for
 the device model.

 
 You'd have to copy the entire range since you don't know what the guest
 might put there.  I guess it's acceptable for small areas.

?  The vmap is presumably part of an ABI between guest and host, so the
host should always know what structure is present within the region, and
what is relevant from within that structure to migrate once that state
is frozen.

These regions (for vbus, anyway) equate to things like virtqueue
metadata, and presumably the same problem exists for virtio-net in
userspace as it does here, since that is another form of a vmap.  So
whatever solution works for virtio-net migrating its virtqueues in
userspace should be close to what will work here.  The primary
difference is the location of the serializer.

 
 +
 +static inline void
 +_kvm_xinterface_release(struct kref *kref)
 +{
 +struct kvm_xinterface *intf;
 +struct module *owner;
 +
 +intf = container_of

Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-06 Thread Gregory Haskins
Gregory Haskins wrote:
 Avi Kivity wrote:
 On 10/06/2009 01:57 AM, Gregory Haskins wrote:
 Avi Kivity wrote:
   
 On 10/02/2009 10:19 PM, Gregory Haskins wrote:

 +
 +static inline void
 +_kvm_xinterface_release(struct kref *kref)
 +{
 +struct kvm_xinterface *intf;
 +struct module *owner;
 +
 +intf = container_of(kref, struct kvm_xinterface, kref);
 +
 +owner = intf-owner;
 +rmb();


 Why rmb?
  
 the intf-ops-release() line may invalidate the intf pointer, so we
 want to ensure that the read completes before the release() is called.

 TBH: I'm not 100% its needed, but I was being conservative.

 rmb()s are only needed if an external agent can issue writes, otherwise
 you'd need one after every statement.
 
 I was following lessons learned here:
 
 http://lkml.org/lkml/2009/7/7/175
 
 Perhaps mb() or barrier() are more appropriate than rmb()?  I'm CC'ing
 David Howells in case he has more insight.

BTW: In case it is not clear, the rationale as I understand it is we
worry about the case where one cpu reorders the read to be after the
-release(), and another cpu might grab the memory that was kfree()'d
within the -release() and scribble something else on it before the read
completes.

I know rmb() typically needs to be paired with wmb() to be correct, so
you are probably right to say that the rmb() itself is not appropriate.
 This problem in general makes my head hurt, which is why I said I am
not 100% sure of what is required.  As David mentions, perhaps
smp_mb() is more appropriate for this application.  I also speculate
barrier() may be all that we need.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-06 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/06/2009 03:31 PM, Gregory Haskins wrote:

 slots would be one implementation, if you can think of others then you'd
 add them.
  
 I'm more interested in *how* you'd add them more than if we would add
 them.  What I am getting at are the logistics of such a beast.

 
 Add alternative ioctls, or have one ioctl with a 'type' field.
 
 For instance, would I have /dev/slots-vas with ioctls for adding slots,
 and /dev/foo-vas for adding foos?  And each one would instantiate a
 different vas_struct object with its own vas_struct-ops?  Or were you
 thinking of something different.

 
 I think a single /dev/foo is sufficient, unless some of those address
 spaces are backed by real devices.
 
 If you can't, I think it indicates that the whole thing isn't necessary
 and we're better off with slots and virtual memory.
  
 I'm not sure if we are talking about the same thing yet, but if we are,
 there are uses of a generalized interface outside of slots/virtual
 memory (Ira's physical box being a good example).

 
 I'm not sure Ira's case is not best supported by virtual memory.

Perhaps, but there are surely some cases where the memory is not
pageable, but is accessible indirectly through some DMA controller.  So
if we require it to be pagable we will limit the utility of the
interface, though admittedly it will probably cover most cases.

 
 In any case, I think the best approach is what I already proposed.
 KVM's arrangement of memory is going to tend to be KVM specific, and
 what better place to implement the interface than close to the kvm.ko
 core.

   
 The only thing missing is dma, which you don't deal with anyway.

  
 Afaict I do support dma in the generalized vbus::memctx, though I do not
 use it on anything related to KVM or xinterface.  Can you elaborate on
 the problem here?  Does the SG interface in 4/4 help get us closer to
 what you envision?

 
 There's no dma method in there.  Mapping to physical addresses is
 equivalent to get_user_pages(), so it doesn't add anything over virtual
 memory slots.

I am not following you at all.  What kind of interface do we need for
DMA?  Since the KVM guest is pagable, the support for DMA would come
from building a scatterlist (patch 4/4) and passing the resulting pages
out using standard sg mechanisms, like sg_dma_address().  This is what I
do today to support zero-copy in AlacrityVM.

What more do we need?

 
 You'd have to copy the entire range since you don't know what the guest
 might put there.  I guess it's acceptable for small areas.
  
 ?  The vmap is presumably part of an ABI between guest and host, so the
 host should always know what structure is present within the region, and
 what is relevant from within that structure to migrate once that state
 is frozen.

 
 I was thinking of the case where the page is shared with some other
 (guest-private) data.  But dirtying that data would be tracked by kvm,
 so it isn't a problem.

Ok.

 
 These regions (for vbus, anyway) equate to things like virtqueue
 metadata, and presumably the same problem exists for virtio-net in
 userspace as it does here, since that is another form of a vmap.  So
 whatever solution works for virtio-net migrating its virtqueues in
 userspace should be close to what will work here.  The primary
 difference is the location of the serializer.

 
 Actually, virtio doesn't serialize this data, instead it marks the page
 dirty and lets normal RAM migration move it.

Ok, so its effectively serializing the entire region indirectly.  That
works too.

 
 
 rmb()s are only needed if an external agent can issue writes, otherwise
 you'd need one after every statement.
  
 I was following lessons learned here:

 http://lkml.org/lkml/2009/7/7/175

 Perhaps mb() or barrier() are more appropriate than rmb()?  I'm CC'ing
 David Howells in case he has more insight.

 
 I'm not sure I understand the reference.  Callers and callees don't need
 memory barriers since they're guaranteed program order.  You only need
 memory barriers when you have an external agent (a device, or another
 cpu).  What external agent can touch things during -release()?

We already have a different subthread started on this, so I'll pick this
up there.

 
 A simple per-vcpu cache (in struct kvm_vcpu) is likely to give better
 results.

  
 per-vcpu will not work well here, unfortunately, since this is an
 external interface mechanism.  The callers will generally be from a
 kthread or some other non-vcpu related context.  Even if we could
 figure
 out a vcpu to use as a basis, we would require some kind of
 heavier-weight synchronization which would not be as desirable.

 Therefore, I opted to go per-cpu and use the presumably lighterweight
 get_cpu/put_cpu() instead.


 This just assumes a low context switch rate.
  
 It primarily assumes a low _migration_ rate, since you do not typically
 have two contexts on the same cpu pounding

Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-06 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/06/2009 04:22 PM, Gregory Haskins wrote:
 +
 +static inline void
 +_kvm_xinterface_release(struct kref *kref)
 +{
 +struct kvm_xinterface *intf;
 +struct module *owner;
 +
 +intf = container_of(kref, struct kvm_xinterface, kref);
 +
 +owner = intf-owner;
 +rmb();


  
 Why rmb?


 the intf-ops-release() line may invalidate the intf pointer, so we
 want to ensure that the read completes before the release() is called.

 TBH: I'm not 100% its needed, but I was being conservative.

  
 rmb()s are only needed if an external agent can issue writes, otherwise
 you'd need one after every statement.

 I was following lessons learned here:

 http://lkml.org/lkml/2009/7/7/175

 Perhaps mb() or barrier() are more appropriate than rmb()?  I'm CC'ing
 David Howells in case he has more insight.
  
 BTW: In case it is not clear, the rationale as I understand it is we
 worry about the case where one cpu reorders the read to be after the
 -release(), and another cpu might grab the memory that was kfree()'d
 within the -release() and scribble something else on it before the read
 completes.

 I know rmb() typically needs to be paired with wmb() to be correct, so
 you are probably right to say that the rmb() itself is not appropriate.
   This problem in general makes my head hurt, which is why I said I am
 not 100% sure of what is required.  As David mentions, perhaps
 smp_mb() is more appropriate for this application.  I also speculate
 barrier() may be all that we need.

 
 barrier() is the operation for a compiler barrier.  But it's unneeded
 here - unless the compiler can prove that -release(intf) will not
 modify intf-owner it is not allowed to move the access afterwards.  An
 indirect function call is generally a barrier() since the compiler can't
 assume memory has not been modified.
 

You're logic seems reasonable to me.  I will defer to David, since he
brought up the issue with the similar logic originally.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-06 Thread Gregory Haskins
Gregory Haskins wrote:
 Avi Kivity wrote:
 On 10/06/2009 04:22 PM, Gregory Haskins wrote:
 +
 +static inline void
 +_kvm_xinterface_release(struct kref *kref)
 +{
 +struct kvm_xinterface *intf;
 +struct module *owner;
 +
 +intf = container_of(kref, struct kvm_xinterface, kref);
 +
 +owner = intf-owner;
 +rmb();


  
 Why rmb?


 the intf-ops-release() line may invalidate the intf pointer, so we
 want to ensure that the read completes before the release() is called.

 TBH: I'm not 100% its needed, but I was being conservative.

  
 rmb()s are only needed if an external agent can issue writes, otherwise
 you'd need one after every statement.

 I was following lessons learned here:

 http://lkml.org/lkml/2009/7/7/175

 Perhaps mb() or barrier() are more appropriate than rmb()?  I'm CC'ing
 David Howells in case he has more insight.
  
 BTW: In case it is not clear, the rationale as I understand it is we
 worry about the case where one cpu reorders the read to be after the
 -release(), and another cpu might grab the memory that was kfree()'d
 within the -release() and scribble something else on it before the read
 completes.

 I know rmb() typically needs to be paired with wmb() to be correct, so
 you are probably right to say that the rmb() itself is not appropriate.
   This problem in general makes my head hurt, which is why I said I am
 not 100% sure of what is required.  As David mentions, perhaps
 smp_mb() is more appropriate for this application.  I also speculate
 barrier() may be all that we need.

 barrier() is the operation for a compiler barrier.  But it's unneeded
 here - unless the compiler can prove that -release(intf) will not
 modify intf-owner it is not allowed to move the access afterwards.  An
 indirect function call is generally a barrier() since the compiler can't
 assume memory has not been modified.

 
 You're logic

gak.  or your logic even.

 seems reasonable to me.  I will defer to David, since he
 brought up the issue with the similar logic originally.
 
 Kind Regards,
 -Greg
 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-06 Thread Gregory Haskins
Gregory Haskins wrote:
 Gregory Haskins wrote:
 Avi Kivity wrote:
 On 10/06/2009 04:22 PM, Gregory Haskins wrote:
 +
 +static inline void
 +_kvm_xinterface_release(struct kref *kref)
 +{
 +struct kvm_xinterface *intf;
 +struct module *owner;
 +
 +intf = container_of(kref, struct kvm_xinterface, kref);
 +
 +owner = intf-owner;
 +rmb();


  
 Why rmb?


 the intf-ops-release() line may invalidate the intf pointer, so we
 want to ensure that the read completes before the release() is called.

 TBH: I'm not 100% its needed, but I was being conservative.

  
 rmb()s are only needed if an external agent can issue writes, otherwise
 you'd need one after every statement.

 I was following lessons learned here:

 http://lkml.org/lkml/2009/7/7/175

 Perhaps mb() or barrier() are more appropriate than rmb()?  I'm CC'ing
 David Howells in case he has more insight.
  
 BTW: In case it is not clear, the rationale as I understand it is we
 worry about the case where one cpu reorders the read to be after the
 -release(), and another cpu might grab the memory that was kfree()'d
 within the -release() and scribble something else on it before the read
 completes.

 I know rmb() typically needs to be paired with wmb() to be correct, so
 you are probably right to say that the rmb() itself is not appropriate.
   This problem in general makes my head hurt, which is why I said I am
 not 100% sure of what is required.  As David mentions, perhaps
 smp_mb() is more appropriate for this application.  I also speculate
 barrier() may be all that we need.

 barrier() is the operation for a compiler barrier.  But it's unneeded
 here - unless the compiler can prove that -release(intf) will not
 modify intf-owner it is not allowed to move the access afterwards.  An
 indirect function call is generally a barrier() since the compiler can't
 assume memory has not been modified.

 You're logic
 
 gak.  or your logic even.
 
 seems reasonable to me.  I will defer to David, since he
 brought up the issue with the similar logic originally.

 Kind Regards,
 -Greg

 
 

Thinking about this some more over lunch, I think we (Avi and I) might
both be wrong (and David is right).  Avi is right that we don't need
rmb() or barrier() for the reasons already stated, but I think David is
right that we need an smp_mb() to ensure the cpu doesn't do the
reordering.  Otherwise a different cpu could invalidate the memory if it
reuses the freed memory in the meantime, iiuc.  IOW: its not a compiler
issue but a cpu issue.

Or am I still confused?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-05 Thread Gregory Haskins
Hi Marcelo!

Marcelo Tosatti wrote:
 On Fri, Oct 02, 2009 at 04:19:27PM -0400, Gregory Haskins wrote:
 What: xinterface is a mechanism that allows kernel modules external to
 the kvm.ko proper to interface with a running guest.  It accomplishes
 this by creating an abstracted interface which does not expose any
 private details of the guest or its related KVM structures, and provides
 a mechanism to find and bind to this interface at run-time.

 Why: There are various subsystems that would like to interact with a KVM
 guest which are ideally suited to exist outside the domain of the kvm.ko
 core logic. For instance, external pci-passthrough, virtual-bus, and
 virtio-net modules are currently under development.  In order for these
 modules to successfully interact with the guest, they need, at the very
 least, various interfaces for signaling IO events, pointer translation,
 and possibly memory mapping.

 The signaling case is covered by the recent introduction of the
 irqfd/ioeventfd mechanisms.  This patch provides a mechanism to cover the
 other cases.  Note that today we only expose pointer-translation related
 functions, but more could be added at a future date as needs arise.

 Example usage: QEMU instantiates a guest, and an external module foo
 that desires the ability to interface with the guest (say via
 open(/dev/foo)).  QEMU may then pass the kvmfd to foo via an
 ioctl, such as: ioctl(foofd, FOO_SET_VMID, kvmfd).  Upon receipt, the
 foo module can issue kvm_xinterface_bind(kvmfd) to acquire
 the proper context.  Internally, the struct kvm* and associated
 struct module* will remain pinned at least until the foo module calls
 kvm_xinterface_put().
 
 --- /dev/null
 +++ b/virt/kvm/xinterface.c
 @@ -0,0 +1,409 @@
 +/*
 + * KVM module interface - Allows external modules to interface with a guest
 + *
 + * Copyright 2009 Novell.  All Rights Reserved.
 + *
 + * Author:
 + *  Gregory Haskins ghask...@novell.com
 + *
 + * This file is free software; you can redistribute it and/or modify
 + * it under the terms of version 2 of the GNU General Public License
 + * as published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software Foundation,
 + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
 + */
 +
 +#include linux/mm.h
 +#include linux/vmalloc.h
 +#include linux/highmem.h
 +#include linux/module.h
 +#include linux/mmu_context.h
 +#include linux/kvm_host.h
 +#include linux/kvm_xinterface.h
 +
 +struct _xinterface {
 +struct kvm *kvm;
 +struct task_struct *task;
 +struct mm_struct   *mm;
 +struct kvm_xinterface   intf;
 +struct kvm_memory_slot *slotcache[NR_CPUS];
 +};
 +
 +struct _xvmap {
 +struct kvm_memory_slot*memslot;
 +unsigned long  npages;
 +struct kvm_xvmap   vmap;
 +};
 +
 +static struct _xinterface *
 +to_intf(struct kvm_xinterface *intf)
 +{
 +return container_of(intf, struct _xinterface, intf);
 +}
 +
 +#define _gfn_to_hva(gfn, memslot) \
 +(memslot-userspace_addr + (gfn - memslot-base_gfn) * PAGE_SIZE)
 +
 +/*
 + * gpa_to_hva() - translate a guest-physical to host-virtual using
 + * a per-cpu cache of the memslot.
 + *
 + * The gfn_to_memslot() call is relatively expensive, and the gpa access
 + * patterns exhibit a high degree of locality.  Therefore, lets cache
 + * the last slot used on a per-cpu basis to optimize the lookup
 + *
 + * assumes slots_lock held for read
 + */
 +static unsigned long
 +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa)
 +{
 +int cpu = get_cpu();
 +unsigned long   gfn = gpa  PAGE_SHIFT;
 +struct kvm_memory_slot *memslot = _intf-slotcache[cpu];
 +unsigned long   addr= 0;
 +
 +if (!memslot
 +|| gfn  memslot-base_gfn
 +|| gfn = memslot-base_gfn + memslot-npages) {
 +
 +memslot = gfn_to_memslot(_intf-kvm, gfn);
 +if (!memslot)
 +goto out;
 +
 +_intf-slotcache[cpu] = memslot;
 +}
 +
 +addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa);
 +
 +out:
 +put_cpu();
 +
 +return addr;
 
 Please optimize gfn_to_memslot() instead, so everybody benefits. It
 shows very often on profiles.

Yeah, its not a bad idea.  The reason why I did it here is because the
requirements for sync (kvm-vcpu) vs async (xinterface) access is
slightly different.  Sync is probably optimal with per-vcpu caching,
whereas async is optimal with per-cpu.

That said, we could probably build the entire algorithm to be per-cpu as
a compromise and still gain benefits

Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-05 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/02/2009 10:19 PM, Gregory Haskins wrote:
 What: xinterface is a mechanism that allows kernel modules external to
 the kvm.ko proper to interface with a running guest.  It accomplishes
 this by creating an abstracted interface which does not expose any
 private details of the guest or its related KVM structures, and provides
 a mechanism to find and bind to this interface at run-time.

 
 If this is needed, it should be done as a virt_address_space to which
 kvm and other modules bind, instead of as something that kvm exports and
 other modules import.  The virt_address_space can be identified by an fd
 and passed around to kvm and other modules.

IIUC, what you are proposing is something similar to generalizing the
vbus::memctx object.  I had considered doing something like that in the
early design phase of vbus, but then decided it would be a hard-sell to
the mm crowd, and difficult to generalize.

What do you propose as the interface to program the object?

 
 Why: There are various subsystems that would like to interact with a KVM
 guest which are ideally suited to exist outside the domain of the kvm.ko
 core logic. For instance, external pci-passthrough, virtual-bus, and
 virtio-net modules are currently under development.  In order for these
 modules to successfully interact with the guest, they need, at the very
 least, various interfaces for signaling IO events, pointer translation,
 and possibly memory mapping.

 The signaling case is covered by the recent introduction of the
 irqfd/ioeventfd mechanisms.  This patch provides a mechanism to cover the
 other cases.  Note that today we only expose pointer-translation related
 functions, but more could be added at a future date as needs arise.

 Example usage: QEMU instantiates a guest, and an external module foo
 that desires the ability to interface with the guest (say via
 open(/dev/foo)).  QEMU may then pass the kvmfd to foo via an
 ioctl, such as: ioctl(foofd, FOO_SET_VMID,kvmfd).  Upon receipt, the
 foo module can issue kvm_xinterface_bind(kvmfd) to acquire
 the proper context.  Internally, the struct kvm* and associated
 struct module* will remain pinned at least until the foo module calls
 kvm_xinterface_put().


 
 So, under my suggestion above, you'd call
 sys_create_virt_address_space(), populate it, and pass the result to kvm
 and to foo.  This allows the use of virt_address_space without kvm and
 doesn't require foo to interact with kvm.

The problem I see here is that the only way I can think to implement
this generally is something that looks very kvm-esque (slots-to-pages
kind of translation).  Is there a way you can think of that does not
involve a kvm.ko originated vtable that is also not kvm centric?

 
 +struct kvm_xinterface_ops {
 +unsigned long (*copy_to)(struct kvm_xinterface *intf,
 + unsigned long gpa, const void *src,
 + unsigned long len);
 +unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst,
 +   unsigned long gpa, unsigned long len);
 +struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
 +  unsigned long gpa,
 +  unsigned long len);

 
 How would vmap() work with live migration?

vmap represents shmem regions, and is a per-guest-instance resource.  So
my plan there is that the new and old guest instance would each have the
vmap region instated at the same GPA location (assumption: gpas are
stable across migration), and any state relevant data local to the shmem
(like ring head/tail position) is conveyed in the serialized stream for
the device model.

 
 +
 +static inline void
 +_kvm_xinterface_release(struct kref *kref)
 +{
 +struct kvm_xinterface *intf;
 +struct module *owner;
 +
 +intf = container_of(kref, struct kvm_xinterface, kref);
 +
 +owner = intf-owner;
 +rmb();

 
 Why rmb?

the intf-ops-release() line may invalidate the intf pointer, so we
want to ensure that the read completes before the release() is called.

TBH: I'm not 100% its needed, but I was being conservative.

 
 +
 +intf-ops-release(intf);
 +module_put(owner);
 +}
 +

 +
 +/*
 + * gpa_to_hva() - translate a guest-physical to host-virtual using
 + * a per-cpu cache of the memslot.
 + *
 + * The gfn_to_memslot() call is relatively expensive, and the gpa access
 + * patterns exhibit a high degree of locality.  Therefore, lets cache
 + * the last slot used on a per-cpu basis to optimize the lookup
 + *
 + * assumes slots_lock held for read
 + */
 +static unsigned long
 +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa)
 +{
 +int cpu = get_cpu();
 +unsigned long   gfn = gpa  PAGE_SHIFT;
 +struct kvm_memory_slot *memslot = _intf-slotcache[cpu];
 +unsigned long   addr= 0;
 +
 +if (!memslot
 +|| gfn  memslot-base_gfn
 +|| gfn= memslot-base_gfn + memslot-npages) {
 +
 +memslot

Re: [PATCH v2 4/4] KVM: add scatterlist support to xinterface

2009-10-05 Thread Gregory Haskins
Avi Kivity wrote:
 On 10/02/2009 10:19 PM, Gregory Haskins wrote:
 This allows a scatter-gather approach to IO, which will be useful for
 building high performance interfaces, like zero-copy and low-latency
 copy (avoiding multiple calls to copy_to/from).

 The interface is based on the existing scatterlist infrastructure.  The
 caller is expected to pass in a scatterlist with its dma field
 populated with valid GPAs.  The xinterface will then populate each
 entry by translating the GPA to a page*.

 The caller signifies completion by simply performing a put_page() on
 each page returned in the list.

 Signed-off-by: Gregory Haskinsghask...@novell.com
 ---

   include/linux/kvm_xinterface.h |4 ++
   virt/kvm/xinterface.c  |   72
 
   2 files changed, 76 insertions(+), 0 deletions(-)

 diff --git a/include/linux/kvm_xinterface.h
 b/include/linux/kvm_xinterface.h
 index 684b6f8..eefb575 100644
 --- a/include/linux/kvm_xinterface.h
 +++ b/include/linux/kvm_xinterface.h
 @@ -9,6 +9,7 @@
   #includelinux/kref.h
   #includelinux/module.h
   #includelinux/file.h
 +#includelinux/scatterlist.h

   struct kvm_xinterface;
   struct kvm_xvmap;
 @@ -36,6 +37,9 @@ struct kvm_xinterface_ops {
   u64 addr,
   unsigned long len,
   unsigned long flags);
 +unsigned long (*sgmap)(struct kvm_xinterface *intf,
 +   struct scatterlist *sgl, int nents,
 +   unsigned long flags);
   void (*release)(struct kvm_xinterface *);
   };

 diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c
 index c356835..16729f6 100644
 --- a/virt/kvm/xinterface.c
 +++ b/virt/kvm/xinterface.c
 @@ -467,6 +467,77 @@ fail:

   }

 +static unsigned long
 +xinterface_sgmap(struct kvm_xinterface *intf,
 + struct scatterlist *sgl, int nents,
 + unsigned long flags)
 +{
 +struct _xinterface *_intf   = to_intf(intf);
 +struct task_struct *p   = _intf-task;
 +struct mm_struct   *mm  = _intf-mm;
 +struct kvm *kvm = _intf-kvm;
 +struct kvm_memory_slot *memslot = NULL;
 +boolkthread = !current-mm;
 +int ret;
 +struct scatterlist *sg;
 +int i;
 +
 +down_read(kvm-slots_lock);
 +
 +if (kthread)
 +use_mm(_intf-mm);
 +
 +for_each_sg(sgl, sg, nents, i) {
 +unsigned long   gpa= sg_dma_address(sg);
 +unsigned long   len= sg_dma_len(sg);
 +unsigned long   gfn= gpa  PAGE_SHIFT;
 +off_t   offset = offset_in_page(gpa);
 +unsigned long   hva;
 +struct page*pg;
 +
 +/* ensure that we do not have more than one page per entry */
 +if ((PAGE_ALIGN(len + offset)  PAGE_SHIFT) != 1) {
 +ret = -EINVAL;
 +break;
 +}
 +
 +/* check for a memslot-cache miss */
 +if (!memslot
 +|| gfn  memslot-base_gfn
 +|| gfn= memslot-base_gfn + memslot-npages) {
 +memslot = gfn_to_memslot(kvm, gfn);
 +if (!memslot) {
 +ret = -EFAULT;
 +break;
 +}
 +}
 +
 +hva = (memslot-userspace_addr +
 +   (gfn - memslot-base_gfn) * PAGE_SIZE);
 +
 +if (kthread || current-mm == mm)
 +ret = get_user_pages_fast(hva, 1, 1,pg);
 +else
 +ret = get_user_pages(p, mm, hva, 1, 1, 0,pg, NULL);

 
 One of these needs the mm semaphore.

Indeed.  Good catch.

 
 +
 +if (ret != 1) {
 +if (ret= 0)
 +ret = -EFAULT;
 +break;
 +}
 +
 +sg_set_page(sg, pg, len, offset);
 +ret = 0;
 +}
 +
 +if (kthread)
 +unuse_mm(_intf-mm);
 +
 +up_read(kvm-slots_lock);
 +
 +return ret;
 +}
 +
   static void
   xinterface_release(struct kvm_xinterface *intf)
   {
 @@ -483,6 +554,7 @@ struct kvm_xinterface_ops _xinterface_ops = {
   .copy_from   = xinterface_copy_from,
   .vmap= xinterface_vmap,
   .ioevent = xinterface_ioevent,
 +.sgmap   = xinterface_sgmap,
   .release = xinterface_release,
   };


 -- 
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 
 




signature.asc
Description: OpenPGP digital signature


[PATCH v2 0/4] KVM: xinterface

2009-10-02 Thread Gregory Haskins
(Applies to kvm.git/master:083e9e10)

For details, please read the patch headers.

[ Changelog:

 v2:
*) We now re-use the vmfd as the binding token, instead of creating
   a new separate namespace
*) Added support for switch_to(mm), which is much faster
*) Added support for memslot-cache for exploiting slot locality
*) Added support for scatter-gather access
*) Added support for xioevent interface

  v1:
*) Initial release
]

This series is included in upstream AlacrityVM and is well tested and
known to work properly.  Comments?

Kind Regards,
-Greg

---

Gregory Haskins (4):
  KVM: add scatterlist support to xinterface
  KVM: add io services to xinterface
  KVM: introduce xinterface API for external interaction with guests
  mm: export use_mm() and unuse_mm() to modules


 arch/x86/kvm/Makefile  |2 
 include/linux/kvm_host.h   |3 
 include/linux/kvm_xinterface.h |  165 +++
 kernel/fork.c  |1 
 mm/mmu_context.c   |3 
 virt/kvm/kvm_main.c|   24 ++
 virt/kvm/xinterface.c  |  587 
 7 files changed, 784 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/kvm_xinterface.h
 create mode 100644 virt/kvm/xinterface.c

-- 
Signature
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] mm: export use_mm() and unuse_mm() to modules

2009-10-02 Thread Gregory Haskins
We want to use these functions from withing KVM, which may be built as
a module.

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 mm/mmu_context.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index ded9081..f31ba20 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -6,6 +6,7 @@
 #include linux/mm.h
 #include linux/mmu_context.h
 #include linux/sched.h
+#include linux/module.h
 
 #include asm/mmu_context.h
 
@@ -37,6 +38,7 @@ void use_mm(struct mm_struct *mm)
if (active_mm != mm)
mmdrop(active_mm);
 }
+EXPORT_SYMBOL_GPL(use_mm);
 
 /*
  * unuse_mm
@@ -56,3 +58,4 @@ void unuse_mm(struct mm_struct *mm)
enter_lazy_tlb(mm, tsk);
task_unlock(tsk);
 }
+EXPORT_SYMBOL_GPL(unuse_mm);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-02 Thread Gregory Haskins
What: xinterface is a mechanism that allows kernel modules external to
the kvm.ko proper to interface with a running guest.  It accomplishes
this by creating an abstracted interface which does not expose any
private details of the guest or its related KVM structures, and provides
a mechanism to find and bind to this interface at run-time.

Why: There are various subsystems that would like to interact with a KVM
guest which are ideally suited to exist outside the domain of the kvm.ko
core logic. For instance, external pci-passthrough, virtual-bus, and
virtio-net modules are currently under development.  In order for these
modules to successfully interact with the guest, they need, at the very
least, various interfaces for signaling IO events, pointer translation,
and possibly memory mapping.

The signaling case is covered by the recent introduction of the
irqfd/ioeventfd mechanisms.  This patch provides a mechanism to cover the
other cases.  Note that today we only expose pointer-translation related
functions, but more could be added at a future date as needs arise.

Example usage: QEMU instantiates a guest, and an external module foo
that desires the ability to interface with the guest (say via
open(/dev/foo)).  QEMU may then pass the kvmfd to foo via an
ioctl, such as: ioctl(foofd, FOO_SET_VMID, kvmfd).  Upon receipt, the
foo module can issue kvm_xinterface_bind(kvmfd) to acquire
the proper context.  Internally, the struct kvm* and associated
struct module* will remain pinned at least until the foo module calls
kvm_xinterface_put().

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 arch/x86/kvm/Makefile  |2 
 include/linux/kvm_host.h   |3 
 include/linux/kvm_xinterface.h |  114 +++
 kernel/fork.c  |1 
 virt/kvm/kvm_main.c|   24 ++
 virt/kvm/xinterface.c  |  409 
 6 files changed, 552 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/kvm_xinterface.h
 create mode 100644 virt/kvm/xinterface.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31a7035..0449d6e 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.
 
 kvm-y  += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
coalesced_mmio.o irq_comm.o eventfd.o \
-   assigned-dev.o)
+   assigned-dev.o xinterface.o)
 kvm-$(CONFIG_IOMMU_API)+= $(addprefix ../../../virt/kvm/, iommu.o)
 
 kvm-y  += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b985a29..7cc1afb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -362,6 +362,9 @@ void kvm_arch_sync_events(struct kvm *kvm);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 
+struct kvm_xinterface *
+kvm_xinterface_alloc(struct kvm *kvm, struct module *owner);
+
 int kvm_is_mmio_pfn(pfn_t pfn);
 
 struct kvm_irq_ack_notifier {
diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h
new file mode 100644
index 000..01f092b
--- /dev/null
+++ b/include/linux/kvm_xinterface.h
@@ -0,0 +1,114 @@
+#ifndef __KVM_XINTERFACE_H
+#define __KVM_XINTERFACE_H
+
+/*
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include linux/kref.h
+#include linux/module.h
+#include linux/file.h
+
+struct kvm_xinterface;
+struct kvm_xvmap;
+
+struct kvm_xinterface_ops {
+   unsigned long (*copy_to)(struct kvm_xinterface *intf,
+unsigned long gpa, const void *src,
+unsigned long len);
+   unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst,
+  unsigned long gpa, unsigned long len);
+   struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
+ unsigned long gpa,
+ unsigned long len);
+   void (*release)(struct kvm_xinterface *);
+};
+
+struct kvm_xinterface {
+   struct module   *owner;
+   struct kref  kref;
+   const struct kvm_xinterface_ops *ops;
+};
+
+static inline void
+kvm_xinterface_get(struct kvm_xinterface *intf)
+{
+   kref_get(intf-kref);
+}
+
+static inline void
+_kvm_xinterface_release(struct kref *kref)
+{
+   struct kvm_xinterface *intf;
+   struct module *owner;
+
+   intf = container_of(kref, struct kvm_xinterface, kref);
+
+   owner = intf-owner;
+   rmb();
+
+   intf-ops-release(intf);
+   module_put(owner);
+}
+
+static inline void
+kvm_xinterface_put(struct kvm_xinterface *intf)
+{
+   kref_put(intf-kref, _kvm_xinterface_release);
+}
+
+struct kvm_xvmap_ops {
+   void (*release)(struct

[PATCH v2 3/4] KVM: add io services to xinterface

2009-10-02 Thread Gregory Haskins
We want to add a more efficient way to get PIO signals out of the guest,
so we add an xioevent interface.  This allows a client to register
for notifications when a specific MMIO/PIO address is touched by
the guest.  This is an alternative interface to ioeventfd, which is
performance limited by io-bus scaling and eventfd wait-queue based
notification mechanism.  This also has the advantage of retaining
the full PIO data payload and passing it to the recipient.

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 include/linux/kvm_xinterface.h |   47 ++
 virt/kvm/xinterface.c  |  106 
 2 files changed, 153 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h
index 01f092b..684b6f8 100644
--- a/include/linux/kvm_xinterface.h
+++ b/include/linux/kvm_xinterface.h
@@ -12,6 +12,16 @@
 
 struct kvm_xinterface;
 struct kvm_xvmap;
+struct kvm_xioevent;
+
+enum {
+   kvm_xioevent_flag_nr_pio,
+   kvm_xioevent_flag_nr_max,
+};
+
+#define KVM_XIOEVENT_FLAG_PIO   (1  kvm_xioevent_flag_nr_pio)
+
+#define KVM_XIOEVENT_VALID_FLAG_MASK  ((1  kvm_xioevent_flag_nr_max) - 1)
 
 struct kvm_xinterface_ops {
unsigned long (*copy_to)(struct kvm_xinterface *intf,
@@ -22,6 +32,10 @@ struct kvm_xinterface_ops {
struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
  unsigned long gpa,
  unsigned long len);
+   struct kvm_xioevent* (*ioevent)(struct kvm_xinterface *intf,
+   u64 addr,
+   unsigned long len,
+   unsigned long flags);
void (*release)(struct kvm_xinterface *);
 };
 
@@ -109,6 +123,39 @@ kvm_xvmap_put(struct kvm_xvmap *vmap)
kref_put(vmap-kref, _kvm_xvmap_release);
 }
 
+struct kvm_xioevent_ops {
+   void (*deassign)(struct kvm_xioevent *ioevent);
+};
+
+struct kvm_xioevent {
+   const struct kvm_xioevent_ops *ops;
+   struct kvm_xinterface *intf;
+   void (*signal)(struct kvm_xioevent *ioevent, const void *val);
+   void  *priv;
+};
+
+static inline void
+kvm_xioevent_init(struct kvm_xioevent *ioevent,
+ const struct kvm_xioevent_ops *ops,
+ struct kvm_xinterface *intf)
+{
+   memset(ioevent, 0, sizeof(vmap));
+   ioevent-ops = ops;
+   ioevent-intf = intf;
+
+   kvm_xinterface_get(intf);
+}
+
+static inline void
+kvm_xioevent_deassign(struct kvm_xioevent *ioevent)
+{
+   struct kvm_xinterface *intf = ioevent-intf;
+   rmb();
+
+   ioevent-ops-deassign(ioevent);
+   kvm_xinterface_put(intf);
+}
+
 struct kvm_xinterface *kvm_xinterface_bind(int fd);
 
 #endif /* __KVM_XINTERFACE_H */
diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c
index 3b586c5..c356835 100644
--- a/virt/kvm/xinterface.c
+++ b/virt/kvm/xinterface.c
@@ -28,6 +28,8 @@
 #include linux/kvm_host.h
 #include linux/kvm_xinterface.h
 
+#include iodev.h
+
 struct _xinterface {
struct kvm *kvm;
struct task_struct *task;
@@ -42,6 +44,14 @@ struct _xvmap {
struct kvm_xvmap   vmap;
 };
 
+struct _ioevent {
+   u64   addr;
+   int   length;
+   struct kvm_io_bus*bus;
+   struct kvm_io_device  dev;
+   struct kvm_xioevent   ioevent;
+};
+
 static struct _xinterface *
 to_intf(struct kvm_xinterface *intf)
 {
@@ -362,6 +372,101 @@ fail:
return ERR_PTR(ret);
 }
 
+/* MMIO/PIO writes trigger an event if the addr/val match */
+static int
+ioevent_write(struct kvm_io_device *dev, gpa_t addr, int len, const void *val)
+{
+   struct _ioevent *p = container_of(dev, struct _ioevent, dev);
+   struct kvm_xioevent *ioevent = p-ioevent;
+
+   if (!(addr == p-addr  len == p-length))
+   return -EOPNOTSUPP;
+
+   if (!ioevent-signal)
+   return 0;
+
+   ioevent-signal(ioevent, val);
+   return 0;
+}
+
+static const struct kvm_io_device_ops ioevent_device_ops = {
+   .write = ioevent_write,
+};
+
+static void
+ioevent_deassign(struct kvm_xioevent *ioevent)
+{
+   struct _ioevent*p = container_of(ioevent, struct _ioevent, ioevent);
+   struct _xinterface *_intf = to_intf(ioevent-intf);
+   struct kvm *kvm = _intf-kvm;
+
+   kvm_io_bus_unregister_dev(kvm, p-bus, p-dev);
+   kfree(p);
+}
+
+static const struct kvm_xioevent_ops ioevent_intf_ops = {
+   .deassign = ioevent_deassign,
+};
+
+static struct kvm_xioevent*
+xinterface_ioevent(struct kvm_xinterface *intf,
+  u64 addr,
+  unsigned long len,
+  unsigned long flags)
+{
+   struct _xinterface *_intf = to_intf(intf);
+   struct kvm *kvm = _intf-kvm;
+   int pio = flags

[PATCH v2 4/4] KVM: add scatterlist support to xinterface

2009-10-02 Thread Gregory Haskins
This allows a scatter-gather approach to IO, which will be useful for
building high performance interfaces, like zero-copy and low-latency
copy (avoiding multiple calls to copy_to/from).

The interface is based on the existing scatterlist infrastructure.  The
caller is expected to pass in a scatterlist with its dma field
populated with valid GPAs.  The xinterface will then populate each
entry by translating the GPA to a page*.

The caller signifies completion by simply performing a put_page() on
each page returned in the list.

Signed-off-by: Gregory Haskins ghask...@novell.com
---

 include/linux/kvm_xinterface.h |4 ++
 virt/kvm/xinterface.c  |   72 
 2 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h
index 684b6f8..eefb575 100644
--- a/include/linux/kvm_xinterface.h
+++ b/include/linux/kvm_xinterface.h
@@ -9,6 +9,7 @@
 #include linux/kref.h
 #include linux/module.h
 #include linux/file.h
+#include linux/scatterlist.h
 
 struct kvm_xinterface;
 struct kvm_xvmap;
@@ -36,6 +37,9 @@ struct kvm_xinterface_ops {
u64 addr,
unsigned long len,
unsigned long flags);
+   unsigned long (*sgmap)(struct kvm_xinterface *intf,
+  struct scatterlist *sgl, int nents,
+  unsigned long flags);
void (*release)(struct kvm_xinterface *);
 };
 
diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c
index c356835..16729f6 100644
--- a/virt/kvm/xinterface.c
+++ b/virt/kvm/xinterface.c
@@ -467,6 +467,77 @@ fail:
 
 }
 
+static unsigned long
+xinterface_sgmap(struct kvm_xinterface *intf,
+struct scatterlist *sgl, int nents,
+unsigned long flags)
+{
+   struct _xinterface *_intf   = to_intf(intf);
+   struct task_struct *p   = _intf-task;
+   struct mm_struct   *mm  = _intf-mm;
+   struct kvm *kvm = _intf-kvm;
+   struct kvm_memory_slot *memslot = NULL;
+   boolkthread = !current-mm;
+   int ret;
+   struct scatterlist *sg;
+   int i;
+
+   down_read(kvm-slots_lock);
+
+   if (kthread)
+   use_mm(_intf-mm);
+
+   for_each_sg(sgl, sg, nents, i) {
+   unsigned long   gpa= sg_dma_address(sg);
+   unsigned long   len= sg_dma_len(sg);
+   unsigned long   gfn= gpa  PAGE_SHIFT;
+   off_t   offset = offset_in_page(gpa);
+   unsigned long   hva;
+   struct page*pg;
+
+   /* ensure that we do not have more than one page per entry */
+   if ((PAGE_ALIGN(len + offset)  PAGE_SHIFT) != 1) {
+   ret = -EINVAL;
+   break;
+   }
+
+   /* check for a memslot-cache miss */
+   if (!memslot
+   || gfn  memslot-base_gfn
+   || gfn = memslot-base_gfn + memslot-npages) {
+   memslot = gfn_to_memslot(kvm, gfn);
+   if (!memslot) {
+   ret = -EFAULT;
+   break;
+   }
+   }
+
+   hva = (memslot-userspace_addr +
+  (gfn - memslot-base_gfn) * PAGE_SIZE);
+
+   if (kthread || current-mm == mm)
+   ret = get_user_pages_fast(hva, 1, 1, pg);
+   else
+   ret = get_user_pages(p, mm, hva, 1, 1, 0, pg, NULL);
+
+   if (ret != 1) {
+   if (ret = 0)
+   ret = -EFAULT;
+   break;
+   }
+
+   sg_set_page(sg, pg, len, offset);
+   ret = 0;
+   }
+
+   if (kthread)
+   unuse_mm(_intf-mm);
+
+   up_read(kvm-slots_lock);
+
+   return ret;
+}
+
 static void
 xinterface_release(struct kvm_xinterface *intf)
 {
@@ -483,6 +554,7 @@ struct kvm_xinterface_ops _xinterface_ops = {
.copy_from   = xinterface_copy_from,
.vmap= xinterface_vmap,
.ioevent = xinterface_ioevent,
+   .sgmap   = xinterface_sgmap,
.release = xinterface_release,
 };
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-10-01 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/30/2009 10:04 PM, Gregory Haskins wrote:
 
 
 A 2.6.27 guest, or Windows guest with the existing virtio drivers,
 won't work
 over vbus.
  
 Binary compatibility with existing virtio drivers, while nice to have,
 is not a specific requirement nor goal.  We will simply load an updated
 KMP/MSI into those guests and they will work again.  As previously
 discussed, this is how more or less any system works today.  It's like
 we are removing an old adapter card and adding a new one to uprev the
 silicon.

 
 Virtualization is about not doing that.  Sometimes it's necessary (when
 you have made unfixable design mistakes), but just to replace a bus,
 with no advantages to the guest that has to be changed (other
 hypervisors or hypervisorless deployment scenarios aren't).

The problem is that your continued assertion that there is no advantage
to the guest is a completely unsubstantiated claim.  As it stands right
now, I have a public git tree that, to my knowledge, is the fastest KVM
PV networking implementation around.  It also has capabilities that are
demonstrably not found elsewhere, such as the ability to render generic
shared-memory interconnects (scheduling, timers), interrupt-priority
(qos), and interrupt-coalescing (exit-ratio reduction).  I designed each
of these capabilities after carefully analyzing where KVM was coming up
short.

Those are facts.

I can't easily prove which of my new features alone are what makes it
special per se, because I don't have unit tests for each part that
breaks it down.  What I _can_ state is that its the fastest and most
feature rich KVM-PV tree that I am aware of, and others may download and
test it themselves to verify my claims.

The disproof, on the other hand, would be in a counter example that
still meets all the performance and feature criteria under all the same
conditions while maintaining the existing ABI.  To my knowledge, this
doesn't exist.

Therefore, if you believe my work is irrelevant, show me a git tree that
accomplishes the same feats in a binary compatible way, and I'll rethink
my position.  Until then, complaining about lack of binary compatibility
is pointless since it is not an insurmountable proposition, and the one
and only available solution declares it a required casualty.

 
   Further, non-shmem virtio can't work over vbus.
  
 Actually I misspoke earlier when I said virtio works over non-shmem.
 Thinking about it some more, both virtio and vbus fundamentally require
 shared-memory, since sharing their metadata concurrently on both sides
 is their raison d'être.

 The difference is that virtio utilizes a pre-translation/mapping (via
 -add_buf) from the guest side.  OTOH, vbus uses a post translation
 scheme (via memctx) from the host-side.  If anything, vbus is actually
 more flexible because it doesn't assume the entire guest address space
 is directly mappable.

 In summary, your statement is incorrect (though it is my fault for
 putting that idea in your head).

 
 Well, Xen requires pre-translation (since the guest has to give the host
 (which is just another guest) permissions to access the data).

Actually I am not sure that it does require pre-translation.  You might
be able to use the memctx-copy_to/copy_from scheme in post translation
as well, since those would be able to communicate to something like the
xen kernel.  But I suppose either method would result in extra exits, so
there is no distinct benefit using vbus there..as you say below they're
just different.

The biggest difference is that my proposed model gets around the notion
that the entire guest address space can be represented by an arbitrary
pointer.  For instance, the copy_to/copy_from routines take a GPA, but
may use something indirect like a DMA controller to access that GPA.  On
the other hand, virtio fully expects a viable pointer to come out of the
interface iiuc.  This is in part what makes vbus more adaptable to non-virt.

 So neither is a superset of the other, they're just different.
 
 It doesn't really matter since Xen is unlikely to adopt virtio.

Agreed.

 
 An interesting thing here is that you don't even need a fancy
 multi-homed setup to see the effects of my exit-ratio reduction work:
 even single port configurations suffer from the phenomenon since many
 devices have multiple signal-flows (e.g. network adapters tend to have
 at least 3 flows: rx-ready, tx-complete, and control-events (link-state,
 etc).  Whats worse, is that the flows often are indirectly related (for
 instance, many host adapters will free tx skbs during rx operations, so
 you tend to get bursts of tx-completes at the same time as rx-ready.  If
 the flows map 1:1 with IDT, they will suffer the same problem.

 
 You can simply use the same vector for both rx and tx and poll both at
 every interrupt.

Yes, but that has its own problems: e.g. additional exits or at least
additional overhead figuring out what happens each time.  This is even

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-30 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/26/2009 12:32 AM, Gregory Haskins wrote:

 I realize in retrospect that my choice of words above implies vbus _is_
 complete, but this is not what I was saying.  What I was trying to
 convey is that vbus is _more_ complete.  Yes, in either case some kind
 of glue needs to be written.  The difference is that vbus implements
 more of the glue generally, and leaves less required to be customized
 for each iteration.



 No argument there.  Since you care about non-virt scenarios and virtio
 doesn't, naturally vbus is a better fit for them as the code stands.
  
 Thanks for finally starting to acknowledge there's a benefit, at least.

 
 I think I've mentioned vbus' finer grained layers as helpful here,
 though I doubt the value of this.  Hypervisors are added rarely, while
 devices and drivers are added (and modified) much more often.  I don't
 buy the anything-to-anything promise.

The ease in which a new hypervisor should be able to integrate into the
stack is only one of vbus's many benefits.

 
 To be more precise, IMO virtio is designed to be a performance oriented
 ring-based driver interface that supports all types of hypervisors (e.g.
 shmem based kvm, and non-shmem based Xen).  vbus is designed to be a
 high-performance generic shared-memory interconnect (for rings or
 otherwise) framework for environments where linux is the underpinning
 host (physical or virtual).  They are distinctly different, but
 complementary (the former addresses the part of the front-end, and
 latter addresses the back-end, and a different part of the front-end).

 
 They're not truly complementary since they're incompatible.

No, that is incorrect.  Not to be rude, but for clarity:

  Complementary \Com`ple*menta*ry\, a.
 Serving to fill out or to complete; as, complementary
 numbers.
 [1913 Webster]

Citation: www.dict.org

IOW: Something being complementary has nothing to do with guest/host
binary compatibility.  virtio-pci and virtio-vbus are both equally
complementary to virtio since they fill in the bottom layer of the
virtio stack.

So yes, vbus is truly complementary to virtio afaict.

 A 2.6.27 guest, or Windows guest with the existing virtio drivers, won't work
 over vbus.

Binary compatibility with existing virtio drivers, while nice to have,
is not a specific requirement nor goal.  We will simply load an updated
KMP/MSI into those guests and they will work again.  As previously
discussed, this is how more or less any system works today.  It's like
we are removing an old adapter card and adding a new one to uprev the
silicon.

  Further, non-shmem virtio can't work over vbus.

Actually I misspoke earlier when I said virtio works over non-shmem.
Thinking about it some more, both virtio and vbus fundamentally require
shared-memory, since sharing their metadata concurrently on both sides
is their raison d'être.

The difference is that virtio utilizes a pre-translation/mapping (via
-add_buf) from the guest side.  OTOH, vbus uses a post translation
scheme (via memctx) from the host-side.  If anything, vbus is actually
more flexible because it doesn't assume the entire guest address space
is directly mappable.

In summary, your statement is incorrect (though it is my fault for
putting that idea in your head).

  Since
 virtio is guest-oriented and host-agnostic, it can't ignore
 non-shared-memory hosts (even though it's unlikely virtio will be
 adopted there)

Well, to be fair no one said it has to ignore them.  Either virtio-vbus
transport is present and available to the virtio stack, or it isn't.  If
its present, it may or may not publish objects for consumption.
Providing a virtio-vbus transport in no way limits or degrades the
existing capabilities of the virtio stack.  It only enhances them.

I digress.  The whole point is moot since I realized that the non-shmem
distinction isn't accurate anyway.  They both require shared-memory for
the metadata, and IIUC virtio requires the entire address space to be
mappable whereas vbus only assumes the metadata is.

 
 In addition, the kvm-connector used in AlacrityVM's design strives to
 add value and improve performance via other mechanisms, such as dynamic
   allocation, interrupt coalescing (thus reducing exit-ratio, which is a
 serious issue in KVM)
 
 Do you have measurements of inter-interrupt coalescing rates (excluding
 intra-interrupt coalescing).

I actually do not have a rig setup to explicitly test inter-interrupt
rates at the moment.  Once things stabilize for me, I will try to
re-gather some numbers here.  Last time I looked, however, there were
some decent savings for inter as well.

Inter rates are interesting because they are what tends to ramp up with
IO load more than intra since guest interrupt mitigation techniques like
NAPI often quell intra-rates naturally.  This is especially true for
data-center, cloud, hpc-grid, etc, kind of workloads (vs vanilla
desktops, etc) that tend to have multiple IO

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-25 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/24/2009 09:03 PM, Gregory Haskins wrote:

 I don't really see how vhost and vbus are different here.  vhost expects
 signalling to happen through a couple of eventfds and requires someone
 to supply them and implement kernel support (if needed).  vbus requires
 someone to write a connector to provide the signalling implementation.
 Neither will work out-of-the-box when implementing virtio-net over
 falling dominos, for example.
  
 I realize in retrospect that my choice of words above implies vbus _is_
 complete, but this is not what I was saying.  What I was trying to
 convey is that vbus is _more_ complete.  Yes, in either case some kind
 of glue needs to be written.  The difference is that vbus implements
 more of the glue generally, and leaves less required to be customized
 for each iteration.

 
 
 No argument there.  Since you care about non-virt scenarios and virtio
 doesn't, naturally vbus is a better fit for them as the code stands.

Thanks for finally starting to acknowledge there's a benefit, at least.

To be more precise, IMO virtio is designed to be a performance oriented
ring-based driver interface that supports all types of hypervisors (e.g.
shmem based kvm, and non-shmem based Xen).  vbus is designed to be a
high-performance generic shared-memory interconnect (for rings or
otherwise) framework for environments where linux is the underpinning
host (physical or virtual).  They are distinctly different, but
complementary (the former addresses the part of the front-end, and
latter addresses the back-end, and a different part of the front-end).

In addition, the kvm-connector used in AlacrityVM's design strives to
add value and improve performance via other mechanisms, such as dynamic
 allocation, interrupt coalescing (thus reducing exit-ratio, which is a
serious issue in KVM) and priortizable/nestable signals.

Today there is a large performance disparity between what a KVM guest
sees and what a native linux application sees on that same host.  Just
take a look at some of my graphs between virtio, and native, for
example:

http://developer.novell.com/wiki/images/b/b7/31-rc4_throughput.png

A dominant vbus design principle is to try to achieve the same IO
performance for all linux applications whether they be literally
userspace applications, or things like KVM vcpus or Ira's physical
boards.  It also aims to solve problems not previously expressible with
current technologies (even virtio), like nested real-time.

And even though you repeatedly insist otherwise, the neat thing here is
that the two technologies mesh (at least under certain circumstances,
like when virtio is deployed on a shared-memory friendly linux backend
like KVM).  I hope that my stack diagram below depicts that clearly.


 But that's not a strong argument for vbus; instead of adding vbus you
 could make virtio more friendly to non-virt

Actually, it _is_ a strong argument then because adding vbus is what
helps makes virtio friendly to non-virt, at least for when performance
matters.

 (there's a limit how far you
 can take this, not imposed by the code, but by virtio's charter as a
 virtual device driver framework).
 
 Going back to our stack diagrams, you could think of a vhost solution
 like this:

 --
 | virtio-net
 --
 | virtio-ring
 --
 | virtio-bus
 --
 | ? undefined-1 ?
 --
 | vhost
 --

 and you could think of a vbus solution like this

 --
 | virtio-net
 --
 | virtio-ring
 --
 | virtio-bus
 --
 | bus-interface
 --
 | ? undefined-2 ?
 --
 | bus-model
 --
 | virtio-net-device (vhost ported to vbus model? :)
 --


 So the difference between vhost and vbus in this particular context is
 that you need to have undefined-1 do device discovery/hotswap,
 config-space, address-decode/isolation, signal-path routing, memory-path
 routing, etc.  Today this function is filled by things like virtio-pci,
 pci-bus, KVM/ioeventfd, and QEMU for x86.  I am not as familiar with
 lguest, but presumably it is filled there by components like
 virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher.  And to use
 more contemporary examples, we might have virtio-domino, domino-bus,
 domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko,
 and ira-launcher.

 Contrast this to the vbus stack:  The bus-X components (when optionally
 employed by the connector designer) do device-discovery, hotswap,
 config-space, address-decode/isolation, signal-path and memory-path
 routing, etc in a general (and pv-centric) way. The undefined-2
 portion is the connector, and just needs to convey messages like
 DEVCALL and SHMSIGNAL.  The rest is handled in other parts

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-24 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/24/2009 12:15 AM, Gregory Haskins wrote:

 There are various aspects about designing high-performance virtual
 devices such as providing the shortest paths possible between the
 physical resources and the consumers.  Conversely, we also need to
 ensure that we meet proper isolation/protection guarantees at the same
 time.  What this means is there are various aspects to any
 high-performance PV design that require to be placed in-kernel to
 maximize the performance yet properly isolate the guest.

 For instance, you are required to have your signal-path (interrupts and
 hypercalls), your memory-path (gpa translation), and
 addressing/isolation model in-kernel to maximize performance.


 Exactly.  That's what vhost puts into the kernel and nothing more.
  
 Actually, no.  Generally, _KVM_ puts those things into the kernel, and
 vhost consumes them.  Without KVM (or something equivalent), vhost is
 incomplete.  One of my goals with vbus is to generalize the something
 equivalent part here.

 
 I don't really see how vhost and vbus are different here.  vhost expects
 signalling to happen through a couple of eventfds and requires someone
 to supply them and implement kernel support (if needed).  vbus requires
 someone to write a connector to provide the signalling implementation. 
 Neither will work out-of-the-box when implementing virtio-net over
 falling dominos, for example.

I realize in retrospect that my choice of words above implies vbus _is_
complete, but this is not what I was saying.  What I was trying to
convey is that vbus is _more_ complete.  Yes, in either case some kind
of glue needs to be written.  The difference is that vbus implements
more of the glue generally, and leaves less required to be customized
for each iteration.

Going back to our stack diagrams, you could think of a vhost solution
like this:

--
| virtio-net
--
| virtio-ring
--
| virtio-bus
--
| ? undefined-1 ?
--
| vhost
--

and you could think of a vbus solution like this

--
| virtio-net
--
| virtio-ring
--
| virtio-bus
--
| bus-interface
--
| ? undefined-2 ?
--
| bus-model
--
| virtio-net-device (vhost ported to vbus model? :)
--


So the difference between vhost and vbus in this particular context is
that you need to have undefined-1 do device discovery/hotswap,
config-space, address-decode/isolation, signal-path routing, memory-path
routing, etc.  Today this function is filled by things like virtio-pci,
pci-bus, KVM/ioeventfd, and QEMU for x86.  I am not as familiar with
lguest, but presumably it is filled there by components like
virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher.  And to use
more contemporary examples, we might have virtio-domino, domino-bus,
domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko,
and ira-launcher.

Contrast this to the vbus stack:  The bus-X components (when optionally
employed by the connector designer) do device-discovery, hotswap,
config-space, address-decode/isolation, signal-path and memory-path
routing, etc in a general (and pv-centric) way. The undefined-2
portion is the connector, and just needs to convey messages like
DEVCALL and SHMSIGNAL.  The rest is handled in other parts of the stack.

So to answer your question, the difference is that the part that has to
be customized in vbus should be a fraction of what needs to be
customized with vhost because it defines more of the stack.  And, as
eluded to in my diagram, both virtio-net and vhost (with some
modifications to fit into the vbus framework) are potentially
complementary, not competitors.

 
 Vbus accomplishes its in-kernel isolation model by providing a
 container concept, where objects are placed into this container by
 userspace.  The host kernel enforces isolation/protection by using a
 namespace to identify objects that is only relevant within a specific
 container's context (namely, a u32 dev-id).  The guest addresses the
 objects by its dev-id, and the kernel ensures that the guest can't
 access objects outside of its dev-id namespace.


 vhost manages to accomplish this without any kernel support.
  
 No, vhost manages to accomplish this because of KVMs kernel support
 (ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
 merely a kind of tuntap-like clone signalled by eventfds.

 
 Without a vbus-connector-falling-dominos, vbus-venet can't do anything
 either.

Mostly covered above...

However, I was addressing your assertion that vhost somehow magically
accomplishes this container/addressing function without any specific
kernel support.  This is incorrect.  I contend that this kernel support

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-24 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/23/2009 10:37 PM, Avi Kivity wrote:

 Example: feature negotiation.  If it happens in userspace, it's easy
 to limit what features we expose to the guest.  If it happens in the
 kernel, we need to add an interface to let the kernel know which
 features it should expose to the guest.  We also need to add an
 interface to let userspace know which features were negotiated, if we
 want to implement live migration.  Something fairly trivial bloats
 rapidly.
 
 btw, we have this issue with kvm reporting cpuid bits to the guest. 
 Instead of letting kvm talk directly to the hardware and the guest, kvm
 gets the cpuid bits from the hardware, strips away features it doesn't
 support, exposes that to userspace, and expects userspace to program the
 cpuid bits it wants to expose to the guest (which may be different than
 what kvm exposed to userspace, and different from guest to guest).
 

This issue doesn't exist in the model I am referring to, as these are
all virtual-devices anyway.  See my last reply

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-23 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/22/2009 12:43 AM, Ira W. Snyder wrote:

 Sure, virtio-ira and he is on his own to make a bus-model under that, or
 virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
 model can work, I agree.

  
 Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
 virtio-s390. It isn't especially easy. I can steal lots of code from the
 lguest bus model, but sometimes it is good to generalize, especially
 after the fourth implemention or so. I think this is what GHaskins tried
 to do.

 
 Yes.  vbus is more finely layered so there is less code duplication.

To clarify, Ira was correct in stating this generalizing some of these
components was one of the goals for the vbus project: IOW vbus finely
layers and defines what's below virtio, not replaces it.

You can think of a virtio-stack like this:

--
| virtio-net
--
| virtio-ring
--
| virtio-bus
--
| ? undefined ?
--

IOW: The way I see it, virtio is a device interface model only.  The
rest of it is filled in by the virtio-transport and some kind of back-end.

So today, we can complete the ? undefined ? block like this for KVM:

--
| virtio-pci
--
 |
--
| kvm.ko
--
| qemu
--
| tuntap
--

In this case, kvm.ko and tuntap are providing plumbing, and qemu is
providing a backend device model (pci-based, etc).

You can, of course, plug a different stack in (such as virtio-lguest,
virtio-ira, etc) but you are more or less on your own to recreate many
of the various facilities contained in that stack (such as things
provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering.

Vbus tries to commoditize more components in the stack (like the bus
model and backend-device model) so they don't need to be redesigned each
time we solve this virtio-transport problem.  IOW: stop the
proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
virtio.  Instead, we can then focus on the value add on top, like the
models themselves or the simple glue between them.

So now you might have something like

--
| virtio-vbus
--
| vbus-proxy
--
| kvm-guest-connector
--
 |
--
| kvm.ko
--
| kvm-host-connector.ko
--
| vbus.ko
--
| virtio-net-backend.ko
--

so now we don't need to worry about the bus-model or the device-model
framework.  We only need to implement the connector, etc.  This is handy
when you find yourself in an environment that doesn't support PCI (such
as Ira's rig, or userspace containers), or when you want to add features
that PCI doesn't have (such as fluid event channels for things like IPC
services, or priortizable interrupts, etc).

 
 The virtio layering was more or less dictated by Xen which doesn't have
 shared memory (it uses grant references instead).  As a matter of fact
 lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that
 part is duplicated.  It's probably possible to add a virtio-shmem.ko
 library that people who do have shared memory can reuse.

Note that I do not believe the Xen folk use virtio, so while I can
appreciate the foresight that went into that particular aspect of the
design of the virtio model, I am not sure if its a realistic constraint.

The reason why I decided to not worry about that particular model is
twofold:

1) Trying to support non shared-memory designs is prohibitively high for
my performance goals (for instance, requiring an exit on each
-add_buf() in addition to the -kick()).

2) The Xen guys are unlikely to diverge from something like
xenbus/xennet anyway, so it would be for naught.

Therefore, I just went with a device model optimized for shared-memory
outright.

That said, I believe we can refactor what is called the
vbus-proxy-device into this virtio-shmem interface that you and
Anthony have described.  We could make the feature optional and only
support on architectures where this makes sense.

snip

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-23 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/23/2009 05:26 PM, Gregory Haskins wrote:

   
 Yes, I'm having to create my own bus model, a-la lguest, virtio-pci,
 and
 virtio-s390. It isn't especially easy. I can steal lots of code from
 the
 lguest bus model, but sometimes it is good to generalize, especially
 after the fourth implemention or so. I think this is what GHaskins
 tried
 to do.


 Yes.  vbus is more finely layered so there is less code duplication.
  
 To clarify, Ira was correct in stating this generalizing some of these
 components was one of the goals for the vbus project: IOW vbus finely
 layers and defines what's below virtio, not replaces it.

 You can think of a virtio-stack like this:

 --
 | virtio-net
 --
 | virtio-ring
 --
 | virtio-bus
 --
 | ? undefined ?
 --

 IOW: The way I see it, virtio is a device interface model only.  The
 rest of it is filled in by the virtio-transport and some kind of
 back-end.

 So today, we can complete the ? undefined ? block like this for KVM:

 --
 | virtio-pci
 --
   |
 --
 | kvm.ko
 --
 | qemu
 --
 | tuntap
 --

 In this case, kvm.ko and tuntap are providing plumbing, and qemu is
 providing a backend device model (pci-based, etc).

 You can, of course, plug a different stack in (such as virtio-lguest,
 virtio-ira, etc) but you are more or less on your own to recreate many
 of the various facilities contained in that stack (such as things
 provided by QEMU, like discovery/hotswap/addressing), as Ira is
 discovering.

 Vbus tries to commoditize more components in the stack (like the bus
 model and backend-device model) so they don't need to be redesigned each
 time we solve this virtio-transport problem.  IOW: stop the
 proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
 virtio.  Instead, we can then focus on the value add on top, like the
 models themselves or the simple glue between them.

 So now you might have something like

 --
 | virtio-vbus
 --
 | vbus-proxy
 --
 | kvm-guest-connector
 --
   |
 --
 | kvm.ko
 --
 | kvm-host-connector.ko
 --
 | vbus.ko
 --
 | virtio-net-backend.ko
 --

 so now we don't need to worry about the bus-model or the device-model
 framework.  We only need to implement the connector, etc.  This is handy
 when you find yourself in an environment that doesn't support PCI (such
 as Ira's rig, or userspace containers), or when you want to add features
 that PCI doesn't have (such as fluid event channels for things like IPC
 services, or priortizable interrupts, etc).

 
 Well, vbus does more, for example it tunnels interrupts instead of
 exposing them 1:1 on the native interface if it exists.

As I've previously explained, that trait is a function of the
kvm-connector I've chosen to implement, not of the overall design of vbus.

The reason why my kvm-connector is designed that way is because my early
testing/benchmarking shows one of the issues in KVM performance is the
ratio of exits per IO operation are fairly high, especially as your
scale io-load.  Therefore, the connector achieves a substantial
reduction in that ratio by treating interrupts to the same kind of
benefits that NAPI brought to general networking: That is, we enqueue
interrupt messages into a lockless ring and only hit the IDT for the
first occurrence.  Subsequent interrupts are injected in a
parallel/lockless manner, without hitting the IDT nor incurring an extra
EOI.  This pays dividends as the IO rate increases, which is when the
guest needs the most help.

OTOH, it is entirely possible to design the connector such that we
maintain a 1:1 ratio of signals to traditional IDT interrupts.  It is
also possible to design a connector which surfaces as something else,
such as PCI devices (by terminating the connector in QEMU and utilizing
its PCI emulation facilities), which would naturally employ 1:1 mapping.

So if 1:1 mapping is a critical feature (I would argue to the contrary),
vbus can support it.

 It also pulls parts of the device model into the host kernel.

That is the point.  Most of it needs to be there for performance.  And
what doesn't need to be there for performance can either be:

a) skipped at the discretion of the connector/device-model designer

OR

b) included because its trivially small subset of the model (e.g. a
mac-addr attribute) and its nice to have a cohesive solution instead of
requiring a separate binary blob that can get out of sync, etc.

The example Ive provided to date (venet on kvm) utilizes (b), but it
certainly

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-23 Thread Gregory Haskins
Gregory Haskins wrote:
 Avi Kivity wrote:
 On 09/23/2009 05:26 PM, Gregory Haskins wrote:
   
 Yes, I'm having to create my own bus model, a-la lguest, virtio-pci,
 and
 virtio-s390. It isn't especially easy. I can steal lots of code from
 the
 lguest bus model, but sometimes it is good to generalize, especially
 after the fourth implemention or so. I think this is what GHaskins
 tried
 to do.


 Yes.  vbus is more finely layered so there is less code duplication.
  
 To clarify, Ira was correct in stating this generalizing some of these
 components was one of the goals for the vbus project: IOW vbus finely
 layers and defines what's below virtio, not replaces it.

 You can think of a virtio-stack like this:

 --
 | virtio-net
 --
 | virtio-ring
 --
 | virtio-bus
 --
 | ? undefined ?
 --

 IOW: The way I see it, virtio is a device interface model only.  The
 rest of it is filled in by the virtio-transport and some kind of
 back-end.

 So today, we can complete the ? undefined ? block like this for KVM:

 --
 | virtio-pci
 --
   |
 --
 | kvm.ko
 --
 | qemu
 --
 | tuntap
 --

 In this case, kvm.ko and tuntap are providing plumbing, and qemu is
 providing a backend device model (pci-based, etc).

 You can, of course, plug a different stack in (such as virtio-lguest,
 virtio-ira, etc) but you are more or less on your own to recreate many
 of the various facilities contained in that stack (such as things
 provided by QEMU, like discovery/hotswap/addressing), as Ira is
 discovering.

 Vbus tries to commoditize more components in the stack (like the bus
 model and backend-device model) so they don't need to be redesigned each
 time we solve this virtio-transport problem.  IOW: stop the
 proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
 virtio.  Instead, we can then focus on the value add on top, like the
 models themselves or the simple glue between them.

 So now you might have something like

 --
 | virtio-vbus
 --
 | vbus-proxy
 --
 | kvm-guest-connector
 --
   |
 --
 | kvm.ko
 --
 | kvm-host-connector.ko
 --
 | vbus.ko
 --
 | virtio-net-backend.ko
 --

 so now we don't need to worry about the bus-model or the device-model
 framework.  We only need to implement the connector, etc.  This is handy
 when you find yourself in an environment that doesn't support PCI (such
 as Ira's rig, or userspace containers), or when you want to add features
 that PCI doesn't have (such as fluid event channels for things like IPC
 services, or priortizable interrupts, etc).

 Well, vbus does more, for example it tunnels interrupts instead of
 exposing them 1:1 on the native interface if it exists.
 
 As I've previously explained, that trait is a function of the
 kvm-connector I've chosen to implement, not of the overall design of vbus.
 
 The reason why my kvm-connector is designed that way is because my early
 testing/benchmarking shows one of the issues in KVM performance is the
 ratio of exits per IO operation are fairly high, especially as your
 scale io-load.  Therefore, the connector achieves a substantial
 reduction in that ratio by treating interrupts to the same kind of
 benefits that NAPI brought to general networking: That is, we enqueue
 interrupt messages into a lockless ring and only hit the IDT for the
 first occurrence.  Subsequent interrupts are injected in a
 parallel/lockless manner, without hitting the IDT nor incurring an extra
 EOI.  This pays dividends as the IO rate increases, which is when the
 guest needs the most help.
 
 OTOH, it is entirely possible to design the connector such that we
 maintain a 1:1 ratio of signals to traditional IDT interrupts.  It is
 also possible to design a connector which surfaces as something else,
 such as PCI devices (by terminating the connector in QEMU and utilizing
 its PCI emulation facilities), which would naturally employ 1:1 mapping.
 
 So if 1:1 mapping is a critical feature (I would argue to the contrary),
 vbus can support it.
 
 It also pulls parts of the device model into the host kernel.
 
 That is the point.  Most of it needs to be there for performance.

To clarify this point:

There are various aspects about designing high-performance virtual
devices such as providing the shortest paths possible between the
physical resources and the consumers.  Conversely, we also need to
ensure that we meet proper isolation/protection guarantees at the same
time.  What this means is there are various aspects to any
high-performance PV

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-23 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/23/2009 08:58 PM, Gregory Haskins wrote:

 It also pulls parts of the device model into the host kernel.

 That is the point.  Most of it needs to be there for performance.
  
 To clarify this point:

 There are various aspects about designing high-performance virtual
 devices such as providing the shortest paths possible between the
 physical resources and the consumers.  Conversely, we also need to
 ensure that we meet proper isolation/protection guarantees at the same
 time.  What this means is there are various aspects to any
 high-performance PV design that require to be placed in-kernel to
 maximize the performance yet properly isolate the guest.

 For instance, you are required to have your signal-path (interrupts and
 hypercalls), your memory-path (gpa translation), and
 addressing/isolation model in-kernel to maximize performance.

 
 Exactly.  That's what vhost puts into the kernel and nothing more.

Actually, no.  Generally, _KVM_ puts those things into the kernel, and
vhost consumes them.  Without KVM (or something equivalent), vhost is
incomplete.  One of my goals with vbus is to generalize the something
equivalent part here.

I know you may not care about non-kvm use cases, and thats fine.  No one
says you have to.  However, note that some of use do care about these
non-kvm cases, and thus its a distinction I am making here as a benefit
of the vbus framework.

 
 Vbus accomplishes its in-kernel isolation model by providing a
 container concept, where objects are placed into this container by
 userspace.  The host kernel enforces isolation/protection by using a
 namespace to identify objects that is only relevant within a specific
 container's context (namely, a u32 dev-id).  The guest addresses the
 objects by its dev-id, and the kernel ensures that the guest can't
 access objects outside of its dev-id namespace.

 
 vhost manages to accomplish this without any kernel support.

No, vhost manages to accomplish this because of KVMs kernel support
(ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
merely a kind of tuntap-like clone signalled by eventfds.

vbus on the other hand, generalizes one more piece of the puzzle
(namely, the function of pio+ioeventfd and userspace's programming of
it) by presenting the devid namespace and container concept.

This goes directly to my rebuttal of your claim that vbus places too
much in the kernel.  I state that, one way or the other, address decode
and isolation _must_ be in the kernel for performance.  Vbus does this
with a devid/container scheme.  vhost+virtio-pci+kvm does it with
pci+pio+ioeventfd.


  The guest
 simply has not access to any vhost resources other than the guest-host
 doorbell, which is handed to the guest outside vhost (so it's somebody
 else's problem, in userspace).

You mean _controlled_ by userspace, right?  Obviously, the other side of
the kernel still needs to be programmed (ioeventfd, etc).  Otherwise,
vhost would be pointless: e.g. just use vanilla tuntap if you don't need
fast in-kernel decoding.

 
 All that is required is a way to transport a message with a devid
 attribute as an address (such as DEVCALL(devid)) and the framework
 provides the rest of the decode+execute function.

 
 vhost avoids that.

No, it doesn't avoid it.  It just doesn't specify how its done, and
relies on something else to do it on its behalf.

Conversely, vbus specifies how its done, but not how to transport the
verb across the wire.  That is the role of the vbus-connector abstraction.

 
 Contrast this to vhost+virtio-pci (called simply vhost from here).

 
 It's the wrong name.  vhost implements only the data path.

Understood, but vhost+virtio-pci is what I am contrasting, and I use
vhost for short from that point on because I am too lazy to type the
whole name over and over ;)

 
 It is not immune to requiring in-kernel addressing support either, but
 rather it just does it differently (and its not as you might expect via
 qemu).

 Vhost relies on QEMU to render PCI objects to the guest, which the guest
 assigns resources (such as BARs, interrupts, etc).
 
 vhost does not rely on qemu.  It relies on its user to handle
 configuration.  In one important case it's qemu+pci.  It could just as
 well be the lguest launcher.

I meant vhost=vhost+virtio-pci here.  Sorry for the confusion.

The point I am making specifically is that vhost in general relies on
other in-kernel components to function.  I.e. It cannot function without
having something like the PCI model to build an IO namespace.  That
namespace (in this case, pio addresses+data tuples) are used for the
in-kernel addressing function under KVM + virtio-pci.

The case of the lguest launcher is a good one to highlight.  Yes, you
can presumably also use lguest with vhost, if the requisite facilities
are exposed to lguest-bus, and some eventfd based thing like ioeventfd
is written for the host (if it doesnt exist already

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-16 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/15/2009 11:08 PM, Gregory Haskins wrote:

 There's virtio-console, virtio-blk etc.  None of these have kernel-mode
 servers, but these could be implemented if/when needed.
  
 IIUC, Ira already needs at least ethernet and console capability.


 
 He's welcome to pick up the necessary code from qemu.

The problem isn't where to find the models...the problem is how to
aggregate multiple models to the guest.

 
 b) what do you suppose this protocol to aggregate the connections would
 look like? (hint: this is what a vbus-connector does).


 You mean multilink?  You expose the device as a multiqueue.
  
 No, what I mean is how do you surface multiple ethernet and consoles to
 the guests?  For Ira's case, I think he needs at minimum at least one of
 each, and he mentioned possibly having two unique ethernets at one point.

 
 You instantiate multiple vhost-nets.  Multiple ethernet NICs is a
 supported configuration for kvm.

But this is not KVM.

 
 His slave boards surface themselves as PCI devices to the x86
 host.  So how do you use that to make multiple vhost-based devices (say
 two virtio-nets, and a virtio-console) communicate across the transport?

 
 I don't really see the difference between 1 and N here.

A KVM surfaces N virtio-devices as N pci-devices to the guest.  What do
we do in Ira's case where the entire guest represents itself as a PCI
device to the host, and nothing the other way around?


 
 There are multiple ways to do this, but what I am saying is that
 whatever is conceived will start to look eerily like a vbus-connector,
 since this is one of its primary purposes ;)

 
 I'm not sure if you're talking about the configuration interface or data
 path here.

I am talking about how we would tunnel the config space for N devices
across his transport.

As an aside, the vbus-kvm connector makes them one and the same, but
they do not have to be.  Its all in the connector design.

 
 c) how do you manage the configuration, especially on a per-board
 basis?


 pci (for kvm/x86).
  
 Ok, for kvm understood (and I would also add qemu to that mix).  But
 we are talking about vhost's application in a non-kvm environment here,
 right?.

 So if the vhost-X devices are in the guest,
 
 They aren't in the guest.  The best way to look at it is
 
 - a device side, with a dma engine: vhost-net
 - a driver side, only accessing its own memory: virtio-net
 
 Given that Ira's config has the dma engine in the ppc boards, that's
 where vhost-net would live (the ppc boards acting as NICs to the x86
 board, essentially).

That sounds convenient given his hardware, but it has its own set of
problems.  For one, the configuration/inventory of these boards is now
driven by the wrong side and has to be addressed.  Second, the role
reversal will likely not work for many models other than ethernet (e.g.
virtio-console or virtio-blk drivers running on the x86 board would be
naturally consuming services from the slave boards...virtio-net is an
exception because 802.x is generally symmetrical).

IIUC, vbus would support having the device models live properly on the
x86 side, solving both of these problems.  It would be impossible to
reverse vhost given its current design.

 
 and the x86 board is just
 a slave...How do you tell each ppc board how many devices and what
 config (e.g. MACs, etc) to instantiate?  Do you assume that they should
 all be symmetric and based on positional (e.g. slot) data?  What if you
 want asymmetric configurations (if not here, perhaps in a different
 environment)?

 
 I have no idea, that's for Ira to solve.

Bingo.  Thus my statement that the vhost proposal is incomplete.  You
have the virtio-net and vhost-net pieces covering the fast-path
end-points, but nothing in the middle (transport, aggregation,
config-space), and nothing on the management-side.  vbus provides most
of the other pieces, and can even support the same virtio-net protocol
on top.  The remaining part would be something like a udev script to
populate the vbus with devices on board-insert events.

 If he could fake the PCI
 config space as seen by the x86 board, he would just show the normal pci
 config and use virtio-pci (multiple channels would show up as a
 multifunction device).  Given he can't, he needs to tunnel the virtio
 config space some other way.

Right, and note that vbus was designed to solve this.  This tunneling
can, of course, be done without vbus using some other design.  However,
whatever solution is created will look incredibly close to what I've
already done, so my point is why reinvent it?

 
 Yes.  virtio is really virtualization oriented.
  
 I would say that its vhost in particular that is virtualization
 oriented.  virtio, as a concept, generally should work in physical
 systems, if perhaps with some minor modifications.  The biggest limit
 is having virt in its name ;)

 
 Let me rephrase.  The virtio developers

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-16 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/16/2009 02:44 PM, Gregory Haskins wrote:
 The problem isn't where to find the models...the problem is how to
 aggregate multiple models to the guest.

 
 You mean configuration?
 
 You instantiate multiple vhost-nets.  Multiple ethernet NICs is a
 supported configuration for kvm.
  
 But this is not KVM.


 
 If kvm can do it, others can.

The problem is that you seem to either hand-wave over details like this,
or you give details that are pretty much exactly what vbus does already.
 My point is that I've already sat down and thought about these issues
and solved them in a freely available GPL'ed software package.

So the question is: is your position that vbus is all wrong and you wish
to create a new bus-like thing to solve the problem?  If so, how is it
different from what Ive already done?  More importantly, what specific
objections do you have to what Ive done, as perhaps they can be fixed
instead of starting over?

 
 His slave boards surface themselves as PCI devices to the x86
 host.  So how do you use that to make multiple vhost-based devices (say
 two virtio-nets, and a virtio-console) communicate across the
 transport?


 I don't really see the difference between 1 and N here.
  
 A KVM surfaces N virtio-devices as N pci-devices to the guest.  What do
 we do in Ira's case where the entire guest represents itself as a PCI
 device to the host, and nothing the other way around?

 
 There is no guest and host in this scenario.  There's a device side
 (ppc) and a driver side (x86).  The driver side can access configuration
 information on the device side.  How to multiplex multiple devices is an
 interesting exercise for whoever writes the virtio binding for that setup.

Bingo.  So now its a question of do you want to write this layer from
scratch, or re-use my framework.

 
 There are multiple ways to do this, but what I am saying is that
 whatever is conceived will start to look eerily like a vbus-connector,
 since this is one of its primary purposes ;)


 I'm not sure if you're talking about the configuration interface or data
 path here.
  
 I am talking about how we would tunnel the config space for N devices
 across his transport.

 
 Sounds trivial.

No one said it was rocket science.  But it does need to be designed and
implemented end-to-end, much of which Ive already done in what I hope is
an extensible way.

  Write an address containing the device number and
 register number to on location, read or write data from another.

You mean like the u64 devh, and u32 func fields I have here for the
vbus-kvm connector?

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64

 Just
 like the PCI cf8/cfc interface.
 
 They aren't in the guest.  The best way to look at it is

 - a device side, with a dma engine: vhost-net
 - a driver side, only accessing its own memory: virtio-net

 Given that Ira's config has the dma engine in the ppc boards, that's
 where vhost-net would live (the ppc boards acting as NICs to the x86
 board, essentially).
  
 That sounds convenient given his hardware, but it has its own set of
 problems.  For one, the configuration/inventory of these boards is now
 driven by the wrong side and has to be addressed.
 
 Why is it the wrong side?

Wrong is probably too harsh a word when looking at ethernet.  Its
certainly odd, and possibly inconvenient.  It would be like having
vhost in a KVM guest, and virtio-net running on the host.  You could do
it, but its weird and awkward.  Where it really falls apart and enters
the wrong category is for non-symmetric devices, like disk-io.

 
 Second, the role
 reversal will likely not work for many models other than ethernet (e.g.
 virtio-console or virtio-blk drivers running on the x86 board would be
 naturally consuming services from the slave boards...virtio-net is an
 exception because 802.x is generally symmetrical).

 
 There is no role reversal.

So if I have virtio-blk driver running on the x86 and vhost-blk device
running on the ppc board, I can use the ppc board as a block-device.
What if I really wanted to go the other way?

 The side doing dma is the device, the side
 accessing its own memory is the driver.  Just like that other 1e12
 driver/device pairs out there.

IIUC, his ppc boards really can be seen as guests (they are linux
instances that are utilizing services from the x86, not the other way
around).  vhost forces the model to have the ppc boards act as IO-hosts,
whereas vbus would likely work in either direction due to its more
refined abstraction layer.

 
 I have no idea, that's for Ira to solve.
  
 Bingo.  Thus my statement that the vhost proposal is incomplete.  You
 have the virtio-net and vhost-net pieces covering the fast-path
 end-points, but nothing in the middle (transport, aggregation

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-16 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/16/2009 05:10 PM, Gregory Haskins wrote:

 If kvm can do it, others can.
  
 The problem is that you seem to either hand-wave over details like this,
 or you give details that are pretty much exactly what vbus does already.
   My point is that I've already sat down and thought about these issues
 and solved them in a freely available GPL'ed software package.

 
 In the kernel.  IMO that's the wrong place for it.

In conversations with Ira, he indicated he needs kernel-to-kernel
ethernet for performance, and needs at least an ethernet and console
connectivity.  You could conceivably build a solution for this system 3
basic ways:

1) completely in userspace: use things like tuntap on the ppc boards,
and tunnel packets across a custom point-to-point connection formed over
the pci link to a userspace app on the x86 board.  This app then
reinjects the packets into the x86 kernel as a raw socket or tuntap,
etc.  Pretty much vanilla tuntap/vpn kind of stuff.  Advantage: very
little kernel code.  Problem: performance (citation: hopefully obvious).

2) partially in userspace: have an in-kernel virtio-net driver talk to
a userspace based virtio-net backend.  This is the (current, non-vhost
oriented) KVM/qemu model.  Advantage, re-uses existing kernel-code.
Problem: performance (citation: see alacrityvm numbers).

3) in-kernel: You can do something like virtio-net to vhost to
potentially meet some of the requirements, but not all.

In order to fully meet (3), you would need to do some of that stuff you
mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
we need to have a facility for mapping eventfds and establishing a
signaling mechanism (like PIO+qid), etc. KVM does this with
IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
invented.

To meet performance, this stuff has to be in kernel and there has to be
a way to manage it.  Since vbus was designed to do exactly that, this is
what I would advocate.  You could also reinvent these concepts and put
your own mux and mapping code in place, in addition to all the other
stuff that vbus does.  But I am not clear why anyone would want to.

So no, the kernel is not the wrong place for it.  Its the _only_ place
for it.  Otherwise, just use (1) and be done with it.

  Further, if we adopt
 vbus, if drop compatibility with existing guests or have to support both
 vbus and virtio-pci.

We already need to support both (at least to support Ira).  virtio-pci
doesn't work here.  Something else (vbus, or vbus-like) is needed.

 
 So the question is: is your position that vbus is all wrong and you wish
 to create a new bus-like thing to solve the problem?
 
 I don't intend to create anything new, I am satisfied with virtio.  If
 it works for Ira, excellent.  If not, too bad.

I think that about sums it up, then.


  I believe it will work without too much trouble.

Afaict it wont for the reasons I mentioned.

 
 If so, how is it
 different from what Ive already done?  More importantly, what specific
 objections do you have to what Ive done, as perhaps they can be fixed
 instead of starting over?

 
 The two biggest objections are:
 - the host side is in the kernel

As it needs to be.

 - the guest side is a new bus instead of reusing pci (on x86/kvm),
 making Windows support more difficult

Thats a function of the vbus-connector, which is different from
vbus-core.  If you don't like it (and I know you don't), we can write
one that interfaces to qemu's pci system.  I just don't like the
limitations that imposes, nor do I think we need that complexity of
dealing with a split PCI model, so I chose to not implement vbus-kvm
this way.

With all due respect, based on all of your comments in aggregate I
really do not think you are truly grasping what I am actually building here.

 
 I guess these two are exactly what you think are vbus' greatest
 advantages, so we'll probably have to extend our agree-to-disagree on
 this one.
 
 I also had issues with using just one interrupt vector to service all
 events, but that's easily fixed.

Again, function of the connector.

 
 There is no guest and host in this scenario.  There's a device side
 (ppc) and a driver side (x86).  The driver side can access configuration
 information on the device side.  How to multiplex multiple devices is an
 interesting exercise for whoever writes the virtio binding for that
 setup.
  
 Bingo.  So now its a question of do you want to write this layer from
 scratch, or re-use my framework.

 
 You will have to implement a connector or whatever for vbus as well. 
 vbus has more layers so it's probably smaller for vbus.

Bingo! That is precisely the point.

All the stuff for how to map eventfds, handle signal mitigation, demux
device/function pointers, isolation, etc, are built in.  All the
connector has to do is transport the 4-6 verbs and provide a memory
mapping/copy function, and the rest is reusable.  The device models
would

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-16 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/16/2009 10:22 PM, Gregory Haskins wrote:
 Avi Kivity wrote:
   
 On 09/16/2009 05:10 PM, Gregory Haskins wrote:
 
 If kvm can do it, others can.

  
 The problem is that you seem to either hand-wave over details like
 this,
 or you give details that are pretty much exactly what vbus does
 already.
My point is that I've already sat down and thought about these
 issues
 and solved them in a freely available GPL'ed software package.


 In the kernel.  IMO that's the wrong place for it.
  
 3) in-kernel: You can do something like virtio-net to vhost to
 potentially meet some of the requirements, but not all.

 In order to fully meet (3), you would need to do some of that stuff you
 mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
 we need to have a facility for mapping eventfds and establishing a
 signaling mechanism (like PIO+qid), etc. KVM does this with
 IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
 invented.

 
 irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.

Not per se, but it needs to be interfaced.  How do I register that
eventfd with the fastpath in Ira's rig? How do I signal the eventfd
(x86-ppc, and ppc-x86)?

To take it to the next level, how do I organize that mechanism so that
it works for more than one IO-stream (e.g. address the various queues
within ethernet or a different device like the console)?  KVM has
IOEVENTFD and IRQFD managed with MSI and PIO.  This new rig does not
have the luxury of an established IO paradigm.

Is vbus the only way to implement a solution?  No.  But it is _a_ way,
and its one that was specifically designed to solve this very problem
(as well as others).

(As an aside, note that you generally will want an abstraction on top of
irqfd/eventfd like shm-signal or virtqueues to do shared-memory based
event mitigation, but I digress.  That is a separate topic).

 
 To meet performance, this stuff has to be in kernel and there has to be
 a way to manage it.
 
 and management belongs in userspace.

vbus does not dictate where the management must be.  Its an extensible
framework, governed by what you plug into it (ala connectors and devices).

For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD
and DEVDROP hotswap events into the interrupt stream, because they are
simple and we already needed the interrupt stream anyway for fast-path.

As another example: venet chose to put -call(MACQUERY) config-space
into its call namespace because its simple, and we already need
-calls() for fastpath.  It therefore exports an attribute to sysfs that
allows the management app to set it.

I could likewise have designed the connector or device-model differently
as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI
userspace) but this seems silly to me when they are so trivial, so I didn't.

 
 Since vbus was designed to do exactly that, this is
 what I would advocate.  You could also reinvent these concepts and put
 your own mux and mapping code in place, in addition to all the other
 stuff that vbus does.  But I am not clear why anyone would want to.

 
 Maybe they like their backward compatibility and Windows support.

This is really not relevant to this thread, since we are talking about
Ira's hardware.  But if you must bring this up, then I will reiterate
that you just design the connector to interface with QEMU+PCI and you
have that too if that was important to you.

But on that topic: Since you could consider KVM a motherboard
manufacturer of sorts (it just happens to be virtual hardware), I don't
know why KVM seems to consider itself the only motherboard manufacturer
in the world that has to make everything look legacy.  If a company like
ASUS wants to add some cutting edge IO controller/bus, they simply do
it.  Pretty much every product release may contain a different array of
devices, many of which are not backwards compatible with any prior
silicon.  The guy/gal installing Windows on that system may see a ? in
device-manager until they load a driver that supports the new chip, and
subsequently it works.  It is certainly not a requirement to make said
chip somehow work with existing drivers/facilities on bare metal, per
se.  Why should virtual systems be different?

So, yeah, the current design of the vbus-kvm connector means I have to
provide a driver.  This is understood, and I have no problem with that.

The only thing that I would agree has to be backwards compatible is the
BIOS/boot function.  If you can't support running an image like the
Windows installer, you are hosed.  If you can't use your ethernet until
you get a chance to install a driver after the install completes, its
just like most other systems in existence.  IOW: It's not a big deal.

For cases where the IO system is needed as part of the boot/install, you
provide BIOS and/or an install-disk support for it.

 
 So no, the kernel

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-16 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Wed, Sep 16, 2009 at 10:10:55AM -0400, Gregory Haskins wrote:
 There is no role reversal.
 So if I have virtio-blk driver running on the x86 and vhost-blk device
 running on the ppc board, I can use the ppc board as a block-device.
 What if I really wanted to go the other way?
 
 It seems ppc is the only one that can initiate DMA to an arbitrary
 address, so you can't do this really, or you can by tunneling each
 request back to ppc, or doing an extra data copy, but it's unlikely to
 work well.
 
 The limitation comes from hardware, not from the API we use.

Understood, but presumably it can be exposed as a sub-function of the
ppc's board's register file as a DMA-controller service to the x86.
This would fall into the tunnel requests back category you mention
above, though I think tunnel implies a heavier protocol than it would
actually require.  This would look more like a PIO cycle to a DMA
controller than some higher layer protocol.

You would then utilize that DMA service inside the memctx, and it the
rest of vbus would work transparently with the existing devices/drivers.

I do agree it would require some benchmarking to determine its
feasibility, which is why I was careful to say things like may work
;).  I also do not even know if its possible to expose the service this
way on his system.  If this design is not possible or performs poorly, I
admit vbus is just as hosed as vhost in regard to the role correction
benefit.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-15 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/14/2009 10:14 PM, Gregory Haskins wrote:
 To reiterate, as long as the model is such that the ppc boards are
 considered the owner (direct access, no translation needed) I believe
 it will work.  If the pointers are expected to be owned by the host,
 then my model doesn't work well either.

 
 In this case the x86 is the owner and the ppc boards use translated
 access.  Just switch drivers and device and it falls into place.
 

You could switch vbus roles as well, I suppose.  Another potential
option is that he can stop mapping host memory on the guest so that it
follows the more traditional model.  As a bus-master device, the ppc
boards should have access to any host memory at least in the GFP_DMA
range, which would include all relevant pointers here.

I digress:  I was primarily addressing the concern that Ira would need
to manage the host side of the link using hvas mapped from userspace
(even if host side is the ppc boards).  vbus abstracts that access so as
to allow something other than userspace/hva mappings.  OTOH, having each
ppc board run a userspace app to do the mapping on its behalf and feed
it to vhost is probably not a huge deal either.  Where vhost might
really fall apart is when any assumptions about pageable memory occur,
if any.

As an aside: a bigger issue is that, iiuc, Ira wants more than a single
ethernet channel in his design (multiple ethernets, consoles, etc).  A
vhost solution in this environment is incomplete.

Note that Ira's architecture highlights that vbus's explicit management
interface is more valuable here than it is in KVM, since KVM already has
its own management interface via QEMU.

Kind Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-15 Thread Gregory Haskins
Avi Kivity wrote:
 On 09/15/2009 04:03 PM, Gregory Haskins wrote:

 In this case the x86 is the owner and the ppc boards use translated
 access.  Just switch drivers and device and it falls into place.

  
 You could switch vbus roles as well, I suppose.
 
 Right, there's not real difference in this regard.
 
 Another potential
 option is that he can stop mapping host memory on the guest so that it
 follows the more traditional model.  As a bus-master device, the ppc
 boards should have access to any host memory at least in the GFP_DMA
 range, which would include all relevant pointers here.

 I digress:  I was primarily addressing the concern that Ira would need
 to manage the host side of the link using hvas mapped from userspace
 (even if host side is the ppc boards).  vbus abstracts that access so as
 to allow something other than userspace/hva mappings.  OTOH, having each
 ppc board run a userspace app to do the mapping on its behalf and feed
 it to vhost is probably not a huge deal either.  Where vhost might
 really fall apart is when any assumptions about pageable memory occur,
 if any.

 
 Why?  vhost will call get_user_pages() or copy_*_user() which ought to
 do the right thing.

I was speaking generally, not specifically to Ira's architecture.  What
I mean is that vbus was designed to work without assuming that the
memory is pageable.  There are environments in which the host is not
capable of mapping hvas/*page, but the memctx-copy_to/copy_from
paradigm could still work (think rdma, for instance).

 
 As an aside: a bigger issue is that, iiuc, Ira wants more than a single
 ethernet channel in his design (multiple ethernets, consoles, etc).  A
 vhost solution in this environment is incomplete.

 
 Why?  Instantiate as many vhost-nets as needed.

a) what about non-ethernets?
b) what do you suppose this protocol to aggregate the connections would
look like? (hint: this is what a vbus-connector does).
c) how do you manage the configuration, especially on a per-board basis?

 
 Note that Ira's architecture highlights that vbus's explicit management
 interface is more valuable here than it is in KVM, since KVM already has
 its own management interface via QEMU.

 
 vhost-net and vbus both need management, vhost-net via ioctls and vbus
 via configfs.

Actually I have patches queued to allow vbus to be managed via ioctls as
well, per your feedback (and it solves the permissions/lifetime
critisims in alacrityvm-v0.1).

  The only difference is the implementation.  vhost-net
 leaves much more to userspace, that's the main difference.

Also,

*) vhost is virtio-net specific, whereas vbus is a more generic device
model where thing like virtio-net or venet ride on top.

*) vhost is only designed to work with environments that look very
similar to a KVM guest (slot/hva translatable).  vbus can bridge various
environments by abstracting the key components (such as memory access).

*) vhost requires an active userspace management daemon, whereas vbus
can be driven by transient components, like scripts (ala udev)

Kind Regards,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-15 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
 No, what I mean is how do you surface multiple ethernet and consoles to
 the guests?  For Ira's case, I think he needs at minimum at least one of
 each, and he mentioned possibly having two unique ethernets at one point.

 His slave boards surface themselves as PCI devices to the x86
 host.  So how do you use that to make multiple vhost-based devices (say
 two virtio-nets, and a virtio-console) communicate across the transport?

 There are multiple ways to do this, but what I am saying is that
 whatever is conceived will start to look eerily like a vbus-connector,
 since this is one of its primary purposes ;)
 
 Can't all this be in userspace?

Can you outline your proposal?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-15 Thread Gregory Haskins
Michael S. Tsirkin wrote:
 On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote:
 Michael S. Tsirkin wrote:
 On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
 No, what I mean is how do you surface multiple ethernet and consoles to
 the guests?  For Ira's case, I think he needs at minimum at least one of
 each, and he mentioned possibly having two unique ethernets at one point.

 His slave boards surface themselves as PCI devices to the x86
 host.  So how do you use that to make multiple vhost-based devices (say
 two virtio-nets, and a virtio-console) communicate across the transport?

 There are multiple ways to do this, but what I am saying is that
 whatever is conceived will start to look eerily like a vbus-connector,
 since this is one of its primary purposes ;)
 Can't all this be in userspace?
 Can you outline your proposal?

 -Greg

 
 Userspace in x86 maps a PCI region, uses it for communication with ppc?
 

And what do you propose this communication to look like?

-Greg



signature.asc
Description: OpenPGP digital signature


  1   2   3   4   5   6   7   >