Re: [PATCH] deal with interrupt shadow state for emulated instruction

2009-05-07 Thread Glauber Costa
On Wed, May 06, 2009 at 01:51:04PM +0300, Avi Kivity wrote:
> Gleb Natapov wrote:
>> On Tue, May 05, 2009 at 02:40:11PM -0400, Glauber Costa wrote:
>>   
>>> diff --git a/arch/x86/include/asm/kvm_host.h 
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 8e680c3..a49d07b 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -510,6 +510,8 @@ struct kvm_x86_ops {
>>> void (*run)(struct kvm_vcpu *vcpu, struct kvm_run *run);
>>> int (*handle_exit)(struct kvm_run *run, struct kvm_vcpu *vcpu);
>>> void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu);
>>> +   void (*set_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
>>> 
>> There is .drop_interrupt_shadow() callback. The patch should remove it and
>> replace its use by set_interrupt_shadow().
>>   
>
> That would be [PATCH 1/2].
[PATCH 2/2]. Otherwise we will break bisectability, as the pure removal of this
function would lead us to a non-functioning kernel for no reason.

Avi: if this patch is okay, please apply. I'll send another one later that 
replaces
the existing .drop_interrupt_shadow by the (then) in tree set_interrupt_shadow.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Davide Libenzi
On Thu, 7 May 2009, Avi Kivity wrote:

> What's your take on adding irq context safe callbacks to irqfd?
> 
> To give some background here, we would like to use eventfd as a generic
> connector between components, so the components do not know about each other.
> So far eventfd successfully abstracts among components in the same process, in
> different processes, and in the kernel.
> 
> eventfd_signal() can be safely called from irq context, and will wake up a
> waiting task.  But in some cases, if the consumer is in the kernel, it may be
> able to consume the event from irq context, saving a context switch.
> 
> So, will you consider patches adding this capability to eventfd?

Since I received this one after your ACK on the capability of eventfd of 
triggering callbacks, I assume we're clear on this point, aren't we?


- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote:
>   
>> Also for PIO/MMIO you're adding this unoptimized lookup to the 
>> measurement:
>>
>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>> if (pio_dev) {
>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>> complete_pio(vcpu); 
>> return 1;
>> }
>>
>> Whereas for hypercall measurement you don't. I believe a fair comparison
>> would be have a shared guest/host memory area where you store guest/host
>> TSC values and then do, on guest:
>>
>>  rdtscll(&shared_area->guest_tsc);
>>  pio/mmio/hypercall
>>  ... back to host
>>  rdtscll(&shared_area->host_tsc);
>>
>> And then calculate the difference (minus guests TSC_OFFSET of course)?
>> 
>
> Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest"
> Core2 Xeon 5130 @2.00Ghz, 4GB RAM.
>
> Also it would be interesting to see the MMIO comparison with EPT/NPT,
> it probably sucks much less than what you're seeing.
>
>   

Agreed.  If you or someone on this thread has such a beast, please fire
up my test and post the numbers.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote:
>   
>> Chris Wright wrote:
>> 
>>> * Gregory Haskins (ghask...@novell.com) wrote:
>>>   
>>>   
 Chris Wright wrote:
 
 
> VF drivers can also have this issue (and typically use mmio).
> I at least have a better idea what your proposal is, thanks for
> explanation.  Are you able to demonstrate concrete benefit with it yet
> (improved latency numbers for example)?
>   
>   
 I had a test-harness/numbers for this kind of thing, but its a bit
 crufty since its from ~1.5 years ago.  I will dig it up, update it, and
 generate/post new numbers.
 
 
>>> That would be useful, because I keep coming back to pio and shared
>>> page(s) when think of why not to do this.  Seems I'm not alone in that.
>>>
>>> thanks,
>>> -chris
>>>   
>>>   
>> I completed the resurrection of the test and wrote up a little wiki on
>> the subject, which you can find here:
>>
>> http://developer.novell.com/wiki/index.php/WhyHypercalls
>>
>> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
>> reinvent the wheel?" questions.
>>
>> I will include this information when I publish the updated v2 series
>> with the s/hypercall/dynhc changes.
>>
>> Let me know if you have any questions.
>> 
>
> Greg,
>
> I think comparison is not entirely fair. You're using
> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
> (on Intel) to only one register read:
>
> nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>
> Whereas in a real hypercall for (say) PIO you would need the address,
> size, direction and data.
>   

Hi Marcelo,
   I'll have to respectfully disagree with you here.  What you are
proposing is actually a different test: a 4th type I would call  "PIO
over HC".  It is distinctly different than the existing MMIO, PIO, and
HC tests already present.

I assert that the current HC test remains valid because for pure
hypercalls, the "nr" *is* the address.  It identifies the function to be
executed (e.g. VAPIC_POLL_IRQ = null), just like the PIO address of my
nullio device identifies the function to be executed (i.e.
nullio_write() = null)

My argument is that the HC test emulates the "dynhc()" concept I have
been talking about, whereas the PIOoHC is more like the
pv_io_ops->iowrite approach.

That said, your 4th test type would actually be a very interesting
data-point to add to the suite (especially since we are still kicking
around the notion of doing something like this).  I will update the
patches.


> Also for PIO/MMIO you're adding this unoptimized lookup to the 
> measurement:
>
> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
> if (pio_dev) {
> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
> complete_pio(vcpu); 
> return 1;
> }
>
> Whereas for hypercall measurement you don't.

In theory they should both share about the same algorithmic complexity
in the decode-stage, but due to the possible optimization you mention
you may have a point.  I need to take some steps to ensure the HC path
isn't artificially simplified by GCC (like making the execute stage do
some trivial work like you mention below).

>  I believe a fair comparison
> would be have a shared guest/host memory area where you store guest/host
> TSC values and then do, on guest:
>
>   rdtscll(&shared_area->guest_tsc);
>   pio/mmio/hypercall
>   ... back to host
>   rdtscll(&shared_area->host_tsc);
>
> And then calculate the difference (minus guests TSC_OFFSET of course)?
>
>   
I'm not sure I need that much complexity.  I can probably just change
the test harness to generate an ioread32(), and have the functions
return the TSC value as a return parameter for all test types.  The
important thing is that we pick something extremely cheap (yet dynamic)
to compute so the execution time doesn't invalidate the measurement
granularity with a large constant.

Regards,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: SCSI block device timeouts post-migration on x86_64 linux guests

2009-05-07 Thread Charles Duffy

Marcelo Tosatti wrote:

There's no savevm handler at all:


Ahh! Thank you for the explanation.


There was an incomplete patch posted to qemu-devel a while ago, which I
can't find.


I'm guessing that would be "[PATCH] save/restore for LSI SCSI", by Nolan 
: 
http://permalink.gmane.org/gmane.comp.emulators.qemu/36045


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI block device timeouts post-migration on x86_64 linux guests

2009-05-07 Thread Marcelo Tosatti
Charles,

There's no savevm handler at all:

$ grep register_savevm hw/lsi53c895a.c

There was an incomplete patch posted to qemu-devel a while ago, which I
can't find.

On Tue, Apr 28, 2009 at 04:24:02PM -0500, Charles Duffy wrote:
> Per subject. Observed on kvm-83 through kvm-85; haven't traced back  
> further. No longer occurs when switching to IDE. Manifests with messages  
> similar to the following on restore:
>
> sd 0:0:0:0: [sda] ABORT operation started
> sd 0:0:0:0: ABORT operation timed-out.
> sd 0:0:0:0: [sda] ABORT operation started
> sd 0:0:0:0: ABORT operation timed-out.
> sd 0:0:0:0: [sda] ABORT operation started
> sd 0:0:0:0: ABORT operation timed-out.
> sd 0:0:0:0: [sda] ABORT operation started
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Michael Ellerman
On Fri, 2009-05-08 at 09:25 +0930, Rusty Russell wrote:
> On Thu, 7 May 2009 07:49:53 pm Sheng Yang wrote:
> > On Thursday 07 May 2009 17:53:02 Matthew Wilcox wrote:
> > > Here's a good example.  Let's suppose you have a driver which supports
> > > two different models of cards, one has 16 MSI-X interrupts, the other
> > > has 10.  You can call pci_enable_msix() asking for 16 vectors.  If your
> > > card is model A, you get 16 interrupts.  If your card is model B, it says
> > > "you can have 10".
> 
> Sheng is absolutely right, that's a horrid API.

It's not horrid, though it is tricky - but only for drivers that care,
you can still do:

if (pci_enable_msix(..))
bail;

> If it actually enabled that number and returned it, it might make sense (cf. 
> write() returning less bytes than you give it).  

It could do that, but I think that would be worse. The driver, on
finding out it can only get a certain number of MSIs might need to make
a decision about how it allocates those - eg. in a network driver,
sharing them between TX/RX/management.

And in practice most of the drivers just bail if they can't get what
they asked for, so enabling less than they wanted would just mean they
have to go and disable them.

> But overloading the return 
> value to save an explicit call is just ugly; it's not worth saving a few 
> lines 
> of code at cost of making all the drivers subtle and tricksy.

Looking at just this patch, I would agree, but unfortunately it's not
that simple. The first limitation on the number of MSIs the driver can
have is the number the device supports, that's what this patch does. But
there are others, and they come out of the arch code, or even the
firmware. So to implement pci_how_many_msis(), we'd need a parallel API
all the way down to the arch code, or a flag to all the existing
routines saying "don't really allocate, just find out". That would be
horrid IMHO.

cheers


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Matthew Wilcox
On Fri, May 08, 2009 at 09:25:00AM +0930, Rusty Russell wrote:
> Sheng is absolutely right, that's a horrid API.

But the API already exists, and is in use by 27 drivers.  If this were a
change to the API, I'd be against it, but it is the existing API.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Rusty Russell
On Thu, 7 May 2009 07:49:53 pm Sheng Yang wrote:
> On Thursday 07 May 2009 17:53:02 Matthew Wilcox wrote:
> > Here's a good example.  Let's suppose you have a driver which supports
> > two different models of cards, one has 16 MSI-X interrupts, the other
> > has 10.  You can call pci_enable_msix() asking for 16 vectors.  If your
> > card is model A, you get 16 interrupts.  If your card is model B, it says
> > "you can have 10".

Sheng is absolutely right, that's a horrid API.

If it actually enabled that number and returned it, it might make sense (cf. 
write() returning less bytes than you give it).  But overloading the return 
value to save an explicit call is just ugly; it's not worth saving a few lines 
of code at cost of making all the drivers subtle and tricksy.

Fail with -ENOSPC or something.

Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Marcelo Tosatti
On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote:
> Also for PIO/MMIO you're adding this unoptimized lookup to the 
> measurement:
> 
> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
> if (pio_dev) {
> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
> complete_pio(vcpu); 
> return 1;
> }
> 
> Whereas for hypercall measurement you don't. I believe a fair comparison
> would be have a shared guest/host memory area where you store guest/host
> TSC values and then do, on guest:
> 
>   rdtscll(&shared_area->guest_tsc);
>   pio/mmio/hypercall
>   ... back to host
>   rdtscll(&shared_area->host_tsc);
> 
> And then calculate the difference (minus guests TSC_OFFSET of course)?

Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest"
Core2 Xeon 5130 @2.00Ghz, 4GB RAM.

Also it would be interesting to see the MMIO comparison with EPT/NPT,
it probably sucks much less than what you're seeing.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bridges

2009-05-07 Thread Matthew Palmer
On Thu, May 07, 2009 at 08:57:03AM -0700, Ross Boylan wrote:
> I'm trying to understand bridging with KVM, but am still puzzled.
> I think that the recommended bridging with TAP means that packets from
> the VM will end up going out the host card attached to the default
> gateway.  But it looks to me as if their IP address is unchanged, which
> means replies will never reach me.  Is that correct?  Do I need to NAT
> the packets, or is something already doing that?
> 
> Some documents indicate that I need to bring the interfaces (e.g., eth0)
> down before I bring the bridge up, and that afterwards only the bridge
> will have an IP address.  Is that right?

Here's how I think of a Linux "soft" bridge: the bridge consists of an
Ethernet switch, and a regular interface (named after the bridge) that is
connected to that switch.  This is why you "give an IP address to the
bridge", because "the bridge" is also a NIC of it's own.

If you attach any physical interfaces (eg ethN) to the bridge, they aren't
NICs any more, they're just network cables you plug into the switch to pass
traffic to other switches.  Attaching VMs to the switch is just hooking up
more cables between the switch and the VMs.

If you want your host to do NAT for your VMs, then you do as you would for
any other firewall -- you have one switch (the bridge, in this case) with
all of your VMs and the "internal" interface of the host (in this case, the
bridge as well) all plugged in, and then a second interface to the outside
world (the physical NIC).

> Some documents, e.g.,
> http://ebtables.sourceforge.net/br_fw_ia/br_fw_ia.html, indicate
> iptables should "just work" with bridging.

Yes, iptables *does* "just work" with bridging, in the sense that iptables
can still filter IP packets passing through it's interfaces.  What you
*can't* do, though, is have some sort of magic iptables filter deep in the
bridge that plays with all traffic as it traverses.  For that, there's
ebtables, which is iptables but for Ethernet (rather than IP) traffic. 
Personally, I've never used ebtables in my life.

> However, I've seen someone
> with a 2.6.15 kernel ask about firewalling and be told they needed to
> patch the kernel to get it work (don't have the reference handy).
> Should it just work?

It should Just Work, and if you've got to patch any 2.6 (or even probably
2.4) kernel then you're doing something *very* esoteric.

- Matt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Marcelo Tosatti
On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote:
> Chris Wright wrote:
> > * Gregory Haskins (ghask...@novell.com) wrote:
> >   
> >> Chris Wright wrote:
> >> 
> >>> VF drivers can also have this issue (and typically use mmio).
> >>> I at least have a better idea what your proposal is, thanks for
> >>> explanation.  Are you able to demonstrate concrete benefit with it yet
> >>> (improved latency numbers for example)?
> >>>   
> >> I had a test-harness/numbers for this kind of thing, but its a bit
> >> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
> >> generate/post new numbers.
> >> 
> >
> > That would be useful, because I keep coming back to pio and shared
> > page(s) when think of why not to do this.  Seems I'm not alone in that.
> >
> > thanks,
> > -chris
> >   
> 
> I completed the resurrection of the test and wrote up a little wiki on
> the subject, which you can find here:
> 
> http://developer.novell.com/wiki/index.php/WhyHypercalls
> 
> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
> reinvent the wheel?" questions.
> 
> I will include this information when I publish the updated v2 series
> with the s/hypercall/dynhc changes.
> 
> Let me know if you have any questions.

Greg,

I think comparison is not entirely fair. You're using
KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
(on Intel) to only one register read:

nr = kvm_register_read(vcpu, VCPU_REGS_RAX);

Whereas in a real hypercall for (say) PIO you would need the address,
size, direction and data.

Also for PIO/MMIO you're adding this unoptimized lookup to the 
measurement:

pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
if (pio_dev) {
kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
complete_pio(vcpu); 
return 1;
}

Whereas for hypercall measurement you don't. I believe a fair comparison
would be have a shared guest/host memory area where you store guest/host
TSC values and then do, on guest:

rdtscll(&shared_area->guest_tsc);
pio/mmio/hypercall
... back to host
rdtscll(&shared_area->host_tsc);

And then calculate the difference (minus guests TSC_OFFSET of course)?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: Use a bitmap for tracking used GSIs

2009-05-07 Thread Alex Williamson
We're currently using a counter to track the most recent GSI we've
handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
assignment with a driver that regularly toggles the MSI enable bit.
This can mean only a few minutes of usable run time.  Instead, track
used GSIs in a bitmap.

Signed-off-by: Alex Williamson 
---

 Applies on top of "kvm: device-assignment: Catch GSI overflow"

 hw/device-assignment.c  |4 ++-
 kvm/libkvm/kvm-common.h |3 +-
 kvm/libkvm/libkvm.c |   68 +--
 kvm/libkvm/libkvm.h |   10 +++
 4 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index e06dd08..5bdae24 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -561,8 +561,10 @@ static void free_dev_irq_entries(AssignedDevice *dev)
 {
 int i;
 
-for (i = 0; i < dev->irq_entries_nr; i++)
+for (i = 0; i < dev->irq_entries_nr; i++) {
 kvm_del_routing_entry(kvm_context, &dev->entry[i]);
+kvm_free_irq_route_gsi(kvm_context, dev->entry[i].gsi);
+}
 free(dev->entry);
 dev->entry = NULL;
 dev->irq_entries_nr = 0;
diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h
index 591fb53..94f86e5 100644
--- a/kvm/libkvm/kvm-common.h
+++ b/kvm/libkvm/kvm-common.h
@@ -66,8 +66,9 @@ struct kvm_context {
 #ifdef KVM_CAP_IRQ_ROUTING
struct kvm_irq_routing *irq_routes;
int nr_allocated_irq_routes;
+   void *used_gsi_bitmap;
+   int max_gsi;
 #endif
-   int max_used_gsi;
 };
 
 int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory,
diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
index 2a4165a..43abc7d 100644
--- a/kvm/libkvm/libkvm.c
+++ b/kvm/libkvm/libkvm.c
@@ -1298,8 +1298,6 @@ int kvm_add_routing_entry(kvm_context_t kvm,
new->flags = entry->flags;
new->u = entry->u;
 
-   if (entry->gsi > kvm->max_used_gsi)
-   kvm->max_used_gsi = entry->gsi;
return 0;
 #else
return -ENOSYS;
@@ -1404,20 +1402,72 @@ int kvm_commit_irq_routes(kvm_context_t kvm)
 #endif
 }
 
+#ifdef KVM_CAP_IRQ_ROUTING
+static inline void set_bit(unsigned int *buf, int bit)
+{
+   buf[bit >> 5] |= (1U << (bit & 0x1f));
+}
+
+static inline void clear_bit(unsigned int *buf, int bit)
+{
+   buf[bit >> 5] &= ~(1U << (bit & 0x1f));
+}
+
+static int kvm_find_free_gsi(kvm_context_t kvm)
+{
+   int i, bit, gsi;
+   unsigned int *buf = kvm->used_gsi_bitmap;
+
+   for (i = 0; i < (kvm->max_gsi >> 5); i++) {
+   if (buf[i] != ~0U)
+   break;
+   }
+
+   if (i == kvm->max_gsi >> 5)
+   return -ENOSPC;
+
+   bit = ffs(~buf[i]);
+   if (!bit)
+   return -EAGAIN;
+
+   gsi = (bit - 1) | (i << 5);
+   set_bit(buf, gsi);
+   return gsi;
+}
+#endif
+
 int kvm_get_irq_route_gsi(kvm_context_t kvm)
 {
 #ifdef KVM_CAP_IRQ_ROUTING
-   if (kvm->max_used_gsi >= KVM_IOAPIC_NUM_PINS)  {
-   if (kvm->max_used_gsi + 1 < kvm_get_gsi_count(kvm))
-return kvm->max_used_gsi + 1;
-else
-return -ENOSPC;
-} else
-return KVM_IOAPIC_NUM_PINS;
+   if (!kvm->max_gsi) {
+   int i;
+
+   /* Round the number of GSIs supported to a 4 byte
+* value so we can search it using ints and ffs */
+   i = kvm_get_gsi_count(kvm) & ~0x1f;
+   kvm->used_gsi_bitmap = malloc(i >> 3);
+   if (!kvm->used_gsi_bitmap)
+   return -ENOMEM;
+   memset(kvm->used_gsi_bitmap, 0, i >> 3);
+   kvm->max_gsi = i;
+
+   /* Mark all the IOAPIC pin GSIs as already used */
+   for (i = 0; i <= KVM_IOAPIC_NUM_PINS; i++)
+   set_bit(kvm->used_gsi_bitmap, i);
+   }
+
+   return kvm_find_free_gsi(kvm);
 #else
return -ENOSYS;
 #endif
 }
+ 
+void kvm_free_irq_route_gsi(kvm_context_t kvm, int gsi)
+{
+#ifdef KVM_CAP_IRQ_ROUTING
+   clear_bit(kvm->used_gsi_bitmap, gsi);
+#endif
+}
 
 #ifdef KVM_CAP_DEVICE_MSIX
 int kvm_assign_set_msix_nr(kvm_context_t kvm,
diff --git a/kvm/libkvm/libkvm.h b/kvm/libkvm/libkvm.h
index c23d37b..4e9344c 100644
--- a/kvm/libkvm/libkvm.h
+++ b/kvm/libkvm/libkvm.h
@@ -856,6 +856,16 @@ int kvm_commit_irq_routes(kvm_context_t kvm);
  */
 int kvm_get_irq_route_gsi(kvm_context_t kvm);
 
+/*!
+ * \brief Free used GSI number
+ *
+ * Free used GSI number acquired from kvm_get_irq_route_gsi()
+ *
+ * \param kvm Pointer to the current kvm_context
+ * \param gsi GSI number to free
+ */
+void kvm_free_irq_route_gsi(kvm_context_t kvm, int gsi);
+
 #ifdef KVM_CAP_DEVICE_MSIX
 int kvm_assign_set_msix_nr(kvm_context_t kvm,
   struct kvm_assigned_msix_nr *msix_nr);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Chris Wright wrote:
> 
> > Chris, is that issue with the non ioread/iowrite access of a mangled
> > pointer still an issue here?  I would think so, but I am a bit fuzzy on
> > whether there is still an issue of non-wrapped MMIO ever occuring.
> 
> Arnd was saying it's a bug for other reasons, so perhaps it would work
> out fine.

Well, maybe. I only said that __raw_writel and pointer dereference is
bad, but not writel.

IIRC when we had that discussion about io-workarounds on powerpc,
the outcome was that passing an IORESOURCE_MEM resource into pci_iomap
must still result in something that can be passed into writel in addition
to iowrite32, while an IORESOURCE_IO resource may or may not be valid for
writel and/or outl.

Unfortunately, this means that either readl/writel needs to be adapted
in some way (e.g. the address also ioremapped to the mangled pointer)
or the mechanism will be limited to I/O space accesses.

Maybe BenH remembers the details better than me.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Gregory Haskins (gregory.hask...@gmail.com) wrote:
> Arnd Bergmann wrote:
> > pci_iomap could look at the bus device that the PCI function sits on.
> > If it detects a PCI bridge that has a certain property (config space
> > setting, vendor/device ID, ...), it assumes that the device itself
> > will be emulated and it should set the address flag for IO_COND.
> >
> > This implies that all pass-through devices need to be on a different
> > PCI bridge from the emulated devices, which should be fairly
> > straightforward to enforce.

Hmm, this gets to the grey area of the ABI.  I think this would mean an
upgrade of the host would suddenly break when the mgmt tool does:

(qemu) pci_add pci_addr=0:6 host host=01:10.0

> Thats actually a pretty good idea.
> 
> Chris, is that issue with the non ioread/iowrite access of a mangled
> pointer still an issue here?  I would think so, but I am a bit fuzzy on
> whether there is still an issue of non-wrapped MMIO ever occuring.

Arnd was saying it's a bug for other reasons, so perhaps it would work
out fine.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Arnd Bergmann wrote:
> On Thursday 07 May 2009, Gregory Haskins wrote:
>   
>> What I am not clear on is how you would know to flag the address to
>> begin with.
>> 
>
> pci_iomap could look at the bus device that the PCI function sits on.
> If it detects a PCI bridge that has a certain property (config space
> setting, vendor/device ID, ...), it assumes that the device itself
> will be emulated and it should set the address flag for IO_COND.
>
> This implies that all pass-through devices need to be on a different
> PCI bridge from the emulated devices, which should be fairly
> straightforward to enforce.
>   

Thats actually a pretty good idea.

Chris, is that issue with the non ioread/iowrite access of a mangled
pointer still an issue here?  I would think so, but I am a bit fuzzy on
whether there is still an issue of non-wrapped MMIO ever occuring.

-Greg




signature.asc
Description: OpenPGP digital signature


[patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3

2009-05-07 Thread mtosatti
Disallow the deletion of memory slots (and aliases, for x86 case), if a
vcpu contains a cr3 that points to such slot/alias.

This complements commit 6c20e1442bb1c62914bb85b7f4a38973d2a423ba.

v2:
- set KVM_REQ_TRIPLE_FAULT
- use __KVM_HAVE_ARCH_CAN_FREE_MEMSLOT to avoid duplication of stub

Signed-off-by: Marcelo Tosatti 

Index: kvm-pending/arch/x86/kvm/x86.c
===
--- kvm-pending.orig/arch/x86/kvm/x86.c
+++ kvm-pending/arch/x86/kvm/x86.c
@@ -1636,6 +1636,29 @@ gfn_t unalias_gfn(struct kvm *kvm, gfn_t
return gfn;
 }
 
+static int kvm_root_gfn_in_range(struct kvm *kvm, gfn_t base_gfn,
+gfn_t end_gfn, bool unalias)
+{
+   struct kvm_vcpu *vcpu;
+   gfn_t root_gfn;
+   int i;
+
+   for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+   vcpu = kvm->vcpus[i];
+   if (!vcpu)
+   continue;
+   root_gfn = vcpu->arch.cr3 >> PAGE_SHIFT;
+   if (unalias)
+   root_gfn = unalias_gfn(kvm, root_gfn);
+   if (root_gfn >= base_gfn && root_gfn <= end_gfn) {
+   set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
+   return 1;
+   }
+   }
+
+   return 0;
+}
+
 /*
  * Set a new alias region.  Aliases map a portion of physical memory into
  * another portion.  This is useful for memory windows, for example the PC
@@ -1666,6 +1689,19 @@ static int kvm_vm_ioctl_set_memory_alias
spin_lock(&kvm->mmu_lock);
 
p = &kvm->arch.aliases[alias->slot];
+
+   /* FIXME: either disallow shrinking alias slots or disable
+* size changes as done with memslots
+*/
+   if (!alias->memory_size) {
+   r = -EBUSY;
+   if (kvm_root_gfn_in_range(kvm, p->base_gfn,
+  p->base_gfn + p->npages - 1,
+  false))
+   goto out_unlock;
+   }
+
+
p->base_gfn = alias->guest_phys_addr >> PAGE_SHIFT;
p->npages = alias->memory_size >> PAGE_SHIFT;
p->target_gfn = alias->target_phys_addr >> PAGE_SHIFT;
@@ -1682,6 +1718,9 @@ static int kvm_vm_ioctl_set_memory_alias
 
return 0;
 
+out_unlock:
+   spin_unlock(&kvm->mmu_lock);
+   up_write(&kvm->slots_lock);
 out:
return r;
 }
@@ -4552,6 +4591,15 @@ void kvm_arch_flush_shadow(struct kvm *k
kvm_mmu_zap_all(kvm);
 }
 
+int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+   int ret;
+
+   ret = kvm_root_gfn_in_range(kvm, slot->base_gfn,
+   slot->base_gfn + slot->npages - 1, true);
+   return !ret;
+}
+
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
Index: kvm-pending/include/linux/kvm_host.h
===
--- kvm-pending.orig/include/linux/kvm_host.h
+++ kvm-pending/include/linux/kvm_host.h
@@ -200,6 +200,7 @@ int kvm_arch_set_memory_region(struct kv
struct kvm_memory_slot old,
int user_alloc);
 void kvm_arch_flush_shadow(struct kvm *kvm);
+int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn);
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
Index: kvm-pending/virt/kvm/kvm_main.c
===
--- kvm-pending.orig/virt/kvm/kvm_main.c
+++ kvm-pending/virt/kvm/kvm_main.c
@@ -1061,6 +1061,13 @@ static int kvm_vm_release(struct inode *
return 0;
 }
 
+#ifndef __KVM_HAVE_ARCH_CAN_FREE_MEMSLOT
+int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+   return 1;
+}
+#endif
+
 /*
  * Allocate some memory and give it an address in the guest physical address
  * space.
@@ -1179,8 +1186,13 @@ int __kvm_set_memory_region(struct kvm *
}
 #endif /* not defined CONFIG_S390 */
 
-   if (!npages)
+   if (!npages) {
kvm_arch_flush_shadow(kvm);
+   if (!kvm_arch_can_free_memslot(kvm, memslot)) {
+   r = -EBUSY;
+   goto out_free;
+   }
+   }
 
spin_lock(&kvm->mmu_lock);
if (mem->slot >= kvm->nmemslots)
Index: kvm-pending/arch/x86/include/asm/kvm.h
===
--- kvm-pending.orig/arch/x86/include/asm/kvm.h
+++ kvm-pending/arch/x86/include/asm/kvm.h
@@ -18,6 +18,8 @@
 #define __KVM_HAVE_GUEST_DEBUG
 #define __KVM_HAVE_MSIX
 
+#define __KVM_HAVE_ARCH_CAN_FREE_MEMSLOT
+
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the 

[patch 3/4] KVM: use smp_send_reschedule in kvm_vcpu_kick

2009-05-07 Thread mtosatti
KVM uses a function call IPI to cause the exit of a guest running on a
physical cpu. For virtual interrupt notification there is no need to
wait on IPI receival, or to execute any function.

This is exactly what the reschedule IPI does, without the overhead
of function IPI. So use it instead of smp_call_function_single in
kvm_vcpu_kick.

Also change the "guest_mode" variable to a bit in vcpu->requests, and
use that to collapse multiple IPI's that would be issued between the
first one and zeroing of guest mode.

This allows kvm_vcpu_kick to called with interrupts disabled.

Signed-off-by: Marcelo Tosatti 

Index: kvm-pending/arch/ia64/kernel/irq_ia64.c
===
--- kvm-pending.orig/arch/ia64/kernel/irq_ia64.c
+++ kvm-pending/arch/ia64/kernel/irq_ia64.c
@@ -610,6 +610,9 @@ static struct irqaction ipi_irqaction = 
.name = "IPI"
 };
 
+/*
+ * KVM uses this interrupt to force a cpu out of guest mode
+ */
 static struct irqaction resched_irqaction = {
.handler =  dummy_handler,
.flags =IRQF_DISABLED,
Index: kvm-pending/arch/ia64/kvm/kvm-ia64.c
===
--- kvm-pending.orig/arch/ia64/kvm/kvm-ia64.c
+++ kvm-pending/arch/ia64/kvm/kvm-ia64.c
@@ -668,7 +668,7 @@ again:
host_ctx = kvm_get_host_context(vcpu);
guest_ctx = kvm_get_guest_context(vcpu);
 
-   vcpu->guest_mode = 1;
+   clear_bit(KVM_REQ_KICK, &vcpu->requests);
 
r = kvm_vcpu_pre_transition(vcpu);
if (r < 0)
@@ -685,7 +685,7 @@ again:
kvm_vcpu_post_transition(vcpu);
 
vcpu->arch.launched = 1;
-   vcpu->guest_mode = 0;
+   set_bit(KVM_REQ_KICK, &vcpu->requests);
local_irq_enable();
 
/*
@@ -1879,24 +1879,18 @@ void kvm_arch_hardware_unsetup(void)
 {
 }
 
-static void vcpu_kick_intr(void *info)
-{
-#ifdef DEBUG
-   struct kvm_vcpu *vcpu = (struct kvm_vcpu *)info;
-   printk(KERN_DEBUG"vcpu_kick_intr %p \n", vcpu);
-#endif
-}
-
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 {
-   int ipi_pcpu = vcpu->cpu;
-   int cpu = get_cpu();
+   int me;
+   int cpu = vcpu->cpu;
 
if (waitqueue_active(&vcpu->wq))
wake_up_interruptible(&vcpu->wq);
 
-   if (vcpu->guest_mode && cpu != ipi_pcpu)
-   smp_call_function_single(ipi_pcpu, vcpu_kick_intr, vcpu, 0);
+   me = get_cpu();
+   if (cpu != me && (unsigned) cpu < nr_cpu_ids && cpu_online(cpu))
+   if (!test_and_set_bit(KVM_REQ_KICK, &vcpu->requests))
+   smp_send_reschedule(cpu);
put_cpu();
 }
 
Index: kvm-pending/arch/x86/kernel/smp.c
===
--- kvm-pending.orig/arch/x86/kernel/smp.c
+++ kvm-pending/arch/x86/kernel/smp.c
@@ -172,6 +172,9 @@ void smp_reschedule_interrupt(struct pt_
 {
ack_APIC_irq();
inc_irq_stat(irq_resched_count);
+   /*
+* KVM uses this interrupt to force a cpu out of guest mode
+*/
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
Index: kvm-pending/arch/x86/kvm/x86.c
===
--- kvm-pending.orig/arch/x86/kvm/x86.c
+++ kvm-pending/arch/x86/kvm/x86.c
@@ -3259,6 +3259,9 @@ static int vcpu_enter_guest(struct kvm_v
 
local_irq_disable();
 
+   clear_bit(KVM_REQ_KICK, &vcpu->requests);
+   smp_mb__after_clear_bit();
+
if (vcpu->requests || need_resched() || signal_pending(current)) {
local_irq_enable();
preempt_enable();
@@ -3266,13 +3269,6 @@ static int vcpu_enter_guest(struct kvm_v
goto out;
}
 
-   vcpu->guest_mode = 1;
-   /*
-* Make sure that guest_mode assignment won't happen after
-* testing the pending IRQ vector bitmap.
-*/
-   smp_wmb();
-
if (vcpu->arch.exception.pending)
__queue_exception(vcpu);
else
@@ -3317,7 +3313,7 @@ static int vcpu_enter_guest(struct kvm_v
set_debugreg(vcpu->arch.host_dr6, 6);
set_debugreg(vcpu->arch.host_dr7, 7);
 
-   vcpu->guest_mode = 0;
+   set_bit(KVM_REQ_KICK, &vcpu->requests);
local_irq_enable();
 
++vcpu->stat.exits;
@@ -4611,30 +4607,20 @@ int kvm_arch_vcpu_runnable(struct kvm_vc
   || vcpu->arch.nmi_pending;
 }
 
-static void vcpu_kick_intr(void *info)
-{
-#ifdef DEBUG
-   struct kvm_vcpu *vcpu = (struct kvm_vcpu *)info;
-   printk(KERN_DEBUG "vcpu_kick_intr %p \n", vcpu);
-#endif
-}
-
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 {
-   int ipi_pcpu = vcpu->cpu;
-   int cpu;
+   int me;
+   int cpu = vcpu->cpu;
 
if (waitqueue_active(&vcpu->wq)) {
wake_up_interruptible(&vcpu->wq);
++vcpu->stat.halt_wakeup;
}
-   /*
-* We may be called synchronously with irqs disabled in guest m

[patch 0/4] smp_send_reschedule / assigned dev host intx race v2

2009-05-07 Thread mtosatti
Addressing comments.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/4] kvm-kmod: smp_send_reschedule compat

2009-05-07 Thread mtosatti
smp_send_reschedule was exported (via smp_ops) in v2.6.24.
Create a compat function which schedules the IPI to keventd context,
in case interrupts are disabled, for kernels < 2.6.24.

Signed-off-by: Marcelo Tosatti 

Index: kvm-kmod/ia64/hack-module.awk
===
--- kvm-kmod.orig/ia64/hack-module.awk
+++ kvm-kmod/ia64/hack-module.awk
@@ -1,4 +1,4 @@
-BEGIN { split("INIT_WORK on_each_cpu smp_call_function " \
+BEGIN { split("INIT_WORK on_each_cpu smp_call_function smp_send_reschedule " \
  "hrtimer_add_expires_ns hrtimer_get_expires " \
  "hrtimer_get_expires_ns hrtimer_start_expires " \
  "hrtimer_expires_remaining " \
Index: kvm-kmod/x86/hack-module.awk
===
--- kvm-kmod.orig/x86/hack-module.awk
+++ kvm-kmod/x86/hack-module.awk
@@ -1,7 +1,7 @@
 BEGIN { split("INIT_WORK desc_struct ldttss_desc64 desc_ptr " \
  "hrtimer_add_expires_ns hrtimer_get_expires " \
  "hrtimer_get_expires_ns hrtimer_start_expires " \
- "hrtimer_expires_remaining " \
+ "hrtimer_expires_remaining smp_send_reschedule " \
  "on_each_cpu relay_open request_irq free_irq" , compat_apis); }
 
 /^int kvm_init\(/ { anon_inodes = 1 }
Index: kvm-kmod/external-module-compat-comm.h
===
--- kvm-kmod.orig/external-module-compat-comm.h
+++ kvm-kmod/external-module-compat-comm.h
@@ -224,6 +224,10 @@ int kvm_smp_call_function_mask(cpumask_t
 
 #define smp_call_function_mask kvm_smp_call_function_mask
 
+void kvm_smp_send_reschedule(int cpu);
+
+#define smp_send_reschedule kvm_smp_send_reschedule
+
 #endif
 
 /* empty_zero_page isn't exported in all kernels */
Index: kvm-kmod/external-module-compat.c
===
--- kvm-kmod.orig/external-module-compat.c
+++ kvm-kmod/external-module-compat.c
@@ -221,6 +221,58 @@ out:
return 0;
 }
 
+#include 
+
+static void vcpu_kick_intr(void *info)
+{
+}
+
+struct kvm_kick {
+   int cpu;
+   struct work_struct work;
+};
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,20)
+static void kvm_do_smp_call_function(void *data)
+{
+   int me;
+   struct kvm_kick *kvm_kick = data;
+#else
+static void kvm_do_smp_call_function(struct work_struct *work)
+{
+   int me;
+   struct kvm_kick *kvm_kick = container_of(work, struct kvm_kick, work);
+#endif
+   me = get_cpu();
+
+   if (kvm_kick->cpu != me)
+   smp_call_function_single(kvm_kick->cpu, vcpu_kick_intr,
+NULL, 0);
+   kfree(kvm_kick);
+   put_cpu();
+}
+
+void kvm_queue_smp_call_function(int cpu)
+{
+   struct kvm_kick *kvm_kick = kmalloc(sizeof(struct kvm_kick), 
GFP_ATOMIC);
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,20)
+   INIT_WORK(&kvm_kick->work, kvm_do_smp_call_function, kvm_kick);
+#else
+   INIT_WORK(&kvm_kick->work, kvm_do_smp_call_function);
+#endif
+
+   schedule_work(&kvm_kick->work);
+}
+
+void kvm_smp_send_reschedule(int cpu)
+{
+   if (irqs_disabled()) {
+   kvm_queue_smp_call_function(cpu);
+   return;
+   }
+   smp_call_function_single(cpu, vcpu_kick_intr, NULL, 0);
+}
 #endif
 
 /* manually export hrtimer_init/start/cancel */


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/4] KVM: x86: disallow changing a slots size

2009-05-07 Thread mtosatti
Support to shrinking aliases complicates kernel code unnecessarily,
while userspace can do the same with two operations, delete an alias,
and create a new alias.

Signed-off-by: Marcelo Tosatti 

Index: kvm-pending/arch/x86/kvm/x86.c
===
--- kvm-pending.orig/arch/x86/kvm/x86.c
+++ kvm-pending/arch/x86/kvm/x86.c
@@ -1669,6 +1669,7 @@ static int kvm_vm_ioctl_set_memory_alias
 {
int r, n;
struct kvm_mem_alias *p;
+   unsigned long npages;
 
r = -EINVAL;
/* General sanity checks */
@@ -1690,9 +1691,12 @@ static int kvm_vm_ioctl_set_memory_alias
 
p = &kvm->arch.aliases[alias->slot];
 
-   /* FIXME: either disallow shrinking alias slots or disable
-* size changes as done with memslots
-*/
+   /* Disallow changing an alias slot's size. */
+   npages = alias->memory_size >> PAGE_SHIFT;
+   r = -EINVAL;
+   if (npages && p->npages && npages != p->npages)
+   goto out_unlock;
+
if (!alias->memory_size) {
r = -EBUSY;
if (kvm_root_gfn_in_range(kvm, p->base_gfn,


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 0/4] set_memory_region locking fixes / cr3 vs removal of memslots v2

2009-05-07 Thread mtosatti
Addressing comments.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/4] KVM: protect assigned dev workqueue, int handler and irq acker

2009-05-07 Thread mtosatti
kvm_assigned_dev_ack_irq is vulnerable to a race condition with the
interrupt handler function. It does:

if (dev->host_irq_disabled) {
enable_irq(dev->host_irq);
dev->host_irq_disabled = false;
}

If an interrupt triggers before the host->dev_irq_disabled assignment,
it will disable the interrupt and set dev->host_irq_disabled to true.

On return to kvm_assigned_dev_ack_irq, dev->host_irq_disabled is set to
false, and the next kvm_assigned_dev_ack_irq call will fail to reenable
it.

Other than that, having the interrupt handler and work handlers run in
parallel sounds like asking for trouble (could not spot any obvious
problem, but better not have to, its fragile).

CC: sheng.y...@intel.com
Signed-off-by: Marcelo Tosatti 

Index: kvm-pending/include/linux/kvm_host.h
===
--- kvm-pending.orig/include/linux/kvm_host.h
+++ kvm-pending/include/linux/kvm_host.h
@@ -346,6 +346,7 @@ struct kvm_assigned_dev_kernel {
int flags;
struct pci_dev *dev;
struct kvm *kvm;
+   spinlock_t assigned_dev_lock;
 };
 
 struct kvm_irq_mask_notifier {
Index: kvm-pending/virt/kvm/kvm_main.c
===
--- kvm-pending.orig/virt/kvm/kvm_main.c
+++ kvm-pending/virt/kvm/kvm_main.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -130,6 +131,7 @@ static void kvm_assigned_dev_interrupt_w
 * finer-grained lock, update this
 */
mutex_lock(&kvm->lock);
+   spin_lock_irq(&assigned_dev->assigned_dev_lock);
if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
struct kvm_guest_msix_entry *guest_entries =
assigned_dev->guest_msix_entries;
@@ -156,18 +158,21 @@ static void kvm_assigned_dev_interrupt_w
}
}
 
+   spin_unlock_irq(&assigned_dev->assigned_dev_lock);
mutex_unlock(&assigned_dev->kvm->lock);
 }
 
 static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
 {
+   unsigned long flags;
struct kvm_assigned_dev_kernel *assigned_dev =
(struct kvm_assigned_dev_kernel *) dev_id;
 
+   spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
int index = find_index_from_host_irq(assigned_dev, irq);
if (index < 0)
-   return IRQ_HANDLED;
+   goto out;
assigned_dev->guest_msix_entries[index].flags |=
KVM_ASSIGNED_MSIX_PENDING;
}
@@ -177,6 +182,8 @@ static irqreturn_t kvm_assigned_dev_intr
disable_irq_nosync(irq);
assigned_dev->host_irq_disabled = true;
 
+out:
+   spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
return IRQ_HANDLED;
 }
 
@@ -184,6 +191,7 @@ static irqreturn_t kvm_assigned_dev_intr
 static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
 {
struct kvm_assigned_dev_kernel *dev;
+   unsigned long flags;
 
if (kian->gsi == -1)
return;
@@ -196,10 +204,12 @@ static void kvm_assigned_dev_ack_irq(str
/* The guest irq may be shared so this ack may be
 * from another device.
 */
+   spin_lock_irqsave(&dev->assigned_dev_lock, flags);
if (dev->host_irq_disabled) {
enable_irq(dev->host_irq);
dev->host_irq_disabled = false;
}
+   spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
 }
 
 static void deassign_guest_irq(struct kvm *kvm,
@@ -615,6 +625,7 @@ static int kvm_vm_ioctl_assign_device(st
match->host_devfn = assigned_dev->devfn;
match->flags = assigned_dev->flags;
match->dev = dev;
+   spin_lock_init(&match->assigned_dev_lock);
match->irq_source_id = -1;
match->kvm = kvm;
match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/4] kvm-kmod: nr_cpu_ids compat

2009-05-07 Thread mtosatti
Signed-off-by: Marcelo Tosatti 

Index: kvm-kmod/external-module-compat-comm.h
===
--- kvm-kmod.orig/external-module-compat-comm.h
+++ kvm-kmod/external-module-compat-comm.h
@@ -116,6 +116,10 @@ int kvm_smp_call_function_single(int cpu
 
 #endif
 
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,28)
+#define nr_cpu_ids NR_CPUS
+#endif
+
 #include 
 #ifndef KVM_MINOR
 #define KVM_MINOR 232


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/4] KVM: MMU: protect kvm_mmu_change_mmu_pages with mmu_lock

2009-05-07 Thread mtosatti
kvm_handle_hva, called by MMU notifiers, manipulates mmu data only with
the protection of mmu_lock.

Update kvm_mmu_change_mmu_pages callers to take mmu_lock, thus protecting
against kvm_handle_hva.

CC: Andrea Arcangeli 
Signed-off-by: Marcelo Tosatti 

Index: kvm-pending/arch/x86/kvm/mmu.c
===
--- kvm-pending.orig/arch/x86/kvm/mmu.c
+++ kvm-pending/arch/x86/kvm/mmu.c
@@ -2723,7 +2723,6 @@ void kvm_mmu_slot_remove_write_access(st
 {
struct kvm_mmu_page *sp;
 
-   spin_lock(&kvm->mmu_lock);
list_for_each_entry(sp, &kvm->arch.active_mmu_pages, link) {
int i;
u64 *pt;
@@ -2738,7 +2737,6 @@ void kvm_mmu_slot_remove_write_access(st
pt[i] &= ~PT_WRITABLE_MASK;
}
kvm_flush_remote_tlbs(kvm);
-   spin_unlock(&kvm->mmu_lock);
 }
 
 void kvm_mmu_zap_all(struct kvm *kvm)
Index: kvm-pending/arch/x86/kvm/x86.c
===
--- kvm-pending.orig/arch/x86/kvm/x86.c
+++ kvm-pending/arch/x86/kvm/x86.c
@@ -1607,10 +1607,12 @@ static int kvm_vm_ioctl_set_nr_mmu_pages
return -EINVAL;
 
down_write(&kvm->slots_lock);
+   spin_lock(&kvm->mmu_lock);
 
kvm_mmu_change_mmu_pages(kvm, kvm_nr_mmu_pages);
kvm->arch.n_requested_mmu_pages = kvm_nr_mmu_pages;
 
+   spin_unlock(&kvm->mmu_lock);
up_write(&kvm->slots_lock);
return 0;
 }
@@ -1786,7 +1788,9 @@ int kvm_vm_ioctl_get_dirty_log(struct kv
 
/* If nothing is dirty, don't bother messing with page tables. */
if (is_dirty) {
+   spin_lock(&kvm->mmu_lock);
kvm_mmu_slot_remove_write_access(kvm, log->slot);
+   spin_unlock(&kvm->mmu_lock);
kvm_flush_remote_tlbs(kvm);
memslot = &kvm->memslots[log->slot];
n = ALIGN(memslot->npages, BITS_PER_LONG) / 8;
@@ -4530,12 +4534,14 @@ int kvm_arch_set_memory_region(struct kv
}
}
 
+   spin_lock(&kvm->mmu_lock);
if (!kvm->arch.n_requested_mmu_pages) {
unsigned int nr_mmu_pages = kvm_mmu_calculate_mmu_pages(kvm);
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
}
 
kvm_mmu_slot_remove_write_access(kvm, mem->slot);
+   spin_unlock(&kvm->mmu_lock);
kvm_flush_remote_tlbs(kvm);
 
return 0;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/4] KVM: take mmu_lock when updating a deleted slot

2009-05-07 Thread mtosatti
kvm_handle_hva relies on mmu_lock protection to safely access 
the memslot structures.

Signed-off-by: Marcelo Tosatti 

Index: kvm-pending/virt/kvm/kvm_main.c
===
--- kvm-pending.orig/virt/kvm/kvm_main.c
+++ kvm-pending/virt/kvm/kvm_main.c
@@ -1199,8 +1199,10 @@ int __kvm_set_memory_region(struct kvm *
 
kvm_free_physmem_slot(&old, npages ? &new : NULL);
/* Slot deletion case: we have to update the current slot */
+   spin_lock(&kvm->mmu_lock);
if (!npages)
*memslot = old;
+   spin_unlock(&kvm->mmu_lock);
 #ifdef CONFIG_DMAR
/* map the pages in iommu page table */
r = kvm_iommu_map_pages(kvm, base_gfn, npages);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Gregory Haskins wrote:
> What I am not clear on is how you would know to flag the address to
> begin with.

pci_iomap could look at the bus device that the PCI function sits on.
If it detects a PCI bridge that has a certain property (config space
setting, vendor/device ID, ...), it assumes that the device itself
will be emulated and it should set the address flag for IO_COND.

This implies that all pass-through devices need to be on a different
PCI bridge from the emulated devices, which should be fairly
straightforward to enforce.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Gregory Haskins (gregory.hask...@gmail.com) wrote:
> After posting my numbers today, what I *can* tell you definitively that
> its significantly slower to VMEXIT via MMIO.  I guess I do not really
> know the reason for sure. :)

there's certainly more work, including insn decoding
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Arnd Bergmann wrote:
> An easy way to deal with the pass-through case might be to actually use
> __raw_writel there. In guest-to-guest communication, the two sides are
> known to have the same endianess (I assume) and you can still add the
> appropriate smp_mb() and such into the code.

Ok, that was nonsense. I thought you meant pass-through to a memory range
on the host that is potentially shared with other processes or guests.
For pass-through to a real device, it obviously would not work.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Gregory Haskins wrote:
> Arnd Bergmann wrote:
> > An mmio that goes through a PF is a bug, it's certainly broken on
> > a number of platforms, so performance should not be an issue there.
> >   
> 
> This may be my own ignorance, but I thought a VMEXIT of type "PF" was
> how MMIO worked in VT/SVM. 

You are right that all MMIOs (and PIO on most non-x86 architectures)
are handled this way in the end. What I meant was that an MMIO that
traps because of a simple pointer dereference as in __raw_writel
is a bug, while any actual writel() call could be diverted to
do an hcall and therefore not cause a PF once the infrastructure
is there.

> I guess the problem that was later pointed out is that we cannot discern
> which devices might be pass-through and therefore should not be
> revectored through a HC.  But I am even less knowledgeable about how
> pass-through works than I am about the MMIO traps, so I might be
> completely off here.

An easy way to deal with the pass-through case might be to actually use
__raw_writel there. In guest-to-guest communication, the two sides are
known to have the same endianess (I assume) and you can still add the
appropriate smp_mb() and such into the code.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Chris Wright wrote:
> * Gregory Haskins (gregory.hask...@gmail.com) wrote:
>   
>> What I am not clear on is how you would know to flag the address to
>> begin with.
>> 
>
> That's why I mentioned pv_io_ops->iomap() earlier.  Something I'd expect
> would get called on IORESOURCE_PVIO type.

Yeah, this wasn't clear at the time, but I totally get what you meant
now in retrospect.

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Arnd Bergmann wrote:
> On Thursday 07 May 2009, Gregory Haskins wrote:
>   
>> I guess technically mmio can just be a simple access of the page which
>> would be problematic to trap locally without a PF.  However it seems
>> that most mmio always passes through a ioread()/iowrite() call so this
>> is perhaps the hook point.  If we set the stake in the ground that mmios
>> that go through some other mechanism like PFs can just hit the "slow
>> path" are an acceptable casualty, I think we can make that work.
>>
>> Thoughts?
>> 
>
> An mmio that goes through a PF is a bug, it's certainly broken on
> a number of platforms, so performance should not be an issue there.
>   

This may be my own ignorance, but I thought a VMEXIT of type "PF" was
how MMIO worked in VT/SVM.  I didn't mean to imply that the guest nor
the host took a traditional PF exception in their respective IDT, if
that is what you thought I meant here.  Rather, the mmio region is
unmapped in the guest MMU, access causes a VMEXIT to host-side KVM of
type PF, and the host side code then consults the guest page-table to
see if its an MMIO or not.  I could very well be mistaken as I have only
a cursory understanding of what happens in KVM today with this path.

After posting my numbers today, what I *can* tell you definitively that
its significantly slower to VMEXIT via MMIO.  I guess I do not really
know the reason for sure. :)
> Note that are four commonly used interface classes for PIO/MMIO:
>
> 1. readl/writel: little-endian MMIO
> 2. inl/outl: little-endian PIO
> 3. ioread32/iowrite32: converged little-endian PIO/MMIO
> 4. __raw_readl/__raw_writel: native-endian MMIO without checks
>
> You don't need to worry about the __raw_* stuff, as this should never
> be used in device drivers.
>
> As a simplification, you could mandate that all drivers that want to
> use this get converted to the ioread/iowrite class of interfaces and
> leave the others slow.
>   

I guess the problem that was later pointed out is that we cannot discern
which devices might be pass-through and therefore should not be
revectored through a HC.  But I am even less knowledgeable about how
pass-through works than I am about the MMIO traps, so I might be
completely off here.

In any case, thank you kindly for the suggestions.

Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> Oh yes.  But don't call it dynhc - like Chris says it's the wrong
>>> semantic.
>>>
>>> Since we want to connect it to an eventfd, call it HC_NOTIFY or
>>> HC_EVENT or something along these lines.  You won't be able to pass
>>> any data, but that's fine.  Registers are saved to memory anyway.
>>> 
>> Ok, but how would you access the registers since you would presumably
>> only be getting a waitq::func callback on the eventfd.  Or were you
>> saying that more data, if required, is saved in a side-band memory
>> location?  I can see the latter working. 
>
> Yeah.  You basically have that side-band in vbus shmem (or the virtio
> ring).

Ok, got it.
>
>>  I can't wrap my head around
>> the former.
>>   
>
> I only meant that registers aren't faster than memory, since they are
> just another memory location.
>
> In fact registers are accessed through a function call (not that that
> takes any time these days).
>
>
>>> Just to make sure we have everything plumbed down, here's how I see
>>> things working out (using qemu and virtio, use sed to taste):
>>>
>>> 1. qemu starts up, sets up the VM
>>> 2. qemu creates virtio-net-server
>>> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx
>>> ring, one set for rx ring)
>>> 4. qemu connects the six eventfd to the data-available,
>>> data-not-available, and kick ports of virtio-net-server
>>> 5. the guest starts up and configures virtio-net in pci pin mode
>>> 6. qemu notices and decides it will manage interrupts in user space
>>> since this is complicated (shared level triggered interrupts)
>>> 7. the guest OS boots, loads device driver
>>> 8. device driver switches virtio-net to msix mode
>>> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the
>>> notify fds as notifyfd
>>> 10. look ma, no hands.
>>>
>>> Under the hood, the following takes place.
>>>
>>> kvm wires the irqfds to schedule a work item which fires the
>>> interrupt.  One day the kvm developers get their act together and
>>> change it to inject the interrupt directly when the irqfd is signalled
>>> (which could be from the net softirq or somewhere similarly nasty).
>>>
>>> virtio-net-server wires notifyfd according to its liking.  It may
>>> schedule a thread, or it may execute directly.
>>>
>>> And they all lived happily ever after.
>>> 
>>
>> Ack.  I hope when its all said and done I can convince you that the
>> framework to code up those virtio backends in the kernel is vbus ;)
>
> If vbus doesn't bring significant performance advantages, I'll prefer
> virtio because of existing investment.

Just to clarify: vbus is just the container/framework for the in-kernel
models.  You can implement and deploy virtio devices inside the
container (tho I haven't had a chance to sit down and implement one
yet).  Note that I did publish a virtio transport in the last few series
to demonstrate how that might work, so its just ripe for the picking if
someone is so inclined.

So really the question is whether you implement the in-kernel virtio
backend in vbus, in some other framework, or just do it standalone.

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Gregory Haskins (gregory.hask...@gmail.com) wrote:
> What I am not clear on is how you would know to flag the address to
> begin with.

That's why I mentioned pv_io_ops->iomap() earlier.  Something I'd expect
would get called on IORESOURCE_PVIO type.  This isn't really transparent
though (only virtio devices basically), kind of like you're saying below.

> Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it
> would never be a real device).  This means we route any io requests from
> virtio-pci though pv_io_ops->mmio(), but not unflagged addresses.  This
> is not as slick as boosting *everyones* mmio speed as Avi's original
> idea would have, but it is perhaps a good tradeoff between the entirely
> new namespace created by my original dynhc() proposal and leaving them
> all PF based.
>
> This way, its just like using my dynhc() proposal except the mmio-addr
> is the substitute address-token (instead of the dynhc-vector). 
> Additionally, if you do not PV the kernel the IO_COND/pv_io_op is
> ignored and it just slow-paths through the PF as it does today.  Dynhc()
> would be dependent  on pv_ops.
> 
> Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

Oh yes.  But don't call it dynhc - like Chris says it's the wrong
semantic.

Since we want to connect it to an eventfd, call it HC_NOTIFY or
HC_EVENT or something along these lines.  You won't be able to pass
any data, but that's fine.  Registers are saved to memory anyway.


Ok, but how would you access the registers since you would presumably
only be getting a waitq::func callback on the eventfd.  Or were you
saying that more data, if required, is saved in a side-band memory
location?  I can see the latter working. 


Yeah.  You basically have that side-band in vbus shmem (or the virtio ring).


 I can't wrap my head around
the former.
  


I only meant that registers aren't faster than memory, since they are 
just another memory location.


In fact registers are accessed through a function call (not that that 
takes any time these days).




Just to make sure we have everything plumbed down, here's how I see
things working out (using qemu and virtio, use sed to taste):

1. qemu starts up, sets up the VM
2. qemu creates virtio-net-server
3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx
ring, one set for rx ring)
4. qemu connects the six eventfd to the data-available,
data-not-available, and kick ports of virtio-net-server
5. the guest starts up and configures virtio-net in pci pin mode
6. qemu notices and decides it will manage interrupts in user space
since this is complicated (shared level triggered interrupts)
7. the guest OS boots, loads device driver
8. device driver switches virtio-net to msix mode
9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the
notify fds as notifyfd
10. look ma, no hands.

Under the hood, the following takes place.

kvm wires the irqfds to schedule a work item which fires the
interrupt.  One day the kvm developers get their act together and
change it to inject the interrupt directly when the irqfd is signalled
(which could be from the net softirq or somewhere similarly nasty).

virtio-net-server wires notifyfd according to its liking.  It may
schedule a thread, or it may execute directly.

And they all lived happily ever after.



Ack.  I hope when its all said and done I can convince you that the
framework to code up those virtio backends in the kernel is vbus ;)


If vbus doesn't bring significant performance advantages, I'll prefer 
virtio because of existing investment.



  But
even if not, this should provide enough plumbing that we can all coexist
together peacefully.
  


Yes, vbus and virtio can compete on their merits without bias from some 
maintainer getting in the way.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> Don't - it's broken.  It will also catch device assignment mmio and
>>> hypercall them.
>>>
>>> 
>> Ah.  Crap.
>>
>> Would you be conducive if I continue along with the dynhc() approach
>> then?
>>   
>
> Oh yes.  But don't call it dynhc - like Chris says it's the wrong
> semantic.
>
> Since we want to connect it to an eventfd, call it HC_NOTIFY or
> HC_EVENT or something along these lines.  You won't be able to pass
> any data, but that's fine.  Registers are saved to memory anyway.
Ok, but how would you access the registers since you would presumably
only be getting a waitq::func callback on the eventfd.  Or were you
saying that more data, if required, is saved in a side-band memory
location?  I can see the latter working.  I can't wrap my head around
the former.

>
> And btw, given that eventfd and the underlying infrastructure are so
> flexible, it's probably better to go back to your original "irqfd gets
> fd from userspace" just to be consistent everywhere.
>
> (no, I'm not deliberately making you rewrite that patch again and
> again... it's going to be a key piece of infrastructure so I want to
> get it right)

Ok, np.  Actually now that Davide showed me the waitq::func trick, the
fd technically doesn't even need to be an eventfd per se.  We can just
plain-old "fget()" it and attach via the f_ops->poll() as I do in v5. 
Ill submit this later today.

>
>
> Just to make sure we have everything plumbed down, here's how I see
> things working out (using qemu and virtio, use sed to taste):
>
> 1. qemu starts up, sets up the VM
> 2. qemu creates virtio-net-server
> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx
> ring, one set for rx ring)
> 4. qemu connects the six eventfd to the data-available,
> data-not-available, and kick ports of virtio-net-server
> 5. the guest starts up and configures virtio-net in pci pin mode
> 6. qemu notices and decides it will manage interrupts in user space
> since this is complicated (shared level triggered interrupts)
> 7. the guest OS boots, loads device driver
> 8. device driver switches virtio-net to msix mode
> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the
> notify fds as notifyfd
> 10. look ma, no hands.
>
> Under the hood, the following takes place.
>
> kvm wires the irqfds to schedule a work item which fires the
> interrupt.  One day the kvm developers get their act together and
> change it to inject the interrupt directly when the irqfd is signalled
> (which could be from the net softirq or somewhere similarly nasty).
>
> virtio-net-server wires notifyfd according to its liking.  It may
> schedule a thread, or it may execute directly.
>
> And they all lived happily ever after.

Ack.  I hope when its all said and done I can convince you that the
framework to code up those virtio backends in the kernel is vbus ;)  But
even if not, this should provide enough plumbing that we can all coexist
together peacefully.

Thanks,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Gregory Haskins wrote:
> I guess technically mmio can just be a simple access of the page which
> would be problematic to trap locally without a PF.  However it seems
> that most mmio always passes through a ioread()/iowrite() call so this
> is perhaps the hook point.  If we set the stake in the ground that mmios
> that go through some other mechanism like PFs can just hit the "slow
> path" are an acceptable casualty, I think we can make that work.
> 
> Thoughts?

An mmio that goes through a PF is a bug, it's certainly broken on
a number of platforms, so performance should not be an issue there.

Note that are four commonly used interface classes for PIO/MMIO:

1. readl/writel: little-endian MMIO
2. inl/outl: little-endian PIO
3. ioread32/iowrite32: converged little-endian PIO/MMIO
4. __raw_readl/__raw_writel: native-endian MMIO without checks

You don't need to worry about the __raw_* stuff, as this should never
be used in device drivers.

As a simplification, you could mandate that all drivers that want to
use this get converted to the ioread/iowrite class of interfaces and
leave the others slow.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Avi Kivity wrote:


I think we just past the "too complicated" threshold.



And the "can't spel" threshold in the same sentence.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

Don't - it's broken.  It will also catch device assignment mmio and
hypercall them.



Ah.  Crap.

Would you be conducive if I continue along with the dynhc() approach then?
  


Oh yes.  But don't call it dynhc - like Chris says it's the wrong semantic.

Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT 
or something along these lines.  You won't be able to pass any data, but 
that's fine.  Registers are saved to memory anyway.


And btw, given that eventfd and the underlying infrastructure are so 
flexible, it's probably better to go back to your original "irqfd gets 
fd from userspace" just to be consistent everywhere.


(no, I'm not deliberately making you rewrite that patch again and 
again... it's going to be a key piece of infrastructure so I want to get 
it right)


Just to make sure we have everything plumbed down, here's how I see 
things working out (using qemu and virtio, use sed to taste):


1. qemu starts up, sets up the VM
2. qemu creates virtio-net-server
3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx 
ring, one set for rx ring)
4. qemu connects the six eventfd to the data-available, 
data-not-available, and kick ports of virtio-net-server

5. the guest starts up and configures virtio-net in pci pin mode
6. qemu notices and decides it will manage interrupts in user space 
since this is complicated (shared level triggered interrupts)

7. the guest OS boots, loads device driver
8. device driver switches virtio-net to msix mode
9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the 
notify fds as notifyfd

10. look ma, no hands.

Under the hood, the following takes place.

kvm wires the irqfds to schedule a work item which fires the interrupt.  
One day the kvm developers get their act together and change it to 
inject the interrupt directly when the irqfd is signalled (which could 
be from the net softirq or somewhere similarly nasty).


virtio-net-server wires notifyfd according to its liking.  It may 
schedule a thread, or it may execute directly.


And they all lived happily ever after.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Chris Wright wrote:
> * Gregory Haskins (ghask...@novell.com) wrote:
>   
>> Chris Wright wrote:
>> 
>>> * Avi Kivity (a...@redhat.com) wrote:
>>>   
 Gregory Haskins wrote:
 
> Cool,  I will code this up and submit it.  While Im at it, Ill run it
> through the "nullio" ringer, too. ;)  It would be cool to see the
> pv-mmio hit that 2.07us number.  I can't think of any reason why this
> will not be the case.
>   
>   
 Don't - it's broken.  It will also catch device assignment mmio and  
 hypercall them.
 
>>> Not necessarily.  It just needs to be creative w/ IO_COND
>>>   
>> Hi Chris,
>>Could you elaborate?  How would you know which pages to hypercall and
>> which to let PF?
>> 
>
> Was just thinking of some ugly mangling of the addr (I'm not entirely
> sure what would work best).
>   
Right, I get the part about flagging the address and then keying off
that flag in IO_COND (like we do for PIO vs MMIO).

What I am not clear on is how you would know to flag the address to
begin with.

Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it
would never be a real device).  This means we route any io requests from
virtio-pci though pv_io_ops->mmio(), but not unflagged addresses.  This
is not as slick as boosting *everyones* mmio speed as Avi's original
idea would have, but it is perhaps a good tradeoff between the entirely
new namespace created by my original dynhc() proposal and leaving them
all PF based.

This way, its just like using my dynhc() proposal except the mmio-addr
is the substitute address-token (instead of the dynhc-vector). 
Additionally, if you do not PV the kernel the IO_COND/pv_io_op is
ignored and it just slow-paths through the PF as it does today.  Dynhc()
would be dependent  on pv_ops.

Thoughts?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Chris Wright wrote:

* Gregory Haskins (ghask...@novell.com) wrote:
  

Chris Wright wrote:


* Avi Kivity (a...@redhat.com) wrote:
  

Gregory Haskins wrote:


Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the "nullio" ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.
  
  
Don't - it's broken.  It will also catch device assignment mmio and  
hypercall them.


Not necessarily.  It just needs to be creative w/ IO_COND
  

Hi Chris,
   Could you elaborate?  How would you know which pages to hypercall and
which to let PF?



Was just thinking of some ugly mangling of the addr (I'm not entirely
sure what would work best).
  


I think we just past the "too complicated" threshold.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Gregory Haskins (ghask...@novell.com) wrote:
> Chris Wright wrote:
> > * Avi Kivity (a...@redhat.com) wrote:
> >> Gregory Haskins wrote:
> >>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
> >>> through the "nullio" ringer, too. ;)  It would be cool to see the
> >>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
> >>> will not be the case.
> >>>   
> >> Don't - it's broken.  It will also catch device assignment mmio and  
> >> hypercall them.
> >
> > Not necessarily.  It just needs to be creative w/ IO_COND
> 
> Hi Chris,
>Could you elaborate?  How would you know which pages to hypercall and
> which to let PF?

Was just thinking of some ugly mangling of the addr (I'm not entirely
sure what would work best).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bridges

2009-05-07 Thread Cam Macdonell

Ross Boylan wrote:

On Thu, 2009-05-07 at 11:13 -0600, Cam Macdonell wrote:

Ross Boylan wrote:

I'm trying to understand bridging with KVM, but am still puzzled.
I think that the recommended bridging with TAP means that packets

from

the VM will end up going out the host card attached to the default
gateway.  But it looks to me as if their IP address is unchanged,

which

means replies will never reach me.  Is that correct?  Do I need to

NAT

the packets, or is something already doing that?

Hi Ross,

This is the place to start

http://www.linux-kvm.org/page/Networking.  

I saw that; it gives some recipes but I wasn't sure what their effect
was.


You want a public bridge.

I'm not sure what "their" and "me" mean in your email.  In short,
with 
bridging each VM has its own IP and that VM can be accessed directly 
from the network.

"their" = the VM.
"me" = my host machine.

So if the VM's are running on their own subnet, 


VMs do not run on their own subnet with bridged networking.


e.g., 10.0.2.* (I've
been assuming the subnet with TAP is like the one with the User Mode
Network stack in 3.7.3 of http://www.nongnu.org/qemu/qemu-doc.html) and
my host machine is on another net, e.g., 10.0.8.* then I think the
packet will go out with an IP of 10.0.2.2 (say).  When some other
machine tries to reply to 10.0.2.2, the packet gets lost because the
outside network thinks 10.0.2.* is not for it.  At least that's my
concern.  If the return IP address on the packet were 10.0.8.44
(supposing that's the IP of my host machine) then the packets could find
their way back.


Using bridged networking is very different from the user stack.  The 
user stack is extremely limited and slow.




My host machine is on an internal network with a 10.* IP.  The example
might be clearer if one supposed that the VM's were on a 192.168.*
network.

I am perhaps being influenced by the fact that I don't want to ask for
more IP's, so I don't want to configure the VM's to use an IP on our
10.0.8 network.


Then you probably want to use a NAT network.  A NAT setup puts all the 
VMs on their own network within the host machine.  iptables is necessary 
to forward the subnet packets out to the world and back.


Here is some older documentation, but not much has changed.  Look at the 
first entry under "Advanced Networking".


https://help.ubuntu.com/community/KVMFeisty


Does the TAP networking setup a whole subnet like the user mode network
stack (e.g., running a DHCP server), or is the idea that I would just
give the VM an IP on my subnet (10.0.8.*) in this example?


No, bridge networking using taps (one tap per VM) and effectively sits 
all the VMs on the same network your host is on.  You would need to get 
IPs from sysadmin for each VM.



If the latter is the case (I'm now suspecting it is) then I think the
solution is clear.  I just stick the VM's on a private (to my machine)
subnet, like 192.168.*, and I do NAT on the packets as they go out.


NAT is a very common solution.  Use VDE (vde.sourceforget.net) to create 
a virtual switch on your host for the VMs.  dnsmasq can serve dynamic 
IPs to the VMs on their own subnet that doesn't bother your sysadmin at 
all.  Use iptables to forward and receive packets through your host's 
NIC.


Cam
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Chris Wright wrote:
> * Avi Kivity (a...@redhat.com) wrote:
>   
>> Gregory Haskins wrote:
>> 
>>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>>> through the "nullio" ringer, too. ;)  It would be cool to see the
>>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>>> will not be the case.
>>>   
>> Don't - it's broken.  It will also catch device assignment mmio and  
>> hypercall them.
>> 
>
> Not necessarily.  It just needs to be creative w/ IO_COND
>   

Hi Chris,
   Could you elaborate?  How would you know which pages to hypercall and
which to let PF?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Avi Kivity (a...@redhat.com) wrote:
> Gregory Haskins wrote:
>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>> through the "nullio" ringer, too. ;)  It would be cool to see the
>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>> will not be the case.
>
> Don't - it's broken.  It will also catch device assignment mmio and  
> hypercall them.

Not necessarily.  It just needs to be creative w/ IO_COND
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>
 I guess technically mmio can just be a simple access of the page which
 would be problematic to trap locally without a PF.  However it seems
 that most mmio always passes through a ioread()/iowrite() call so this
 is perhaps the hook point.  If we set the stake in the ground that
 mmios
 that go through some other mechanism like PFs can just hit the "slow
 path" are an acceptable casualty, I think we can make that work.
 
>>> That's my thinking exactly.
>>> 
>>
>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>> through the "nullio" ringer, too. ;)  It would be cool to see the
>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>> will not be the case.
>>   
>
> Don't - it's broken.  It will also catch device assignment mmio and
> hypercall them.
>
Ah.  Crap.

Would you be conducive if I continue along with the dynhc() approach then?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

Avi Kivity wrote:
  

Gregory Haskins wrote:


I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the "slow
path" are an acceptable casualty, I think we can make that work.
  
  

That's my thinking exactly.



Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the "nullio" ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.
  


Don't - it's broken.  It will also catch device assignment mmio and 
hypercall them.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3

2009-05-07 Thread Marcelo Tosatti
On Thu, May 07, 2009 at 05:16:35PM +0300, Avi Kivity wrote:
> mtosa...@redhat.com wrote:
>> Disallow the deletion of memory slots (and aliases, for x86 case), if a
>> vcpu contains a cr3 that points to such slot/alias.
>>   
>
> That allows the guest to induce failures in the host.  

I don't understand what you mean. What is the problem with returning
errors in the ioctl handlers?

The guest can cause an overflow in qemu, overwrite the parameters to
KVM_GET_MSR_INDEX_LIST in an attempt to read kernel data, and get
-E2BIG. Or pick your combination.

> Better to triple-fault the guest instead.

Sure can additionally triple fault it, but the kernel might attempt to
access the non-existant slot which cr3 points to before TRIPLE_FAULT is
processed. So you have to avoid that possibility in the first place,
thats why the patch modifies the ioctls to fail.

>>  +int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot 
>> *slot)
>> +{
>> +return 1;
>> +}
>> +
>>   
>
> In general, instead of stubs in every arch, have x86 say  
> KVM_HAVE_ARCH_CAN_FREE_MEMSLOT and define the stub in generic code when  
> that define is not present.

Will fix that.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu: kvm: avoid harmless unhandled wrmsr 0xc0010117 messages

2009-05-07 Thread Marcelo Tosatti

Olders kernel which don't contain kvm.git commit
61a6bd672bda3b9468bf5895c1be085c4e481138 display the following message:

kvm: 32301: cpu0 unhandled wrmsr: 0xc0010117 data 0 

When kvm_arch_load_regs is called. This is confusing in bug reports.

Avoid it by checking whether the host advertises the MSR, similarly to
how MSR_STAR is handled.

Signed-off-by: Marcelo Tosatti 


diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 06ef775..5d19705 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -25,6 +25,7 @@
 static struct kvm_msr_list *kvm_msr_list;
 extern unsigned int kvm_shadow_memory;
 static int kvm_has_msr_star;
+static int kvm_has_vm_hsave_pa;
 
 static int lm_capable_kernel;
 
@@ -54,10 +55,14 @@ int kvm_arch_qemu_create_context(void)
 kvm_msr_list = kvm_get_msr_list(kvm_context);
 if (!kvm_msr_list)
return -1;
-for (i = 0; i < kvm_msr_list->nmsrs; ++i)
+for (i = 0; i < kvm_msr_list->nmsrs; ++i) {
if (kvm_msr_list->indices[i] == MSR_STAR)
kvm_has_msr_star = 1;
-   return 0;
+if (kvm_msr_list->indices[i] == MSR_VM_HSAVE_PA)
+kvm_has_vm_hsave_pa = 1;
+}
+
+return 0;
 }
 
 static void set_msr_entry(struct kvm_msr_entry *entry, uint32_t index,
@@ -260,7 +265,8 @@ void kvm_arch_load_regs(CPUState *env)
 set_msr_entry(&msrs[n++], MSR_IA32_SYSENTER_EIP, env->sysenter_eip);
 if (kvm_has_msr_star)
set_msr_entry(&msrs[n++], MSR_STAR,  env->star);
-set_msr_entry(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave);
+if (kvm_has_vm_hsave_pa)
+set_msr_entry(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave);
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 set_msr_entry(&msrs[n++], MSR_CSTAR, env->cstar);
@@ -435,7 +441,8 @@ void kvm_arch_save_regs(CPUState *env)
 if (kvm_has_msr_star)
msrs[n++].index = MSR_STAR;
 msrs[n++].index = MSR_IA32_TSC;
-msrs[n++].index = MSR_VM_HSAVE_PA;
+if (kvm_has_vm_hsave_pa)
+msrs[n++].index = MSR_VM_HSAVE_PA;
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 msrs[n++].index = MSR_CSTAR;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> I guess technically mmio can just be a simple access of the page which
>> would be problematic to trap locally without a PF.  However it seems
>> that most mmio always passes through a ioread()/iowrite() call so this
>> is perhaps the hook point.  If we set the stake in the ground that mmios
>> that go through some other mechanism like PFs can just hit the "slow
>> path" are an acceptable casualty, I think we can make that work.
>>   
>
> That's my thinking exactly.

Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the "nullio" ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.

>
> Note we can cheat further.  kvm already has a "coalesced mmio" feature
> where side-effect-free mmios are collected in the kernel and passed to
> userspace only when some other significant event happens.  We could
> pass those addresses to the guest and let it queue those writes
> itself, avoiding the hypercall completely.
>
> Though it's probably pointless: if the guest is paravirtualized enough
> to have the mmio hypercall, then it shouldn't be using e1000.

Yeah...plus at least for my vbus purposes, all my my guest->host
transitions are explicitly to cause side-effects, or I wouldn't be doing
them in the first place ;)  I suspect virtio-pci is exactly the same. 
I.e. the coalescing has already been done at a higher layer for
platforms running "PV" code.

Still a cool feature, tho.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the "slow
path" are an acceptable casualty, I think we can make that work.
  


That's my thinking exactly.

Note we can cheat further.  kvm already has a "coalesced mmio" feature 
where side-effect-free mmios are collected in the kernel and passed to 
userspace only when some other significant event happens.  We could pass 
those addresses to the guest and let it queue those writes itself, 
avoiding the hypercall completely.


Though it's probably pointless: if the guest is paravirtualized enough 
to have the mmio hypercall, then it shouldn't be using e1000.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Intel-IOMMU, intr-remap: source-id checking

2009-05-07 Thread Suresh Siddha
On Wed, 2009-05-06 at 23:16 -0700, Han, Weidong wrote:
> @@ -634,6 +694,44 @@ static int ir_parse_ioapic_scope(struct acpi_dmar_header 
> *header,
>  " 0x%Lx\n", scope->enumeration_id,
>  drhd->address);
>  
> + bus = pci_find_bus(drhd->segment, scope->bus);
> + path = (struct acpi_dmar_pci_path *)(scope + 1);
> + count = (scope->length -
> +  sizeof(struct acpi_dmar_device_scope))
> + / sizeof(struct acpi_dmar_pci_path);
> +
> + while (count) {
> + if (pdev)
> + pci_dev_put(pdev);
> +
> + if (!bus)
> + break;
> +
> + pdev = pci_get_slot(bus,
> + PCI_DEVFN(path->dev, path->fn));
> + if (!pdev)
> + break;

ir_parse_ioapic_scope() happens very early in the boot. So, I don't
think we can do the pci related discovery here.

thanks,
suresh

> +
> + path++;
> + count--;
> + bus = pdev->subordinate;
> + }
> +
> + if (pdev) { /* PCI discoverable IOAPIC*/
> + ir_ioapic[ir_ioapic_num].bus =
> + pdev->bus->number;
> + ir_ioapic[ir_ioapic_num].devfn = pdev->devfn;
> + } else { /* Not PCI discoverable IOAPIC */
> + if (!bus)
> + ir_ioapic[ir_ioapic_num].bus =
> + scope->bus;
> + else
> + ir_ioapic[ir_ioapic_num].bus =
> + bus->number;
> + ir_ioapic[ir_ioapic_num].devfn =
> + PCI_DEVFN(path->dev, path->fn);
> + }
> +
>   ir_ioapic[ir_ioapic_num].iommu = iommu;
>   ir_ioapic[ir_ioapic_num].id = scope->enumeration_id;
>   ir_ioapic_num++;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> What do you think of my mmio hypercall?  That will speed up all mmio
>>> to be as fast as a hypercall, and then we can use ordinary mmio/pio
>>> writes to trigger things.
>>>
>>> 
>> I like it!
>>
>> Bigger question is what kind of work goes into making mmio a pv_op (or
>> is this already done)?
>>
>>   
>
> Looks like it isn't there.  But it isn't any different than set_pte -
> convert a write into a hypercall.
>
>

I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the "slow
path" are an acceptable casualty, I think we can make that work.

Thoughts?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Avi Kivity

Davide Libenzi wrote:

On Thu, 7 May 2009, Avi Kivity wrote:

  

Davide Libenzi wrote:

  
  

What's your take on adding irq context safe callbacks to irqfd?

To give some background here, we would like to use eventfd as a generic
connector between components, so the components do not know about each
other.
So far eventfd successfully abstracts among components in the same
process, in
different processes, and in the kernel.

eventfd_signal() can be safely called from irq context, and will wake up a
waiting task.  But in some cases, if the consumer is in the kernel, it may
be
able to consume the event from irq context, saving a context switch.

So, will you consider patches adding this capability to eventfd?



Maybe I got lost in the thread, but inside the kernel we have callback-based
wakeup since long time. This is what epoll uses, when hooking into the file*
f_op->poll() subsystem.
Did you mean something else?
  
  

Do you mean wait_queue_t::func?



Yes, it is since long time ago that a wakeup in Linux can lead either to a 
real wakeup (in scheduler terms), or to a simple function call.

Isn't that enough?

  


It's perfect.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

What do you think of my mmio hypercall?  That will speed up all mmio
to be as fast as a hypercall, and then we can use ordinary mmio/pio
writes to trigger things.



I like it!

Bigger question is what kind of work goes into making mmio a pv_op (or
is this already done)?

  


Looks like it isn't there.  But it isn't any different than set_pte - 
convert a write into a hypercall.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> I completed the resurrection of the test and wrote up a little wiki on
>> the subject, which you can find here:
>>
>> http://developer.novell.com/wiki/index.php/WhyHypercalls
>>
>> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
>> reinvent the wheel?" questions.
>>
>> I will include this information when I publish the updated v2 series
>> with the s/hypercall/dynhc changes.
>>
>> Let me know if you have any questions.
>>   
>
> Well, 420 ns is not to be sneezed at.
>
> What do you think of my mmio hypercall?  That will speed up all mmio
> to be as fast as a hypercall, and then we can use ordinary mmio/pio
> writes to trigger things.
>
I like it!

Bigger question is what kind of work goes into making mmio a pv_op (or
is this already done)?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

I completed the resurrection of the test and wrote up a little wiki on
the subject, which you can find here:

http://developer.novell.com/wiki/index.php/WhyHypercalls

Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
reinvent the wheel?" questions.

I will include this information when I publish the updated v2 series
with the s/hypercall/dynhc changes.

Let me know if you have any questions.
  


Well, 420 ns is not to be sneezed at.

What do you think of my mmio hypercall?  That will speed up all mmio to 
be as fast as a hypercall, and then we can use ordinary mmio/pio writes 
to trigger things.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bridges

2009-05-07 Thread Ross Boylan
On Thu, 2009-05-07 at 11:13 -0600, Cam Macdonell wrote:
> 
> Ross Boylan wrote:
> > I'm trying to understand bridging with KVM, but am still puzzled.
> > I think that the recommended bridging with TAP means that packets
> from
> > the VM will end up going out the host card attached to the default
> > gateway.  But it looks to me as if their IP address is unchanged,
> which
> > means replies will never reach me.  Is that correct?  Do I need to
> NAT
> > the packets, or is something already doing that?
> 
> Hi Ross,
> 
> This is the place to start
> 
> http://www.linux-kvm.org/page/Networking.  
I saw that; it gives some recipes but I wasn't sure what their effect
was.

> You want a public bridge.
> 
> I'm not sure what "their" and "me" mean in your email.  In short,
> with 
> bridging each VM has its own IP and that VM can be accessed directly 
> from the network.
"their" = the VM.
"me" = my host machine.

So if the VM's are running on their own subnet, e.g., 10.0.2.* (I've
been assuming the subnet with TAP is like the one with the User Mode
Network stack in 3.7.3 of http://www.nongnu.org/qemu/qemu-doc.html) and
my host machine is on another net, e.g., 10.0.8.* then I think the
packet will go out with an IP of 10.0.2.2 (say).  When some other
machine tries to reply to 10.0.2.2, the packet gets lost because the
outside network thinks 10.0.2.* is not for it.  At least that's my
concern.  If the return IP address on the packet were 10.0.8.44
(supposing that's the IP of my host machine) then the packets could find
their way back.

My host machine is on an internal network with a 10.* IP.  The example
might be clearer if one supposed that the VM's were on a 192.168.*
network.

I am perhaps being influenced by the fact that I don't want to ask for
more IP's, so I don't want to configure the VM's to use an IP on our
10.0.8 network.

Does the TAP networking setup a whole subnet like the user mode network
stack (e.g., running a DHCP server), or is the idea that I would just
give the VM an IP on my subnet (10.0.8.*) in this example?

If the latter is the case (I'm now suspecting it is) then I think the
solution is clear.  I just stick the VM's on a private (to my machine)
subnet, like 192.168.*, and I do NAT on the packets as they go out.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: device-assignment: Catch GSI overflow

2009-05-07 Thread Alex Williamson
On Thu, 2009-05-07 at 11:09 -0600, Alex Williamson wrote:
> Fix the index at which we return -ENOSPC since the kernel side will
> reject a GSI >= KVM_MAX_IRQ_ROUTES.  Also, mask as a signed int before
> testing for error.

Even with this, there still seems to be a fundamental problem with our
consumption of GSIs in kvm.  For example, every time a guest writes to
the MSI capabilities area and enables MSI support, we call
kvm_get_irq_route_gsi() to get a new max_used_gsi + 1 value, then call
kvm_add_routing_entry(), which updates max_used_gsi.  It doesn't take
too long before we exhaust the GSI space and the device no longer works.
This seems to happen within a minute or two of booting a guest with an
e1000e device sitting idle on a busy network.  Do we need to keep a
bitmap of used GSIs or maybe just attempt to reuse the GSI we've gotten
previously for the device?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: device-assignment: Catch GSI overflow

2009-05-07 Thread Alex Williamson
Fix the index at which we return -ENOSPC since the kernel side will
reject a GSI >= KVM_MAX_IRQ_ROUTES.  Also, mask as a signed int before
testing for error.

Signed-off-by: Alex Williamson 
---

 hw/device-assignment.c |2 +-
 kvm/libkvm/libkvm.c|2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index a7365c8..e06dd08 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -796,7 +796,7 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, 
unsigned int ctrl_pos)
 pci_dev->cap.start + PCI_MSI_DATA_32);
 assigned_dev->entry->type = KVM_IRQ_ROUTING_MSI;
 assigned_dev->entry->gsi = kvm_get_irq_route_gsi(kvm_context);
-if (assigned_dev->entry->gsi < 0) {
+if ((int)(assigned_dev->entry->gsi) < 0) {
 perror("assigned_dev_update_msi: kvm_get_irq_route_gsi");
 return;
 }
diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
index ba0a5d1..2a4165a 100644
--- a/kvm/libkvm/libkvm.c
+++ b/kvm/libkvm/libkvm.c
@@ -1408,7 +1408,7 @@ int kvm_get_irq_route_gsi(kvm_context_t kvm)
 {
 #ifdef KVM_CAP_IRQ_ROUTING
if (kvm->max_used_gsi >= KVM_IOAPIC_NUM_PINS)  {
-   if (kvm->max_used_gsi <= kvm_get_gsi_count(kvm))
+   if (kvm->max_used_gsi + 1 < kvm_get_gsi_count(kvm))
 return kvm->max_used_gsi + 1;
 else
 return -ENOSPC;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Chris Wright wrote:
> * Gregory Haskins (ghask...@novell.com) wrote:
>   
>> Chris Wright wrote:
>> 
>>> VF drivers can also have this issue (and typically use mmio).
>>> I at least have a better idea what your proposal is, thanks for
>>> explanation.  Are you able to demonstrate concrete benefit with it yet
>>> (improved latency numbers for example)?
>>>   
>> I had a test-harness/numbers for this kind of thing, but its a bit
>> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
>> generate/post new numbers.
>> 
>
> That would be useful, because I keep coming back to pio and shared
> page(s) when think of why not to do this.  Seems I'm not alone in that.
>
> thanks,
> -chris
>   

I completed the resurrection of the test and wrote up a little wiki on
the subject, which you can find here:

http://developer.novell.com/wiki/index.php/WhyHypercalls

Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
reinvent the wheel?" questions.

I will include this information when I publish the updated v2 series
with the s/hypercall/dynhc changes.

Let me know if you have any questions.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Davide Libenzi
On Thu, 7 May 2009, Avi Kivity wrote:

> Davide Libenzi wrote:
> >   
> > > What's your take on adding irq context safe callbacks to irqfd?
> > > 
> > > To give some background here, we would like to use eventfd as a generic
> > > connector between components, so the components do not know about each
> > > other.
> > > So far eventfd successfully abstracts among components in the same
> > > process, in
> > > different processes, and in the kernel.
> > > 
> > > eventfd_signal() can be safely called from irq context, and will wake up a
> > > waiting task.  But in some cases, if the consumer is in the kernel, it may
> > > be
> > > able to consume the event from irq context, saving a context switch.
> > > 
> > > So, will you consider patches adding this capability to eventfd?
> > > 
> > 
> > Maybe I got lost in the thread, but inside the kernel we have callback-based
> > wakeup since long time. This is what epoll uses, when hooking into the file*
> > f_op->poll() subsystem.
> > Did you mean something else?
> >   
> 
> Do you mean wait_queue_t::func?

Yes, it is since long time ago that a wakeup in Linux can lead either to a 
real wakeup (in scheduler terms), or to a simple function call.
Isn't that enough?


- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] BIOS changes for configuring irq0->inti2 override(v2)

2009-05-07 Thread Beth Kon

Avi Kivity wrote:

Beth Kon wrote:
These patches resolve the irq0->inti2 override issue, and get the 
hpet working

on kvm.
They are dependent on Jes Sorensen's recent 
0006-qemu-kvm-irq-routing.patch.


Override and HPET changes are sent as a series because HPET depends 
on the override. Win2k8 expects the HPET interrupt on inti2, 
regardless of whether an override exists in the BIOS. And the HPET 
spec states that in legacy mode, timer interrupt is on inti2.


The irq0->inti2 override will always be used unless the kernel cannot 
do irq routing (i.e., compatibility with old kernels). So if the 
kernel is capable, userspace sets up irq0->inti2 via the irq routing 
interface, and adds the irq0->inti2 override to the MADT interrupt 
source override table, and the mp table (for the no-acpi case).


A couple of months ago, Marcelo was seeing RHEL5 guests complain of 
invalid
checksum with these patches, but later he couldn't reproduce it, and 
I'm not seeing it now. While all guests still need to be fully 
tested, everything appears to be in order.  I've tested on win2k864, 
win2k832, RHEL5.3 32 bit, and ubuntu 8.10 64 bit.
  


What are the changes relative to v1?
Just merge issues with the changes you put in when moving to the newer 
bios. I submitted prematurely, incorrectly thinking I was done testing. 
When I finished, some problems surfaced.



@@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val)
 #define QEMU_CFG_SIGNATURE  0x00
 #define QEMU_CFG_ID 0x01
 #define QEMU_CFG_UUID   0x02
+#define QEMU_CFG_IRQ0_OVERRIDE 0x0e
  


As noted, this should be in the arch local space.

The base changes were not in the code yet. As we discussed on IRC, I'll 
resubmit once they're there.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Driver for Inter-VM shared memory device for KVM supporting interrupts.

2009-05-07 Thread Cam Macdonell
Driver for inter-VM shared memory device that now supports interrupts between 
two guests.  The driver defines a counting semaphore and wait_event queue for 
different synchronization needs of users.  Initializing the semaphore count, 
sending interrupts and waiting are implemented via ioctl calls. 

The synchronization mechanisms are simple and rely on existing kernel 
primitives, but I think they're flexible for synchronization between guests.  
I'm contemplating more complicated designs that would use the shared memory to 
store synchronization variables, but thought I would get this initial patch out 
to get some feedback. 

Cheers,
Cam 

---
 drivers/char/Kconfig   |8 +
 drivers/char/Makefile  |2 +
 drivers/char/kvm_ivshmem.c |  430 
 3 files changed, 440 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/kvm_ivshmem.c

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 735bbe2..afa7cb8 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1099,6 +1099,14 @@ config DEVPORT
depends on ISA || PCI
default y
 
+config KVM_IVSHMEM
+tristate "Inter-VM Shared Memory Device"
+depends on PCI
+default m
+help
+  This device maps a region of shared memory between the host OS and any
+  number of virtual machines.
+
 source "drivers/s390/char/Kconfig"
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 9caf5b5..021f06b 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -111,6 +111,8 @@ obj-$(CONFIG_PS3_FLASH) += ps3flash.o
 obj-$(CONFIG_JS_RTC)   += js-rtc.o
 js-rtc-y = rtc.o
 
+obj-$(CONFIG_KVM_IVSHMEM)  += kvm_ivshmem.o
+
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
 
diff --git a/drivers/char/kvm_ivshmem.c b/drivers/char/kvm_ivshmem.c
new file mode 100644
index 000..a20a224
--- /dev/null
+++ b/drivers/char/kvm_ivshmem.c
@@ -0,0 +1,430 @@
+/*
+ * drivers/char/kvm_ivshmem.c - driver for KVM Inter-VM shared memory PCI 
device
+ *
+ * Copyright 2009 Cam Macdonell 
+ *
+ * Based on cirrusfb.c and 8139cp.c:
+ * Copyright 1999-2001 Jeff Garzik
+ * Copyright 2001-2004 Jeff Garzik
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define TRUE 1
+#define FALSE 0
+#define KVM_IVSHMEM_DEVICE_MINOR_NUM 0
+
+enum {
+/* KVM Inter-VM shared memory device register offsets */
+IntrMask= 0x00,/* Interrupt Mask */
+IntrStatus  = 0x10,/* Interrupt Status */
+Doorbell= 0x20,/* Doorbell */
+ShmOK = 1/* Everything is OK */
+};
+
+typedef struct kvm_ivshmem_device {
+void __iomem * regs;
+
+void * base_addr;
+
+unsigned int regaddr;
+unsigned int reg_size;
+
+unsigned int ioaddr;
+unsigned int ioaddr_size;
+unsigned int irq;
+
+bool enabled;
+
+} kvm_ivshmem_device;
+
+static int event_num;
+static struct semaphore sema;
+static wait_queue_head_t wait_queue;
+
+static kvm_ivshmem_device kvm_ivshmem_dev;
+
+static int device_major_nr;
+
+static int kvm_ivshmem_ioctl(struct inode *, struct file *, unsigned int, 
unsigned long);
+static int kvm_ivshmem_mmap(struct file *, struct vm_area_struct *);
+static int kvm_ivshmem_open(struct inode *, struct file *);
+static int kvm_ivshmem_release(struct inode *, struct file *);
+static ssize_t kvm_ivshmem_read(struct file *, char *, size_t, loff_t *);
+static ssize_t kvm_ivshmem_write(struct file *, const char *, size_t, loff_t 
*);
+static loff_t kvm_ivshmem_lseek(struct file * filp, loff_t offset, int origin);
+
+enum ivshmem_ioctl { set_sema, down_sema, sema_irq, wait_event, wait_event_irq 
};
+
+static const struct file_operations kvm_ivshmem_ops = {
+.owner   = THIS_MODULE,
+.open= kvm_ivshmem_open,
+.mmap= kvm_ivshmem_mmap,
+.read= kvm_ivshmem_read,
+.ioctl   = kvm_ivshmem_ioctl,
+.write   = kvm_ivshmem_write,
+.llseek  = kvm_ivshmem_lseek,
+.release = kvm_ivshmem_release,
+};
+
+static struct pci_device_id kvm_ivshmem_id_table[] = {
+{ 0x1af4, 0x1110, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
+{ 0 },
+};
+MODULE_DEVICE_TABLE (pci, kvm_ivshmem_id_table);
+
+static void kvm_ivshmem_remove_device(struct pci_dev* pdev);
+static int kvm_ivshmem_probe_device (struct pci_dev *pdev,
+const struct pci_device_id * ent);
+
+static struct pci_driver kvm_ivshmem_pci_driver = {
+.name= "kvm-shmem",
+.id_table= kvm_ivshmem_id_table,
+.probe   = kvm_ivshmem_probe_device,
+.remove  = kvm_ivshmem_remove_device,
+};
+
+static int kvm_ivshmem_ioctl(struct inode * ino, struct file * filp,
+unsigned int cmd, unsigned long arg)
+{
+
+int rv;
+
+switch (cmd) {
+case set_sema:
+printk("KVM_IVSHMEM: initialize semaphore\n")

bridges

2009-05-07 Thread Ross Boylan
I'm trying to understand bridging with KVM, but am still puzzled.
I think that the recommended bridging with TAP means that packets from
the VM will end up going out the host card attached to the default
gateway.  But it looks to me as if their IP address is unchanged, which
means replies will never reach me.  Is that correct?  Do I need to NAT
the packets, or is something already doing that?

Some documents indicate that I need to bring the interfaces (e.g., eth0)
down before I bring the bridge up, and that afterwards only the bridge
will have an IP address.  Is that right?

Some documents, e.g.,
http://ebtables.sourceforge.net/br_fw_ia/br_fw_ia.html, indicate
iptables should "just work" with bridging.  However, I've seen someone
with a 2.6.15 kernel ask about firewalling and be told they needed to
patch the kernel to get it work (don't have the reference handy).
Should it just work?

I'm running a 2.6.29 kernel on Debian Lenny with kvm 72+dfsg-5~lenny1.
Version 84+dfsg-2 is available in experimental.  Is there much to be
gained by going with the more recent version?

Please cc me; I'm not on the list.

Thanks.
Ross Boylan


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bios: Use a different mask to size the option ROM BAR

2009-05-07 Thread Alex Williamson
Bit 0 is the enable bit, which we not only don't want to set, but
it will stick and make us think it's an I/O port resource.

Signed-off-by: Alex Williamson 
---

 bios/rombios32.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/bios/rombios32.c b/bios/rombios32.c
index 269f175..616e70a 100644
--- a/bios/rombios32.c
+++ b/bios/rombios32.c
@@ -948,11 +948,13 @@ static void pci_bios_init_device(PCIDevice *d)
 int ofs;
 uint32_t val, size ;
 
-if (i == PCI_ROM_SLOT)
+if (i == PCI_ROM_SLOT) {
 ofs = 0x30;
-else
+pci_config_writel(d, ofs, 0xfffe);
+} else {
 ofs = 0x10 + i * 4;
-pci_config_writel(d, ofs, 0x);
+pci_config_writel(d, ofs, 0x);
+}
 val = pci_config_readl(d, ofs);
 if (val != 0) {
 size = (~(val & ~0xf)) + 1;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Shared memory device with interrupt support

2009-05-07 Thread Cam Macdonell
Support an inter-vm shared memory device that maps a shared-memory object 
as a PCI device in the guest.  This patch also supports interrupts between 
guest by communicating over a unix domain socket.  This patch applies to the 
qemu-kvm repository. 

This device now creates a qemu character device and sends 1-bytes messages to 
trigger interrupts.  Writes are trigger by writing to the "Doorbell" register 
on the shared memory PCI device.  The lower 8-bits of the value written to this 
register are sent as the 1-byte message so different meanings of interrupts can 
be supported.

Interrupts are only supported between 2 VMs currently.  One VM must act as the 
server by adding "server" to the command-line argument.  Shared memory devices 
are created with the following command-line:

-ivhshmem ,,[unix:][,server] 

Interrupts can also be used between host and guest as well by implementing a 
listener on the host.

Cam

---
 Makefile.target |3 +
 hw/ivshmem.c|  421 +++
 hw/pc.c |6 +
 hw/pc.h |3 +
 qemu-options.hx |   14 ++
 sysemu.h|8 +
 vl.c|   14 ++
 7 files changed, 469 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index b68a689..3190bba 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -643,6 +643,9 @@ OBJS += pcnet.o
 OBJS += rtl8139.o
 OBJS += e1000.o
 
+# Inter-VM PCI shared memory
+OBJS += ivshmem.o
+
 # Generic watchdog support and some watchdog devices
 OBJS += watchdog.o
 OBJS += wdt_ib700.o wdt_i6300esb.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 000..95e2268
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,421 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *  Cam Macdonell 
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "qemu-common.h"
+#include 
+
+#define PCI_COMMAND_IOACCESS0x0001
+#define PCI_COMMAND_MEMACCESS   0x0002
+#define PCI_COMMAND_BUSMASTER   0x0004
+
+//#define DEBUG_IVSHMEM
+
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)\
+do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct IVShmemState {
+uint16_t intrmask;
+uint16_t intrstatus;
+uint16_t doorbell;
+uint8_t *ivshmem_ptr;
+unsigned long ivshmem_offset;
+unsigned int ivshmem_size;
+unsigned long bios_offset;
+unsigned int bios_size;
+target_phys_addr_t base_ctrl;
+int it_shift;
+PCIDevice *pci_dev;
+CharDriverState * chr;
+unsigned long map_addr;
+unsigned long map_end;
+int ivshmem_mmio_io_addr;
+} IVShmemState;
+
+typedef struct PCI_IVShmemState {
+PCIDevice dev;
+IVShmemState ivshmem_state;
+} PCI_IVShmemState;
+
+typedef struct IVShmemDesc {
+char name[1024];
+char * chrdev;
+int size;
+} IVShmemDesc;
+
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+IntrMask = 0,
+IntrStatus = 16,
+Doorbell = 32
+};
+
+static int num_ivshmem_devices = 0;
+static IVShmemDesc ivshmem_desc;
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+uint32_t addr, uint32_t size, int type)
+{
+PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+IVShmemState *s = &d->ivshmem_state;
+
+IVSHMEM_DPRINTF("addr = %u size = %u\n", addr, size);
+cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
+
+}
+
+void ivshmem_init(const char * optarg) {
+
+char * temp;
+char * ivshmem_sz;
+int size;
+
+num_ivshmem_devices++;
+
+/* currently we only support 1 device */
+if (num_ivshmem_devices > MAX_IVSHMEM_DEVICES) {
+return;
+}
+
+temp = strdup(optarg);
+snprintf(ivshmem_desc.name, 1024, "/%s", strsep(&temp,","));
+ivshmem_sz=strsep(&temp,",");
+if (ivshmem_sz != NULL){
+size = atol(ivshmem_sz);
+} else {
+size = -1;
+}
+
+ivshmem_desc.chrdev = strsep(&temp,"\0");
+
+if ( size == -1) {
+ivshmem_desc.size = TARGET_PAGE_SIZE;
+} else {
+ivshmem_desc.size = size*1024*1024;
+}
+IVSHMEM_DPRINTF("optarg is %s, name is %s, size is %d, chrdev is %s\n",
+optarg, ivshmem_desc.name,
+ivshmem_desc.size, 
ivshmem_desc.chrdev);
+}
+
+int ivshmem_get_size(void) {
+return ivshmem_desc.size;
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s)
+{
+int isr;
+isr = (s->intrstatus & s->intrmask) & 0x;
+
+/* don't print ISR resets */
+if (isr) {
+IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+   isr ? 1 : 0, s

Re: [PATCH 1/4] BIOS changes for configuring irq0->inti2 override (v2)

2009-05-07 Thread Avi Kivity

Beth Kon wrote:

These patches resolve the irq0->inti2 override issue, and get the hpet working
on kvm. 


They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch.

Override and HPET changes are sent as a series because HPET depends on the 
override. Win2k8 expects the HPET interrupt on inti2, regardless of whether 
an override exists in the BIOS. And the HPET spec states that in legacy mode, 
timer interrupt is on inti2.


The irq0->inti2 override will always be used unless the kernel cannot do irq 
routing (i.e., compatibility with old kernels). So if the kernel is capable, 
userspace sets up irq0->inti2 via the irq routing interface, and adds the 
irq0->inti2 override to the MADT interrupt source override table, 
and the mp table (for the no-acpi case).


A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid
checksum with these patches, but later he couldn't reproduce it, and I'm not 
seeing it now. While all guests still need to be fully tested, everything 
appears to be in order.  I've tested on win2k864, win2k832, RHEL5.3 32 bit, 
and ubuntu 8.10 64 bit. 

  


What are the changes relative to v1?


@@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val)
 #define QEMU_CFG_SIGNATURE  0x00
 #define QEMU_CFG_ID 0x01
 #define QEMU_CFG_UUID   0x02
+#define QEMU_CFG_IRQ0_OVERRIDE 0x0e
  


As noted, this should be in the arch local space.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Avi Kivity

Davide Libenzi wrote:
  

What's your take on adding irq context safe callbacks to irqfd?

To give some background here, we would like to use eventfd as a generic
connector between components, so the components do not know about each other.
So far eventfd successfully abstracts among components in the same process, in
different processes, and in the kernel.

eventfd_signal() can be safely called from irq context, and will wake up a
waiting task.  But in some cases, if the consumer is in the kernel, it may be
able to consume the event from irq context, saving a context switch.

So, will you consider patches adding this capability to eventfd?



Maybe I got lost in the thread, but inside the kernel we have 
callback-based wakeup since long time. This is what epoll uses, when 
hooking into the file* f_op->poll() subsystem.

Did you mean something else?
  


Do you mean wait_queue_t::func?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][KVM][retry 1] Add support for Pause Filtering to AMD SVM

2009-05-07 Thread Avi Kivity

(copying Ingo)

Mark Langsdorf wrote:

commit 01813db8627e74018c8cec90df7e345839351f23
Author: root 
Date:   Thu May 7 09:44:10 2009 -0500

New AMD processors will support the Pause Filter Feature.
This feature creates a new field in the VMCB called Pause
Filter Count.  If Pause Filter Count is greater than 0 and
intercepting PAUSEs is enabled, the processor will increment
an internal counter when a PAUSE instruction occurs instead
of intercepting.  When the internal counter reaches the
Pause Filter Count value, a PAUSE intercept will occur.

This feature can be used to detect contended spinlocks,

especially when the lock holding VCPU is not scheduled.
Rescheduling another VCPU prevents the VCPU seeking the
lock from wasting its quantum by spinning idly.

Experimental results show that most spinlocks are held

for less than 1000 PAUSE cycles or more than a few
thousand.  Default the Pause Filter Counter to 3000 to
detect the contended spinlocks.

Processor support for this feature is indicated by a CPUID

bit.

On a 24 core system running 4 guests each with 16 VCPUs,

this patch improved overall performance of each guest's
32 job kernbench by approximately 1%.  Further performance
improvement may be possible with a more sophisticated
yield algorithm.

-Mark Langsdorf

Operating System Research Center
AMD

Signed-off-by: Mark Langsdorf 
  


(please use git format-patch rather than git show, and set up user.name 
and user.email properly)


 
 	svm->nested_vmcb = 0;

svm->vcpu.arch.hflags = HF_GIF_MASK;
+
+   if (svm_has(SVM_FEATURE_PAUSE_FILTER)) {
+   control->pause_filter_count = 5000;
+   control->intercept |= (1ULL << INTERCEPT_PAUSE);
+   }
+
 }


3000 or 5000?

 
+static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)

+{
+   /* Simple yield */
+   vcpu_put(&svm->vcpu);
+   schedule();
+   vcpu_load(&svm->vcpu);
+   return 1;
+


Ingo, will this do anything under CFS, or will CFS note that nothing has 
changed in the accounting and reschedule us immediately?




--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

 


This is my preferred option.  For a virtio-net-server in the kernel,
we'd service its eventfd in qemu, raising and lowering the pci
interrupt in the traditional way.

But we'd still need to know when to lower the interrupt.  How?



IIUC, isn't that  usually device/subsystem specific, and out of scope of
the GSI delivery vehicle?  For instance, most devices I have seen with
level ints have a register in their device register namespace for acking
the int.  


Yes it is.


As an aside, this is what causes some of the grief in dealing
with shared interrupts like KVM pass-through and/or threaded-isrs:  
There isn't a standardized way to ACK them.
  


So we'd need a side channel to tell userspace to lower the irq.  Another 
eventfd likely.


Note we don't support device assignment for devices with shared interrupts.


You may also see some generalization of masking/acking in things like
the MSI-X table.  But again, this would be out of scope of the general
GSI delivery path IIUC.

I understand that there is a feedback mechanism in the ioapic model for
calling back on acknowledgment of the interrupt.  But I am not sure what
is how the real hardware works normally, and therefore I am not
convinced that is something we need to feed all the way back (i.e. via
irqfd or whatever).  In the interest of full disclosure, its been a few
years since I studied the xAPIC docs, so I might be out to lunch on that
assertion. ;)
  


Right, that ack thing is completely internal, used for catching up on 
time drift, and for shutting down level triggered assigned interrupts.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][KVM][retry 1] Add support for Pause Filtering to AMD SVM

2009-05-07 Thread Mark Langsdorf
commit 01813db8627e74018c8cec90df7e345839351f23
Author: root 
Date:   Thu May 7 09:44:10 2009 -0500

New AMD processors will support the Pause Filter Feature.
This feature creates a new field in the VMCB called Pause
Filter Count.  If Pause Filter Count is greater than 0 and
intercepting PAUSEs is enabled, the processor will increment
an internal counter when a PAUSE instruction occurs instead
of intercepting.  When the internal counter reaches the
Pause Filter Count value, a PAUSE intercept will occur.

This feature can be used to detect contended spinlocks,
especially when the lock holding VCPU is not scheduled.
Rescheduling another VCPU prevents the VCPU seeking the
lock from wasting its quantum by spinning idly.

Experimental results show that most spinlocks are held
for less than 1000 PAUSE cycles or more than a few
thousand.  Default the Pause Filter Counter to 3000 to
detect the contended spinlocks.

Processor support for this feature is indicated by a CPUID
bit.

On a 24 core system running 4 guests each with 16 VCPUs,
this patch improved overall performance of each guest's
32 job kernbench by approximately 1%.  Further performance
improvement may be possible with a more sophisticated
yield algorithm.

-Mark Langsdorf
Operating System Research Center
AMD

Signed-off-by: Mark Langsdorf 

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 85574b7..1fecb7e 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -57,7 +57,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u16 intercept_dr_write;
u32 intercept_exceptions;
u64 intercept;
-   u8 reserved_1[44];
+   u8 reserved_1[42];
+   u16 pause_filter_count;
u64 iopm_base_pa;
u64 msrpm_base_pa;
u64 tsc_offset;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ef43a18..4279141 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -45,6 +45,7 @@ MODULE_LICENSE("GPL");
 #define SVM_FEATURE_NPT  (1 << 0)
 #define SVM_FEATURE_LBRV (1 << 1)
 #define SVM_FEATURE_SVML (1 << 2)
+#define SVM_FEATURE_PAUSE_FILTER (1 << 10)
 
 #define DEBUGCTL_RESERVED_BITS (~(0x3fULL))
 
@@ -575,6 +576,12 @@ static void init_vmcb(struct vcpu_svm *svm)
 
svm->nested_vmcb = 0;
svm->vcpu.arch.hflags = HF_GIF_MASK;
+
+   if (svm_has(SVM_FEATURE_PAUSE_FILTER)) {
+   control->pause_filter_count = 5000;
+   control->intercept |= (1ULL << INTERCEPT_PAUSE);
+   }
+
 }
 
 static int svm_vcpu_reset(struct kvm_vcpu *vcpu)
@@ -2087,6 +2094,15 @@ static int interrupt_window_interception(struct vcpu_svm 
*svm,
return 1;
 }
 
+static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+{
+   /* Simple yield */
+   vcpu_put(&svm->vcpu);
+   schedule();
+   vcpu_load(&svm->vcpu);
+   return 1;
+}
+
 static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
  struct kvm_run *kvm_run) = {
[SVM_EXIT_READ_CR0] = emulate_on_interception,
@@ -2123,6 +2139,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
[SVM_EXIT_CPUID]= cpuid_interception,
[SVM_EXIT_IRET] = iret_interception,
[SVM_EXIT_INVD] = emulate_on_interception,
+   [SVM_EXIT_PAUSE]= pause_interception,
[SVM_EXIT_HLT]  = halt_interception,
[SVM_EXIT_INVLPG]   = invlpg_interception,
[SVM_EXIT_INVLPGA]  = invalid_op_interception,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2b73e19..e2b730d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -710,6 +710,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_load(vcpu, cpu);
put_cpu();
 }
+EXPORT_SYMBOL_GPL(vcpu_load);
 
 void vcpu_put(struct kvm_vcpu *vcpu)
 {
@@ -719,6 +720,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
preempt_enable();
mutex_unlock(&vcpu->mutex);
 }
+EXPORT_SYMBOL_GPL(vcpu_put);
 
 static void ack_flush(void *_completed)
 {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] BIOS changes for KVM HPET (v2)

2009-05-07 Thread Beth Kon

Just a note here... 
The number of table_offset_entry entries for the non BX_QEMU case doesn't make  
sense here. There are only 2 entries. I left it as is, since it does not 
impact HPET's interraction with it. Actually it seems like dead code since 
this is in kvm code but with BX_QEMU undefined. It doesn't seem to be a
problem. 


Signed-off-by: Beth Kon 


diff --git a/kvm/bios/acpi-dsdt.dsl b/kvm/bios/acpi-dsdt.dsl
index c756fed..0e142be 100755
--- a/kvm/bios/acpi-dsdt.dsl
+++ b/kvm/bios/acpi-dsdt.dsl
@@ -308,7 +308,6 @@ DefinitionBlock (
 })
 }
 #ifdef BX_QEMU
-#ifdef HPET_WORKS_IN_KVM
 Device(HPET) {
 Name(_HID,  EISAID("PNP0103"))
 Name(_UID, 0)
@@ -328,7 +327,6 @@ DefinitionBlock (
 })
 }
 #endif
-#endif
 }
 
 Scope(\_SB.PCI0) {
diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c
index ddfa828..7441cd7 100755
--- a/kvm/bios/rombios32.c
+++ b/kvm/bios/rombios32.c
@@ -1293,7 +1293,7 @@ struct rsdt_descriptor_rev1
 {
ACPI_TABLE_HEADER_DEF   /* ACPI common table 
header */
 #ifdef BX_QEMU
-   uint32_t table_offset_entry [2]; /* Array 
of pointers to other */
+   uint32_t table_offset_entry [3]; /* Array 
of pointers to other */
 // uint32_t table_offset_entry [4]; /* Array 
of pointers to other */
 #else
uint32_t table_offset_entry [3]; /* Array 
of pointers to other */
@@ -1450,8 +1450,8 @@ struct acpi_20_generic_address {
 } __attribute__((__packed__));
 
 /*
- *  * HPET Description Table
- *   */
+ *  HPET Description Table
+ */
 struct acpi_20_hpet {
 ACPI_TABLE_HEADER_DEF   /* ACPI common table 
header */
 uint32_t   timer_block_id;
@@ -1591,13 +1591,11 @@ void acpi_bios_init(void)
 #endif
 addr += madt_size;
 #ifdef BX_QEMU
-#ifdef HPET_WORKS_IN_KVM
 addr = (addr + 7) & ~7;
 hpet_addr = addr;
 hpet = (void *)(addr);
 addr += sizeof(*hpet);
 #endif
-#endif
 
 acpi_tables_size = addr - base_addr;
 
@@ -1620,10 +1618,10 @@ void acpi_bios_init(void)
 memset(rsdt, 0, sizeof(*rsdt));
 rsdt->table_offset_entry[0] = cpu_to_le32(fadt_addr);
 rsdt->table_offset_entry[1] = cpu_to_le32(madt_addr);
-//rsdt->table_offset_entry[2] = cpu_to_le32(ssdt_addr);
 #ifdef BX_QEMU
-//rsdt->table_offset_entry[3] = cpu_to_le32(hpet_addr);
+rsdt->table_offset_entry[2] = cpu_to_le32(hpet_addr);
 #endif
+//rsdt->table_offset_entry[3] = cpu_to_le32(ssdt_addr);
 acpi_build_table_header((struct acpi_table_header *)rsdt,
 "RSDT", sizeof(*rsdt), 1);
 
@@ -1723,7 +1721,6 @@ void acpi_bios_init(void)
 
 #ifdef BX_QEMU
 /* HPET */
-#ifdef HPET_WORKS_IN_KVM
 memset(hpet, 0, sizeof(*hpet));
 /* Note timer_block_id value must be kept in sync with value advertised by
  * emulated hpet
@@ -1733,7 +1730,6 @@ void acpi_bios_init(void)
 acpi_build_table_header((struct  acpi_table_header *)hpet,
  "HPET", sizeof(*hpet), 1);
 #endif
-#endif
 
 }
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] Userspace changes for KVM HPET (v2)

2009-05-07 Thread Beth Kon

Signed-off-by: Beth Kon 


diff --git a/hw/hpet.c b/hw/hpet.c
index c7945ec..47c9f89 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -30,6 +30,7 @@
 #include "console.h"
 #include "qemu-timer.h"
 #include "hpet_emul.h"
+#include "qemu-kvm.h"
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -48,6 +49,43 @@ uint32_t hpet_in_legacy_mode(void)
 return 0;
 }
 
+static void hpet_kpit_enable(void)
+{
+struct kvm_pit_state ps;
+kvm_get_pit(kvm_context, &ps);
+kvm_set_pit(kvm_context, &ps);
+}
+
+static void hpet_kpit_disable(void)
+{
+struct kvm_pit_state ps;
+kvm_get_pit(kvm_context, &ps);
+ps.channels[0].mode = 0xff;
+kvm_set_pit(kvm_context, &ps);
+}
+
+static void hpet_legacy_enable(void)
+{
+if (qemu_kvm_pit_in_kernel()) {
+   hpet_kpit_disable();
+   dprintf("qemu: hpet disabled kernel pit\n");
+} else {
+   hpet_pit_disable();
+   dprintf("qemu: hpet disabled userspace pit\n");
+}
+}
+
+static void hpet_legacy_disable(void)
+{
+if (qemu_kvm_pit_in_kernel()) {
+   hpet_kpit_enable();
+   dprintf("qemu: hpet enabled kernel pit\n");
+} else {
+   hpet_pit_enable();
+   dprintf("qemu: hpet enabled userspace pit\n");
+}
+}
+
 static uint32_t timer_int_route(struct HPETTimer *timer)
 {
 uint32_t route;
@@ -475,9 +513,9 @@ static void hpet_ram_writel(void *opaque, 
target_phys_addr_t addr,
 }
 /* i8254 and RTC are disabled when HPET is in legacy mode */
 if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
-hpet_pit_disable();
+hpet_legacy_enable();
 } else if (deactivating_bit(old_val, new_val, 
HPET_CFG_LEGACY)) {
-hpet_pit_enable();
+hpet_legacy_disable();
 }
 break;
 case HPET_CFG + 4:
@@ -560,7 +598,7 @@ static void hpet_reset(void *opaque) {
  * hpet_reset is called due to system reset. At this point control must
  * be returned to pit until SW reenables hpet.
  */
-hpet_pit_enable();
+hpet_legacy_disable();
 count = 1;
 }
 
diff --git a/vl.c b/vl.c
index f9a72b3..b860b82 100644
--- a/vl.c
+++ b/vl.c
@@ -6138,10 +6138,15 @@ int main(int argc, char **argv, char **envp)
 }
 
 if (kvm_enabled()) {
-   kvm_init_ap();
+kvm_init_ap();
 #ifdef USE_KVM
 if (kvm_irqchip && !qemu_kvm_has_gsi_routing()) {
 irq0override = 0;
+/* if kernel can't do irq routing, interrupt source
+ * override 0->2 can not be set up as required by hpet,
+ * so disable hpet.
+ */
+no_hpet=1;
 }
 #endif
 }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] Userspace changes for configuring irq0->inti2 override (v2)

2009-05-07 Thread Beth Kon
Signed-off-by: Beth Kon 

diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c
index e1b19d7..bb74f38 100644
--- a/hw/fw_cfg.c
+++ b/hw/fw_cfg.c
@@ -279,6 +279,7 @@ void *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
 fw_cfg_add_bytes(s, FW_CFG_UUID, qemu_uuid, 16);
 fw_cfg_add_i16(s, FW_CFG_NOGRAPHIC, (uint16_t)nographic);
 fw_cfg_add_i16(s, FW_CFG_NB_CPUS, (uint16_t)smp_cpus);
+fw_cfg_add_i16(s, FW_CFG_IRQ0_OVERRIDE, (uint16_t)irq0override);
 
 register_savevm("fw_cfg", -1, 1, fw_cfg_save, fw_cfg_load, s);
 qemu_register_reset(fw_cfg_reset, s);
diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h
index f616ed2..498c1e3 100644
--- a/hw/fw_cfg.h
+++ b/hw/fw_cfg.h
@@ -15,6 +15,7 @@
 #define FW_CFG_INITRD_SIZE  0x0b
 #define FW_CFG_BOOT_DEVICE  0x0c
 #define FW_CFG_NUMA 0x0d
+#define FW_CFG_IRQ0_OVERRIDE0x0e
 #define FW_CFG_MAX_ENTRY0x10
 
 #define FW_CFG_WRITE_CHANNEL0x4000
diff --git a/hw/ioapic.c b/hw/ioapic.c
index 0b70cf6..2d77a2c 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -23,6 +23,7 @@
 
 #include "hw.h"
 #include "pc.h"
+#include "sysemu.h"
 #include "qemu-timer.h"
 #include "host-utils.h"
 
@@ -95,14 +96,13 @@ void ioapic_set_irq(void *opaque, int vector, int level)
 {
 IOAPICState *s = opaque;
 
-#if 0
 /* ISA IRQs map to GSI 1-1 except for IRQ0 which maps
  * to GSI 2.  GSI maps to ioapic 1-1.  This is not
  * the cleanest way of doing it but it should work. */
 
-if (vector == 0)
+if (vector == 0 && irq0override) {
 vector = 2;
-#endif
+}
 
 if (vector >= 0 && vector < IOAPIC_NUM_PINS) {
 uint32_t mask = 1 << vector;
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 8cb6faa..2e52c87 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -879,7 +879,11 @@ int kvm_arch_init_irq_routing(void)
 return r;
 }
 for (i = 0; i < 24; ++i) {
-r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, i);
+if (i == 0) {
+r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, 2);
+} else if (i != 2) {
+r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, i);
+}
 if (r < 0)
 return r;
 }
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 8226001..c64718c 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -165,6 +165,7 @@ void qemu_kvm_cpu_stop(CPUState *env);
 #define kvm_enabled() (kvm_allowed)
 #define qemu_kvm_irqchip_in_kernel() kvm_irqchip_in_kernel(kvm_context)
 #define qemu_kvm_pit_in_kernel() kvm_pit_in_kernel(kvm_context)
+#define qemu_kvm_has_gsi_routing() kvm_has_gsi_routing(kvm_context)
 #define kvm_has_sync_mmu() qemu_kvm_has_sync_mmu()
 void kvm_init_vcpu(CPUState *env);
 void kvm_load_tsc(CPUState *env);
@@ -173,6 +174,7 @@ void kvm_load_tsc(CPUState *env);
 #define kvm_nested 0
 #define qemu_kvm_irqchip_in_kernel() (0)
 #define qemu_kvm_pit_in_kernel() (0)
+#define qemu_kvm_has_gsi_routing() (0)
 #define kvm_has_sync_mmu() (0)
 #define kvm_load_registers(env) do {} while(0)
 #define kvm_save_registers(env) do {} while(0)
diff --git a/sysemu.h b/sysemu.h
index 1f45fd6..292bbc3 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -93,6 +93,7 @@ extern int graphic_width;
 extern int graphic_height;
 extern int graphic_depth;
 extern int nographic;
+extern int irq0override;
 extern const char *keyboard_layout;
 extern int win2k_install_hack;
 extern int rtc_td_hack;
diff --git a/vl.c b/vl.c
index 6b4b7d2..f9a72b3 100644
--- a/vl.c
+++ b/vl.c
@@ -207,6 +207,7 @@ static int vga_ram_size;
 enum vga_retrace_method vga_retrace_method = VGA_RETRACE_DUMB;
 static DisplayState *display_state;
 int nographic;
+int irq0override;
 static int curses;
 static int sdl;
 const char* keyboard_layout = NULL;
@@ -5039,6 +5040,7 @@ int main(int argc, char **argv, char **envp)
 vga_ram_size = VGA_RAM_SIZE;
 snapshot = 0;
 nographic = 0;
+irq0override = 1;
 curses = 0;
 kernel_filename = NULL;
 kernel_cmdline = "";
@@ -6135,8 +6137,14 @@ int main(int argc, char **argv, char **envp)
 }
 }
 
-if (kvm_enabled())
-   kvm_init_ap();
+if (kvm_enabled()) {
+   kvm_init_ap();
+#ifdef USE_KVM
+if (kvm_irqchip && !qemu_kvm_has_gsi_routing()) {
+irq0override = 0;
+}
+#endif
+}
 
 machine->init(ram_size, vga_ram_size, boot_devices,
   kernel_filename, kernel_cmdline, initrd_filename, cpu_model);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] BIOS changes for configuring irq0->inti2 override (v2)

2009-05-07 Thread Beth Kon
These patches resolve the irq0->inti2 override issue, and get the hpet working
on kvm. 

They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch.

Override and HPET changes are sent as a series because HPET depends on the 
override. Win2k8 expects the HPET interrupt on inti2, regardless of whether 
an override exists in the BIOS. And the HPET spec states that in legacy mode, 
timer interrupt is on inti2.

The irq0->inti2 override will always be used unless the kernel cannot do irq 
routing (i.e., compatibility with old kernels). So if the kernel is capable, 
userspace sets up irq0->inti2 via the irq routing interface, and adds the 
irq0->inti2 override to the MADT interrupt source override table, 
and the mp table (for the no-acpi case).

A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid
checksum with these patches, but later he couldn't reproduce it, and I'm not 
seeing it now. While all guests still need to be fully tested, everything 
appears to be in order.  I've tested on win2k864, win2k832, RHEL5.3 32 bit, 
and ubuntu 8.10 64 bit. 

Signed-off-by: Beth Kon 

diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c
index 8684987..07dda73 100755
--- a/kvm/bios/rombios32.c
+++ b/kvm/bios/rombios32.c
@@ -445,6 +445,9 @@ uint32_t cpuid_ext_features;
 unsigned long ram_size;
 uint64_t ram_end;
 uint8_t bios_uuid[16];
+#ifdef BX_QEMU
+uint8_t irq0_override;
+#endif
 #ifdef BX_USE_EBDA_TABLES
 unsigned long ebda_cur_addr;
 #endif
@@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val)
 #define QEMU_CFG_SIGNATURE  0x00
 #define QEMU_CFG_ID 0x01
 #define QEMU_CFG_UUID   0x02
+#define QEMU_CFG_IRQ0_OVERRIDE 0x0e
 
 int qemu_cfg_port;
 
@@ -518,6 +522,18 @@ void uuid_probe(void)
 memset(bios_uuid, 0, 16);
 }
 
+#ifdef BX_QEMU
+void irq0_override_probe(void)
+{
+if(qemu_cfg_port) {
+qemu_cfg_select(QEMU_CFG_IRQ0_OVERRIDE);
+qemu_cfg_read(&irq0_override, 1);
+return;
+}
+memset(&irq0_override, 0, 1);
+}
+#endif
+
 void cpu_probe(void)
 {
 uint32_t eax, ebx, ecx, edx;
@@ -1160,6 +1176,13 @@ static void mptable_init(void)
 
 /* irqs */
 for(i = 0; i < 16; i++) {
+#ifdef BX_QEMU
+/* One entry per ioapic interrupt destination. Destination 2 is covered
+ * by irq0->inti2 override (i == 0). Source IRQ 2 is unused 
+ */
+if (irq0_override && i == 2)
+continue;
+#endif
 putb(&q, 3); /* entry type = I/O interrupt */
 putb(&q, 0); /* interrupt type = vectored interrupt */
 putb(&q, 0); /* flags: po=0, el=0 */
@@ -1167,7 +1190,12 @@ static void mptable_init(void)
 putb(&q, 0); /* source bus ID = ISA */
 putb(&q, i); /* source bus IRQ */
 putb(&q, ioapic_id); /* dest I/O APIC ID */
-putb(&q, i); /* dest I/O APIC interrupt in */
+#ifdef BX_QEMU
+if (irq0_override && i == 0)
+putb(&q, 2); /* dest I/O APIC interrupt in */
+else
+#endif
+putb(&q, i); /* dest I/O APIC interrupt in */
 }
 /* patch length */
 len = q - mp_config_table;
@@ -1550,16 +1578,18 @@ void acpi_bios_init(void)
 
 addr = (addr + 7) & ~7;
 madt_addr = addr;
+madt = (void *)(addr);
 madt_size = sizeof(*madt) +
 sizeof(struct madt_processor_apic) * MAX_CPUS +
-#ifdef BX_QEMU
-sizeof(struct madt_io_apic) /* + sizeof(struct madt_int_override) */;
-#else
 sizeof(struct madt_io_apic);
+#ifdef BX_QEMU
+for (i = 0; i < 16; i++)
+if (PCI_ISA_IRQ_MASK & (1U << i))
+madt_size += sizeof(struct madt_int_override);
+if (irq0_override)
+madt_size += sizeof(struct madt_int_override);
 #endif
-madt = (void *)(addr);
 addr += madt_size;
-
 #ifdef BX_QEMU
 #ifdef HPET_WORKS_IN_KVM
 addr = (addr + 7) & ~7;
@@ -1660,23 +1690,20 @@ void acpi_bios_init(void)
 io_apic->io_apic_id = smp_cpus;
 io_apic->address = cpu_to_le32(0xfec0);
 io_apic->interrupt = cpu_to_le32(0);
+int_override = (struct madt_int_override*)(io_apic + 1);
 #ifdef BX_QEMU
-#ifdef HPET_WORKS_IN_KVM
-io_apic++;
-
-int_override = (void *)io_apic;
-int_override->type = APIC_XRUPT_OVERRIDE;
-int_override->length = sizeof(*int_override);
-int_override->bus = cpu_to_le32(0);
-int_override->source = cpu_to_le32(0);
-int_override->gsi = cpu_to_le32(2);
-int_override->flags = cpu_to_le32(0);
-#endif
+if (irq0_override) {
+memset(int_override, 0, sizeof(*int_override));
+int_override->type = APIC_XRUPT_OVERRIDE;
+int_override->length = sizeof(*int_override);
+int_override->source = 0;
+int_override->gsi = 2;
+int_override->flags = 0; /* conforms to bus specifications */
+int_override++;
+}
 #endif
-
-int_override = (struct madt_int_override*)(io_apic + 1);
 for

Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> One thing I was thinking here was that I could create a flag for the
>> kvm_irqfd() function for something like "KVM_IRQFD_MODE_CLEAR".  This
>> flag when specified at creation time will cause the event to execute a
>> clear operation instead of a set when triggered.  That way, the default
>> mode is an edge-triggered set.  The non-default mode is to trigger a
>> clear.  Level-triggered ints could therefore create two irqfds, one for
>> raising, the other for clearing.
>>   
>
> That's my second choice option.
>
>> An alternative is to abandon the use of eventfd, and allow the irqfd to
>> be a first-class anon-fd.  The parameters passed to the write/signal()
>> function could then indicate the desired level.  The disadvantage would
>> be that it would not be compatible with eventfd, so we would need to
>> decide if the tradeoff is worth it.
>>   
>
> I would really like to keep using eventfd.  Which is why I asked
> Davide about the prospects of direct callbacks (vs wakeups).

I saw that request.  That would be ideal.

>
>> OTOH, I suspect level triggered interrupts will be primarily in the
>> legacy domain, so perhaps we do not need to worry about it too much.
>> Therefore, another option is that we *could* simply set the stake in the
>> ground that legacy/level cannot use irqfd.
>>   
>
> This is my preferred option.  For a virtio-net-server in the kernel,
> we'd service its eventfd in qemu, raising and lowering the pci
> interrupt in the traditional way.
>
> But we'd still need to know when to lower the interrupt.  How?

IIUC, isn't that  usually device/subsystem specific, and out of scope of
the GSI delivery vehicle?  For instance, most devices I have seen with
level ints have a register in their device register namespace for acking
the int.  As an aside, this is what causes some of the grief in dealing
with shared interrupts like KVM pass-through and/or threaded-isrs:  
There isn't a standardized way to ACK them.

You may also see some generalization of masking/acking in things like
the MSI-X table.  But again, this would be out of scope of the general
GSI delivery path IIUC.

I understand that there is a feedback mechanism in the ioapic model for
calling back on acknowledgment of the interrupt.  But I am not sure what
is how the real hardware works normally, and therefore I am not
convinced that is something we need to feed all the way back (i.e. via
irqfd or whatever).  In the interest of full disclosure, its been a few
years since I studied the xAPIC docs, so I might be out to lunch on that
assertion. ;)

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Davide Libenzi
On Thu, 7 May 2009, Avi Kivity wrote:

> What's your take on adding irq context safe callbacks to irqfd?
> 
> To give some background here, we would like to use eventfd as a generic
> connector between components, so the components do not know about each other.
> So far eventfd successfully abstracts among components in the same process, in
> different processes, and in the kernel.
> 
> eventfd_signal() can be safely called from irq context, and will wake up a
> waiting task.  But in some cases, if the consumer is in the kernel, it may be
> able to consume the event from irq context, saving a context switch.
> 
> So, will you consider patches adding this capability to eventfd?

Maybe I got lost in the thread, but inside the kernel we have 
callback-based wakeup since long time. This is what epoll uses, when 
hooking into the file* f_op->poll() subsystem.
Did you mean something else?


- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

One thing I was thinking here was that I could create a flag for the
kvm_irqfd() function for something like "KVM_IRQFD_MODE_CLEAR".  This
flag when specified at creation time will cause the event to execute a
clear operation instead of a set when triggered.  That way, the default
mode is an edge-triggered set.  The non-default mode is to trigger a
clear.  Level-triggered ints could therefore create two irqfds, one for
raising, the other for clearing.
  


That's my second choice option.


An alternative is to abandon the use of eventfd, and allow the irqfd to
be a first-class anon-fd.  The parameters passed to the write/signal()
function could then indicate the desired level.  The disadvantage would
be that it would not be compatible with eventfd, so we would need to
decide if the tradeoff is worth it.
  


I would really like to keep using eventfd.  Which is why I asked Davide 
about the prospects of direct callbacks (vs wakeups).



OTOH, I suspect level triggered interrupts will be primarily in the
legacy domain, so perhaps we do not need to worry about it too much. 
Therefore, another option is that we *could* simply set the stake in the

ground that legacy/level cannot use irqfd.
  


This is my preferred option.  For a virtio-net-server in the kernel, 
we'd service its eventfd in qemu, raising and lowering the pci interrupt 
in the traditional way.


But we'd still need to know when to lower the interrupt.  How?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3

2009-05-07 Thread Avi Kivity

mtosa...@redhat.com wrote:

Disallow the deletion of memory slots (and aliases, for x86 case), if a
vcpu contains a cr3 that points to such slot/alias.
  


That allows the guest to induce failures in the host.  Better to 
triple-fault the guest instead.


 
+int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)

+{
+   return 1;
+}
+
  


In general, instead of stubs in every arch, have x86 say 
KVM_HAVE_ARCH_CAN_FREE_MEMSLOT and define the stub in generic code when 
that define is not present.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] virtio_pci: optional MSI-X support

2009-05-07 Thread Michael S. Tsirkin
This implements optional MSI-X support in virtio_pci.
MSI-X is used whenever the host supports at least 2 MSI-X
vectors: 1 for configuration changes and 1 for virtqueues.
Per-virtqueue vectors are allocated if enough vectors
available.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio_pci.c |  209 +--
 include/linux/virtio_pci.h  |8 ++-
 2 files changed, 190 insertions(+), 27 deletions(-)

diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index cd0fd5d..4a0275b 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -47,9 +47,15 @@
 /* The bit of the ISR which indicates a device configuration change. */
 #define VIRTIO_PCI_ISR_CONFIG  0x2
 
+/* MSI-X registers: only enabled if MSI-X is enabled. */
+/* A 16-bit vector for configuration changes. */
+#define VIRTIO_MSI_CONFIG_VECTOR20
+/* A 16-bit vector for selected queue notifications. */
+#define VIRTIO_MSI_QUEUE_VECTOR 22
+
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
-#define VIRTIO_PCI_CONFIG  20
+#define VIRTIO_PCI_CONFIG(dev) ((dev)->msix_enabled ? 24 : 20)
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION 0
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index f7b79a2..2b6333c 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -42,6 +42,28 @@ struct virtio_pci_device
/* a list of queues so we can dispatch IRQs */
spinlock_t lock;
struct list_head virtqueues;
+
+   /* MSI-X support */
+   int msix_enabled;
+   struct msix_entry *msix_entries;
+   /* Name strings for interrupts. This size should be enough,
+* and I'm too lazy to allocate each name separately. */
+   char (*msix_names)[256];
+   /* Number of vectors configured at startup (excludes per-virtqueue
+* vectors if any) */
+   unsigned msix_preset_vectors;
+   /* Number of per-virtqueue vectors if any. */
+   unsigned msix_per_vq_vectors;
+};
+
+/* Constants for MSI-X */
+/* Use first vector for configuration changes, second and the rest for
+ * virtqueues Thus, we need at least 2 vectors for MSI. */
+enum {
+   VP_MSIX_CONFIG_VECTOR = 0,
+   VP_MSIX_VQ_VECTOR = 1,
+   VP_MSIX_MIN_VECTORS = 2,
+   VP_MSIX_NO_VECTOR = 0x,
 };
 
 struct virtio_pci_vq_info
@@ -60,6 +82,9 @@ struct virtio_pci_vq_info
 
/* the list node for the virtqueues list */
struct list_head node;
+
+   /* MSI-X vector (or none) */
+   unsigned vector;
 };
 
 /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
@@ -109,7 +134,8 @@ static void vp_get(struct virtio_device *vdev, unsigned 
offset,
   void *buf, unsigned len)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
-   void __iomem *ioaddr = vp_dev->ioaddr + VIRTIO_PCI_CONFIG + offset;
+   void __iomem *ioaddr = vp_dev->ioaddr +
+   VIRTIO_PCI_CONFIG(vp_dev) + offset;
u8 *ptr = buf;
int i;
 
@@ -123,7 +149,8 @@ static void vp_set(struct virtio_device *vdev, unsigned 
offset,
   const void *buf, unsigned len)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
-   void __iomem *ioaddr = vp_dev->ioaddr + VIRTIO_PCI_CONFIG + offset;
+   void __iomem *ioaddr = vp_dev->ioaddr +
+  VIRTIO_PCI_CONFIG(vp_dev) + offset;
const u8 *ptr = buf;
int i;
 
@@ -221,17 +248,110 @@ static irqreturn_t vp_interrupt(int irq, void *opaque)
return vp_vring_interrupt(irq, opaque);
 }
 
-   spin_lock_irqsave(&vp_dev->lock, flags);
-   list_for_each_entry(info, &vp_dev->virtqueues, node) {
-   if (vring_interrupt(irq, info->vq) == IRQ_HANDLED)
-   ret = IRQ_HANDLED;
+/* the config->free_vqs() implementation */
+static void vp_free_vectors(struct virtio_device *vdev) {
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   int i;
+
+   /* Disable the vector used for configuration */
+   iowrite16(VP_MSIX_NO_VECTOR,
+ vp_dev->ioaddr + VIRTIO_MSI_CONFIG_VECTOR);
+
+   for (i = 0; i < vp_dev->msix_preset_vectors; ++i)
+   free_irq(vp_dev->msix_entries[i].vector, vp_dev);
+
+   if (!vp_dev->msix_preset_vectors)
+   free_irq(vp_dev->pci_dev->irq, vp_dev);
+
+   if (vp_dev->msix_enabled) {
+   vp_dev->msix_enabled = 0;
+   pci_disable_msix(vp_dev->pci_dev);
}
-   spin_unlock_irqrestore(&vp_dev->lock, flags);
+}
 
-   return ret;
+static int vp_request_vectors(struct virtio_device *vdev, unsigned max_vqs)
+{
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   const char *name = dev_name(&vp_dev->vdev.dev);
+   unsigned i, vectors;
+   int err

[PATCH 2/3] virtio_pci: split up vp_interrupt

2009-05-07 Thread Michael S. Tsirkin
This reorganizes virtio-pci code in vp_interrupt slightly, so that
it's easier to add per-vq MSI support on top.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/virtio/virtio_pci.c |   45 +-
 1 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 3671c42..f7b79a2 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -164,6 +164,37 @@ static void vp_notify(struct virtqueue *vq)
iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
+/* Handle a configuration change: Tell driver if it wants to know. */
+static irqreturn_t vp_config_changed(int irq, void *opaque)
+{
+   struct virtio_pci_device *vp_dev = opaque;
+   struct virtio_driver *drv;
+   drv = container_of(vp_dev->vdev.dev.driver,
+  struct virtio_driver, driver);
+
+   if (drv && drv->config_changed)
+   drv->config_changed(&vp_dev->vdev);
+   return IRQ_HANDLED;
+}
+
+/* Notify all virtqueues on an interrupt. */
+static irqreturn_t vp_vring_interrupt(int irq, void *opaque)
+{
+   struct virtio_pci_device *vp_dev = opaque;
+   struct virtio_pci_vq_info *info;
+   irqreturn_t ret = IRQ_NONE;
+   unsigned long flags;
+
+   spin_lock_irqsave(&vp_dev->lock, flags);
+   list_for_each_entry(info, &vp_dev->virtqueues, node) {
+   if (vring_interrupt(irq, info->vq) == IRQ_HANDLED)
+   ret = IRQ_HANDLED;
+   }
+   spin_unlock_irqrestore(&vp_dev->lock, flags);
+
+   return ret;
+}
+
 /* A small wrapper to also acknowledge the interrupt when it's handled.
  * I really need an EIO hook for the vring so I can ack the interrupt once we
  * know that we'll be handling the IRQ but before we invoke the callback since
@@ -173,9 +204,6 @@ static void vp_notify(struct virtqueue *vq)
 static irqreturn_t vp_interrupt(int irq, void *opaque)
 {
struct virtio_pci_device *vp_dev = opaque;
-   struct virtio_pci_vq_info *info;
-   irqreturn_t ret = IRQ_NONE;
-   unsigned long flags;
u8 isr;
 
/* reading the ISR has the effect of also clearing it so it's very
@@ -187,14 +215,11 @@ static irqreturn_t vp_interrupt(int irq, void *opaque)
return IRQ_NONE;
 
/* Configuration change?  Tell driver if it wants to know. */
-   if (isr & VIRTIO_PCI_ISR_CONFIG) {
-   struct virtio_driver *drv;
-   drv = container_of(vp_dev->vdev.dev.driver,
-  struct virtio_driver, driver);
+   if (isr & VIRTIO_PCI_ISR_CONFIG)
+   vp_config_changed(irq, opaque);
 
-   if (drv && drv->config_changed)
-   drv->config_changed(&vp_dev->vdev);
-   }
+   return vp_vring_interrupt(irq, opaque);
+}
 
spin_lock_irqsave(&vp_dev->lock, flags);
list_for_each_entry(info, &vp_dev->virtqueues, node) {
-- 
1.6.3.rc3.1.g830204

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] virtio: find_vqs/del_vqs virtio operations

2009-05-07 Thread Michael S. Tsirkin
This replaces find_vq/del_vq with find_vqs/del_vqs virtio operations,
and updates all drivers. This is needed for MSI support, because MSI
needs to know the total number of vectors upfront.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/block/virtio_blk.c  |   11 +++---
 drivers/char/hw_random/virtio-rng.c |   11 +++---
 drivers/char/virtio_console.c   |   27 ++
 drivers/lguest/lguest_device.c  |   49 +-
 drivers/net/virtio_net.c|   47 +++---
 drivers/s390/kvm/kvm_virtio.c   |   64 +-
 drivers/virtio/virtio_balloon.c |   31 -
 drivers/virtio/virtio_pci.c |   43 +++
 include/linux/virtio_config.h   |   29 +++-
 9 files changed, 222 insertions(+), 90 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 5d34764..7b7435d 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -197,6 +197,7 @@ static int virtblk_probe(struct virtio_device *vdev)
u64 cap;
u32 v;
u32 blk_size, sg_elems;
+   virtqueue_callback *callback[] = { blk_done };
 
if (index_to_minor(index) >= 1 << MINORBITS)
return -ENOSPC;
@@ -224,11 +225,9 @@ static int virtblk_probe(struct virtio_device *vdev)
sg_init_table(vblk->sg, vblk->sg_elems);
 
/* We expect one virtqueue, for output. */
-   vblk->vq = vdev->config->find_vq(vdev, 0, blk_done);
-   if (IS_ERR(vblk->vq)) {
-   err = PTR_ERR(vblk->vq);
+   err = vdev->config->find_vqs(vdev, 1, &vblk->vq, callback);
+   if (err)
goto out_free_vblk;
-   }
 
vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
if (!vblk->pool) {
@@ -323,7 +322,7 @@ out_put_disk:
 out_mempool:
mempool_destroy(vblk->pool);
 out_free_vq:
-   vdev->config->del_vq(vblk->vq);
+   vdev->config->del_vqs(vdev);
 out_free_vblk:
kfree(vblk);
 out:
@@ -344,7 +343,7 @@ static void virtblk_remove(struct virtio_device *vdev)
blk_cleanup_queue(vblk->disk->queue);
put_disk(vblk->disk);
mempool_destroy(vblk->pool);
-   vdev->config->del_vq(vblk->vq);
+   vdev->config->del_vqs(vdev);
kfree(vblk);
 }
 
diff --git a/drivers/char/hw_random/virtio-rng.c 
b/drivers/char/hw_random/virtio-rng.c
index 86e83f8..18eabe4 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -91,16 +91,17 @@ static struct hwrng virtio_hwrng = {
 
 static int virtrng_probe(struct virtio_device *vdev)
 {
+   virtqueue_callback *callback[] = { random_recv_done };
int err;
 
/* We expect a single virtqueue. */
-   vq = vdev->config->find_vq(vdev, 0, random_recv_done);
-   if (IS_ERR(vq))
-   return PTR_ERR(vq);
+   err = vdev->config->find_vqs(vdev, 1, &vq, callback);
+   if (err)
+   return err;
 
err = hwrng_register(&virtio_hwrng);
if (err) {
-   vdev->config->del_vq(vq);
+   vdev->config->del_vqs(vdev);
return err;
}
 
@@ -112,7 +113,7 @@ static void virtrng_remove(struct virtio_device *vdev)
 {
vdev->config->reset(vdev);
hwrng_unregister(&virtio_hwrng);
-   vdev->config->del_vq(vq);
+   vdev->config->del_vqs(vdev);
 }
 
 static struct virtio_device_id id_table[] = {
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index ff6f5a4..1fd5376 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -188,6 +188,8 @@ static void hvc_handle_input(struct virtqueue *vq)
  * Finally we put our input buffer in the input queue, ready to receive. */
 static int __devinit virtcons_probe(struct virtio_device *dev)
 {
+   struct virtqueue *vqs[2];
+   virtqueue_callback *callbacks[2];
int err;
 
vdev = dev;
@@ -199,20 +201,17 @@ static int __devinit virtcons_probe(struct virtio_device 
*dev)
goto fail;
}
 
-   /* Find the input queue. */
+   /* Find the queues. */
/* FIXME: This is why we want to wean off hvc: we do nothing
 * when input comes in. */
-   in_vq = vdev->config->find_vq(vdev, 0, hvc_handle_input);
-   if (IS_ERR(in_vq)) {
-   err = PTR_ERR(in_vq);
+   callbacks[0] = hvc_handle_input;
+   callbacks[1] = NULL;
+   err = vdev->config->find_vqs(vdev, 2, vqs, callbacks);
+   if (err)
goto free;
-   }
 
-   out_vq = vdev->config->find_vq(vdev, 1, NULL);
-   if (IS_ERR(out_vq)) {
-   err = PTR_ERR(out_vq);
-   goto free_in_vq;
-   }
+   in_vq = vqs[0];
+   out_vq = vqs[1];
 
/* Start using the new console output. */
virtio_cons.get_chars = get_chars;
@@ -233,17 +232,15 @@ static int __devinit virtcons_probe(

[PATCHv2 0/3] virtio: add guest MSI-X support

2009-05-07 Thread Michael S. Tsirkin
Add optional MSI-X support: use a vector per virtqueue with
fallback to a common vector and finally to regular interrupt.
Teach all drivers to use it.

Signed-off-by: Michael S. Tsirkin 

---

Here's a draft set of patches for MSI-X support in the guest.  It still
needs to be tested properly, and performance impact measured, but I
thought I'd share it here in the hope of getting some very early
feedback/flames.

Changelog since v1:
- Per Avi's suggestion, let guest configure virtqueue to vector mapping
- Per Rusty's suggestion, replace API with find_vqs/del_vqs.

Michael S. Tsirkin (3):
  virtio: find_vqs/del_vqs virtio operations
  virtio_pci: split up vp_interrupt
  virtio_pci: optional MSI-X support

 drivers/block/virtio_blk.c  |   11 +-
 drivers/char/hw_random/virtio-rng.c |   11 +-
 drivers/char/virtio_console.c   |   27 ++--
 drivers/lguest/lguest_device.c  |   49 ++-
 drivers/net/virtio_net.c|   47 +++
 drivers/s390/kvm/kvm_virtio.c   |   64 -
 drivers/virtio/virtio_balloon.c |   31 ++--
 drivers/virtio/virtio_pci.c |  283 ++-
 include/linux/virtio_config.h   |   29 +++-
 include/linux/virtio_pci.h  |8 +-
 10 files changed, 440 insertions(+), 120 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Gregory Haskins
Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 12:48:21PM +0300, Avi Kivity wrote:
>   
>> Davide Libenzi wrote:
>> 
>>> On Wed, 6 May 2009, Gregory Haskins wrote:
>>>
>>>   
>>>   
 I think we are ok in this regard (at least in v5) without the 
 callback. kvm holds irqfd, which holds eventfd.  In a normal 
 situation, we will
 have eventfd with 2 references.  If userspace closes the eventfd, it
 will drop 1 of the 2 eventfd file references, but the object should
 remain intact as long as kvm still holds it as well.  When the kvm-fd is
 released, we will then decouple from the eventfd->wqh and drop the last
 fput(), officially freeing it.

 Likewise, if kvm is closed before the eventfd, we will simply decouple
 from the wqh and fput(eventfd), leaving the last reference held by
 userspace until it closes as well.

 Let me know if you see any holes in that.
 
 
>>> Looks OK, modulo my knowledge of KVM internals.
>>>   
>>>   
>> What's your take on adding irq context safe callbacks to irqfd?
>>
>> To give some background here, we would like to use eventfd as a generic  
>> connector between components, so the components do not know about each  
>> other.  So far eventfd successfully abstracts among components in the  
>> same process, in different processes, and in the kernel.
>>
>> eventfd_signal() can be safely called from irq context, and will wake up  
>> a waiting task.  But in some cases, if the consumer is in the kernel, it  
>> may be able to consume the event from irq context, saving a context 
>> switch.
>>
>> So, will you consider patches adding this capability to eventfd?
>> 
>
> (pasting from a separate thread)
>
>   
>> That's my thinking.  PCI interrupts don't work because we need to do  
>> some hacky stuff in there, but MSI should.  Oh, and we could improve
>> UIO  
>> support for interrupts when using MSI, since there's no need to  
>> acknowledge the interrupt.
>> 
>
> Ok, so for INTx assigned devices all you need to do on the ACK handler
> is to re-enable the host interrupt (and set the guest interrupt line to
> low).
>
> Right now the ack comes through a kvm internal irq ack callback.
>
> AFAICS there is no mechanism in irqfd for ACK notification, and
> interrupt injection is edge triggered.
>
> So for PCI INTx assigned devices (or any INTx level), you'd want to keep
> the guest interrupt high, with some way to notify the ACK.
>
> Avi mentioned a separate irqfd to notify the ACK. For assigned devices,
> you could register a fd wakeup function in that fd, which replaces the
> current irq ACK callback?
>   

One thing I was thinking here was that I could create a flag for the
kvm_irqfd() function for something like "KVM_IRQFD_MODE_CLEAR".  This
flag when specified at creation time will cause the event to execute a
clear operation instead of a set when triggered.  That way, the default
mode is an edge-triggered set.  The non-default mode is to trigger a
clear.  Level-triggered ints could therefore create two irqfds, one for
raising, the other for clearing.

An alternative is to abandon the use of eventfd, and allow the irqfd to
be a first-class anon-fd.  The parameters passed to the write/signal()
function could then indicate the desired level.  The disadvantage would
be that it would not be compatible with eventfd, so we would need to
decide if the tradeoff is worth it.

OTOH, I suspect level triggered interrupts will be primarily in the
legacy domain, so perhaps we do not need to worry about it too much. 
Therefore, another option is that we *could* simply set the stake in the
ground that legacy/level cannot use irqfd.

Thoughts?

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface

2009-05-07 Thread Marcelo Tosatti
On Thu, May 07, 2009 at 12:48:21PM +0300, Avi Kivity wrote:
> Davide Libenzi wrote:
>> On Wed, 6 May 2009, Gregory Haskins wrote:
>>
>>   
>>> I think we are ok in this regard (at least in v5) without the 
>>> callback. kvm holds irqfd, which holds eventfd.  In a normal 
>>> situation, we will
>>> have eventfd with 2 references.  If userspace closes the eventfd, it
>>> will drop 1 of the 2 eventfd file references, but the object should
>>> remain intact as long as kvm still holds it as well.  When the kvm-fd is
>>> released, we will then decouple from the eventfd->wqh and drop the last
>>> fput(), officially freeing it.
>>>
>>> Likewise, if kvm is closed before the eventfd, we will simply decouple
>>> from the wqh and fput(eventfd), leaving the last reference held by
>>> userspace until it closes as well.
>>>
>>> Let me know if you see any holes in that.
>>> 
>>
>> Looks OK, modulo my knowledge of KVM internals.
>>   
>
> What's your take on adding irq context safe callbacks to irqfd?
>
> To give some background here, we would like to use eventfd as a generic  
> connector between components, so the components do not know about each  
> other.  So far eventfd successfully abstracts among components in the  
> same process, in different processes, and in the kernel.
>
> eventfd_signal() can be safely called from irq context, and will wake up  
> a waiting task.  But in some cases, if the consumer is in the kernel, it  
> may be able to consume the event from irq context, saving a context 
> switch.
>
> So, will you consider patches adding this capability to eventfd?

(pasting from a separate thread)

> That's my thinking.  PCI interrupts don't work because we need to do  
> some hacky stuff in there, but MSI should.  Oh, and we could improve
> UIO  
> support for interrupts when using MSI, since there's no need to  
> acknowledge the interrupt.

Ok, so for INTx assigned devices all you need to do on the ACK handler
is to re-enable the host interrupt (and set the guest interrupt line to
low).

Right now the ack comes through a kvm internal irq ack callback.

AFAICS there is no mechanism in irqfd for ACK notification, and
interrupt injection is edge triggered.

So for PCI INTx assigned devices (or any INTx level), you'd want to keep
the guest interrupt high, with some way to notify the ACK.

Avi mentioned a separate irqfd to notify the ACK. For assigned devices,
you could register a fd wakeup function in that fd, which replaces the
current irq ACK callback?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jes Sorensen

Jan Kiszka wrote:

I'm not asking for full gdbstub support at this point, just for a build
workaround at arch-level (or in arch-specific blocks, in gdbstub.c e.g.)
instead of patching generic code with global risk of regressions.


Oh, I see.

I already submitted that to Avi, making gdb_set_stop_cpu() a noop on
ia64. Since there's no target-ia64/ directory in mainline qemu.git yet,
that fix will have to live in qemu-kvm for the time being.

I am fine if you wish to remove all the #ifdef CONFIG_GDBSTUB bits.

Cheers,
Jes

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] qemu: external module: smp_send_reschedule compat

2009-05-07 Thread Avi Kivity

mtosa...@redhat.com wrote:

smp_send_reschedule was exported (via smp_ops) in v2.6.24.
Create a compat function which schedules the IPI to keventd context, 
in case interrupts are disabled, for kernels < 2.6.24.


 
 #endif
 
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,28)

+#define nr_cpu_ids NR_CPUS
+#endif
+
  


Doesn't seem related?  Please send as a separate patch.


diff --git a/kvm/kernel/x86/hack-module.awk b/kvm/kernel/x86/hack-module.awk
index 260eeef..f4d14da 100644
--- a/kvm/kernel/x86/hack-module.awk
+++ b/kvm/kernel/x86/hack-module.awk
@@ -35,6 +35,8 @@ BEGIN { split("INIT_WORK desc_struct ldttss_desc64 desc_ptr " 
\
 
 { sub(/match->dev->msi_enabled/, "kvm_pcidev_msi_enabled(match->dev)") }
 
+{ sub(/smp_send_reschedule/, "kvm_smp_send_reschedule") }

+
 /^static void __vmx_load_host_state/ {
 vmx_load_host_state = 1
 }

  


There's a bit on the top of hack-module.awk that does this slightly 
simpler.  Please also adjust ia64.


This now resides in kvm-kmod.git, please patch against that.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/4] use smp_send_reschedule in vcpu_kick / assigned dev host intx race fix

2009-05-07 Thread Avi Kivity

Sheng Yang wrote:

Is there any issue blocked this patchset?
  


Yes, a slow maintainer.  I'll go review it now.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jan Kiszka
Jes Sorensen wrote:
> Jan Kiszka wrote:
>> As the original change this patch (correctly) fixes broke gdbstub
>> support in qemu-kvm for all archs + it is still a deviation from
>> upstream, please take the chance and go for ia64 stubs.
> 
> Jan,
> 
> There's far more to do in upstream than just gdbstubs for ia64 to be
> usable :-(
> 
> We'll get to gdbstubs at some point, but there's too many other higher
> priority items to deal with first.

I'm not asking for full gdbstub support at this point, just for a build
workaround at arch-level (or in arch-specific blocks, in gdbstub.c e.g.)
instead of patching generic code with global risk of regressions.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jes Sorensen

Jan Kiszka wrote:

As the original change this patch (correctly) fixes broke gdbstub
support in qemu-kvm for all archs + it is still a deviation from
upstream, please take the chance and go for ia64 stubs.


Jan,

There's far more to do in upstream than just gdbstubs for ia64 to be
usable :-(

We'll get to gdbstubs at some point, but there's too many other higher
priority items to deal with first.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jan Kiszka
Jes Sorensen wrote:
> Avi Kivity wrote:
>> I agree, unless
>>
>> - we want to make gdbstub support configurable (don't see any
>> overwhelming reason for this, but maybe others do)
>> - we want to merge ia64 kvm support upstream, and don't want to impose
>> gdbstub support (though I'd recommend properly implementing gdbstub)
>>
>> In any case, I'm okay with dropping the check upstream and applying
>> the local fixup.
> 
> Hi,
> 
> Here's a patch that fixes the #ifndef to make it the #ifdef as it was
> intended.
> 
> I am quite fine with us trying to drop all the #ifdefs and introduce
> noop wrappers for archs that do not provide the gdbstubs (ie. ia64), but
> to start with we better just fix the #ifndef to make it behave like it
> was originally intended.

As the original change this patch (correctly) fixes broke gdbstub
support in qemu-kvm for all archs + it is still a deviation from
upstream, please take the chance and go for ia64 stubs.

BTW, I'm planning to submit a cleanup patch for CONFIG_GDBSTUB to
upstream soon.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix qemu-kvm to build when gdbstub is disabled

2009-05-07 Thread Jes Sorensen

Avi Kivity wrote:

I agree, unless

- we want to make gdbstub support configurable (don't see any 
overwhelming reason for this, but maybe others do)
- we want to merge ia64 kvm support upstream, and don't want to impose 
gdbstub support (though I'd recommend properly implementing gdbstub)


In any case, I'm okay with dropping the check upstream and applying the 
local fixup.


Hi,

Here's a patch that fixes the #ifndef to make it the #ifdef as it was
intended.

I am quite fine with us trying to drop all the #ifdefs and introduce
noop wrappers for archs that do not provide the gdbstubs (ie. ia64), but
to start with we better just fix the #ifndef to make it behave like it
was originally intended.

Cheers,
Jes



Fix incorrect #ifndef for CONFIG_GETSTUB, which should have been an
#ifdef.

Signed-off-by: Jes Sorensen 

---
 vl.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: qemu/vl.c
===
--- qemu.orig/vl.c
+++ qemu/vl.c
@@ -4350,7 +4350,7 @@
 }
 if (cpu_can_run(env))
 ret = qemu_cpu_exec(env);
-#ifndef CONFIG_GDBSTUB
+#ifdef CONFIG_GDBSTUB
 if (ret == EXCP_DEBUG) {
 gdb_set_stop_cpu(env);
 debug_requested = 1;


Re: [PATCH] KVM: Fix assigned device with no irq

2009-05-07 Thread Sheng Yang
On Thursday 07 May 2009 20:28:12 Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 03:24:15PM +0300, Avi Kivity wrote:
> > Sheng Yang wrote:
> >> Some device like VF of SRIOV only support MSI-X.
> >>
> >> With this patch, SRIOV can be enabled with KVM assigned device(all
> >> kernel space patches are ready).
> >
> > Just noticed I missed this patch.  Is it still needed or is it obsolete?
>
> I think its obsolete.

Yeah, it have been fixed from another direct in userspace: Interrupt pin =0 
means it haven't been initialized.

-- 
regards
Yang, Sheng

>
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 4d2be16..ce80f3a 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
> >>kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
> >>assigned_dev->irq_source_id = -1;
> >>  - if (!assigned_dev->irq_requested_type)
> >> +  if (!assigned_dev->irq_requested_type) {
> >> +  if (assigned_dev->dev->msi_enabled)
> >> +  pci_disable_msi(assigned_dev->dev);
> >>return;
> >> +  }
> >>/*
> >> * In kvm_free_device_irq, cancel_work_sync return true if:
> >> @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm
> >> *kvm, if (adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_INTX)
> >>return 0;
> >>  + /* IRQ 0 means uninitialized here */
> >> +  if (airq->host_irq == 0)
> >> +  return 0;
> >> +
> >>if (irqchip_in_kernel(kvm)) {
> >>if (!msi2intx &&
> >>(adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_MSI)) {
> >> @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm,
> >>r = 0;
> >>} else {
> >>printk(KERN_WARNING
> >> - "kvm: failed to enable MSI device!\n");
> >> -  r = -ENOTTY;
> >> -  goto out_release;
> >> + "kvm: device didn't support INTx or MSI!\n");
> >> +  r = 0;
> >>}
> >>} else {
> >>/* Non-sharing INTx mode */
> >
> > --
> > error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

Chris Wright wrote:
  

* Gregory Haskins (ghask...@novell.com) wrote:
  


Chris Wright wrote:

  

But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
means hypercall number and arg list must be the same in order for code
to call hypercall() in a hypervisor agnostic way.
  


Yes, and that is exactly the intention.  I think its perhaps the point
you are missing.

  

Yes, I was reading this as purely any hypercall, but it seems a bit
more like:
 pv_io_ops->iomap()
 pv_io_ops->ioread()
 pv_io_ops->iowrite()
  



Right.
  


Hmm, reminds me of something I thought of a while back.

We could implement an 'mmio hypercall' that does mmio reads/writes via a 
hypercall instead of an mmio operation.  That will speed up mmio for 
emulated devices (say, e1000).  It's easy to hook into Linux 
(readl/writel), is pci-friendly, non-x86 friendly, etc.


It also makes the device work when hypercall support is not available 
(qemu/tcg); you simply fall back on mmio.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Fix assigned device with no irq

2009-05-07 Thread Marcelo Tosatti
On Thu, May 07, 2009 at 03:24:15PM +0300, Avi Kivity wrote:
> Sheng Yang wrote:
>> Some device like VF of SRIOV only support MSI-X.
>>
>> With this patch, SRIOV can be enabled with KVM assigned device(all kernel 
>> space
>> patches are ready).
>>
>>   
>
> Just noticed I missed this patch.  Is it still needed or is it obsolete?

I think its obsolete. 


>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 4d2be16..ce80f3a 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
>>  kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
>>  assigned_dev->irq_source_id = -1;
>>  -   if (!assigned_dev->irq_requested_type)
>> +if (!assigned_dev->irq_requested_type) {
>> +if (assigned_dev->dev->msi_enabled)
>> +pci_disable_msi(assigned_dev->dev);
>>  return;
>> +}
>>  /*
>>   * In kvm_free_device_irq, cancel_work_sync return true if:
>> @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm *kvm,
>>  if (adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_INTX)
>>  return 0;
>>  +   /* IRQ 0 means uninitialized here */
>> +if (airq->host_irq == 0)
>> +return 0;
>> +
>>  if (irqchip_in_kernel(kvm)) {
>>  if (!msi2intx &&
>>  (adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_MSI)) {
>> @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm,
>>  r = 0;
>>  } else {
>>  printk(KERN_WARNING
>> -   "kvm: failed to enable MSI device!\n");
>> -r = -ENOTTY;
>> -goto out_release;
>> +   "kvm: device didn't support INTx or MSI!\n");
>> +r = 0;
>>  }
>>  } else {
>>  /* Non-sharing INTx mode */
>>   
>
>
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Fix assigned device with no irq

2009-05-07 Thread Avi Kivity

Sheng Yang wrote:

Some device like VF of SRIOV only support MSI-X.

With this patch, SRIOV can be enabled with KVM assigned device(all kernel space
patches are ready).

  


Just noticed I missed this patch.  Is it still needed or is it obsolete?



diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4d2be16..ce80f3a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
assigned_dev->irq_source_id = -1;
 
-	if (!assigned_dev->irq_requested_type)

+   if (!assigned_dev->irq_requested_type) {
+   if (assigned_dev->dev->msi_enabled)
+   pci_disable_msi(assigned_dev->dev);
return;
+   }
 
 	/*

 * In kvm_free_device_irq, cancel_work_sync return true if:
@@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm *kvm,
if (adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_INTX)
return 0;
 
+	/* IRQ 0 means uninitialized here */

+   if (airq->host_irq == 0)
+   return 0;
+
if (irqchip_in_kernel(kvm)) {
if (!msi2intx &&
(adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_MSI)) {
@@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm,
r = 0;
} else {
printk(KERN_WARNING
-  "kvm: failed to enable MSI device!\n");
-   r = -ENOTTY;
-   goto out_release;
+  "kvm: device didn't support INTx or MSI!\n");
+   r = 0;
}
} else {
/* Non-sharing INTx mode */
  



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm: bios: Use a different mask to size the option ROM BAR

2009-05-07 Thread Avi Kivity

Alex Williamson wrote:

Bit 0 is the enable bit, which we not only don't want to set, but
it will stick and make us think it's an I/O port resource.

Signed-off-by: Alex Williamson 
---

 kvm/bios/rombios32.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c
index 8684987..6502e63 100755
--- a/kvm/bios/rombios32.c
+++ b/kvm/bios/rombios32.c
@@ -958,11 +958,13 @@ static void pci_bios_init_device(PCIDevice *d)
 int ofs;
 uint32_t val, size ;
 
-if (i == PCI_ROM_SLOT)

+if (i == PCI_ROM_SLOT) {
 ofs = 0x30;
-else
+pci_config_writel(d, ofs, 0xfffe);
+} else {
 ofs = 0x10 + i * 4;
-pci_config_writel(d, ofs, 0x);
+pci_config_writel(d, ofs, 0x);
+}
 val = pci_config_readl(d, ofs);
 if (val != 0) {
 size = (~(val & ~0xf)) + 1;
  


Looks good.  I think it applies upstream (or even upstream's upstream) 
though, in which case please send it qemu-devel.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] msi-x: let drivers retry when not enough vectors

2009-05-07 Thread Matthew Wilcox
On Thu, May 07, 2009 at 01:44:38PM +0300, Avi Kivity wrote:
> I imagine this loop is present in many drivers.  So why not add a helper

Let's look!

arch/x86/kernel/amd_iommu_init.c : Needs an exact number of vectors.
drivers/block/cciss.c : If it doesn't get all 4 vectors, falls back to pin mode 
(instead of MSI mode -- bug!)
drivers/dma/ioat_dma.c : Falls back down to 1 vector if it can't get nvec
drivers/infiniband/hw/mthca/mthca_main.c : Reverts to MSI if it can't get 3.
drivers/net/benet/be_main.c : Falls back to MSI if it can't get 2.
drivers/net/bnx2.c : Falls back to MSI if it can't get 9.
drivers/net/bnx2x_main.c : Falls back to MSI if it can't get N.
drivers/net/e1000e/netdev.c: Falls back to MSI if it can't get N.
drivers/net/enic/enic_main.c: Falls back to MSI if it can't get N.
drivers/net/forcedeth.c: Falls back to MSI if it can't get N.
drivers/net/igb/igb_main.c: Falls back to MSI if it can't get N.
drivers/net/igbvf/netdev.c: Falls back to MSI if it can't get 3.
drivers/net/myri10ge/myri10ge.c: Falls back to Pin if it can't get N.
drivers/net/netxen/netxen_nic_main.c: Falls back to MSI if it can't get N.
drivers/net/qlge/qlge_main.c: Falls back to MSI if it can't get N.
drivers/net/s2io.c: Falls back to MSI if it can't get N.
drivers/net/vxge/vxge-main.c: Falls back once to a second number.
drivers/pci/pcie/portdrv_core.c: Falls back to MSI if it can't get all of them.
drivers/scsi/lpfc/lpfc_init.c: Falls back to MSI if it can't get N.
drivers/scsi/mpt2sas/mpt2sas_base.c: Only allocates 1.

drivers/net/cxgb3/cxgb3_main.c: Actually falls back ... in a bit of a weird 
way.  This one could definitely do with the proposed loop.
drivers/net/ixgbe/ixgbe_main.c: Could also be improved with this loop.
drivers/net/mlx4/main.c: Nasty goto-based loop.
drivers/net/niu.c: Ditto
drivers/net/sfc/efx.c: Only falls back once.  Would benefit from loop.
drivers/scsi/qla2xxx/qla_isr.c: Only falls back once.  Would benefit from loop.
drivers/staging/sxg/sxg.c: /*Should try with less vector returned.*/

so ... 7 drivers would benefit from this loop.  20 seem like they wouldn't.

What a lot of drivers seem to do is fall back either to a single MSI or
just pin-based interrupts when they can't get as many interrupts as they
would like.  They don't try to get a single MSI-X interrupt.  I feel an
update to the MSI-HOWTO coming on.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix irq routing by moving it to the correct place

2009-05-07 Thread Avi Kivity

Avi Kivity wrote:

Jes Sorensen wrote:

Hi,

Some freak :-) put x86 specific irq routing into the generic code path,
which obviously won't work on non-x86 systems.

This patch creates kvm_arch_init_irq_routing() and moves the x86
gibberish to it's correct location :-)



Applied, thanks.



Argh, that didn't build on x86.  I fixed it up.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >