Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-28 Thread Takuya Yoshikawa
Avi Kivity  wrote:

> > The key part of the implementation is the use of xchg() operation for
> > clearing dirty bits atomically.  Since this allows us to update only
> > BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
> > until every dirty bit is cleared again for the next call.
> 
> What about using cmpxchg16b?  That should reduce locked ops by a factor
> of 2 (but note it needs 16 bytes alignment).

I tried cmpxchg16b first: the implementation could not be naturally
combined with the for loop over the unsigned long array.

Extra "if not zero", alignement check and ... it was ugly
and I guessed it would be slow.

Taking it into account that cmpxchg16b needs more cycles than others,
I think this should be tried carefully with more measurement later.

How about concentrating on xchg now?

The implementation is simple and gives us enough improvement for now.
At least, I want to see whether xchg-based implementation works well
for one release.

GET_DIRTY_LOG can be easily tuned to one particular case and
it is really hard to check whether the implementation works well
for every important case.  I really want feedback from users
before adding non-obvious optimization.

In addition, we should care about the new API.  It is not decided about
what kind of range can be ordered.  I think restricting the range to be
long size aligned is natural.  Do you have any plan?

> > Another point to note is that we do not use for_each_set_bit() to check
> > which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
> > simply use __ffs() and __fls() and pass the range in between the two
> > positions found by them to kvm_mmu_write_protect_pt_range().
> 
> This seems artificial.

OK, then I want to pass the bits (unsingned long) as a mask.

Non-NPT machines may gain some.

> > Even though the passed range may include clean pages, it is much faster
> > than repeatedly call find_next_bit() due to the locality of dirty pages.
> 
> Perhaps this is due to the implementation of find_next_bit()?  would
> using bsf improve things?

I need to explain what I did in the past.

Before srcu-less work, I had already noticed the slowness of
for_each_set_bit() and replaced it with simple for loop like now: the
improvement was significant.

Yes, find_next_bit() is for generic use and not at all good when there
are many consecutive bits set: it cannot assume anything so needs to check
a lot of cases - we have long size aligned bitmap and "bits" is already
known to be non-zero after the first check of the for loop.

Of course, doing 64 function calls alone should be avoided in our case.
I also do not want to call kvm_mmu_* for each bit.

So, above, I proposed just passing "bits" to kvm_mmu_*: we can check
each bit i in a register before using rmap[i] if needed.

__ffs is really fast compared to other APIs.

One note is that we will lose in cases like bits = 0x..

2271171.412064.9   138.6  16K -45
3375866.214743.3   103.0  32K -55
4408395.610720.067.2  64K -51
5915336.226538.145.1 128K -44
8497356.416441.032.4 256K -29

So the last one will become worse.  For other 4 patterns I am not sure.

I thought that we should tune to the last case for gaining a lot from
the locality of WWS.  What do you think about this point?

> > -   } else {
> > -   r = -EFAULT;
> > -   if (clear_user(log->dirty_bitmap, n))
> > -   goto out;
> > +   kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);
> 
> If indeed the problem is find_next_bit(), then we could hanve
> kvm_mmu_write_protect_slot_masked() which would just take the bitmap as
> a parameter.  This would allow covering just this function with the
> spinlock, not the xchg loop.

We may see partial display updates if we do not hold the mmu_lock during
xchg loop: it is possible that pages near the end of the framebuffer alone
gets updated sometimes - I noticed this problem when I fixed the TLB flush
issue.

Not a big problem but still maybe-noticeable change, so I think we should
do it separately with some comments if needed.

In addition, we do not want to scan the dirty bitmap twice.  Using the
bits value soon after it is read into a register seems to be the fastest.


BTW, I also want to decide the design of the new API at this chance.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: Resize kvm_io_range array dynamically

2012-02-28 Thread Amos Kong
kvm_io_bus devices are used for ioevent, pit, pic, ioapic,
coalesced_mmio.

Currently Qemu only emulates one PCI bus, it contains 32 slots,
one slot contains 8 functions, maximum of supported PCI devices:
 1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100,
each zone has an iobus devices.

This patch makes the kvm_io_range array can be resized dynamically.

Changes from v1:
- fix typo: kvm_io_bus_range -> kvm_io_range

Signed-off-by: Amos Kong 
CC: Alex Williamson 
---
 include/linux/kvm_host.h |3 +--
 virt/kvm/kvm_main.c  |   24 +++-
 2 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 355e445..0e6d9d2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -69,8 +69,7 @@ struct kvm_io_range {
 
 struct kvm_io_bus {
int   dev_count;
-#define NR_IOBUS_DEVS 300
-   struct kvm_io_range range[NR_IOBUS_DEVS];
+   struct kvm_io_range range[];
 };
 
 enum kvm_bus {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e4431ad..a6b9445 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2389,9 +2389,6 @@ int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
 int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev,
  gpa_t addr, int len)
 {
-   if (bus->dev_count == NR_IOBUS_DEVS)
-   return -ENOSPC;
-
bus->range[bus->dev_count++] = (struct kvm_io_range) {
.addr = addr,
.len = len,
@@ -2491,10 +2488,12 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum 
kvm_bus bus_idx, gpa_t addr,
struct kvm_io_bus *new_bus, *bus;
 
bus = kvm->buses[bus_idx];
-   if (bus->dev_count > NR_IOBUS_DEVS-1)
-   return -ENOSPC;
 
-   new_bus = kmemdup(bus, sizeof(struct kvm_io_bus), GFP_KERNEL);
+   new_bus = kzalloc(sizeof(*bus) + ((bus->dev_count + 1) *
+ sizeof(struct kvm_io_range)), GFP_KERNEL);
+   if (new_bus)
+   memcpy(new_bus, bus, sizeof(struct kvm_io_bus) +
+  (bus->dev_count * sizeof(struct kvm_io_range)));
if (!new_bus)
return -ENOMEM;
kvm_io_bus_insert_dev(new_bus, dev, addr, len);
@@ -2514,16 +2513,23 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
 
bus = kvm->buses[bus_idx];
 
-   new_bus = kmemdup(bus, sizeof(*bus), GFP_KERNEL);
+   new_bus = kmemdup(bus, sizeof(*bus) + ((bus->dev_count - 1) *
+ sizeof(struct kvm_io_range)), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
 
r = -ENOENT;
for (i = 0; i < new_bus->dev_count; i++)
-   if (new_bus->range[i].dev == dev) {
+   if (i == bus->dev_count - 1) {
+   /* dev is the last item of bus->range array,
+  and new_bus->range doesn't have this item. */
+   r = 0;
+   new_bus->dev_count--;
+   break;
+   } else if (new_bus->range[i].dev == dev) {
r = 0;
new_bus->dev_count--;
-   new_bus->range[i] = new_bus->range[new_bus->dev_count];
+   new_bus->range[i] = bus->range[new_bus->dev_count];
sort(new_bus->range, new_bus->dev_count,
 sizeof(struct kvm_io_range),
 kvm_io_bus_sort_cmp, NULL);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Resize kvm_io_bus_range array dynamically

2012-02-28 Thread Amos Kong
kvm_io_bus devices are used for ioevent, pit, pic, ioapic,
coalesced_mmio.

Currently Qemu only emulates one PCI bus, it contains 32 slots,
one slot contains 8 functions, maximum of supported PCI devices:
 1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100,
each zone has an iobus devices.

This patch makes the kvm_io_bus_range array can be resized dynamically.

Signed-off-by: Amos Kong 
CC: Alex Williamson 
---
 include/linux/kvm_host.h |3 +--
 virt/kvm/kvm_main.c  |   24 +++-
 2 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 355e445..0e6d9d2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -69,8 +69,7 @@ struct kvm_io_range {
 
 struct kvm_io_bus {
int   dev_count;
-#define NR_IOBUS_DEVS 300
-   struct kvm_io_range range[NR_IOBUS_DEVS];
+   struct kvm_io_range range[];
 };
 
 enum kvm_bus {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e4431ad..a6b9445 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2389,9 +2389,6 @@ int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
 int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev,
  gpa_t addr, int len)
 {
-   if (bus->dev_count == NR_IOBUS_DEVS)
-   return -ENOSPC;
-
bus->range[bus->dev_count++] = (struct kvm_io_range) {
.addr = addr,
.len = len,
@@ -2491,10 +2488,12 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum 
kvm_bus bus_idx, gpa_t addr,
struct kvm_io_bus *new_bus, *bus;
 
bus = kvm->buses[bus_idx];
-   if (bus->dev_count > NR_IOBUS_DEVS-1)
-   return -ENOSPC;
 
-   new_bus = kmemdup(bus, sizeof(struct kvm_io_bus), GFP_KERNEL);
+   new_bus = kzalloc(sizeof(*bus) + ((bus->dev_count + 1) *
+ sizeof(struct kvm_io_range)), GFP_KERNEL);
+   if (new_bus)
+   memcpy(new_bus, bus, sizeof(struct kvm_io_bus) +
+  (bus->dev_count * sizeof(struct kvm_io_range)));
if (!new_bus)
return -ENOMEM;
kvm_io_bus_insert_dev(new_bus, dev, addr, len);
@@ -2514,16 +2513,23 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
 
bus = kvm->buses[bus_idx];
 
-   new_bus = kmemdup(bus, sizeof(*bus), GFP_KERNEL);
+   new_bus = kmemdup(bus, sizeof(*bus) + ((bus->dev_count - 1) *
+ sizeof(struct kvm_io_range)), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
 
r = -ENOENT;
for (i = 0; i < new_bus->dev_count; i++)
-   if (new_bus->range[i].dev == dev) {
+   if (i == bus->dev_count - 1) {
+   /* dev is the last item of bus->range array,
+  and new_bus->range doesn't have this item. */
+   r = 0;
+   new_bus->dev_count--;
+   break;
+   } else if (new_bus->range[i].dev == dev) {
r = 0;
new_bus->dev_count--;
-   new_bus->range[i] = new_bus->range[new_bus->dev_count];
+   new_bus->range[i] = bus->range[new_bus->dev_count];
sort(new_bus->range, new_bus->dev_count,
 sizeof(struct kvm_io_range),
 kvm_io_bus_sort_cmp, NULL);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-28 Thread John Fastabend
On 2/28/2012 8:40 PM, John Fastabend wrote:
> On 2/18/2012 4:41 AM, jamal wrote:
>> On Fri, 2012-02-17 at 09:10 -0800, John Fastabend wrote:
>>
>>> Yes I agree that is the goal.
>>>
 One last comment:
 With synchronization there are other challenges when the entry in the
 hardware conflicts with the entry in software when you intend the
 behavior to be the same. This is not such a big deal with bridging but
 becomes more apparent when you start offloading ACLs etc.

>>>
>>> OK and these sorts of conflicts certainly don't need to be resolved
>>> by kernel code. So I think this is a reasonable reason to drive the
>>> synchronization into a user space daemon.
>>
>>
>> Yep. 
>> Thanks for listening John. Waiting to see them patches.
>>
>> cheers,
>> jamal
>>
>>
>>
> 
> +Lennert
> 
> OK back to this. The last piece is where to put these messages...
> we could take PF_ROUTE:RTM_*NEIGH
> 
>  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
>  switch.
>  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
>  switch.
>  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
> 
> The neighbor code is using the PF_UNSPEC protocol type so we won't
> collide with these unless someone was using PF_ROUTE and relying on
> falling back to PF_UNSPEC however I couldn't find any programs that
> did this iproute2 certainly doesn't. And the bridge pieces are using
> PF_BRIDGE so no collision there.
> 
> I briefly thought about trying to pull the PF_BRIDGE protocol out
> and use this for both types but I think its better to leave the
> bridge code alone and there is also the issue of disambiguating a msg
> at a port which has both an embedded switch and has SW bridge for a
> master.

Maybe I gave up too quickly here I could use a bit in the ndm_flags to
specify embedded or sw bridge. But would require having the bridge
module loaded.

> 
> Also if there are embedded switches with learning capabilities they
> might want to trigger events to user space. In this case having
> a protocol type makes user space a bit easier to manage. I've
> added Lennert so maybe he can comment I think the Marvell chipsets
> might support something along these lines. The SR-IOV chipsets I'm
> aware of _today_ don't do learning. Learning makes the event model
> more plausible.
> 

Just checked looks like the DSA infrastructure has commands to enable
STP so guess it is doing learning.

> The other mechanism would be to embed some more attributes into the
> PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
> support learning and triggering events then we likely also don't
> want to send these events to every app with RTNLGRP_LINK set.
> 
> Plus there is already a proliferation of LINK attributes and dumping
> the FDB out of this seems a bit much but could be done with some
> bitmasks. Although the current ext_filter_mask u32 doesn't seem to
> be sufficient for events to trigger this.
> 
> so much for a short note...
> 
> Thanks
> .John
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-28 Thread John Fastabend
On 2/18/2012 4:41 AM, jamal wrote:
> On Fri, 2012-02-17 at 09:10 -0800, John Fastabend wrote:
> 
>> Yes I agree that is the goal.
>>
>>> One last comment:
>>> With synchronization there are other challenges when the entry in the
>>> hardware conflicts with the entry in software when you intend the
>>> behavior to be the same. This is not such a big deal with bridging but
>>> becomes more apparent when you start offloading ACLs etc.
>>>
>>
>> OK and these sorts of conflicts certainly don't need to be resolved
>> by kernel code. So I think this is a reasonable reason to drive the
>> synchronization into a user space daemon.
> 
> 
> Yep. 
> Thanks for listening John. Waiting to see them patches.
> 
> cheers,
> jamal
> 
> 
> 

+Lennert

OK back to this. The last piece is where to put these messages...
we could take PF_ROUTE:RTM_*NEIGH

 PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
 switch.
 PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
 switch.
 PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table

The neighbor code is using the PF_UNSPEC protocol type so we won't
collide with these unless someone was using PF_ROUTE and relying on
falling back to PF_UNSPEC however I couldn't find any programs that
did this iproute2 certainly doesn't. And the bridge pieces are using
PF_BRIDGE so no collision there.

I briefly thought about trying to pull the PF_BRIDGE protocol out
and use this for both types but I think its better to leave the
bridge code alone and there is also the issue of disambiguating a msg
at a port which has both an embedded switch and has SW bridge for a
master.

Also if there are embedded switches with learning capabilities they
might want to trigger events to user space. In this case having
a protocol type makes user space a bit easier to manage. I've
added Lennert so maybe he can comment I think the Marvell chipsets
might support something along these lines. The SR-IOV chipsets I'm
aware of _today_ don't do learning. Learning makes the event model
more plausible.

The other mechanism would be to embed some more attributes into the
PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
support learning and triggering events then we likely also don't
want to send these events to every app with RTNLGRP_LINK set.

Plus there is already a proliferation of LINK attributes and dumping
the FDB out of this seems a bit much but could be done with some
bitmasks. Although the current ext_filter_mask u32 doesn't seem to
be sufficient for events to trigger this.

so much for a short note...

Thanks
.John




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] KVM: srcu-less dirty logging

2012-02-28 Thread Takuya Yoshikawa
Avi Kivity  wrote:

> There will be an inversion for sure, if __put_user() faults and triggers
> an mmu notifier (perhaps directly, perhaps through an allocation that
> triggers a swap).

Ah, I did not notice that possibility.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] Quirk for IVB graphics FLR errata

2012-02-28 Thread Hao, Xudong
For IvyBridge Mobile platform, a system hang may occur if a FLR(Function Level
 Reset) is asserted to internal graphics.

This quirk patch is workaround for the IVB FLR errata issue.
We are disabling the FLR reset handshake between the PCH and CPU display, 
then manually powering down the panel power sequencing and resetting the PCH 
display.

Signed-off-by: Xudong Hao 
Signed-off-by: Kay, Allen M 
---
 drivers/pci/quirks.c |   49 +
 1 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6476547..5223b80 
100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -29,6 +29,11 @@
 #include/* isa_dma_bridge_buggy */
 #include "pci.h"
 
+#include "../gpu/drm/i915/i915_reg.h"
+#include 
+/* 10 seconds */
+#define IGD_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000)
+
 /*
  * This quirk function disables memory decoding and releases memory resources
  * of the device specified by kernel's boot parameter 
'pci=resource_alignment='.
@@ -3069,11 +3074,55 @@ static int reset_intel_82599_sfp_virtfn(struct pci_dev 
*dev, int probe)
return 0;
 }
 
+#define MSG_CTL0x45010
+
+static int reset_ivb_igd(struct pci_dev *dev, int probe) {
+   u8 *mmio_base;
+   u32 val;
+
+   if (probe)
+   return 0;
+
+   mmio_base = ioremap_nocache(pci_resource_start(dev, 0),
+pci_resource_len(dev, 0));
+   if (!mmio_base)
+   return -ENOMEM;
+
+   /* Work Around */
+   *((u32 *)(mmio_base + MSG_CTL)) = 0x0002;
+   *((u32 *)(mmio_base + SOUTH_CHICKEN2)) = 0x0005;
+   val = *((u32 *)(mmio_base + PCH_PP_CONTROL)) & 0xfffe;
+   *((u32 *)(mmio_base + PCH_PP_CONTROL)) = val;
+   do {
+   cycles_t start_time = get_cycles();
+   while (1) {
+   val = *((u32 *)(mmio_base + PCH_PP_STATUS));
+   if (((val & 0x8000) == 0)
+   && ((val & 0x3000) == 0))
+   break;
+   if (IGD_OPERATION_TIMEOUT < (get_cycles() - start_time))
+   break;
+   cpu_relax();
+   }
+   } while (0);
+   *((u32 *)(mmio_base + 0xd0100)) = 0x0002;
+
+   iounmap(pci_resource_start(dev, 0));
+   return 0;
+}
+
 #define PCI_DEVICE_ID_INTEL_82599_SFP_VF   0x10ed
+#define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
+#define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
 
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
 reset_intel_82599_sfp_virtfn },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M_VGA,
+   reset_ivb_igd },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
+   reset_ivb_igd },
{ PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
reset_intel_generic_dev },
{ 0 }
--
1.6.0.rc1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Don't sync timebase when inside KVM

2012-02-28 Thread Alexander Graf
When we know that we're running inside of a KVM guest, we don't have to
worry about synchronizing timebases between different CPUs, since the
host already took care of that.

This fixes CPU overcommit scenarios where vCPUs could hang forever trying
to sync each other while not being scheduled.

Reported-by: Stuart Yoder 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/kernel/smp.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 46695fe..670b453 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -49,6 +49,8 @@
 #ifdef CONFIG_PPC64
 #include 
 #endif
+#include 
+#include 
 
 #ifdef DEBUG
 #include 
@@ -541,7 +543,7 @@ int __cpuinit __cpu_up(unsigned int cpu)
 
DBG("Processor %u found.\n", cpu);
 
-   if (smp_ops->give_timebase)
+   if (!kvm_para_available() && smp_ops->give_timebase)
smp_ops->give_timebase();
 
/* Wait until cpu puts itself in the online map */
@@ -626,7 +628,7 @@ void __devinit start_secondary(void *unused)
 
if (smp_ops->setup_cpu)
smp_ops->setup_cpu(cpu);
-   if (smp_ops->take_timebase)
+   if (!kvm_para_available() && smp_ops->take_timebase)
smp_ops->take_timebase();
 
secondary_cpu_time_init();
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-28 Thread Wen Congyang
At 02/28/2012 07:23 PM, Avi Kivity Wrote:
> On 02/27/2012 05:01 AM, Wen Congyang wrote:
>> We can know the guest is paniced when the guest runs on xen.
>> But we do not have such feature on kvm. This patch implemnts
>> this feature, and the implementation is the same as xen:
>> register panic notifier, and call hypercall when the guest
>> is paniced.
> 
> What's the motivation for this?  "Xen does this" is insufficient.

Another purpose is: management app(for example: libvirt) can do auto
dump when the guest is crashed. If management app does not do auto
dump, the guest's user can do dump by hand if he sees the guest is
paniced.

I am thinking about another status: dumping. This status tells
the guest's user that the guest is paniced, and the OS's dump function
is working.

These two status can tell the guest's user whether the guest is pancied,
and what should he do if the guest is paniced.

Thanks
Wen Congyang
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-28 Thread Wen Congyang
At 02/28/2012 06:45 PM, Gleb Natapov Wrote:
> On Tue, Feb 28, 2012 at 11:19:47AM +0100, Jan Kiszka wrote:
>> On 2012-02-28 10:42, Wen Congyang wrote:
>>> At 02/28/2012 05:34 PM, Jan Kiszka Wrote:
 On 2012-02-28 09:23, Wen Congyang wrote:
> At 02/27/2012 11:08 PM, Jan Kiszka Wrote:
>> On 2012-02-27 04:01, Wen Congyang wrote:
>>> We can know the guest is paniced when the guest runs on xen.
>>> But we do not have such feature on kvm. This patch implemnts
>>> this feature, and the implementation is the same as xen:
>>> register panic notifier, and call hypercall when the guest
>>> is paniced.
>>>
>>> Signed-off-by: Wen Congyang 
>>> ---
>>>  arch/x86/kernel/kvm.c|   12 
>>>  arch/x86/kvm/svm.c   |8 ++--
>>>  arch/x86/kvm/vmx.c   |8 ++--
>>>  arch/x86/kvm/x86.c   |   13 +++--
>>>  include/linux/kvm.h  |1 +
>>>  include/linux/kvm_para.h |1 +
>>>  6 files changed, 37 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index f0c6fd6..b928d1d 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -331,6 +331,17 @@ static struct notifier_block kvm_pv_reboot_nb = {
>>> .notifier_call = kvm_pv_reboot_notify,
>>>  };
>>>  
>>> +static int
>>> +kvm_pv_panic_notify(struct notifier_block *nb, unsigned long code, 
>>> void *unused)
>>> +{
>>> +   kvm_hypercall0(KVM_HC_GUEST_PANIC);
>>> +   return NOTIFY_DONE;
>>> +}
>>> +
>>> +static struct notifier_block kvm_pv_panic_nb = {
>>> +   .notifier_call = kvm_pv_panic_notify,
>>> +};
>>> +
>>
>> You should split up host and guest-side changes.
>>
>>>  static u64 kvm_steal_clock(int cpu)
>>>  {
>>> u64 steal;
>>> @@ -417,6 +428,7 @@ void __init kvm_guest_init(void)
>>>  
>>> paravirt_ops_setup();
>>> register_reboot_notifier(&kvm_pv_reboot_nb);
>>> +   atomic_notifier_chain_register(&panic_notifier_list, 
>>> &kvm_pv_panic_nb);
>>> for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
>>> spin_lock_init(&async_pf_sleepers[i].lock);
>>> if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index 0b7690e..38b4705 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -1900,10 +1900,14 @@ static int halt_interception(struct vcpu_svm 
>>> *svm)
>>>  
>>>  static int vmmcall_interception(struct vcpu_svm *svm)
>>>  {
>>> +   int ret;
>>> +
>>> svm->next_rip = kvm_rip_read(&svm->vcpu) + 3;
>>> skip_emulated_instruction(&svm->vcpu);
>>> -   kvm_emulate_hypercall(&svm->vcpu);
>>> -   return 1;
>>> +   ret = kvm_emulate_hypercall(&svm->vcpu);
>>> +
>>> +   /* Ignore the error? */
>>> +   return ret == 0 ? 0 : 1;
>>
>> Why can't kvm_emulate_hypercall return the right value?
>
> kvm_emulate_hypercall() will call kvm_hv_hypercall(), and
> kvm_hv_hypercall() will return 0 when vcpu's CPL > 0.
> If vcpu's CPL > 0, does kvm need to exit and tell it to
> qemu?

 No, there is currently no exit to userspace due to hypercalls, neither
 of HV nor KVM kind.

 The point is that the return code of kvm_emulate_hypercall is unused so
 far, so you can easily redefine it to encode continue vs. exit to
 userspace. Once someone has different needs, this could still be
 refactored again.
>>>
>>> So, it is OK to change the return value of kvm_hv_hypercall() if vcpu's
>>> CPL > 0?
>>
>> Yes, change it to encode what vendor modules need to return to their
>> callers.
>>
> Better introduce new request flag and set it in your hypercall emulation. See
> how triple fault is handled.

triple fault sets KVM_EXIT_SHUTDOWN and exits to userspace. Do you mean 
introduce
a new value(like KVM_EXIT_SHUTDOWN)?

Thanks
Wen Congyang

> 
> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/38] KVM: PPC: e500: merge into arch/powerpc/kvm/e500.h

2012-02-28 Thread Alexander Graf
From: Scott Wood 

Keeping two separate headers for e500-specific things was a
pain, and wasn't even organized along any logical boundary.

There was TLB stuff in  despite the existence of
arch/powerpc/kvm/e500_tlb.h, and nothing in  needed
to be referenced from outside arch/powerpc/kvm.

Signed-off-by: Scott Wood 
[agraf: fix bisectability]
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm_e500.h |   96 ---
 arch/powerpc/kvm/e500.c |1 -
 arch/powerpc/kvm/e500.h |   82 --
 arch/powerpc/kvm/e500_emulate.c |1 -
 arch/powerpc/kvm/e500_tlb.c |1 -
 5 files changed, 78 insertions(+), 103 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/kvm_e500.h

diff --git a/arch/powerpc/include/asm/kvm_e500.h 
b/arch/powerpc/include/asm/kvm_e500.h
deleted file mode 100644
index 8cd50a5..000
--- a/arch/powerpc/include/asm/kvm_e500.h
+++ /dev/null
@@ -1,96 +0,0 @@
-/*
- * Copyright (C) 2008-2011 Freescale Semiconductor, Inc. All rights reserved.
- *
- * Author: Yu Liu, 
- *
- * Description:
- * This file is derived from arch/powerpc/include/asm/kvm_44x.h,
- * by Hollis Blanchard .
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License, version 2, as
- * published by the Free Software Foundation.
- */
-
-#ifndef __ASM_KVM_E500_H__
-#define __ASM_KVM_E500_H__
-
-#include 
-
-#define BOOKE_INTERRUPT_SIZE 36
-
-#define E500_PID_NUM   3
-#define E500_TLB_NUM   2
-
-#define E500_TLB_VALID 1
-#define E500_TLB_DIRTY 2
-
-struct tlbe_ref {
-   pfn_t pfn;
-   unsigned int flags; /* E500_TLB_* */
-};
-
-struct tlbe_priv {
-   struct tlbe_ref ref; /* TLB0 only -- TLB1 uses tlb_refs */
-};
-
-struct vcpu_id_table;
-
-struct kvmppc_e500_tlb_params {
-   int entries, ways, sets;
-};
-
-struct kvmppc_vcpu_e500 {
-   /* Unmodified copy of the guest's TLB -- shared with host userspace. */
-   struct kvm_book3e_206_tlb_entry *gtlb_arch;
-
-   /* Starting entry number in gtlb_arch[] */
-   int gtlb_offset[E500_TLB_NUM];
-
-   /* KVM internal information associated with each guest TLB entry */
-   struct tlbe_priv *gtlb_priv[E500_TLB_NUM];
-
-   struct kvmppc_e500_tlb_params gtlb_params[E500_TLB_NUM];
-
-   unsigned int gtlb_nv[E500_TLB_NUM];
-
-   /*
-* information associated with each host TLB entry --
-* TLB1 only for now.  If/when guest TLB1 entries can be
-* mapped with host TLB0, this will be used for that too.
-*
-* We don't want to use this for guest TLB0 because then we'd
-* have the overhead of doing the translation again even if
-* the entry is still in the guest TLB (e.g. we swapped out
-* and back, and our host TLB entries got evicted).
-*/
-   struct tlbe_ref *tlb_refs[E500_TLB_NUM];
-   unsigned int host_tlb1_nv;
-
-   u32 host_pid[E500_PID_NUM];
-   u32 pid[E500_PID_NUM];
-   u32 svr;
-
-   /* vcpu id table */
-   struct vcpu_id_table *idt;
-
-   u32 l1csr0;
-   u32 l1csr1;
-   u32 hid0;
-   u32 hid1;
-   u32 tlb0cfg;
-   u32 tlb1cfg;
-   u64 mcar;
-
-   struct page **shared_tlb_pages;
-   int num_shared_tlb_pages;
-
-   struct kvm_vcpu vcpu;
-};
-
-static inline struct kvmppc_vcpu_e500 *to_e500(struct kvm_vcpu *vcpu)
-{
-   return container_of(vcpu, struct kvmppc_vcpu_e500, vcpu);
-}
-
-#endif /* __ASM_KVM_E500_H__ */
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 5c450ba..76b35d8 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -20,7 +20,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "booke.h"
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 02ecde2..51d13bd 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -1,11 +1,12 @@
 /*
  * Copyright (C) 2008-2011 Freescale Semiconductor, Inc. All rights reserved.
  *
- * Author: Yu Liu, yu@freescale.com
+ * Author: Yu Liu 
  *
  * Description:
- * This file is based on arch/powerpc/kvm/44x_tlb.h,
- * by Hollis Blanchard .
+ * This file is based on arch/powerpc/kvm/44x_tlb.h and
+ * arch/powerpc/include/asm/kvm_44x.h by Hollis Blanchard ,
+ * Copyright IBM Corp. 2007-2008
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License, version 2, as
@@ -18,7 +19,80 @@
 #include 
 #include 
 #include 
-#include 
+
+#define E500_PID_NUM   3
+#define E500_TLB_NUM   2
+
+#define E500_TLB_VALID 1
+#define E500_TLB_DIRTY 2
+
+struct tlbe_ref {
+   pfn_t pfn;
+   unsigned int flags; /* E500_TLB_* */
+};
+
+struct tlbe_priv {
+   struct tlbe_ref ref; /* TLB0 only -- TLB1 uses tlb_refs */
+};
+
+struct vcpu_id_table;
+
+struct kvmppc_e500_tlb_params {
+   int entries, ways, sets;
+};
+
+struct kvmppc_vc

[PATCH 00/38] KVM: PPC: e500mc support v3

2012-02-28 Thread Alexander Graf
This is Scott's e500mc RFC patch set rebased, berobbed of its pt_regs
parts and fixed for bisectability. On top of them, I addressed all the
comments that I had on the code and that came up in his code as FIXMEs.

I verified that this patch set works just fine on e500mc and doesn't
break e500v2, so I would say it's good to go as it is, unless someone
has strong objections to how things are done. Everything hereafter
I would prefer to do based on a working upstream version rather than
a downstream fork, as that way exposure is a lot higher.

v1 -> v2:

  - ESR -> GESR
  - introduce and use constants for doorbell
  - drop e500mc ifdefs for doorbell
  - fix whitespace
  - use explicit preempt counts in inst fixup
  - rework e500v2 kconfig patch
  - add patches 31-37

v2 -> v3:

  - add patch 38
  - check for signals earlier
  - also remove "lwzr9, VCPU_KVM(r4)" which was as superfluous
  - sync host state instead of guest state to pt_regs
  - optimize reinject code path to get out fast when not reinjecting

Alexander Graf (23):
  KVM: PPC: e500mc: Add doorbell emulation support
  KVM: PPC: e500mc: implicitly set MSR_GS
  KVM: PPC: e500mc: Move r1/r2 restoration very early
  KVM: PPC: e500mc: add load inst fixup
  KVM: PPC: rename CONFIG_KVM_E500 -> CONFIG_KVM_E500V2
  KVM: PPC: make e500v2 kvm and e500mc cpu mutually exclusive
  KVM: PPC: booke: remove leftover debugging
  KVM: PPC: booke: deliver program int on emulation failure
  KVM: PPC: booke: rework rescheduling checks
  KVM: PPC: booke: BOOKE_IRQPRIO_MAX is n+1
  KVM: PPC: bookehv: fix exit timing
  KVM: PPC: bookehv: remove negation for CONFIG_64BIT
  KVM: PPC: bookehv: remove SET_VCPU
  KVM: PPC: bookehv: disable MAS register updates early
  KVM: PPC: bookehv: add comment about shadow_msr
  KVM: PPC: booke: Readd debug abort code for machine check
  KVM: PPC: booke: add GS documentation for program interrupt
  KVM: PPC: bookehv: remove unused code
  KVM: PPC: e500: fix typo in tlb code
  KVM: PPC: booke: Support perfmon interrupts
  KVM: PPC: booke: expose good state on irq reinject
  KVM: PPC: booke: Reinject performance monitor interrupts
  KVM: PPC: Booke: only prepare to enter when we enter

Scott Wood (15):
  powerpc/booke: Set CPU_FTR_DEBUG_LVL_EXC on 32-bit
  powerpc/e500: split CPU_FTRS_ALWAYS/CPU_FTRS_POSSIBLE
  KVM: PPC: factor out lpid allocator from book3s_64_mmu_hv
  KVM: PPC: booke: add booke-level vcpu load/put
  KVM: PPC: booke: Move vm core init/destroy out of booke.c
  KVM: PPC: e500: rename e500_tlb.h to e500.h
  KVM: PPC: e500: merge  into arch/powerpc/kvm/e500.h
  KVM: PPC: e500: clean up arch/powerpc/kvm/e500.h
  KVM: PPC: e500: refactor core-specific TLB code
  KVM: PPC: e500: Track TLB1 entries with a bitmap
  KVM: PPC: e500: emulate tlbilx
  powerpc/booke: Provide exception macros with interrupt name
  KVM: PPC: booke: category E.HV (GS-mode) support
  KVM: PPC: booke: standard PPC floating point support
  KVM: PPC: e500mc support

 arch/powerpc/include/asm/cputable.h |   21 +-
 arch/powerpc/include/asm/dbell.h|3 +
 arch/powerpc/include/asm/hw_irq.h   |1 +
 arch/powerpc/include/asm/kvm.h  |1 +
 arch/powerpc/include/asm/kvm_asm.h  |8 +
 arch/powerpc/include/asm/kvm_book3s.h   |3 +
 arch/powerpc/include/asm/kvm_booke.h|3 +
 arch/powerpc/include/asm/kvm_booke_hv_asm.h |   49 +++
 arch/powerpc/include/asm/kvm_e500.h |   96 -
 arch/powerpc/include/asm/kvm_host.h |   22 +-
 arch/powerpc/include/asm/kvm_ppc.h  |   10 +-
 arch/powerpc/include/asm/mmu-book3e.h   |6 +
 arch/powerpc/include/asm/processor.h|3 +
 arch/powerpc/include/asm/reg.h  |2 +
 arch/powerpc/include/asm/reg_booke.h|   34 ++
 arch/powerpc/include/asm/system.h   |1 +
 arch/powerpc/kernel/asm-offsets.c   |   15 +-
 arch/powerpc/kernel/cpu_setup_fsl_booke.S   |1 +
 arch/powerpc/kernel/head_44x.S  |   23 +-
 arch/powerpc/kernel/head_booke.h|   69 ++-
 arch/powerpc/kernel/head_fsl_booke.S|   98 -
 arch/powerpc/kvm/44x.c  |   12 +
 arch/powerpc/kvm/Kconfig|   28 +-
 arch/powerpc/kvm/Makefile   |   15 +-
 arch/powerpc/kvm/book3s.c   |4 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |   26 +-
 arch/powerpc/kvm/booke.c|  469 +
 arch/powerpc/kvm/booke.h|   57 +++-
 arch/powerpc/kvm/booke_emulate.c|   23 +-
 arch/powerpc/kvm/bookehv_interrupts.S   |  602 +++
 arch/powerpc/kvm/e500.c |  372 ++---
 arch/powerpc/kvm/e500.h |  302 ++
 arch/powerpc/kvm/e500_emulate.c |  110 +-
 arch/powerpc/kvm/e500_tlb.c |  588 +++---
 arch/powerpc/kvm/e500_tlb.h |  174 
 

[PATCH 04/38] KVM: PPC: booke: add booke-level vcpu load/put

2012-02-28 Thread Alexander Graf
From: Scott Wood 

This gives us a place to put load/put actions that correspond to
code that is booke-specific but not specific to a particular core.

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/44x.c   |3 +++
 arch/powerpc/kvm/booke.c |8 
 arch/powerpc/kvm/booke.h |3 +++
 arch/powerpc/kvm/e500.c  |3 +++
 4 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/44x.c b/arch/powerpc/kvm/44x.c
index 7b612a7..879a1a7 100644
--- a/arch/powerpc/kvm/44x.c
+++ b/arch/powerpc/kvm/44x.c
@@ -29,15 +29,18 @@
 #include 
 
 #include "44x_tlb.h"
+#include "booke.h"
 
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+   kvmppc_booke_vcpu_load(vcpu, cpu);
kvmppc_44x_tlb_load(vcpu);
 }
 
 void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu)
 {
kvmppc_44x_tlb_put(vcpu);
+   kvmppc_booke_vcpu_put(vcpu);
 }
 
 int kvmppc_core_check_processor_compat(void)
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index ee9e1ee..a2456c7 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -968,6 +968,14 @@ void kvmppc_decrementer_func(unsigned long data)
kvmppc_set_tsr_bits(vcpu, TSR_DIS);
 }
 
+void kvmppc_booke_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
+void kvmppc_booke_vcpu_put(struct kvm_vcpu *vcpu)
+{
+}
+
 int __init kvmppc_booke_init(void)
 {
unsigned long ivor[16];
diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
index 2fe2027..05d1d99 100644
--- a/arch/powerpc/kvm/booke.h
+++ b/arch/powerpc/kvm/booke.h
@@ -71,4 +71,7 @@ void kvmppc_save_guest_spe(struct kvm_vcpu *vcpu);
 /* high-level function, manages flags, host state */
 void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu);
 
+void kvmppc_booke_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void kvmppc_booke_vcpu_put(struct kvm_vcpu *vcpu);
+
 #endif /* __KVM_BOOKE_H__ */
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index ddcd896..2d5fe04 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -36,6 +36,7 @@ void kvmppc_core_load_guest_debugstate(struct kvm_vcpu *vcpu)
 
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+   kvmppc_booke_vcpu_load(vcpu, cpu);
kvmppc_e500_tlb_load(vcpu, cpu);
 }
 
@@ -47,6 +48,8 @@ void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu)
if (vcpu->arch.shadow_msr & MSR_SPE)
kvmppc_vcpu_disable_spe(vcpu);
 #endif
+
+   kvmppc_booke_vcpu_put(vcpu);
 }
 
 int kvmppc_core_check_processor_compat(void)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/38] powerpc/booke: Set CPU_FTR_DEBUG_LVL_EXC on 32-bit

2012-02-28 Thread Alexander Graf
From: Scott Wood 

Currently 32-bit only cares about this for choice of exception
vector, which is done in core-specific code.  However, KVM will
want to distinguish as well.

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/cputable.h |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index ad55a1c..6a034a2 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -376,7 +376,8 @@ extern const char *powerpc_base_platform;
 #define CPU_FTRS_47X   (CPU_FTRS_440x6)
 #define CPU_FTRS_E200  (CPU_FTR_USE_TB | CPU_FTR_SPE_COMP | \
CPU_FTR_NODSISRALIGN | CPU_FTR_COHERENT_ICACHE | \
-   CPU_FTR_UNIFIED_ID_CACHE | CPU_FTR_NOEXECUTE)
+   CPU_FTR_UNIFIED_ID_CACHE | CPU_FTR_NOEXECUTE | \
+   CPU_FTR_DEBUG_LVL_EXC)
 #define CPU_FTRS_E500  (CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | \
CPU_FTR_SPE_COMP | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_NODSISRALIGN | \
CPU_FTR_NOEXECUTE)
@@ -385,7 +386,7 @@ extern const char *powerpc_base_platform;
CPU_FTR_NODSISRALIGN | CPU_FTR_NOEXECUTE)
 #define CPU_FTRS_E500MC(CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN | \
CPU_FTR_L2CSR | CPU_FTR_LWSYNC | CPU_FTR_NOEXECUTE | \
-   CPU_FTR_DBELL)
+   CPU_FTR_DBELL | CPU_FTR_DEBUG_LVL_EXC)
 #define CPU_FTRS_E5500 (CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN | \
CPU_FTR_L2CSR | CPU_FTR_LWSYNC | CPU_FTR_NOEXECUTE | \
CPU_FTR_DBELL | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/38] KVM: PPC: e500: refactor core-specific TLB code

2012-02-28 Thread Alexander Graf
From: Scott Wood 

The PID handling is e500v1/v2-specific, and is moved to e500.c.

The MMU sregs code and kvmppc_core_vcpu_translate will be shared with
e500mc, and is moved from e500.c to e500_tlb.c.

Partially based on patches from Liu Yu .

Signed-off-by: Scott Wood 
[agraf: fix bisectability]
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/kvm/e500.c |  357 +++
 arch/powerpc/kvm/e500.h |   62 -
 arch/powerpc/kvm/e500_emulate.c |6 +-
 arch/powerpc/kvm/e500_tlb.c |  460 +--
 5 files changed, 473 insertions(+), 414 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 52eb9c1..47612cc 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -426,6 +426,8 @@ struct kvm_vcpu_arch {
ulong fault_esr;
ulong queued_dear;
ulong queued_esr;
+   u32 tlbcfg[4];
+   u32 mmucfg;
 #endif
gpa_t paddr_accessed;
 
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 76b35d8..b479ed7 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -22,9 +22,281 @@
 #include 
 #include 
 
+#include "../mm/mmu_decl.h"
 #include "booke.h"
 #include "e500.h"
 
+struct id {
+   unsigned long val;
+   struct id **pentry;
+};
+
+#define NUM_TIDS 256
+
+/*
+ * This table provide mappings from:
+ * (guestAS,guestTID,guestPR) --> ID of physical cpu
+ * guestAS [0..1]
+ * guestTID[0..255]
+ * guestPR [0..1]
+ * ID  [1..255]
+ * Each vcpu keeps one vcpu_id_table.
+ */
+struct vcpu_id_table {
+   struct id id[2][NUM_TIDS][2];
+};
+
+/*
+ * This table provide reversed mappings of vcpu_id_table:
+ * ID --> address of vcpu_id_table item.
+ * Each physical core has one pcpu_id_table.
+ */
+struct pcpu_id_table {
+   struct id *entry[NUM_TIDS];
+};
+
+static DEFINE_PER_CPU(struct pcpu_id_table, pcpu_sids);
+
+/* This variable keeps last used shadow ID on local core.
+ * The valid range of shadow ID is [1..255] */
+static DEFINE_PER_CPU(unsigned long, pcpu_last_used_sid);
+
+/*
+ * Allocate a free shadow id and setup a valid sid mapping in given entry.
+ * A mapping is only valid when vcpu_id_table and pcpu_id_table are match.
+ *
+ * The caller must have preemption disabled, and keep it that way until
+ * it has finished with the returned shadow id (either written into the
+ * TLB or arch.shadow_pid, or discarded).
+ */
+static inline int local_sid_setup_one(struct id *entry)
+{
+   unsigned long sid;
+   int ret = -1;
+
+   sid = ++(__get_cpu_var(pcpu_last_used_sid));
+   if (sid < NUM_TIDS) {
+   __get_cpu_var(pcpu_sids).entry[sid] = entry;
+   entry->val = sid;
+   entry->pentry = &__get_cpu_var(pcpu_sids).entry[sid];
+   ret = sid;
+   }
+
+   /*
+* If sid == NUM_TIDS, we've run out of sids.  We return -1, and
+* the caller will invalidate everything and start over.
+*
+* sid > NUM_TIDS indicates a race, which we disable preemption to
+* avoid.
+*/
+   WARN_ON(sid > NUM_TIDS);
+
+   return ret;
+}
+
+/*
+ * Check if given entry contain a valid shadow id mapping.
+ * An ID mapping is considered valid only if
+ * both vcpu and pcpu know this mapping.
+ *
+ * The caller must have preemption disabled, and keep it that way until
+ * it has finished with the returned shadow id (either written into the
+ * TLB or arch.shadow_pid, or discarded).
+ */
+static inline int local_sid_lookup(struct id *entry)
+{
+   if (entry && entry->val != 0 &&
+   __get_cpu_var(pcpu_sids).entry[entry->val] == entry &&
+   entry->pentry == &__get_cpu_var(pcpu_sids).entry[entry->val])
+   return entry->val;
+   return -1;
+}
+
+/* Invalidate all id mappings on local core -- call with preempt disabled */
+static inline void local_sid_destroy_all(void)
+{
+   __get_cpu_var(pcpu_last_used_sid) = 0;
+   memset(&__get_cpu_var(pcpu_sids), 0, sizeof(__get_cpu_var(pcpu_sids)));
+}
+
+static void *kvmppc_e500_id_table_alloc(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+   vcpu_e500->idt = kzalloc(sizeof(struct vcpu_id_table), GFP_KERNEL);
+   return vcpu_e500->idt;
+}
+
+static void kvmppc_e500_id_table_free(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+   kfree(vcpu_e500->idt);
+   vcpu_e500->idt = NULL;
+}
+
+/* Map guest pid to shadow.
+ * We use PID to keep shadow of current guest non-zero PID,
+ * and use PID1 to keep shadow of guest zero PID.
+ * So that guest tlbe with TID=0 can be accessed at any time */
+static void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+   preempt_disable();
+   vcpu_e500->vcpu.arch.shadow_pid = kvmppc_e500_get_sid(vcpu_e500,
+   get_cur_as(&vcpu_e500->vcpu),
+   get

[PATCH 06/38] KVM: PPC: e500: rename e500_tlb.h to e500.h

2012-02-28 Thread Alexander Graf
From: Scott Wood 

This is in preparation for merging in the contents of
arch/powerpc/include/asm/kvm_e500.h.

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/e500.c |2 +-
 arch/powerpc/kvm/{e500_tlb.h => e500.h} |6 +++---
 arch/powerpc/kvm/e500_emulate.c |2 +-
 arch/powerpc/kvm/e500_tlb.c |2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)
 rename arch/powerpc/kvm/{e500_tlb.h => e500.h} (98%)

diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index ac6c9ae..5c450ba 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -24,7 +24,7 @@
 #include 
 
 #include "booke.h"
-#include "e500_tlb.h"
+#include "e500.h"
 
 void kvmppc_core_load_host_debugstate(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/e500_tlb.h b/arch/powerpc/kvm/e500.h
similarity index 98%
rename from arch/powerpc/kvm/e500_tlb.h
rename to arch/powerpc/kvm/e500.h
index 5c6d2d7..02ecde2 100644
--- a/arch/powerpc/kvm/e500_tlb.h
+++ b/arch/powerpc/kvm/e500.h
@@ -12,8 +12,8 @@
  * published by the Free Software Foundation.
  */
 
-#ifndef __KVM_E500_TLB_H__
-#define __KVM_E500_TLB_H__
+#ifndef KVM_E500_H
+#define KVM_E500_H
 
 #include 
 #include 
@@ -171,4 +171,4 @@ static inline int tlbe_is_host_safe(const struct kvm_vcpu 
*vcpu,
return 1;
 }
 
-#endif /* __KVM_E500_TLB_H__ */
+#endif /* KVM_E500_H */
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index 6d0b2bd..2a1a228 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -17,7 +17,7 @@
 #include 
 
 #include "booke.h"
-#include "e500_tlb.h"
+#include "e500.h"
 
 #define XOP_TLBIVAX 786
 #define XOP_TLBSX   914
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 6e53e41..1d623a0 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -29,7 +29,7 @@
 #include 
 
 #include "../mm/mmu_decl.h"
-#include "e500_tlb.h"
+#include "e500.h"
 #include "trace.h"
 #include "timing.h"
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/38] KVM: PPC: e500: clean up arch/powerpc/kvm/e500.h

2012-02-28 Thread Alexander Graf
From: Scott Wood 

Move vcpu to the beginning of vcpu_e500 to give it appropriate
prominence, especially if more fields end up getting added to the
end of vcpu_e500 (and vcpu ends up in the middle).

Remove gratuitous "extern" and add parameter names to prototypes.

Signed-off-by: Scott Wood 
[agraf: fix bisectability]
Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/e500.h |   25 ++---
 1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 51d13bd..a48af00 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -42,6 +42,8 @@ struct kvmppc_e500_tlb_params {
 };
 
 struct kvmppc_vcpu_e500 {
+   struct kvm_vcpu vcpu;
+
/* Unmodified copy of the guest's TLB -- shared with host userspace. */
struct kvm_book3e_206_tlb_entry *gtlb_arch;
 
@@ -85,8 +87,6 @@ struct kvmppc_vcpu_e500 {
 
struct page **shared_tlb_pages;
int num_shared_tlb_pages;
-
-   struct kvm_vcpu vcpu;
 };
 
 static inline struct kvmppc_vcpu_e500 *to_e500(struct kvm_vcpu *vcpu)
@@ -113,19 +113,22 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct 
kvm_vcpu *vcpu)
  (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \
   | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK)
 
-extern void kvmppc_dump_tlbs(struct kvm_vcpu *);
-extern int kvmppc_e500_emul_mt_mmucsr0(struct kvmppc_vcpu_e500 *, ulong);
-extern int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *);
-extern int kvmppc_e500_emul_tlbre(struct kvm_vcpu *);
-extern int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *, int, int);
-extern int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *, int);
-extern int kvmppc_e500_tlb_search(struct kvm_vcpu *, gva_t, unsigned int, int);
 extern void kvmppc_e500_tlb_put(struct kvm_vcpu *);
 extern void kvmppc_e500_tlb_load(struct kvm_vcpu *, int);
-extern int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *);
-extern void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *);
 extern void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *);
 extern void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *);
+int kvmppc_e500_emul_mt_mmucsr0(struct kvmppc_vcpu_e500 *vcpu_e500,
+   ulong value);
+int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu);
+int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu);
+int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *vcpu, int ra, int rb);
+int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb);
+int kvmppc_e500_tlb_search(struct kvm_vcpu *, gva_t, unsigned int, int);
+int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500);
+void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500);
+
+void kvmppc_get_sregs_e500_tlb(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
+int kvmppc_set_sregs_e500_tlb(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
 
 /* TLB helper functions */
 static inline unsigned int
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/38] KVM: PPC: booke: Move vm core init/destroy out of booke.c

2012-02-28 Thread Alexander Graf
From: Scott Wood 

e500mc will want to do lpid allocation/deallocation here.

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/44x.c   |9 +
 arch/powerpc/kvm/booke.c |9 -
 arch/powerpc/kvm/e500.c  |9 +
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/44x.c b/arch/powerpc/kvm/44x.c
index 879a1a7..50e7dbc 100644
--- a/arch/powerpc/kvm/44x.c
+++ b/arch/powerpc/kvm/44x.c
@@ -163,6 +163,15 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
kmem_cache_free(kvm_vcpu_cache, vcpu_44x);
 }
 
+int kvmppc_core_init_vm(struct kvm *kvm)
+{
+   return 0;
+}
+
+void kvmppc_core_destroy_vm(struct kvm *kvm)
+{
+}
+
 static int __init kvmppc_44x_init(void)
 {
int r;
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index a2456c7..2ee9bae 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -932,15 +932,6 @@ void kvmppc_core_commit_memory_region(struct kvm *kvm,
 {
 }
 
-int kvmppc_core_init_vm(struct kvm *kvm)
-{
-   return 0;
-}
-
-void kvmppc_core_destroy_vm(struct kvm *kvm)
-{
-}
-
 void kvmppc_set_tcr(struct kvm_vcpu *vcpu, u32 new_tcr)
 {
vcpu->arch.tcr = new_tcr;
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 2d5fe04..ac6c9ae 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -226,6 +226,15 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
kmem_cache_free(kvm_vcpu_cache, vcpu_e500);
 }
 
+int kvmppc_core_init_vm(struct kvm *kvm)
+{
+   return 0;
+}
+
+void kvmppc_core_destroy_vm(struct kvm *kvm)
+{
+}
+
 static int __init kvmppc_e500_init(void)
 {
int r, i;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/38] KVM: PPC: e500: emulate tlbilx

2012-02-28 Thread Alexander Graf
From: Scott Wood 

tlbilx is the new, preferred invalidation instruction.  It is not
found on e500 prior to e500mc, but there should be no harm in
supporting it on all e500.

Based on code from Ashish Kalra .

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/e500.h |1 +
 arch/powerpc/kvm/e500_emulate.c |9 ++
 arch/powerpc/kvm/e500_tlb.c |   52 +++
 3 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index f4dee55..ce3f163 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -124,6 +124,7 @@ int kvmppc_e500_emul_mt_mmucsr0(struct kvmppc_vcpu_e500 
*vcpu_e500,
 int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu);
 int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu);
 int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *vcpu, int ra, int rb);
+int kvmppc_e500_emul_tlbilx(struct kvm_vcpu *vcpu, int rt, int ra, int rb);
 int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb);
 int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500);
 void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500);
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index c80794d..af02c18 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -22,6 +22,7 @@
 #define XOP_TLBSX   914
 #define XOP_TLBRE   946
 #define XOP_TLBWE   978
+#define XOP_TLBILX  18
 
 int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu,
unsigned int inst, int *advance)
@@ -29,6 +30,7 @@ int kvmppc_core_emulate_op(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
int emulated = EMULATE_DONE;
int ra;
int rb;
+   int rt;
 
switch (get_op(inst)) {
case 31:
@@ -47,6 +49,13 @@ int kvmppc_core_emulate_op(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
emulated = kvmppc_e500_emul_tlbsx(vcpu,rb);
break;
 
+   case XOP_TLBILX:
+   ra = get_ra(inst);
+   rb = get_rb(inst);
+   rt = get_rt(inst);
+   emulated = kvmppc_e500_emul_tlbilx(vcpu, rt, ra, rb);
+   break;
+
case XOP_TLBIVAX:
ra = get_ra(inst);
rb = get_rb(inst);
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index c8ce51d..6eb5d65 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -631,6 +631,58 @@ int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *vcpu, int 
ra, int rb)
return EMULATE_DONE;
 }
 
+static void tlbilx_all(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel,
+  int pid, int rt)
+{
+   struct kvm_book3e_206_tlb_entry *tlbe;
+   int tid, esel;
+
+   /* invalidate all entries */
+   for (esel = 0; esel < vcpu_e500->gtlb_params[tlbsel].entries; esel++) {
+   tlbe = get_entry(vcpu_e500, tlbsel, esel);
+   tid = get_tlb_tid(tlbe);
+   if (rt == 0 || tid == pid) {
+   inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
+   kvmppc_e500_gtlbe_invalidate(vcpu_e500, tlbsel, esel);
+   }
+   }
+}
+
+static void tlbilx_one(struct kvmppc_vcpu_e500 *vcpu_e500, int pid,
+  int ra, int rb)
+{
+   int tlbsel, esel;
+   gva_t ea;
+
+   ea = kvmppc_get_gpr(&vcpu_e500->vcpu, rb);
+   if (ra)
+   ea += kvmppc_get_gpr(&vcpu_e500->vcpu, ra);
+
+   for (tlbsel = 0; tlbsel < 2; tlbsel++) {
+   esel = kvmppc_e500_tlb_index(vcpu_e500, ea, tlbsel, pid, -1);
+   if (esel >= 0) {
+   inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
+   kvmppc_e500_gtlbe_invalidate(vcpu_e500, tlbsel, esel);
+   break;
+   }
+   }
+}
+
+int kvmppc_e500_emul_tlbilx(struct kvm_vcpu *vcpu, int rt, int ra, int rb)
+{
+   struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+   int pid = get_cur_spid(vcpu);
+
+   if (rt == 0 || rt == 1) {
+   tlbilx_all(vcpu_e500, 0, pid, rt);
+   tlbilx_all(vcpu_e500, 1, pid, rt);
+   } else if (rt == 3) {
+   tlbilx_one(vcpu_e500, pid, ra, rb);
+   }
+
+   return EMULATE_DONE;
+}
+
 int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu)
 {
struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/38] KVM: PPC: e500mc: implicitly set MSR_GS

2012-02-28 Thread Alexander Graf
When setting MSR for an e500mc guest, we implicitly always set MSR_GS
to make sure the guest is in guest state. Since we have this implicit
rule there, we don't need to explicitly pass MSR_GS to set_msr().

Remove all explicit setters of MSR_GS.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |   11 +--
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 85bd5b8..fcbe928 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -280,7 +280,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
 unsigned int priority)
 {
int allowed = 0;
-   ulong uninitialized_var(msr_mask);
+   ulong msr_mask = 0;
bool update_esr = false, update_dear = false;
ulong crit_raw = vcpu->arch.shared->critical;
ulong crit_r1 = kvmppc_get_gpr(vcpu, 1);
@@ -322,20 +322,19 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
case BOOKE_IRQPRIO_AP_UNAVAIL:
case BOOKE_IRQPRIO_ALIGNMENT:
allowed = 1;
-   msr_mask = MSR_GS | MSR_CE | MSR_ME | MSR_DE;
+   msr_mask = MSR_CE | MSR_ME | MSR_DE;
int_class = INT_CLASS_NONCRIT;
break;
case BOOKE_IRQPRIO_CRITICAL:
case BOOKE_IRQPRIO_DBELL_CRIT:
allowed = vcpu->arch.shared->msr & MSR_CE;
allowed = allowed && !crit;
-   msr_mask = MSR_GS | MSR_ME;
+   msr_mask = MSR_ME;
int_class = INT_CLASS_CRIT;
break;
case BOOKE_IRQPRIO_MACHINE_CHECK:
allowed = vcpu->arch.shared->msr & MSR_ME;
allowed = allowed && !crit;
-   msr_mask = MSR_GS;
int_class = INT_CLASS_MC;
break;
case BOOKE_IRQPRIO_DECREMENTER:
@@ -346,13 +345,13 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
case BOOKE_IRQPRIO_DBELL:
allowed = vcpu->arch.shared->msr & MSR_EE;
allowed = allowed && !crit;
-   msr_mask = MSR_GS | MSR_CE | MSR_ME | MSR_DE;
+   msr_mask = MSR_CE | MSR_ME | MSR_DE;
int_class = INT_CLASS_NONCRIT;
break;
case BOOKE_IRQPRIO_DEBUG:
allowed = vcpu->arch.shared->msr & MSR_DE;
allowed = allowed && !crit;
-   msr_mask = MSR_GS | MSR_ME;
+   msr_mask = MSR_ME;
int_class = INT_CLASS_CRIT;
break;
}
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/38] powerpc/booke: Provide exception macros with interrupt name

2012-02-28 Thread Alexander Graf
From: Scott Wood 

DO_KVM will need to identify the particular exception type.

There is an existing set of arbitrary numbers that Linux passes,
but it's an undocumented mess that sort of corresponds to server/classic
exception vectors but not really.

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/kernel/head_44x.S   |   23 +--
 arch/powerpc/kernel/head_booke.h |   41 ++
 arch/powerpc/kernel/head_fsl_booke.S |   52 +-
 3 files changed, 68 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S
index 7dd2981..d1192c5 100644
--- a/arch/powerpc/kernel/head_44x.S
+++ b/arch/powerpc/kernel/head_44x.S
@@ -248,10 +248,11 @@ _ENTRY(_start);
 
 interrupt_base:
/* Critical Input Interrupt */
-   CRITICAL_EXCEPTION(0x0100, CriticalInput, unknown_exception)
+   CRITICAL_EXCEPTION(0x0100, CRITICAL, CriticalInput, unknown_exception)
 
/* Machine Check Interrupt */
-   CRITICAL_EXCEPTION(0x0200, MachineCheck, machine_check_exception)
+   CRITICAL_EXCEPTION(0x0200, MACHINE_CHECK, MachineCheck, \
+  machine_check_exception)
MCHECK_EXCEPTION(0x0210, MachineCheckA, machine_check_exception)
 
/* Data Storage Interrupt */
@@ -261,7 +262,8 @@ interrupt_base:
INSTRUCTION_STORAGE_EXCEPTION
 
/* External Input Interrupt */
-   EXCEPTION(0x0500, ExternalInput, do_IRQ, EXC_XFER_LITE)
+   EXCEPTION(0x0500, BOOKE_INTERRUPT_EXTERNAL, ExternalInput, \
+ do_IRQ, EXC_XFER_LITE)
 
/* Alignment Interrupt */
ALIGNMENT_EXCEPTION
@@ -273,29 +275,32 @@ interrupt_base:
 #ifdef CONFIG_PPC_FPU
FP_UNAVAILABLE_EXCEPTION
 #else
-   EXCEPTION(0x2010, FloatingPointUnavailable, unknown_exception, 
EXC_XFER_EE)
+   EXCEPTION(0x2010, BOOKE_INTERRUPT_FP_UNAVAIL, \
+ FloatingPointUnavailable, unknown_exception, EXC_XFER_EE)
 #endif
/* System Call Interrupt */
START_EXCEPTION(SystemCall)
-   NORMAL_EXCEPTION_PROLOG
+   NORMAL_EXCEPTION_PROLOG(BOOKE_INTERRUPT_SYSCALL)
EXC_XFER_EE_LITE(0x0c00, DoSyscall)
 
/* Auxiliary Processor Unavailable Interrupt */
-   EXCEPTION(0x2020, AuxillaryProcessorUnavailable, unknown_exception, 
EXC_XFER_EE)
+   EXCEPTION(0x2020, BOOKE_INTERRUPT_AP_UNAVAIL, \
+ AuxillaryProcessorUnavailable, unknown_exception, EXC_XFER_EE)
 
/* Decrementer Interrupt */
DECREMENTER_EXCEPTION
 
/* Fixed Internal Timer Interrupt */
/* TODO: Add FIT support */
-   EXCEPTION(0x1010, FixedIntervalTimer, unknown_exception, EXC_XFER_EE)
+   EXCEPTION(0x1010, BOOKE_INTERRUPT_FIT, FixedIntervalTimer, \
+ unknown_exception, EXC_XFER_EE)
 
/* Watchdog Timer Interrupt */
/* TODO: Add watchdog support */
 #ifdef CONFIG_BOOKE_WDT
-   CRITICAL_EXCEPTION(0x1020, WatchdogTimer, WatchdogException)
+   CRITICAL_EXCEPTION(0x1020, WATCHDOG, WatchdogTimer, WatchdogException)
 #else
-   CRITICAL_EXCEPTION(0x1020, WatchdogTimer, unknown_exception)
+   CRITICAL_EXCEPTION(0x1020, WATCHDOG, WatchdogTimer, unknown_exception)
 #endif
 
/* Data TLB Error Interrupt */
diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index fc921bf..06ab353 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -2,6 +2,8 @@
 #define __HEAD_BOOKE_H__
 
 #include /* for STACK_FRAME_REGS_MARKER */
+#include 
+
 /*
  * Macros used for common Book-e exception handling
  */
@@ -28,7 +30,7 @@
  */
 #define THREAD_NORMSAVE(offset)(THREAD_NORMSAVES + (offset * 4))
 
-#define NORMAL_EXCEPTION_PROLOG
 \
+#define NORMAL_EXCEPTION_PROLOG(intno) 
 \
mtspr   SPRN_SPRG_WSCRATCH0, r10;   /* save one register */  \
mfspr   r10, SPRN_SPRG_THREAD;   \
stw r11, THREAD_NORMSAVE(0)(r10);\
@@ -113,7 +115,7 @@
  * registers as the normal prolog above. Instead we use a portion of the
  * critical/machine check exception stack at low physical addresses.
  */
-#define EXC_LEVEL_EXCEPTION_PROLOG(exc_level, exc_level_srr0, exc_level_srr1) \
+#define EXC_LEVEL_EXCEPTION_PROLOG(exc_level, intno, exc_level_srr0, 
exc_level_srr1) \
mtspr   SPRN_SPRG_WSCRATCH_##exc_level,r8;   \
BOOKE_LOAD_EXC_LEVEL_STACK(exc_level);/* r8 points to the exc_level 
stack*/ \
stw r9,GPR9(r8);/* save various registers  */\
@@ -162,12 +164,13 @@
SAVE_4GPRS(3, r11);  \
SAVE_2GPRS(7, r11)
 
-#define CRITICAL_EXCEPTION_PROLOG \
-   EXC_LEVEL_EXCEPTION_PROLOG(CRIT, S

[PATCH 16/38] KVM: PPC: e500mc: Add doorbell emulation support

2012-02-28 Thread Alexander Graf
When one vcpu wants to kick another, it can issue a special IPI instruction
called msgsnd. This patch emulates this instruction, its clearing counterpart
and the infrastructure required to actually trigger that interrupt inside
a guest vcpu.

With this patch, SMP guests on e500mc work.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - introduce and use constants
  - drop e500mc ifdefs
---
 arch/powerpc/include/asm/dbell.h |2 +
 arch/powerpc/kvm/booke.c |2 +
 arch/powerpc/kvm/e500_emulate.c  |   68 ++
 3 files changed, 72 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
index d7365b0..154c067 100644
--- a/arch/powerpc/include/asm/dbell.h
+++ b/arch/powerpc/include/asm/dbell.h
@@ -19,7 +19,9 @@
 
 #define PPC_DBELL_MSG_BRDCAST  (0x0400)
 #define PPC_DBELL_TYPE(x)  (((x) & 0xf) << (63-36))
+#define PPC_DBELL_TYPE_MASKPPC_DBELL_TYPE(0xf)
 #define PPC_DBELL_LPID(x)  ((x) << (63 - 49))
+#define PPC_DBELL_PIR_MASK 0x3fff
 enum ppc_dbell {
PPC_DBELL = 0,  /* doorbell */
PPC_DBELL_CRIT = 1, /* critical doorbell */
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 0b77be1..85bd5b8 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -326,6 +326,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
int_class = INT_CLASS_NONCRIT;
break;
case BOOKE_IRQPRIO_CRITICAL:
+   case BOOKE_IRQPRIO_DBELL_CRIT:
allowed = vcpu->arch.shared->msr & MSR_CE;
allowed = allowed && !crit;
msr_mask = MSR_GS | MSR_ME;
@@ -342,6 +343,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
keep_irq = true;
/* fall through */
case BOOKE_IRQPRIO_EXTERNAL:
+   case BOOKE_IRQPRIO_DBELL:
allowed = vcpu->arch.shared->msr & MSR_EE;
allowed = allowed && !crit;
msr_mask = MSR_GS | MSR_CE | MSR_ME | MSR_DE;
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index 98b6c1c..99155f8 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -14,16 +14,74 @@
 
 #include 
 #include 
+#include 
 
 #include "booke.h"
 #include "e500.h"
 
+#define XOP_MSGSND  206
+#define XOP_MSGCLR  238
 #define XOP_TLBIVAX 786
 #define XOP_TLBSX   914
 #define XOP_TLBRE   946
 #define XOP_TLBWE   978
 #define XOP_TLBILX  18
 
+#ifdef CONFIG_KVM_E500MC
+static int dbell2prio(ulong param)
+{
+   int msg = param & PPC_DBELL_TYPE_MASK;
+   int prio = -1;
+
+   switch (msg) {
+   case PPC_DBELL_TYPE(PPC_DBELL):
+   prio = BOOKE_IRQPRIO_DBELL;
+   break;
+   case PPC_DBELL_TYPE(PPC_DBELL_CRIT):
+   prio = BOOKE_IRQPRIO_DBELL_CRIT;
+   break;
+   default:
+   break;
+   }
+
+   return prio;
+}
+
+static int kvmppc_e500_emul_msgclr(struct kvm_vcpu *vcpu, int rb)
+{
+   ulong param = vcpu->arch.gpr[rb];
+   int prio = dbell2prio(param);
+
+   if (prio < 0)
+   return EMULATE_FAIL;
+
+   clear_bit(prio, &vcpu->arch.pending_exceptions);
+   return EMULATE_DONE;
+}
+
+static int kvmppc_e500_emul_msgsnd(struct kvm_vcpu *vcpu, int rb)
+{
+   ulong param = vcpu->arch.gpr[rb];
+   int prio = dbell2prio(rb);
+   int pir = param & PPC_DBELL_PIR_MASK;
+   int i;
+   struct kvm_vcpu *cvcpu;
+
+   if (prio < 0)
+   return EMULATE_FAIL;
+
+   kvm_for_each_vcpu(i, cvcpu, vcpu->kvm) {
+   int cpir = cvcpu->arch.shared->pir;
+   if ((param & PPC_DBELL_MSG_BRDCAST) || (cpir == pir)) {
+   set_bit(prio, &cvcpu->arch.pending_exceptions);
+   kvm_vcpu_kick(cvcpu);
+   }
+   }
+
+   return EMULATE_DONE;
+}
+#endif
+
 int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu,
unsigned int inst, int *advance)
 {
@@ -36,6 +94,16 @@ int kvmppc_core_emulate_op(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
case 31:
switch (get_xop(inst)) {
 
+#ifdef CONFIG_KVM_E500MC
+   case XOP_MSGSND:
+   emulated = kvmppc_e500_emul_msgsnd(vcpu, get_rb(inst));
+   break;
+
+   case XOP_MSGCLR:
+   emulated = kvmppc_e500_emul_msgclr(vcpu, get_rb(inst));
+   break;
+#endif
+
case XOP_TLBRE:
emulated = kvmppc_e500_emul_tlbre(vcpu);
break;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/38] KVM: PPC: booke: rework rescheduling checks

2012-02-28 Thread Alexander Graf
Instead of checking whether we should reschedule only when we exited
due to an interrupt, let's always check before entering the guest back
again. This gets the target more in line with the other archs.

Also while at it, generalize the whole thing so that eventually we could
have a single kvmppc_prepare_to_enter function for all ppc targets that
does signal and reschedule checking for us.

Signed-off-by: Alexander Graf 

---

v2 -> v3:

  - check for signals earlier
---
 arch/powerpc/include/asm/kvm_ppc.h |2 +-
 arch/powerpc/kvm/book3s.c  |4 +-
 arch/powerpc/kvm/booke.c   |   72 +---
 3 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index e709975..7f0a3da 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -95,7 +95,7 @@ extern int kvmppc_core_vcpu_translate(struct kvm_vcpu *vcpu,
 extern void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 extern void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu);
 
-extern void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu);
+extern int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu);
 extern int kvmppc_core_pending_dec(struct kvm_vcpu *vcpu);
 extern void kvmppc_core_queue_program(struct kvm_vcpu *vcpu, ulong flags);
 extern void kvmppc_core_queue_dec(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 7d54f4e..c8ead7b 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -258,7 +258,7 @@ static bool clear_irqprio(struct kvm_vcpu *vcpu, unsigned 
int priority)
return true;
 }
 
-void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
+int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 {
unsigned long *pending = &vcpu->arch.pending_exceptions;
unsigned long old_pending = vcpu->arch.pending_exceptions;
@@ -283,6 +283,8 @@ void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 
/* Tell the guest about our interrupt status */
kvmppc_update_int_pending(vcpu, *pending, old_pending);
+
+   return 0;
 }
 
 pfn_t kvmppc_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 9979be1..3da0e42 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -439,8 +439,9 @@ static void kvmppc_core_check_exceptions(struct kvm_vcpu 
*vcpu)
 }
 
 /* Check pending exceptions and deliver one, if possible. */
-void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
+int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 {
+   int r = 0;
WARN_ON_ONCE(!irqs_disabled());
 
kvmppc_core_check_exceptions(vcpu);
@@ -451,8 +452,46 @@ void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
local_irq_disable();
 
kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
-   kvmppc_core_check_exceptions(vcpu);
+   r = 1;
};
+
+   return r;
+}
+
+/*
+ * Common checks before entering the guest world.  Call with interrupts
+ * disabled.
+ *
+ * returns !0 if a signal is pending and check_signal is true
+ */
+static int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu, bool check_signal)
+{
+   int r = 0;
+
+   WARN_ON_ONCE(!irqs_disabled());
+   while (true) {
+   if (need_resched()) {
+   local_irq_enable();
+   cond_resched();
+   local_irq_disable();
+   continue;
+   }
+
+   if (check_signal && signal_pending(current)) {
+   r = 1;
+   break;
+   }
+
+   if (kvmppc_core_prepare_to_enter(vcpu)) {
+   /* interrupts got enabled in between, so we
+  are back at square 1 */
+   continue;
+   }
+
+   break;
+   }
+
+   return r;
 }
 
 int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
@@ -470,10 +509,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
}
 
local_irq_disable();
-
-   kvmppc_core_prepare_to_enter(vcpu);
-
-   if (signal_pending(current)) {
+   if (kvmppc_prepare_to_enter(vcpu, true)) {
kvm_run->exit_reason = KVM_EXIT_INTR;
ret = -EINTR;
goto out;
@@ -598,25 +634,21 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 
switch (exit_nr) {
case BOOKE_INTERRUPT_MACHINE_CHECK:
-   kvm_resched(vcpu);
r = RESUME_GUEST;
break;
 
case BOOKE_INTERRUPT_EXTERNAL:
kvmppc_account_exit(vcpu, EXT_INTR_EXITS);
-   kvm_resched(vcpu);
r = RESUME_GUEST;
break;
 
case BOOKE_INTERRUPT_DECREMENTER:

[PATCH 25/38] KVM: PPC: booke: BOOKE_IRQPRIO_MAX is n+1

2012-02-28 Thread Alexander Graf
The semantics of BOOKE_IRQPRIO_MAX changed to denote the highest available
irqprio + 1, so let's reflect that in the code too.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 3da0e42..11b0625 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -425,7 +425,7 @@ static void kvmppc_core_check_exceptions(struct kvm_vcpu 
*vcpu)
}
 
priority = __ffs(*pending);
-   while (priority <= BOOKE_IRQPRIO_MAX) {
+   while (priority < BOOKE_IRQPRIO_MAX) {
if (kvmppc_booke_irqprio_deliver(vcpu, priority))
break;
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 26/38] KVM: PPC: bookehv: fix exit timing

2012-02-28 Thread Alexander Graf
When using exit timing stats, we clobber r9 in the NEED_EMU case,
so better move that part down a few lines and fix it that way.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/bookehv_interrupts.S |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index f7dc3f6..215381e 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -83,10 +83,6 @@
stw r10, VCPU_GUEST_PID(r4)
mtspr   SPRN_PID, r8
 
-   .if \flags & NEED_EMU
-   lwz r9, VCPU_KVM(r4)
-   .endif
-
 #ifdef CONFIG_KVM_EXIT_TIMING
/* save exit time */
 1: mfspr   r7, SPRN_TBRU
@@ -98,6 +94,10 @@
PPC_STL r9, VCPU_TIMING_EXIT_TBU(r4)
 #endif
 
+   .if \flags & NEED_EMU
+   lwz r9, VCPU_KVM(r4)
+   .endif
+
orisr8, r6, MSR_CE@h
 #ifndef CONFIG_64BIT
stw r6, (VCPU_SHARED_MSR + 4)(r11)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 20/38] KVM: PPC: rename CONFIG_KVM_E500 -> CONFIG_KVM_E500V2

2012-02-28 Thread Alexander Graf
The CONFIG_KVM_E500 option really indicates that we're running on a V2 machine,
not on a machine of the generic E500 class. So indicate that properly and
change the config name accordingly.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/Kconfig|8 
 arch/powerpc/kvm/Makefile   |4 ++--
 arch/powerpc/kvm/booke.c|2 +-
 arch/powerpc/kvm/e500.h |6 +++---
 arch/powerpc/kvm/e500_tlb.c |2 +-
 arch/powerpc/kvm/powerpc.c  |8 
 6 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 58f6e68..44a998d 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -109,7 +109,7 @@ config KVM_440
 
 config KVM_EXIT_TIMING
bool "Detailed exit timing"
-   depends on KVM_440 || KVM_E500 || KVM_E500MC
+   depends on KVM_440 || KVM_E500V2 || KVM_E500MC
---help---
  Calculate elapsed time for every exit/enter cycle. A per-vcpu
  report is available in debugfs kvm/vm#_vcpu#_timing.
@@ -118,14 +118,14 @@ config KVM_EXIT_TIMING
 
  If unsure, say N.
 
-config KVM_E500
-   bool "KVM support for PowerPC E500 processors"
+config KVM_E500V2
+   bool "KVM support for PowerPC E500v2 processors"
depends on EXPERIMENTAL && E500
select KVM
select KVM_MMIO
---help---
  Support running unmodified E500 guest kernels in virtual machines on
- E500 host processors.
+ E500v2 host processors.
 
  This module provides access to the hardware capabilities through
  a character device node named /dev/kvm.
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 62febd7..25225ae 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -36,7 +36,7 @@ kvm-e500-objs := \
e500.o \
e500_tlb.o \
e500_emulate.o
-kvm-objs-$(CONFIG_KVM_E500) := $(kvm-e500-objs)
+kvm-objs-$(CONFIG_KVM_E500V2) := $(kvm-e500-objs)
 
 kvm-e500mc-objs := \
$(common-objs-y) \
@@ -98,7 +98,7 @@ kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
 obj-$(CONFIG_KVM_440) += kvm.o
-obj-$(CONFIG_KVM_E500) += kvm.o
+obj-$(CONFIG_KVM_E500V2) += kvm.o
 obj-$(CONFIG_KVM_E500MC) += kvm.o
 obj-$(CONFIG_KVM_BOOK3S_64) += kvm.o
 obj-$(CONFIG_KVM_BOOK3S_32) += kvm.o
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index fcbe928..9fcc760 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -762,7 +762,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu 
*vcpu,
gpa_t gpaddr;
gfn_t gfn;
 
-#ifdef CONFIG_KVM_E500
+#ifdef CONFIG_KVM_E500V2
if (!(vcpu->arch.shared->msr & MSR_PR) &&
(eaddr & PAGE_MASK) == vcpu->arch.magic_page_ea) {
kvmppc_map_magic(vcpu);
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 3143085..7967f3f 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -39,7 +39,7 @@ struct tlbe_priv {
struct tlbe_ref ref; /* TLB0 only -- TLB1 uses tlb_refs */
 };
 
-#ifdef CONFIG_KVM_E500
+#ifdef CONFIG_KVM_E500V2
 struct vcpu_id_table;
 #endif
 
@@ -89,7 +89,7 @@ struct kvmppc_vcpu_e500 {
u64 *g2h_tlb1_map;
unsigned int *h2g_tlb1_rmap;
 
-#ifdef CONFIG_KVM_E500
+#ifdef CONFIG_KVM_E500V2
u32 pid[E500_PID_NUM];
 
/* vcpu id table */
@@ -136,7 +136,7 @@ void kvmppc_get_sregs_e500_tlb(struct kvm_vcpu *vcpu, 
struct kvm_sregs *sregs);
 int kvmppc_set_sregs_e500_tlb(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
 
 
-#ifdef CONFIG_KVM_E500
+#ifdef CONFIG_KVM_E500V2
 unsigned int kvmppc_e500_get_sid(struct kvmppc_vcpu_e500 *vcpu_e500,
 unsigned int as, unsigned int gid,
 unsigned int pr, int avoid_recursion);
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index e232bb4..279e10a 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -156,7 +156,7 @@ static inline void write_host_tlbe(struct kvmppc_vcpu_e500 
*vcpu_e500,
}
 }
 
-#ifdef CONFIG_KVM_E500
+#ifdef CONFIG_KVM_E500V2
 void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 {
struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 58a084f..26c6a8d 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -74,7 +74,7 @@ int kvmppc_kvm_pv(struct kvm_vcpu *vcpu)
}
case HC_VENDOR_KVM | KVM_HC_FEATURES:
r = HC_EV_SUCCESS;
-#if defined(CONFIG_PPC_BOOK3S) || defined(CONFIG_KVM_E500)
+#if defined(CONFIG_PPC_BOOK3S) || defined(CONFIG_KVM_E500V2)
/* XXX Missing magic page on 44x */
r2 |= (1 << KVM_FEATURE_MAGIC_PAGE);
 #endif
@@ -230,7 +230,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_

[PATCH 28/38] KVM: PPC: bookehv: remove SET_VCPU

2012-02-28 Thread Alexander Graf
The SET_VCPU macro is a leftover from times when the vcpu struct wasn't
stored in the thread on vcpu_load/put. It's not needed anymore. Remove it.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/bookehv_interrupts.S |8 
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index c5a0796..469bd3f 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -35,9 +35,6 @@
 #define GET_VCPU(vcpu, thread) \
PPC_LL  vcpu, THREAD_KVM_VCPU(thread)
 
-#define SET_VCPU(vcpu) \
-PPC_STLvcpu, (THREAD + THREAD_KVM_VCPU)(r2)
-
 #define LONGBYTES  (BITS_PER_LONG / 8)
 
 #define VCPU_GPR(n)(VCPU_GPRS + (n * LONGBYTES))
@@ -517,11 +514,6 @@ lightweight_exit:
lwz r3, VCPU_GUEST_PID(r4)
mtspr   SPRN_PID, r3
 
-   /* Save vcpu pointer for the exception handlers
-* must be done before loading guest r2.
-*/
-// SET_VCPU(r4)
-
PPC_LL  r11, VCPU_SHARED(r4)
/* Save host mas4 and mas6 and load guest MAS registers */
mfspr   r3, SPRN_MAS4
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 27/38] KVM: PPC: bookehv: remove negation for CONFIG_64BIT

2012-02-28 Thread Alexander Graf
Instead if doing

  #ifndef CONFIG_64BIT
  ...
  #else
  ...
  #endif

we should rather do

  #ifdef CONFIG_64BIT
  ...
  #else
  ...
  #endif

which is a lot easier to read. Change the bookehv implementation to
stick with this rule.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/bookehv_interrupts.S |   24 
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 215381e..c5a0796 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -99,10 +99,10 @@
.endif
 
orisr8, r6, MSR_CE@h
-#ifndef CONFIG_64BIT
-   stw r6, (VCPU_SHARED_MSR + 4)(r11)
-#else
+#ifdef CONFIG_64BIT
std r6, (VCPU_SHARED_MSR)(r11)
+#else
+   stw r6, (VCPU_SHARED_MSR + 4)(r11)
 #endif
ori r8, r8, MSR_ME | MSR_RI
PPC_STL r5, VCPU_PC(r4)
@@ -344,10 +344,10 @@ _GLOBAL(kvmppc_resume_host)
stw r5, VCPU_SHARED_MAS0(r11)
mfspr   r7, SPRN_MAS2
stw r6, VCPU_SHARED_MAS1(r11)
-#ifndef CONFIG_64BIT
-   stw r7, (VCPU_SHARED_MAS2 + 4)(r11)
-#else
+#ifdef CONFIG_64BIT
std r7, (VCPU_SHARED_MAS2)(r11)
+#else
+   stw r7, (VCPU_SHARED_MAS2 + 4)(r11)
 #endif
mfspr   r5, SPRN_MAS3
mfspr   r6, SPRN_MAS4
@@ -530,10 +530,10 @@ lightweight_exit:
stw r3, VCPU_HOST_MAS6(r4)
lwz r3, VCPU_SHARED_MAS0(r11)
lwz r5, VCPU_SHARED_MAS1(r11)
-#ifndef CONFIG_64BIT
-   lwz r6, (VCPU_SHARED_MAS2 + 4)(r11)
-#else
+#ifdef CONFIG_64BIT
ld  r6, (VCPU_SHARED_MAS2)(r11)
+#else
+   lwz r6, (VCPU_SHARED_MAS2 + 4)(r11)
 #endif
lwz r7, VCPU_SHARED_MAS7_3+4(r11)
lwz r8, VCPU_SHARED_MAS4(r11)
@@ -572,10 +572,10 @@ lightweight_exit:
PPC_LL  r6, VCPU_CTR(r4)
PPC_LL  r7, VCPU_CR(r4)
PPC_LL  r8, VCPU_PC(r4)
-#ifndef CONFIG_64BIT
-   lwz r9, (VCPU_SHARED_MSR + 4)(r11)
-#else
+#ifdef CONFIG_64BIT
ld  r9, (VCPU_SHARED_MSR)(r11)
+#else
+   lwz r9, (VCPU_SHARED_MSR + 4)(r11)
 #endif
PPC_LL  r0, VCPU_GPR(r0)(r4)
PPC_LL  r1, VCPU_GPR(r1)(r4)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/38] KVM: PPC: e500mc support

2012-02-28 Thread Alexander Graf
From: Scott Wood 

Add processor support for e500mc, using hardware virtualization support
(GS-mode).

Current issues include:
 - No support for external proxy (coreint) interrupt mode in the guest.

Includes work by Ashish Kalra ,
Varun Sethi , and
Liu Yu .

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/cputable.h   |6 +-
 arch/powerpc/include/asm/kvm.h|1 +
 arch/powerpc/kernel/cpu_setup_fsl_booke.S |1 +
 arch/powerpc/kernel/head_fsl_booke.S  |   46 
 arch/powerpc/kvm/Kconfig  |   17 ++-
 arch/powerpc/kvm/Makefile |   11 +
 arch/powerpc/kvm/e500.h   |   13 +-
 arch/powerpc/kvm/e500_emulate.c   |   24 ++-
 arch/powerpc/kvm/e500_tlb.c   |   21 ++-
 arch/powerpc/kvm/e500mc.c |  342 +
 arch/powerpc/kvm/powerpc.c|6 +-
 11 files changed, 476 insertions(+), 12 deletions(-)
 create mode 100644 arch/powerpc/kvm/e500mc.c

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 2022f2d..598cd24 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -168,6 +168,7 @@ extern const char *powerpc_base_platform;
 #define CPU_FTR_LWSYNC ASM_CONST(0x0800)
 #define CPU_FTR_NOEXECUTE  ASM_CONST(0x1000)
 #define CPU_FTR_INDEXED_DCRASM_CONST(0x2000)
+#define CPU_FTR_EMB_HV ASM_CONST(0x4000)
 
 /*
  * Add the 64-bit processor unique features in the top half of the word;
@@ -386,11 +387,11 @@ extern const char *powerpc_base_platform;
CPU_FTR_NODSISRALIGN | CPU_FTR_NOEXECUTE)
 #define CPU_FTRS_E500MC(CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN | \
CPU_FTR_L2CSR | CPU_FTR_LWSYNC | CPU_FTR_NOEXECUTE | \
-   CPU_FTR_DBELL | CPU_FTR_DEBUG_LVL_EXC)
+   CPU_FTR_DBELL | CPU_FTR_DEBUG_LVL_EXC | CPU_FTR_EMB_HV)
 #define CPU_FTRS_E5500 (CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN | \
CPU_FTR_L2CSR | CPU_FTR_LWSYNC | CPU_FTR_NOEXECUTE | \
CPU_FTR_DBELL | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
-   CPU_FTR_DEBUG_LVL_EXC)
+   CPU_FTR_DEBUG_LVL_EXC | CPU_FTR_EMB_HV)
 #define CPU_FTRS_GENERIC_32(CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN)
 
 /* 64-bit CPUs */
@@ -535,6 +536,7 @@ enum {
 #ifdef CONFIG_PPC_E500MC
CPU_FTRS_E500MC & CPU_FTRS_E5500 &
 #endif
+   ~CPU_FTR_EMB_HV &   /* can be removed at runtime */
CPU_FTRS_POSSIBLE,
 };
 #endif /* __powerpc64__ */
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index b921c3f..1bea4d8 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -277,6 +277,7 @@ struct kvm_sync_regs {
 #define KVM_CPU_E500V2 2
 #define KVM_CPU_3S_32  3
 #define KVM_CPU_3S_64  4
+#define KVM_CPU_E500MC 5
 
 /* for KVM_CAP_SPAPR_TCE */
 struct kvm_create_spapr_tce {
diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S 
b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index 8053db0..69fdd23 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -73,6 +73,7 @@ _GLOBAL(__setup_cpu_e500v2)
mtlrr4
blr
 _GLOBAL(__setup_cpu_e500mc)
+   mr  r5, r4
mflrr4
bl  __e500_icache_setup
bl  __e500_dcache_setup
diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index 418931f..88c0a35 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -380,10 +380,16 @@ interrupt_base:
mtspr   SPRN_SPRG_WSCRATCH0, r10 /* Save some working registers */
mfspr   r10, SPRN_SPRG_THREAD
stw r11, THREAD_NORMSAVE(0)(r10)
+#ifdef CONFIG_KVM_BOOKE_HV
+BEGIN_FTR_SECTION
+   mfspr   r11, SPRN_SRR1
+END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
+#endif
stw r12, THREAD_NORMSAVE(1)(r10)
stw r13, THREAD_NORMSAVE(2)(r10)
mfcrr13
stw r13, THREAD_NORMSAVE(3)(r10)
+   DO_KVM  BOOKE_INTERRUPT_DTLB_MISS SPRN_SRR1
mfspr   r10, SPRN_DEAR  /* Get faulting address */
 
/* If we are faulting a kernel address, we have to use the
@@ -468,10 +474,16 @@ interrupt_base:
mtspr   SPRN_SPRG_WSCRATCH0, r10 /* Save some working registers */
mfspr   r10, SPRN_SPRG_THREAD
stw r11, THREAD_NORMSAVE(0)(r10)
+#ifdef CONFIG_KVM_BOOKE_HV
+BEGIN_FTR_SECTION
+   mfspr   r11, SPRN_SRR1
+END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
+#endif
stw r12, THREAD_NORMSAVE(1)(r10)
stw r13, THREAD_NORMSAVE(2)(r10)
mfcrr13
stw r13, THREAD_NORMSAVE(3)(r10)
+   DO_KVM  BOOKE_INTERRUPT_ITLB_MISS SPRN_SRR1
mfspr   r10, SPRN_SRR0  /* Get faulting address */
 
/* 

[PATCH 13/38] KVM: PPC: booke: category E.HV (GS-mode) support

2012-02-28 Thread Alexander Graf
From: Scott Wood 

Chips such as e500mc that implement category E.HV in Power ISA 2.06
provide hardware virtualization features, including a new MSR mode for
guest state.  The guest OS can perform many operations without trapping
into the hypervisor, including transitions to and from guest userspace.

Since we can use SRR1[GS] to reliably tell whether an exception came from
guest state, instead of messing around with IVPR, we use DO_KVM similarly
to book3s.

Current issues include:
 - Machine checks from guest state are not routed to the host handler.
 - The guest can cause a host oops by executing an emulated instruction
   in a page that lacks read permission.  Existing e500/4xx support has
   the same problem.

Includes work by Ashish Kalra ,
Varun Sethi , and
Liu Yu .

Signed-off-by: Scott Wood 
[agraf: remove pt_regs usage]
Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - ESR -> GESR
---
 arch/powerpc/include/asm/dbell.h|1 +
 arch/powerpc/include/asm/kvm_asm.h  |8 +
 arch/powerpc/include/asm/kvm_booke_hv_asm.h |   49 +++
 arch/powerpc/include/asm/kvm_host.h |   19 +-
 arch/powerpc/include/asm/kvm_ppc.h  |3 +
 arch/powerpc/include/asm/mmu-book3e.h   |6 +
 arch/powerpc/include/asm/processor.h|3 +
 arch/powerpc/include/asm/reg.h  |2 +
 arch/powerpc/include/asm/reg_booke.h|   34 ++
 arch/powerpc/kernel/asm-offsets.c   |   15 +-
 arch/powerpc/kernel/head_booke.h|   28 ++-
 arch/powerpc/kvm/Kconfig|3 +
 arch/powerpc/kvm/booke.c|  309 ---
 arch/powerpc/kvm/booke.h|   24 +-
 arch/powerpc/kvm/booke_emulate.c|   23 +-
 arch/powerpc/kvm/bookehv_interrupts.S   |  587 +++
 arch/powerpc/kvm/powerpc.c  |5 +
 arch/powerpc/kvm/timing.h   |6 +
 18 files changed, 1058 insertions(+), 67 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_booke_hv_asm.h
 create mode 100644 arch/powerpc/kvm/bookehv_interrupts.S

diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
index efa74ac..d7365b0 100644
--- a/arch/powerpc/include/asm/dbell.h
+++ b/arch/powerpc/include/asm/dbell.h
@@ -19,6 +19,7 @@
 
 #define PPC_DBELL_MSG_BRDCAST  (0x0400)
 #define PPC_DBELL_TYPE(x)  (((x) & 0xf) << (63-36))
+#define PPC_DBELL_LPID(x)  ((x) << (63 - 49))
 enum ppc_dbell {
PPC_DBELL = 0,  /* doorbell */
PPC_DBELL_CRIT = 1, /* critical doorbell */
diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 7b1f0e0..0978152 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -48,6 +48,14 @@
 #define BOOKE_INTERRUPT_SPE_FP_DATA 33
 #define BOOKE_INTERRUPT_SPE_FP_ROUND 34
 #define BOOKE_INTERRUPT_PERFORMANCE_MONITOR 35
+#define BOOKE_INTERRUPT_DOORBELL 36
+#define BOOKE_INTERRUPT_DOORBELL_CRITICAL 37
+
+/* booke_hv */
+#define BOOKE_INTERRUPT_GUEST_DBELL 38
+#define BOOKE_INTERRUPT_GUEST_DBELL_CRIT 39
+#define BOOKE_INTERRUPT_HV_SYSCALL 40
+#define BOOKE_INTERRUPT_HV_PRIV 41
 
 /* book3s */
 
diff --git a/arch/powerpc/include/asm/kvm_booke_hv_asm.h 
b/arch/powerpc/include/asm/kvm_booke_hv_asm.h
new file mode 100644
index 000..30a600f
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_booke_hv_asm.h
@@ -0,0 +1,49 @@
+/*
+ * Copyright 2010-2011 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef ASM_KVM_BOOKE_HV_ASM_H
+#define ASM_KVM_BOOKE_HV_ASM_H
+
+#ifdef __ASSEMBLY__
+
+/*
+ * All exceptions from guest state must go through KVM
+ * (except for those which are delivered directly to the guest) --
+ * there are no exceptions for which we fall through directly to
+ * the normal host handler.
+ *
+ * Expected inputs (normal exceptions):
+ *   SCRATCH0 = saved r10
+ *   r10 = thread struct
+ *   r11 = appropriate SRR1 variant (currently used as scratch)
+ *   r13 = saved CR
+ *   *(r10 + THREAD_NORMSAVE(0)) = saved r11
+ *   *(r10 + THREAD_NORMSAVE(2)) = saved r13
+ *
+ * Expected inputs (crit/mcheck/debug exceptions):
+ *   appropriate SCRATCH = saved r8
+ *   r8 = exception level stack frame
+ *   r9 = *(r8 + _CCR) = saved CR
+ *   r11 = appropriate SRR1 variant (currently used as scratch)
+ *   *(r8 + GPR9) = saved r9
+ *   *(r8 + GPR10) = saved r10 (r10 not yet clobbered)
+ *   *(r8 + GPR11) = saved r11
+ */
+.macro DO_KVM intno srr1
+#ifdef CONFIG_KVM_BOOKE_HV
+BEGIN_FTR_SECTION
+   mtocrf  0x80, r11   /* check MSR[GS] without clobbering reg */
+   bf  3, kvmppc_resume_\intno\()_\srr1
+   b   kvmppc_handler_\intno\()_\srr1
+kvmppc_resume_\intno\()_\srr1:
+END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
+#endif
+.endm
+
+#endif /*__ASSEMB

[PATCH 29/38] KVM: PPC: bookehv: disable MAS register updates early

2012-02-28 Thread Alexander Graf
We need to make sure that no MAS updates happen automatically while we
have the guest MAS registers loaded. So move the disabling code a bit
higher up so that it covers the full time we have guest values in MAS
registers.

The race this patch fixes should never occur, but it makes the code a
bit more logical to do it this way around.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/bookehv_interrupts.S |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 469bd3f..021d087 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -358,6 +358,7 @@ _GLOBAL(kvmppc_resume_host)
mtspr   SPRN_MAS4, r6
stw r5, VCPU_SHARED_MAS7_3+0(r11)
mtspr   SPRN_MAS6, r8
+   /* Enable MAS register updates via exception */
mfspr   r3, SPRN_EPCR
rlwinm  r3, r3, 0, ~SPRN_EPCR_DMIUH
mtspr   SPRN_EPCR, r3
@@ -515,6 +516,11 @@ lightweight_exit:
mtspr   SPRN_PID, r3
 
PPC_LL  r11, VCPU_SHARED(r4)
+   /* Disable MAS register updates via exception */
+   mfspr   r3, SPRN_EPCR
+   orisr3, r3, SPRN_EPCR_DMIUH@h
+   mtspr   SPRN_EPCR, r3
+   isync
/* Save host mas4 and mas6 and load guest MAS registers */
mfspr   r3, SPRN_MAS4
stw r3, VCPU_HOST_MAS4(r4)
@@ -538,10 +544,6 @@ lightweight_exit:
lwz r5, VCPU_SHARED_MAS7_3+0(r11)
mtspr   SPRN_MAS6, r3
mtspr   SPRN_MAS7, r5
-   /* Disable MAS register updates via exception */
-   mfspr   r3, SPRN_EPCR
-   orisr3, r3, SPRN_EPCR_DMIUH@h
-   mtspr   SPRN_EPCR, r3
 
/*
 * Host interrupt handlers may have clobbered these guest-readable
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 32/38] KVM: PPC: booke: add GS documentation for program interrupt

2012-02-28 Thread Alexander Graf
The comment for program interrupts triggered when using bookehv was
misleading. Update it to mention why MSR_GS indicates that we have
to inject an interrupt into the guest again, not emulate it.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |   10 --
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index af02d9d..7df3f3a 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -685,8 +685,14 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 
case BOOKE_INTERRUPT_PROGRAM:
if (vcpu->arch.shared->msr & (MSR_PR | MSR_GS)) {
-   /* Program traps generated by user-level software must 
be handled
-* by the guest kernel. */
+   /*
+* Program traps generated by user-level software must
+* be handled by the guest kernel.
+*
+* In GS mode, hypervisor privileged instructions trap
+* on BOOKE_INTERRUPT_HV_PRIV, not here, so these are
+* actual program interrupts, handled by the guest.
+*/
kvmppc_core_queue_program(vcpu, vcpu->arch.fault_esr);
r = RESUME_GUEST;
kvmppc_account_exit(vcpu, USR_PR_INST);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/38] powerpc/e500: split CPU_FTRS_ALWAYS/CPU_FTRS_POSSIBLE

2012-02-28 Thread Alexander Graf
From: Scott Wood 

Split e500 (v1/v2) and e500mc/e5500 to allow optimization of feature
checks that differ between the two.

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/cputable.h |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 6a034a2..2022f2d 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -483,8 +483,10 @@ enum {
CPU_FTRS_E200 |
 #endif
 #ifdef CONFIG_E500
-   CPU_FTRS_E500 | CPU_FTRS_E500_2 | CPU_FTRS_E500MC |
-   CPU_FTRS_E5500 |
+   CPU_FTRS_E500 | CPU_FTRS_E500_2 |
+#endif
+#ifdef CONFIG_PPC_E500MC
+   CPU_FTRS_E500MC | CPU_FTRS_E5500 |
 #endif
0,
 };
@@ -528,8 +530,10 @@ enum {
CPU_FTRS_E200 &
 #endif
 #ifdef CONFIG_E500
-   CPU_FTRS_E500 & CPU_FTRS_E500_2 & CPU_FTRS_E500MC &
-   CPU_FTRS_E5500 &
+   CPU_FTRS_E500 & CPU_FTRS_E500_2 &
+#endif
+#ifdef CONFIG_PPC_E500MC
+   CPU_FTRS_E500MC & CPU_FTRS_E5500 &
 #endif
CPU_FTRS_POSSIBLE,
 };
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 31/38] KVM: PPC: booke: Readd debug abort code for machine check

2012-02-28 Thread Alexander Graf
When during guest execution we get a machine check interrupt, we don't
know how to handle it yet. So let's add the error printing code back
again that we dropped accidently earlier and tell user space that something
went really wrong.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 11b0625..af02d9d 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -634,7 +634,12 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 
switch (exit_nr) {
case BOOKE_INTERRUPT_MACHINE_CHECK:
-   r = RESUME_GUEST;
+   printk("MACHINE CHECK: %lx\n", mfspr(SPRN_MCSR));
+   kvmppc_dump_vcpu(vcpu);
+   /* For debugging, send invalid exit reason to user space */
+   run->hw.hardware_exit_reason = ~1ULL << 32;
+   run->hw.hardware_exit_reason |= mfspr(SPRN_MCSR);
+   r = RESUME_HOST;
break;
 
case BOOKE_INTERRUPT_EXTERNAL:
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 33/38] KVM: PPC: bookehv: remove unused code

2012-02-28 Thread Alexander Graf
There was some unused code in the exit code path that must have been
a leftover from earlier iterations. While it did no harm, it's superfluous
and thus should be removed.

Signed-off-by: Alexander Graf 

---

v2 -> v3:

  - fix commit message
  - also remove "lwzr9, VCPU_KVM(r4)" which was as superfluous
---
 arch/powerpc/kvm/bookehv_interrupts.S |7 ---
 1 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 021d087..63fc5f0 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -91,10 +91,6 @@
PPC_STL r9, VCPU_TIMING_EXIT_TBU(r4)
 #endif
 
-   .if \flags & NEED_EMU
-   lwz r9, VCPU_KVM(r4)
-   .endif
-
orisr8, r6, MSR_CE@h
 #ifdef CONFIG_64BIT
std r6, (VCPU_SHARED_MSR)(r11)
@@ -112,9 +108,6 @@
 * appropriate for the exception type).
 */
cmpwr6, r8
-   .if \flags & NEED_EMU
-   lwz r9, KVM_LPID(r9)
-   .endif
beq 1f
mfmsr   r7
.if \srr0 != SPRN_MCSRR0 && \srr0 != SPRN_CSRR0
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 30/38] KVM: PPC: bookehv: add comment about shadow_msr

2012-02-28 Thread Alexander Graf
For BookE HV the guest visible MSR is shared->msr and is identical to
the MSR that is in use while the guest is running, because we can't trap
reads from/to MSR.

So shadow_msr is unused there. Indicate that with a comment.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm_host.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index ed95f53..633d68f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -386,6 +386,7 @@ struct kvm_vcpu_arch {
 #endif
u32 vrsave; /* also USPRG0 */
u32 mmucr;
+   /* shadow_msr is unused for BookE HV */
ulong shadow_msr;
ulong csrr0;
ulong csrr1;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 36/38] KVM: PPC: booke: expose good state on irq reinject

2012-02-28 Thread Alexander Graf
When reinjecting an interrupt into the host interrupt handler after we're
back in host kernel land, we need to tell the kernel where the interrupt
happened. We can't tell it that we were in guest state, because that might
lead to random code walking host addresses. So instead, we tell it that
we came from the interrupt reinject code.

This helps getting reasonable numbers out of perf.

Signed-off-by: Alexander Graf 

---

v2 -> v3:

  - actually sync host state
  - no need for vcpu in sync
---
 arch/powerpc/kvm/booke.c |   56 +
 1 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index ee39c8a..488936b 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -595,37 +595,63 @@ static int emulation_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu)
}
 }
 
-/**
- * kvmppc_handle_exit
- *
- * Return value is in the form (errcode<<2 | RESUME_FLAG_HOST | RESUME_FLAG_NV)
- */
-int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
-   unsigned int exit_nr)
+static void kvmppc_fill_pt_regs(struct pt_regs *regs)
 {
-   int r = RESUME_HOST;
+   ulong r1, ip, msr, lr;
+
+   asm("mr %0, 1" : "=r"(r1));
+   asm("mflr %0" : "=r"(lr));
+   asm("mfmsr %0" : "=r"(msr));
+   asm("bl 1f; 1: mflr %0" : "=r"(ip));
+
+   memset(regs, 0, sizeof(*regs));
+   regs->gpr[1] = r1;
+   regs->nip = ip;
+   regs->msr = msr;
+   regs->link = lr;
+}
 
-   /* update before a new last_exit_type is rewritten */
-   kvmppc_update_timing_stats(vcpu);
+static void kvmppc_restart_interrupt(struct kvm_vcpu *vcpu,
+unsigned int exit_nr)
+{
+   struct pt_regs regs;
 
switch (exit_nr) {
case BOOKE_INTERRUPT_EXTERNAL:
-   do_IRQ(current->thread.regs);
+   kvmppc_fill_pt_regs(®s);
+   do_IRQ(®s);
break;
-
case BOOKE_INTERRUPT_DECREMENTER:
-   timer_interrupt(current->thread.regs);
+   kvmppc_fill_pt_regs(®s);
+   timer_interrupt(®s);
break;
-
 #if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_BOOK3E_64)
case BOOKE_INTERRUPT_DOORBELL:
-   doorbell_exception(current->thread.regs);
+   kvmppc_fill_pt_regs(®s);
+   doorbell_exception(®s);
break;
 #endif
case BOOKE_INTERRUPT_MACHINE_CHECK:
/* FIXME */
break;
}
+}
+
+/**
+ * kvmppc_handle_exit
+ *
+ * Return value is in the form (errcode<<2 | RESUME_FLAG_HOST | RESUME_FLAG_NV)
+ */
+int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
+   unsigned int exit_nr)
+{
+   int r = RESUME_HOST;
+
+   /* update before a new last_exit_type is rewritten */
+   kvmppc_update_timing_stats(vcpu);
+
+   /* restart interrupts if they were meant for the host */
+   kvmppc_restart_interrupt(vcpu, exit_nr);
 
local_irq_enable();
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 34/38] KVM: PPC: e500: fix typo in tlb code

2012-02-28 Thread Alexander Graf
The tlbncfg registers should be populated with their respective TLB's
values. Fix the obvious typo.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/e500_tlb.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 279e10a..e05232b 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -1268,8 +1268,8 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 
*vcpu_e500)
 
vcpu->arch.tlbcfg[1] = mfspr(SPRN_TLB1CFG) &
 ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
-   vcpu->arch.tlbcfg[0] |= vcpu_e500->gtlb_params[1].entries;
-   vcpu->arch.tlbcfg[0] |=
+   vcpu->arch.tlbcfg[1] |= vcpu_e500->gtlb_params[1].entries;
+   vcpu->arch.tlbcfg[1] |=
vcpu_e500->gtlb_params[1].ways << TLBnCFG_ASSOC_SHIFT;
 
return 0;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 35/38] KVM: PPC: booke: Support perfmon interrupts

2012-02-28 Thread Alexander Graf
When during guest context we get a performance monitor interrupt, we
currently bail out and oops. Let's route it to its correct handler
instead.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 7df3f3a..ee39c8a 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -679,6 +679,10 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
r = RESUME_GUEST;
break;
 
+   case BOOKE_INTERRUPT_PERFORMANCE_MONITOR:
+   r = RESUME_GUEST;
+   break;
+
case BOOKE_INTERRUPT_HV_PRIV:
r = emulation_exit(run, vcpu);
break;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 37/38] KVM: PPC: booke: Reinject performance monitor interrupts

2012-02-28 Thread Alexander Graf
When we get a performance monitor interrupt, we need to make sure that
the host receives it. So reinject it like we reinject the other host
destined interrupts.

Signed-off-by: Alexander Graf 

---

v2 -> v3:

  - call regs sync directly
---
 arch/powerpc/include/asm/hw_irq.h |1 +
 arch/powerpc/kvm/booke.c  |4 
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index bb712c9..904e66c 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -12,6 +12,7 @@
 #include 
 
 extern void timer_interrupt(struct pt_regs *);
+extern void performance_monitor_exception(struct pt_regs *regs);
 
 #ifdef CONFIG_PPC64
 #include 
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 488936b..8e8aa4c 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -634,6 +634,10 @@ static void kvmppc_restart_interrupt(struct kvm_vcpu *vcpu,
case BOOKE_INTERRUPT_MACHINE_CHECK:
/* FIXME */
break;
+   case BOOKE_INTERRUPT_PERFORMANCE_MONITOR:
+   kvmppc_fill_pt_regs(®s);
+   performance_monitor_exception(®s);
+   break;
}
 }
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 38/38] KVM: PPC: Booke: only prepare to enter when we enter

2012-02-28 Thread Alexander Graf
So far, we've always called prepare_to_enter even when all we did was return
to the host. This patch changes that semantic to only call prepare_to_enter
when we actually want to get back into the guest.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |   18 ++
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 8e8aa4c..9f27258 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -464,7 +464,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
  *
  * returns !0 if a signal is pending and check_signal is true
  */
-static int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu, bool check_signal)
+static int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
 {
int r = 0;
 
@@ -477,7 +477,7 @@ static int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu, 
bool check_signal)
continue;
}
 
-   if (check_signal && signal_pending(current)) {
+   if (signal_pending(current)) {
r = 1;
break;
}
@@ -509,7 +509,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
}
 
local_irq_disable();
-   if (kvmppc_prepare_to_enter(vcpu, true)) {
+   if (kvmppc_prepare_to_enter(vcpu)) {
kvm_run->exit_reason = KVM_EXIT_INTR;
ret = -EINTR;
goto out;
@@ -946,11 +946,13 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 * To avoid clobbering exit_reason, only check for signals if we
 * aren't already exiting to userspace for some other reason.
 */
-   local_irq_disable();
-   if (kvmppc_prepare_to_enter(vcpu, !(r & RESUME_HOST))) {
-   run->exit_reason = KVM_EXIT_INTR;
-   r = (-EINTR << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
-   kvmppc_account_exit(vcpu, SIGNAL_EXITS);
+   if (!(r & RESUME_HOST)) {
+   local_irq_disable();
+   if (kvmppc_prepare_to_enter(vcpu)) {
+   run->exit_reason = KVM_EXIT_INTR;
+   r = (-EINTR << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
+   kvmppc_account_exit(vcpu, SIGNAL_EXITS);
+   }
}
 
return r;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/38] KVM: PPC: booke: standard PPC floating point support

2012-02-28 Thread Alexander Graf
From: Scott Wood 

e500mc has a normal PPC FPU, rather than SPE which is found
on e500v1/v2.

Based on code from Liu Yu .

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/system.h |1 +
 arch/powerpc/kvm/booke.c  |   44 +
 arch/powerpc/kvm/booke.h  |   30 +
 3 files changed, 75 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/system.h 
b/arch/powerpc/include/asm/system.h
index c377457..73eee86 100644
--- a/arch/powerpc/include/asm/system.h
+++ b/arch/powerpc/include/asm/system.h
@@ -140,6 +140,7 @@ extern void via_cuda_init(void);
 extern void read_rtc_time(void);
 extern void pmac_find_display(void);
 extern void giveup_fpu(struct task_struct *);
+extern void load_up_fpu(void);
 extern void disable_kernel_fp(void);
 extern void enable_kernel_fp(void);
 extern void flush_fp_to_thread(struct task_struct *);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 75dbaeb..0b77be1 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -457,6 +457,11 @@ void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 {
int ret;
+#ifdef CONFIG_PPC_FPU
+   unsigned int fpscr;
+   int fpexc_mode;
+   u64 fpr[32];
+#endif
 
if (!vcpu->arch.sane) {
kvm_run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
@@ -479,7 +484,46 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
}
 
kvm_guest_enter();
+
+#ifdef CONFIG_PPC_FPU
+   /* Save userspace FPU state in stack */
+   enable_kernel_fp();
+   memcpy(fpr, current->thread.fpr, sizeof(current->thread.fpr));
+   fpscr = current->thread.fpscr.val;
+   fpexc_mode = current->thread.fpexc_mode;
+
+   /* Restore guest FPU state to thread */
+   memcpy(current->thread.fpr, vcpu->arch.fpr, sizeof(vcpu->arch.fpr));
+   current->thread.fpscr.val = vcpu->arch.fpscr;
+
+   /*
+* Since we can't trap on MSR_FP in GS-mode, we consider the guest
+* as always using the FPU.  Kernel usage of FP (via
+* enable_kernel_fp()) in this thread must not occur while
+* vcpu->fpu_active is set.
+*/
+   vcpu->fpu_active = 1;
+
+   kvmppc_load_guest_fp(vcpu);
+#endif
+
ret = __kvmppc_vcpu_run(kvm_run, vcpu);
+
+#ifdef CONFIG_PPC_FPU
+   kvmppc_save_guest_fp(vcpu);
+
+   vcpu->fpu_active = 0;
+
+   /* Save guest FPU state from thread */
+   memcpy(vcpu->arch.fpr, current->thread.fpr, sizeof(vcpu->arch.fpr));
+   vcpu->arch.fpscr = current->thread.fpscr.val;
+
+   /* Restore userspace FPU state from stack */
+   memcpy(current->thread.fpr, fpr, sizeof(current->thread.fpr));
+   current->thread.fpscr.val = fpscr;
+   current->thread.fpexc_mode = fpexc_mode;
+#endif
+
kvm_guest_exit();
 
 out:
diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
index d53bcf2..3bf5eda 100644
--- a/arch/powerpc/kvm/booke.h
+++ b/arch/powerpc/kvm/booke.h
@@ -96,4 +96,34 @@ enum int_class {
 
 void kvmppc_set_pending_interrupt(struct kvm_vcpu *vcpu, enum int_class type);
 
+/*
+ * Load up guest vcpu FP state if it's needed.
+ * It also set the MSR_FP in thread so that host know
+ * we're holding FPU, and then host can help to save
+ * guest vcpu FP state if other threads require to use FPU.
+ * This simulates an FP unavailable fault.
+ *
+ * It requires to be called with preemption disabled.
+ */
+static inline void kvmppc_load_guest_fp(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_PPC_FPU
+   if (vcpu->fpu_active && !(current->thread.regs->msr & MSR_FP)) {
+   load_up_fpu();
+   current->thread.regs->msr |= MSR_FP;
+   }
+#endif
+}
+
+/*
+ * Save guest vcpu FP state into thread.
+ * It requires to be called with preemption disabled.
+ */
+static inline void kvmppc_save_guest_fp(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_PPC_FPU
+   if (vcpu->fpu_active && (current->thread.regs->msr & MSR_FP))
+   giveup_fpu(current);
+#endif
+}
 #endif /* __KVM_BOOKE_H__ */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 22/38] KVM: PPC: booke: remove leftover debugging

2012-02-28 Thread Alexander Graf
The e500mc patches left some debug code in that we don't need. Remove it.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 9fcc760..17d5318 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -469,11 +469,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
return -EINVAL;
}
 
-   if (!current->thread.kvm_vcpu) {
-   WARN(1, "no vcpu\n");
-   return -EPERM;
-   }
-
local_irq_disable();
 
kvmppc_core_prepare_to_enter(vcpu);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 19/38] KVM: PPC: e500mc: add load inst fixup

2012-02-28 Thread Alexander Graf
There's always a chance we're unable to read a guest instruction. The guest
could have its TLB mapped execute-, but not readable, something odd happens
and our TLB gets flushed. So it's a good idea to be prepared for that case
and have a fallback that allows us to fix things up in that case.

Add fixup code that keeps guest code from potentially crashing our host kernel.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - fix whitespace
  - use explicit preempt counts
---
 arch/powerpc/kvm/bookehv_interrupts.S |   30 +-
 1 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 63023ae..f7dc3f6 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../kernel/head_booke.h" /* for THREAD_NORMSAVE() */
 
@@ -171,9 +172,36 @@
PPC_STL r30, VCPU_GPR(r30)(r4)
PPC_STL r31, VCPU_GPR(r31)(r4)
mtspr   SPRN_EPLC, r8
+
+   /* disable preemption, so we are sure we hit the fixup handler */
+#ifdef CONFIG_PPC64
+   clrrdi  r8,r1,THREAD_SHIFT
+#else
+   rlwinm  r8,r1,0,0,31-THREAD_SHIFT   /* current thread_info */
+#endif
+   li  r7, 1
+stwr7, TI_PREEMPT(r8)
+
isync
-   lwepx   r9, 0, r5
+
+   /*
+* In case the read goes wrong, we catch it and write an invalid value
+* in LAST_INST instead.
+*/
+1: lwepx   r9, 0, r5
+2:
+.section .fixup, "ax"
+3: li  r9, KVM_INST_FETCH_FAILED
+   b   2b
+.previous
+.section __ex_table,"a"
+   PPC_LONG_ALIGN
+   PPC_LONG 1b,3b
+.previous
+
mtspr   SPRN_EPLC, r3
+   li  r7, 0
+stwr7, TI_PREEMPT(r8)
stw r9, VCPU_LAST_INST(r4)
.endif
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 23/38] KVM: PPC: booke: deliver program int on emulation failure

2012-02-28 Thread Alexander Graf
When we fail to emulate an instruction for the guest, we better go in and
tell it that we failed to emulate it, by throwing an illegal instruction
exception.

Please beware that we basically never get around to telling the guest that
we failed thanks to the debugging code right above it. If user space however
decides that it wants to ignore the debug, we would at least do "the right
thing" afterwards.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 17d5318..9979be1 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -545,13 +545,13 @@ static int emulation_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu)
return RESUME_HOST;
 
case EMULATE_FAIL:
-   /* XXX Deliver Program interrupt to guest. */
printk(KERN_CRIT "%s: emulation at %lx failed (%08x)\n",
   __func__, vcpu->arch.pc, vcpu->arch.last_inst);
/* For debugging, encode the failing instruction and
 * report it to userspace. */
run->hw.hardware_exit_reason = ~0ULL << 32;
run->hw.hardware_exit_reason |= vcpu->arch.last_inst;
+   kvmppc_core_queue_program(vcpu, ESR_PIL);
return RESUME_HOST;
 
default:
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 21/38] KVM: PPC: make e500v2 kvm and e500mc cpu mutually exclusive

2012-02-28 Thread Alexander Graf
We can't run e500v2 kvm on e500mc kernels, so indicate that by
making the 2 options mutually exclusive in kconfig.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/Kconfig |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 44a998d..f4dacb9 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -120,7 +120,7 @@ config KVM_EXIT_TIMING
 
 config KVM_E500V2
bool "KVM support for PowerPC E500v2 processors"
-   depends on EXPERIMENTAL && E500
+   depends on EXPERIMENTAL && E500 && !PPC_E500MC
select KVM
select KVM_MMIO
---help---
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/38] KVM: PPC: e500mc: Move r1/r2 restoration very early

2012-02-28 Thread Alexander Graf
If we hit any exception whatsoever in the restore path and r1/r2 aren't the
host registers, we don't get a working oops. So it's always a good idea to
restore them as early as possible.

This time, it actually has practical reasons to do so too, since we need to
have the host page fault handler fix up our guest instruction read code. And
for that to work we need r1/r2 restored.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/bookehv_interrupts.S |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 9eaeebd..63023ae 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -67,6 +67,12 @@
  * saved in vcpu: cr, ctr, r3-r13
  */
 .macro kvm_handler_common intno, srr0, flags
+   /* Restore host stack pointer */
+   PPC_STL r1, VCPU_GPR(r1)(r4)
+   PPC_STL r2, VCPU_GPR(r2)(r4)
+   PPC_LL  r1, VCPU_HOST_STACK(r4)
+   PPC_LL  r2, HOST_R2(r1)
+
mfspr   r10, SPRN_PID
lwz r8, VCPU_HOST_PID(r4)
PPC_LL  r11, VCPU_SHARED(r4)
@@ -290,10 +296,8 @@ _GLOBAL(kvmppc_resume_host)
/* Save remaining volatile guest register state to vcpu. */
mfspr   r3, SPRN_VRSAVE
PPC_STL r0, VCPU_GPR(r0)(r4)
-   PPC_STL r1, VCPU_GPR(r1)(r4)
mflrr5
mfspr   r6, SPRN_SPRG4
-   PPC_STL r2, VCPU_GPR(r2)(r4)
PPC_STL r5, VCPU_LR(r4)
mfspr   r7, SPRN_SPRG5
PPC_STL r3, VCPU_VRSAVE(r4)
@@ -334,10 +338,6 @@ _GLOBAL(kvmppc_resume_host)
mtspr   SPRN_EPCR, r3
isync
 
-   /* Restore host stack pointer */
-   PPC_LL  r1, VCPU_HOST_STACK(r4)
-   PPC_LL  r2, HOST_R2(r1)
-
/* Switch to kernel stack and jump to handler. */
PPC_LL  r3, HOST_RUN(r1)
mr  r5, r14 /* intno */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/38] KVM: PPC: e500: Track TLB1 entries with a bitmap

2012-02-28 Thread Alexander Graf
From: Scott Wood 

Rather than invalidate everything when a TLB1 entry needs to be
taken down, keep track of which host TLB1 entries are used for
a given guest TLB1 entry, and invalidate just those entries.

Based on code from Ashish Kalra 
and Liu Yu .

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/e500.h |5 +++
 arch/powerpc/kvm/e500_tlb.c |   72 ---
 2 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 34cef08..f4dee55 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -2,6 +2,7 @@
  * Copyright (C) 2008-2011 Freescale Semiconductor, Inc. All rights reserved.
  *
  * Author: Yu Liu 
+ * Ashish Kalra 
  *
  * Description:
  * This file is based on arch/powerpc/kvm/44x_tlb.h and
@@ -25,6 +26,7 @@
 
 #define E500_TLB_VALID 1
 #define E500_TLB_DIRTY 2
+#define E500_TLB_BITMAP 4
 
 struct tlbe_ref {
pfn_t pfn;
@@ -82,6 +84,9 @@ struct kvmppc_vcpu_e500 {
struct page **shared_tlb_pages;
int num_shared_tlb_pages;
 
+   u64 *g2h_tlb1_map;
+   unsigned int *h2g_tlb1_rmap;
+
 #ifdef CONFIG_KVM_E500
u32 pid[E500_PID_NUM];
 
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 9925fc6..c8ce51d 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -2,6 +2,7 @@
  * Copyright (C) 2008-2011 Freescale Semiconductor, Inc. All rights reserved.
  *
  * Author: Yu Liu, yu@freescale.com
+ * Ashish Kalra, ashish.ka...@freescale.com
  *
  * Description:
  * This file is based on arch/powerpc/kvm/44x_tlb.c,
@@ -175,8 +176,28 @@ static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 
*vcpu_e500,
struct kvm_book3e_206_tlb_entry *gtlbe =
get_entry(vcpu_e500, tlbsel, esel);
 
-   if (tlbsel == 1) {
-   kvmppc_e500_tlbil_all(vcpu_e500);
+   if (tlbsel == 1 &&
+   vcpu_e500->gtlb_priv[1][esel].ref.flags & E500_TLB_BITMAP) {
+   u64 tmp = vcpu_e500->g2h_tlb1_map[esel];
+   int hw_tlb_indx;
+   unsigned long flags;
+
+   local_irq_save(flags);
+   while (tmp) {
+   hw_tlb_indx = __ilog2_u64(tmp & -tmp);
+   mtspr(SPRN_MAS0,
+ MAS0_TLBSEL(1) |
+ MAS0_ESEL(to_htlb1_esel(hw_tlb_indx)));
+   mtspr(SPRN_MAS1, 0);
+   asm volatile("tlbwe");
+   vcpu_e500->h2g_tlb1_rmap[hw_tlb_indx] = 0;
+   tmp &= tmp - 1;
+   }
+   mb();
+   vcpu_e500->g2h_tlb1_map[esel] = 0;
+   vcpu_e500->gtlb_priv[1][esel].ref.flags &= ~E500_TLB_BITMAP;
+   local_irq_restore(flags);
+
return;
}
 
@@ -282,6 +303,16 @@ static inline void kvmppc_e500_ref_release(struct tlbe_ref 
*ref)
}
 }
 
+static void clear_tlb1_bitmap(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+   if (vcpu_e500->g2h_tlb1_map)
+   memset(vcpu_e500->g2h_tlb1_map,
+  sizeof(u64) * vcpu_e500->gtlb_params[1].entries, 0);
+   if (vcpu_e500->h2g_tlb1_rmap)
+   memset(vcpu_e500->h2g_tlb1_rmap,
+  sizeof(unsigned int) * host_tlb_params[1].entries, 0);
+}
+
 static void clear_tlb_privs(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
int tlbsel = 0;
@@ -511,7 +542,7 @@ static void kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 
*vcpu_e500,
 /* XXX for both one-one and one-to-many , for now use TLB1 */
 static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 *vcpu_e500,
u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
-   struct kvm_book3e_206_tlb_entry *stlbe)
+   struct kvm_book3e_206_tlb_entry *stlbe, int esel)
 {
struct tlbe_ref *ref;
unsigned int victim;
@@ -524,6 +555,14 @@ static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 
*vcpu_e500,
ref = &vcpu_e500->tlb_refs[1][victim];
kvmppc_e500_shadow_map(vcpu_e500, gvaddr, gfn, gtlbe, 1, stlbe, ref);
 
+   vcpu_e500->g2h_tlb1_map[esel] |= (u64)1 << victim;
+   vcpu_e500->gtlb_priv[1][esel].ref.flags |= E500_TLB_BITMAP;
+   if (vcpu_e500->h2g_tlb1_rmap[victim]) {
+   unsigned int idx = vcpu_e500->h2g_tlb1_rmap[victim];
+   vcpu_e500->g2h_tlb1_map[idx] &= ~(1ULL << victim);
+   }
+   vcpu_e500->h2g_tlb1_rmap[victim] = esel;
+
return victim;
 }
 
@@ -728,7 +767,7 @@ int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 * are mapped on the fly. */
stlbsel = 1;
sesel = kvmppc_e500_tlb1_map(vcpu_e500, eaddr,
-   raddr >> PAGE_SHIFT, gtlbe, &stlbe);
+   raddr >> PAGE_SHIFT, gtlbe, &stlbe, esel);

[PATCH 03/38] KVM: PPC: factor out lpid allocator from book3s_64_mmu_hv

2012-02-28 Thread Alexander Graf
From: Scott Wood 

We'll use it on e500mc as well.

Signed-off-by: Scott Wood 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm_book3s.h |3 ++
 arch/powerpc/include/asm/kvm_booke.h  |3 ++
 arch/powerpc/include/asm/kvm_ppc.h|5 
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   26 +---
 arch/powerpc/kvm/powerpc.c|   34 +
 5 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index aa795cc..046041f 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -452,4 +452,7 @@ static inline bool kvmppc_critical_section(struct kvm_vcpu 
*vcpu)
 
 #define INS_DCBZ   0x7c0007ec
 
+/* LPIDs we support with this build -- runtime limit may be lower */
+#define KVMPPC_NR_LPIDS(LPID_RSVD + 1)
+
 #endif /* __ASM_KVM_BOOK3S_H__ */
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index a90e091..b7cd335 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -23,6 +23,9 @@
 #include 
 #include 
 
+/* LPIDs we support with this build -- runtime limit may be lower */
+#define KVMPPC_NR_LPIDS64
+
 static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val)
 {
vcpu->arch.gpr[num] = val;
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 9d6dee0..731e920 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -204,4 +204,9 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
 struct kvm_dirty_tlb *cfg);
 
+long kvmppc_alloc_lpid(void);
+void kvmppc_claim_lpid(long lpid);
+void kvmppc_free_lpid(long lpid);
+void kvmppc_init_lpid(unsigned long nr_lpids);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index ddc485a..d031ce1 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -36,13 +36,11 @@
 
 /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
 #define MAX_LPID_970   63
-#define NR_LPIDS   (LPID_RSVD + 1)
-unsigned long lpid_inuse[BITS_TO_LONGS(NR_LPIDS)];
 
 long kvmppc_alloc_hpt(struct kvm *kvm)
 {
unsigned long hpt;
-   unsigned long lpid;
+   long lpid;
struct revmap_entry *rev;
struct kvmppc_linear_info *li;
 
@@ -72,14 +70,9 @@ long kvmppc_alloc_hpt(struct kvm *kvm)
}
kvm->arch.revmap = rev;
 
-   /* Allocate the guest's logical partition ID */
-   do {
-   lpid = find_first_zero_bit(lpid_inuse, NR_LPIDS);
-   if (lpid >= NR_LPIDS) {
-   pr_err("kvm_alloc_hpt: No LPIDs free\n");
-   goto out_freeboth;
-   }
-   } while (test_and_set_bit(lpid, lpid_inuse));
+   lpid = kvmppc_alloc_lpid();
+   if (lpid < 0)
+   goto out_freeboth;
 
kvm->arch.sdr1 = __pa(hpt) | (HPT_ORDER - 18);
kvm->arch.lpid = lpid;
@@ -96,7 +89,7 @@ long kvmppc_alloc_hpt(struct kvm *kvm)
 
 void kvmppc_free_hpt(struct kvm *kvm)
 {
-   clear_bit(kvm->arch.lpid, lpid_inuse);
+   kvmppc_free_lpid(kvm->arch.lpid);
vfree(kvm->arch.revmap);
if (kvm->arch.hpt_li)
kvm_release_hpt(kvm->arch.hpt_li);
@@ -171,8 +164,7 @@ int kvmppc_mmu_hv_init(void)
if (!cpu_has_feature(CPU_FTR_HVMODE))
return -EINVAL;
 
-   memset(lpid_inuse, 0, sizeof(lpid_inuse));
-
+   /* POWER7 has 10-bit LPIDs, PPC970 and e500mc have 6-bit LPIDs */
if (cpu_has_feature(CPU_FTR_ARCH_206)) {
host_lpid = mfspr(SPRN_LPID);   /* POWER7 */
rsvd_lpid = LPID_RSVD;
@@ -181,9 +173,11 @@ int kvmppc_mmu_hv_init(void)
rsvd_lpid = MAX_LPID_970;
}
 
-   set_bit(host_lpid, lpid_inuse);
+   kvmppc_init_lpid(rsvd_lpid + 1);
+
+   kvmppc_claim_lpid(host_lpid);
/* rsvd_lpid is reserved for use in partition switching */
-   set_bit(rsvd_lpid, lpid_inuse);
+   kvmppc_claim_lpid(rsvd_lpid);
 
return 0;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 00d7e34..9806ea5 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -808,6 +808,40 @@ out:
return r;
 }
 
+static unsigned long lpid_inuse[BITS_TO_LONGS(KVMPPC_NR_LPIDS)];
+static unsigned long nr_lpids;
+
+long kvmppc_alloc_lpid(void)
+{
+   long lpid;
+
+   do {
+   lpid = find_first_zero_bit(lpid_inuse, KVMPPC_NR_LPIDS);
+   if (lpid >= nr_lpids) {
+   pr_err("%s: No LPIDs free\n", __func__);
+   return -ENOMEM;
+   }
+

Re: Reconciling qemu-kvm and qemu's PIT

2012-02-28 Thread Jan Kiszka
On 2012-02-28 20:50, Avi Kivity wrote:
> On 02/28/2012 04:42 PM, Avi Kivity wrote:
> 
> 
> 
> I'm getting a crash now:
> 
> #0  0x74ea0285 in raise() from /lib64/libc.so.6
> #1  0x74ea1b9b in abort() from /lib64/libc.so.6
> #2  0x74e98e9e in __assert_fail_base ()from /lib64/libc.so.6
> #3  0x74e98f42 in __assert_fail() from /lib64/libc.so.6
> #4  0x5571c69b in qdev_connect_gpio_out(dev=0x56913a70,
> n=0, pin=0x568ef770) at /home/tlv/akivity/qemu/hw/qdev.c:297
> #5  0x5582beae in pit_init (bus=0x568d5610,base=64,
> isa_irq=-1, alt_irq=0x568ef770) at /home/tlv/akivity/qemu/hw/i8254.h:85
> #6  0x5582ebbc in pc_basic_device_init
> (isa_bus=0x568d5610,gsi=0x568cadb0,
> rtc_state=0x7fffdea0, floppy=0x7fffdea8, no_vmport=false)
> at /home/tlv/akivity/qemu/hw/pc.c:1182
> #7  0x5582f454 in pc_init1 (system_memory=0x5646f270,
> system_io=0x5646f360, ram_size=1073741824,
> boot_device=0x7fffe300 "cad", kernel_filename=0x0,   
> kernel_cmdline=0x558baa12 "", initrd_filename=0x0,
> cpu_model=0x0, pci_enabled=1, kvmclock_enabled=1) at
> /home/tlv/akivity/qemu/hw/pc_piix.c:256
> #8  0x5582f8d7 in pc_init_pci (ram_size=1073741824,
> boot_device=0x7fffe300 "cad", kernel_filename=0x0,
> kernel_cmdline=0x558baa12 "", initrd_filename=0x0,
> cpu_model=0x0) at /home/tlv/akivity/qemu/hw/pc_piix.c:335
> #9  0x556f1349 in main (argc=6,argv=0x7fffe428,
> envp=0x7fffe460) at /home/tlv/akivity/qemu/vl.c:3431
> 
> Looks like isa-pit has zero gpio pins, so it fails when crashing.  I
> must have mismerged it, but where is the gpio pin count set?

In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to
discuss this without seeing your code.

Jan



signature.asc
Description: OpenPGP digital signature


Re: Reconciling qemu-kvm and qemu's PIT

2012-02-28 Thread Avi Kivity
On 02/28/2012 04:42 PM, Avi Kivity wrote:



I'm getting a crash now:

#0  0x74ea0285 in raise() from /lib64/libc.so.6
#1  0x74ea1b9b in abort() from /lib64/libc.so.6
#2  0x74e98e9e in __assert_fail_base ()from /lib64/libc.so.6
#3  0x74e98f42 in __assert_fail() from /lib64/libc.so.6
#4  0x5571c69b in qdev_connect_gpio_out(dev=0x56913a70,
n=0, pin=0x568ef770) at /home/tlv/akivity/qemu/hw/qdev.c:297
#5  0x5582beae in pit_init (bus=0x568d5610,base=64,
isa_irq=-1, alt_irq=0x568ef770) at /home/tlv/akivity/qemu/hw/i8254.h:85
#6  0x5582ebbc in pc_basic_device_init
(isa_bus=0x568d5610,gsi=0x568cadb0,
rtc_state=0x7fffdea0, floppy=0x7fffdea8, no_vmport=false)
at /home/tlv/akivity/qemu/hw/pc.c:1182
#7  0x5582f454 in pc_init1 (system_memory=0x5646f270,
system_io=0x5646f360, ram_size=1073741824,
boot_device=0x7fffe300 "cad", kernel_filename=0x0,   
kernel_cmdline=0x558baa12 "", initrd_filename=0x0,
cpu_model=0x0, pci_enabled=1, kvmclock_enabled=1) at
/home/tlv/akivity/qemu/hw/pc_piix.c:256
#8  0x5582f8d7 in pc_init_pci (ram_size=1073741824,
boot_device=0x7fffe300 "cad", kernel_filename=0x0,
kernel_cmdline=0x558baa12 "", initrd_filename=0x0,
cpu_model=0x0) at /home/tlv/akivity/qemu/hw/pc_piix.c:335
#9  0x556f1349 in main (argc=6,argv=0x7fffe428,
envp=0x7fffe460) at /home/tlv/akivity/qemu/vl.c:3431

Looks like isa-pit has zero gpio pins, so it fails when crashing.  I
must have mismerged it, but where is the gpio pin count set?


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Avi Kivity
On 02/28/2012 09:26 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 11:06 AM, Avi Kivity  wrote:
> >
> > No, the scheduler saves the state into task_struct.  I need it saved
> > into the vcpu structure.  We have two fpu states, the user state, and
> > the guest state.  APIs that take a task_struct as a parameter, or
> > reference current implicitly, aren't going to work.
>
> As far as I can tell, you should do the saving into the vcpu structure
> when you actually switch the thing around.
>
> In fact, you can do it these days by just playing around with the
> "tsk->thread.fpu.state" pointer, I guess.

Good idea.  I can't say I like poking into struct fpu's internals, but
we can treat it as an opaque structure and copy it around.

We can also do this in kernel_fpu_begin(), and allow it to be preemptible.

> But it all boils down to the fact that your code is not just ugly,
> it's *buggy*. If you play around with setting TS, you *will* be hit by
> interrupts etc that will start to use the FP code that you "don't
> use".
>
> And there is no excuse for you touching the host TS. The kernel does
> that for you, and does it better. And caches the end result in
> TS_USEDFPU (old) or in some variable that you shouldn't look at but
> can access with the user_has_fpu() helpers.

Again, I can't avoid touching it.  I can try to get the hardware to
always preserve its value, but that comes with a cost.

btw, I think the current code is safe wrt kvm.  If the guest fpu has
been loaded, then we know that that TS_USEDFPU is set, since we will
have saved the user fpu earlier.  Yes it's "accidental" and needs to be
improved, but I don't think it's a data corruptor waiting to happen.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Linus Torvalds
On Tue, Feb 28, 2012 at 11:06 AM, Avi Kivity  wrote:
>
> No, the scheduler saves the state into task_struct.  I need it saved
> into the vcpu structure.  We have two fpu states, the user state, and
> the guest state.  APIs that take a task_struct as a parameter, or
> reference current implicitly, aren't going to work.

As far as I can tell, you should do the saving into the vcpu structure
when you actually switch the thing around.

In fact, you can do it these days by just playing around with the
"tsk->thread.fpu.state" pointer, I guess.

But it all boils down to the fact that your code is not just ugly,
it's *buggy*. If you play around with setting TS, you *will* be hit by
interrupts etc that will start to use the FP code that you "don't
use".

And there is no excuse for you touching the host TS. The kernel does
that for you, and does it better. And caches the end result in
TS_USEDFPU (old) or in some variable that you shouldn't look at but
can access with the user_has_fpu() helpers.

Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM-AUTOTEST] [KVM-autotest] Cgroup-kvm rework

2012-02-28 Thread Lucas Meneghel Rodrigues

On 02/27/2012 03:42 PM, Lukas Doktor wrote:

Hi,

This is a complete rework of cgroup test from subtests to singe-test-execution. 
It improves stability of testing and allows better test customisation. The 
speed is similar/faster in single variant execution and a bit slower in 
all-variants execution compare to previous version.

It also contains a lot of important bugfixes and some cool enhancements 
described in patch.

Checkout current version on:
https://github.com/autotest/autotest/pull/209


Patchset applied, thanks Lukas!


Regards,
Lukáš



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Avi Kivity
On 02/28/2012 08:34 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 10:09 AM, Avi Kivity  wrote:
> >
> > This is done by preempt notifiers.  Whenever a task switch happens we
> > push the guest fpu state into memory (if loaded) and let the normal
> > stuff happen.  So the if we had a task switch during instruction
> > emulation, for example, then we'd get the "glacial and stupid path" to fire.
>
> Oh christ.
>
> This is exactly what the scheduler has ALWAYS ALREADY DONE FOR YOU.

No, the scheduler saves the state into task_struct.  I need it saved
into the vcpu structure.  We have two fpu states, the user state, and
the guest state.  APIs that take a task_struct as a parameter, or
reference current implicitly, aren't going to work.

> That's what the i387 save-and-restore code is all about. What's the
> advantage of just re-implementing it in non-obvious ways?
>
> Stop doing it. You get *zero* advantages from just doing what the
> scheduler natively does for you, and the scheduler does it *better*.

The scheduler does something different.

What I'd ideally want is

  struct fpu {
  int cpu;  /* -1 = not loaded */
  union thread_xstate *state;
  };

Perhaps with a struct fpu_ops *ops if needed.  We could then let various
users' fpus float around freely and only save/load them at the last moment.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Linus Torvalds
On Tue, Feb 28, 2012 at 10:09 AM, Avi Kivity  wrote:
>
> This is done by preempt notifiers.  Whenever a task switch happens we
> push the guest fpu state into memory (if loaded) and let the normal
> stuff happen.  So the if we had a task switch during instruction
> emulation, for example, then we'd get the "glacial and stupid path" to fire.

Oh christ.

This is exactly what the scheduler has ALWAYS ALREADY DONE FOR YOU.

That's what the i387 save-and-restore code is all about. What's the
advantage of just re-implementing it in non-obvious ways?

Stop doing it. You get *zero* advantages from just doing what the
scheduler natively does for you, and the scheduler does it *better*.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Avi Kivity
On 02/28/2012 08:08 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 9:37 AM, Linus Torvalds
>  wrote:
> >
> > So where's the comment about why you actually own and control CR0.TS,
> > and nobody else does?
>
> So what I think KVM should strive for (but I really don't know the
> code, so maybe there are good reasons why it is impossible) is to just
> never touch TS at all, and let the core kernel code do it all for you.

Which TS?  With kvm (restricting ourselves to vmx at this moment) there
are three versions lying around: CR0.TS, GUEST_CR0.TS (which is loaded
by the cpu during entry into the guest) and HOST_CR0.TS (which is loaded
by the cpu during guest exit).  GUEST_CR0.TS is actually a combination
of the guest's virtualized CR0.TS, and a flag that says whether the
guest fpu is loaded or not.  HOST_CR0 is basically a cached CR0, but as
it's expensive to change it, we don't want to reflect CR0.TS into
HOST_CR0.TS.

> When you need access to the FPU, let the core code just handle it for
> you. Let it trap and restore the state. When you get scheduled away,
> let the core code just set TS, because you really can't touch the FP
> state again.
>
> IOW, just do the FP operations you do within the thread you are. Never
> touch TS at all, just don't worry about it. Worry about your own
> internal FP state machine, but don't interact with the "global" kernel
> TS state machine.

I can't avoid touching it.  On exit vmx will set it for me.  I can
atomically copy CR0.TS into HOST_CR0.TS, but that's expensive.

Maybe we should just virtualize it into a percpu variable.  Should speed
up the non-kvm case as well since read_cr0() is likely not very fast.

> You can't do a lot better than that, I think. Especially now that we
> do the lazy restore, we can schedule between two tasks and if only one
> of them actually uses the FPU, we won't bother with extraneous state
> restores.

Ah, this is a new bit, I'll have to study it.

> The one exception I can think of is that if you are loading totally
> *new* FP state, and you think that TS is likely to be set, instead of
> trapping (and loading the old state in the trap handling) only to
> return to load the *new* state, we could expose a helper for that
> situation. It would look something like
>
>user_fpu_begin();
>fpu_restore_checking(newfpustate);
>
> and it would avoid the trap when loading the new state.
>

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Avi Kivity
On 02/28/2012 07:37 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 9:21 AM, Avi Kivity  wrote:
> >
> > What you described is the slow path.
>
> Indeed. I'd even call it the "glacial and stupid" path.

Right.  It won't be offended, since it's almost never called.

> >The fast path is
> >
> >  void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
> >  {
> >  if (vcpu->guest_fpu_loaded)
> >  return;
> >
> > If we're emulating an fpu instruction, it's very likely that we have the
> > guest fpu loaded into the cpu.  If we do take that path, we have the
> > right fpu state loaded, but CR0.TS is set by the recent exit, so we need
> > to clear it (the comment is in fact correct, except that it misspelled
> > "set").
>
> So why the hell do you put the clts in the wrong place, then?

You mean, why not in kvm_load_guest_fpu()?  Most of the uses of
kvm_load_guest_fpu() are just before guest entry, so the clts() would be
immediately overwritten by the loading of the GUEST_CR0 register (which
may have TS set or clear).  So putting it here would be wasting cycles.

>
> Dammit, the code is crap.
>
> The clts's are in random places, they don't make any sense, the
> comments are *wrong*, and the only reason they exist in the first
> place is exactly the fact that the code does what it does in the wrong
> place.
>
> There's a reason I called the code crap. It makes no sense. Your
> explanation only explains what it does (badly) when what *should* have
> happened is you saying "ok, that makes no sense, let's fix it up".
>
> So let me re-iterate: it makes ZERO SENSE to clear clts *after*
> restoring the state. Don't do it. Don't make excuses for doing it.
> It's WRONG. Whether it even happens to work by random chance isn't
> even the issue.
>
> Where it *can* make sense to clear TS is in your code that says
>
> >  if (vcpu->guest_fpu_loaded)
> >  return;
>
> where you could have done it like this:
>
> /* If we already have the FPU loaded, just clear TS - it was set
> by a recent exit */
> if (vcpu->guest_fpu_loaded) {
> clts();
> return;
> }
>
> And then at least the *placement* of clts would make sense. 

True, it's cleaner, but as noted above, it's wasteful.

> HOWEVER.
> Even if you did that, what guarantees that the most recent FP usage
> was by *your* kvm process? Sure, the "recent exit" may have set TS,
> but have you had preemption disabled for the whole time? Guaranteed?

Both the vcpu_load() and emulation paths happen with preemption disabled.

> Because TS may be set because something else rescheduled too.
>
> So where's the comment about why you actually own and control CR0.TS,
> and nobody else does?

The code is poorly commented, yes.

> Finally, how does this all interact with the fact that the task
> switching now keeps the FPU state around in the FPU and caches what
> state it is? I have no idea, because the kvm code is so inpenetratable
> due to all these totally unexplained things.

This is done by preempt notifiers.  Whenever a task switch happens we
push the guest fpu state into memory (if loaded) and let the normal
stuff happen.  So the if we had a task switch during instruction
emulation, for example, then we'd get the "glacial and stupid path" to fire.

> Btw, don't get me wrong - the core FPU state save/restore was a mess
> of random "TS_USEDFPU" and clts() crap too. We had several bugs there,
> partly exactly because the core FPU restore code also had "clts()"
> separated from the logic that actually set or cleared the TS_USEDFPU
> bit, and it was not at all clear at a "local" setting what the F was
> going on.
>
> Most of the recent i387 changes were actually to clean up and make
> sense of that thing, and making sure that the clts() was paired with
> the action of actually giving the FPU to the thread etc. So at least
> now the core FPU handling is reasonably sane, and the clts's and
> stts's are paired with the things that take control of the FPU, and we
> have a few helper functions and some abstraction in place.
>
> The kvm code definitely needs the same kind of cleanup. Because as it
> is now, it's just random code junk, and there is no obvious reason why
> it wouldn't interact horribly badly with an interrupt doing
> "irq_fpu_usable()" + "kernel_fpu_begin/end()" for example.

Well, interrupted_kernel_fpu_idle() does look like it returns the wrong
result when kvm is active.  Has the semantics of that changed in the
recent round?  The kvm fpu code is quite old and we haven't had any
reports of bad interactions with RAID/encryption since it was stabilized.

> Seriously.

I agree a refactoring is needed.  We may need to replace read_cr0() in

  static inline bool interrupted_kernel_fpu_idle(void)
  {
return !(current_thread_info()->status & TS_USEDFPU) &&
(read_cr0() & X86_CR0_TS);
  }

with some percpu variable since CR0.TS is not reliable in interrupt
context while kvm is running.

-- 
error compiling committee.c: too many arguments to 

Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Linus Torvalds
On Tue, Feb 28, 2012 at 9:37 AM, Linus Torvalds
 wrote:
>
> So where's the comment about why you actually own and control CR0.TS,
> and nobody else does?

So what I think KVM should strive for (but I really don't know the
code, so maybe there are good reasons why it is impossible) is to just
never touch TS at all, and let the core kernel code do it all for you.

When you need access to the FPU, let the core code just handle it for
you. Let it trap and restore the state. When you get scheduled away,
let the core code just set TS, because you really can't touch the FP
state again.

IOW, just do the FP operations you do within the thread you are. Never
touch TS at all, just don't worry about it. Worry about your own
internal FP state machine, but don't interact with the "global" kernel
TS state machine.

You can't do a lot better than that, I think. Especially now that we
do the lazy restore, we can schedule between two tasks and if only one
of them actually uses the FPU, we won't bother with extraneous state
restores.

The one exception I can think of is that if you are loading totally
*new* FP state, and you think that TS is likely to be set, instead of
trapping (and loading the old state in the trap handling) only to
return to load the *new* state, we could expose a helper for that
situation. It would look something like

   user_fpu_begin();
   fpu_restore_checking(newfpustate);

and it would avoid the trap when loading the new state.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-28 Thread Avi Kivity
On 02/28/2012 07:36 PM, David Ahern wrote:
> On 2/28/12 10:24 AM, Avi Kivity wrote:
>> On 02/28/2012 05:55 PM, Joerg Roedel wrote:
>>>
>>>   __init int amd_pmu_init(void)
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index 5fa553b..773fee2 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -29,6 +29,7 @@
>>>   #include
>>>   #include
>>>
>>> +#include
>>>   #include
>>>   #include
>>>   #include
>>> @@ -575,6 +576,8 @@ static void svm_hardware_disable(void *garbage)
>>>   wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT);
>>>
>>>   cpu_svm_disable();
>>> +
>>> +x86_pmu_disable_virt();
>>>   }
>>>
>>>   static int svm_hardware_enable(void *garbage)
>>> @@ -622,6 +625,8 @@ static int svm_hardware_enable(void *garbage)
>>>
>>>   svm_init_erratum_383();
>>>
>>> +x86_pmu_enable_virt();
>>> +
>>>   return 0;
>>>   }
>>>
>>
>> These should go into x86.c.  If the functions later gain meaning on
>> Intel, we want them to be called (and nothing in the name suggests
>> they're AMD specific).
>>
>
> I was to suggest the reverse: since this patch addesses an AMD bug,
> why not push those functions into perf_event_amd.c and make them
> dependent on CONFIG_CPU_SUP_AMD as well.

It depends on which direction you expect the code to grow.  These hooks
seem reasonable, so I think they should be generic.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Linus Torvalds
On Tue, Feb 28, 2012 at 9:21 AM, Avi Kivity  wrote:
>
> What you described is the slow path.

Indeed. I'd even call it the "glacial and stupid" path.

>The fast path is
>
>  void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
>  {
>      if (vcpu->guest_fpu_loaded)
>          return;
>
> If we're emulating an fpu instruction, it's very likely that we have the
> guest fpu loaded into the cpu.  If we do take that path, we have the
> right fpu state loaded, but CR0.TS is set by the recent exit, so we need
> to clear it (the comment is in fact correct, except that it misspelled
> "set").

So why the hell do you put the clts in the wrong place, then?

Dammit, the code is crap.

The clts's are in random places, they don't make any sense, the
comments are *wrong*, and the only reason they exist in the first
place is exactly the fact that the code does what it does in the wrong
place.

There's a reason I called the code crap. It makes no sense. Your
explanation only explains what it does (badly) when what *should* have
happened is you saying "ok, that makes no sense, let's fix it up".

So let me re-iterate: it makes ZERO SENSE to clear clts *after*
restoring the state. Don't do it. Don't make excuses for doing it.
It's WRONG. Whether it even happens to work by random chance isn't
even the issue.

Where it *can* make sense to clear TS is in your code that says

>      if (vcpu->guest_fpu_loaded)
>          return;

where you could have done it like this:

/* If we already have the FPU loaded, just clear TS - it was set
by a recent exit */
if (vcpu->guest_fpu_loaded) {
clts();
return;
}

And then at least the *placement* of clts would make sense. HOWEVER.
Even if you did that, what guarantees that the most recent FP usage
was by *your* kvm process? Sure, the "recent exit" may have set TS,
but have you had preemption disabled for the whole time? Guaranteed?

Because TS may be set because something else rescheduled too.

So where's the comment about why you actually own and control CR0.TS,
and nobody else does?

Finally, how does this all interact with the fact that the task
switching now keeps the FPU state around in the FPU and caches what
state it is? I have no idea, because the kvm code is so inpenetratable
due to all these totally unexplained things.

Btw, don't get me wrong - the core FPU state save/restore was a mess
of random "TS_USEDFPU" and clts() crap too. We had several bugs there,
partly exactly because the core FPU restore code also had "clts()"
separated from the logic that actually set or cleared the TS_USEDFPU
bit, and it was not at all clear at a "local" setting what the F was
going on.

Most of the recent i387 changes were actually to clean up and make
sense of that thing, and making sure that the clts() was paired with
the action of actually giving the FPU to the thread etc. So at least
now the core FPU handling is reasonably sane, and the clts's and
stts's are paired with the things that take control of the FPU, and we
have a few helper functions and some abstraction in place.

The kvm code definitely needs the same kind of cleanup. Because as it
is now, it's just random code junk, and there is no obvious reason why
it wouldn't interact horribly badly with an interrupt doing
"irq_fpu_usable()" + "kernel_fpu_begin/end()" for example.

Seriously.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-28 Thread David Ahern

On 2/28/12 10:24 AM, Avi Kivity wrote:

On 02/28/2012 05:55 PM, Joerg Roedel wrote:


  __init int amd_pmu_init(void)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 5fa553b..773fee2 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -29,6 +29,7 @@
  #include
  #include

+#include
  #include
  #include
  #include
@@ -575,6 +576,8 @@ static void svm_hardware_disable(void *garbage)
wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT);

cpu_svm_disable();
+
+   x86_pmu_disable_virt();
  }

  static int svm_hardware_enable(void *garbage)
@@ -622,6 +625,8 @@ static int svm_hardware_enable(void *garbage)

svm_init_erratum_383();

+   x86_pmu_enable_virt();
+
return 0;
  }



These should go into x86.c.  If the functions later gain meaning on
Intel, we want them to be called (and nothing in the name suggests
they're AMD specific).



I was to suggest the reverse: since this patch addesses an AMD bug, why 
not push those functions into perf_event_amd.c and make them dependent 
on CONFIG_CPU_SUP_AMD as well.


David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: one out of four existing kvm guest's not starting after system upgrade

2012-02-28 Thread Thomas Fjellstrom
On Tue Feb 28, 2012, you wrote:
> On 2012-02-19 21:13, Thomas Fjellstrom wrote:
> > I'm pretty much stumped on this. So I decided to try re-creating the vm
> > through virt-manager. Its up and running now. The only to major
> > differences I can see in the old and new config is the machine (-M
> > pc-0.12 vs -M pc-1.0) parameter, and the uuid. The rest of the
> > parameters I played with a lot trying to get it to work by starting up
> > the vm manually from the cli. I can't really see how those two changes
> > would do much of anything considering the other three VM's still are
> > configured to use -M pc-0.12, and they work fine.
> 
> To pick up this topic again: The trace contains no clear indication what
> is going on. Now I'm trying to understand what works and what not.
> Please correct / extend as required:
> 
>  - qemu-kvm-0.12 problematic-vm.img   [OK]
>  - qemu-kvm-1.0 -M pc-0.12 problematic-vm.img [HANG]
>  - qemu-kvm-1.0 -M pc-1.0 problematic-vm.img  [OK]
> 
> In all cases, the image is the same, never reinstalled?

Right. same exact disk image.

> 
> BTW, what is your guest again? What is your VM configuration?

Guest is debian squeeze. (with a trace of sid, but not a whole lot)
 
> That IOCTL error messages you find in the kernel log likely relate to
> direct cdrom access from the qemu process. Do you pass a host drive
> through?

Not a CDROM drive no. The host doesn't even have a cdrom drive. There are some 
virtio lvm disk images passed through.
 
> Jan


-- 
Thomas Fjellstrom
tho...@fjellstrom.ca
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42812] kvm module has bogus kernel version check

2012-02-28 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42812


Jan Kiszka  changed:

   What|Removed |Added

 CC||jan.kis...@web.de




--- Comment #1 from Jan Kiszka   2012-02-28 17:29:13 ---
Thanks for reporting, addressed in latest kvm-kmod.git.

Jan

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-28 Thread Avi Kivity
On 02/28/2012 05:55 PM, Joerg Roedel wrote:
>  
>  __init int amd_pmu_init(void)
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 5fa553b..773fee2 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -575,6 +576,8 @@ static void svm_hardware_disable(void *garbage)
>   wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT);
>  
>   cpu_svm_disable();
> +
> + x86_pmu_disable_virt();
>  }
>  
>  static int svm_hardware_enable(void *garbage)
> @@ -622,6 +625,8 @@ static int svm_hardware_enable(void *garbage)
>  
>   svm_init_erratum_383();
>  
> + x86_pmu_enable_virt();
> +
>   return 0;
>  }
>  

These should go into x86.c.  If the functions later gain meaning on
Intel, we want them to be called (and nothing in the name suggests
they're AMD specific).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Avi Kivity
On 02/28/2012 06:05 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 3:21 AM, Avi Kivity  wrote:
> >
> > Can you elaborate on what you don't like in the kvm code (apart from "it
> > does virtualiztion")?
>
> It doesn't fit any of the patterns of the x87 save/restore code, and I
> don't know what it does.

It tries to do two things: first, keep the guest fpu loaded while
running kernel code, and second, allow the instruction emulator to
access the guest fpu.

> It does clts on its own, in random places without actually restoring
> the FPU state. Why is that ok? I don't know. 

The way we use vmx, it does an implicit stts() after an exit from a
guest (it's not required, but it's expensive to play with the value of
the host cr0, so we set it to a safe value and clear it when needed). 
So sometimes we need these random clts()s.

> And I don't think it is,
> but I didn't change any of it. Why doesn't that thing corrupt the lazy
> state save of some other process, for example?
>
> Doing a "clts()" without restoring the FPU state immediately
> afterwards is fundamentally *wrong*. It's crazy. Insane. You can now
> use the FPU, but with whatever random state that is in it that caused
> TS to be set to begin with.

There are two cases.  In one of them, we do restore the guest fpu
immediately afterwards.  In the other, we're just clearing a CR0.TS that
was set spuriously.

> And if you don't have any FPU state to restore, because you want to
> use your own kernel state, you should use the
> "kernel_fpu_begin()/end()" things that we have had forever.

We do have state - the guest state.

> Here's an example of the kind of UTTER AND ABSOLUTE SHIT that kvm FPU
> state restore is:
>
>   static void emulator_get_fpu(struct x86_emulate_ctxt *ctxt)
>   {
> preempt_disable();
> kvm_load_guest_fpu(emul_to_vcpu(ctxt));
> /*
>  * CR0.TS may reference the host fpu state, not the guest fpu state,
>  * so it may be clear at this point.
>  */
> clts();
>   }
>
> that whole "comment" says nothing at all. And clearing CR0.TS *after*
> loading the FPU state is a f*cking joke, since you need it clear to
> load the FPU state to begin with. So as far as I can tell,
> kvm_load_guest_fpu() will have cleared the FPU state already, but *it*
> did it by:
>
> unlazy_fpu(current);
> fpu_restore_checking(&vcpu->arch.guest_fpu);
>
> where "unlazy_fpu()" will have *set* TS if it wasn't set before, so
> fpu_restore_checking() will now TAKE A FAULT, and in that fault
> handler it will clear TS so that it can reload the state we just saved
> (yes, really), only to then return to fpu_restore_checking() and
> reload yet *another* state.
>
> The code is crap. It's insane. It may work, but if it does, it does so
> by pure chance and happenstance. The code is CLEARLY INSANE.

What you described is the slow path.  The fast path is

  void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
  {
  if (vcpu->guest_fpu_loaded)
  return;

If we're emulating an fpu instruction, it's very likely that we have the
guest fpu loaded into the cpu.  If we do take that path, we have the
right fpu state loaded, but CR0.TS is set by the recent exit, so we need
to clear it (the comment is in fact correct, except that it misspelled
"set").

> I wasn't going to touch it. It had been written by a
> random-code-generator that had strung the various FPU accessor
> functions up in random order until it compiled.

The tried and tested way, yes.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 24/37] KVM: PPC: booke: rework rescheduling checks

2012-02-28 Thread Scott Wood
On 02/28/2012 05:03 AM, Alexander Graf wrote:
> 
> On 27.02.2012, at 20:28, Scott Wood wrote:
> 
>> If there is a signal pending and MSR[WE] is set, we'll loop forever
>> without reaching this check.
> 
> Good point. How about something like this on top (will fold in later)?
> 
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 430055e..9f27258 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -477,15 +477,17 @@ static int kvmppc_prepare_to_enter(struct kvm_vcpu 
> *vcpu)
> continue;
> }
>  
> +   if (signal_pending(current)) {
> +   r = 1;
> +   break;
> +   }
> +
> if (kvmppc_core_prepare_to_enter(vcpu)) {
> /* interrupts got enabled in between, so we
>are back at square 1 */
> continue;
> }
>  
> -   if (signal_pending(current))
> -   r = 1;
> -
> break;
> }

Looks OK.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-blk performance regression and qemu-kvm

2012-02-28 Thread Martin Mailand

Hi Stefan,
I was bisecting qemu-kvm.git.

 git remote show origin
* remote origin
  Fetch URL: git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
  Push  URL: git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git

The bisect log is:

git bisect start
# good: [b8095f24f24e50a7d4be33d8a79474aff3324295] Bump version to 
reflect v0.15.0-rc0

git bisect good b8095f24f24e50a7d4be33d8a79474aff3324295
# bad: [e072ea2fd8fdceef64159b9596d3c15ce01bea91] Bump version to 1.0-rc0
git bisect bad e072ea2fd8fdceef64159b9596d3c15ce01bea91
# bad: [7d4b4ba5c2bae99d44f265884b567ae63947bb4a] block: New 
change_media_cb() parameter load

git bisect bad 7d4b4ba5c2bae99d44f265884b567ae63947bb4a
# good: [baaa86d9f5d516d423d34af92e0c15b56e06ac4b] hw/9pfs: Update 
v9fs_create to use coroutines

git bisect good baaa86d9f5d516d423d34af92e0c15b56e06ac4b
# bad: [9aed1e036dc0de49d08d713f9e5c4655e94acb56] Rename qemu -> 
qemu-system-i386

git bisect bad 9aed1e036dc0de49d08d713f9e5c4655e94acb56
# good: [8ef9ea85a2cc1007eaefa53e6871f1f83bcef22d] Merge remote-tracking 
branch 'qemu-kvm/memory/batch' into staging

git bisect good 8ef9ea85a2cc1007eaefa53e6871f1f83bcef22d
# good: [9f4bd6baf64b8139cf2d7f8f53a98b27531da13c] Merge remote-tracking 
branch 'kwolf/for-anthony' into staging

git bisect good 9f4bd6baf64b8139cf2d7f8f53a98b27531da13c
# good: [09001ee7b27b9b5f049362efc427d03e2186a431] trace: [make] replace 
'ifeq' with values in CONFIG_TRACE_*

git bisect good 09001ee7b27b9b5f049362efc427d03e2186a431
# good: [d8e8ef4ee05bfee0df84e2665d9196c4a954c095] simpletrace: fix 
process() argument count

git bisect good d8e8ef4ee05bfee0df84e2665d9196c4a954c095
# good: [a952c570c865d5eae6c148716f2cb585a0d3a2ee] Merge remote-tracking 
branch 'qemu-kvm-tmp/memory/core' into staging

git bisect good a952c570c865d5eae6c148716f2cb585a0d3a2ee
# good: [625f9e1f54cd78ee98ac22030da527c9a1cc9d2b] Merge remote-tracking 
branch 'stefanha/trivial-patches' into staging

git bisect good 625f9e1f54cd78ee98ac22030da527c9a1cc9d2b
# good: [d9cd446b4f6ff464f9520898116534de988d9bc1] trace: fix 
out-of-tree builds

git bisect good d9cd446b4f6ff464f9520898116534de988d9bc1
# bad: [12d4536f7d911b6d87a766ad7300482ea663cea2] main: force enabling 
of I/O thread

git bisect bad 12d4536f7d911b6d87a766ad7300482ea663cea2

-martin

On 28.02.2012 18:05, Stefan Hajnoczi wrote:

On Tue, Feb 28, 2012 at 4:39 PM, Martin Mailand  wrote:

I could reproduce it and I bisected it down to this commit.

12d4536f7d911b6d87a766ad7300482ea663cea2 is the first bad commit
commit 12d4536f7d911b6d87a766ad7300482ea663cea2
Author: Anthony Liguori
Date:   Mon Aug 22 08:24:58 2011 -0500

This seems strange to me.

What commit 12d4536f7 did was to switch to a threading model in
*qemu.git* that is similar to what *qemu-kvm.git* has been doing all
along.

That means the qemu-kvm binaries already use the iothread model.  The
only explanation I have is that your bisect went down a qemu.git path
and you therefore tripped over this - but in practice it should not
account for a difference between qemu-kvm 0.14.1 and 1.0.

Can you please confirm that you are bisecting qemu-kvm.git and not qemu.git?

Stefan


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-blk performance regression and qemu-kvm

2012-02-28 Thread Stefan Hajnoczi
On Tue, Feb 28, 2012 at 4:39 PM, Martin Mailand  wrote:
> I could reproduce it and I bisected it down to this commit.
>
> 12d4536f7d911b6d87a766ad7300482ea663cea2 is the first bad commit
> commit 12d4536f7d911b6d87a766ad7300482ea663cea2
> Author: Anthony Liguori 
> Date:   Mon Aug 22 08:24:58 2011 -0500

This seems strange to me.

What commit 12d4536f7 did was to switch to a threading model in
*qemu.git* that is similar to what *qemu-kvm.git* has been doing all
along.

That means the qemu-kvm binaries already use the iothread model.  The
only explanation I have is that your bisect went down a qemu.git path
and you therefore tripped over this - but in practice it should not
account for a difference between qemu-kvm 0.14.1 and 1.0.

Can you please confirm that you are bisecting qemu-kvm.git and not qemu.git?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-blk performance regression and qemu-kvm

2012-02-28 Thread Martin Mailand

Hi,
I could reproduce it and I bisected it down to this commit.

12d4536f7d911b6d87a766ad7300482ea663cea2 is the first bad commit
commit 12d4536f7d911b6d87a766ad7300482ea663cea2
Author: Anthony Liguori 
Date:   Mon Aug 22 08:24:58 2011 -0500


-martin


On 22.02.2012 20:53, Stefan Hajnoczi wrote:

On Wed, Feb 22, 2012 at 4:48 PM, Dongsu Park
  wrote:

Try turning ioeventfd off for the virtio-blk device:

-device virtio-blk-pci,ioeventfd=off,...

You might see better performance since ramdisk I/O should be very
low-latency.  The overhead of using ioeventfd might not make it
worthwhile.  The ioeventfd feature was added post-0.14 IIRC.  Normally
it helps avoid stealing vcpu time and also causing lock contention
inside the guest - but if host I/O latency is extremely low it might
be faster to issue I/O from the vcpu thread.

Thanks for the tip. I tried that too, but no success.

My guesses have all been wrong.  Maybe it's time to git bisect this instead :).

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blockdev operations [was: [Qemu-devel] KVM call agenda for Tuesday 28th]

2012-02-28 Thread Paolo Bonzini
Il 28/02/2012 17:07, Eric Blake ha scritto:
> { 'enum': 'BlockdevOp',
>   'data': [ 'snapshot', 'snapshot-mirror', 'reopen' ] }
> { 'type': 'BlockdevAction',
>   'data': {'device': 'str', 'op': 'BlockdevOp',
>'file': 'str', '*format': 'str', '*reuse': 'bool',
>'*mirror': 'str', '*mirror-format': 'str' } }
> { 'command': 'blkdev-group-action-sync',
>   'data': { 'actionlist': [ 'BlockdevAction' ] } }
> 
> 
> The overall command is atomic - either all operations will succeed, or
> the command returns an error pointing to the name of the device that
> failed leaving all devices in their pre-command state.  Then, for each
> requested operation:
> 
> If op is 'snapshot', then 'file' names the new snapshot file; 'reuse' is
> optional (defaults to false) to say whether qemu creates the file from
> scratch, or opens an existing file with the backing file already
> populated correctly.  'format' gives the format of 'file', defaulting to
> qcow2.  'mirror' and 'mirror-format' must not be given.
> 
> If op is 'snapshot-mirror', then 'mirror' is mandatory; and both 'file'
> and 'mirror' are opened as a new mirrored snapshot.  Again, 'reuse'
> affects whether qemu creates the new files from scratch or trusts oVirt
> to pre-create both files with backing file information; and 'format' and
> 'mirror-format' allow control over the image format being opened.

Could snapshot-mirror be done as two separate commands for snapshot (or
reopen) and mirror?  This removes the need for mirror and mirror-format.

> If op is 'reopen', then 'file' is the name of the file to be opened to
> replace the current file tied to the blockdev, with type given by
> 'format'.  'reuse', 'mirror', and 'mirror-format' must not be given.

Otherwise looks good.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


blockdev operations [was: [Qemu-devel] KVM call agenda for Tuesday 28th]

2012-02-28 Thread Eric Blake
On 02/28/2012 07:58 AM, Stefan Hajnoczi wrote:
> On Tue, Feb 28, 2012 at 2:47 PM, Paolo Bonzini  wrote:
>> Il 28/02/2012 15:39, Stefan Hajnoczi ha scritto:
>>> I'm not a fan of transactions or freeze/thaw (if used to atomically
>>> perform other commands).
>>>
>>> We should not export low-level block device operations so that
>>> external software can micromanage via QMP.  I don't think this is a
>>> good idea because it takes the block device offline and possibly
>>> blocks the VM.  We're reaching a level comparable to an HTTP interface
>>> for acquiring pthread mutex, doing some operations, and then another
>>> HTTP request to unlock it.  This is micromanagement it will create
>>> more problems because we will have to support lots of little API
>>> functions.
>>
>> So you're for extending Jeff's patches to group mirroring etc.?
>>
>> That's also my favorite one, assuming we can do it in time for 1.1.
> 
> Yes, that's the approach I like the most.  It's relatively clean and
> leaves us space to develop -blockdev.

Here's the idea I was forming based on today's call:

Jeff's idea of a group operation can be extended to allow multiple
operations while reusing the framework.  For oVirt, we need the ability
to open a mirror (by passing the mirror file alongside the name of the
new external snapshot), as well as reopening a blockdev (to pivot to the
other side of an already-open mirror).

Is there a way to express a designated union in QMP?  I'm thinking
something along the lines of having the overall group command take a
list of operations, where each operation can either be 'create a
snapshot', 'create a snapshot and mirror', or 'reopen a mirror'.

I'm thinking it might look something like:

{ 'enum': 'BlockdevOp',
  'data': [ 'snapshot', 'snapshot-mirror', 'reopen' ] }
{ 'type': 'BlockdevAction',
  'data': {'device': 'str', 'op': 'BlockdevOp',
   'file': 'str', '*format': 'str', '*reuse': 'bool',
   '*mirror': 'str', '*mirror-format': 'str' } }
{ 'command': 'blkdev-group-action-sync',
  'data': { 'actionlist': [ 'BlockdevAction' ] } }


The overall command is atomic - either all operations will succeed, or
the command returns an error pointing to the name of the device that
failed leaving all devices in their pre-command state.  Then, for each
requested operation:

If op is 'snapshot', then 'file' names the new snapshot file; 'reuse' is
optional (defaults to false) to say whether qemu creates the file from
scratch, or opens an existing file with the backing file already
populated correctly.  'format' gives the format of 'file', defaulting to
qcow2.  'mirror' and 'mirror-format' must not be given.

If op is 'snapshot-mirror', then 'mirror' is mandatory; and both 'file'
and 'mirror' are opened as a new mirrored snapshot.  Again, 'reuse'
affects whether qemu creates the new files from scratch or trusts oVirt
to pre-create both files with backing file information; and 'format' and
'mirror-format' allow control over the image format being opened.

If op is 'reopen', then 'file' is the name of the file to be opened to
replace the current file tied to the blockdev, with type given by
'format'.  'reuse', 'mirror', and 'mirror-format' must not be given.

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/2] i387: split up into exported and internal interfaces

2012-02-28 Thread Linus Torvalds
On Tue, Feb 28, 2012 at 3:21 AM, Avi Kivity  wrote:
>
> Can you elaborate on what you don't like in the kvm code (apart from "it
> does virtualiztion")?

It doesn't fit any of the patterns of the x87 save/restore code, and I
don't know what it does.

It does clts on its own, in random places without actually restoring
the FPU state. Why is that ok? I don't know. And I don't think it is,
but I didn't change any of it. Why doesn't that thing corrupt the lazy
state save of some other process, for example?

Doing a "clts()" without restoring the FPU state immediately
afterwards is fundamentally *wrong*. It's crazy. Insane. You can now
use the FPU, but with whatever random state that is in it that caused
TS to be set to begin with.

And if you don't have any FPU state to restore, because you want to
use your own kernel state, you should use the
"kernel_fpu_begin()/end()" things that we have had forever.

Here's an example of the kind of UTTER AND ABSOLUTE SHIT that kvm FPU
state restore is:

  static void emulator_get_fpu(struct x86_emulate_ctxt *ctxt)
  {
preempt_disable();
kvm_load_guest_fpu(emul_to_vcpu(ctxt));
/*
 * CR0.TS may reference the host fpu state, not the guest fpu state,
 * so it may be clear at this point.
 */
clts();
  }

that whole "comment" says nothing at all. And clearing CR0.TS *after*
loading the FPU state is a f*cking joke, since you need it clear to
load the FPU state to begin with. So as far as I can tell,
kvm_load_guest_fpu() will have cleared the FPU state already, but *it*
did it by:

unlazy_fpu(current);
fpu_restore_checking(&vcpu->arch.guest_fpu);

where "unlazy_fpu()" will have *set* TS if it wasn't set before, so
fpu_restore_checking() will now TAKE A FAULT, and in that fault
handler it will clear TS so that it can reload the state we just saved
(yes, really), only to then return to fpu_restore_checking() and
reload yet *another* state.

The code is crap. It's insane. It may work, but if it does, it does so
by pure chance and happenstance. The code is CLEARLY INSANE.

I wasn't going to touch it. It had been written by a
random-code-generator that had strung the various FPU accessor
functions up in random order until it compiled.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-28 Thread Joerg Roedel
It turned out that a performance counter on AMD does not
count at all when the GO or HO bit is set in the control
register and SVM is disabled in EFER.

This patch works around this issue by masking out the HO bit
in the performance counter control register when SVM is not
enabled.

The GO bit is not touched because it is only set when the
user wants to count in guest-mode only. So when SVM is
disabled the counter should not run at all and the
not-counting is the intended behaviour.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Avi Kivity 
Cc: Stephane Eranian 
Cc: David Ahern 
Cc: Gleb Natapov 
Cc: Robert Richter 
Cc: sta...@vger.kernel.org # 3.2
Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/perf_event.h|5 +
 arch/x86/kernel/cpu/perf_event.c |   30 ++
 arch/x86/kernel/cpu/perf_event.h |6 +-
 arch/x86/kernel/cpu/perf_event_amd.c |6 --
 arch/x86/kvm/svm.c   |5 +
 5 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 096c975e..e0a4ad4 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -227,6 +227,8 @@ struct perf_guest_switch_msr {
 
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
+extern void x86_pmu_enable_virt(void);
+extern void x86_pmu_disable_virt(void);
 #else
 static inline perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 {
@@ -240,6 +242,9 @@ static inline void perf_get_x86_pmu_capability(struct 
x86_pmu_capability *cap)
 }
 
 static inline void perf_events_lapic_init(void){ }
+
+static inline void x86_pmu_enable_virt(void) { }
+static inline void x86_pmu_disable_virt(void) { }
 #endif
 
 #endif /* _ASM_X86_PERF_EVENT_H */
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 5adce10..f1ba9bf 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -477,6 +477,36 @@ void x86_pmu_enable_all(int added)
}
 }
 
+void x86_pmu_enable_virt(void)
+{
+   struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+   cpuc->perf_ctr_virt_mask = 0;
+
+   /* Reload all events */
+   x86_pmu_disable_all();
+   x86_pmu_enable_all(0);
+}
+EXPORT_SYMBOL_GPL(x86_pmu_enable_virt);
+
+void x86_pmu_disable_virt(void)
+{
+   struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+   /*
+* We only mask out the Host-only bit so that host-only counting works
+* when SVM is disabled. If someone sets up a guest-only counter when
+* SVM is disabled the Guest-only bits still gets set and the counter
+* will not count anything.
+*/
+   cpuc->perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY;
+
+   /* Reload all events */
+   x86_pmu_disable_all();
+   x86_pmu_enable_all(0);
+}
+EXPORT_SYMBOL_GPL(x86_pmu_disable_virt);
+
 static struct pmu pmu;
 
 static inline int is_x86_event(struct perf_event *event)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 8944062..2c581b9 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -148,6 +148,8 @@ struct cpu_hw_events {
 * AMD specific bits
 */
struct amd_nb   *amd_nb;
+   /* Inverted mask of bits to clear in the perf_ctr ctrl registers */
+   u64 perf_ctr_virt_mask;
 
void*kfree_on_online;
 };
@@ -417,9 +419,11 @@ void x86_pmu_disable_all(void);
 static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc,
  u64 enable_mask)
 {
+   u64 disable_mask = __this_cpu_read(cpu_hw_events.perf_ctr_virt_mask);
+
if (hwc->extra_reg.reg)
wrmsrl(hwc->extra_reg.reg, hwc->extra_reg.config);
-   wrmsrl(hwc->config_base, hwc->config | enable_mask);
+   wrmsrl(hwc->config_base, (hwc->config | enable_mask) & ~disable_mask);
 }
 
 void x86_pmu_enable_all(int added);
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c 
b/arch/x86/kernel/cpu/perf_event_amd.c
index 0397b23..0487b12 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -357,7 +357,9 @@ static void amd_pmu_cpu_starting(int cpu)
struct amd_nb *nb;
int i, nb_id;
 
-   if (boot_cpu_data.x86_max_cores < 2)
+   cpuc->perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY;
+
+   if (boot_cpu_data.x86_max_cores < 2 || boot_cpu_data.x86 == 0x15)
return;
 
nb_id = amd_get_nb_id(cpu);
@@ -587,9 +589,9 @@ static __initconst const struct x86_pmu amd_pmu_f15h = {
.put_event_constraints  = amd_put_event_constraints,
 
.cpu_prepare= amd_pmu_cpu_prepare,
-   .cpu_starting   = amd_pmu_cpu_starting,
.

Re: [Qemu-devel] KVM call agenda for Tuesday 28th

2012-02-28 Thread Stefan Hajnoczi
On Tue, Feb 28, 2012 at 2:47 PM, Paolo Bonzini  wrote:
> Il 28/02/2012 15:39, Stefan Hajnoczi ha scritto:
>> I'm not a fan of transactions or freeze/thaw (if used to atomically
>> perform other commands).
>>
>> We should not export low-level block device operations so that
>> external software can micromanage via QMP.  I don't think this is a
>> good idea because it takes the block device offline and possibly
>> blocks the VM.  We're reaching a level comparable to an HTTP interface
>> for acquiring pthread mutex, doing some operations, and then another
>> HTTP request to unlock it.  This is micromanagement it will create
>> more problems because we will have to support lots of little API
>> functions.
>
> So you're for extending Jeff's patches to group mirroring etc.?
>
> That's also my favorite one, assuming we can do it in time for 1.1.

Yes, that's the approach I like the most.  It's relatively clean and
leaves us space to develop -blockdev.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for Tuesday 28th

2012-02-28 Thread Paolo Bonzini
Il 28/02/2012 15:39, Stefan Hajnoczi ha scritto:
> I'm not a fan of transactions or freeze/thaw (if used to atomically
> perform other commands).
> 
> We should not export low-level block device operations so that
> external software can micromanage via QMP.  I don't think this is a
> good idea because it takes the block device offline and possibly
> blocks the VM.  We're reaching a level comparable to an HTTP interface
> for acquiring pthread mutex, doing some operations, and then another
> HTTP request to unlock it.  This is micromanagement it will create
> more problems because we will have to support lots of little API
> functions.

So you're for extending Jeff's patches to group mirroring etc.?

That's also my favorite one, assuming we can do it in time for 1.1.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reconciling qemu-kvm and qemu's PIT

2012-02-28 Thread Avi Kivity
On 02/28/2012 04:40 PM, Jan Kiszka wrote:
> On 2012-02-28 15:32, Avi Kivity wrote:
> > 
> > VMStateDescription vmstate_pit = {
> > .name = "i8254",
> > .version_id = 3,
> > .minimum_version_id = 2,
> > .minimum_version_id_old = 1,
> > .load_state_old = pit_load_old,
> > .fields  = (VMStateField []) {
> > <<< HEAD
> > VMSTATE_UINT32(flags, PITState),
> > ||| merged common ancestors
> > ===
> > VMSTATE_UINT32_V(channels[0].irq_disabled, PITState, 3),
>  ce967e2f33861b0e17753f97fa4527b5943c94b6
> > VMSTATE_STRUCT_ARRAY(channels, PITState, 3, 2,
> > vmstate_pit_channel, PITChannelState),
> > VMSTATE_TIMER(channels[0].irq_timer, PITState),
> > VMSTATE_END_OF_LIST()
> > }
> > };
> > 
> > I'm guessing that flags and irq_disabled are equivalent, but do they
> > have the same sense (that is, do the "1" values have the same meaning)? 
>
> Yes. qemu-kvm sets flags to 0 or PIT_FLAGS_HPET_LEGACY, which is 1. And
> the latter means "irq_disabled".
>
> > If not, we have a migration problem.
> > 
> > Is it save to just adopt the new version and drop the old one?
>
> The new upstream code was designed to match qemu-kvm's migration format,
> so you can switch. 

Not entirely unexpected, but I wanted to make sure.

> Of course, the result needs a careful check.
>

Thanks!

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reconciling qemu-kvm and qemu's PIT

2012-02-28 Thread Jan Kiszka
On 2012-02-28 15:32, Avi Kivity wrote:
> 
> VMStateDescription vmstate_pit = {
> .name = "i8254",
> .version_id = 3,
> .minimum_version_id = 2,
> .minimum_version_id_old = 1,
> .load_state_old = pit_load_old,
> .fields  = (VMStateField []) {
> <<< HEAD
> VMSTATE_UINT32(flags, PITState),
> ||| merged common ancestors
> ===
> VMSTATE_UINT32_V(channels[0].irq_disabled, PITState, 3),
 ce967e2f33861b0e17753f97fa4527b5943c94b6
> VMSTATE_STRUCT_ARRAY(channels, PITState, 3, 2,
> vmstate_pit_channel, PITChannelState),
> VMSTATE_TIMER(channels[0].irq_timer, PITState),
> VMSTATE_END_OF_LIST()
> }
> };
> 
> I'm guessing that flags and irq_disabled are equivalent, but do they
> have the same sense (that is, do the "1" values have the same meaning)? 

Yes. qemu-kvm sets flags to 0 or PIT_FLAGS_HPET_LEGACY, which is 1. And
the latter means "irq_disabled".

> If not, we have a migration problem.
> 
> Is it save to just adopt the new version and drop the old one?

The new upstream code was designed to match qemu-kvm's migration format,
so you can switch. Of course, the result needs a careful check.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for Tuesday 28th

2012-02-28 Thread Stefan Hajnoczi
On Mon, Feb 27, 2012 at 10:06 PM, Anthony Liguori  wrote:
> On 02/27/2012 03:58 PM, Paolo Bonzini wrote:
>>
>> Il 27/02/2012 18:21, Eric Blake ha scritto:
>
> Please send in any agenda items you are interested in covering.
>>>
>>> Given all the threads on snapshot/mirror/migrate/reopen in the blockdev
>>> layer, that sounds like a worthwhile topic to discuss on a phone call.
>>
>>
>> I put a description of the existing proposals here:
>>
>> http://wiki.qemu.org/Features/SnapshotsMultipleDevices/CommandSetProposals
>
>
> Thanks!  One thing I'm having trouble following on your proposal: What
> commands are valid within
> blockdev-start-transaction/blockdev-commit-transaction?
>
> If I do:
>
> blockdev-start-transaction
> stop
> drive-reopen
> drive-mirror
> blockdev-end-transaction
>
> What state should I expect that my guest is in (paused or running)?

I'm not a fan of transactions or freeze/thaw (if used to atomically
perform other commands).

We should not export low-level block device operations so that
external software can micromanage via QMP.  I don't think this is a
good idea because it takes the block device offline and possibly
blocks the VM.  We're reaching a level comparable to an HTTP interface
for acquiring pthread mutex, doing some operations, and then another
HTTP request to unlock it.  This is micromanagement it will create
more problems because we will have to support lots of little API
functions.

I think we're only exposing low level operations because:
1. We haven't designed a block model that works.
2. Therefore, upper layers of the management stack have felt forced to
implement these operations on our behalf.  They want a micromanagement
interface in order to do that.

What we should really do is design the block device model for QEMU:

* What responsibilities does QEMU have for handling image files?  We
seem to go back and forth between file descriptor passing for security
and reopening images while QEMU is running.

* What user-visible operations does it need to support (snapshotting
groups of images, eject/insert media, hotplug disk, etc)?

We can look at existing hypervisors and virtualization APIs as inspiration.

Let's provide high-level commands via QMP and let's do it with -blockdev.

Or if we decide that QEMU shouldn't be in the business of doing these
operations then we need to radically simplify to a model that just
passes file descriptors and freezes/thaws I/O but doesn't do any of
the high-level operations at all.  Right now we have a half-way house
and adding more snapshot/transaction APIs isn't the answer.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Reconciling qemu-kvm and qemu's PIT

2012-02-28 Thread Avi Kivity

VMStateDescription vmstate_pit = {
.name = "i8254",
.version_id = 3,
.minimum_version_id = 2,
.minimum_version_id_old = 1,
.load_state_old = pit_load_old,
.fields  = (VMStateField []) {
<<< HEAD
VMSTATE_UINT32(flags, PITState),
||| merged common ancestors
===
VMSTATE_UINT32_V(channels[0].irq_disabled, PITState, 3),
>>> ce967e2f33861b0e17753f97fa4527b5943c94b6
VMSTATE_STRUCT_ARRAY(channels, PITState, 3, 2,
vmstate_pit_channel, PITChannelState),
VMSTATE_TIMER(channels[0].irq_timer, PITState),
VMSTATE_END_OF_LIST()
}
};

I'm guessing that flags and irq_disabled are equivalent, but do they
have the same sense (that is, do the "1" values have the same meaning)? 
If not, we have a migration problem.

Is it save to just adopt the new version and drop the old one?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: don't forget to schedule()

2012-02-28 Thread Avi Kivity
On 02/28/2012 04:00 PM, Nadav Har'El wrote:
> On Tue, Feb 28, 2012, Avi Kivity wrote about "Re: [PATCH] vhost: don't forget 
> to schedule()":
> > > + if (need_resched())
> > > + schedule();
> > 
> > This is cond_resched(), no?
>
> It indeed looks similar, but it appears there are some slightly
> different things happening in both cases, especially for a preemptive
> kernel... Unfortunately, I am not astute (or experienced) enough to tell 
> which of the two idioms are better or more appropriate for this case.

I'd have expected that cond_resched() is a no-op with preemptible
kernels, but I see this is not the case.

>
> The idiom that I used seemed right, and seemed to work in my tests.
> Moreover I also noticed it was used in vmx.c. Also, vhost.c was already
> calling schedule(), not cond_resched(), so I thought it made sense to
> call the same thing...
>
> But I now see that in kvm_main.c, there's also this:
>
> if (!need_resched())
> return;
> cond_resched();
>
> Which seems to combine both idioms ;-) Can anybody shed a light on what
> is the right way to do it?
>

It's bogus.  Look at commit 3fca03653010:

Author: Yaozu Dong 
Date:   Wed Apr 25 16:49:19 2007 +0300

KVM: VMX: Avoid unnecessary vcpu_load()/vcpu_put() cycles
   
By checking if a reschedule is needed, we avoid dropping the vcpu.
   
[With changes by me, based on Anthony Liguori's observations]
   
Signed-off-by: Avi Kivity 

diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 03c0ee7..f535635 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -1590,6 +1590,8 @@ static int set_msr(struct kvm_vcpu *vcpu, u32
msr_index, u64 data)
 
 void kvm_resched(struct kvm_vcpu *vcpu)
 {
+   if (!need_resched())
+   return;
vcpu_put(vcpu);
cond_resched();
vcpu_load(vcpu);

at that time, it made sense to do the extra check to avoid the expensive
vcpu_put/vcpu_load.  Later preempt notifiers made them redundant
(15ad71460d75), and they were removed, but the extra check remained.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: don't forget to schedule()

2012-02-28 Thread Nadav Har'El
On Tue, Feb 28, 2012, Avi Kivity wrote about "Re: [PATCH] vhost: don't forget 
to schedule()":
> > +   if (need_resched())
> > +   schedule();
> 
> This is cond_resched(), no?

It indeed looks similar, but it appears there are some slightly
different things happening in both cases, especially for a preemptive
kernel... Unfortunately, I am not astute (or experienced) enough to tell 
which of the two idioms are better or more appropriate for this case.

The idiom that I used seemed right, and seemed to work in my tests.
Moreover I also noticed it was used in vmx.c. Also, vhost.c was already
calling schedule(), not cond_resched(), so I thought it made sense to
call the same thing...

But I now see that in kvm_main.c, there's also this:

if (!need_resched())
return;
cond_resched();

Which seems to combine both idioms ;-) Can anybody shed a light on what
is the right way to do it?

-- 
Nadav Har'El|   Tuesday, Feb 28 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I am logged in, therefore I am.
http://nadav.harel.org.il   |
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux guests and ksm performance

2012-02-28 Thread Avi Kivity
On 02/28/2012 03:20 PM, Peter Lieven wrote:
> On 28.02.2012 14:16, Avi Kivity wrote:
>> On 02/24/2012 08:41 AM, Stefan Hajnoczi wrote:
 I dont think that it is cpu intense. All user pages are zeroed
 anyway, but at allocation time it shouldnt be a big difference in
 terms of cpu power.
>>> It's easy to find a scenario where eagerly zeroing pages is wasteful.
>>> Imagine a process that uses all of physical memory.  Once it
>>> terminates the system is going to run processes that only use a small
>>> set of pages.  It's pointless zeroing all those pages if we're not
>>> going to use them anymore.
>> In the long term, we will use them, except if the guest is completely
>> idle.
>>
>> The scenario in which zeroing is expensive is when the page is refilled
>> through DMA.  In that case the zeroing was wasted.  This is a pretty
>> common scenario in pagecache intensive workloads.
>>
> Avi, what do you think of the proposal to give the guest vm a hint
> that the host is running ksm? In that case the administrator
> has already chosen that saving physical memory is more important
> than performance to him?

It makes some sense.  Perhaps through the balloon device, a flag that
indicates that voluntary ballooning will be gratefully accepted.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42829] KVM Guest with virtio network driver loses network connectivity

2012-02-28 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42829





--- Comment #1 from Steve   2012-02-28 13:49:24 ---
Also this sites described similar problems:

https://bugzilla.redhat.com/show_bug.cgi?id=520119

http://www.mail-archive.com/scientific-linux-users@
listserv.fnal.gov/msg10661.html

Let me know how could I help in solving above issue.

Thank you for your time.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Hosts crashed

2012-02-28 Thread Martin Maurer
You are using old kernels on our Proxmox VE 1.9 - upgrade to latest stable, see 
http://forum.proxmox.com/threads/8399-New-2-6-32-Kernel-for-Proxmox-VE-1-9-stable
If you still got issues, post in our Proxmox forum.

Martin

> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
> Behalf Of Germain Maurice
> Sent: Dienstag, 28. Februar 2012 12:51
> To: kvm@vger.kernel.org
> Subject: Hosts crashed
> 
> Hello everybody,
> 
> I don't know if someone has already seen my request...
> 
> I got crashes on my KVM hosts. They are in a 6 nodes cluster of Proxmox VE
> (1.9), the VM disks are LV through AoE and i use Coraid SAN storage (all the
> device inside guests are Virtio - /dev/vdX, network).
> 
> For one of them (Node3), the differences between the other servers (Node2
> and Node6) were it was using 2.6.32-47 kernel (ksm module loaded but
> inactive) and on the others, 2.6.32-33 kernel was running (ksm module not
> loaded).
> On Node3, I switched back to 2.6.32-33 kernel, and seemed to be better.
> 
> However, a few weeks ago, two hosts of the same cluster running 2.6.32-33
> kernel crashed too (Node2 and Node6).
> 
> The pictures of the stacktraces i took are visible at 
> http://imgur.com/a/FTfuM .
> 
> Is someone have an idea on what could have been occurred ?
> Answers, questions, all is welcome.
> 
> Thanks in advance.
> Germain--
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of 
> a
> message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux guests and ksm performance

2012-02-28 Thread Peter Lieven

On 28.02.2012 14:16, Avi Kivity wrote:

On 02/24/2012 08:41 AM, Stefan Hajnoczi wrote:

I dont think that it is cpu intense. All user pages are zeroed anyway, but at 
allocation time it shouldnt be a big difference in terms of cpu power.

It's easy to find a scenario where eagerly zeroing pages is wasteful.
Imagine a process that uses all of physical memory.  Once it
terminates the system is going to run processes that only use a small
set of pages.  It's pointless zeroing all those pages if we're not
going to use them anymore.

In the long term, we will use them, except if the guest is completely idle.

The scenario in which zeroing is expensive is when the page is refilled
through DMA.  In that case the zeroing was wasted.  This is a pretty
common scenario in pagecache intensive workloads.


Avi, what do you think of the proposal to give the guest vm a hint
that the host is running ksm? In that case the administrator
has already chosen that saving physical memory is more important
than performance to him?

Peter
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices

2012-02-28 Thread Jan Kiszka
PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share legacy IRQs of such devices with other host devices
when passing them to a guest.

The new IRQ sharing feature introduced here is optional, user space has
to request it explicitly. Moreover, user space can inform us about its
view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
interrupt and signaling it if the guest masked it via the virtualized
PCI config space.

Signed-off-by: Jan Kiszka 
---

Changes in v4:
 - Integrated doc changes as proposed by Alex
 - Fixed deassign_host_irq /wrt MSI
 - Fixed kvm_vm_ioctl_set_pci_irq_mask /wrt INTx unmasking of non-2.3
   devices

 Documentation/virtual/kvm/api.txt |   41 +++
 arch/x86/kvm/x86.c|1 +
 include/linux/kvm.h   |6 +
 include/linux/kvm_host.h  |2 +
 virt/kvm/assigned-dev.c   |  209 +++-
 5 files changed, 230 insertions(+), 29 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 59a3826..6386f8c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1169,6 +1169,14 @@ following flags are specified:
 
 /* Depends on KVM_CAP_IOMMU */
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0)
+/* The following two depend on KVM_CAP_PCI_2_3 */
+#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX   (1 << 2)
+
+If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
+via the PCI-2.3-compliant device-level mask, thus enable IRQ sharing with other
+assigned devices or host devices. KVM_DEV_ASSIGN_MASK_INTX specifies the
+guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
 
 The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
 isolation of the device.  Usages not specifying this flag are deprecated.
@@ -1441,6 +1449,39 @@ The "num_dirty" field is a performance hint for KVM to 
determine whether it
 should skip processing the bitmap and just invalidate everything.  It must
 be set to the number of set bits in the bitmap.
 
+4.60 KVM_ASSIGN_SET_INTX_MASK
+
+Capability: KVM_CAP_PCI_2_3
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Allows userspace to mask PCI INTx interrupts from the assigned device.  The
+kernel will not deliver INTx interrupts to the guest between setting and
+clearing of KVM_ASSIGN_SET_INTX_MASK via this interface.  This enables use of
+and emulation of PCI 2.3 INTx disable command register behavior.
+
+This may be used for both PCI 2.3 devices supporting INTx disable natively and
+older devices lacking this support. Userspace is responsible for emulating the
+read value of the INTx disable bit in the guest visible PCI command register.
+When modifying the INTx disable state, userspace should precede updating the
+physical device command register by calling this ioctl to inform the kernel of
+the new intended INTx mask state.
+
+Note that the kernel uses the device INTx disable bit to internally manage the
+device interrupt state for PCI 2.3 devices.  Reads of this register may
+therefore not match the expected value.  Writes should always use the guest
+intended INTx disable value rather than attempting to read-copy-update the
+current physical device state.  Races between user and kernel updates to the
+INTx disable bit are handled lazily in the kernel.  It's possible the device
+may generate unintended interrupts, but they will not be injected into the
+guest.
+
+See KVM_ASSIGN_DEV_IRQ for the data structure.  The target device is specified
+by assigned_dev_id.  In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
+evaluated.
+
 4.62 KVM_CREATE_SPAPR_TCE
 
 Capability: KVM_CAP_SPAPR_TCE
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9d99e5..4e2088a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2143,6 +2143,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
case KVM_CAP_GET_TSC_KHZ:
+   case KVM_CAP_PCI_2_3:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index acbe429..6c322a9 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -588,6 +588,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_TSC_DEADLINE_TIMER 72
 #define KVM_CAP_S390_UCONTROL 73
 #define KVM_CAP_SYNC_REGS 74
+#define KVM_CAP_PCI_2_3 75
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -784,6 +785,9 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_TSC_CONTROL */
 #define KVM_SET_TSC_KHZ   _IO(KVMIO,  0xa2)
 #define KVM_GET_TSC_KHZ   _IO(KVMIO,  0xa3)
+/* Available with KVM_CAP_PCI_2_3 */
+#define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa4, \
+  struct kvm_assigned_pci_dev)
 
 /*
  * ioctls for vcpu fds
@

Re: linux guests and ksm performance

2012-02-28 Thread Avi Kivity
On 02/24/2012 08:41 AM, Stefan Hajnoczi wrote:
> >
> > I dont think that it is cpu intense. All user pages are zeroed anyway, but 
> > at allocation time it shouldnt be a big difference in terms of cpu power.
>
> It's easy to find a scenario where eagerly zeroing pages is wasteful.
> Imagine a process that uses all of physical memory.  Once it
> terminates the system is going to run processes that only use a small
> set of pages.  It's pointless zeroing all those pages if we're not
> going to use them anymore.

In the long term, we will use them, except if the guest is completely idle.

The scenario in which zeroing is expensive is when the page is refilled
through DMA.  In that case the zeroing was wasted.  This is a pretty
common scenario in pagecache intensive workloads.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux guests and ksm performance

2012-02-28 Thread Avi Kivity
On 02/23/2012 06:42 PM, Stefan Hajnoczi wrote:
> On Thu, Feb 23, 2012 at 3:40 PM, Peter Lieven  wrote:
> > However, in a virtual machine I have not observed the above slow down to
> > that extend
> > while the benefit of zero after free in a virtualisation environment is
> > obvious:
> >
> > 1) zero pages can easily be merged by ksm or other technique.
> > 2) zero (dup) pages are a lot faster to transfer in case of migration.
>
> The other approach is a memory page "discard" mechanism - which
> obviously requires more code changes than zeroing freed pages.
>
> The advantage is that we don't take the brute-force and CPU intensive
> approach of zeroing pages.  It would be like a fine-grained ballooning
> feature.
>
> I hope someone will follow up saying this has already been done or
> prototyped :).

It already exists - that's the balloon code.  Right now it's host
driven, but maybe we can modify it to allow the guest to initiate
balloon inflations.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42829] KVM Guest with virtio network driver loses network connectivity

2012-02-28 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42829


Steve  changed:

   What|Removed |Added

 CC||ru...@rustcorp.com.au
 Kernel Version|3.3.0-rc5   |3.2




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: don't forget to schedule()

2012-02-28 Thread Avi Kivity
On 02/27/2012 03:07 PM, Nadav Har'El wrote:
> This is a tiny, but important, patch to vhost.
>
> Vhost's worker thread only called schedule() when it had no work to do, and
> it wanted to go to sleep. But if there's always work to do, e.g., the guest
> is running a network-intensive program like netperf with small message sizes,
> schedule() was *never* called. This had several negative implications (on
> non-preemptive kernels):
>
>  1. Passing time was not properly accounted to the "vhost" process (ps and
> top would wrongly show it using zero CPU time).
>
>  2. Sometimes error messages about RCU timeouts would be printed, if the
> core running the vhost thread didn't schedule() for a very long time.
>
>  3. Worst of all, a vhost thread would "hog" the core. If several vhost
> threads need to share the same core, typically one would get most of the
> CPU time (and its associated guest most of the performance), while the
> others hardly get any work done.
>
> The trivial solution is to add
>
>   if (need_resched())
>   schedule();
>
> After doing every piece of work. This will not do the heavy schedule() all
> the time, just when the timer interrupt decided a reschedule is warranted
> (so need_resched returns true).
>
> Thanks to Abel Gordon for this patch.
>
> Signed-off-by: Nadav Har'El 
> ---
>  vhost.c |2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c14c42b..ae66278 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -222,6 +222,8 @@ static int vhost_worker(void *data)
>   if (work) {
>   __set_current_state(TASK_RUNNING);
>   work->fn(work);
> + if (need_resched())
> + schedule();
>

This is cond_resched(), no?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor

2012-02-28 Thread Stefano Stabellini
On Tue, 28 Feb 2012, Ian Campbell wrote:
> On Tue, 2012-02-28 at 10:20 +, Dave Martin wrote:
> > On Mon, Feb 27, 2012 at 07:33:39PM +, Ian Campbell wrote:
> > > On Mon, 2012-02-27 at 18:03 +, Dave Martin wrote:
> > > > > Since we support only ARMv7+ there are "T2" and "T3" encodings 
> > > > > available
> > > > > which do allow direct mov of an immediate into R12, but are 32 bit 
> > > > > Thumb
> > > > > instructions.
> > > > > 
> > > > > Should we use r7 instead to maximise instruction density for Thumb 
> > > > > code?
> > > > 
> > > > The difference seems trivial when put into context, even if you code a
> > > > special Thumb version of the code to maximise density (the Thumb-2 code
> > > > which gets built from assembler in the kernel is very suboptimal in
> > > > size, but there simply isn't a high proportion of asm code in the kernel
> > > > anyway.)  I wouldn't consider the ARM/Thumb differences as an important
> > > > factor when deciding on a register.
> > > 
> > > OK, that's useful information. thanks.
> > > 
> > > > One argument for _not_ using r12 for this purpose is that it is then
> > > > harder to put a generic "HVC" function (analogous to the "syscall"
> > > > syscall) out-of-line, since r12 could get destroyed by the call.
> > > 
> > > For an out of line syscall(2) wouldn't the syscall number either be in a
> > > standard C calling convention argument register or on the stack when the
> > > function was called, since it is just a normal argument at that point?
> > > As you point out it cannot be passed in r12 (and could never be, due to
> > > the clobbering).
> > > 
> > > The syscall function itself would have to move the arguments and syscall
> > > nr etc around before issuing the syscall.
> > > 
> > > I think the same is true of a similar hypercall(2)
> > > 
> > > > If you don't think you will ever care about putting HVC out of line
> > > > though, it may not matter.
> > 
> > If you have both inline and out-of-line hypercalls, it's hard to ensure
> > that you never have to shuffle the registers in either case.
> 
> Agreed.
> 
> I think we want to optimise for the inline case since those are the
> majority.

They are not just the majority, all of them are static inline at the
moment, even on x86 (where the number of hypercalls is much higher).

So yes, we should optimize for the inline case.


> The only non-inline case is the special "privcmd ioctl" which is the
> mechanism that allows the Xen toolstack to make hypercalls. It's
> somewhat akin to syscall(2). By the time you get to it you will already
> have done a system call for the ioctl, pulled the arguments from the
> ioctl argument structure etc, plus such hypercalls are not really
> performance critical.

Even the privcmd hypercall (privcmd_call) is a static inline function,
it is just that at the moment there is only one caller :)


> > Shuffling can be reduced but only at the expense of strange argument
> > ordering in some cases when calling from C -- the complexity is probably
> > not worth it.  Linux doesn't bother for its own syscalls.
> > 
> > Note that even in assembler, a branch from one section to a label in
> > another section may cause r12 to get destroyed, so you will need to be
> > careful about how you code the hypervisor trap handler.  However, this
> > is not different from coding exception handlers in general, so I don't
> > know that it constitutes a conclusive argument on its own.
> 
> We are happy to arrange that this doesn't occur on our trap entry paths,
> at least until the guest register state has been saved. Currently the
> hypercall dispatcher is in C and gets r12 from the on-stack saved state.
> We will likely eventually optimise the hypercall path directly in ASM
> and in that case we are happy to take steps to ensure we don't clobber
> r12 before we need it.

Yes, I don't think this should be an issue.


> > My instinctive preference would therefore be for r7 (which also seems to
> > be good enough for Linux syscalls) -- but it really depends how many
> > arguments you expect to need to support.
> 
> Apparently r7 is the frame pointer for gcc in thumb mode which I think
> is a good reason to avoid it.
> 
> We currently have some 5 argument hypercalls and there have been
> occasional suggestions for interfaces which use 6 -- although none of
> them have come to reality.
 
I don't have a very strong opinion on which register we should use, but
I would like to avoid r7 if it is already actively used by gcc.

The fact that r12 can be destroyed so easily is actually a good argument
for using it because it means it is less likely to contain useful data
that needs to be saved/restored by gcc.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] linux guests and ksm performance

2012-02-28 Thread Peter Lieven

On 28.02.2012 13:05, Stefan Hajnoczi wrote:

On Tue, Feb 28, 2012 at 11:46 AM, Peter Lieven  wrote:

On 24.02.2012 08:23, Stefan Hajnoczi wrote:

On Fri, Feb 24, 2012 at 6:53 AM, Stefan Hajnoczi
  wrote:

On Fri, Feb 24, 2012 at 6:41 AM, Stefan Hajnoczi
  wrote:

On Thu, Feb 23, 2012 at 7:08 PM, peter.lie...@gmail.com
  wrote:

Stefan Hajnoczischrieb:


On Thu, Feb 23, 2012 at 3:40 PM, Peter Lievenwrote:

However, in a virtual machine I have not observed the above slow down

to

that extend
while the benefit of zero after free in a virtualisation environment

is

obvious:

1) zero pages can easily be merged by ksm or other technique.
2) zero (dup) pages are a lot faster to transfer in case of

migration.

The other approach is a memory page "discard" mechanism - which
obviously requires more code changes than zeroing freed pages.

The advantage is that we don't take the brute-force and CPU intensive
approach of zeroing pages.  It would be like a fine-grained ballooning
feature.


I dont think that it is cpu intense. All user pages are zeroed anyway,
but at allocation time it shouldnt be a big difference in terms of cpu
power.

It's easy to find a scenario where eagerly zeroing pages is wasteful.
Imagine a process that uses all of physical memory.  Once it
terminates the system is going to run processes that only use a small
set of pages.  It's pointless zeroing all those pages if we're not
going to use them anymore.

Perhaps the middle path is to zero pages but do it after a grace
timeout.  I wonder if this helps eliminate the 2-3% slowdown you
noticed when compiling.

Gah, it's too early in the morning.  I don't think this timer actually
makes sense.


do you think it makes then sense to make a patchset/proposal to notice a
guest
kernel about the presense of ksm in the host and switch to zero after free?

I think your idea is interesting - whether or not people are happy
with it will depend on the performance impact.  It seems reasonable to
me.

could you support/help me in implementing and publishing this approach?

Peter
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor

2012-02-28 Thread Stefano Stabellini
On Tue, 28 Feb 2012, Dave Martin wrote:
> > Given that Stefano is proposing to make the ISS a (per-hypervisor)
> > constant we could consider just defining the Thumb and non-Thumb
> > constants instead of doing all the construction with the __HVC_IMM stuff
> > -- that would remove a big bit of the macroization.
> 
> It's not quite as simple as that -- emitting instructions using data
> directives is not endianness safe, and even in the cases where .long gives
> the right result for ARM, it gives the wrong result for 32-bit Thumb
> instructions if the opcode is given in human-readable order.
> 
> I was trying to solve the same problem for the kvm guys with some global
> macros -- I'm aiming to get a patch posted soon, so I'll make sure
> you're on CC.
 
That would be great, thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hosts crashed

2012-02-28 Thread Stefan Hajnoczi
On Tue, Feb 28, 2012 at 11:50 AM, Germain Maurice
 wrote:
> I don't know if someone has already seen my request…

If you're not getting responses it could be because you are using
Proxmox VE, so the kernel and qemu-kvm might have patches.  Have you
asked the Proxmox community/company for support?  They might be better
able to look into this.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] linux guests and ksm performance

2012-02-28 Thread Stefan Hajnoczi
On Tue, Feb 28, 2012 at 11:46 AM, Peter Lieven  wrote:
> On 24.02.2012 08:23, Stefan Hajnoczi wrote:
>>
>> On Fri, Feb 24, 2012 at 6:53 AM, Stefan Hajnoczi
>>  wrote:
>>>
>>> On Fri, Feb 24, 2012 at 6:41 AM, Stefan Hajnoczi
>>>  wrote:

 On Thu, Feb 23, 2012 at 7:08 PM, peter.lie...@gmail.com
  wrote:
>
> Stefan Hajnoczi  schrieb:
>
>> On Thu, Feb 23, 2012 at 3:40 PM, Peter Lieven  wrote:
>>>
>>> However, in a virtual machine I have not observed the above slow down
>>
>> to
>>>
>>> that extend
>>> while the benefit of zero after free in a virtualisation environment
>>
>> is
>>>
>>> obvious:
>>>
>>> 1) zero pages can easily be merged by ksm or other technique.
>>> 2) zero (dup) pages are a lot faster to transfer in case of
>>
>> migration.
>>
>> The other approach is a memory page "discard" mechanism - which
>> obviously requires more code changes than zeroing freed pages.
>>
>> The advantage is that we don't take the brute-force and CPU intensive
>> approach of zeroing pages.  It would be like a fine-grained ballooning
>> feature.
>>
> I dont think that it is cpu intense. All user pages are zeroed anyway,
> but at allocation time it shouldnt be a big difference in terms of cpu
> power.

 It's easy to find a scenario where eagerly zeroing pages is wasteful.
 Imagine a process that uses all of physical memory.  Once it
 terminates the system is going to run processes that only use a small
 set of pages.  It's pointless zeroing all those pages if we're not
 going to use them anymore.
>>>
>>> Perhaps the middle path is to zero pages but do it after a grace
>>> timeout.  I wonder if this helps eliminate the 2-3% slowdown you
>>> noticed when compiling.
>>
>> Gah, it's too early in the morning.  I don't think this timer actually
>> makes sense.
>
>
> do you think it makes then sense to make a patchset/proposal to notice a
> guest
> kernel about the presense of ksm in the host and switch to zero after free?

I think your idea is interesting - whether or not people are happy
with it will depend on the performance impact.  It seems reasonable to
me.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hosts crashed

2012-02-28 Thread Germain Maurice
Hello everybody,

I don't know if someone has already seen my request…

I got crashes on my KVM hosts. They are in a 6 nodes cluster of Proxmox VE 
(1.9), the VM disks are LV through AoE and i use Coraid SAN storage (all the 
device inside guests are Virtio - /dev/vdX, network).

For one of them (Node3), the differences between the other servers (Node2 and 
Node6) were it was using 2.6.32-47 kernel (ksm module loaded but inactive)
and on the others, 2.6.32-33 kernel was running (ksm module not loaded).
On Node3, I switched back to 2.6.32-33 kernel, and seemed to be better.

However, a few weeks ago, two hosts of the same cluster running 2.6.32-33 
kernel crashed too (Node2 and Node6).

The pictures of the stacktraces i took are visible at http://imgur.com/a/FTfuM .

Is someone have an idea on what could have been occurred ?
Answers, questions, all is welcome.

Thanks in advance.
Germain--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.0 crashes with threaded vnc server?

2012-02-28 Thread Peter Lieven

On 28.02.2012 09:37, Corentin Chary wrote:

On Mon, Feb 13, 2012 at 10:24 AM, Peter Lieven  wrote:

Am 11.02.2012 um 09:55 schrieb Corentin Chary:


On Thu, Feb 9, 2012 at 7:08 PM, Peter Lieven  wrote:

Hi,

is anyone aware if there are still problems when enabling the threaded vnc
server?
I saw some VMs crashing when using a qemu-kvm build with
--enable-vnc-thread.

qemu-kvm-1.0[22646]: segfault at 0 ip 7fec1ca7ea0b sp 7fec19d056d0
error 6 in libz.so.1.2.3.3[7fec1ca75000+16000]
qemu-kvm-1.0[26056]: segfault at 7f06d8d6e010 ip 7f06e0a30d71 sp
7f06df035748 error 6 in libc-2.11.1.so[7f06e09aa000+17a000]

I had no time to debug further. It seems to happen shortly after migrating,
but thats uncertain. At least the segfault in libz seems to
give a hint to VNC since I cannot image of any other part of qemu-kvm using
libz except for VNC server.

Thanks,
Peter



Hi Peter,
I found two patches on my git tree that I sent long ago but somehow
get lost on the mailing list. I rebased the tree but did not have the
time (yet) to test them.
http://git.iksaif.net/?p=qemu.git;a=shortlog;h=refs/heads/wip
Feel free to try them. If QEMU segfault again, please send a full gdb
backtrace / valgrind trace / way to reproduce :).
Thanks,

Hi Corentin,

thanks for rebasing those patches. I remember that I have seen them the
last time I noticed (about 1 year ago) that the threaded VNC is crashing.
I'm on vacation this week, but I will test them next week
and let you know if I can force a crash with them applied. If not we should
consider to include them asap.

Hi Peter, any news on that ?
sorry, i had much trouble debugging nasty slow windows vm problems last 
2 weeks.

but its still on my list. i'll keep you all posted.

peter






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-28 Thread Avi Kivity
On 02/23/2012 01:35 PM, Takuya Yoshikawa wrote:
> We have seen some problems of the current implementation of
> get_dirty_log() which uses synchronize_srcu_expedited() for updating
> dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
> order of latency when we use VGA displays.
>
> Furthermore the recent discussion on the following thread
> "srcu: Implement call_srcu()"
> http://lkml.org/lkml/2012/1/31/211
> also motivated us to implement get_dirty_log() without SRCU.
>
> This patch achieves this goal without sacrificing the performance of
> both VGA and live migration: in practice the new code is much faster
> than the old one unless we have too many dirty pages.
>
> Implementation:
>
> The key part of the implementation is the use of xchg() operation for
> clearing dirty bits atomically.  Since this allows us to update only
> BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
> until every dirty bit is cleared again for the next call.

What about using cmpxchg16b?  That should reduce locked ops by a factor
of 2 (but note it needs 16 bytes alignment).

>
> Although some people may worry about the problem of using the atomic
> memory instruction many times to the concurrently accessible bitmap,
> it is usually accessed with mmu_lock held and we rarely see concurrent
> accesses: so what we need to care about is the pure xchg() overheads.
>
> Another point to note is that we do not use for_each_set_bit() to check
> which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
> simply use __ffs() and __fls() and pass the range in between the two
> positions found by them to kvm_mmu_write_protect_pt_range().

This seems artificial.

> Even though the passed range may include clean pages, it is much faster
> than repeatedly call find_next_bit() due to the locality of dirty pages.

Perhaps this is due to the implementation of find_next_bit()?  would
using bsf improve things?

> Performance:
>
> The dirty-log-perf unit test showed nice improvement, some times faster
> than before, when the number of dirty pages was below 8K.  For other
> cases we saw a bit of regression but still enough fast compared to the
> processing of these dirty pages in the userspace.
>
> For real workloads, both VGA and live migration, we have observed pure
> improvement: when the guest was reading a file, we originally saw a few
> ms of latency, but with the new method the latency was 50us to 300us.
>
>  
>  /**
> - * write_protect_slot - write protect a slot for dirty logging
> - * @kvm: the kvm instance
> - * @memslot: the slot we protect
> - * @dirty_bitmap: the bitmap indicating which pages are dirty
> - * @nr_dirty_pages: the number of dirty pages
> + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a 
> slot
> + * @kvm: kvm instance
> + * @log: slot id and address to which we copy the log
>   *
> - * We have two ways to find all sptes to protect:
> - * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
> - *checks ones that have a spte mapping a page in the slot.
> - * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
> + * We need to keep it in mind that VCPU threads can write to the bitmap
> + * concurrently.  So, to avoid losing data, we keep the following order for
> + * each bit:
>   *
> - * Generally speaking, if there are not so many dirty pages compared to the
> - * number of shadow pages, we should use the latter.
> + *   1. Take a snapshot of the bit and clear it if needed.
> + *   2. Write protect the corresponding page.
> + *   3. Flush TLB's if needed.
> + *   4. Copy the snapshot to the userspace.
>   *
> - * Note that letting others write into a page marked dirty in the old bitmap
> - * by using the remaining tlb entry is not a problem.  That page will become
> - * write protected again when we flush the tlb and then be reported dirty to
> - * the user space by copying the old bitmap.
> + * Between 2 and 3, the guest may write to the page using the remaining TLB
> + * entry.  This is not a problem because the page will be reported dirty at
> + * step 4 using the snapshot taken before and step 3 ensures that successive
> + * writes will be logged for the next call.
>   */
> -static void write_protect_slot(struct kvm *kvm,
> -struct kvm_memory_slot *memslot,
> -unsigned long *dirty_bitmap,
> -unsigned long nr_dirty_pages)
> -{
> - spin_lock(&kvm->mmu_lock);
> -
> - /* Not many dirty pages compared to # of shadow pages. */
> - if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) {
> - gfn_t offset;
> -
> - for_each_set_bit(offset, dirty_bitmap, memslot->npages)
> - kvm_mmu_write_protect_pt_range(kvm, memslot, offset, 
> offset);
> -
> - kvm_flush_remote_tlbs(kvm);
> - } else
> - kvm_mmu_slot_remove_write_access(kvm, memslot->id);
> -
> 

Re: [Qemu-devel] linux guests and ksm performance

2012-02-28 Thread Peter Lieven

On 24.02.2012 08:23, Stefan Hajnoczi wrote:

On Fri, Feb 24, 2012 at 6:53 AM, Stefan Hajnoczi  wrote:

On Fri, Feb 24, 2012 at 6:41 AM, Stefan Hajnoczi  wrote:

On Thu, Feb 23, 2012 at 7:08 PM, peter.lie...@gmail.com  wrote:

Stefan Hajnoczi  schrieb:


On Thu, Feb 23, 2012 at 3:40 PM, Peter Lieven  wrote:

However, in a virtual machine I have not observed the above slow down

to

that extend
while the benefit of zero after free in a virtualisation environment

is

obvious:

1) zero pages can easily be merged by ksm or other technique.
2) zero (dup) pages are a lot faster to transfer in case of

migration.

The other approach is a memory page "discard" mechanism - which
obviously requires more code changes than zeroing freed pages.

The advantage is that we don't take the brute-force and CPU intensive
approach of zeroing pages.  It would be like a fine-grained ballooning
feature.


I dont think that it is cpu intense. All user pages are zeroed anyway, but at 
allocation time it shouldnt be a big difference in terms of cpu power.

It's easy to find a scenario where eagerly zeroing pages is wasteful.
Imagine a process that uses all of physical memory.  Once it
terminates the system is going to run processes that only use a small
set of pages.  It's pointless zeroing all those pages if we're not
going to use them anymore.

Perhaps the middle path is to zero pages but do it after a grace
timeout.  I wonder if this helps eliminate the 2-3% slowdown you
noticed when compiling.

Gah, it's too early in the morning.  I don't think this timer actually
makes sense.


do you think it makes then sense to make a patchset/proposal to notice a 
guest

kernel about the presense of ksm in the host and switch to zero after free?

peter


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >