Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 08:08 -0700, David Evensky wrote:
> Adding in the rest of what ivshmem does shouldn't affect our use, *I
> think*.  I hadn't intended this to do everything that ivshmem does,
> but I can see how that would be useful. It would be cool if it could
> grow into that. 

David,

I've added most of ivshmem on top of your driver (still working on fully
understanding the client-server protocol).

The changes that might affect your use have been simple:
 * The shared memory BAR is now 2 instead of 0.
 * Vendor and device IDs changed.
 * The device now has MSI-X capability in the header and supporting code
to run it.

If these points won't affect your use I think there shouldn't be any
other issues.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 16:25 +, Decker, Schorschi wrote:
> I would ask two things be done in the design if it goes forward, 1)
> have an explicit way to disable this feature, where the hypervisor
> cannot interact with the guest OS directly in any way if disablement
> is selected.

I doubt that this (or anything similar) introduced will even be set to
on by default. It has the potential of breaking stuff that would work
otherwise (thats why the default boot is with the safest configuration
possible).

On Thu, 2011-08-25 at 16:25 +, Decker, Schorschi wrote:
> 2) implement the feature as an agent in the guest OS where the
> hypervisor can only query the guest OS agent, using a standard TCP/IP
> methodology.

I was planning to implementing it by probing the image before actually
booting it.
This process is completely offline and doesn't require interaction with
the guest. The guest isn't even running at that point.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 16:35 -0500, Anthony Liguori wrote:
> On 08/24/2011 05:25 PM, David Evensky wrote:
> >
> >
> > This patch adds a PCI device that provides PCI device memory to the
> > guest. This memory in the guest exists as a shared memory segment in
> > the host. This is similar memory sharing capability of Nahanni
> > (ivshmem) available in QEMU. In this case, the shared memory segment
> > is exposed as a PCI BAR only.
> >
> > A new command line argument is added as:
> >  --shmem pci:0xc800:16MB:handle=/newmem:create
> >
> > diff -uprN -X linux-kvm/Documentation/dontdiff 
> > linux-kvm/tools/kvm/include/kvm/pci-shmem.h 
> > linux-kvm_pci_shmem/tools/kvm/include/kvm/pci-shmem.h
> > --- linux-kvm/tools/kvm/include/kvm/pci-shmem.h 1969-12-31 
> > 16:00:00.0 -0800
> > +++ linux-kvm_pci_shmem/tools/kvm/include/kvm/pci-shmem.h   2011-08-13 
> > 15:43:01.067953711 -0700
> > @@ -0,0 +1,13 @@
> > +#ifndef KVM__PCI_SHMEM_H
> > +#define KVM__PCI_SHMEM_H
> > +
> > +#include
> > +#include
> > +
> > +struct kvm;
> > +struct shmem_info;
> > +
> > +int pci_shmem__init(struct kvm *self);
> > +int pci_shmem__register_mem(struct shmem_info *si);
> > +
> > +#endif
> > diff -uprN -X linux-kvm/Documentation/dontdiff 
> > linux-kvm/tools/kvm/include/kvm/virtio-pci-dev.h 
> > linux-kvm_pci_shmem/tools/kvm/include/kvm/virtio-pci-dev.h
> > --- linux-kvm/tools/kvm/include/kvm/virtio-pci-dev.h2011-08-09 
> > 15:38:48.760120973 -0700
> > +++ linux-kvm_pci_shmem/tools/kvm/include/kvm/virtio-pci-dev.h  
> > 2011-08-18 10:06:12.171539230 -0700
> > @@ -15,10 +15,13 @@
> >   #define PCI_DEVICE_ID_VIRTIO_BLN  0x1005
> >   #define PCI_DEVICE_ID_VIRTIO_P9   0x1009
> >   #define PCI_DEVICE_ID_VESA0x2000
> > +#define PCI_DEVICE_ID_PCI_SHMEM0x0001
> >
> >   #define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4
> > +#define PCI_VENDOR_ID_PCI_SHMEM0x0001
> >   #define PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET   0x1af4
> 
> FYI, that's not a valid vendor and device ID.
> 
> Perhaps the RH folks would be willing to reserve a portion of the device 
> ID space in their vendor ID for ya'll to play around with.

I'm working on a patch on top of David's patch to turn it into a ivshmem
device. Once it's ready we would use same vendor/device IDs.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Guest kernel device compatability auto-detection

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 16:48 -0500, Anthony Liguori wrote:
> On 08/25/2011 12:21 AM, Sasha Levin wrote:
> > Hi,
> >
> > Currently when we run the guest we treat it as a black box, we're not
> > quite sure what it's going to start and whether it supports the same
> > features we expect it to support when running it from the host.
> >
> > This forces us to start the guest with the safest defaults possible, for
> > example: '-drive file=my_image.qcow2' will be started with slow IDE
> > emulation even though the guest is capable of virtio.
> >
> > I'm currently working on a method to try and detect whether the guest
> > kernel has specific configurations enabled and either warn the user if
> > we know the kernel is not going to properly work or use better defaults
> > if we know some advanced features are going to work.
> >
> > How am I planning to do it? First, we'll try finding which kernel the
> > guest is going to boot (easy when user does '-kernel', less easy when
> > the user boots an image). For simplicity sake I'll stick with the
> > '-kernel' option for now.
> 
> Is the problem you're trying to solve determine whether the guest kernel 
> is going to work well under kvm tool or trying to choose the right 
> hardware profile to expose to the guest?
> 
> If it's the former, I think the path you're heading down is the most 
> likely to succeed (trying to guess based on what you can infer about the 
> kernel).
> 
> If it's the later, there's some interesting possibilities we never fully 
> explored in QEMU.
> 

I was thinking about both, I've considered kvm tools to be the 'easy'
case where we only say if it would work or not, and QEMU as the hard one
where we need to build a working configuration.

> One would be exposing a well supported device (like IDE emulation) and 
> having a magic mode that allowed you to basically promote the device 
> from IDE emulation to virtio-blk.  Likewise, you could do something like 
> that to promote from the e1000 to virtio-net.
> 
> It might require some special support in the guest kernel and would 
> likely be impossible to do in Windows, but if you primarily care about 
> Linux guests, it ought to be possible.

You're thinking about trying to expose all interfaces during boot and
seeing which ones the kernel bites?

Another thing that comes to mind is that we could start this project
with a script that given a kernel, it would find the optimal hardware
configuration for it (and the matching QEMU command line).

It would simply work the other way around: try booting with the best
devices first, if it doesn't boot we would 'demote' them one at a time
until the kernel does boot.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: windows workload: many ept_violation and mmio exits

2011-08-25 Thread ya su
hi,Avi:

I met the same problem, tons of hpet vm_exits(vector 209, fault
address is in the guest vm's hpet mmio range), even I disable hpet
device in win7 guest vm, it still produce a larget amount of vm_exits
when trace-cmd ;  I add -no-hpet to start the vm, it still has HPET
device inside VM.

Does that means the HPET device in VM does not depend on the
emulated hpet device in qemu-kvm? Is there any way to disable the VM
HPET device to prevent so many vm_exits?  Thansk.

Regards.

Suya.

2009/12/3 Avi Kivity :
> On 12/03/2009 03:46 PM, Andrew Theurer wrote:
>>
>> I am running a windows workload which has 26 windows VMs running many
>> instances of a J2EE workload.  There are 13 pairs of an application server
>> VM and database server VM.  There seem to be quite a bit of vm_exits, and it
>> looks over a third of them are mmio_exit:
>>
>>> efer_relo  0
>>> exits      337139
>>> fpu_reloa  247321
>>> halt_exit  19092
>>> halt_wake  18611
>>> host_stat  247332
>>> hypercall  0
>>> insn_emul  184265
>>> insn_emul  184265
>>> invlpg     0
>>> io_exits   69184
>>> irq_exits  52953
>>> irq_injec  48115
>>> irq_windo  2411
>>> largepage  19
>>> mmio_exit  123554
>>
>> I collected a kvmtrace, and below is a very small portion of that.  Is
>> there a way I can figure out what device the mmio's are for?
>
> We want 'info physical_address_space' in the monitor.
>
>> Also, is it normal to have lots of ept_violations?  This is a 2 socket
>> Nehalem system with SMT on.
>
> So long as pf_fixed is low, these are all mmio or apic accesses.
>
>>
>>
>>> qemu-system-x86-19673 [014] 213577.939624: kvm_page_fault: address
>>> fed000f0 error_code 181
>>>  qemu-system-x86-19673 [014] 213577.939627: kvm_mmio: mmio
>>> unsatisfied-read len 4 gpa 0xfed000f0 val 0x0
>>>  qemu-system-x86-19673 [014] 213577.939629: kvm_mmio: mmio read len 4 gpa
>>> 0xfed000f0 val 0xfb8f214d
>
> hpet
>
>>>  qemu-system-x86-19673 [014] 213577.939631: kvm_entry: vcpu 0
>>>  qemu-system-x86-19673 [014] 213577.939633: kvm_exit: reason
>>> ept_violation rip 0xf8000160ef8e
>>>  qemu-system-x86-19673 [014] 213577.939634: kvm_page_fault: address
>>> fed000f0 error_code 181
>
> hpet - was this the same exit? we ought to skip over the emulated
> instruction.
>
>>>  qemu-system-x86-19673 [014] 213577.939693: kvm_page_fault: address
>>> fed000f0 error_code 181
>>>  qemu-system-x86-19673 [014] 213577.939696: kvm_mmio: mmio
>>> unsatisfied-read len 4 gpa 0xfed000f0 val 0x0
>
> hpet
>
>>>  qemu-system-x86-19332 [008] 213577.939699: kvm_exit: reason
>>> ept_violation rip 0xf80001b3af8e
>>>  qemu-system-x86-19332 [008] 213577.939700: kvm_page_fault: address
>>> fed000f0 error_code 181
>>>  qemu-system-x86-19673 [014] 213577.939702: kvm_mmio: mmio read len 4 gpa
>>> 0xfed000f0 val 0xfb8f3da6
>
> hpet
>
>>>  qemu-system-x86-19332 [008] 213577.939706: kvm_mmio: mmio
>>> unsatisfied-read len 4 gpa 0xfed000f0 val 0x0
>>>  qemu-system-x86-19563 [010] 213577.939707: kvm_ioapic_set_irq: pin 11
>>> dst 1 vec=130 (LowPrio|logical|level)
>>>  qemu-system-x86-19332 [008] 213577.939713: kvm_mmio: mmio read len 4 gpa
>>> 0xfed000f0 val 0x29a105de
>
> hpet ...
>
>>>  qemu-system-x86-19673 [014] 213577.939908: kvm_ioapic_set_irq: pin 11
>>> dst 1 vec=130 (LowPrio|logical|level)
>>>  qemu-system-x86-19673 [014] 213577.939910: kvm_entry: vcpu 0
>>>  qemu-system-x86-19673 [014] 213577.939912: kvm_exit: reason apic_access
>>> rip 0xf800016a050c
>>>  qemu-system-x86-19673 [014] 213577.939914: kvm_mmio: mmio write len 4
>>> gpa 0xfee000b0 val 0x0
>
> apic eoi
>
>>>  qemu-system-x86-19332 [008] 213577.939958: kvm_mmio: mmio write len 4
>>> gpa 0xfee000b0 val 0x0
>>>  qemu-system-x86-19673 [014] 213577.939958: kvm_pic_set_irq: chip 1 pin 3
>>> (level|masked)
>>>  qemu-system-x86-19332 [008] 213577.939958: kvm_apic: apic_write APIC_EOI
>>> = 0x0
>
> apic eoi
>
>>>  qemu-system-x86-19673 [014] 213577.940010: kvm_exit: reason cr_access
>>> rip 0xf800016ee2b2
>>>  qemu-system-x86-19673 [014] 213577.940011: kvm_cr: cr_write 4 = 0x678
>>>  qemu-system-x86-19673 [014] 213577.940017: kvm_entry: vcpu 0
>>>  qemu-system-x86-19673 [014] 213577.940019: kvm_exit: reason cr_access
>>> rip 0xf800016ee2b5
>>>  qemu-system-x86-19673 [014] 213577.940019: kvm_cr: cr_write 4 = 0x6f8
>
> toggling global pages, we can avoid that with CR4_GUEST_HOST_MASK.
>
> So, tons of hpet and eois.  We can accelerate both by thing the hyper-V
> accelerations, we already have some (unmerged) code for eoi, so this should
> be improved soon.
>
>>
>> Here is oprofile:
>>
>>> 4117817  62.2029  kvm-intel.ko             kvm-intel.ko
>>> vmx_vcpu_run
>>> 338198    5.1087  qemu-system-x86_64       qemu-system-x86_64
>>> /usr/local/qemu/48bb360cc687b89b74dfb1cac0f6e8812b64841c/bin/qemu-system-x86_64
>>> 62449     0.9433  kvm.ko                   kvm.ko
>>> kvm_arch_vcpu_ioctl_run
>>> 56512     0.8537
>>>  vmlinux-2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1
>>> vmlinux-2.6.32-rc7-5e8cb552cb8b48

Re: kvm PCI assignment & VFIO ramblings

2011-08-25 Thread David Gibson
On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote:
> 
> On 25.08.2011, at 07:31, Roedel, Joerg wrote:
> 
> > On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
> >> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> > 
> 
> [...]
> 
> >> We need to try the polite method of attempting to hot unplug the device
> >> from qemu first, which the current vfio code already implements.  We can
> >> then escalate if it doesn't respond.  The current code calls abort in
> >> qemu if the guest doesn't respond, but I agree we should also be
> >> enforcing this at the kernel interface.  I think the problem with the
> >> hard-unplug is that we don't have a good revoke mechanism for the mmio
> >> mmaps.
> > 
> > For mmio we could stop the guest and replace the mmio region with a
> > region that is filled with 0xff, no?
> 
> Sure, but that happens in user space. The question is how does
> kernel space enforce an MMIO region to not be mapped after the
> hotplug event occured? Keep in mind that user space is pretty much
> untrusted here - it doesn't have to be QEMU. It could just as well
> be a generic user space driver. And that can just ignore hotplug
> events.

We're saying you hard yank the mapping from the userspace process.
That is, you invalidate all its PTEs mapping the MMIO space, and don't
let it fault them back in.

As I see it there are two options: (a) make subsequent accesses from
userspace or the guest result in either a SIGBUS that userspace must
either deal with or die, or (b) replace the mapping with a dummy RO
mapping containing 0xff, with any trapped writes emulated as nops.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-25 Thread David Gibson
On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
> On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> 
> > > I don't see a reason to make this meta-grouping static. It would harm
> > > flexibility on x86. I think it makes things easier on power but there
> > > are options on that platform to get the dynamic solution too.
> > 
> > I think several people are misreading what Ben means by "static".  I
> > would prefer to say 'persistent', in that the meta-groups lifetime is
> > not tied to an fd, but they can be freely created, altered and removed
> > during runtime.
> 
> Even if it can be altered at runtime, from a usability perspective it is
> certainly the best to handle these groups directly in qemu. Or are there
> strong reasons to do it somewhere else?

Funny, Ben and I think usability demands it be the other way around.

If the meta-groups are transient - that is lifetime tied to an fd -
then any program that wants to use meta-groups *must* know the
interfaces for creating one, whatever they are.

But if they're persistent, the admin can use other tools to create the
meta-group then just hand it to a program to use, since the interfaces
for _using_ a meta-group are identical to those for an atomic group.

This doesn't preclude a program from being meta-group aware, and
creating its own if it wants to, of course.  My guess is that qemu
would not want to build its own meta-groups, but libvirt probably
would.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-25 Thread Xiao Guangrong
On 08/25/2011 09:47 PM, Marcelo Tosatti wrote:

> I guess it is OK to be more trigger happy with zapping by ignoring
> the accessed bit, clearing the flood counter on page fault.
> 

Yeah, i like this way, is this patch good for you?


Subject: [PATCH 11/11] KVM: MMU: improve write flooding detected

Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |6 +---
 arch/x86/kvm/mmu.c  |   57 +--
 arch/x86/kvm/paging_tmpl.h  |   12 +++-
 3 files changed, 26 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 927ba73..9d17238 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -239,6 +239,8 @@ struct kvm_mmu_page {
int clear_spte_count;
 #endif
 
+   int write_flooding_count;
+
struct rcu_head rcu;
 };
 
@@ -353,10 +355,6 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
 
-   gfn_t last_pt_write_gfn;
-   int   last_pt_write_count;
-   u64  *last_pte_updated;
-
struct fpu guest_fpu;
u64 xcr0;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index adaa160..fd5b389 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1652,6 +1652,18 @@ static void init_shadow_page_table(struct kvm_mmu_page 
*sp)
sp->spt[i] = 0ull;
 }
 
+static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
+{
+   sp->write_flooding_count = 0;
+}
+
+static void clear_sp_write_flooding_count(u64 *spte)
+{
+   struct kvm_mmu_page *sp =  page_header(__pa(spte));
+
+   __clear_sp_write_flooding_count(sp);
+}
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 gfn_t gfn,
 gva_t gaddr,
@@ -1695,6 +1707,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
} else if (sp->unsync)
kvm_mmu_mark_parents_unsync(sp);
 
+   __clear_sp_write_flooding_count(sp);
trace_kvm_mmu_get_page(sp, false);
return sp;
}
@@ -1847,15 +1860,6 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, 
u64 *parent_pte)
mmu_page_remove_parent_pte(sp, parent_pte);
 }
 
-static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
-{
-   int i;
-   struct kvm_vcpu *vcpu;
-
-   kvm_for_each_vcpu(i, vcpu, kvm)
-   vcpu->arch.last_pte_updated = NULL;
-}
-
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
u64 *parent_pte;
@@ -1915,7 +1919,6 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
}
 
sp->role.invalid = 1;
-   kvm_mmu_reset_last_pte_updated(kvm);
return ret;
 }
 
@@ -2360,8 +2363,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
}
kvm_release_pfn_clean(pfn);
-   if (speculative)
-   vcpu->arch.last_pte_updated = sptep;
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3522,13 +3523,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu 
*vcpu, bool zap_page,
kvm_mmu_flush_tlb(vcpu);
 }
 
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
-   u64 *spte = vcpu->arch.last_pte_updated;
-
-   return !!(spte && (*spte & shadow_accessed_mask));
-}
-
 static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
const u8 *new, int *bytes)
 {
@@ -3569,22 +3563,9 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu 
*vcpu, gpa_t *gpa,
  * If we're seeing too many writes to a page, it may no longer be a page table,
  * or we may be forking, in which case it is better to unmap the page.
  */
-static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
 {
-   bool flooded = false;
-
-   if (gfn == vcpu->arch.last_pt_write_gfn
-   && !last_updated_pte_accessed(vcpu)) {
-   ++vcpu->arch.last_pt_write_count;
-   if (vcpu->arch.last_pt

Re: [PATCH 0/3] Emulator fuzz tester

2011-08-25 Thread Lucas Meneghel Rodrigues

On 08/22/2011 10:41 AM, Avi Kivity wrote:

As it is exposed directly to guest code, the x86 emulator is an interesting
target for exploiters: a vulnerability may lead to compromise of the entire
host.

In an attempt to catch vulnerabilities before they make it into production
code, this patchset adds a fuzz tester for the emulator.  Instructions
are synthesized and fed into the emulator; a vulnerability will usually
result in an access violation.

I tried to make the emulator test build an run in userspace; this proved too
difficult, so the test is built as part of the kernel.  It can still be run
in userspace, via KVM:

   qemu -enable-kvm -smp 4 -serial stdio -kernel bzImage \
   -append 'console=ttyS0 test_emulator.iterations=10'

   ...
   starting emulator test
   emulator fuzz test results
 instructions: 10
 decoded:94330032
 emulated:   92529152
 nofault: 117
 failures:  0
   emulator test: PASS
   ...

One billion random instructions failed to find a vulnerability, so either
the emulator is really good, or the test is really bad, or we need a lot more
runtime.

Lucas, how would we go about integrating this into kvm-autotest?


I have applied the 3 patches on your latest tree, compiled the kernel 
but I'm having trouble in running the test the way you described.


One thing I've noticed here: I can only compile the test as a kernel 
module, not in the kernel image (menuconfig only gives me
(N/m/?). So I believe there's no way to test it the way you have 
described... In any case I did try what you have suggested, then the 
kernel panics due to the lack of a filesystem/init. After some reading, 
I learned to create a bogus fs with a bogus init in it, but still, the 
test does not run (I guess it's because the test module is not compiled 
into the bzImage).


I assume there are some details you forgot to mention to get this 
done... Would you mind posting a more detailed procedure?


To avoid misunderstandings, here is the outline of what I've tried:

1) Updated my kvm.git repo, so it reflects latest upstream
2) Applied all 3 patches of this series
3) make bzImage, make modules
4) Tried to boot with a very recent qemu-kvm compiled from HEAD (exec 
name is like this because it's a symlink)


./qemu -smp 4 -serial stdio -kernel ~/Code/kvm/arch/x86_64/boot/bzImage 
-append 'console=ttyS0 test_emulator.iterations=10'


No use (kernel panics due to a lack of rootfs and init).

5) Then I tried:

./qemu -smp 4 -serial stdio -kernel ~/Code/kvm/arch/x86_64/boot/bzImage 
-initrd rootfs -append 'root=/dev/ram console=ttyS0 
test_emulator.iterations=10'


No use either. Kernel will stand on the bogus init and do nothing else.

Cheers,

Lucas

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [PATCH] KVM test: Add cpu_hotplug subtest

2011-08-25 Thread Lucas Meneghel Rodrigues
On Wed, Aug 24, 2011 at 1:25 AM, pradeep  wrote:
> On Wed, 24 Aug 2011 01:05:13 -0300
> Lucas Meneghel Rodrigues  wrote:
>
>> Tests the ability of adding virtual cpus on the fly to qemu using
>> the monitor command cpu_set, then after everything is OK, run the
>> cpu_hotplug testsuite on the guest through autotest.
>>
>> Updates: As of the latest qemu-kvm (08-24-2011) HEAD, trying to
>> online more CPUs than the ones already available leads to qemu
>> hanging:
>>
>> File /home/lmr/Code/autotest-git/client/virt/kvm_monitor.py, line
>> 279, in cmd raise MonitorProtocolError(msg)
>> MonitorProtocolError: Could not find (qemu) prompt after command
>> cpu_set 2 online. Output so far: ""
>>
>> Signed-off-by: Lucas Meneghel Rodrigues 
>> ---
>>  client/tests/kvm/tests/cpu_hotplug.py  |   99
>> 
>> client/tests/kvm/tests_base.cfg.sample |    7 ++ 2 files changed, 106
>> insertions(+), 0 deletions(-) create mode 100644
>> client/tests/kvm/tests/cpu_hotplug.py
>>
>> diff --git a/client/tests/kvm/tests/cpu_hotplug.py
>> b/client/tests/kvm/tests/cpu_hotplug.py new file mode 100644
>> index 000..fa75c9b
>> --- /dev/null
>> +++ b/client/tests/kvm/tests/cpu_hotplug.py
>> @@ -0,0 +1,99 @@
>> +import os, logging, re
>> +from autotest_lib.client.common_lib import error
>> +from autotest_lib.client.virt import virt_test_utils
>> +
>> +
>> +@error.context_aware
>> +def run_cpu_hotplug(test, params, env):
>> +    """
>> +    Runs CPU hotplug test:
>> +
>> +    1) Pick up a living guest
>> +    2) Send the monitor command cpu_set [cpu id] for each cpu we
>> wish to have
>> +    3) Verify if guest has the additional CPUs showing up under
>> +        /sys/devices/system/cpu
>> +    4) Try to bring them online by writing 1 to the 'online' file
>> inside that dir
>> +    5) Run the CPU Hotplug test suite shipped with autotest inside
>
> It looks good to me.  How about adding
>        1) off-lining of vcpu.
>        2) Frequent offline-online of vcpus.  some thing like below.
>
> #!/bin/sh
>
> SYS_CPU_DIR=/sys/devices/system/cpu
>
> VICTIM_IRQ=15
> IRQ_MASK=f0
>
> iteration=0
> while true; do
>  echo $iteration
>  echo $IRQ_MASK > /proc/irq/$VICTIM_IRQ/smp_affinity
>  for cpudir in $SYS_CPU_DIR/cpu[1-9]; do
>    echo 0 > $cpudir/online
>  done
>  for cpudir in $SYS_CPU_DIR/cpu[1-9]; do
>    echo 1 > $cpudir/online
>  done
>  iteration=`expr $iteration + 1`
> done
>

Implemented your suggestion, see:

http://patchwork.test.kernel.org/patch/3613/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM test: Add cpu_hotplug subtest v2

2011-08-25 Thread Lucas Meneghel Rodrigues
Tests the ability of adding virtual cpus on the fly to qemu using
the monitor command cpu_set, then after everything is OK, run the
cpu_hotplug testsuite on the guest through autotest.

Updates: As of the latest qemu-kvm (08-24-2011) HEAD, trying to
online more CPUs than the ones already available leads to qemu
hanging:

File /home/lmr/Code/autotest-git/client/virt/kvm_monitor.py, line 279, in cmd
raise MonitorProtocolError(msg)
MonitorProtocolError: Could not find (qemu) prompt after command cpu_set 2 
online. Output so far: ""

Changes from v2:
 * Added stress onlining/offlining CPUs
 * Changed variant name

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/tests/cpu_hotplug.py  |  111 
 client/tests/kvm/tests_base.cfg.sample |8 ++
 2 files changed, 119 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/cpu_hotplug.py

diff --git a/client/tests/kvm/tests/cpu_hotplug.py 
b/client/tests/kvm/tests/cpu_hotplug.py
new file mode 100644
index 000..e4d79f8
--- /dev/null
+++ b/client/tests/kvm/tests/cpu_hotplug.py
@@ -0,0 +1,111 @@
+import os, logging, re
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.virt import virt_test_utils
+
+
+@error.context_aware
+def run_cpu_hotplug(test, params, env):
+"""
+Runs CPU hotplug test:
+
+1) Pick up a living guest
+2) Send the monitor command cpu_set [cpu id] for each cpu we wish to have
+3) Verify if guest has the additional CPUs showing up under
+/sys/devices/system/cpu
+4) Try to bring them online by writing 1 to the 'online' file inside that 
dir
+5) Run the CPU Hotplug test suite shipped with autotest inside guest
+
+@param test: KVM test object.
+@param params: Dictionary with test parameters.
+@param env: Dictionary with the test environment.
+"""
+vm = env.get_vm(params["main_vm"])
+vm.verify_alive()
+timeout = int(params.get("login_timeout", 360))
+session = vm.wait_for_login(timeout=timeout)
+
+n_cpus_add = int(params.get("n_cpus_add", 1))
+current_cpus = int(params.get("smp", 1))
+onoff_iterations = int(params.get("onoff_iterations", 20))
+total_cpus = current_cpus + n_cpus_add
+
+error.context("getting guest dmesg before addition")
+dmesg_before = session.cmd("dmesg -c")
+
+error.context("Adding %d CPUs to guest" % n_cpus_add)
+for i in range(total_cpus):
+vm.monitor.cmd("cpu_set %s online" % i)
+
+output = vm.monitor.cmd("info cpus")
+logging.debug("Output of info cpus:\n%s", output)
+
+cpu_regexp = re.compile("CPU #(\d+)")
+total_cpus_monitor = len(cpu_regexp.findall(output))
+if total_cpus_monitor != total_cpus:
+raise error.TestFail("Monitor reports %s CPUs, when VM should have %s" 
%
+ (total_cpus_monitor, total_cpus))
+
+dmesg_after = session.cmd("dmesg -c")
+logging.debug("Guest dmesg output after CPU add:\n%s" % dmesg_after)
+
+# Verify whether the new cpus are showing up on /sys
+error.context("verifying if new CPUs are showing on guest's /sys dir")
+n_cmd = 'find /sys/devices/system/cpu/cpu[0-99] -maxdepth 0 -type d | wc 
-l'
+output = session.cmd(n_cmd)
+logging.debug("List of cpus on /sys:\n%s" % output)
+try:
+cpus_after_addition = int(output)
+except ValueError:
+logging.error("Output of '%s': %s", n_cmd, output)
+raise error.TestFail("Unable to get CPU count after CPU addition")
+
+if cpus_after_addition != total_cpus:
+raise error.TestFail("%s CPUs are showing up under "
+ "/sys/devices/system/cpu, was expecting %s" %
+ (cpus_after_addition, total_cpus))
+
+error.context("locating online files for guest's new CPUs")
+r_cmd = 'find /sys/devices/system/cpu/cpu[1-99]/online -maxdepth 0 -type f'
+online_files = session.cmd(r_cmd)
+logging.debug("CPU online files detected: %s", online_files)
+online_files = online_files.split().sort()
+
+if not online_files:
+raise error.TestFail("Could not find CPUs that can be "
+ "enabled/disabled on guest")
+
+for online_file in online_files:
+cpu_regexp = re.compile("cpu(\d+)", re.IGNORECASE)
+cpu_id = cpu_regexp.findall(online_file)[0]
+error.context("changing online status for CPU %s" % cpu_id)
+check_online_status = session.cmd("cat %s" % online_file)
+try:
+check_online_status = int(check_online_status)
+except ValueError:
+raise error.TestFail("Unable to get online status from CPU %s" %
+ cpu_id)
+assert(check_online_status in [0, 1])
+if check_online_status == 0:
+error.context("Bringing CPU %s online" % cpu_id)
+session.cmd("echo 1 > %s" % online_file)
+
+# Now that all CPUs were onlined, let's execute the
+

Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread David Evensky


Thanks. My initial version did use the E820 map (thus the reason I
want to have an 'address family'), but it was suggested that PCI would
be a better way to go. When I get the rest of the project going, I
will certainly test against that. I am going to have to do a LOT of
ioremap's so that might be the bottleneck. That said, I don't think
it will end up as an issue.

\dae

On Thu, Aug 25, 2011 at 03:08:52PM -0700, Eric Northup wrote:
> Just FYI, one issue that I found with exposing host memory regions as
> a PCI BAR (including via a very old version of the ivshmem driver...
> haven't tried a newer one) is that x86's pci_mmap_page_range doesn't
> want to set up a write-back cacheable mapping of a BAR.
> 
> It may not matter for your requirements, but the uncached access
> reduced guest<->host bandwidth via the shared memory driver by a lot.
> 
> 
> If you need the physical address to be fixed, you might be better off
> by reserving a memory region in the e820 map rather than a PCI BAR,
> since BARs can move around.
> 
> 
> On Thu, Aug 25, 2011 at 8:08 AM, David Evensky
>  wrote:
> >
> > Adding in the rest of what ivshmem does shouldn't affect our use, *I
> > think*. ?I hadn't intended this to do everything that ivshmem does,
> > but I can see how that would be useful. It would be cool if it could
> > grow into that.
> >
> > Our requirements for the driver in kvm tool are that another program
> > on the host can create a shared segment (anonymous, non-file backed)
> > with a specified handle, size, and contents. That this segment is
> > available to the guest at boot time at a specified address and that no
> > driver will change the contents of the memory except under direct user
> > action. Also, when the guest goes away the shared memory segment
> > shouldn't be affected (e.g. contents changed). Finally, we cannot
> > change the lightweight nature of kvm tool.
> >
> > This is the feature of ivshmem that I need to check today. I did some
> > testing a month ago, but it wasn't detailed enough to check this out.
> >
> > \dae
> >
> >
> >
> >
> > On Thu, Aug 25, 2011 at 02:25:48PM +0300, Sasha Levin wrote:
> > > On Thu, 2011-08-25 at 11:59 +0100, Stefan Hajnoczi wrote:
> > > > On Thu, Aug 25, 2011 at 11:37 AM, Pekka Enberg  
> > > > wrote:
> > > > > Hi Stefan,
> > > > >
> > > > > On Thu, Aug 25, 2011 at 1:31 PM, Stefan Hajnoczi  
> > > > > wrote:
> > > > >>> It's obviously not competing. One thing you might want to consider 
> > > > >>> is
> > > > >>> making the guest interface compatible with ivshmem. Is there any 
> > > > >>> reason
> > > > >>> we shouldn't do that? I don't consider that a requirement, just 
> > > > >>> nice to
> > > > >>> have.
> > > > >>
> > > > >> The point of implementing the same interface as ivshmem is that users
> > > > >> don't need to rejig guests or applications in order to switch between
> > > > >> hypervisors. ?A different interface also prevents same-to-same
> > > > >> benchmarks.
> > > > >>
> > > > >> There is little benefit to creating another virtual device interface
> > > > >> when a perfectly good one already exists. ?The question should be: 
> > > > >> how
> > > > >> is this shmem device different and better than ivshmem? ?If there is
> > > > >> no justification then implement the ivshmem interface.
> > > > >
> > > > > So which interface are we actually taking about? Userspace/kernel in 
> > > > > the
> > > > > guest or hypervisor/guest kernel?
> > > >
> > > > The hardware interface. ?Same PCI BAR layout and semantics.
> > > >
> > > > > Either way, while it would be nice to share the interface but it's 
> > > > > not a
> > > > > *requirement* for tools/kvm unless ivshmem is specified in the virtio
> > > > > spec or the driver is in mainline Linux. We don't intend to require 
> > > > > people
> > > > > to implement non-standard and non-Linux QEMU interfaces. OTOH,
> > > > > ivshmem would make the PCI ID problem go away.
> > > >
> > > > Introducing yet another non-standard and non-Linux interface doesn't
> > > > help though. ?If there is no significant improvement over ivshmem then
> > > > it makes sense to let ivshmem gain critical mass and more users
> > > > instead of fragmenting the space.
> > >
> > > I support doing it ivshmem-compatible, though it doesn't have to be a
> > > requirement right now (that is, use this patch as a base and build it
> > > towards ivshmem - which shouldn't be an issue since this patch provides
> > > the PCI+SHM parts which are required by ivshmem anyway).
> > >
> > > ivshmem is a good, documented, stable interface backed by a lot of
> > > research and testing behind it. Looking at the spec it's obvious that
> > > Cam had KVM in mind when designing it and thats exactly what we want to
> > > have in the KVM tool.
> > >
> > > David, did you have any plans to extend it to become ivshmem-compatible?
> > > If not, would turning it into such break any code that depends on it
> > > horribly?
> > >
> > > --
> > >
> > > Sasha.
> > >
> > --
> > To unsub

Re: [PATCH 3/3] KVM: x86 emulator: fuzz tester

2011-08-25 Thread Lucas Meneghel Rodrigues

On 08/22/2011 10:41 AM, Avi Kivity wrote:

The x86 emulator is directly exposed to guest code; therefore it is part
of the directly exposed attack surface.  To reduce the risk of
vulnerabilities, this patch adds a fuzz test that runs random instructions
through the emulator.  A vulnerability will usually result in an oops.

One way to run the test is via KVM itself:

   qemu -enable-kvm -smp 4 -serial stdio -kernel bzImage \
   -append 'console=ttyS0 test_emulator.iterations=10'

this requires that the test module be built into the kernel.

Signed-off-by: Avi Kivity
---
  arch/x86/Kbuild  |1 +
  arch/x86/kvm/Kconfig |   11 +
  arch/x86/kvm/Makefile|1 +
  arch/x86/kvm/test-emulator.c |  533 ++
  4 files changed, 546 insertions(+), 0 deletions(-)
  create mode 100644 arch/x86/kvm/test-emulator.c

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index 0e9dec6..0d80e6f 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -1,5 +1,6 @@

  obj-$(CONFIG_KVM) += kvm/
+obj-$(CONFIG_KVM_EMULATOR_TEST) += kvm/

  # Xen paravirtualization support
  obj-$(CONFIG_XEN) += xen/
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ff5790d..9ffc30a 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -76,6 +76,17 @@ config KVM_MMU_AUDIT
 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 audit  KVM MMU at runtime.

+config KVM_EMULATOR_TEST
+tristate "KVM emulator self test"
+   depends on KVM
+   ---help---
+Build test code that checks the x86 emulator during boot or module
+ insertion.  If built as a module, it will be called test-emulator.ko.
+
+The emulator test will run for as many iterations as are specified by
+ the emulator_test.iterations parameter; all processors will be
+ utilized.  When the test is complete, results are reported in dmesg.
+
  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
  # the virtualization menu.
  source drivers/vhost/Kconfig
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f15501f..fc4a9e2 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -19,3 +19,4 @@ kvm-amd-y += svm.o
  obj-$(CONFIG_KVM) += kvm.o
  obj-$(CONFIG_KVM_INTEL)   += kvm-intel.o
  obj-$(CONFIG_KVM_AMD) += kvm-amd.o
+obj-$(CONFIG_KVM_EMULATOR_TEST) += test-emulator.o
diff --git a/arch/x86/kvm/test-emulator.c b/arch/x86/kvm/test-emulator.c
new file mode 100644
index 000..1e3a22f
--- /dev/null
+++ b/arch/x86/kvm/test-emulator.c
@@ -0,0 +1,533 @@
+/*
+ * x86 instruction emulator test
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates.
+ *
+ * Authors:
+ *   Avi Kivity
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include
+#include
+#include
+#include
+#include
+#include


I still haven't gone through all the code, but it's a good idea to put a 
MODULE_LICENSE("GPL") macro around here, so the build system doesn't 
complain about it:


WARNING: modpost: missing MODULE_LICENSE() in arch/x86/kvm/test-emulator.o
see include/linux/module.h for more information


+static ulong iterations = 0;
+module_param(iterations, ulong, S_IRUGO);
+
+struct test_context {
+   struct work_struct work;
+   struct completion completion;
+   struct x86_emulate_ctxt ctxt;
+   struct test_context *next;
+   bool failed;
+   u8 insn[15];
+   bool insn_base_valid;
+   ulong insn_base;
+   struct test_seg {
+   u16 selector;
+   struct desc_struct desc;
+   u32 base3;
+   bool valid;
+   } segs[8];
+   ulong iterations;
+   ulong completed;
+   ulong decoded;
+   ulong emulated;
+   ulong nofault;
+   ulong failures;
+};
+
+static u64 random64(void)
+{
+   return random32() | ((u64)random32()<<  32);
+}
+
+static ulong randlong(void)
+{
+   if (sizeof(ulong) == sizeof(u32))
+   return random32();
+   else
+   return random64();
+}
+
+static struct test_context *to_test(struct x86_emulate_ctxt *ctxt)
+{
+   return container_of(ctxt, struct test_context, ctxt);
+}
+
+static void fail(struct x86_emulate_ctxt *ctxt, const char *msg, ...)
+   __attribute__((format(printf, 2, 3)));
+
+static void fail(struct x86_emulate_ctxt *ctxt, const char *msg, ...)
+{
+   va_list args;
+   char s[200];
+
+   va_start(args, msg);
+   vsnprintf(s, sizeof(s), msg, args);
+   va_end(args);
+   printk("emulator test failure: %s\n", s);
+   to_test(ctxt)->failed = true;
+}
+
+static int test_fill_exception(struct x86_exception *ex)
+{
+   if (random32() % 4 == 0) {
+   if (ex) {
+   ex->vector = random32();
+   ex->error_code_valid = random32();
+  

Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Eric Northup
Just FYI, one issue that I found with exposing host memory regions as
a PCI BAR (including via a very old version of the ivshmem driver...
haven't tried a newer one) is that x86's pci_mmap_page_range doesn't
want to set up a write-back cacheable mapping of a BAR.

It may not matter for your requirements, but the uncached access
reduced guest<->host bandwidth via the shared memory driver by a lot.


If you need the physical address to be fixed, you might be better off
by reserving a memory region in the e820 map rather than a PCI BAR,
since BARs can move around.


On Thu, Aug 25, 2011 at 8:08 AM, David Evensky
 wrote:
>
> Adding in the rest of what ivshmem does shouldn't affect our use, *I
> think*.  I hadn't intended this to do everything that ivshmem does,
> but I can see how that would be useful. It would be cool if it could
> grow into that.
>
> Our requirements for the driver in kvm tool are that another program
> on the host can create a shared segment (anonymous, non-file backed)
> with a specified handle, size, and contents. That this segment is
> available to the guest at boot time at a specified address and that no
> driver will change the contents of the memory except under direct user
> action. Also, when the guest goes away the shared memory segment
> shouldn't be affected (e.g. contents changed). Finally, we cannot
> change the lightweight nature of kvm tool.
>
> This is the feature of ivshmem that I need to check today. I did some
> testing a month ago, but it wasn't detailed enough to check this out.
>
> \dae
>
>
>
>
> On Thu, Aug 25, 2011 at 02:25:48PM +0300, Sasha Levin wrote:
> > On Thu, 2011-08-25 at 11:59 +0100, Stefan Hajnoczi wrote:
> > > On Thu, Aug 25, 2011 at 11:37 AM, Pekka Enberg  wrote:
> > > > Hi Stefan,
> > > >
> > > > On Thu, Aug 25, 2011 at 1:31 PM, Stefan Hajnoczi  
> > > > wrote:
> > > >>> It's obviously not competing. One thing you might want to consider is
> > > >>> making the guest interface compatible with ivshmem. Is there any 
> > > >>> reason
> > > >>> we shouldn't do that? I don't consider that a requirement, just nice 
> > > >>> to
> > > >>> have.
> > > >>
> > > >> The point of implementing the same interface as ivshmem is that users
> > > >> don't need to rejig guests or applications in order to switch between
> > > >> hypervisors.  A different interface also prevents same-to-same
> > > >> benchmarks.
> > > >>
> > > >> There is little benefit to creating another virtual device interface
> > > >> when a perfectly good one already exists.  The question should be: how
> > > >> is this shmem device different and better than ivshmem?  If there is
> > > >> no justification then implement the ivshmem interface.
> > > >
> > > > So which interface are we actually taking about? Userspace/kernel in the
> > > > guest or hypervisor/guest kernel?
> > >
> > > The hardware interface.  Same PCI BAR layout and semantics.
> > >
> > > > Either way, while it would be nice to share the interface but it's not a
> > > > *requirement* for tools/kvm unless ivshmem is specified in the virtio
> > > > spec or the driver is in mainline Linux. We don't intend to require 
> > > > people
> > > > to implement non-standard and non-Linux QEMU interfaces. OTOH,
> > > > ivshmem would make the PCI ID problem go away.
> > >
> > > Introducing yet another non-standard and non-Linux interface doesn't
> > > help though.  If there is no significant improvement over ivshmem then
> > > it makes sense to let ivshmem gain critical mass and more users
> > > instead of fragmenting the space.
> >
> > I support doing it ivshmem-compatible, though it doesn't have to be a
> > requirement right now (that is, use this patch as a base and build it
> > towards ivshmem - which shouldn't be an issue since this patch provides
> > the PCI+SHM parts which are required by ivshmem anyway).
> >
> > ivshmem is a good, documented, stable interface backed by a lot of
> > research and testing behind it. Looking at the spec it's obvious that
> > Cam had KVM in mind when designing it and thats exactly what we want to
> > have in the KVM tool.
> >
> > David, did you have any plans to extend it to become ivshmem-compatible?
> > If not, would turning it into such break any code that depends on it
> > horribly?
> >
> > --
> >
> > Sasha.
> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread David Evensky

I need to specify the physical address because I need to ioremap the
memory during boot.

The production issue I think is a memory limitation. We certainly do
use QEMU a lot; but for this the kvm tool is a better fit.

\dae

On Fri, Aug 26, 2011 at 12:11:03AM +0300, Avi Kivity wrote:
> On 08/26/2011 12:00 AM, David Evensky wrote:
> >I've tested ivshmem with the latest git pull (had minor trouble
> >building on debian sid, vnc and unused var, but trivial to work
> >around).
> >
> >QEMU's  -device ivshmem,size=16,shm=/kvm_shmem
> >
> >seems to function as my proposed
> >
> > --shmem pci:0xfd00:16M:handle=/kvm_shmem
> >
> >except that I can't specify the BAR. I am able to read what
> >I'm given, 0xfd00, from lspci -vvv; but for our application
> >we need to be able to specify the address on the command line.
> >
> >If folks are open, I would like to request this feature in the
> >ivshmem.
> 
> It's not really possible. Qemu does not lay out the BARs, the guest
> does (specifically the bios).  You might be able to re-arrange the
> layout after the guest boots.
> 
> Why do you need the BAR at a specific physical address?
> 
> >It would be cool to test our application with QEMU,
> >even if we can't use it in production.
> 
> Why can't you use qemu in production?
> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread David Evensky
On Thu, Aug 25, 2011 at 04:35:29PM -0500, Anthony Liguori wrote:
dev.h
> 
> >--- linux-kvm/tools/kvm/include/kvm/virtio-pci-dev.h 2011-08-09 
> >15:38:48.760120973 -0700
> >+++ linux-kvm_pci_shmem/tools/kvm/include/kvm/virtio-pci-dev.h   
> >2011-08-18 10:06:12.171539230 -0700
> >@@ -15,10 +15,13 @@
> >  #define PCI_DEVICE_ID_VIRTIO_BLN   0x1005
> >  #define PCI_DEVICE_ID_VIRTIO_P90x1009
> >  #define PCI_DEVICE_ID_VESA 0x2000
> >+#define PCI_DEVICE_ID_PCI_SHMEM 0x0001
> >
> >  #define PCI_VENDOR_ID_REDHAT_QUMRANET  0x1af4
> >+#define PCI_VENDOR_ID_PCI_SHMEM 0x0001
> >  #define PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET0x1af4
> 
> FYI, that's not a valid vendor and device ID.
> 
> Perhaps the RH folks would be willing to reserve a portion of the
> device ID space in their vendor ID for ya'll to play around with.

That would be cool! I've started asking around some folks at my
place to see if we have such a thing; but so far, I've heard nothing.



> 
> Regards,
> 
> Anthony Liguori
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Guest kernel device compatability auto-detection

2011-08-25 Thread Anthony Liguori

On 08/25/2011 12:21 AM, Sasha Levin wrote:

Hi,

Currently when we run the guest we treat it as a black box, we're not
quite sure what it's going to start and whether it supports the same
features we expect it to support when running it from the host.

This forces us to start the guest with the safest defaults possible, for
example: '-drive file=my_image.qcow2' will be started with slow IDE
emulation even though the guest is capable of virtio.

I'm currently working on a method to try and detect whether the guest
kernel has specific configurations enabled and either warn the user if
we know the kernel is not going to properly work or use better defaults
if we know some advanced features are going to work.

How am I planning to do it? First, we'll try finding which kernel the
guest is going to boot (easy when user does '-kernel', less easy when
the user boots an image). For simplicity sake I'll stick with the
'-kernel' option for now.


Is the problem you're trying to solve determine whether the guest kernel 
is going to work well under kvm tool or trying to choose the right 
hardware profile to expose to the guest?


If it's the former, I think the path you're heading down is the most 
likely to succeed (trying to guess based on what you can infer about the 
kernel).


If it's the later, there's some interesting possibilities we never fully 
explored in QEMU.


One would be exposing a well supported device (like IDE emulation) and 
having a magic mode that allowed you to basically promote the device 
from IDE emulation to virtio-blk.  Likewise, you could do something like 
that to promote from the e1000 to virtio-net.


It might require some special support in the guest kernel and would 
likely be impossible to do in Windows, but if you primarily care about 
Linux guests, it ought to be possible.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Anthony Liguori

On 08/24/2011 05:25 PM, David Evensky wrote:



This patch adds a PCI device that provides PCI device memory to the
guest. This memory in the guest exists as a shared memory segment in
the host. This is similar memory sharing capability of Nahanni
(ivshmem) available in QEMU. In this case, the shared memory segment
is exposed as a PCI BAR only.

A new command line argument is added as:
 --shmem pci:0xc800:16MB:handle=/newmem:create

diff -uprN -X linux-kvm/Documentation/dontdiff 
linux-kvm/tools/kvm/include/kvm/pci-shmem.h 
linux-kvm_pci_shmem/tools/kvm/include/kvm/pci-shmem.h
--- linux-kvm/tools/kvm/include/kvm/pci-shmem.h 1969-12-31 16:00:00.0 
-0800
+++ linux-kvm_pci_shmem/tools/kvm/include/kvm/pci-shmem.h   2011-08-13 
15:43:01.067953711 -0700
@@ -0,0 +1,13 @@
+#ifndef KVM__PCI_SHMEM_H
+#define KVM__PCI_SHMEM_H
+
+#include
+#include
+
+struct kvm;
+struct shmem_info;
+
+int pci_shmem__init(struct kvm *self);
+int pci_shmem__register_mem(struct shmem_info *si);
+
+#endif
diff -uprN -X linux-kvm/Documentation/dontdiff 
linux-kvm/tools/kvm/include/kvm/virtio-pci-dev.h 
linux-kvm_pci_shmem/tools/kvm/include/kvm/virtio-pci-dev.h
--- linux-kvm/tools/kvm/include/kvm/virtio-pci-dev.h2011-08-09 
15:38:48.760120973 -0700
+++ linux-kvm_pci_shmem/tools/kvm/include/kvm/virtio-pci-dev.h  2011-08-18 
10:06:12.171539230 -0700
@@ -15,10 +15,13 @@
  #define PCI_DEVICE_ID_VIRTIO_BLN  0x1005
  #define PCI_DEVICE_ID_VIRTIO_P9   0x1009
  #define PCI_DEVICE_ID_VESA0x2000
+#define PCI_DEVICE_ID_PCI_SHMEM0x0001

  #define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4
+#define PCI_VENDOR_ID_PCI_SHMEM0x0001
  #define PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET   0x1af4


FYI, that's not a valid vendor and device ID.

Perhaps the RH folks would be willing to reserve a portion of the device 
ID space in their vendor ID for ya'll to play around with.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Avi Kivity

On 08/26/2011 12:00 AM, David Evensky wrote:

I've tested ivshmem with the latest git pull (had minor trouble
building on debian sid, vnc and unused var, but trivial to work
around).

QEMU's  -device ivshmem,size=16,shm=/kvm_shmem

seems to function as my proposed

 --shmem pci:0xfd00:16M:handle=/kvm_shmem

except that I can't specify the BAR. I am able to read what
I'm given, 0xfd00, from lspci -vvv; but for our application
we need to be able to specify the address on the command line.

If folks are open, I would like to request this feature in the
ivshmem.


It's not really possible. Qemu does not lay out the BARs, the guest does 
(specifically the bios).  You might be able to re-arrange the layout 
after the guest boots.


Why do you need the BAR at a specific physical address?


It would be cool to test our application with QEMU,
even if we can't use it in production.


Why can't you use qemu in production?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread David Evensky

I've tested ivshmem with the latest git pull (had minor trouble
building on debian sid, vnc and unused var, but trivial to work
around).

QEMU's  -device ivshmem,size=16,shm=/kvm_shmem

seems to function as my proposed

--shmem pci:0xfd00:16M:handle=/kvm_shmem

except that I can't specify the BAR. I am able to read what
I'm given, 0xfd00, from lspci -vvv; but for our application
we need to be able to specify the address on the command line.

If folks are open, I would like to request this feature in the
ivshmem. It would be cool to test our application with QEMU,
even if we can't use it in production.

I didn't check the case where QEMU must create the shared
segment from scratch, etc. so I didn't test what differences
there are with my proposed 'create' flag or not, but I did look
at the ivshmem source and looks like it does the right thing.
(Makes me want to steal code to make mine better :-))


\dae

On Thu, Aug 25, 2011 at 08:08:06AM -0700, David Evensky wrote:
> 
> Adding in the rest of what ivshmem does shouldn't affect our use, *I
> think*.  I hadn't intended this to do everything that ivshmem does,
> but I can see how that would be useful. It would be cool if it could
> grow into that.
> 
> Our requirements for the driver in kvm tool are that another program
> on the host can create a shared segment (anonymous, non-file backed)
> with a specified handle, size, and contents. That this segment is
> available to the guest at boot time at a specified address and that no
> driver will change the contents of the memory except under direct user
> action. Also, when the guest goes away the shared memory segment
> shouldn't be affected (e.g. contents changed). Finally, we cannot
> change the lightweight nature of kvm tool.
> 
> This is the feature of ivshmem that I need to check today. I did some
> testing a month ago, but it wasn't detailed enough to check this out.
> 
> \dae
> 
> 
> 
> 
> On Thu, Aug 25, 2011 at 02:25:48PM +0300, Sasha Levin wrote:
> > On Thu, 2011-08-25 at 11:59 +0100, Stefan Hajnoczi wrote:
> > > On Thu, Aug 25, 2011 at 11:37 AM, Pekka Enberg  wrote:
> > > > Hi Stefan,
> > > >
> > > > On Thu, Aug 25, 2011 at 1:31 PM, Stefan Hajnoczi  
> > > > wrote:
> > > >>> It's obviously not competing. One thing you might want to consider is
> > > >>> making the guest interface compatible with ivshmem. Is there any 
> > > >>> reason
> > > >>> we shouldn't do that? I don't consider that a requirement, just nice 
> > > >>> to
> > > >>> have.
> > > >>
> > > >> The point of implementing the same interface as ivshmem is that users
> > > >> don't need to rejig guests or applications in order to switch between
> > > >> hypervisors.  A different interface also prevents same-to-same
> > > >> benchmarks.
> > > >>
> > > >> There is little benefit to creating another virtual device interface
> > > >> when a perfectly good one already exists.  The question should be: how
> > > >> is this shmem device different and better than ivshmem?  If there is
> > > >> no justification then implement the ivshmem interface.
> > > >
> > > > So which interface are we actually taking about? Userspace/kernel in the
> > > > guest or hypervisor/guest kernel?
> > > 
> > > The hardware interface.  Same PCI BAR layout and semantics.
> > > 
> > > > Either way, while it would be nice to share the interface but it's not a
> > > > *requirement* for tools/kvm unless ivshmem is specified in the virtio
> > > > spec or the driver is in mainline Linux. We don't intend to require 
> > > > people
> > > > to implement non-standard and non-Linux QEMU interfaces. OTOH,
> > > > ivshmem would make the PCI ID problem go away.
> > > 
> > > Introducing yet another non-standard and non-Linux interface doesn't
> > > help though.  If there is no significant improvement over ivshmem then
> > > it makes sense to let ivshmem gain critical mass and more users
> > > instead of fragmenting the space.
> > 
> > I support doing it ivshmem-compatible, though it doesn't have to be a
> > requirement right now (that is, use this patch as a base and build it
> > towards ivshmem - which shouldn't be an issue since this patch provides
> > the PCI+SHM parts which are required by ivshmem anyway).
> > 
> > ivshmem is a good, documented, stable interface backed by a lot of
> > research and testing behind it. Looking at the spec it's obvious that
> > Cam had KVM in mind when designing it and thats exactly what we want to
> > have in the KVM tool.
> > 
> > David, did you have any plans to extend it to become ivshmem-compatible?
> > If not, would turning it into such break any code that depends on it
> > horribly?
> > 
> > -- 
> > 
> > Sasha.
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More m

Re: [PATCH 0/3] Emulator fuzz tester

2011-08-25 Thread Lucas Meneghel Rodrigues

On 08/22/2011 10:41 AM, Avi Kivity wrote:

As it is exposed directly to guest code, the x86 emulator is an interesting
target for exploiters: a vulnerability may lead to compromise of the entire
host.

In an attempt to catch vulnerabilities before they make it into production
code, this patchset adds a fuzz tester for the emulator.  Instructions
are synthesized and fed into the emulator; a vulnerability will usually
result in an access violation.

I tried to make the emulator test build an run in userspace; this proved too
difficult, so the test is built as part of the kernel.  It can still be run
in userspace, via KVM:

   qemu -enable-kvm -smp 4 -serial stdio -kernel bzImage \
   -append 'console=ttyS0 test_emulator.iterations=10'

   ...
   starting emulator test
   emulator fuzz test results
 instructions: 10
 decoded:94330032
 emulated:   92529152
 nofault: 117
 failures:  0
   emulator test: PASS
   ...

One billion random instructions failed to find a vulnerability, so either
the emulator is really good, or the test is really bad, or we need a lot more
runtime.

Lucas, how would we go about integrating this into kvm-autotest?


I'm thinking about it. Some ideas that come to my mind:

1) Create a test that boots a bzImage with the emulator params. This way 
we could:

 a) Use the bzImage built for the host, on our daily upstream jobs, or
 b) We can simply have a step that compiles a kernel tree, provided 
that your patch is present there and then pass the bzImage to qemu-kvm.


I'm currently trying out b) manually to get a grasp of how this would work.

Cheers,

Lucas
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: x86 emulator: fuzz tester

2011-08-25 Thread Avi Kivity

On 08/25/2011 07:27 PM, Marcelo Tosatti wrote:

On Mon, Aug 22, 2011 at 04:41:09PM +0300, Avi Kivity wrote:
>  The x86 emulator is directly exposed to guest code; therefore it is part
>  of the directly exposed attack surface.  To reduce the risk of
>  vulnerabilities, this patch adds a fuzz test that runs random instructions
>  through the emulator.  A vulnerability will usually result in an oops.
>
>  + for (i = 0; i<  NR_VCPU_REGS; ++i)
>  + ctxt->regs[i] = randlong();
>  + r = x86_decode_insn(ctxt, NULL, 0);

It could rerun N times instructions that have been decoded successfully.
This would increase the chance of testing the code path for that (class
of) instruction.


Good idea.  I'll keep N small (20?) so that we fuzz the decoder as well.


Also fuzzing from an actual guest is useful to test the real backend
functions. What problem did you encounter? The new testsuite scheme
seems a good fit for that (with the exception of being locked to 32-bit
mode).


Mostly that I forgot it exists.  Other issues are that it's harder to 
force random values through it - though I could allocate a couple GB and 
fill it with random values.  We also lose the ability to test inputs to 
callbacks (not that I do much of that here).


I'll try it out.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-25 Thread Joerg Roedel
On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote:
> On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:

> > We need to solve this differently. ARM is starting to use the iommu-api
> > too and this definitly does not work there. One possible solution might
> > be to make the iommu-ops per-bus.
> 
> That sounds good.  Is anyone working on it?  It seems like it doesn't
> hurt to use this in the interim, we may just be watching the wrong bus
> and never add any sysfs group info.

I'll cook something up for RFC over the weekend.

> > Also the return type should not be long but something that fits into
> > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> > choice.
> 
> The convenience of using seg|bus|dev|fn was too much to resist, too bad
> it requires a full 32bits.  Maybe I'll change it to:
> int iommu_device_group(struct device *dev, unsigned int *group)

If we really expect segment numbers that need the full 16 bit then this
would be the way to go. Otherwise I would prefer returning the group-id
directly and partition the group-id space for the error values (s32 with
negative numbers being errors).

> > > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
> > >   printk(KERN_INFO
> > >   "Intel-IOMMU: disable supported super page\n");
> > >   intel_iommu_superpage = 0;
> > > + } else if (!strncmp(str, "no_mf_groups", 12)) {
> > > + printk(KERN_INFO
> > > + "Intel-IOMMU: disable separate groups for 
> > > multifunction devices\n");
> > > + intel_iommu_no_mf_groups = 1;
> > 
> > This should really be a global iommu option and not be VT-d specific.
> 
> You think?  It's meaningless on benh's power systems.

But it is not meaningless on AMD-Vi systems :) There should be one
option for both.
On the other hand this requires an iommu= parameter on ia64, but thats
probably not that bad.

> > This looks like code duplication in the VT-d driver. It doesn't need to
> > be generalized now, but we should keep in mind to do a more general
> > solution later.
> > Maybe it is beneficial if the IOMMU drivers only setup the number in
> > dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> > But as I said, this is some more work and does not need to be done for
> > this patch(-set).
> 
> The iommu-api reaches into dev->arch.iommu.groupid?  I figured we should
> at least start out with a lightweight, optional interface without the
> overhead of predefining groupids setup by bus notification callbacks in
> each iommu driver.  Thanks,

As I said, this is just an idea for an later optimization. It is fine
for now as it is in this patch.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] KVM: Add wrapper script around QEMU to test kernels

2011-08-25 Thread Blue Swirl
On Wed, Aug 24, 2011 at 9:38 PM, Alexander Graf  wrote:
> On LinuxCon I had a nice chat with Linus on what he thinks kvm-tool
> would be doing and what he expects from it. Basically he wants a
> small and simple tool he and other developers can run to try out and
> see if the kernel they just built actually works.
>
> Fortunately, QEMU can do that today already! The only piece that was
> missing was the "simple" piece of the equation, so here is a script
> that wraps around QEMU and executes a kernel you just built.
>
> If you do have KVM around and are not cross-compiling, it will use
> KVM. But if you don't, you can still fall back to emulation mode and
> at least check if your kernel still does what you expect. I only
> implemented support for s390x and ppc there, but it's easily extensible
> to more platforms, as QEMU can emulate (and virtualize) pretty much
> any platform out there.
>
> If you don't have qemu installed, please do so before using this script. Your
> distro should provide a package for it (might even call it "kvm"). If not,
> just compile it from source - it's not hard!
>
> To quickly get going, just execute the following as user:
>
>    $ ./Documentation/run-qemu.sh -r / -a init=/bin/bash
>
> This will drop you into a shell on your rootfs.
>
> Happy hacking!
>
> Signed-off-by: Alexander Graf 
>
> ---
>
> v1 -> v2:
>
>  - fix naming of QEMU
>  - use grep -q for has_config
>  - support multiple -a args
>  - spawn gdb on execution
>  - pass through qemu options
>  - dont use qemu-system-x86_64 on i386
>  - add funny sentence to startup text
>  - more helpful error messages
> ---
>  scripts/run-qemu.sh |  334 
> +++
>  1 files changed, 334 insertions(+), 0 deletions(-)
>  create mode 100755 scripts/run-qemu.sh
>
> diff --git a/scripts/run-qemu.sh b/scripts/run-qemu.sh
> new file mode 100755
> index 000..5d4e185
> --- /dev/null
> +++ b/scripts/run-qemu.sh
> @@ -0,0 +1,334 @@
> +#!/bin/bash
> +#
> +# QEMU Launcher
> +#
> +# This script enables simple use of the KVM and QEMU tool stack for
> +# easy kernel testing. It allows to pass either a host directory to
> +# the guest or a disk image. Example usage:
> +#
> +# Run the host root fs inside a VM:
> +#
> +# $ ./scripts/run-qemu.sh -r /
> +#
> +# Run the same with SDL:
> +#
> +# $ ./scripts/run-qemu.sh -r / --sdl
> +#
> +# Or with a PPC build:
> +#
> +# $ ARCH=ppc ./scripts/run-qemu.sh -r /
> +#
> +# PPC with a mac99 model by passing options to QEMU:
> +#
> +# $ ARCH=ppc ./scripts/run-qemu.sh -r / -- -M mac99
> +#
> +
> +USE_SDL=
> +USE_VNC=
> +USE_GDB=1
> +KERNEL_BIN=arch/x86/boot/bzImage
> +MON_STDIO=
> +KERNEL_APPEND2=
> +SERIAL=ttyS0
> +SERIAL_KCONFIG=SERIAL_8250
> +BASENAME=$(basename "$0")
> +
> +function usage() {
> +       echo "
> +$BASENAME allows you to execute a virtual machine with the Linux kernel
> +that you just built. To only execute a simple VM, you can just run it
> +on your root fs with \"-r / -a init=/bin/bash\"
> +
> +       -a, --append parameters
> +               Append the given parameters to the kernel command line.
> +
> +       -d, --disk image
> +               Add the image file as disk into the VM.
> +
> +       -D, --no-gdb
> +               Don't run an xterm with gdb attached to the guest.
> +
> +       -r, --root directory
> +               Use the specified directory as root directory inside the 
> guest.
> +
> +       -s, --sdl
> +               Enable SDL graphical output.
> +
> +       -S, --smp cpus
> +               Set number of virtual CPUs.
> +
> +       -v, --vnc
> +               Enable VNC graphical output.
> +
> +Examples:
> +
> +       Run the host root fs inside a VM:
> +       $ ./scripts/run-qemu.sh -r /
> +
> +       Run the same with SDL:
> +       $ ./scripts/run-qemu.sh -r / --sdl
> +
> +       Or with a PPC build:
> +       $ ARCH=ppc ./scripts/run-qemu.sh -r /
> +
> +       PPC with a mac99 model by passing options to QEMU:
> +       $ ARCH=ppc ./scripts/run-qemu.sh -r / -- -M mac99
> +"
> +}
> +
> +function require_config() {
> +       if [ "$(grep CONFIG_$1=y .config)" ]; then
> +               return
> +       fi
> +
> +       echo "You need to enable CONFIG_$1 for run-qemu to work properly"
> +       exit 1
> +}
> +
> +function has_config() {
> +       grep -q "CONFIG_$1=y" .config
> +}
> +
> +function drive_if() {
> +       if has_config VIRTIO_BLK; then
> +               echo virtio
> +       elif has_config ATA_PIIX; then
> +               echo ide
> +       else
> +               echo "\
> +Your kernel must have either VIRTIO_BLK or ATA_PIIX
> +enabled for block device assignment" >&2
> +               exit 1
> +       fi
> +}
> +
> +GETOPT=`getopt -o a:d:Dhr:sS:v --long 
> append,disk:,no-gdb,help,root:,sdl,smp:,vnc \
> +       -n "$(basename \"$0\")" -- "$@"`
> +
> +if [ $? != 0 ]; then
> +       echo "Terminating..." >&2
> +       exit 1
> +fi
> +
> +eval set -- "$GETOPT"
> +
> +while true; do
> +       case "$1" in
> +   

Re: kvm PCI assignment & VFIO ramblings

2011-08-25 Thread Alex Williamson
On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
> Hi Alex,
> 
> On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> > Is this roughly what you're thinking of for the iommu_group component?
> > Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> > support in the iommu base.  Would AMD-Vi do something similar (or
> > exactly the same) for group #s?  Thanks,
> 
> The concept looks good, I have some comments, though. On AMD-Vi the
> implementation would look a bit different because there is a
> data-structure were the information can be gathered from, so no need for
> PCI bus scanning there.
> 
> > diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> > index 6e6b6a1..6b54c1a 100644
> > --- a/drivers/base/iommu.c
> > +++ b/drivers/base/iommu.c
> > @@ -17,20 +17,56 @@
> >   */
> >  
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  static struct iommu_ops *iommu_ops;
> >  
> > +static ssize_t show_iommu_group(struct device *dev,
> > +   struct device_attribute *attr, char *buf)
> > +{
> > +   return sprintf(buf, "%lx", iommu_dev_to_group(dev));
> 
> Probably add a 0x prefix so userspace knows the format?

I think I'll probably change it to %u.  Seems common to have decimal in
sysfs and doesn't get confusing if we cat it with a string.  As a bonus,
it abstracts that vt-d is just stuffing a PCI device address in there,
which nobody should ever rely on.

> > +}
> > +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> > +
> > +static int add_iommu_group(struct device *dev, void *unused)
> > +{
> > +   if (iommu_dev_to_group(dev) >= 0)
> > +   return device_create_file(dev, &dev_attr_iommu_group);
> > +
> > +   return 0;
> > +}
> > +
> > +static int device_notifier(struct notifier_block *nb,
> > +  unsigned long action, void *data)
> > +{
> > +   struct device *dev = data;
> > +
> > +   if (action == BUS_NOTIFY_ADD_DEVICE)
> > +   return add_iommu_group(dev, NULL);
> > +
> > +   return 0;
> > +}
> > +
> > +static struct notifier_block device_nb = {
> > +   .notifier_call = device_notifier,
> > +};
> > +
> >  void register_iommu(struct iommu_ops *ops)
> >  {
> > if (iommu_ops)
> > BUG();
> >  
> > iommu_ops = ops;
> > +
> > +   /* FIXME - non-PCI, really want for_each_bus() */
> > +   bus_register_notifier(&pci_bus_type, &device_nb);
> > +   bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
> >  }
> 
> We need to solve this differently. ARM is starting to use the iommu-api
> too and this definitly does not work there. One possible solution might
> be to make the iommu-ops per-bus.

That sounds good.  Is anyone working on it?  It seems like it doesn't
hurt to use this in the interim, we may just be watching the wrong bus
and never add any sysfs group info.

> >  bool iommu_found(void)
> > @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
> >  
> > +long iommu_dev_to_group(struct device *dev)
> > +{
> > +   if (iommu_ops->dev_to_group)
> > +   return iommu_ops->dev_to_group(dev);
> > +   return -ENODEV;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
> 
> Please rename this to iommu_device_group(). The dev_to_group name
> suggests a conversion but it is actually just a property of the device.

Ok.

> Also the return type should not be long but something that fits into
> 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> choice.

The convenience of using seg|bus|dev|fn was too much to resist, too bad
it requires a full 32bits.  Maybe I'll change it to:
int iommu_device_group(struct device *dev, unsigned int *group)

> > +
> >  int iommu_map(struct iommu_domain *domain, unsigned long iova,
> >   phys_addr_t paddr, int gfp_order, int prot)
> >  {
> > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> > index f02c34d..477259c 100644
> > --- a/drivers/pci/intel-iommu.c
> > +++ b/drivers/pci/intel-iommu.c
> > @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
> >  static int dmar_forcedac;
> >  static int intel_iommu_strict;
> >  static int intel_iommu_superpage = 1;
> > +static int intel_iommu_no_mf_groups;
> >  
> >  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
> >  static DEFINE_SPINLOCK(device_domain_lock);
> > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
> > printk(KERN_INFO
> > "Intel-IOMMU: disable supported super page\n");
> > intel_iommu_superpage = 0;
> > +   } else if (!strncmp(str, "no_mf_groups", 12)) {
> > +   printk(KERN_INFO
> > +   "Intel-IOMMU: disable separate groups for 
> > multifunction devices\n");
> > +   intel_iommu_no_mf_groups = 1;
> 
> This should r

Re: kvm PCI assignment & VFIO ramblings

2011-08-25 Thread Roedel, Joerg
On Thu, Aug 25, 2011 at 11:38:09AM -0400, Don Dutile wrote:

> On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
> > We need to solve this differently. ARM is starting to use the iommu-api
> > too and this definitly does not work there. One possible solution might
> > be to make the iommu-ops per-bus.
> >
> When you think of a system where there isn't just one bus-type
> with iommu support, it makes more sense.
> Additionally, it also allows the long-term architecture to use different types
> of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
> esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
> for direct-attach disk hba's.

Not sure how likely it is to have different types of IOMMUs within a
given bus-type. But if they become reality we can multiplex in the
iommu-api without much hassle :)
For now, something like bus_set_iommu() or bus_register_iommu() would
provide a nice way to do bus-specific setups for a given iommu
implementation.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-25 Thread Decker, Schorschi
>From a security perspective, this not a great idea.  Security isolation in 
>virtualization is gaining ground, so anything that breaches the 
>hypervisor/guest vale is by your typical enterprise/company security team 
>considered completely illegal, a number of firms I have talked with all are 
>talking about how their respective security teams are raising all kinds of 
>hell/red flags, demanding disablement of features that breach the vale.

I would ask two things be done in the design if it goes forward, 1) have an 
explicit way to disable this feature, where the hypervisor cannot interact with 
the guest OS directly in any way if disablement is selected.  2) implement the 
feature as an agent in the guest OS where the hypervisor can only query the 
guest OS agent, using a standard TCP/IP methodology.  Any under the hood, under 
the covers methodology, I can tell you for a fact, security teams will quash, 
or demand the feature is disabled.  This goes for VM to VM communication as 
well that does not use a formal TCP/IP stack based method as well.  Security 
teams want true OS isolation, point blank.


Schorschi Decker

VP; Sr. Consultant Engineer
ECT&O Emerging Technologies / Virtualization Platform Engineering Team
Bank of America

Office 213-345-4714



-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Richard W.M. Jones
Sent: Thursday, 25 August, 2011 03:01
To: qemu-de...@nongnu.org
Cc: Avi Kivity; kvm
Subject: Re: [Qemu-devel] Guest kernel device compatability auto-detection

On Thu, Aug 25, 2011 at 08:48:25AM +0100, Richard W.M. Jones wrote:
> On Thu, Aug 25, 2011 at 10:40:34AM +0300, Sasha Levin wrote:
> > From what I gathered libguestfs only provides access to the guests'
> > image.
> 
> Correct.
> 
> > Which part is doing the IKCONFIG or System.map probing? Or is it 
> > done in a different way?
> 
> You'll have to see what Matt's doing in the virt-v2v code for the 
> details, but in general we have full access to:
> 
>  - grub.conf (to determine which kernel will boot)
>  - the kernel image
>  - the corresponding System.map and config
>  - the modules directory
>  - the Xorg config
>  - boot.ini or BCD (to determine which NT kernel will boot)
>  - the Windows Registry
>  - the list of packages installed (to see if VMware-tools or some other
>guest agent is installed)
> 
> So working out what drivers are available is just a tedious matter of 
> iterating across each of these places in the filesystem.

We had some interesting discussion on IRC about this.

Detecting if a guest "supports virtio" is a tricky problem, and it goes beyond 
what the guest kernel can do.  For Linux guests you also need to check what 
userspace can do.  This means unpacking the initrd and checking for virtio 
drivers [in the general case this is intractable, but you can do it for 
specific distros].

You also need to check that udev has the correct rules and that LVM is 
configured to see VGs on /dev/vd* devices.

Console and Xorg configuration may also need to be checked (for virtio-console 
and Cirrus/QXL support resp.)

virt-v2v does quite a lot of work to *enable* virtio drivers
including:

 - possibly installing a new kernel and updating grub

 - rebuilding the initrd to include virtio drivers

 - adjusting many different config files

 - removing other guest tools and Xen drivers

 - reconfiguring SELinux

 - adding viostor driver to Windows and adjusting the Windows Registry
   Critical Device Database

Of course virt-v2v confines itself to specific known guests, and we test it 
like crazy.

Here is the code:

http://git.fedorahosted.org/git/?p=virt-v2v.git;a=blob;f=lib/Sys/VirtConvert/Converter/RedHat.pm;hb=HEAD
http://git.fedorahosted.org/git/?p=virt-v2v.git;a=blob;f=lib/Sys/VirtConvert/Converter/Windows.pm;hb=HEAD

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones 
virt-df lists disk usage of guests without needing to install any software 
inside the virtual machine.  Supports Linux and Windows.
http://et.redhat.com/~rjones/virt-df/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a 
message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html

--
This message w/attachments (message) is intended solely for the use of the 
intended recipient(s) and may contain information that is privileged, 
confidential or proprietary. If you are not an intended recipient, please 
notify the sender, and then please delete and destroy all copies and 
attachments, and be advised that any review or dissemination of, or the taking 
of any action in reliance on, the information contained in or attached to this 
message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a 
solicitation of any investment products or other financial product or service, 
an of

Re: [PATCH 3/3] KVM: x86 emulator: fuzz tester

2011-08-25 Thread Marcelo Tosatti
On Mon, Aug 22, 2011 at 04:41:09PM +0300, Avi Kivity wrote:
> The x86 emulator is directly exposed to guest code; therefore it is part
> of the directly exposed attack surface.  To reduce the risk of
> vulnerabilities, this patch adds a fuzz test that runs random instructions
> through the emulator.  A vulnerability will usually result in an oops.
> 
> One way to run the test is via KVM itself:
> 
>   qemu -enable-kvm -smp 4 -serial stdio -kernel bzImage \
>   -append 'console=ttyS0 test_emulator.iterations=10'
> 
> this requires that the test module be built into the kernel.
> 
> Signed-off-by: Avi Kivity 
> ---
>  arch/x86/Kbuild  |1 +
>  arch/x86/kvm/Kconfig |   11 +
>  arch/x86/kvm/Makefile|1 +
>  arch/x86/kvm/test-emulator.c |  533 
> ++
>  4 files changed, 546 insertions(+), 0 deletions(-)
>  create mode 100644 arch/x86/kvm/test-emulator.c
> 

> + .fetch = test_fetch,
> + .read_emulated = test_read,
> + .write_emulated = test_write,
> + .cmpxchg_emulated = test_cmpxchg,
> + .invlpg = test_invlpg,
> + .pio_in_emulated = test_pio_in,
> + .pio_out_emulated = test_pio_out,
> + .get_segment = test_get_segment,
> + .set_segment = test_set_segment,
> + .get_cached_segment_base = test_get_cached_segment_base,
> + .get_gdt = test_get_desc_table,
> + .get_idt = test_get_desc_table,
> + .set_gdt = test_set_desc_table,
> + .set_idt = test_set_desc_table,
> + .get_cr = test_get_cr,
> + .set_cr = test_set_cr,
> + .cpl = test_cpl,
> + .get_dr = test_get_dr,
> + .set_dr = test_set_dr,
> + .set_msr = test_set_msr,
> + .get_msr = test_get_msr,
> + .halt = test_halt,
> + .wbinvd = test_wbinvd,
> + .fix_hypercall = test_fix_hypercall,
> + .get_fpu = test_get_fpu,
> + .put_fpu = test_put_fpu,
> + .intercept = test_intercept,
> +};
> +
> +static int modes[] = {
> + X86EMUL_MODE_REAL,
> + X86EMUL_MODE_VM86,
> + X86EMUL_MODE_PROT16,
> + X86EMUL_MODE_PROT32,
> + X86EMUL_MODE_PROT64,
> +};
> +
> +static int test_emulator_one(struct test_context *test)
> +{
> + struct x86_emulate_ctxt *ctxt = &test->ctxt;
> + unsigned i;
> + int r;
> +
> + test->failed = false;
> + i = 0;
> + if (random32() & 1)
> + test->insn[i++] = 0x0f;
> + for (; i < 15; ++i)
> + test->insn[i++] = random32();
> + test->insn_base_valid = false;
> + ctxt->ops = &test_ops;
> + ctxt->eflags = randlong();
> + ctxt->eip = randlong();
> + ctxt->mode = modes[random32() % ARRAY_SIZE(modes)];
> + ctxt->guest_mode = random32() % 16 == 0;
> + ctxt->perm_ok = random32() % 16 == 0;
> + ctxt->only_vendor_specific_insn = random32() % 64 == 0;
> + memset(&ctxt->twobyte, 0,
> +(void *)&ctxt->regs - (void *)&ctxt->twobyte);
> + for (i = 0; i < NR_VCPU_REGS; ++i)
> + ctxt->regs[i] = randlong();
> + r = x86_decode_insn(ctxt, NULL, 0);

It could rerun N times instructions that have been decoded successfully.
This would increase the chance of testing the code path for that (class
of) instruction.

Also fuzzing from an actual guest is useful to test the real backend
functions. What problem did you encounter? The new testsuite scheme
seems a good fit for that (with the exception of being locked to 32-bit
mode).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-25 Thread Don Dutile

On 08/25/2011 06:54 AM, Roedel, Joerg wrote:

Hi Alex,

On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:

Is this roughly what you're thinking of for the iommu_group component?
Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
support in the iommu base.  Would AMD-Vi do something similar (or
exactly the same) for group #s?  Thanks,


The concept looks good, I have some comments, though. On AMD-Vi the
implementation would look a bit different because there is a
data-structure were the information can be gathered from, so no need for
PCI bus scanning there.


diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
index 6e6b6a1..6b54c1a 100644
--- a/drivers/base/iommu.c
+++ b/drivers/base/iommu.c
@@ -17,20 +17,56 @@
   */

  #include
+#include
  #include
  #include
  #include
  #include
  #include
+#include

  static struct iommu_ops *iommu_ops;

+static ssize_t show_iommu_group(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, "%lx", iommu_dev_to_group(dev));


Probably add a 0x prefix so userspace knows the format?


+}
+static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
+
+static int add_iommu_group(struct device *dev, void *unused)
+{
+   if (iommu_dev_to_group(dev)>= 0)
+   return device_create_file(dev,&dev_attr_iommu_group);
+
+   return 0;
+}
+
+static int device_notifier(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct device *dev = data;
+
+   if (action == BUS_NOTIFY_ADD_DEVICE)
+   return add_iommu_group(dev, NULL);
+
+   return 0;
+}
+
+static struct notifier_block device_nb = {
+   .notifier_call = device_notifier,
+};
+
  void register_iommu(struct iommu_ops *ops)
  {
if (iommu_ops)
BUG();

iommu_ops = ops;
+
+   /* FIXME - non-PCI, really want for_each_bus() */
+   bus_register_notifier(&pci_bus_type,&device_nb);
+   bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
  }


We need to solve this differently. ARM is starting to use the iommu-api
too and this definitly does not work there. One possible solution might
be to make the iommu-ops per-bus.


When you think of a system where there isn't just one bus-type
with iommu support, it makes more sense.
Additionally, it also allows the long-term architecture to use different types
of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
for direct-attach disk hba's.



  bool iommu_found(void)
@@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
  }
  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);

+long iommu_dev_to_group(struct device *dev)
+{
+   if (iommu_ops->dev_to_group)
+   return iommu_ops->dev_to_group(dev);
+   return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_to_group);


Please rename this to iommu_device_group(). The dev_to_group name
suggests a conversion but it is actually just a property of the device.
Also the return type should not be long but something that fits into
32bit on all platforms. Since you use -ENODEV, probably s32 is a good
choice.


+
  int iommu_map(struct iommu_domain *domain, unsigned long iova,
  phys_addr_t paddr, int gfp_order, int prot)
  {
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f02c34d..477259c 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
  static int dmar_forcedac;
  static int intel_iommu_strict;
  static int intel_iommu_superpage = 1;
+static int intel_iommu_no_mf_groups;

  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
  static DEFINE_SPINLOCK(device_domain_lock);
@@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
printk(KERN_INFO
"Intel-IOMMU: disable supported super page\n");
intel_iommu_superpage = 0;
+   } else if (!strncmp(str, "no_mf_groups", 12)) {
+   printk(KERN_INFO
+   "Intel-IOMMU: disable separate groups for 
multifunction devices\n");
+   intel_iommu_no_mf_groups = 1;


This should really be a global iommu option and not be VT-d specific.



str += strcspn(str, ",");
@@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
  }

+/* Group numbers are arbitrary.  Device with the same group number
+ * indicate the iommu cannot differentiate between them.  To avoid
+ * tracking used groups we just use the seg|bus|devfn of the lowest
+ * level we're able to differentiate devices */
+static long intel_iommu_dev_to_group(struct device *dev)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_d

Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread David Evensky

Adding in the rest of what ivshmem does shouldn't affect our use, *I
think*.  I hadn't intended this to do everything that ivshmem does,
but I can see how that would be useful. It would be cool if it could
grow into that.

Our requirements for the driver in kvm tool are that another program
on the host can create a shared segment (anonymous, non-file backed)
with a specified handle, size, and contents. That this segment is
available to the guest at boot time at a specified address and that no
driver will change the contents of the memory except under direct user
action. Also, when the guest goes away the shared memory segment
shouldn't be affected (e.g. contents changed). Finally, we cannot
change the lightweight nature of kvm tool.

This is the feature of ivshmem that I need to check today. I did some
testing a month ago, but it wasn't detailed enough to check this out.

\dae




On Thu, Aug 25, 2011 at 02:25:48PM +0300, Sasha Levin wrote:
> On Thu, 2011-08-25 at 11:59 +0100, Stefan Hajnoczi wrote:
> > On Thu, Aug 25, 2011 at 11:37 AM, Pekka Enberg  wrote:
> > > Hi Stefan,
> > >
> > > On Thu, Aug 25, 2011 at 1:31 PM, Stefan Hajnoczi  
> > > wrote:
> > >>> It's obviously not competing. One thing you might want to consider is
> > >>> making the guest interface compatible with ivshmem. Is there any reason
> > >>> we shouldn't do that? I don't consider that a requirement, just nice to
> > >>> have.
> > >>
> > >> The point of implementing the same interface as ivshmem is that users
> > >> don't need to rejig guests or applications in order to switch between
> > >> hypervisors.  A different interface also prevents same-to-same
> > >> benchmarks.
> > >>
> > >> There is little benefit to creating another virtual device interface
> > >> when a perfectly good one already exists.  The question should be: how
> > >> is this shmem device different and better than ivshmem?  If there is
> > >> no justification then implement the ivshmem interface.
> > >
> > > So which interface are we actually taking about? Userspace/kernel in the
> > > guest or hypervisor/guest kernel?
> > 
> > The hardware interface.  Same PCI BAR layout and semantics.
> > 
> > > Either way, while it would be nice to share the interface but it's not a
> > > *requirement* for tools/kvm unless ivshmem is specified in the virtio
> > > spec or the driver is in mainline Linux. We don't intend to require people
> > > to implement non-standard and non-Linux QEMU interfaces. OTOH,
> > > ivshmem would make the PCI ID problem go away.
> > 
> > Introducing yet another non-standard and non-Linux interface doesn't
> > help though.  If there is no significant improvement over ivshmem then
> > it makes sense to let ivshmem gain critical mass and more users
> > instead of fragmenting the space.
> 
> I support doing it ivshmem-compatible, though it doesn't have to be a
> requirement right now (that is, use this patch as a base and build it
> towards ivshmem - which shouldn't be an issue since this patch provides
> the PCI+SHM parts which are required by ivshmem anyway).
> 
> ivshmem is a good, documented, stable interface backed by a lot of
> research and testing behind it. Looking at the spec it's obvious that
> Cam had KVM in mind when designing it and thats exactly what we want to
> have in the KVM tool.
> 
> David, did you have any plans to extend it to become ivshmem-compatible?
> If not, would turning it into such break any code that depends on it
> horribly?
> 
> -- 
> 
> Sasha.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Emulating LWZU Instruction for e500 powerpc

2011-08-25 Thread Alexander Graf

On 25.08.2011, at 04:30, Aashish Mittal wrote:

> 
> On Thu, Aug 25, 2011 at 4:04 AM, Alexander Graf  wrote:
> 
> On 19.08.2011, at 06:45, Aashish Mittal wrote:
> 
> > Hi
> > I'm trying to emulate the lwzu instruction in e500 powerpc kvm for my 
> > project .
> > I've removed the read and write privileges from the tlb entries of guest's
> > certain pages . So when i'm trying to emulate lwzu instruction i'm getting a
> > kernel panic while mounting the guest filesystem while booting .
> >
> > attempt to access beyond end of device
> > ram0: rw=0, want=75703268, limit=262144
> >
> > To make sure that the emulation is  faulty what i'm trying to do now is at 
> > the
> > time of DATA STORAGE exit on a marked page by an lwzu instruction i'm 
> > patching
> > the next instruction with an instruction which will raise an INTERRUPT 
> > PROGRAM
> > EXCEPTION and will get trapped in kvm and then i'm reverting the old read 
> > and
> > write privileges of this page and resuming the guest so that this LWZU
> > instruction can run natively . I'm expecting the immediate next instruction 
> > to
> > raise the INTERRUPT PROGRAM EXCEPTION but all i'm getting are DATA STORAGE 
> > Exits
> > at other pages marked by me and DTLB and ITLB misses on other addresses .
> >
> > I've made sure to flush the icache after i patch using the instruction
> > flush_icache_range .
> >
> > Error Log :
> > Emulating a lwzu instruction on pc 0xc00161ac && eaddr 0xc05742f0
> > Original Instruction is 0x90e60004 at pc: 0xc00161b0
> > Modified Instruction is 0x7ce000a6 at pc: 0xc00161b0
> > Exit : Interrupt DATA STORAGE at pc 0xc000f210 on eaddr:0xc000f228 
> > instruction:
> > 0x8085001c
> >
> > Why i'm not getting any INTERRUPT PROGRAM EXCEPTION immediately in the next
> > instruction ?
> 
> Hrm. Are you sure you're actually modifying the instruction? This looks like 
> you're running Linux, so you could try and just put a "b ." instruction right 
> after the instruction you're trying to patch up and examine memory from Qemu 
> :)
> 
> 
> Alex
> 
> I checked the memory by reading the guest page at the modified instruction 
> and it is the modified instruction . I think the reason i'm not getting the 
> EXIT is because the icache is not getting flushed properly. 
> I'm trying to to flush the icache entries of the guest while i'm inside the 
> host kvm . I'm trying to use flush_icache_range(pc , pc+ 4 ) where this pc  
> is the guest pc eaddr where i'm modifying the instruction . 
> 
> Do you have any idea how to flush the guest entries while i'm inside the host 
> os given the guest eaddr of the address where i'm trying to patch the 
> instruction . 

You need to do the icache flush on the effective address you were writing to, 
not the guest effective address :)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Questions regarding ivshmem spec

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 17:40 +0300, Avi Kivity wrote:
> On 08/25/2011 05:39 PM, Sasha Levin wrote:
> > On Thu, 2011-08-25 at 17:00 +0300, Avi Kivity wrote:
> > >  On 08/25/2011 04:29 PM, Sasha Levin wrote:
> > >  >  2. The spec describes DOORBELL as an array of DWORDs, when one guest
> > >  >  wants to poke a different guest it would write something into the 
> > > offset
> > >  >  of the other guest in the DOORBELL array.
> > >  >  Looking at the implementation in QEMU, DOORBELL is one DWORD, when
> > >  >  writing to it the upper WORD is the guest id and the lower WORD is the
> > >  >  value.
> > >  >  What am I missing here?
> > >  >
> > >
> > >  The spec in qemu.git is accurate.  The intent is to use an ioeventfd
> > >  bound into an irqfd so a write into the doorbell injects an interrupt
> > >  directly into the other guest, without going through qemu^Wkvm tool.
> > >
> >
> > But the doorbell is a single DWORD, so if a guest writes to it we'd
> > still need to figure out which guest/vector he wants to poke from
> > userspace, no?
> >
> > If it was an array of doorbells then yes, we could assign an ioeventfd
> > to each offset - but now I don't quite see how we can avoid passing
> > through the userspace.
> >
> 
> Use the datamatch facility.
> 
> We didn't want an array of registers to avoid scaling issues (PIO space 
> is quite small).
> 
> 

Ah, right.

Thanks!

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Questions regarding ivshmem spec

2011-08-25 Thread Avi Kivity

On 08/25/2011 05:39 PM, Sasha Levin wrote:

On Thu, 2011-08-25 at 17:00 +0300, Avi Kivity wrote:
>  On 08/25/2011 04:29 PM, Sasha Levin wrote:
>  >  2. The spec describes DOORBELL as an array of DWORDs, when one guest
>  >  wants to poke a different guest it would write something into the offset
>  >  of the other guest in the DOORBELL array.
>  >  Looking at the implementation in QEMU, DOORBELL is one DWORD, when
>  >  writing to it the upper WORD is the guest id and the lower WORD is the
>  >  value.
>  >  What am I missing here?
>  >
>
>  The spec in qemu.git is accurate.  The intent is to use an ioeventfd
>  bound into an irqfd so a write into the doorbell injects an interrupt
>  directly into the other guest, without going through qemu^Wkvm tool.
>

But the doorbell is a single DWORD, so if a guest writes to it we'd
still need to figure out which guest/vector he wants to poke from
userspace, no?

If it was an array of doorbells then yes, we could assign an ioeventfd
to each offset - but now I don't quite see how we can avoid passing
through the userspace.



Use the datamatch facility.

We didn't want an array of registers to avoid scaling issues (PIO space 
is quite small).



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/14] KVM: PPC: Check privilege level on SPRs

2011-08-25 Thread Alexander Graf
We have 3 privilege levels: problem state, supervisor state and hypervisor
state. Each of them can access different SPRs, so we need to check on every
SPR if it's accessible in the respective mode.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/book3s_emulate.c |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 4668465..bf0ddcd 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -63,6 +63,25 @@
  * function pointers, so let's just disable the define. */
 #undef mfsrin
 
+enum priv_level {
+   PRIV_PROBLEM = 0,
+   PRIV_SUPER = 1,
+   PRIV_HYPER = 2,
+};
+
+static bool spr_allowed(struct kvm_vcpu *vcpu, enum priv_level level)
+{
+   /* PAPR VMs only access supervisor SPRs */
+   if (vcpu->arch.papr_enabled && (level > PRIV_SUPER))
+   return false;
+
+   /* Limit user space to its own small SPR set */
+   if ((vcpu->arch.shared->msr & MSR_PR) && level > PRIV_PROBLEM)
+   return false;
+
+   return true;
+}
+
 int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu,
unsigned int inst, int *advance)
 {
@@ -296,6 +315,8 @@ int kvmppc_core_emulate_mtspr(struct kvm_vcpu *vcpu, int 
sprn, int rs)
 
switch (sprn) {
case SPRN_SDR1:
+   if (!spr_allowed(vcpu, PRIV_HYPER))
+   goto unprivileged;
to_book3s(vcpu)->sdr1 = spr_val;
break;
case SPRN_DSISR:
@@ -390,6 +411,7 @@ int kvmppc_core_emulate_mtspr(struct kvm_vcpu *vcpu, int 
sprn, int rs)
case SPRN_PMC4_GEKKO:
case SPRN_WPAR_GEKKO:
break;
+unprivileged:
default:
printk(KERN_INFO "KVM: invalid SPR write: %d\n", sprn);
 #ifndef DEBUG_SPR
@@ -421,6 +443,8 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int 
sprn, int rt)
break;
}
case SPRN_SDR1:
+   if (!spr_allowed(vcpu, PRIV_HYPER))
+   goto unprivileged;
kvmppc_set_gpr(vcpu, rt, to_book3s(vcpu)->sdr1);
break;
case SPRN_DSISR:
@@ -476,6 +500,7 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int 
sprn, int rt)
kvmppc_set_gpr(vcpu, rt, 0);
break;
default:
+unprivileged:
printk(KERN_INFO "KVM: invalid SPR read: %d\n", sprn);
 #ifndef DEBUG_SPR
emulated = EMULATE_FAIL;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/14] KVM: PPC: Add PAPR hypercall code for PR mode

2011-08-25 Thread Alexander Graf
When running a PAPR guest, we need to handle a few hypercalls in kernel space,
most prominently the page table invalidation (to sync the shadows).

So this patch adds handling for a few PAPR hypercalls to PR mode KVM. I tried
to share the code with HV mode, but it ended up being a lot easier this way
around, as the two differ too much in those details.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - whitespace fix
---
 arch/powerpc/include/asm/kvm_book3s.h |1 +
 arch/powerpc/kvm/Makefile |1 +
 arch/powerpc/kvm/book3s_pr_papr.c |  158 +
 3 files changed, 160 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_pr_papr.c

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 472437b..91d41fa 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -150,6 +150,7 @@ extern void kvmppc_load_up_altivec(void);
 extern void kvmppc_load_up_vsx(void);
 extern u32 kvmppc_alignment_dsisr(struct kvm_vcpu *vcpu, unsigned int inst);
 extern ulong kvmppc_alignment_dar(struct kvm_vcpu *vcpu, unsigned int inst);
+extern int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd);
 
 static inline struct kvmppc_vcpu_book3s *to_book3s(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 08428e2..4c66d51 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -43,6 +43,7 @@ kvm-book3s_64-objs-$(CONFIG_KVM_BOOK3S_64_PR) := \
fpu.o \
book3s_paired_singles.o \
book3s_pr.o \
+   book3s_pr_papr.o \
book3s_emulate.o \
book3s_interrupts.o \
book3s_mmu_hpte.o \
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c 
b/arch/powerpc/kvm/book3s_pr_papr.c
new file mode 100644
index 000..b958932
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -0,0 +1,158 @@
+/*
+ * Copyright (C) 2011. Freescale Inc. All rights reserved.
+ *
+ * Authors:
+ *Alexander Graf 
+ *Paul Mackerras 
+ *
+ * Description:
+ *
+ * Hypercall handling for running PAPR guests in PR KVM on Book 3S
+ * processors.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+
+static unsigned long get_pteg_addr(struct kvm_vcpu *vcpu, long pte_index)
+{
+   struct kvmppc_vcpu_book3s *vcpu_book3s = to_book3s(vcpu);
+   unsigned long pteg_addr;
+
+   pte_index <<= 4;
+   pte_index &= ((1 << ((vcpu_book3s->sdr1 & 0x1f) + 11)) - 1) << 7 | 0x70;
+   pteg_addr = vcpu_book3s->sdr1 & 0xfffcULL;
+   pteg_addr |= pte_index;
+
+   return pteg_addr;
+}
+
+static int kvmppc_h_pr_enter(struct kvm_vcpu *vcpu)
+{
+   long flags = kvmppc_get_gpr(vcpu, 4);
+   long pte_index = kvmppc_get_gpr(vcpu, 5);
+   unsigned long pteg[2 * 8];
+   unsigned long pteg_addr, i, *hpte;
+
+   pte_index &= ~7UL;
+   pteg_addr = get_pteg_addr(vcpu, pte_index);
+
+   copy_from_user(pteg, (void __user *)pteg_addr, sizeof(pteg));
+   hpte = pteg;
+
+   if (likely((flags & H_EXACT) == 0)) {
+   pte_index &= ~7UL;
+   for (i = 0; ; ++i) {
+   if (i == 8)
+   return H_PTEG_FULL;
+   if ((*hpte & HPTE_V_VALID) == 0)
+   break;
+   hpte += 2;
+   }
+   } else {
+   i = kvmppc_get_gpr(vcpu, 5) & 7UL;
+   hpte += i * 2;
+   }
+
+   hpte[0] = kvmppc_get_gpr(vcpu, 6);
+   hpte[1] = kvmppc_get_gpr(vcpu, 7);
+   copy_to_user((void __user *)pteg_addr, pteg, sizeof(pteg));
+   kvmppc_set_gpr(vcpu, 3, H_SUCCESS);
+   kvmppc_set_gpr(vcpu, 4, pte_index | i);
+
+   return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_remove(struct kvm_vcpu *vcpu)
+{
+   unsigned long flags= kvmppc_get_gpr(vcpu, 4);
+   unsigned long pte_index = kvmppc_get_gpr(vcpu, 5);
+   unsigned long avpn = kvmppc_get_gpr(vcpu, 6);
+   unsigned long v = 0, pteg, rb;
+   unsigned long pte[2];
+
+   pteg = get_pteg_addr(vcpu, pte_index);
+   copy_from_user(pte, (void __user *)pteg, sizeof(pte));
+
+   if ((pte[0] & HPTE_V_VALID) == 0 ||
+   ((flags & H_AVPN) && (pte[0] & ~0x7fUL) != avpn) ||
+   ((flags & H_ANDCOND) && (pte[0] & avpn) != 0)) {
+   kvmppc_set_gpr(vcpu, 3, H_NOT_FOUND);
+   return EMULATE_DONE;
+   }
+
+   copy_to_user((void __user *)pteg, &v, sizeof(v));
+
+   rb = compute_tlbie_rb(pte[0], pte[1], pte_index);
+   vcpu->arch.mmu.tlbie(vcpu, rb, rb & 1 ? true : false);
+
+   kvmppc_set_gpr(vcpu, 3, H_SUCCESS);
+   kvmppc_set_gpr(vcpu, 4, pte[0]);
+   kvmppc_set_gpr(vcpu, 5, pte[1]);
+
+   return EMULATE_DONE;
+

[PATCH 01/14] KVM: PPC: move compute_tlbie_rb to book3s common header

2011-08-25 Thread Alexander Graf
We need the compute_tlbie_rb in _pr and _hv implementations for papr
soon, so let's move it over to a common header file that both
implementations can leverage.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm_book3s.h |   33 +
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   33 -
 2 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 98da010..37dd748 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -382,6 +382,39 @@ static inline bool kvmppc_critical_section(struct kvm_vcpu 
*vcpu)
 }
 #endif
 
+static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
+unsigned long pte_index)
+{
+   unsigned long rb, va_low;
+
+   rb = (v & ~0x7fUL) << 16;   /* AVA field */
+   va_low = pte_index >> 3;
+   if (v & HPTE_V_SECONDARY)
+   va_low = ~va_low;
+   /* xor vsid from AVA */
+   if (!(v & HPTE_V_1TB_SEG))
+   va_low ^= v >> 12;
+   else
+   va_low ^= v >> 24;
+   va_low &= 0x7ff;
+   if (v & HPTE_V_LARGE) {
+   rb |= 1;/* L field */
+   if (cpu_has_feature(CPU_FTR_ARCH_206) &&
+   (r & 0xff000)) {
+   /* non-16MB large page, must be 64k */
+   /* (masks depend on page size) */
+   rb |= 0x1000;   /* page encoding in LP field */
+   rb |= (va_low & 0x7f) << 16; /* 7b of VA in AVA/LP 
field */
+   rb |= (va_low & 0xfe);  /* AVAL field (P7 doesn't seem 
to care) */
+   }
+   } else {
+   /* 4kB page */
+   rb |= (va_low & 0x7ff) << 12;   /* remaining 11b of VA */
+   }
+   rb |= (v >> 54) & 0x300;/* B field */
+   return rb;
+}
+
 /* Magic register values loaded into r3 and r4 before the 'sc' assembly
  * instruction for the OSI hypercalls */
 #define OSI_SC_MAGIC_R30x113724FA
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index fcfe6b0..bacb0cf 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -110,39 +110,6 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long 
flags,
return H_SUCCESS;
 }
 
-static unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
- unsigned long pte_index)
-{
-   unsigned long rb, va_low;
-
-   rb = (v & ~0x7fUL) << 16;   /* AVA field */
-   va_low = pte_index >> 3;
-   if (v & HPTE_V_SECONDARY)
-   va_low = ~va_low;
-   /* xor vsid from AVA */
-   if (!(v & HPTE_V_1TB_SEG))
-   va_low ^= v >> 12;
-   else
-   va_low ^= v >> 24;
-   va_low &= 0x7ff;
-   if (v & HPTE_V_LARGE) {
-   rb |= 1;/* L field */
-   if (cpu_has_feature(CPU_FTR_ARCH_206) &&
-   (r & 0xff000)) {
-   /* non-16MB large page, must be 64k */
-   /* (masks depend on page size) */
-   rb |= 0x1000;   /* page encoding in LP field */
-   rb |= (va_low & 0x7f) << 16; /* 7b of VA in AVA/LP 
field */
-   rb |= (va_low & 0xfe);  /* AVAL field (P7 doesn't seem 
to care) */
-   }
-   } else {
-   /* 4kB page */
-   rb |= (va_low & 0x7ff) << 12;   /* remaining 11b of VA */
-   }
-   rb |= (v >> 54) & 0x300;/* B field */
-   return rb;
-}
-
 #define LOCK_TOKEN (*(u32 *)(&get_paca()->lock_token))
 
 static inline int try_lock_tlbie(unsigned int *lock)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/14] KVM: PPC: Read out syscall instruction on trap

2011-08-25 Thread Alexander Graf
We have a few traps where we cache the instruction that cause the trap
for analysis later on. Since we now need to be able to distinguish
between SC 0 and SC 1 system calls and the only way to find out which
is which is by looking at the instruction, we also read out the instruction
causing the system call.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/book3s_segment.S |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_segment.S 
b/arch/powerpc/kvm/book3s_segment.S
index aed32e5..678b6be 100644
--- a/arch/powerpc/kvm/book3s_segment.S
+++ b/arch/powerpc/kvm/book3s_segment.S
@@ -213,11 +213,16 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
beq ld_last_inst
cmpwi   r12, BOOK3S_INTERRUPT_PROGRAM
beq ld_last_inst
+   cmpwi   r12, BOOK3S_INTERRUPT_SYSCALL
+   beq ld_last_prev_inst
cmpwi   r12, BOOK3S_INTERRUPT_ALIGNMENT
beq-ld_last_inst
 
b   no_ld_last_inst
 
+ld_last_prev_inst:
+   addir3, r3, -4
+
 ld_last_inst:
/* Save off the guest instruction we're at */
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/14] KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code

2011-08-25 Thread Alexander Graf
From: Paul Mackerras 

With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel.  This is inefficient.

This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode.  When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle.  Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction.  When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.

This has some other ramifications.  First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle.  This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running.  In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.

This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.

Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.

Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.

Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm_book3s_asm.h |1 +
 arch/powerpc/include/asm/kvm_host.h   |   19 ++-
 arch/powerpc/kernel/asm-offsets.c |6 +
 arch/powerpc/kvm/book3s_hv.c  |  335 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  297 ++---
 arch/powerpc/kvm/powerpc.c|   21 +-
 6 files changed, 483 insertions(+), 196 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index af73469..1f2f5b6 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -76,6 +76,7 @@ struct kvmppc_host_state {
ulong scratch1;
u8 in_guest;
u8 restore_hid5;
+   u8 napping;
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
struct kvm_vcpu *kvm_vcpu;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index dec3054..bf8af5d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -198,21 +198,29 @@ struct kvm_arch {
  */
 struct kvmppc_vcore {
int n_runnable;
-   int n_blocked;
+   int n_busy;
int num_threads;
int entry_exit_count;
int n_woken;
int nap_count;
+   int napping_threads;
u16 pcpu;
-   u8 vcore_running;
+   u8 vcore_state;
u8 in_guest;
struct list_head runnable_threads;
spinlock_t lock;
+   wait_queue_head_t wq;
 };
 
 #define VCORE_ENTRY_COUNT(vc)  ((vc)->entry_exit_count & 0xff)
 #define VCORE_EXIT_COUNT(vc)   ((vc)->entry_exit_count >> 8)
 
+/* Values for vcore_state */
+#define VCORE_INACTIVE 0
+#define VCORE_RUNNING  1
+#define VCORE_EXITING  2
+#define VCORE_SLEEPING 3
+
 struct kvmppc_pte {
ulong eaddr;
u64 vpage;
@@ -403,11 +411,13 @@ struct kvm_vcpu_arch {
struct dtl *dtl;
struct dtl *dtl_end;
 
+   wait_queue_head_t *wqp;
struct kvmppc_vcore *vcore;
int ret;
int trap;
int state;
int ptid;
+   bool timer_running;
wait_queue_head_t cpu_run;
 
struct kvm_vcpu_arch_shared *shared;
@@ -423,8 +433,9 @@ struct kvm_vcpu_arch {
 #endif
 };
 
-#define KVMPPC_VCPU_BUSY_IN_HOST   0
-#define KVMPPC_VCPU_BLOCKED1
+/* Values for vcpu->arch.state */
+#define KVMPPC_VCPU_STOPPED0
+#define KVMPPC_VCPU_BUSY_IN_HOST   1
 #define KVMPPC_VCPU_RUNNABLE   2
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arc

[PATCH 10/14] KVM: PPC: Enable the PAPR CAP for Book3S

2011-08-25 Thread Alexander Graf
Now that Book3S PV mode can also run PAPR guests, we can add a PAPR cap and
enable it for all Book3S targets. Enabling that CAP switches KVM into PAPR
mode.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/powerpc.c |5 +
 include/linux/kvm.h|1 +
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 17a5c83..13bc798 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -189,6 +189,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 #else
case KVM_CAP_PPC_SEGSTATE:
case KVM_CAP_PPC_HIOR:
+   case KVM_CAP_PPC_PAPR:
 #endif
case KVM_CAP_PPC_UNSET_IRQ:
case KVM_CAP_PPC_IRQ_LEVEL:
@@ -572,6 +573,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
r = 0;
vcpu->arch.osi_enabled = true;
break;
+   case KVM_CAP_PPC_PAPR:
+   r = 0;
+   vcpu->arch.papr_enabled = true;
+   break;
default:
r = -EINVAL;
break;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 4d33f78..2d7161c 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -555,6 +555,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_RMA65
 #define KVM_CAP_MAX_VCPUS 66   /* returns max vcpus per vm */
 #define KVM_CAP_PPC_HIOR 67
+#define KVM_CAP_PPC_PAPR 68
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/14] KVM: PPC: Assemble book3s{,_hv}_rmhandlers.S separately

2011-08-25 Thread Alexander Graf
From: Paul Mackerras 

This makes arch/powerpc/kvm/book3s_rmhandlers.S and
arch/powerpc/kvm/book3s_hv_rmhandlers.S be assembled as
separate compilation units rather than having them #included in
arch/powerpc/kernel/exceptions-64s.S.  We no longer have any
conditional branches between the exception prologs in
exceptions-64s.S and the KVM handlers, so there is no need to
keep their contents close together in the vmlinux image.

In their current location, they are using up part of the limited
space between the first-level interrupt handlers and the firmware
NMI data area at offset 0x7000, and with some kernel configurations
this area will overflow (e.g. allyesconfig), leading to an
"attempt to .org backwards" error when compiling exceptions-64s.S.

Moving them out requires that we add some #includes that the
book3s_{,hv_}rmhandlers.S code was previously getting implicitly
via exceptions-64s.S.

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/kernel/exceptions-64s.S|   10 --
 arch/powerpc/kvm/Makefile   |3 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |3 +++
 arch/powerpc/kvm/book3s_rmhandlers.S|3 +++
 4 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 41b02c7..29ddd8b 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -427,16 +427,6 @@ slb_miss_user_pseries:
b   .   /* prevent spec. execution */
 #endif /* __DISABLED__ */
 
-/* KVM's trampoline code needs to be close to the interrupt handlers */
-
-#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
-#ifdef CONFIG_KVM_BOOK3S_PR
-#include "../kvm/book3s_rmhandlers.S"
-#else
-#include "../kvm/book3s_hv_rmhandlers.S"
-#endif
-#endif
-
.align  7
.globl  __end_interrupts
 __end_interrupts:
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 4c66d51..3688aee 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -50,12 +50,15 @@ kvm-book3s_64-objs-$(CONFIG_KVM_BOOK3S_64_PR) := \
book3s_64_mmu_host.o \
book3s_64_mmu.o \
book3s_32_mmu.o
+kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_PR) := \
+   book3s_rmhandlers.o
 
 kvm-book3s_64-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
book3s_hv.o \
book3s_hv_interrupts.o \
book3s_64_mmu_hv.o
 kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
+   book3s_hv_rmhandlers.o \
book3s_hv_rm_mmu.o \
book3s_64_vio_hv.o \
book3s_hv_builtin.o
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 6dd3358..543ee50 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -20,7 +20,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include 
 #include 
 #include 
 
diff --git a/arch/powerpc/kvm/book3s_rmhandlers.S 
b/arch/powerpc/kvm/book3s_rmhandlers.S
index c1f877c..5ee66ed 100644
--- a/arch/powerpc/kvm/book3s_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_rmhandlers.S
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -39,6 +40,7 @@
 #define MSR_NOIRQ  MSR_KERNEL & ~(MSR_IR | MSR_DR)
 #define FUNC(name) GLUE(.,name)
 
+   .globl  kvmppc_skip_interrupt
 kvmppc_skip_interrupt:
/*
 * Here all GPRs are unchanged from when the interrupt happened
@@ -51,6 +53,7 @@ kvmppc_skip_interrupt:
rfid
b   .
 
+   .globl  kvmppc_skip_Hinterrupt
 kvmppc_skip_Hinterrupt:
/*
 * Here all GPRs are unchanged from when the interrupt happened
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/14] KVM: PPC: Add sanity checking to vcpu_run

2011-08-25 Thread Alexander Graf
There are multiple features in PowerPC KVM that can now be enabled
depending on the user's wishes. Some of the combinations don't make
sense or don't work though.

So this patch adds a way to check if the executing environment would
actually be able to run the guest properly. It also adds sanity
checks if PVR is set (should always be true given the current code
flow), if PAPR is only used with book3s_64 where it works and that
HV KVM is only used in PAPR mode.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm.h  |5 +
 arch/powerpc/include/asm/kvm_host.h |2 ++
 arch/powerpc/include/asm/kvm_ppc.h  |1 +
 arch/powerpc/kvm/44x.c  |2 ++
 arch/powerpc/kvm/book3s_hv.c|8 
 arch/powerpc/kvm/book3s_pr.c|   10 ++
 arch/powerpc/kvm/booke.c|   10 +-
 arch/powerpc/kvm/e500.c |2 ++
 arch/powerpc/kvm/powerpc.c  |   28 
 9 files changed, 67 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index a6a253e..08fe69e 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -284,6 +284,11 @@ struct kvm_guest_debug_arch {
 #define KVM_INTERRUPT_UNSET-2U
 #define KVM_INTERRUPT_SET_LEVEL-3U
 
+#define KVM_CPU_4401
+#define KVM_CPU_E500V2 2
+#define KVM_CPU_3S_32  3
+#define KVM_CPU_3S_64  4
+
 /* for KVM_CAP_SPAPR_TCE */
 struct kvm_create_spapr_tce {
__u64 liobn;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e681302..2b8284f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -390,6 +390,8 @@ struct kvm_vcpu_arch {
u8 osi_needed;
u8 osi_enabled;
u8 papr_enabled;
+   u8 sane;
+   u8 cpu_type;
u8 hcall_needed;
 
u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index d121f49..46efd1a 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -66,6 +66,7 @@ extern int kvmppc_emulate_instruction(struct kvm_run *run,
 extern int kvmppc_emulate_mmio(struct kvm_run *run, struct kvm_vcpu *vcpu);
 extern void kvmppc_emulate_dec(struct kvm_vcpu *vcpu);
 extern u32 kvmppc_get_dec(struct kvm_vcpu *vcpu, u64 tb);
+extern int kvmppc_sanity_check(struct kvm_vcpu *vcpu);
 
 /* Core-specific hooks */
 
diff --git a/arch/powerpc/kvm/44x.c b/arch/powerpc/kvm/44x.c
index da3a122..ca1f88b 100644
--- a/arch/powerpc/kvm/44x.c
+++ b/arch/powerpc/kvm/44x.c
@@ -78,6 +78,8 @@ int kvmppc_core_vcpu_setup(struct kvm_vcpu *vcpu)
for (i = 0; i < ARRAY_SIZE(vcpu_44x->shadow_refs); i++)
vcpu_44x->shadow_refs[i].gtlb_index = -1;
 
+   vcpu->arch.cpu_type = KVM_CPU_440;
+
return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cc0d7f1..bf66ec7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -510,6 +510,9 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, 
unsigned int id)
spin_unlock(&vcore->lock);
vcpu->arch.vcore = vcore;
 
+   vcpu->arch.cpu_type = KVM_CPU_3S_64;
+   kvmppc_sanity_check(vcpu);
+
return vcpu;
 
 free_vcpu:
@@ -800,6 +803,11 @@ int kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu 
*vcpu)
 {
int r;
 
+   if (!vcpu->arch.sane) {
+   run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+   return -EINVAL;
+   }
+
do {
r = kvmppc_run_vcpu(run, vcpu);
 
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 48558f6..6e3488b 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -153,6 +153,7 @@ void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
if (!to_book3s(vcpu)->hior_sregs)
to_book3s(vcpu)->hior = 0xfff0;
to_book3s(vcpu)->msr_mask = 0xULL;
+   vcpu->arch.cpu_type = KVM_CPU_3S_64;
} else
 #endif
{
@@ -160,8 +161,11 @@ void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
if (!to_book3s(vcpu)->hior_sregs)
to_book3s(vcpu)->hior = 0;
to_book3s(vcpu)->msr_mask = 0xULL;
+   vcpu->arch.cpu_type = KVM_CPU_3S_32;
}
 
+   kvmppc_sanity_check(vcpu);
+
/* If we are in hypervisor level on 970, we can tell the CPU to
 * treat DCBZ as 32 bytes store */
vcpu->arch.hflags &= ~BOOK3S_HFLAG_DCBZ32;
@@ -938,6 +942,12 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
 #endif
ulong ext_msr;
 
+   /* Check if we can run the vcpu at all */
+   if (!vcpu->arch.sane) {
+   kvm_run->exit_reason = KVM_EXIT_INTERNA

[PATCH 06/14] KVM: PPC: Add support for explicit HIOR setting

2011-08-25 Thread Alexander Graf
Until now, we always set HIOR based on the PVR, but this is just wrong.
Instead, we should be setting HIOR explicitly, so user space can decide
what the initial HIOR value is - just like on real hardware.

We keep the old PVR based way around for backwards compatibility, but
once user space uses the SREGS based method, we drop the PVR logic.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm.h|8 
 arch/powerpc/include/asm/kvm_book3s.h |2 ++
 arch/powerpc/kvm/book3s_pr.c  |   14 --
 arch/powerpc/kvm/powerpc.c|1 +
 include/linux/kvm.h   |1 +
 5 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index a4f6c85..a6a253e 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -149,6 +149,12 @@ struct kvm_regs {
 #define KVM_SREGS_E_UPDATE_DBSR(1 << 3)
 
 /*
+ * Book3S special bits to indicate contents in the struct by maintaining
+ * backwards compatibility with older structs. If adding a new field,
+ * please make sure to add a flag for that new field */
+#define KVM_SREGS_S_HIOR   (1 << 0)
+
+/*
  * In KVM_SET_SREGS, reserved/pad fields must be left untouched from a
  * previous KVM_GET_REGS.
  *
@@ -173,6 +179,8 @@ struct kvm_sregs {
__u64 ibat[8]; 
__u64 dbat[8]; 
} ppc32;
+   __u64 flags; /* KVM_SREGS_S_ */
+   __u64 hior;
} s;
struct {
union {
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 37dd748..472437b 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -90,6 +90,8 @@ struct kvmppc_vcpu_book3s {
 #endif
int context_id[SID_CONTEXTS];
 
+   bool hior_sregs;/* HIOR is set by SREGS, not PVR */
+
struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE];
struct hlist_head hpte_hash_pte_long[HPTEG_HASH_NUM_PTE_LONG];
struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE];
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 0c0d3f2..78dcf65 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -150,13 +150,15 @@ void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
 #ifdef CONFIG_PPC_BOOK3S_64
if ((pvr >= 0x33) && (pvr < 0x7033)) {
kvmppc_mmu_book3s_64_init(vcpu);
-   to_book3s(vcpu)->hior = 0xfff0;
+   if (!to_book3s(vcpu)->hior_sregs)
+   to_book3s(vcpu)->hior = 0xfff0;
to_book3s(vcpu)->msr_mask = 0xULL;
} else
 #endif
{
kvmppc_mmu_book3s_32_init(vcpu);
-   to_book3s(vcpu)->hior = 0;
+   if (!to_book3s(vcpu)->hior_sregs)
+   to_book3s(vcpu)->hior = 0;
to_book3s(vcpu)->msr_mask = 0xULL;
}
 
@@ -770,6 +772,9 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
}
}
 
+   if (sregs->u.s.flags & KVM_SREGS_S_HIOR)
+   sregs->u.s.hior = to_book3s(vcpu)->hior;
+
return 0;
 }
 
@@ -806,6 +811,11 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
/* Flush the MMU after messing with the segments */
kvmppc_mmu_pte_flush(vcpu, 0, 0);
 
+   if (sregs->u.s.flags & KVM_SREGS_S_HIOR) {
+   to_book3s(vcpu)->hior_sregs = true;
+   to_book3s(vcpu)->hior = sregs->u.s.hior;
+   }
+
return 0;
 }
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index a107c9b..17a5c83 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -188,6 +188,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_PPC_BOOKE_SREGS:
 #else
case KVM_CAP_PPC_SEGSTATE:
+   case KVM_CAP_PPC_HIOR:
 #endif
case KVM_CAP_PPC_UNSET_IRQ:
case KVM_CAP_PPC_IRQ_LEVEL:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 55f5afb..4d33f78 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -554,6 +554,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_SMT 64
 #define KVM_CAP_PPC_RMA65
 #define KVM_CAP_MAX_VCPUS 66   /* returns max vcpus per vm */
+#define KVM_CAP_PPC_HIOR 67
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/14] KVM: PPC: Stub emulate CFAR and PURR SPRs

2011-08-25 Thread Alexander Graf
Recent Linux versions use the CFAR and PURR SPRs, but don't really care about
their contents (yet). So for now, we can simply return 0 when the guest wants
to read them.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/book3s_emulate.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index bf0ddcd..0c9dc62 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -473,6 +473,10 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int 
sprn, int rt)
case SPRN_HID5:
kvmppc_set_gpr(vcpu, rt, to_book3s(vcpu)->hid[5]);
break;
+   case SPRN_CFAR:
+   case SPRN_PURR:
+   kvmppc_set_gpr(vcpu, rt, 0);
+   break;
case SPRN_GQR0:
case SPRN_GQR1:
case SPRN_GQR2:
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/14] KVM: PPC: Support SC1 hypercalls for PAPR in PR mode

2011-08-25 Thread Alexander Graf
PAPR defines hypercalls as SC1 instructions. Using these, the guest modifies
page tables and does other privileged operations that it wouldn't be allowed
to do in supervisor mode.

This patch adds support for PR KVM to trap these instructions and route them
through the same PAPR hypercall interface that we already use for HV style
KVM.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/book3s_pr.c |   22 +-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 78dcf65..48558f6 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -648,7 +648,27 @@ program_interrupt:
break;
}
case BOOK3S_INTERRUPT_SYSCALL:
-   if (vcpu->arch.osi_enabled &&
+   if (vcpu->arch.papr_enabled &&
+   (kvmppc_get_last_inst(vcpu) == 0x4422) &&
+   !(vcpu->arch.shared->msr & MSR_PR)) {
+   /* SC 1 papr hypercalls */
+   ulong cmd = kvmppc_get_gpr(vcpu, 3);
+   int i;
+
+   if (kvmppc_h_pr(vcpu, cmd) == EMULATE_DONE) {
+   r = RESUME_GUEST;
+   break;
+   }
+
+   run->papr_hcall.nr = cmd;
+   for (i = 0; i < 9; ++i) {
+   ulong gpr = kvmppc_get_gpr(vcpu, 4 + i);
+   run->papr_hcall.args[i] = gpr;
+   }
+   run->exit_reason = KVM_EXIT_PAPR_HCALL;
+   vcpu->arch.hcall_needed = 1;
+   r = RESUME_HOST;
+   } else if (vcpu->arch.osi_enabled &&
(((u32)kvmppc_get_gpr(vcpu, 3)) == OSI_SC_MAGIC_R3) &&
(((u32)kvmppc_get_gpr(vcpu, 4)) == OSI_SC_MAGIC_R4)) {
/* MOL hypercalls */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/14] KVM: PPC: book3s_pr: Simplify transitions between virtual and real mode

2011-08-25 Thread Alexander Graf
From: Paul Mackerras 

This simplifies the way that the book3s_pr makes the transition to
real mode when entering the guest.  We now call kvmppc_entry_trampoline
(renamed from kvmppc_rmcall) in the base kernel using a normal function
call instead of doing an indirect call through a pointer in the vcpu.
If kvm is a module, the module loader takes care of generating a
trampoline as it does for other calls to functions outside the module.

kvmppc_entry_trampoline then disables interrupts and jumps to
kvmppc_handler_trampoline_enter in real mode using an rfi[d].
That then uses the link register as the address to return to
(potentially in module space) when the guest exits.

This also simplifies the way that we call the Linux interrupt handler
when we exit the guest due to an external, decrementer or performance
monitor interrupt.  Instead of turning on the MMU, then deciding that
we need to call the Linux handler and turning the MMU back off again,
we now go straight to the handler at the point where we would turn the
MMU on.  The handler will then return to the virtual-mode code
(potentially in the module).

Along the way, this moves the setting and clearing of the HID5 DCBZ32
bit into real-mode interrupts-off code, and also makes sure that
we clear the MSR[RI] bit before loading values into SRR0/1.

The net result is that we no longer need any code addresses to be
stored in vcpu->arch.

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm_book3s.h |4 +-
 arch/powerpc/include/asm/kvm_book3s_asm.h |1 +
 arch/powerpc/include/asm/kvm_host.h   |8 --
 arch/powerpc/kernel/asm-offsets.c |7 +--
 arch/powerpc/kvm/book3s_32_sr.S   |2 +-
 arch/powerpc/kvm/book3s_64_slb.S  |2 +-
 arch/powerpc/kvm/book3s_exports.c |4 +-
 arch/powerpc/kvm/book3s_interrupts.S  |  129 +---
 arch/powerpc/kvm/book3s_pr.c  |   12 ---
 arch/powerpc/kvm/book3s_rmhandlers.S  |   51 
 arch/powerpc/kvm/book3s_segment.S |  112 -
 11 files changed, 120 insertions(+), 212 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 91d41fa..a384ffd 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -141,9 +141,7 @@ extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong 
msr);
 extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu 
*vcpu);
 extern pfn_t kvmppc_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
-extern void kvmppc_handler_lowmem_trampoline(void);
-extern void kvmppc_handler_trampoline_enter(void);
-extern void kvmppc_rmcall(ulong srr0, ulong srr1);
+extern void kvmppc_entry_trampoline(void);
 extern void kvmppc_hv_entry_trampoline(void);
 extern void kvmppc_load_up_fpu(void);
 extern void kvmppc_load_up_altivec(void);
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index ef7b368..af73469 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -75,6 +75,7 @@ struct kvmppc_host_state {
ulong scratch0;
ulong scratch1;
u8 in_guest;
+   u8 restore_hid5;
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
struct kvm_vcpu *kvm_vcpu;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 2b8284f..dec3054 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -258,14 +258,6 @@ struct kvm_vcpu_arch {
ulong host_stack;
u32 host_pid;
 #ifdef CONFIG_PPC_BOOK3S
-   ulong host_msr;
-   ulong host_r2;
-   void *host_retip;
-   ulong trampoline_lowmem;
-   ulong trampoline_enter;
-   ulong highmem_handler;
-   ulong rmcall;
-   ulong host_paca_phys;
struct kvmppc_slb slb[64];
int slb_max;/* 1 + index of last valid entry in slb[] */
int slb_nr; /* total number of entries in SLB */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 5f078bc..e069c76 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -449,8 +449,6 @@ int main(void)
 #ifdef CONFIG_PPC_BOOK3S
DEFINE(VCPU_KVM, offsetof(struct kvm_vcpu, kvm));
DEFINE(VCPU_VCPUID, offsetof(struct kvm_vcpu, vcpu_id));
-   DEFINE(VCPU_HOST_RETIP, offsetof(struct kvm_vcpu, arch.host_retip));
-   DEFINE(VCPU_HOST_MSR, offsetof(struct kvm_vcpu, arch.host_msr));
DEFINE(VCPU_PURR, offsetof(struct kvm_vcpu, arch.purr));
DEFINE(VCPU_SPURR, offsetof(struct kvm_vcpu, arch.spurr));
DEFINE(VCPU_DSCR, offsetof(struct kvm_vcpu, arch.dscr));
@@ -458,10 +456,6 @@ int main(void)
DEFINE(VCPU_UAMOR, offsetof(struct kvm_vcpu, arch.uamor));
DEFINE(VCPU_CTRL, offsetof(struct kvm_vcpu, arch.ctr

[PATCH 04/14] KVM: PPC: Interpret SDR1 as HVA in PAPR mode

2011-08-25 Thread Alexander Graf
When running a PAPR guest, the guest is not allowed to set SDR1 - instead
the HTAB information is held in internal hypervisor structures. But all of
our current code relies on SDR1 and walking the HTAB like on real hardware.

So in order to not be too intrusive, we simply set SDR1 to the HTAB we hold
in host memory. That way we can keep the HTAB in user space, but use it from
kernel space to map the guest.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/book3s_64_mmu.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu.c b/arch/powerpc/kvm/book3s_64_mmu.c
index c6d3e19..b871721 100644
--- a/arch/powerpc/kvm/book3s_64_mmu.c
+++ b/arch/powerpc/kvm/book3s_64_mmu.c
@@ -128,7 +128,13 @@ static hva_t kvmppc_mmu_book3s_64_get_pteg(
dprintk("MMU: page=0x%x sdr1=0x%llx pteg=0x%llx vsid=0x%llx\n",
page, vcpu_book3s->sdr1, pteg, slbe->vsid);
 
-   r = gfn_to_hva(vcpu_book3s->vcpu.kvm, pteg >> PAGE_SHIFT);
+   /* When running a PAPR guest, SDR1 contains a HVA address instead
+   of a GPA */
+   if (vcpu_book3s->vcpu.arch.papr_enabled)
+   r = pteg;
+   else
+   r = gfn_to_hva(vcpu_book3s->vcpu.kvm, pteg >> PAGE_SHIFT);
+
if (kvm_is_error_hva(r))
return r;
return r | (pteg & ~PAGE_MASK);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 00/14] ppc patch queue 2011-08-25

2011-08-25 Thread Alexander Graf
Hi Avi,

This is my current patch queue for ppc. Please pull.

Alex


The following changes since commit ef7c782ea4a99fafb3d60dc8b8c057e0ef14f9f7:
  Nadav Har'El (1):
KVM: SVM: Fix TSC MSR read in nested SVM

are available in the git repository at:

  git://github.com/agraf/linux-2.6.git kvm-ppc-next

Alexander Graf (11):
  KVM: PPC: move compute_tlbie_rb to book3s common header
  KVM: PPC: Add papr_enabled flag
  KVM: PPC: Check privilege level on SPRs
  KVM: PPC: Interpret SDR1 as HVA in PAPR mode
  KVM: PPC: Read out syscall instruction on trap
  KVM: PPC: Add support for explicit HIOR setting
  KVM: PPC: Add PAPR hypercall code for PR mode
  KVM: PPC: Stub emulate CFAR and PURR SPRs
  KVM: PPC: Support SC1 hypercalls for PAPR in PR mode
  KVM: PPC: Enable the PAPR CAP for Book3S
  KVM: PPC: Add sanity checking to vcpu_run

Paul Mackerras (3):
  KVM: PPC: Assemble book3s{,_hv}_rmhandlers.S separately
  KVM: PPC: book3s_pr: Simplify transitions between virtual and real mode
  KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code

 arch/powerpc/include/asm/kvm.h|   13 +
 arch/powerpc/include/asm/kvm_book3s.h |   40 +++-
 arch/powerpc/include/asm/kvm_book3s_asm.h |2 +
 arch/powerpc/include/asm/kvm_host.h   |   30 ++-
 arch/powerpc/include/asm/kvm_ppc.h|1 +
 arch/powerpc/kernel/asm-offsets.c |   13 +-
 arch/powerpc/kernel/exceptions-64s.S  |   10 -
 arch/powerpc/kvm/44x.c|2 +
 arch/powerpc/kvm/Makefile |4 +
 arch/powerpc/kvm/book3s_32_sr.S   |2 +-
 arch/powerpc/kvm/book3s_64_mmu.c  |8 +-
 arch/powerpc/kvm/book3s_64_slb.S  |2 +-
 arch/powerpc/kvm/book3s_emulate.c |   29 +++
 arch/powerpc/kvm/book3s_exports.c |4 +-
 arch/powerpc/kvm/book3s_hv.c  |  343 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   33 ---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  300 ++---
 arch/powerpc/kvm/book3s_interrupts.S  |  129 +---
 arch/powerpc/kvm/book3s_pr.c  |   58 --
 arch/powerpc/kvm/book3s_pr_papr.c |  158 +
 arch/powerpc/kvm/book3s_rmhandlers.S  |   54 ++---
 arch/powerpc/kvm/book3s_segment.S |  117 --
 arch/powerpc/kvm/booke.c  |   10 +-
 arch/powerpc/kvm/e500.c   |2 +
 arch/powerpc/kvm/powerpc.c|   55 -
 include/linux/kvm.h   |2 +
 26 files changed, 965 insertions(+), 456 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_pr_papr.c
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/14] KVM: PPC: Add papr_enabled flag

2011-08-25 Thread Alexander Graf
When running a PAPR guest, some things change. The privilege level drops
from hypervisor to supervisor, SDR1 gets treated differently and we interpret
hypercalls. For bisectability sake, add the flag now, but only enable it when
all the support code is there.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm_host.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index cc22b28..e681302 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -389,6 +389,7 @@ struct kvm_vcpu_arch {
u8 dcr_is_write;
u8 osi_needed;
u8 osi_enabled;
+   u8 papr_enabled;
u8 hcall_needed;
 
u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Questions regarding ivshmem spec

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 17:00 +0300, Avi Kivity wrote:
> On 08/25/2011 04:29 PM, Sasha Levin wrote:
> > 2. The spec describes DOORBELL as an array of DWORDs, when one guest
> > wants to poke a different guest it would write something into the offset
> > of the other guest in the DOORBELL array.
> > Looking at the implementation in QEMU, DOORBELL is one DWORD, when
> > writing to it the upper WORD is the guest id and the lower WORD is the
> > value.
> > What am I missing here?
> >
> 
> The spec in qemu.git is accurate.  The intent is to use an ioeventfd 
> bound into an irqfd so a write into the doorbell injects an interrupt 
> directly into the other guest, without going through qemu^Wkvm tool.
> 

But the doorbell is a single DWORD, so if a guest writes to it we'd
still need to figure out which guest/vector he wants to poke from
userspace, no?

If it was an array of doorbells then yes, we could assign an ioeventfd
to each offset - but now I don't quite see how we can avoid passing
through the userspace.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-25 Thread Avi Kivity

On 08/25/2011 04:21 PM, Marcelo Tosatti wrote:

On Thu, Aug 25, 2011 at 07:42:10AM +0300, Avi Kivity wrote:
>  On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
>  >>
>  >>   It could increase the flood count independently of the accessed bit of
>  >>   the spte being updated, zapping after 3 attempts as it is now.
>  >>
>  >>   But additionally reset the flood count if the gpte appears to be valid
>  >>   (points to an existant gfn if the present bit is set, or if its zeroed).
>  >
>  >Well not zero, as thats a common pattern for non ptes.
>  >
>
>  On 32-bit with 4GB RAM, practically anything is a valid gpte.

The following could be required to consider a valid gpte, for write
flood detection purposes:

- Must be present.
- PageCacheDisable must be unset.
- PageWriteThrough must be unset.



Unless the guest is using PAT.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-25 Thread Avi Kivity

On 08/25/2011 05:06 PM, Avi Kivity wrote:

On 08/25/2011 04:21 PM, Marcelo Tosatti wrote:

On Thu, Aug 25, 2011 at 07:42:10AM +0300, Avi Kivity wrote:
>  On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
> >>
> >>   It could increase the flood count independently of the 
accessed bit of

> >>   the spte being updated, zapping after 3 attempts as it is now.
> >>
> >>   But additionally reset the flood count if the gpte appears to 
be valid
> >>   (points to an existant gfn if the present bit is set, or if 
its zeroed).

> >
> >Well not zero, as thats a common pattern for non ptes.
> >
>
>  On 32-bit with 4GB RAM, practically anything is a valid gpte.

The following could be required to consider a valid gpte, for write
flood detection purposes:

- Must be present.
- PageCacheDisable must be unset.
- PageWriteThrough must be unset.



Unless the guest is using PAT.



And not swapping.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Questions regarding ivshmem spec

2011-08-25 Thread Avi Kivity

On 08/25/2011 04:29 PM, Sasha Levin wrote:

2. The spec describes DOORBELL as an array of DWORDs, when one guest
wants to poke a different guest it would write something into the offset
of the other guest in the DOORBELL array.
Looking at the implementation in QEMU, DOORBELL is one DWORD, when
writing to it the upper WORD is the guest id and the lower WORD is the
value.
What am I missing here?



The spec in qemu.git is accurate.  The intent is to use an ioeventfd 
bound into an irqfd so a write into the doorbell injects an interrupt 
directly into the other guest, without going through qemu^Wkvm tool.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-25 Thread Marcelo Tosatti
On Thu, Aug 25, 2011 at 07:42:10AM +0300, Avi Kivity wrote:
> On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
> >>
> >>  It could increase the flood count independently of the accessed bit of
> >>  the spte being updated, zapping after 3 attempts as it is now.
> >>
> >>  But additionally reset the flood count if the gpte appears to be valid
> >>  (points to an existant gfn if the present bit is set, or if its zeroed).
> >
> >Well not zero, as thats a common pattern for non ptes.
> >
> 
> On 32-bit with 4GB RAM, practically anything is a valid gpte.

The following could be required to consider a valid gpte, for write
flood detection purposes:

- Must be present.
- PageCacheDisable must be unset.
- PageWriteThrough must be unset.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-25 Thread Marcelo Tosatti
On Thu, Aug 25, 2011 at 03:57:22PM +0800, Xiao Guangrong wrote:
> On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> > On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> >> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> >>
>  And, i think there are not problems since: if the spte without accssed 
>  bit is
>  written frequently, it means the guest page table is accessed 
>  infrequently or
>  during the writing, the guest page table is not accessed, in this time, 
>  zapping
>  this shadow page is not bad.
> >>>
> >>> Think of the following scenario:
> >>>
> >>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> >>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> >>> is not increased.
> >>> 3) repeat
> >>>
> >>
> >> I think the result is just we hoped, we do not want to zap the shadow page
> >> because the spte is currently used by the guest, it also will be used in 
> >> the
> >> next repetition. So do not increase 'write_flooding_count' is a good 
> >> choice.
> > 
> > Its not used. Step 2) is write to write protected shadow page at
> > gfnA.
> > 
> >> Let's consider what will happen if we increase 'write_flooding_count':
> >> 1: after three repetitions, zap the shadow page
> >> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> >> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
> >>shadow page can be zapped again, repeat 1 to 3.
> > 
> > The shadow page will not be zapped because the spte created from
> > gfnA+indexA has the accessed bit set:
> > 
> >if (spte && !(*spte & shadow_accessed_mask))
> >sp->write_flooding_count++;
> >else
> >sp->write_flooding_count = 0;
> > 
> 
> Marcelo, i am still confused with your example, in step 3), what is repeated?
> it repeats step 2) or it repeats step 1) and 2)?
> 
> Only step 2) is repeated i guess, right? if it is yes, it works well:
> when the guest writes gpte, the spte of corresponding shadow page is zapped
> (level > 1) or it is speculatively fetched(level == 1), the accessed bit is
> cleared in both case.

Right.

> the later write can detect that the accessed bit is not set, and 
> write_flooding_count
> is increased. finally, the shadow page is zapped, the gpte is written 
> directly.
> 
> >> The result is the shadow page for gfnA is alloced and zapped again and 
> >> again,
> >> yes?
> > 
> > The point is you cannot rely on the accessed bit of sptes that have been
> > instantiated with the accessed bit set to decide whether or not to zap.
> > Because the accessed bit will only be cleared on host memory pressure.
> > 
> 
> But the accessed bit is also cleared after spte is written.

Right. But only one of the 512 sptes. Worst case, a shadow that has 1
spte with accessed bit at every 3 spte entries would not be zapped for a
linear write of the entire guest pagetable. The current heuristic does 
not suffer from this issue.

I guess it is OK to be more trigger happy with zapping by ignoring
the accessed bit, clearing the flood counter on page fault.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Questions regarding ivshmem spec

2011-08-25 Thread Sasha Levin
Hello,

I am looking to implement an ivshmem device for KVM tools, the purpose
is to provide same functionality as QEMU and interoperability with QEMU.

Going through the spec (I found here:
https://gitorious.org/nahanni/guest-code/blobs/master/device_spec.txt )
and the code in QEMU I have gathered several questions, I'll be happy
for some help with it.

1. File handles and guest IDs are passed between the server and the
peers using sockets, is the protocol itself documented anywhere? I would
like to be able to work alongside QEMU servers/peers.

2. The spec describes DOORBELL as an array of DWORDs, when one guest
wants to poke a different guest it would write something into the offset
of the other guest in the DOORBELL array.
Looking at the implementation in QEMU, DOORBELL is one DWORD, when
writing to it the upper WORD is the guest id and the lower WORD is the
value.
What am I missing here?

3. There are 3 ways for guests to communicate between each other, and
I'm assuming all guests using the same SHM block must use the same
method. Is it safe to assume we'll always use ioeventfds as in the
implementation now?

Thanks!

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-25 Thread Roedel, Joerg
On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> > On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> > 
> > > > Handling it through fds is a good idea. This makes sure that everything
> > > > belongs to one process. I am not really sure yet if we go the way to
> > > > just bind plain groups together or if we create meta-groups. The
> > > > meta-groups thing seems somewhat cleaner, though.
> > > 
> > > I'm leaning towards binding because we need to make it dynamic, but I
> > > don't really have a good picture of the lifecycle of a meta-group.
> > 
> > In my view the life-cycle of the meta-group is a subrange of the
> > qemu-instance's life-cycle.
> 
> I guess I mean the lifecycle of a super-group that's actually exposed as
> a new group in sysfs.  Who creates it?  How?  How are groups dynamically
> added and removed from the super-group?  The group merging makes sense
> to me because it's largely just an optimization that qemu will try to
> merge groups.  If it works, great.  If not, it manages them separately.
> When all the devices from a group are unplugged, unmerge the group if
> necessary.

Right. The super-group thing is an optimization.

> We need to try the polite method of attempting to hot unplug the device
> from qemu first, which the current vfio code already implements.  We can
> then escalate if it doesn't respond.  The current code calls abort in
> qemu if the guest doesn't respond, but I agree we should also be
> enforcing this at the kernel interface.  I think the problem with the
> hard-unplug is that we don't have a good revoke mechanism for the mmio
> mmaps.

For mmio we could stop the guest and replace the mmio region with a
region that is filled with 0xff, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Pekka Enberg
On 08/25/2011 02:38 PM, Pekka Enberg wrote:
>> If you or other KVM folks want to have a say what goes into tools/kvm,
>> I'm happy to send you a pull request against kvm.git.

On Thu, Aug 25, 2011 at 2:51 PM, Avi Kivity  wrote:
> Thanks, but I have my hands full already.  I'll stop offering unwanted
> advice as well.

Your advice has never been unwanted nor do I imagine it to ever be.

 Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Avi Kivity

On 08/25/2011 02:38 PM, Pekka Enberg wrote:

If you or other KVM folks want to have a say what goes into tools/kvm,
I'm happy to send you a pull request against kvm.git.


Thanks, but I have my hands full already.  I'll stop offering unwanted 
advice as well.



Anyway, Sasha thinks ivshmem is the way to go and that's good enough for me.



Great.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 14:30 +0300, Avi Kivity wrote:
> On 08/25/2011 02:15 PM, Pekka Enberg wrote:
> > On Thu, Aug 25, 2011 at 1:59 PM, Stefan Hajnoczi  wrote:
> > >  Introducing yet another non-standard and non-Linux interface doesn't
> > >  help though.  If there is no significant improvement over ivshmem then
> > >  it makes sense to let ivshmem gain critical mass and more users
> > >  instead of fragmenting the space.
> >
> > Look, I'm not going to require QEMU compatibility from tools/kvm
> > contributors. If you guys really feel that strongly about the
> > interface, then either
> >
> >- Get Rusty's "virtio spec pixie pee" for ivshmem
> 
> It's not a virtio device (doesn't do dma).  It does have a spec in 
> qemu.git/docs/specs.

Please note that the spec you have in /docs/specs is different from what
Cam has in his git tree
(https://gitorious.org/nahanni/guest-code/blobs/master/device_spec.txt
).

If we are going to add it to KVM tool maybe it's a good time to move it
out of QEMU tree and make it less QEMU specific?

> 
> >- Get the Linux driver merged to linux-next
> 
> ivshmem uses uio, so it doesn't need an in-kernel driver, IIRC.  Map 
> your BAR from sysfs and go.
> 
> >- Help out David and Sasha to change interface
> >
> > But don't ask me to block clean code from inclusion to tools/kvm
> > because it doesn't have a QEMU-capable interface.
> 
> A lot of thought has gone into the design and implementation of 
> ivshmem.  But don't let that stop you from merging clean code.

Theres a big difference in requiring it to be ivshmem compatible because
ivshmem is good and requiring it to be ivshmem compatible because thats
what QEMU is doing.

Looking at the comments in this thread I would have expected to see much
more comments regarding the technical supremacy of ivshmem over a simple
memory shared block instead of the argument that KVM tools has to
conform to QEMU standards.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Pekka Enberg
On Thu, Aug 25, 2011 at 2:30 PM, Avi Kivity  wrote:
> On 08/25/2011 02:15 PM, Pekka Enberg wrote:
>>
>> On Thu, Aug 25, 2011 at 1:59 PM, Stefan Hajnoczi
>>  wrote:
>> >  Introducing yet another non-standard and non-Linux interface doesn't
>> >  help though.  If there is no significant improvement over ivshmem then
>> >  it makes sense to let ivshmem gain critical mass and more users
>> >  instead of fragmenting the space.
>>
>> Look, I'm not going to require QEMU compatibility from tools/kvm
>> contributors. If you guys really feel that strongly about the
>> interface, then either
>>
>>   - Get Rusty's "virtio spec pixie pee" for ivshmem
>
> It's not a virtio device (doesn't do dma).  It does have a spec in
> qemu.git/docs/specs.
>
>>   - Get the Linux driver merged to linux-next
>
> ivshmem uses uio, so it doesn't need an in-kernel driver, IIRC.  Map your
> BAR from sysfs and go.

Right.

On Thu, Aug 25, 2011 at 2:30 PM, Avi Kivity  wrote:
>>   - Help out David and Sasha to change interface
>>
>> But don't ask me to block clean code from inclusion to tools/kvm
>> because it doesn't have a QEMU-capable interface.
>
> A lot of thought has gone into the design and implementation of ivshmem.
>  But don't let that stop you from merging clean code.

Thanks, I won't.

If you or other KVM folks want to have a say what goes into tools/kvm,
I'm happy to send you a pull request against kvm.git.

Anyway, Sasha thinks ivshmem is the way to go and that's good enough for me.

Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Avi Kivity

On 08/25/2011 02:15 PM, Pekka Enberg wrote:

On Thu, Aug 25, 2011 at 1:59 PM, Stefan Hajnoczi  wrote:
>  Introducing yet another non-standard and non-Linux interface doesn't
>  help though.  If there is no significant improvement over ivshmem then
>  it makes sense to let ivshmem gain critical mass and more users
>  instead of fragmenting the space.

Look, I'm not going to require QEMU compatibility from tools/kvm
contributors. If you guys really feel that strongly about the
interface, then either

   - Get Rusty's "virtio spec pixie pee" for ivshmem


It's not a virtio device (doesn't do dma).  It does have a spec in 
qemu.git/docs/specs.



   - Get the Linux driver merged to linux-next


ivshmem uses uio, so it doesn't need an in-kernel driver, IIRC.  Map 
your BAR from sysfs and go.



   - Help out David and Sasha to change interface

But don't ask me to block clean code from inclusion to tools/kvm
because it doesn't have a QEMU-capable interface.


A lot of thought has gone into the design and implementation of 
ivshmem.  But don't let that stop you from merging clean code.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 11:59 +0100, Stefan Hajnoczi wrote:
> On Thu, Aug 25, 2011 at 11:37 AM, Pekka Enberg  wrote:
> > Hi Stefan,
> >
> > On Thu, Aug 25, 2011 at 1:31 PM, Stefan Hajnoczi  wrote:
> >>> It's obviously not competing. One thing you might want to consider is
> >>> making the guest interface compatible with ivshmem. Is there any reason
> >>> we shouldn't do that? I don't consider that a requirement, just nice to
> >>> have.
> >>
> >> The point of implementing the same interface as ivshmem is that users
> >> don't need to rejig guests or applications in order to switch between
> >> hypervisors.  A different interface also prevents same-to-same
> >> benchmarks.
> >>
> >> There is little benefit to creating another virtual device interface
> >> when a perfectly good one already exists.  The question should be: how
> >> is this shmem device different and better than ivshmem?  If there is
> >> no justification then implement the ivshmem interface.
> >
> > So which interface are we actually taking about? Userspace/kernel in the
> > guest or hypervisor/guest kernel?
> 
> The hardware interface.  Same PCI BAR layout and semantics.
> 
> > Either way, while it would be nice to share the interface but it's not a
> > *requirement* for tools/kvm unless ivshmem is specified in the virtio
> > spec or the driver is in mainline Linux. We don't intend to require people
> > to implement non-standard and non-Linux QEMU interfaces. OTOH,
> > ivshmem would make the PCI ID problem go away.
> 
> Introducing yet another non-standard and non-Linux interface doesn't
> help though.  If there is no significant improvement over ivshmem then
> it makes sense to let ivshmem gain critical mass and more users
> instead of fragmenting the space.

I support doing it ivshmem-compatible, though it doesn't have to be a
requirement right now (that is, use this patch as a base and build it
towards ivshmem - which shouldn't be an issue since this patch provides
the PCI+SHM parts which are required by ivshmem anyway).

ivshmem is a good, documented, stable interface backed by a lot of
research and testing behind it. Looking at the spec it's obvious that
Cam had KVM in mind when designing it and thats exactly what we want to
have in the KVM tool.

David, did you have any plans to extend it to become ivshmem-compatible?
If not, would turning it into such break any code that depends on it
horribly?

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Pekka Enberg
On Thu, Aug 25, 2011 at 1:59 PM, Stefan Hajnoczi  wrote:
> Introducing yet another non-standard and non-Linux interface doesn't
> help though.  If there is no significant improvement over ivshmem then
> it makes sense to let ivshmem gain critical mass and more users
> instead of fragmenting the space.

Look, I'm not going to require QEMU compatibility from tools/kvm
contributors. If you guys really feel that strongly about the
interface, then either

  - Get Rusty's "virtio spec pixie pee" for ivshmem

  - Get the Linux driver merged to linux-next

  - Help out David and Sasha to change interface

But don't ask me to block clean code from inclusion to tools/kvm
because it doesn't have a QEMU-capable interface.

Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Avoid the use of deprecated gnutls gnutls_*_set_priority functions.

2011-08-25 Thread Daniel P. Berrange
On Thu, Aug 25, 2011 at 11:54:41AM +0100, Stefan Hajnoczi wrote:
> On Mon, Jul 4, 2011 at 11:00 PM, Raghavendra D Prabhu
>  wrote:
> > The gnutls_*_set_priority family of functions has been marked deprecated
> > in 2.12.x. These functions have been superceded by
> > gnutls_priority_set_direct().
> >
> > Signed-off-by: Raghavendra D Prabhu 
> > ---
> >  ui/vnc-tls.c |   20 +---
> >  1 files changed, 1 insertions(+), 19 deletions(-)
> >
> > diff --git a/ui/vnc-tls.c b/ui/vnc-tls.c
> > index dec626c..33a5d8c 100644
> > --- a/ui/vnc-tls.c
> > +++ b/ui/vnc-tls.c
> > @@ -286,10 +286,6 @@ int vnc_tls_validate_certificate(struct VncState *vs)
> >
> >  int vnc_tls_client_setup(struct VncState *vs,
> >                          int needX509Creds) {
> > -    static const int cert_type_priority[] = { GNUTLS_CRT_X509, 0 };
> > -    static const int protocol_priority[]= { GNUTLS_TLS1_1, GNUTLS_TLS1_0, 
> > GNUTLS_SSL3, 0 };
> > -    static const int kx_anon[] = {GNUTLS_KX_ANON_DH, 0};
> > -    static const int kx_x509[] = {GNUTLS_KX_DHE_DSS, GNUTLS_KX_RSA, 
> > GNUTLS_KX_DHE_RSA, GNUTLS_KX_SRP, 0};
> >
> >     VNC_DEBUG("Do TLS setup\n");
> >     if (vnc_tls_initialize() < 0) {
> > @@ -310,21 +306,7 @@ int vnc_tls_client_setup(struct VncState *vs,
> >             return -1;
> >         }
> >
> > -        if (gnutls_kx_set_priority(vs->tls.session, needX509Creds ? 
> > kx_x509 : kx_anon) < 0) {
> > -            gnutls_deinit(vs->tls.session);
> > -            vs->tls.session = NULL;
> > -            vnc_client_error(vs);
> > -            return -1;
> > -        }
> > -
> > -        if (gnutls_certificate_type_set_priority(vs->tls.session, 
> > cert_type_priority) < 0) {
> > -            gnutls_deinit(vs->tls.session);
> > -            vs->tls.session = NULL;
> > -            vnc_client_error(vs);
> > -            return -1;
> > -        }
> > -
> > -        if (gnutls_protocol_set_priority(vs->tls.session, 
> > protocol_priority) < 0) {
> > +        if (gnutls_priority_set_direct(vs->tls.session, needX509Creds ? 
> > "NORMAL" : "NORMAL:+ANON-DH", NULL) < 0) {
> >             gnutls_deinit(vs->tls.session);
> >             vs->tls.session = NULL;
> >             vnc_client_error(vs);
> > --
> > 1.7.6
> 
> Daniel,
> This patch looks good to me but I don't know much about gnutls or
> crypto in general.  Would you be willing to review this?

ACK, this approach is different from what I did in libvirt, but it matches
the recommendations in the GNUTLS manual for setting priority, so I believe
it is good.

Signed-off-by: Daniel P. Berrange 

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-25 Thread Roedel, Joerg
On Wed, Aug 24, 2011 at 10:56:13AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
> > A side-note: Might it be better to expose assigned devices in a guest on
> > a seperate bus? This will make it easier to emulate an IOMMU for the
> > guest inside qemu.
> 
> I think we want that option, sure.  A lot of guests aren't going to
> support hotplugging buses though, so I think our default, map the entire
> guest model should still be using bus 0.  The ACPI gets a lot more
> complicated for that model too; dynamic SSDTs?  Thanks,

Ok, if only AMD-Vi should be emulated then it is not strictly
necessary. For this IOMMU we can specify that devices on the same bus
belong to different IOMMUs. So we can implement an IOMMU that handles
internal qemu-devices and one that handles pass-through devices.
Not sure if this is possible with VT-d too. Okay VT-d emulation would
also require that the devices emulation of a PCIe bridge, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Stefan Hajnoczi
On Thu, Aug 25, 2011 at 11:37 AM, Pekka Enberg  wrote:
> Hi Stefan,
>
> On Thu, Aug 25, 2011 at 1:31 PM, Stefan Hajnoczi  wrote:
>>> It's obviously not competing. One thing you might want to consider is
>>> making the guest interface compatible with ivshmem. Is there any reason
>>> we shouldn't do that? I don't consider that a requirement, just nice to
>>> have.
>>
>> The point of implementing the same interface as ivshmem is that users
>> don't need to rejig guests or applications in order to switch between
>> hypervisors.  A different interface also prevents same-to-same
>> benchmarks.
>>
>> There is little benefit to creating another virtual device interface
>> when a perfectly good one already exists.  The question should be: how
>> is this shmem device different and better than ivshmem?  If there is
>> no justification then implement the ivshmem interface.
>
> So which interface are we actually taking about? Userspace/kernel in the
> guest or hypervisor/guest kernel?

The hardware interface.  Same PCI BAR layout and semantics.

> Either way, while it would be nice to share the interface but it's not a
> *requirement* for tools/kvm unless ivshmem is specified in the virtio
> spec or the driver is in mainline Linux. We don't intend to require people
> to implement non-standard and non-Linux QEMU interfaces. OTOH,
> ivshmem would make the PCI ID problem go away.

Introducing yet another non-standard and non-Linux interface doesn't
help though.  If there is no significant improvement over ivshmem then
it makes sense to let ivshmem gain critical mass and more users
instead of fragmenting the space.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Avoid the use of deprecated gnutls gnutls_*_set_priority functions.

2011-08-25 Thread Stefan Hajnoczi
On Mon, Jul 4, 2011 at 11:00 PM, Raghavendra D Prabhu
 wrote:
> The gnutls_*_set_priority family of functions has been marked deprecated
> in 2.12.x. These functions have been superceded by
> gnutls_priority_set_direct().
>
> Signed-off-by: Raghavendra D Prabhu 
> ---
>  ui/vnc-tls.c |   20 +---
>  1 files changed, 1 insertions(+), 19 deletions(-)
>
> diff --git a/ui/vnc-tls.c b/ui/vnc-tls.c
> index dec626c..33a5d8c 100644
> --- a/ui/vnc-tls.c
> +++ b/ui/vnc-tls.c
> @@ -286,10 +286,6 @@ int vnc_tls_validate_certificate(struct VncState *vs)
>
>  int vnc_tls_client_setup(struct VncState *vs,
>                          int needX509Creds) {
> -    static const int cert_type_priority[] = { GNUTLS_CRT_X509, 0 };
> -    static const int protocol_priority[]= { GNUTLS_TLS1_1, GNUTLS_TLS1_0, 
> GNUTLS_SSL3, 0 };
> -    static const int kx_anon[] = {GNUTLS_KX_ANON_DH, 0};
> -    static const int kx_x509[] = {GNUTLS_KX_DHE_DSS, GNUTLS_KX_RSA, 
> GNUTLS_KX_DHE_RSA, GNUTLS_KX_SRP, 0};
>
>     VNC_DEBUG("Do TLS setup\n");
>     if (vnc_tls_initialize() < 0) {
> @@ -310,21 +306,7 @@ int vnc_tls_client_setup(struct VncState *vs,
>             return -1;
>         }
>
> -        if (gnutls_kx_set_priority(vs->tls.session, needX509Creds ? kx_x509 
> : kx_anon) < 0) {
> -            gnutls_deinit(vs->tls.session);
> -            vs->tls.session = NULL;
> -            vnc_client_error(vs);
> -            return -1;
> -        }
> -
> -        if (gnutls_certificate_type_set_priority(vs->tls.session, 
> cert_type_priority) < 0) {
> -            gnutls_deinit(vs->tls.session);
> -            vs->tls.session = NULL;
> -            vnc_client_error(vs);
> -            return -1;
> -        }
> -
> -        if (gnutls_protocol_set_priority(vs->tls.session, protocol_priority) 
> < 0) {
> +        if (gnutls_priority_set_direct(vs->tls.session, needX509Creds ? 
> "NORMAL" : "NORMAL:+ANON-DH", NULL) < 0) {
>             gnutls_deinit(vs->tls.session);
>             vs->tls.session = NULL;
>             vnc_client_error(vs);
> --
> 1.7.6

Daniel,
This patch looks good to me but I don't know much about gnutls or
crypto in general.  Would you be willing to review this?

Thanks,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-25 Thread Roedel, Joerg
Hi Alex,

On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> Is this roughly what you're thinking of for the iommu_group component?
> Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> support in the iommu base.  Would AMD-Vi do something similar (or
> exactly the same) for group #s?  Thanks,

The concept looks good, I have some comments, though. On AMD-Vi the
implementation would look a bit different because there is a
data-structure were the information can be gathered from, so no need for
PCI bus scanning there.

> diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> index 6e6b6a1..6b54c1a 100644
> --- a/drivers/base/iommu.c
> +++ b/drivers/base/iommu.c
> @@ -17,20 +17,56 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static struct iommu_ops *iommu_ops;
>  
> +static ssize_t show_iommu_group(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lx", iommu_dev_to_group(dev));

Probably add a 0x prefix so userspace knows the format?

> +}
> +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> +
> +static int add_iommu_group(struct device *dev, void *unused)
> +{
> + if (iommu_dev_to_group(dev) >= 0)
> + return device_create_file(dev, &dev_attr_iommu_group);
> +
> + return 0;
> +}
> +
> +static int device_notifier(struct notifier_block *nb,
> +unsigned long action, void *data)
> +{
> + struct device *dev = data;
> +
> + if (action == BUS_NOTIFY_ADD_DEVICE)
> + return add_iommu_group(dev, NULL);
> +
> + return 0;
> +}
> +
> +static struct notifier_block device_nb = {
> + .notifier_call = device_notifier,
> +};
> +
>  void register_iommu(struct iommu_ops *ops)
>  {
>   if (iommu_ops)
>   BUG();
>  
>   iommu_ops = ops;
> +
> + /* FIXME - non-PCI, really want for_each_bus() */
> + bus_register_notifier(&pci_bus_type, &device_nb);
> + bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
>  }

We need to solve this differently. ARM is starting to use the iommu-api
too and this definitly does not work there. One possible solution might
be to make the iommu-ops per-bus.

>  bool iommu_found(void)
> @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
>  
> +long iommu_dev_to_group(struct device *dev)
> +{
> + if (iommu_ops->dev_to_group)
> + return iommu_ops->dev_to_group(dev);
> + return -ENODEV;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_to_group);

Please rename this to iommu_device_group(). The dev_to_group name
suggests a conversion but it is actually just a property of the device.
Also the return type should not be long but something that fits into
32bit on all platforms. Since you use -ENODEV, probably s32 is a good
choice.

> +
>  int iommu_map(struct iommu_domain *domain, unsigned long iova,
> phys_addr_t paddr, int gfp_order, int prot)
>  {
> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> index f02c34d..477259c 100644
> --- a/drivers/pci/intel-iommu.c
> +++ b/drivers/pci/intel-iommu.c
> @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
>  static int dmar_forcedac;
>  static int intel_iommu_strict;
>  static int intel_iommu_superpage = 1;
> +static int intel_iommu_no_mf_groups;
>  
>  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
>  static DEFINE_SPINLOCK(device_domain_lock);
> @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
>   printk(KERN_INFO
>   "Intel-IOMMU: disable supported super page\n");
>   intel_iommu_superpage = 0;
> + } else if (!strncmp(str, "no_mf_groups", 12)) {
> + printk(KERN_INFO
> + "Intel-IOMMU: disable separate groups for 
> multifunction devices\n");
> + intel_iommu_no_mf_groups = 1;

This should really be a global iommu option and not be VT-d specific.

>  
>   str += strcspn(str, ",");
> @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct 
> iommu_domain *domain,
>   return 0;
>  }
>  
> +/* Group numbers are arbitrary.  Device with the same group number
> + * indicate the iommu cannot differentiate between them.  To avoid
> + * tracking used groups we just use the seg|bus|devfn of the lowest
> + * level we're able to differentiate devices */
> +static long intel_iommu_dev_to_group(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct pci_dev *bridge;
> + union {
> + struct {
> + u8 devfn;
> + u8 bus;
> + u16 segment;
> + } pci;
> + u32 group;
> + } id;
> +
> + if (iommu_no_mapping(dev))

Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Pekka Enberg
Hi Stefan,

On Thu, Aug 25, 2011 at 1:31 PM, Stefan Hajnoczi  wrote:
>> It's obviously not competing. One thing you might want to consider is
>> making the guest interface compatible with ivshmem. Is there any reason
>> we shouldn't do that? I don't consider that a requirement, just nice to
>> have.
>
> The point of implementing the same interface as ivshmem is that users
> don't need to rejig guests or applications in order to switch between
> hypervisors.  A different interface also prevents same-to-same
> benchmarks.
>
> There is little benefit to creating another virtual device interface
> when a perfectly good one already exists.  The question should be: how
> is this shmem device different and better than ivshmem?  If there is
> no justification then implement the ivshmem interface.

So which interface are we actually taking about? Userspace/kernel in the
guest or hypervisor/guest kernel?

Either way, while it would be nice to share the interface but it's not a
*requirement* for tools/kvm unless ivshmem is specified in the virtio
spec or the driver is in mainline Linux. We don't intend to require people
to implement non-standard and non-Linux QEMU interfaces. OTOH,
ivshmem would make the PCI ID problem go away.

David, Sasha, thoughts?

Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Stefan Hajnoczi
On Thu, Aug 25, 2011 at 6:06 AM, Pekka Enberg  wrote:
> On Wed, 2011-08-24 at 21:49 -0700, David Evensky wrote:
>> On Wed, Aug 24, 2011 at 10:27:18PM -0500, Alexander Graf wrote:
>> >
>> > On 24.08.2011, at 17:25, David Evensky wrote:
>> >
>> > >
>> > >
>> > > This patch adds a PCI device that provides PCI device memory to the
>> > > guest. This memory in the guest exists as a shared memory segment in
>> > > the host. This is similar memory sharing capability of Nahanni
>> > > (ivshmem) available in QEMU. In this case, the shared memory segment
>> > > is exposed as a PCI BAR only.
>> > >
>> > > A new command line argument is added as:
>> > >    --shmem pci:0xc800:16MB:handle=/newmem:create
>> > >
>> > > which will set the PCI BAR at 0xc800, the shared memory segment
>> > > and the region pointed to by the BAR will be 16MB. On the host side
>> > > the shm_open handle will be '/newmem', and the kvm tool will create
>> > > the shared segment, set its size, and initialize it. If the size,
>> > > handle, or create flag are absent, they will default to 16MB,
>> > > handle=/kvm_shmem, and create will be false. The address family,
>> > > 'pci:' is also optional as it is the only address family currently
>> > > supported. Only a single --shmem is supported at this time.
>> >
>> > Did you have a look at ivshmem? It does that today, but also gives
>> you an IRQ line so the guests can poke each other. For something as
>> simple as this, I don't see why we'd need two competing
>> implementations.
>>
>> Isn't ivshmem in QEMU? If so, then I don't think there isn't any
>> competition. How do you feel that these are competing?
>
> It's obviously not competing. One thing you might want to consider is
> making the guest interface compatible with ivshmem. Is there any reason
> we shouldn't do that? I don't consider that a requirement, just nice to
> have.

The point of implementing the same interface as ivshmem is that users
don't need to rejig guests or applications in order to switch between
hypervisors.  A different interface also prevents same-to-same
benchmarks.

There is little benefit to creating another virtual device interface
when a perfectly good one already exists.  The question should be: how
is this shmem device different and better than ivshmem?  If there is
no justification then implement the ivshmem interface.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-25 Thread Richard W.M. Jones
On Thu, Aug 25, 2011 at 08:48:25AM +0100, Richard W.M. Jones wrote:
> On Thu, Aug 25, 2011 at 10:40:34AM +0300, Sasha Levin wrote:
> > From what I gathered libguestfs only provides access to the guests'
> > image.
> 
> Correct.
> 
> > Which part is doing the IKCONFIG or System.map probing? Or is it done in
> > a different way?
> 
> You'll have to see what Matt's doing in the virt-v2v code for the
> details, but in general we have full access to:
> 
>  - grub.conf (to determine which kernel will boot)
>  - the kernel image
>  - the corresponding System.map and config
>  - the modules directory
>  - the Xorg config
>  - boot.ini or BCD (to determine which NT kernel will boot)
>  - the Windows Registry
>  - the list of packages installed (to see if VMware-tools or some other
>guest agent is installed)
> 
> So working out what drivers are available is just a tedious matter of
> iterating across each of these places in the filesystem.

We had some interesting discussion on IRC about this.

Detecting if a guest "supports virtio" is a tricky problem, and it
goes beyond what the guest kernel can do.  For Linux guests you also
need to check what userspace can do.  This means unpacking the initrd
and checking for virtio drivers [in the general case this is
intractable, but you can do it for specific distros].

You also need to check that udev has the correct rules and that LVM is
configured to see VGs on /dev/vd* devices.

Console and Xorg configuration may also need to be checked (for
virtio-console and Cirrus/QXL support resp.)

virt-v2v does quite a lot of work to *enable* virtio drivers
including:

 - possibly installing a new kernel and updating grub

 - rebuilding the initrd to include virtio drivers

 - adjusting many different config files

 - removing other guest tools and Xen drivers

 - reconfiguring SELinux

 - adding viostor driver to Windows and adjusting the Windows Registry
   Critical Device Database

Of course virt-v2v confines itself to specific known guests, and we
test it like crazy.

Here is the code:

http://git.fedorahosted.org/git/?p=virt-v2v.git;a=blob;f=lib/Sys/VirtConvert/Converter/RedHat.pm;hb=HEAD
http://git.fedorahosted.org/git/?p=virt-v2v.git;a=blob;f=lib/Sys/VirtConvert/Converter/Windows.pm;hb=HEAD

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://et.redhat.com/~rjones/virt-df/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-25 Thread Xiao Guangrong
On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
>> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
>>
 And, i think there are not problems since: if the spte without accssed bit 
 is
 written frequently, it means the guest page table is accessed infrequently 
 or
 during the writing, the guest page table is not accessed, in this time, 
 zapping
 this shadow page is not bad.
>>>
>>> Think of the following scenario:
>>>
>>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
>>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
>>> is not increased.
>>> 3) repeat
>>>
>>
>> I think the result is just we hoped, we do not want to zap the shadow page
>> because the spte is currently used by the guest, it also will be used in the
>> next repetition. So do not increase 'write_flooding_count' is a good choice.
> 
> Its not used. Step 2) is write to write protected shadow page at
> gfnA.
> 
>> Let's consider what will happen if we increase 'write_flooding_count':
>> 1: after three repetitions, zap the shadow page
>> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
>> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
>>shadow page can be zapped again, repeat 1 to 3.
> 
> The shadow page will not be zapped because the spte created from
> gfnA+indexA has the accessed bit set:
> 
>if (spte && !(*spte & shadow_accessed_mask))
>sp->write_flooding_count++;
>else
>sp->write_flooding_count = 0;
> 

Marcelo, i am still confused with your example, in step 3), what is repeated?
it repeats step 2) or it repeats step 1) and 2)?

Only step 2) is repeated i guess, right? if it is yes, it works well:
when the guest writes gpte, the spte of corresponding shadow page is zapped
(level > 1) or it is speculatively fetched(level == 1), the accessed bit is
cleared in both case.

the later write can detect that the accessed bit is not set, and 
write_flooding_count
is increased. finally, the shadow page is zapped, the gpte is written directly.

>> The result is the shadow page for gfnA is alloced and zapped again and again,
>> yes?
> 
> The point is you cannot rely on the accessed bit of sptes that have been
> instantiated with the accessed bit set to decide whether or not to zap.
> Because the accessed bit will only be cleared on host memory pressure.
> 

But the accessed bit is also cleared after spte is written.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-25 Thread Richard W.M. Jones
On Thu, Aug 25, 2011 at 10:40:34AM +0300, Sasha Levin wrote:
> From what I gathered libguestfs only provides access to the guests'
> image.

Correct.

> Which part is doing the IKCONFIG or System.map probing? Or is it done in
> a different way?

You'll have to see what Matt's doing in the virt-v2v code for the
details, but in general we have full access to:

 - grub.conf (to determine which kernel will boot)
 - the kernel image
 - the corresponding System.map and config
 - the modules directory
 - the Xorg config
 - boot.ini or BCD (to determine which NT kernel will boot)
 - the Windows Registry
 - the list of packages installed (to see if VMware-tools or some other
   guest agent is installed)

So working out what drivers are available is just a tedious matter of
iterating across each of these places in the filesystem.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://et.redhat.com/~rjones/virt-top
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 08:32 +0100, Richard W.M. Jones wrote:
> On Thu, Aug 25, 2011 at 08:33:04AM +0300, Avi Kivity wrote:
> > On 08/25/2011 08:21 AM, Sasha Levin wrote:
> > >Hi,
> > >
> > >Currently when we run the guest we treat it as a black box, we're not
> > >quite sure what it's going to start and whether it supports the same
> > >features we expect it to support when running it from the host.
> > >
> > >This forces us to start the guest with the safest defaults possible, for
> > >example: '-drive file=my_image.qcow2' will be started with slow IDE
> > >emulation even though the guest is capable of virtio.
> > >
> > >I'm currently working on a method to try and detect whether the guest
> > >kernel has specific configurations enabled and either warn the user if
> > >we know the kernel is not going to properly work or use better defaults
> > >if we know some advanced features are going to work.
> > >
> > >How am I planning to do it? First, we'll try finding which kernel the
> > >guest is going to boot (easy when user does '-kernel', less easy when
> > >the user boots an image). For simplicity sake I'll stick with the
> > >'-kernel' option for now.
> > >
> > >Once we have the kernel we can do two things:
> > >  1. See if the kernel was built with CONFIG_IKCONFIG.
> > >
> > >  2. Try finding the System.map which belongs to the kernel, it's
> > >provided with all distro kernels so we can expect it to be around. If we
> > >did find it we repeat the same process as in #1.
> > >
> > >If we found one of the above, we start matching config sets ("we need
> > >a,b,c,d for virtio, let's see if it's all there"). Once we find a good
> > >config set, we use it for defaults. If we didn't find a good config set
> > >we warn the user and don't even bother starting the guest.
> > >
> > >If we couldn't find either, we can just default to whatever we have as
> > >defaults now.
> > >
> > >
> > >To sum it up, I was wondering if this approach has been considered
> > >before and whether it sounds interesting enough to try.
> > >
> > 
> > This is a similar problem to p2v or v2v - taking a guest that used
> > to run on physical or virtual hardware, and modifying it to run on
> > (different) virtual hardware.  The first step is what you're looking
> > for - detecting what the guest currently supports.
> > 
> > You can look at http://libguestfs.org/virt-v2v/ for an example.  I'm
> > also copying Richard Jones, who maintains libguestfs, which does the
> > actual poking around in the guest.
> 
> Yes, as Avi says, we do all of the above already.  Including
> for Windows guests.

>From what I gathered libguestfs only provides access to the guests'
image.

Which part is doing the IKCONFIG or System.map probing? Or is it done in
a different way?

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-25 Thread Xiao Guangrong
On 08/25/2011 10:04 AM, Marcelo Tosatti wrote:

>>> Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
>>> is not used as gpte just depends on writing, for example, the guest can
>>> change the mapping address or the status bit, and so on...The sp can be
>>> zapped if the guest write it again(on the same address), i think it is
>>> acceptable, anymore, it is just the speculative way to zap the unused
>>> shadow page...your opinion?
>>
>> It could increase the flood count independently of the accessed bit of
>> the spte being updated, zapping after 3 attempts as it is now.
>>
>> But additionally reset the flood count if the gpte appears to be valid
>> (points to an existant gfn if the present bit is set, or if its zeroed).
> 
> Well not zero, as thats a common pattern for non ptes.
> 

Hi Marcelo,

Maybe it is not good i think, for some reasons:
- checking gfn valid which it is pointed by gpte is high overload,
  it needs to call gfn_to_hva to walk memslots, especially. kvm_mmu_pte_write
  is called very frequently on shadow mmu.

- MMIO gfn is not an existent gfn, but it is valid pointed by gpte

- we can check the reserved bits in the gpte to check whether it is valid a
  gpte, but for some paging modes, all bits are valid.(for example, non-PAE 
mode)

- it can not work if the gfn has multiple shadow pages, for example:
  if the gfn was used as PDE, later it is used as PTE, then we have two shadow
  pages: sp1.level = 2, sp2.level = 1, sp1 can not be zapped even even though it
  is not used anymore.

- sometime, we need to zap the shadow page even though the gpte is written 
validly:
  if the gpte is written frequently but infrequently accessed, we do better zap 
the
  shadow page to let it is writable(write it directly without #PF) and map it 
when it
  is accessed, one example is from Avi, the guest OS may update many gptes at 
one time
  after one page fault.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-25 Thread Richard W.M. Jones
On Thu, Aug 25, 2011 at 08:33:04AM +0300, Avi Kivity wrote:
> On 08/25/2011 08:21 AM, Sasha Levin wrote:
> >Hi,
> >
> >Currently when we run the guest we treat it as a black box, we're not
> >quite sure what it's going to start and whether it supports the same
> >features we expect it to support when running it from the host.
> >
> >This forces us to start the guest with the safest defaults possible, for
> >example: '-drive file=my_image.qcow2' will be started with slow IDE
> >emulation even though the guest is capable of virtio.
> >
> >I'm currently working on a method to try and detect whether the guest
> >kernel has specific configurations enabled and either warn the user if
> >we know the kernel is not going to properly work or use better defaults
> >if we know some advanced features are going to work.
> >
> >How am I planning to do it? First, we'll try finding which kernel the
> >guest is going to boot (easy when user does '-kernel', less easy when
> >the user boots an image). For simplicity sake I'll stick with the
> >'-kernel' option for now.
> >
> >Once we have the kernel we can do two things:
> >  1. See if the kernel was built with CONFIG_IKCONFIG.
> >
> >  2. Try finding the System.map which belongs to the kernel, it's
> >provided with all distro kernels so we can expect it to be around. If we
> >did find it we repeat the same process as in #1.
> >
> >If we found one of the above, we start matching config sets ("we need
> >a,b,c,d for virtio, let's see if it's all there"). Once we find a good
> >config set, we use it for defaults. If we didn't find a good config set
> >we warn the user and don't even bother starting the guest.
> >
> >If we couldn't find either, we can just default to whatever we have as
> >defaults now.
> >
> >
> >To sum it up, I was wondering if this approach has been considered
> >before and whether it sounds interesting enough to try.
> >
> 
> This is a similar problem to p2v or v2v - taking a guest that used
> to run on physical or virtual hardware, and modifying it to run on
> (different) virtual hardware.  The first step is what you're looking
> for - detecting what the guest currently supports.
> 
> You can look at http://libguestfs.org/virt-v2v/ for an example.  I'm
> also copying Richard Jones, who maintains libguestfs, which does the
> actual poking around in the guest.

Yes, as Avi says, we do all of the above already.  Including
for Windows guests.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Pekka Enberg

On 8/25/11 10:20 AM, Asias He wrote:

On Thu, Aug 25, 2011 at 3:02 PM, Pekka Enberg  wrote:

On 8/25/11 9:30 AM, Asias He wrote:

On Thu, Aug 25, 2011 at 1:54 PM, Pekka Enbergwrote:

On 8/25/11 8:34 AM, Asias He wrote:

Hi, David

On Thu, Aug 25, 2011 at 6:25 AM, David Evensky
  wrote:

This patch adds a PCI device that provides PCI device memory to the
guest. This memory in the guest exists as a shared memory segment in
the host. This is similar memory sharing capability of Nahanni
(ivshmem) available in QEMU. In this case, the shared memory segment
is exposed as a PCI BAR only.

A new command line argument is added as:
--shmem pci:0xc800:16MB:handle=/newmem:create

which will set the PCI BAR at 0xc800, the shared memory segment
and the region pointed to by the BAR will be 16MB. On the host side
the shm_open handle will be '/newmem', and the kvm tool will create
the shared segment, set its size, and initialize it. If the size,
handle, or create flag are absent, they will default to 16MB,
handle=/kvm_shmem, and create will be false.

I think it's better to use a default BAR address if user does not specify
one as well.
This way,

./kvm --shmem

will work with default values with zero configuration.

Does that sort of thing make sense here? It's a special purpose device
and the guest is expected to ioremap() the memory so it needs to
know the BAR.

I mean a default bar address for --shmem device.  Yes, guest needs to know
this address, but even if we specify the address at command line the guest
still
does not know this address, no? So having a default bar address does no
harm.

How does the user discover what the default BAR is? Which default BAR
should we use? I don't think default BAR adds much value here.

1. Print it on startup like, like we do for --name.
   # kvm run -k ./bzImage -m 448 -c 4 --name guest-26676 --shmem bar=0xc800

or

2. kvm stat --shmem

David has chosen a default BAR already.
#define SHMEM_DEFAULT_ADDR (0xc800)


OK. Makes sense.

Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Asias He
On Thu, Aug 25, 2011 at 3:02 PM, Pekka Enberg  wrote:
> On 8/25/11 9:30 AM, Asias He wrote:
>>
>> On Thu, Aug 25, 2011 at 1:54 PM, Pekka Enberg  wrote:
>>>
>>> On 8/25/11 8:34 AM, Asias He wrote:
>>>
>>> Hi, David
>>>
>>> On Thu, Aug 25, 2011 at 6:25 AM, David Evensky
>>>  wrote:

 This patch adds a PCI device that provides PCI device memory to the
 guest. This memory in the guest exists as a shared memory segment in
 the host. This is similar memory sharing capability of Nahanni
 (ivshmem) available in QEMU. In this case, the shared memory segment
 is exposed as a PCI BAR only.

 A new command line argument is added as:
    --shmem pci:0xc800:16MB:handle=/newmem:create

 which will set the PCI BAR at 0xc800, the shared memory segment
 and the region pointed to by the BAR will be 16MB. On the host side
 the shm_open handle will be '/newmem', and the kvm tool will create
 the shared segment, set its size, and initialize it. If the size,
 handle, or create flag are absent, they will default to 16MB,
 handle=/kvm_shmem, and create will be false.
>>>
>>> I think it's better to use a default BAR address if user does not specify
>>> one as well.
>>> This way,
>>>
>>> ./kvm --shmem
>>>
>>> will work with default values with zero configuration.
>>>
>>> Does that sort of thing make sense here? It's a special purpose device
>>> and the guest is expected to ioremap() the memory so it needs to
>>> know the BAR.
>>
>> I mean a default bar address for --shmem device.  Yes, guest needs to know
>> this address, but even if we specify the address at command line the guest
>> still
>> does not know this address, no? So having a default bar address does no
>> harm.
>
> How does the user discover what the default BAR is? Which default BAR
> should we use? I don't think default BAR adds much value here.

1. Print it on startup like, like we do for --name.
  # kvm run -k ./bzImage -m 448 -c 4 --name guest-26676 --shmem bar=0xc800

or

2. kvm stat --shmem

David has chosen a default BAR already.
#define SHMEM_DEFAULT_ADDR (0xc800)


 The address family,
 'pci:' is also optional as it is the only address family currently
 supported. Only a single --shmem is supported at this time.
>>>
>>> So, let's drop the 'pci:' prefix.
>>>
>>> That means the user interface will change if someone adds new address
>>> families. So we should keep the prefix, no?
>>
>> We can have a more flexible option format which does not depend on the
>> order of
>> args, e.g.:
>>
>> --shmem bar=0xc800,size=16MB,handle=/newmem,ops=create, type=pci
>>
>> if user does not specify sub-args, just use the default one.
>
> Sure, makes sense.
>
>                                    Pekka
>



-- 
Asias He
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Pekka Enberg

On 8/25/11 9:30 AM, Asias He wrote:

On Thu, Aug 25, 2011 at 1:54 PM, Pekka Enberg  wrote:

On 8/25/11 8:34 AM, Asias He wrote:

Hi, David

On Thu, Aug 25, 2011 at 6:25 AM, David Evensky  wrote:


This patch adds a PCI device that provides PCI device memory to the
guest. This memory in the guest exists as a shared memory segment in
the host. This is similar memory sharing capability of Nahanni
(ivshmem) available in QEMU. In this case, the shared memory segment
is exposed as a PCI BAR only.

A new command line argument is added as:
--shmem pci:0xc800:16MB:handle=/newmem:create

which will set the PCI BAR at 0xc800, the shared memory segment
and the region pointed to by the BAR will be 16MB. On the host side
the shm_open handle will be '/newmem', and the kvm tool will create
the shared segment, set its size, and initialize it. If the size,
handle, or create flag are absent, they will default to 16MB,
handle=/kvm_shmem, and create will be false.

I think it's better to use a default BAR address if user does not specify one 
as well.
This way,

./kvm --shmem

will work with default values with zero configuration.

Does that sort of thing make sense here? It's a special purpose device
and the guest is expected to ioremap() the memory so it needs to
know the BAR.

I mean a default bar address for --shmem device.  Yes, guest needs to know
this address, but even if we specify the address at command line the guest still
does not know this address, no? So having a default bar address does no harm.


How does the user discover what the default BAR is? Which default BAR
should we use? I don't think default BAR adds much value here.


The address family,
'pci:' is also optional as it is the only address family currently
supported. Only a single --shmem is supported at this time.

So, let's drop the 'pci:' prefix.

That means the user interface will change if someone adds new address
families. So we should keep the prefix, no?

We can have a more flexible option format which does not depend on the order of
args, e.g.:

--shmem bar=0xc800,size=16MB,handle=/newmem,ops=create, type=pci

if user does not specify sub-args, just use the default one.


Sure, makes sense.

Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html