Re: [RFC 01/12] kvm tools: Split kvm_cmd_run into init, work and uninit

2011-12-20 Thread Sasha Levin
On Mon, 2011-12-19 at 23:26 +0200, Pekka Enberg wrote:
> On Mon, 2011-12-19 at 15:58 +0200, Sasha Levin wrote:
> > +int kvm_cmd_run(int argc, const char **argv, const char *prefix)
> > +{
> > +   int r, ret;
> > +
> > +   r = kvm_cmd_run_init(argc, argv);
> > +   ret = kvm_cmd_run_work();
> > +   r = kvm_cmd_run_uninit(ret);
> > +
> > +   return ret;
> >  }
> 
> What's going on here? Why do you bother saving 'r' if you don't use it
> for anything?

It was part of my plans to get kvm_cmd_run_{init, uninit} as a simple
for(;;) through a init/uninit function pointer array, right now it's
simply meaningless there.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AMD iommu] pci Failed to assign device "hostdev0" : Device or resource busy

2011-12-20 Thread Andreas Hartmann
Hello kvm-list!

I sent the following text to Don with some additional detailed log
files. If somebody else need them too, I can provide them with pn.


Kind regards,
Andreas Hartmann



Andreas Hartmann schrieb:
> Hello Don,
> 
> thank you for your reply!
> 
> I just want to describe in short the two problems I encounter (my
> description was a bit chaotic :-) I hope it's better now :-)). I
> attached details in two files (dmesg.bz2 and lspci.bz2. It's the
> untouched raw output - no unbind or anything else had been done).
> 
> General hardware setup for a quick overview:
> 
> -[:00]-+-00.0  ATI Technologies Inc RD890 PCI to PCI bridge (external 
> gfx0 port B)
>+-00.2  ATI Technologies Inc Device 5a23
>+-02.0-[01]--+-00.0  ATI Technologies Inc Device 6759
>|\-00.1  ATI Technologies Inc Device aa90
>+-04.0-[02]00.0  Device 1b6f:7023
>+-05.0-[03]00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B 
> PCI Express Gigabit Ethernet controller
>+-09.0-[04]00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B 
> PCI Express Gigabit Ethernet controller
>+-0a.0-[05]00.0  Device 1b6f:7023
>+-11.0  ATI Technologies Inc SB700/SB800 SATA Controller [AHCI 
> mode]
>+-12.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
>+-12.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
>+-13.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
>+-13.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
>+-14.0  ATI Technologies Inc SBx00 SMBus Controller
>+-14.1  ATI Technologies Inc SB700/SB800 IDE Controller
>+-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>+-14.3  ATI Technologies Inc SB700/SB800 LPC host controller
>+-14.4-[06]07.0  RaLink RT2800 802.11n PCI
>+-14.5  ATI Technologies Inc SB700/SB800 USB OHCI2 Controller
>+-15.0-[07]--
>+-16.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
>+-16.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
>+-18.0  Advanced Micro Devices [AMD] Device 1600
>+-18.1  Advanced Micro Devices [AMD] Device 1601
>+-18.2  Advanced Micro Devices [AMD] Device 1602
>+-18.3  Advanced Micro Devices [AMD] Device 1603
>+-18.4  Advanced Micro Devices [AMD] Device 1604
>\-18.5  Advanced Micro Devices [AMD] Device 1605
> 
> The relevant devices are (for problem 1):
> 
> 06:07.0 Network controller: RaLink RT2800 802.11n PCI
> Subsystem: Linksys Device 0067
> Flags: bus master, slow devsel, latency 32, IRQ 21
> Memory at fd8e (32-bit, non-prefetchable) [size=64K]
> Capabilities: [40] Power Management version 3
> Kernel driver in use: rt2800pci
> 
> with the PCI-PCI bridge above:
> 
> 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40) 
> (prog-if 01 [Subtractive decode])
> Flags: bus master, VGA palette snoop, 66MHz, medium devsel, latency 64
> Bus: primary=00, secondary=06, subordinate=06, sec-latency=64
> I/O behind bridge: 9000-9fff
> Memory behind bridge: fd80-fd8f
> Prefetchable memory behind bridge: fd70-fd7f
> 
> 
> 
> and (for problem 2):
> 
> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B 
> PCI Express Gigabit Ethernet controller (rev 01)
> Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express 
> Gigabit Ethernet controller
> Flags: bus master, fast devsel, latency 0, IRQ 10
> I/O ports at de00 [size=256]
> Memory at fdbff000 (64-bit, non-prefetchable) [size=4K]
> [virtual] Expansion ROM at fda0 [disabled] [size=128K]
> Capabilities: [40] Power Management version 2
> Capabilities: [48] Vital Product Data
> Capabilities: [50] MSI: Enable- Count=1/2 Maskable- 64bit+
> Capabilities: [60] Express Endpoint, MSI 00
> Capabilities: [84] Vendor Specific Information: Len=4c 
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [12c] Virtual Channel
> Capabilities: [148] Device Serial Number f1-11-00-00-68-4c-e0-00
> Capabilities: [154] Power Budgeting 
> 
> 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B 
> PCI Express Gigabit Ethernet controller (rev 06)
> Subsystem: Giga-byte Technology GA-EP45-DS5 Motherboard
> Flags: bus master, fast devsel, latency 0, IRQ 10
> I/O ports at ce00 [size=256]
> Memory at fd6ff000 (64-bit, prefetchable) [size=4K]
> Memory at fd6f8000 (64-bit, prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> Capabilities: [70] Express E

Re: [Qemu-devel] [PATCH v5 00/16] uq/master: Introduce basic irqchip support

2011-12-20 Thread Jan Kiszka
On 2011-12-20 04:10, Anthony Liguori wrote:
> On 12/19/2011 08:46 PM, Anthony Liguori wrote:
>> On 12/19/2011 07:19 PM, Jan Kiszka wrote:
>>> On 2011-12-20 02:08, Anthony Liguori wrote:
>> Here's how we solve this problem:
>>
>> 1) In the short term, advertise both devices as having the same
>> VMstate name.
>> Since we don't register until the device is instantiated, this will
>> Just Work
>> and is easy.
>>
>> 2) In the not so short term, we'll have Mike Roth's Visitor series
>> land in the
>> tree (Juan promised me it will be in his next pull request).
>>
>> 3) Once we have the Visitor infrastructure in place, we can introduce
>> a self
>> describing migration format (that will also use QOM path names). With
>> a self
>> describing format, we can read all of the data from the wire into
>> memory without
>> consulting devices.
>>
>> 4) We now have the ability to arbitrarily manipulate this tree in
>> memory. It's
>> just a matter or writing a small tree transformer that converts the
>> KVM-APIC
>> state to the APIC device state (by just renaming a level of the tree).
>> Heck, we
>> could even map fields if we needed to (although we should probably avoid
>> divergence if at all possible).
> 
> The way this would is that something would register a migration "filter"
> when a userspace APIC was instantiated.  Maybe that's the device itself
> or maybe it's some centralized logic.  At any rate, since we have a
> self-describing format (and maybe it's just JSON), we can build a QObject.
> 
> The filters would get called with the QObject before it was decoded and
> dispatched to devices.  It would look something like:
> 
> static QDict *kvm_apic_to_userspace_apic(QDict *state, void *opaque)
> {
>if (strcmp(qdict_get_str(state, "__type__"), "kvm-apic") {
>   QDict *userspace_apic = qdict_new();
>   const char *key;
> 
>   qdict_foreach_key(&key, state) {
>   QObject *value = qdict_get(state, key);
> 
>   qobject_incref(value);
>   qdict_put_obj(userspace_apic, key, value);
>   }
>   qdict_put_str(userspace_apic, "__type__", "apic");
>   return userspace_apic;
>} else {
>   qobject_incref(state);
>   return state;
>}
> }
> 
> The same sort of filter function could also handle migration
> compatibility between virtio-blk-pci and a pair of virtio-blk/virtio-pci
> devices.  It would simply match on the __type__ of "virtio-blk-pci", and
> then split apart the state into an appropriate "virtio-pci" dictionary
> and a "virtio-blk" dictionary.
> 
> This is just psuedo-code mind you.  We'll need to think carefully about
> how we recurse and apply these filters.  But it will be an extremely
> powerful mechanism that will let us solve most of these compatibility
> problems in an elegant way.

Another approach, which also solves an issue the above does not, go like
this:

Use some device alias as name fore saving, and also accept this for
addressing the device in a running VM. The latter would allow for
/path/to/the/ioapic to always point you to the currently used IOAPIC
version, no matter if it is actually kvm-ioapic or [qemu-]ioapic. This
feature was requested by Avi back then. It doesn't map to existing
features directly, though.

In any case, I'm not going to touch a line of code until there is
consensus about the way to go.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] kvm: Remove ability to assign a device without iommu support

2011-12-20 Thread Sasha Levin
On Mon, 2011-12-19 at 20:19 -0700, Alex Williamson wrote:
> This option has no users and it exposes a security hole that we
> can allow devices to be assigned without iommu protection.  Make
> KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.
> 
> Signed-off-by: Alex Williamson 
> ---
> 
>  virt/kvm/assigned-dev.c |   18 +-
>  1 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 3ad0925..a251a28 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -487,6 +487,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>   struct kvm_assigned_dev_kernel *match;
>   struct pci_dev *dev;
>  
> + if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
> + return -EINVAL;

Could we just drop KVM_DEV_ASSIGN_ENABLE_IOMMU and do it by default?
calling KVM_ASSIGN_PCI_DEVICE without that flag set it pretty
meaningless.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 01/12] kvm tools: Split kvm_cmd_run into init, work and uninit

2011-12-20 Thread Asias He
On 12/20/2011 06:09 PM, Sasha Levin wrote:
> On Mon, 2011-12-19 at 23:26 +0200, Pekka Enberg wrote:
>> On Mon, 2011-12-19 at 15:58 +0200, Sasha Levin wrote:
>>> +int kvm_cmd_run(int argc, const char **argv, const char *prefix)
>>> +{
>>> +   int r, ret;
>>> +
>>> +   r = kvm_cmd_run_init(argc, argv);
>>> +   ret = kvm_cmd_run_work();
>>> +   r = kvm_cmd_run_uninit(ret);
>>> +
>>> +   return ret;
>>>  }
>>
>> What's going on here? Why do you bother saving 'r' if you don't use it
>> for anything?
> 
> It was part of my plans to get kvm_cmd_run_{init, uninit} as a simple

Can we have a shorter name for 'uninit', e.g. 'fini', thus we will have
{init, fini}.

> for(;;) through a init/uninit function pointer array, right now it's
> simply meaningless there.
> 


-- 
Asias He
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm: Remove ability to assign a device without iommu support

2011-12-20 Thread Jan Kiszka
On 2011-12-20 09:49, Sasha Levin wrote:
> On Mon, 2011-12-19 at 20:19 -0700, Alex Williamson wrote:
>> This option has no users and it exposes a security hole that we
>> can allow devices to be assigned without iommu protection.  Make
>> KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.
>>
>> Signed-off-by: Alex Williamson 
>> ---
>>
>>  virt/kvm/assigned-dev.c |   18 +-
>>  1 files changed, 9 insertions(+), 9 deletions(-)
>>
>> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
>> index 3ad0925..a251a28 100644
>> --- a/virt/kvm/assigned-dev.c
>> +++ b/virt/kvm/assigned-dev.c
>> @@ -487,6 +487,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>>  struct kvm_assigned_dev_kernel *match;
>>  struct pci_dev *dev;
>>  
>> +if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
>> +return -EINVAL;
> 
> Could we just drop KVM_DEV_ASSIGN_ENABLE_IOMMU and do it by default?
> calling KVM_ASSIGN_PCI_DEVICE without that flag set it pretty
> meaningless.

There is that thing called "backward compatibility". :)

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm: Remove ability to assign a device without iommu support

2011-12-20 Thread Sasha Levin
On Tue, 2011-12-20 at 10:03 +0100, Jan Kiszka wrote:
> On 2011-12-20 09:49, Sasha Levin wrote:
> > On Mon, 2011-12-19 at 20:19 -0700, Alex Williamson wrote:
> >> This option has no users and it exposes a security hole that we
> >> can allow devices to be assigned without iommu protection.  Make
> >> KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.
> >>
> >> Signed-off-by: Alex Williamson 
> >> ---
> >>
> >>  virt/kvm/assigned-dev.c |   18 +-
> >>  1 files changed, 9 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> >> index 3ad0925..a251a28 100644
> >> --- a/virt/kvm/assigned-dev.c
> >> +++ b/virt/kvm/assigned-dev.c
> >> @@ -487,6 +487,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> >>struct kvm_assigned_dev_kernel *match;
> >>struct pci_dev *dev;
> >>  
> >> +  if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
> >> +  return -EINVAL;
> > 
> > Could we just drop KVM_DEV_ASSIGN_ENABLE_IOMMU and do it by default?
> > calling KVM_ASSIGN_PCI_DEVICE without that flag set it pretty
> > meaningless.
> 
> There is that thing called "backward compatibility". :)

Well, Alex suggested skipping deprecation period because there are
currently no users of KVM_ASSIGN_PCI_DEVICE without
KVM_DEV_ASSIGN_ENABLE_IOMMU, so it should be fine to just make it the
default behavior, no?

We can leave KVM_DEV_ASSIGN_ENABLE_IOMMU itself so userspace won't
break, but theres no reason to enforce it being set in the kernel code.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm: Remove ability to assign a device without iommu support

2011-12-20 Thread Jan Kiszka
On 2011-12-20 10:08, Sasha Levin wrote:
> On Tue, 2011-12-20 at 10:03 +0100, Jan Kiszka wrote:
>> On 2011-12-20 09:49, Sasha Levin wrote:
>>> On Mon, 2011-12-19 at 20:19 -0700, Alex Williamson wrote:
 This option has no users and it exposes a security hole that we
 can allow devices to be assigned without iommu protection.  Make
 KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.

 Signed-off-by: Alex Williamson 
 ---

  virt/kvm/assigned-dev.c |   18 +-
  1 files changed, 9 insertions(+), 9 deletions(-)

 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
 index 3ad0925..a251a28 100644
 --- a/virt/kvm/assigned-dev.c
 +++ b/virt/kvm/assigned-dev.c
 @@ -487,6 +487,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
struct kvm_assigned_dev_kernel *match;
struct pci_dev *dev;
  
 +  if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
 +  return -EINVAL;
>>>
>>> Could we just drop KVM_DEV_ASSIGN_ENABLE_IOMMU and do it by default?
>>> calling KVM_ASSIGN_PCI_DEVICE without that flag set it pretty
>>> meaningless.
>>
>> There is that thing called "backward compatibility". :)
> 
> Well, Alex suggested skipping deprecation period because there are
> currently no users of KVM_ASSIGN_PCI_DEVICE without
> KVM_DEV_ASSIGN_ENABLE_IOMMU, so it should be fine to just make it the
> default behavior, no?

This iommu-less mode used to "work" for older qemu-kvm version, and I
think it should still do. Though it makes no sense, I fully agree.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm: Remove ability to assign a device without iommu support

2011-12-20 Thread Avi Kivity
On 12/20/2011 11:12 AM, Jan Kiszka wrote:
> > 
> > Well, Alex suggested skipping deprecation period because there are
> > currently no users of KVM_ASSIGN_PCI_DEVICE without
> > KVM_DEV_ASSIGN_ENABLE_IOMMU, so it should be fine to just make it the
> > default behavior, no?
>
> This iommu-less mode used to "work" for older qemu-kvm version, and I
> think it should still do. Though it makes no sense, I fully agree.
>

It only worked for special kernels that allowed 1:1 gpa/hpa mappings, IIRC.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm: Remove ability to assign a device without iommu support

2011-12-20 Thread Avi Kivity
On 12/20/2011 05:19 AM, Alex Williamson wrote:
> This option has no users and it exposes a security hole that we
> can allow devices to be assigned without iommu protection.  Make
> KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.
>
> Signed-off-by: Alex Williamson 
> ---
>
>  virt/kvm/assigned-dev.c |   18 +-
>
Documentation/virtual/kvm/api.txt +++

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Move gfn_to_memslot() to kvm_host.h

2011-12-20 Thread Paul Mackerras
This moves gfn_to_memslot(), and the functions it calls, that is,
search_memslots() and __gfn_to_memslot(), from kvm_main.c to kvm_host.h
so that gfn_to_memslot() can be called from non-modular code even
when KVM is a module.  On powerpc, the Book3S HV style of KVM has
code that is called from real mode which needs to call gfn_to_memslot()
and thus needs this.  (Module code is allocated in the vmalloc region,
which can't be accessed in real mode.)

With this, we can remove builtin_gfn_to_memslot() from book3s_hv_rm_mmu.c
and thus eliminate a little bit of duplication.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |   23 ++-
 include/linux/kvm_host.h|   25 -
 virt/kvm/kvm_main.c |   25 -
 3 files changed, 26 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index d3e36fc..063b00c 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -21,25 +21,6 @@
 #include 
 #include 
 
-/*
- * Since this file is built in even if KVM is a module, we need
- * a local copy of this function for the case where kvm_main.c is
- * modular.
- */
-static struct kvm_memory_slot *builtin_gfn_to_memslot(struct kvm *kvm,
-   gfn_t gfn)
-{
-   struct kvm_memslots *slots;
-   struct kvm_memory_slot *memslot;
-
-   slots = kvm_memslots(kvm);
-   kvm_for_each_memslot(memslot, slots)
-   if (gfn >= memslot->base_gfn &&
- gfn < memslot->base_gfn + memslot->npages)
-   return memslot;
-   return NULL;
-}
-
 /* Translate address of a vmalloc'd thing to a linear map address */
 static void *real_vmalloc_addr(void *x)
 {
@@ -97,7 +78,7 @@ static void remove_revmap_chain(struct kvm *kvm, long 
pte_index,
rev = real_vmalloc_addr(&kvm->arch.revmap[pte_index]);
ptel = rev->guest_rpte;
gfn = hpte_rpn(ptel, hpte_page_size(hpte_v, ptel));
-   memslot = builtin_gfn_to_memslot(kvm, gfn);
+   memslot = gfn_to_memslot(kvm, gfn);
if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID))
return;
 
@@ -171,7 +152,7 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long 
flags,
/* Find the memslot (if any) for this address */
gpa = (ptel & HPTE_R_RPN) & ~(psize - 1);
gfn = gpa >> PAGE_SHIFT;
-   memslot = builtin_gfn_to_memslot(kvm, gfn);
+   memslot = gfn_to_memslot(kvm, gfn);
pa = 0;
is_io = ~0ul;
rmap = NULL;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ec79a45..109828f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -429,7 +429,6 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
  gpa_t gpa);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
-struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
@@ -649,6 +648,30 @@ static inline void kvm_guest_exit(void)
current->flags &= ~PF_VCPU;
 }
 
+static inline struct kvm_memory_slot *
+search_memslots(struct kvm_memslots *slots, gfn_t gfn)
+{
+   struct kvm_memory_slot *memslot;
+
+   kvm_for_each_memslot(memslot, slots)
+   if (gfn >= memslot->base_gfn &&
+ gfn < memslot->base_gfn + memslot->npages)
+   return memslot;
+
+   return NULL;
+}
+
+static inline struct kvm_memory_slot *
+__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
+{
+   return search_memslots(slots, gfn);
+}
+
+static inline struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t 
gfn)
+{
+   return __gfn_to_memslot(kvm_memslots(kvm), gfn);
+}
+
 static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
 {
return gfn_to_memslot(kvm, gfn)->id;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c144132..ef11529 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -640,19 +640,6 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot 
*memslot)
 }
 #endif /* !CONFIG_S390 */
 
-static struct kvm_memory_slot *
-search_memslots(struct kvm_memslots *slots, gfn_t gfn)
-{
-   struct kvm_memory_slot *memslot;
-
-   kvm_for_each_memslot(memslot, slots)
-   if (gfn >= memslot->base_gfn &&
- gfn < memslot->base_gfn + memslot->npages)
-   return memslot;
-
-   return NULL;
-}
-
 static int cmp_memslot(const void *slot1, const void *slot2)
 {
struct kvm_memory_slot *s1, *s2;
@@ -1031,18 +1018,6 @@ int kvm_is_error_hva(unsigned long ad

Re: [PATCH 0/2] kvm: Lock down device assignment

2011-12-20 Thread Avi Kivity
On 12/20/2011 05:19 AM, Alex Williamson wrote:
> Two patches to try to better secure the device assignment ioctl.
> This firt patch makes KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory
> option when assigning a device.  I don't believe we have any
> users of this option, so I think we can skip any deprecation
> period, especially since it's existence is rather dangerous.
>
> The second patch introduces some file permission checking that Avi
> suggested.  If a user has been granted read/write permission to
> the PCI sysfs BAR resource files, this is a good indication that
> they have access to the device.  We can't call sys_faccessat
> directly (not exported), but the important bits are self contained
> enough to include directly.  This still works with sudo and libvirt
> usage, the latter already grants qemu permission to these files.
> Thanks,
>
>

Looks good, but please update the API documentation.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH]: kvm-s390: add KVM_S390_GET/SET_SREGS2 call for additional hw regs

2011-12-20 Thread Christian Borntraeger
Avi, Marcelo,

let me know if you would prefer to reuse another register load/save ioctls
that is still unused for s390 (e.g. XCRS).


From: Christian Borntraeger 

For guest relocation and virsh dump qemu needs an interface to
get/set additional registers from kvm. We also need the prefix
register for all guest memory accesses to the prefix pages.

The prefix register could also be set via the KVM_S390_SIGP_SET_PREFIX
interrupt ioctl, but I also added the synchronous operation to have

o symmetry: we want to have the same struct for get/set routine
o the interrupt is only delivered before entering the SIE, we also
  want to cover the sequence set prefix/store status at prefix

Signed-off-by: Christian Borntraeger 
---
 arch/s390/include/asm/kvm.h |9 +
 arch/s390/kvm/kvm-s390.c|   24 
 include/linux/kvm.h |4 
 3 files changed, 37 insertions(+)

Index: b/arch/s390/include/asm/kvm.h
===
--- a/arch/s390/include/asm/kvm.h
+++ b/arch/s390/include/asm/kvm.h
@@ -28,6 +28,15 @@ struct kvm_sregs {
__u64 crs[16];
 };
 
+/* for KVM_S390_GET_SREGS2 and KVM_S390_SET_SREGS2 */
+struct kvm_s390_sregs2 {
+   __u64 ckc;
+   __u64 cputm;
+   __u64 gbea;
+   __u32 todpr;
+   __u32 prefix;
+};
+
 /* for KVM_GET_FPU and KVM_SET_FPU */
 struct kvm_fpu {
__u32 fpc;
Index: b/arch/s390/kvm/kvm-s390.c
===
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -129,6 +129,7 @@ int kvm_dev_ioctl_check_extension(long e
case KVM_CAP_S390_PSW:
case KVM_CAP_S390_GMAP:
case KVM_CAP_SYNC_MMU:
+   case KVM_CAP_S390_SREGS2:
r = 1;
break;
default:
@@ -673,6 +674,29 @@ long kvm_arch_vcpu_ioctl(struct file *fi
case KVM_S390_INITIAL_RESET:
r = kvm_arch_vcpu_ioctl_initial_reset(vcpu);
break;
+   case KVM_S390_GET_SREGS2: {
+   struct kvm_s390_sregs2 sregs2;
+
+   sregs2.prefix = vcpu->arch.sie_block->prefix;
+   sregs2.gbea = vcpu->arch.sie_block->gbea;
+   sregs2.cputm = vcpu->arch.sie_block->cputm;
+   sregs2.ckc = vcpu->arch.sie_block->ckc;
+   sregs2.todpr = vcpu->arch.sie_block->todpr;
+   r = copy_to_user(argp, &sregs2, sizeof(sregs2));
+   break;
+   }
+   case KVM_S390_SET_SREGS2: {
+   struct kvm_s390_sregs2 sregs2;
+
+   vcpu->arch.sie_block->prefix = sregs2.prefix;
+   vcpu->arch.sie_block->gbea = sregs2.gbea;
+   vcpu->arch.sie_block->cputm = sregs2.cputm;
+   vcpu->arch.sie_block->ckc = sregs2.ckc;
+   vcpu->arch.sie_block->todpr = sregs2.todpr;
+   r = copy_from_user(&sregs2, argp, sizeof(sregs2));
+   vcpu->arch.sie_block->ihcpu = 0x;
+   break;
+   }
default:
r = -EINVAL;
}
Index: b/include/linux/kvm.h
===
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -557,6 +557,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_MAX_VCPUS 66   /* returns max vcpus per vm */
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_S390_GMAP 71
+#define KVM_CAP_S390_SREGS2 72
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -762,6 +763,9 @@ struct kvm_clock_data {
 #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO,  0xa8, struct 
kvm_create_spapr_tce)
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
+/* Available with KVM_CAP_S390_SREGS2 */
+#define KVM_S390_GET_SREGS2_IOR(KVMIO,  0xaa, struct kvm_s390_sregs2)
+#define KVM_S390_SET_SREGS2_IOW(KVMIO,  0xab, struct kvm_s390_sregs2)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0)
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: kvm-s390: add KVM_S390_GET/SET_SREGS2 call for additional hw regs

2011-12-20 Thread Avi Kivity
On 12/20/2011 11:38 AM, Christian Borntraeger wrote:
> Avi, Marcelo,
>
> let me know if you would prefer to reuse another register load/save ioctls
> that is still unused for s390 (e.g. XCRS).

No, the proposed names are fine.

>
> From: Christian Borntraeger 
>
> For guest relocation and virsh dump qemu needs an interface to
> get/set additional registers from kvm. We also need the prefix
> register for all guest memory accesses to the prefix pages.
>
> The prefix register could also be set via the KVM_S390_SIGP_SET_PREFIX
> interrupt ioctl, but I also added the synchronous operation to have
>
> o symmetry: we want to have the same struct for get/set routine
> o the interrupt is only delivered before entering the SIE, we also
>   want to cover the sequence set prefix/store status at prefix
>
> Signed-off-by: Christian Borntraeger 
> ---
>  arch/s390/include/asm/kvm.h |9 +
>  arch/s390/kvm/kvm-s390.c|   24 
>  include/linux/kvm.h |4 
>  3 files changed, 37 insertions(+)

The lack of documentation is not.


> @@ -673,6 +674,29 @@ long kvm_arch_vcpu_ioctl(struct file *fi
>   case KVM_S390_INITIAL_RESET:
>   r = kvm_arch_vcpu_ioctl_initial_reset(vcpu);
>   break;
> + case KVM_S390_GET_SREGS2: {
> + struct kvm_s390_sregs2 sregs2;
> +
> + sregs2.prefix = vcpu->arch.sie_block->prefix;
> + sregs2.gbea = vcpu->arch.sie_block->gbea;
> + sregs2.cputm = vcpu->arch.sie_block->cputm;
> + sregs2.ckc = vcpu->arch.sie_block->ckc;
> + sregs2.todpr = vcpu->arch.sie_block->todpr;
> + r = copy_to_user(argp, &sregs2, sizeof(sregs2));

Need to return -EFAULT, not the number of remaining bytes to copy.

> + break;
> + }
> + case KVM_S390_SET_SREGS2: {
> + struct kvm_s390_sregs2 sregs2;
> +
> + vcpu->arch.sie_block->prefix = sregs2.prefix;
> + vcpu->arch.sie_block->gbea = sregs2.gbea;
> + vcpu->arch.sie_block->cputm = sregs2.cputm;
> + vcpu->arch.sie_block->ckc = sregs2.ckc;
> + vcpu->arch.sie_block->todpr = sregs2.todpr;

Copying uninitialized data.

> + r = copy_from_user(&sregs2, argp, sizeof(sregs2));

Then initializing it.

> + vcpu->arch.sie_block->ihcpu = 0x;

What's this?



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Avi Kivity
On 12/20/2011 02:38 AM, Anthony Liguori wrote:
>> That was v1 of my patches. Avi didn't like it, I tried it like this, and
>> in the end I had to agree. So, no, I don't think we want such a model.
>
>
> Yes, we do :-)
>
> The in-kernel APIC is a different implementation of the APIC device. 
> It's not an "accelerator" for the userspace APIC.

A different implementation but not a different device.  Device == spec.

>
> All that you're doing here is reinventing qdev.  You're defining your
> own type system (APICBackend), creating a new regression system for
> it, and then defining your own factory function for creating it
> (through a qdev property).
>
> I'm struggling to understand the reason to avoid using the
> infrastructure we already have to do all of this.

Not every table of function pointers has to be done through qdev (not
that I feel strongly about this - only that there is just one APIC device).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: kvm-s390: add KVM_S390_GET/SET_SREGS2 call for additional hw regs

2011-12-20 Thread Christian Borntraeger
>>  arch/s390/include/asm/kvm.h |9 +
>>  arch/s390/kvm/kvm-s390.c|   24 
>>  include/linux/kvm.h |4 
>>  3 files changed, 37 insertions(+)
> 
> The lack of documentation is not.

Ok, will do.
> 
> 
>> @@ -673,6 +674,29 @@ long kvm_arch_vcpu_ioctl(struct file *fi
>>  case KVM_S390_INITIAL_RESET:
>>  r = kvm_arch_vcpu_ioctl_initial_reset(vcpu);
>>  break;
>> +case KVM_S390_GET_SREGS2: {
>> +struct kvm_s390_sregs2 sregs2;
>> +
>> +sregs2.prefix = vcpu->arch.sie_block->prefix;
>> +sregs2.gbea = vcpu->arch.sie_block->gbea;
>> +sregs2.cputm = vcpu->arch.sie_block->cputm;
>> +sregs2.ckc = vcpu->arch.sie_block->ckc;
>> +sregs2.todpr = vcpu->arch.sie_block->todpr;
>> +r = copy_to_user(argp, &sregs2, sizeof(sregs2));
> 
> Need to return -EFAULT, not the number of remaining bytes to copy.

Will fix.

>> +case KVM_S390_SET_SREGS2: {
>> +struct kvm_s390_sregs2 sregs2;
>> +
>> +vcpu->arch.sie_block->prefix = sregs2.prefix;
>> +vcpu->arch.sie_block->gbea = sregs2.gbea;
>> +vcpu->arch.sie_block->cputm = sregs2.cputm;
>> +vcpu->arch.sie_block->ckc = sregs2.ckc;
>> +vcpu->arch.sie_block->todpr = sregs2.todpr;
> 
> Copying uninitialized data.
> 
>> +r = copy_from_user(&sregs2, argp, sizeof(sregs2));
> 
> Then initializing it.

Hmm, a brown paper bag bug. Since life migration does not yet work
I only tested the get case (via dump). Sorry about that.

> 
>> +vcpu->arch.sie_block->ihcpu = 0x;
> 
> What's this?

tlb flush. Necessary after setting the prefix register.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Controlling (WinXP-Guest)- Textmode via chardev pipe broken?

2011-12-20 Thread Oliver Rath
Hi List,

Im trying to control a winxp installation process (esp. the text-mode
part) via pipe through kvm. Im using the git-version of kvm from
yesterday (2011-12-19), build with these parameters:

~/qemu-kvm$ ./configure --enable-sdl
~/qemu-kvm$ make
~/qemu-kvm$ sudo make install

What works:

1. Generating ~/.winpipe/path.{in,out}via mkdir .winpipe; mkfifo
.winpipe/path.in; mkfifo .winpipe/path.out
2. Creating emty image via qemu-img create win.img 20G
3. Starting kvm via: qemu-system-x86_64 -m 1G -hda win.img -cdrom
winxpsp3.iso -chardev pipe,id=mywinpipe,path=.winpipe/path

What doesn't:

1. Piping some text via echo "R" > .winpipe/path.in (i.e. for starting
repair console in winxp-textmode) doesnt have any effect
2. Reading from path.out via cat ~/.winpipe/path.out (nothing happens
except the vm ends, then the command "cat .winpipe/path.out" ends, too)
has no result, only waiting
3. There is no .pipe/ directory inside winxp, wether on C: nor on D:
(looked in via starting repair-console with "R")


Whats wrong?

Tfh!

Oliver

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 00/16] uq/master: Introduce basic irqchip support

2011-12-20 Thread Avi Kivity
On 12/20/2011 02:42 AM, Anthony Liguori wrote:
>> Look down http://thread.gmane.org/gmane.comp.emulators.kvm.devel/82598
>> for the discussion of that model.
>
>
> I have.  I don't understand the rationale for jumping through hoops here.
>
> There seems to be an assertion that migrating from in-kernel APIC to
> userspace APIC is an important use case.  I don't really see how
> that's true.
>

That's only because no one is using qemu.git for virtualization.  If
they were, then you'd prevent existing users from using it, except
through guest shutdown and relaunch of qemu (and perhaps reconfiguration).

We've discussed removing the ioapic from the kernel.  If we do that,
then we need to support migration from in-kernel ioapic to userspace ioapic.

> But nonetheless, the direction migration is heading is not just to
> migrate the QOM path names to identify devices, but to provide a way
> to introspect the device model, transfer the current device model
> description to the other end, and create the device model on the
> destination.
>
> This is the only way to reliably support things like hot-plug during
> live migration which is something we punt to management tools (which
> really can't implement it properly).
>
> So we'll already be migrating the apic backend property which means
> that you are not going to have migration to and from in-kernel APIC
> and userspace APIC without some sort of in-between translation layer
> (which could just as easily change the device names).

To what?

The backend property should be private and not migrated.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 00/16] uq/master: Introduce basic irqchip support

2011-12-20 Thread Avi Kivity
On 12/20/2011 04:46 AM, Anthony Liguori wrote:
>
> I would hope that you would agree that when designing the device
> model, we should aim to do what makes sense independent of migration. 
> If we cannot achieve a certain feature with migration given the
> logical modeling of devices, it probably suggests that we need to
> improve our migration infrastructure.
>
> I assume that given the above, we all agree that separate devices is
> what makes the most sense ignoring migration.

I don't agree with this.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 00/16] uq/master: Introduce basic irqchip support

2011-12-20 Thread Avi Kivity
On 12/20/2011 12:03 PM, Avi Kivity wrote:
> On 12/20/2011 04:46 AM, Anthony Liguori wrote:
> >
> > I would hope that you would agree that when designing the device
> > model, we should aim to do what makes sense independent of migration. 
> > If we cannot achieve a certain feature with migration given the
> > logical modeling of devices, it probably suggests that we need to
> > improve our migration infrastructure.
> >
> > I assume that given the above, we all agree that separate devices is
> > what makes the most sense ignoring migration.
>
> I don't agree with this.

The problem with having two devices, is that now you have to identify
the common code, put them somewhere, and use them as necessary.

"apic" and "kvm-apic" both is-a (are-a?) "apic".  This suggests either a
base class (containing the common code) and derived classes, or (like
Jan's implementation), just one class, that defers part of the
implementation to an interface implemented by two other classes.

Two unrelated classes which happen to implement exactly the same
interface (vmstate fields) except one (visible name) and share some code
are a strange solution to this problem.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: kvm-s390: add KVM_S390_GET/SET_SREGS2 call for additional hw regs

2011-12-20 Thread Avi Kivity
On 12/20/2011 11:59 AM, Christian Borntraeger wrote:
> > 
> >> +  vcpu->arch.sie_block->ihcpu = 0x;
> > 
> > What's this?
>
> tlb flush. Necessary after setting the prefix register.

Perhaps worth wrapping into an inline with a descriptive name later on.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] PCI: Enable ATS at the device state restore

2011-12-20 Thread Hao, Xudong
Hi, Jesse
Do you have any comments for this fix patch?

Thanks,
-Xudong

> -Original Message-
> From: Hao, Xudong
> Sent: Saturday, December 17, 2011 9:25 PM
> To: 'jbar...@virtuousgeek.org'; 'linux-...@vger.kernel.org'
> Cc: linux-ker...@vger.kernel.org; kvm@vger.kernel.org; Zhang, Xiantao
> Subject: [PATCH] PCI: Enable ATS at the device state restore
> 
> When system go to S3 or S4 sleep and then return, some register of PCI device
> does not be restored correctly, such as ATS capability. The same problem
> happen on pci reset function.
> 
> This patch enables ATS at the device state restore if PCI device has ATS
> capability.
> 
> Signed-off-by: Xudong Hao 
> Signed-off-by: Xiantao Zhang 
> ---
>  drivers/pci/ats.c |   17 +
>  drivers/pci/pci.c |1 +
>  drivers/pci/pci.h |8 
>  3 files changed, 26 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c index 7ec56fb..a6c2b35 
> 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -127,6 +127,23 @@ void pci_disable_ats(struct pci_dev *dev)  }
> EXPORT_SYMBOL_GPL(pci_disable_ats);
> 
> +void pci_restore_ats_state(struct pci_dev *dev) {
> + u16 ctrl;
> +
> + if (!pci_ats_enabled(dev))
> + return;
> + if (!pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS))
> + BUG();
> +
> + ctrl = PCI_ATS_CTRL_ENABLE;
> + if (!dev->is_virtfn)
> + ctrl |= PCI_ATS_CTRL_STU(dev->ats->stu - PCI_ATS_MIN_STU);
> +
> + pci_write_config_word(dev, dev->ats->pos + PCI_ATS_CTRL, ctrl); }
> +EXPORT_SYMBOL_GPL(pci_restore_ats_state);
> +
>  /**
>   * pci_ats_queue_depth - query the ATS Invalidate Queue Depth
>   * @dev: the PCI device
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 6f45a73..6dafc1d 
> 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -956,6 +956,7 @@ void pci_restore_state(struct pci_dev *dev)
> 
>   /* PCI Express register must be restored first */
>   pci_restore_pcie_state(dev);
> + pci_restore_ats_state(dev);
> 
>   /*
>* The Base Address register should be programmed before the command
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index b74084e..a4f3140 
> 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -249,6 +249,14 @@ struct pci_sriov {
>   u8 __iomem *mstate; /* VF Migration State Array */
>  };
> 
> +#ifdef CONFIG_PCI_ATS
> +extern void pci_restore_ats_state(struct pci_dev *dev); #else static
> +inline void pci_restore_ats_state(struct pci_dev *dev) { } #endif /*
> +CONFIG_PCI_ATS */
> +
>  #ifdef CONFIG_PCI_IOV
>  extern int pci_iov_init(struct pci_dev *dev);  extern void
> pci_iov_release(struct pci_dev *dev);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Don't panic guest when exiting from custom rootfs

2011-12-20 Thread Sasha Levin
We currently panic guest when exiting from custom rootfs since at that point
we terminate init, and the guest kernel doesn't quite like that.

Instead, we do a graceful shutdown when init is done (either when 'lkvm
sandbox' command or '/bin/sh' is finished).

Signed-off-by: Sasha Levin 
---
 tools/kvm/guest/init_stage2.c |   24 ++--
 1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/tools/kvm/guest/init_stage2.c b/tools/kvm/guest/init_stage2.c
index 6489fee..7b96436 100644
--- a/tools/kvm/guest/init_stage2.c
+++ b/tools/kvm/guest/init_stage2.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int run_process(char *filename)
 {
@@ -26,6 +27,9 @@ static int run_process_sandbox(char *filename)
 
 int main(int argc, char *argv[])
 {
+   pid_t child;
+   int status;
+
/* get session leader */
setsid();
 
@@ -34,12 +38,20 @@ int main(int argc, char *argv[])
 
puts("Starting '/bin/sh'...");
 
-   if (access("/virt/sandbox.sh", R_OK) == 0)
-   run_process_sandbox("/bin/sh");
-   else
-   run_process("/bin/sh");
-
-   printf("Init failed: %s\n", strerror(errno));
+   child = fork();
+   if (child < 0) {
+   printf("Fatal: fork() failed with %d\n", child);
+   return 0;
+   } else if (child == 0) {
+   if (access("/virt/sandbox.sh", R_OK) == 0)
+   run_process_sandbox("/bin/sh");
+   else
+   run_process("/bin/sh");
+   } else {
+   wait(&status);
+   }
+
+   reboot(LINUX_REBOOT_CMD_RESTART);
 
return 0;
 }
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] intel-iommu: Add device info into list before doing context mapping

2011-12-20 Thread Hao, Xudong
Hi, David

Do you have any comments for this patch?

Thanks,
-Xudong


> -Original Message-
> From: Hao, Xudong
> Sent: Saturday, December 17, 2011 9:07 PM
> To: 'io...@lists.linux-foundation.org'; 'dw...@infradead.org'
> Cc: 'linux-ker...@vger.kernel.org'; 'kvm@vger.kernel.org'; Zhang, Xiantao
> Subject: [PATCH] intel-iommu: Add device info into list before doing context
> mapping
> 
> This patch add device info into list before do context mapping. Because device
> info will be used by iommu_enable_dev_iotlb function, in this function,
> pci_enable_ats would not be called without this patch, so ATS did not enable
> while a PCI device which has ATS capability is assigned to a guest.
> 
> Signed-off-by: Xudong Hao 
> Signed-off-by: Xiantao Zhang 
> ---
>  drivers/iommu/intel-iommu.c |   14 --
>  1 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index
> c0c7820..f0b5d38 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2264,12 +2264,6 @@ static int domain_add_dev_info(struct
> dmar_domain *domain,
>   if (!info)
>   return -ENOMEM;
> 
> - ret = domain_context_mapping(domain, pdev, translation);
> - if (ret) {
> - free_devinfo_mem(info);
> - return ret;
> - }
> -
>   info->segment = pci_domain_nr(pdev->bus);
>   info->bus = pdev->bus->number;
>   info->devfn = pdev->devfn;
> @@ -2282,6 +2276,14 @@ static int domain_add_dev_info(struct
> dmar_domain *domain,
>   pdev->dev.archdata.iommu = info;
>   spin_unlock_irqrestore(&device_domain_lock, flags);
> 
> + ret = domain_context_mapping(domain, pdev, translation);
> + if (ret) {
> + list_del(&info->link);
> + list_del(&info->global);
> + free_devinfo_mem(info);
> + return ret;
> + }
> +
>   return 0;
>  }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2]: kvm-s390: use guest flush inline function

2011-12-20 Thread Christian Borntraeger
From: Christian Borntraeger 

Move all open-coded guest flush operations into an inline function.

Signed-off-by: Christian Borntraeger 
---
 arch/s390/kvm/interrupt.c |2 +-
 arch/s390/kvm/kvm-s390.c  |2 +-
 arch/s390/kvm/kvm-s390.h  |7 ++-
 3 files changed, 8 insertions(+), 3 deletions(-)

Index: b/arch/s390/kvm/interrupt.c
===
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -237,7 +237,7 @@ static void __do_deliver_interrupt(struc
   inti->prefix.address);
vcpu->stat.deliver_prefix_signal++;
vcpu->arch.sie_block->prefix = inti->prefix.address;
-   vcpu->arch.sie_block->ihcpu = 0x;
+   flush_guest_cpu(vcpu);
break;
 
case KVM_S390_RESTART:
Index: b/arch/s390/kvm/kvm-s390.c
===
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -291,7 +291,6 @@ static void kvm_s390_vcpu_initial_reset(
vcpu->arch.sie_block->gpsw.mask = 0UL;
vcpu->arch.sie_block->gpsw.addr = 0UL;
vcpu->arch.sie_block->prefix= 0UL;
-   vcpu->arch.sie_block->ihcpu = 0x;
vcpu->arch.sie_block->cputm = 0UL;
vcpu->arch.sie_block->ckc   = 0UL;
vcpu->arch.sie_block->todpr = 0;
@@ -301,6 +300,7 @@ static void kvm_s390_vcpu_initial_reset(
vcpu->arch.guest_fpregs.fpc = 0;
asm volatile("lfpc %0" : : "Q" (vcpu->arch.guest_fpregs.fpc));
vcpu->arch.sie_block->gbea = 1;
+   flush_guest_cpu(vcpu);
 }
 
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
Index: b/arch/s390/kvm/kvm-s390.h
===
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -1,7 +1,7 @@
 /*
  * kvm_s390.h -  definition for kvm on s390
  *
- * Copyright IBM Corp. 2008,2009
+ * Copyright IBM Corp. 2008,2011
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License (version 2 only)
@@ -47,6 +47,11 @@ static inline int __cpu_is_stopped(struc
return atomic_read(&vcpu->arch.sie_block->cpuflags) & CPUSTAT_STOP_INT;
 }
 
+static inline void flush_guest_cpu(struct kvm_vcpu *vcpu)
+{
+   vcpu->arch.sie_block->ihcpu = 0x;
+}
+
 int kvm_s390_handle_wait(struct kvm_vcpu *vcpu);
 enum hrtimer_restart kvm_s390_idle_wakeup(struct hrtimer *timer);
 void kvm_s390_tasklet(unsigned long parm);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2]: kvm-s390: add KVM_S390_GET/SET_SREGS2 call for additional hw regs

2011-12-20 Thread Christian Borntraeger
From: Christian Borntraeger 

For guest relocation and virsh dump qemu needs an interface to
get/set additional registers from kvm. We also need the prefix
register for all guest memory accesses to the prefix pages.

The prefix register could also be set via the KVM_S390_SIGP_SET_PREFIX
interrupt ioctl, but I also added the synchronous operation to have

o symmetry: we want to have the same struct for get/set routine
o the interrupt is only delivered before entering the SIE, we also
  want to cover the sequence set prefix/store status at prefix

Signed-off-by: Christian Borntraeger 
---
 Documentation/virtual/kvm/api.txt |   31 +++
 arch/s390/include/asm/kvm.h   |9 +
 arch/s390/kvm/kvm-s390.c  |   30 ++
 include/linux/kvm.h   |4 
 4 files changed, 74 insertions(+)

Index: b/Documentation/virtual/kvm/api.txt
===
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1450,6 +1450,37 @@ is supported; 2 if the processor require
 an RMA, or 1 if the processor can use an RMA but doesn't require it,
 because it supports the Virtual RMA (VRMA) facility.
 
+4.64 KVM_S390_GET_SREGS2
+
+Capability: KVM_CAP_S390_SREGS2
+Architectures: s390x
+Type: vcpu ioctl
+Parameters: struct kvm_sregs2 (out)
+Returns: 0 on success, -1 on error
+
+Reads special registers from the vcpu which are not covered by sregs.
+
+/* s390x */
+struct kvm_sregs2 {
+   __u64 ckc;  /* clock comparator */
+   __u64 cputm;/* cpu timer */
+   __u64 gbea; /* guest breaking event address */
+   __u32 todpr;/* tod programmable field */
+   __u32 prefix;   /* prefix register */
+};
+
+4.65 KVM_S390_SET_SREGS2
+
+Capability: KVM_CAP_S390_SREGS2
+Architectures: s390x
+Type: vcpu ioctl
+Parameters: struct kvm_sregs2 (in)
+Returns: 0 on success, -1 on error
+
+Writes special registers into the vcpu.  See KVM_S390_GET_SREGS2 for the
+data structures.
+
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
Index: b/arch/s390/include/asm/kvm.h
===
--- a/arch/s390/include/asm/kvm.h
+++ b/arch/s390/include/asm/kvm.h
@@ -28,6 +28,15 @@ struct kvm_sregs {
__u64 crs[16];
 };
 
+/* for KVM_S390_GET_SREGS2 and KVM_S390_SET_SREGS2 */
+struct kvm_s390_sregs2 {
+   __u64 ckc;  /* clock comparator */
+   __u64 cputm;/* cpu timer */
+   __u64 gbea; /* guest breaking event address */
+   __u32 todpr;/* tod programmable field */
+   __u32 prefix;   /* prefix register */
+};
+
 /* for KVM_GET_FPU and KVM_SET_FPU */
 struct kvm_fpu {
__u32 fpc;
Index: b/arch/s390/kvm/kvm-s390.c
===
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -129,6 +129,7 @@ int kvm_dev_ioctl_check_extension(long e
case KVM_CAP_S390_PSW:
case KVM_CAP_S390_GMAP:
case KVM_CAP_SYNC_MMU:
+   case KVM_CAP_S390_SREGS2:
r = 1;
break;
default:
@@ -673,6 +674,35 @@ long kvm_arch_vcpu_ioctl(struct file *fi
case KVM_S390_INITIAL_RESET:
r = kvm_arch_vcpu_ioctl_initial_reset(vcpu);
break;
+   case KVM_S390_GET_SREGS2: {
+   struct kvm_s390_sregs2 sregs2;
+
+   sregs2.prefix = vcpu->arch.sie_block->prefix;
+   sregs2.gbea = vcpu->arch.sie_block->gbea;
+   sregs2.cputm = vcpu->arch.sie_block->cputm;
+   sregs2.ckc = vcpu->arch.sie_block->ckc;
+   sregs2.todpr = vcpu->arch.sie_block->todpr;
+   r = -EFAULT;
+   if (copy_to_user(argp, &sregs2, sizeof(sregs2)))
+   break;
+   r = 0;
+   break;
+   }
+   case KVM_S390_SET_SREGS2: {
+   struct kvm_s390_sregs2 sregs2;
+
+   r = -EFAULT;
+   if (copy_from_user(&sregs2, argp, sizeof(sregs2)))
+   break;
+   vcpu->arch.sie_block->prefix = sregs2.prefix;
+   vcpu->arch.sie_block->gbea = sregs2.gbea;
+   vcpu->arch.sie_block->cputm = sregs2.cputm;
+   vcpu->arch.sie_block->ckc = sregs2.ckc;
+   vcpu->arch.sie_block->todpr = sregs2.todpr;
+   flush_guest_cpu(vcpu);
+   r = 0;
+   break;
+   }
default:
r = -EINVAL;
}
Index: b/include/linux/kvm.h
===
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -557,6 +557,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_MAX_VCPUS 66   /* returns max vcpus per vm */
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_S390_GMAP 71
+#define KVM_CAP_S390_SREGS2 72
 
 #ifdef KVM_CAP_IRQ_ROUT

Re: Controlling (WinXP-Guest)- Textmode via chardev pipe broken?

2011-12-20 Thread Sasha Levin
On Tue, 2011-12-20 at 11:01 +0100, Oliver Rath wrote:
> Hi List,
> 
> Im trying to control a winxp installation process (esp. the text-mode
> part) via pipe through kvm. Im using the git-version of kvm from
> yesterday (2011-12-19), build with these parameters:
> 
> ~/qemu-kvm$ ./configure --enable-sdl
> ~/qemu-kvm$ make
> ~/qemu-kvm$ sudo make install
> 
> What works:
> 
> 1. Generating ~/.winpipe/path.{in,out}via mkdir .winpipe; mkfifo
> .winpipe/path.in; mkfifo .winpipe/path.out
> 2. Creating emty image via qemu-img create win.img 20G
> 3. Starting kvm via: qemu-system-x86_64 -m 1G -hda win.img -cdrom
> winxpsp3.iso -chardev pipe,id=mywinpipe,path=.winpipe/path
> 
> What doesn't:
> 
> 1. Piping some text via echo "R" > .winpipe/path.in (i.e. for starting
> repair console in winxp-textmode) doesnt have any effect
> 2. Reading from path.out via cat ~/.winpipe/path.out (nothing happens
> except the vm ends, then the command "cat .winpipe/path.out" ends, too)
> has no result, only waiting
> 3. There is no .pipe/ directory inside winxp, wether on C: nor on D:
> (looked in via starting repair-console with "R")
> 
> 
> Whats wrong?

Are you sure that windows is using the serial device at all?

You're not emulating a keyboard there, you're emulating input from a
serial device.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-20 Thread Jens Axboe
On 2011-12-19 19:27, Vivek Goyal wrote:
> I think reverting the previous series is not going to be simple either. It
> had 13 patches.
> 
> https://lkml.org/lkml/2011/5/19/560
> 
> By making stats per cpu, I was able to reduce contention on request
> queue lock. Now we shall have to bring the lock back. 

That's not going to happen, having that lock in there was a disaster for
small IO on fast devices. It's essentially the limiting factor on
benchmark runs on the RHEL kernels that have it included and enabled...


-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Anthony Liguori

On 12/20/2011 03:56 AM, Avi Kivity wrote:

On 12/20/2011 02:38 AM, Anthony Liguori wrote:

That was v1 of my patches. Avi didn't like it, I tried it like this, and
in the end I had to agree. So, no, I don't think we want such a model.



Yes, we do :-)

The in-kernel APIC is a different implementation of the APIC device.
It's not an "accelerator" for the userspace APIC.


A different implementation but not a different device.  Device == spec.


If it was hardware, it'd be a fully compatible clone.  The way we would model 
this is via inheritance.


Regards,

Anthony Liguori





All that you're doing here is reinventing qdev.  You're defining your
own type system (APICBackend), creating a new regression system for
it, and then defining your own factory function for creating it
(through a qdev property).

I'm struggling to understand the reason to avoid using the
infrastructure we already have to do all of this.


Not every table of function pointers has to be done through qdev (not
that I feel strongly about this - only that there is just one APIC device).



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 00/16] uq/master: Introduce basic irqchip support

2011-12-20 Thread Anthony Liguori

On 12/20/2011 04:08 AM, Avi Kivity wrote:

On 12/20/2011 12:03 PM, Avi Kivity wrote:

On 12/20/2011 04:46 AM, Anthony Liguori wrote:


I would hope that you would agree that when designing the device
model, we should aim to do what makes sense independent of migration.
If we cannot achieve a certain feature with migration given the
logical modeling of devices, it probably suggests that we need to
improve our migration infrastructure.

I assume that given the above, we all agree that separate devices is
what makes the most sense ignoring migration.


I don't agree with this.


The problem with having two devices, is that now you have to identify
the common code, put them somewhere, and use them as necessary.

"apic" and "kvm-apic" both is-a (are-a?) "apic".  This suggests either a
base class (containing the common code) and derived classes, or (like
Jan's implementation), just one class, that defers part of the
implementation to an interface implemented by two other classes.


Yes, a base-class is what I'm suggesting since this is what qdev is capable of 
today.


The other approach to this is to have an APICFrontend has-a APICBackend and then 
UserspaceAPIC is-a APICBackend and KernelAPIC is-a APICBackend.


You still now have three visible devices in the device model.  This is 
essentially what Jan's patches do today.


I think a simple base-class + subclass inheritance scheme makes the most sense 
here.

Regards,

Anthony Liguori



Two unrelated classes which happen to implement exactly the same
interface (vmstate fields) except one (visible name) and share some code
are a strange solution to this problem.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Paolo Bonzini

On 12/20/2011 02:41 PM, Anthony Liguori wrote:

On 12/20/2011 03:56 AM, Avi Kivity wrote:

On 12/20/2011 02:38 AM, Anthony Liguori wrote:

That was v1 of my patches. Avi didn't like it, I tried it like this,
and
in the end I had to agree. So, no, I don't think we want such a model.



Yes, we do :-)

The in-kernel APIC is a different implementation of the APIC device.
It's not an "accelerator" for the userspace APIC.


A different implementation but not a different device. Device == spec.


If it was hardware, it'd be a fully compatible clone. The way we would
model this is via inheritance.


I see your fully compatible clone, and I raise my bridge with a 
different implementation underneath.  It's the same old debate on is-a 
vs has-a.


In QOM parlance Jan implemented this:

abstract class Object
abstract class Device
class APIC: { backend: link }
abstract class APICBackend
class QEMU_APICBackend
class KVM_APICBackend

and you're proposing this:

abstract class Object
abstract class Device
abstract class APIC
class QEMU_APIC
class KVM_APIC

Both can be right, both can be wrong.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Anthony Liguori

On 12/20/2011 07:51 AM, Paolo Bonzini wrote:

On 12/20/2011 02:41 PM, Anthony Liguori wrote:

On 12/20/2011 03:56 AM, Avi Kivity wrote:

On 12/20/2011 02:38 AM, Anthony Liguori wrote:

That was v1 of my patches. Avi didn't like it, I tried it like this,
and
in the end I had to agree. So, no, I don't think we want such a model.



Yes, we do :-)

The in-kernel APIC is a different implementation of the APIC device.
It's not an "accelerator" for the userspace APIC.


A different implementation but not a different device. Device == spec.


If it was hardware, it'd be a fully compatible clone. The way we would
model this is via inheritance.


I see your fully compatible clone, and I raise my bridge with a different
implementation underneath. It's the same old debate on is-a vs has-a.

In QOM parlance Jan implemented this:

abstract class Object
abstract class Device
class APIC: { backend: link }
abstract class APICBackend
class QEMU_APICBackend
class KVM_APICBackend


I don't fundamentally object to modeling it like this provided that it's modeled 
(and visible) through qdev and not done through a one-off infrastructure.


But yes, you are exactly correct in your observation (and that both can be 
right).

Regards,

Anthony Liguori



and you're proposing this:

abstract class Object
abstract class Device
abstract class APIC
class QEMU_APIC
class KVM_APIC

Both can be right, both can be wrong.

Paolo




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Paolo Bonzini

On 12/20/2011 02:54 PM, Anthony Liguori wrote:

In QOM parlance Jan implemented this:

abstract class Object
abstract class Device
class APIC: { backend: link }
abstract class APICBackend
class QEMU_APICBackend
class KVM_APICBackend


I don't fundamentally object to modeling it like this provided that it's
modeled (and visible) through qdev and not done through a one-off
infrastructure.


There is no superclass of DeviceState, hence doing it through qdev would 
mean introducing a new bus type and so on.  This would be a superb 
example of a useless bus that can disappear with QOM, but I don't see 
why we should take the pain to add it in the first place. :)


We sure can revisit this when the subclassing and interface 
infrastructures of QOM are merged.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Anthony Liguori

On 12/20/2011 07:57 AM, Paolo Bonzini wrote:

On 12/20/2011 02:54 PM, Anthony Liguori wrote:

In QOM parlance Jan implemented this:

abstract class Object
abstract class Device
class APIC: { backend: link }
abstract class APICBackend
class QEMU_APICBackend
class KVM_APICBackend


I don't fundamentally object to modeling it like this provided that it's
modeled (and visible) through qdev and not done through a one-off
infrastructure.


There is no superclass of DeviceState, hence doing it through qdev would mean
introducing a new bus type and so on. This would be a superb example of a
useless bus that can disappear with QOM, but I don't see why we should take the
pain to add it in the first place. :)


Right, so let's modeled it for now as inheritance which qdev can cope with.



We sure can revisit this when the subclassing and interface infrastructures of
QOM are merged.


I'll have patches out this week (just trying to write some more test cases). 
The latest series is below if you're interested.  I fear that it won't be until 
mid to late January before this can be merged though as I want to give folks 
like Markus a chance to review it.


https://github.com/aliguori/qemu/tree/qom-upstream.3

Regards,

Anthony Liguori



Paolo



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Avi Kivity
On 12/20/2011 03:51 PM, Paolo Bonzini wrote:
> On 12/20/2011 02:41 PM, Anthony Liguori wrote:
>> On 12/20/2011 03:56 AM, Avi Kivity wrote:
>>> On 12/20/2011 02:38 AM, Anthony Liguori wrote:
> That was v1 of my patches. Avi didn't like it, I tried it like this,
> and
> in the end I had to agree. So, no, I don't think we want such a
> model.


 Yes, we do :-)

 The in-kernel APIC is a different implementation of the APIC device.
 It's not an "accelerator" for the userspace APIC.
>>>
>>> A different implementation but not a different device. Device == spec.
>>
>> If it was hardware, it'd be a fully compatible clone. The way we would
>> model this is via inheritance.
>
> I see your fully compatible clone, and I raise my bridge with a
> different implementation underneath.  It's the same old debate on is-a
> vs has-a.
>
> In QOM parlance Jan implemented this:

QOM is the new C++

>
> abstract class Object
> abstract class Device
> class APIC: { backend: link }
> abstract class APICBackend
> class QEMU_APICBackend
> class KVM_APICBackend
>
> and you're proposing this:
>
> abstract class Object
> abstract class Device
> abstract class APIC
> class QEMU_APIC
> class KVM_APIC
>
> Both can be right, both can be wrong.

I don't mind either.  What I don't want:

  abstract class Object
 abstract class Device
class APIC
class KVMAPIC

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2]: kvm-s390: add KVM_S390_GET/SET_SREGS2 call for additional hw regs

2011-12-20 Thread Alexander Graf
[resending because I mistakenly sent an html mime mail before, which got 
rejected by kvm@vger]

On 20.12.2011, at 12:41, Christian Borntraeger wrote:

> From: Christian Borntraeger 
> 
> For guest relocation and virsh dump qemu needs an interface to
> get/set additional registers from kvm. We also need the prefix
> register for all guest memory accesses to the prefix pages.
> 
> The prefix register could also be set via the KVM_S390_SIGP_SET_PREFIX
> interrupt ioctl, but I also added the synchronous operation to have
> 
> o symmetry: we want to have the same struct for get/set routine
> o the interrupt is only delivered before entering the SIE, we also
>  want to cover the sequence set prefix/store status at prefix
> 
> Signed-off-by: Christian Borntraeger 
> ---
> Documentation/virtual/kvm/api.txt |   31 +++
> arch/s390/include/asm/kvm.h   |9 +
> arch/s390/kvm/kvm-s390.c  |   30 ++
> include/linux/kvm.h   |4 
> 4 files changed, 74 insertions(+)
> 
> Index: b/Documentation/virtual/kvm/api.txt
> ===
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1450,6 +1450,37 @@ is supported; 2 if the processor require
> an RMA, or 1 if the processor can use an RMA but doesn't require it,
> because it supports the Virtual RMA (VRMA) facility.
> 
> +4.64 KVM_S390_GET_SREGS2
> +
> +Capability: KVM_CAP_S390_SREGS2
> +Architectures: s390x
> +Type: vcpu ioctl
> +Parameters: struct kvm_sregs2 (out)
> +Returns: 0 on success, -1 on error
> +
> +Reads special registers from the vcpu which are not covered by sregs.
> +
> +/* s390x */
> +struct kvm_sregs2 {
> + __u64 ckc;  /* clock comparator */
> + __u64 cputm;/* cpu timer */
> + __u64 gbea; /* guest breaking event address */
> + __u32 todpr;/* tod programmable field */
> + __u32 prefix;   /* prefix register */
> +};

Would it make sense to instead use the GET_ONE_REG and SET_ONE_REG interfaces?

  http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/80854

> +
> +4.65 KVM_S390_SET_SREGS2

I would very much appreciate if you could use 4.66 here. That makes merging the 
sections for GET_ONE_REG and SET_ONE_REG easier :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm: Remove ability to assign a device without iommu support

2011-12-20 Thread Alex Williamson
On Tue, 2011-12-20 at 11:08 +0200, Sasha Levin wrote:
> On Tue, 2011-12-20 at 10:03 +0100, Jan Kiszka wrote:
> > On 2011-12-20 09:49, Sasha Levin wrote:
> > > On Mon, 2011-12-19 at 20:19 -0700, Alex Williamson wrote:
> > >> This option has no users and it exposes a security hole that we
> > >> can allow devices to be assigned without iommu protection.  Make
> > >> KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.
> > >>
> > >> Signed-off-by: Alex Williamson 
> > >> ---
> > >>
> > >>  virt/kvm/assigned-dev.c |   18 +-
> > >>  1 files changed, 9 insertions(+), 9 deletions(-)
> > >>
> > >> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > >> index 3ad0925..a251a28 100644
> > >> --- a/virt/kvm/assigned-dev.c
> > >> +++ b/virt/kvm/assigned-dev.c
> > >> @@ -487,6 +487,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm 
> > >> *kvm,
> > >>  struct kvm_assigned_dev_kernel *match;
> > >>  struct pci_dev *dev;
> > >>  
> > >> +if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
> > >> +return -EINVAL;
> > > 
> > > Could we just drop KVM_DEV_ASSIGN_ENABLE_IOMMU and do it by default?
> > > calling KVM_ASSIGN_PCI_DEVICE without that flag set it pretty
> > > meaningless.
> > 
> > There is that thing called "backward compatibility". :)
> 
> Well, Alex suggested skipping deprecation period because there are
> currently no users of KVM_ASSIGN_PCI_DEVICE without
> KVM_DEV_ASSIGN_ENABLE_IOMMU, so it should be fine to just make it the
> default behavior, no?
> 
> We can leave KVM_DEV_ASSIGN_ENABLE_IOMMU itself so userspace won't
> break, but theres no reason to enforce it being set in the kernel code.

As Jan said, the option does have historical meaning.  Ignoring the flag
adds ambiguity to the API, so I think it's best to make it clear which
path is no longer supported.  Either way, we don't get to take the flag
back, so it might as well serve as a sanity check.  Thanks,

Alex 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/2] kvm: Lock down device assignment

2011-12-20 Thread Alex Williamson
v2: Update API documentation for each patch

Two patches to try to better secure the device assignment ioctl.
This firt patch makes KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory
option when assigning a device.  I don't believe we have any
users of this option, so I think we can skip any deprecation
period, especially since it's existence is rather dangerous.

The second patch introduces some file permission checking that Avi
suggested.  If a user has been granted read/write permission to
the PCI sysfs BAR resource files, this is a good indication that
they have access to the device.  We can't call sys_faccessat
directly (not exported), but the important bits are self contained
enough to include directly.  This still works with sudo and libvirt
usage, the latter already grants qemu permission to these files.
Thanks,

Alex

---

Alex Williamson (2):
  kvm: Device assignment permission checks
  kvm: Remove ability to assign a device without iommu support


 Documentation/virtual/kvm/api.txt |7 
 virt/kvm/assigned-dev.c   |   73 -
 2 files changed, 70 insertions(+), 10 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] kvm: Remove ability to assign a device without iommu support

2011-12-20 Thread Alex Williamson
This option has no users and it exposes a security hole that we
can allow devices to be assigned without iommu protection.  Make
KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.

Signed-off-by: Alex Williamson 
---

 Documentation/virtual/kvm/api.txt |3 +++
 virt/kvm/assigned-dev.c   |   18 +-
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 7945b0b..ee2c96b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1151,6 +1151,9 @@ following flags are specified:
 /* Depends on KVM_CAP_IOMMU */
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0)
 
+The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
+isolation of the device.  Usages not specifying this flag are deprecated.
+
 4.49 KVM_DEASSIGN_PCI_DEVICE
 
 Capability: KVM_CAP_DEVICE_DEASSIGNMENT
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 3ad0925..a251a28 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -487,6 +487,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
struct kvm_assigned_dev_kernel *match;
struct pci_dev *dev;
 
+   if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
+   return -EINVAL;
+
mutex_lock(&kvm->lock);
idx = srcu_read_lock(&kvm->srcu);
 
@@ -544,16 +547,14 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 
list_add(&match->list, &kvm->arch.assigned_dev_head);
 
-   if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
-   if (!kvm->arch.iommu_domain) {
-   r = kvm_iommu_map_guest(kvm);
-   if (r)
-   goto out_list_del;
-   }
-   r = kvm_assign_device(kvm, match);
+   if (!kvm->arch.iommu_domain) {
+   r = kvm_iommu_map_guest(kvm);
if (r)
goto out_list_del;
}
+   r = kvm_assign_device(kvm, match);
+   if (r)
+   goto out_list_del;
 
 out:
srcu_read_unlock(&kvm->srcu, idx);
@@ -593,8 +594,7 @@ static int kvm_vm_ioctl_deassign_device(struct kvm *kvm,
goto out;
}
 
-   if (match->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
-   kvm_deassign_device(kvm, match);
+   kvm_deassign_device(kvm, match);
 
kvm_free_assigned_device(kvm, match);
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] kvm: Device assignment permission checks

2011-12-20 Thread Alex Williamson
Only allow KVM device assignment to attach to devices which:

 - Are not bridges
 - Have BAR resources (assume others are special devices)
 - The user has permissions to use

Assigning a bridge is a configuration error, it's not supported, and
typically doesn't result in the behavior the user is expecting anyway.
Devices without BAR resources are typically chipset components that
also don't have host drivers.  We don't want users to hold such devices
captive or cause system problems by fencing them off into an iommu
domain.  We determine "permission to use" by testing whether the user
has access to the PCI sysfs resource files.  By default a normal user
will not have access to these files, so it provides a good indication
that an administration agent has granted the user access to the device.

Signed-off-by: Alex Williamson 
---

 Documentation/virtual/kvm/api.txt |4 +++
 virt/kvm/assigned-dev.c   |   55 -
 2 files changed, 58 insertions(+), 1 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index ee2c96b..4df9af4 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1154,6 +1154,10 @@ following flags are specified:
 The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
 isolation of the device.  Usages not specifying this flag are deprecated.
 
+Only PCI header type 0 devices with PCI BAR resources are supported by
+device assignment.  The user requesting this ioctl must have read/write
+access to the PCI sysfs resource files associated with the device.
+
 4.49 KVM_DEASSIGN_PCI_DEVICE
 
 Capability: KVM_CAP_DEVICE_DEASSIGNMENT
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index a251a28..faec641 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "irq.h"
 
 static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head 
*head,
@@ -483,9 +484,11 @@ out:
 static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
  struct kvm_assigned_pci_dev *assigned_dev)
 {
-   int r = 0, idx;
+   int r = 0, idx, i;
struct kvm_assigned_dev_kernel *match;
struct pci_dev *dev;
+   u8 header_type;
+   bool bar_found = false;
 
if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
return -EINVAL;
@@ -516,6 +519,56 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
r = -EINVAL;
goto out_free;
}
+
+   /* Don't allow bridges to be assigned */
+   pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type);
+   if ((header_type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL) {
+   r = -EPERM;
+   goto out_put;
+   }
+
+   /* We want to test whether the caller has been granted permissions to
+* use this device.  To be able to configure and control the device,
+* the user needs access to PCI configuration space and BAR resources.
+* These are accessed through PCI sysfs.  PCI config space is often
+* passed to the process calling this ioctl via file descriptor, so we
+* can't rely on access to that file.  We can check for permissions
+* on each of the BAR resource files, which is a pretty clear
+* indicator that the user has been granted access to the device. */
+   for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++) {
+   char buf[64];
+   struct path path;
+   struct inode *inode;
+
+   if (!pci_resource_len(dev, i))
+   continue;
+
+   /* Per sysfs-rules, sysfs is always at /sys */
+   snprintf(buf, sizeof(buf), "/sys/bus/pci/devices/%04x:%02x:"
+"%02x.%d/resource%d", pci_domain_nr(dev->bus),
+dev->bus->number, PCI_SLOT(dev->devfn),
+PCI_FUNC(dev->devfn), i);
+
+   r = kern_path(buf, LOOKUP_FOLLOW, &path);
+   if (r)
+   goto out_put;
+
+   inode = path.dentry->d_inode;
+
+   r = inode_permission(inode, MAY_READ | MAY_WRITE | MAY_ACCESS);
+   path_put(&path);
+   if (r)
+   goto out_put;
+
+   bar_found = true;
+   }
+
+   /* If no resources, probably something special */
+   if (!bar_found) {
+   r = -EPERM;
+   goto out_put;
+   }
+
if (pci_enable_device(dev)) {
printk(KERN_INFO "%s: Could not enable PCI device\n", __func__);
r = -EBUSY;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-20 Thread Vivek Goyal
On Mon, Dec 19, 2011 at 02:56:35PM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Mon, Dec 19, 2011 at 01:27:17PM -0500, Vivek Goyal wrote:
> > Ok, that's good to know. If per cpu allocator can support this use case,
> > it will be good for 3.3 onwards. This seems to be right way to go to fix
> > the problem.
> 
> Ummm... if we're gonna make percpu usable w/ GFP_NOIO, the right
> interim solution would be making a simplistic mempool so that later
> when percpu can do it, it can be swapped easily.  I really can't see
> much benefit of adding refcnting on top of everything just for this.

Ok. So are you suggesting that I should write a simple mempool kind of
implementation of my own for group and per cpu data allocation. Keep
certain number of elements in the cache and trigger a worker thread to
allocate more elements once minimum number of elements go below
threshold. If pool has run out of pre-allocated elements then allocation
will fail and IO will be accounted to root group?

I am looking at the mempool implementation (mempool_create()) and looks
like that is not suitable for my use case. mempool_alloc() will call into
alloc function provided by me and pass the flags. I can't implement an
alloc function and honor that flag as per cpu alloc does not take any
flags.

So IIUC, existing mempool implementation is not directly usable for my
requirement and I need to write some code of my own for the caching
layer which always allocates objects from reserve and fills in the
pool asynchronously with the help of a worker thread.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: building vgabios-stdvga.bin

2011-12-20 Thread Rick Koshi

>> With this questionable version of dev86, I went back to the kvm/vgabios
>> directory and tried to build there again.  Much to my surprise, it
>> actually ran, and built 4 variants on VGABIOS-lgpl-latest.bin...
>> Cool, says I, now what do I do with these?  I did the only thing
>> I could think of, and installed VGA-lgpl-latest.bin as
>
> Almost correct.  You need VGABIOS-lgpl-latest.stdvga.bin

That sounds great, but unfortunately, the build didn't create that one.

It created the following:

VGABIOS-lgpl-latest.bin
VGABIOS-lgpl-latest.cirrus.bin
VGABIOS-lgpl-latest.cirrus.debug.bin
VGABIOS-lgpl-latest.debug.bin

> Thats the one qemu is using.  Also available from
> http://git.qemu.org/?p=vgabios.git;a=summary

I pulled this version down.  Indeed, it does build several
additional versions, including stdvga.  Hurray!

I don't know if this is the right place to make this suggestion,
but it would be nice if this version could make it into the
qemu-kvm distribution.

Now, if I may add a few more questions...

I've been experimenting with adding some special-purpose resolutions,
and I've had some rather mixed success.  I've found that I get
varying results depending on the width and color depth I use.

For example, with 32-bit color depth, if I use a width of 2560,
it works perfectly, but if I use 2556, the screen only resizes
vertically to reflect the new resolution, and the screen fills
with garbage.  I thought that perhaps it needs an integer multiple
of some power of two, so I started backing it off by progressively
larger powers of two.  At a width of 2552, the guest crashes
completely.  At 2544, 2528, 2496, and 2432, it works, but I get
some very odd artifacts, both in the mouse cursor and in places
where the mouse cursor has been recently.  If I bring it all the way
down to 2304 (nearest multiple of 256), everything seems to work
properly again.

At 24-bit color depth, I get better results.  2556 is still garbage,
but 2552 and below now work just fine.  (Actually, 2552 crashed once,
but I think maybe I accidentally clicked on 32-bit color instead of 24.
I hope so, anyway)

Do you know what the real constraints are?  Do you know if they're
imposed by the vgabios driver, the VNC module, or the guest
operating system (Windows 7 in this case)?

And perhaps most importantly, is this a bug that can be fixed?
I can live with needing a multiple of 8, but 256 seems a little
extreme (and unlikely, given that 800x600 works).  It would be
nice if I could use 32-bit color with my chosen resolution.

I'd be happy to go looking for the bug myself, but if you happen
to know off the top of your head where to look, or whether it's
an impossible dream, that would save me a lot of time.  This isn't
exactly my normal area of expertise.

-- Rick
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Add KVM_CAP_NR_VCPUS and KVM_CAP_MAX_VCPUS

2011-12-20 Thread Alexander Graf

On 08.12.2011, at 03:55, Matt Evans wrote:

> PPC KVM lacks these two capabilities, and as such a userland system must 
> assume
> a max of 4 VCPUs (following api.txt).  With these, a userland can determine
> a more realistic limit.
> 
> Signed-off-by: Matt Evans 
> ---
> 
> Alex: For when you're back in civilisation -- the kvmtool/PPC stuff will be
> limited to 4 VCPUs until the kernel returns something for these caps.
> 
> Cheers, Matt
> 
> arch/powerpc/kvm/powerpc.c |   15 +++
> 1 files changed, 15 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 7c7220c..3f7219d 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -245,6 +245,21 @@ int kvm_dev_ioctl_check_extension(long ext)
>   r = 2;
>   break;
> #endif
> + case KVM_CAP_NR_VCPUS:
> + /*
> +  * Recommending a number of CPUs is somewhat arbitrary; we 
> return the number of present
> +  * CPUs for -HV (since a host will have secondary threads 
> "offline"), and for other KVM
> +  * implementations just count online CPUs.
> +  */
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> + r = num_present_cpus();
> +#else
> + r = num_online_cpus();
> +#endif

That will essentially restrict us to not allow overcommitting when in the scope 
of a single VM. Is that what we want? You could easily run a 32-way guest on a 
4-way host, even with _HV.

Maybe some really big number makes more sense here.

Alex

PS: Please always CC kvm@vger when talking about stuff that we want feedback 
from non-PPC folks on.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/28] kvm tools: Move 'kvm__recommended_cpus' to arch-specific code

2011-12-20 Thread Alexander Graf

On 07.12.2011, at 08:19, Matt Evans wrote:

> On 07/12/11 17:34, Sasha Levin wrote:
>> On Wed, 2011-12-07 at 17:17 +1100, Matt Evans wrote:
>>> On 06/12/11 19:20, Sasha Levin wrote:
 Why is it getting moved out of generic code?
 
 This is used to determine the maximum amount of vcpus supported by the
 host for a single guest, and as far as I know KVM_CAP_NR_VCPUS and
 KVM_CAP_MAX_VCPUS are not arch specific.
>>> 
>>> I checked api.txt and you're right, it isn't arch-specific.  I assumed it 
>>> was,
>>> because PPC KVM doesn't support it ;-) I've dropped this patch and in its 
>>> place
>>> implemented the api.txt suggestion of "if KVM_CAP_NR_VCPUS fails, use 4" 
>>> instead
>>> of die(); you'll see that when I repost.
>>> 
>>> This will have the effect of PPC being limited to 4 CPUs until the kernel
>>> supports that CAP.  (I'll see about this part too.)
>> 
>> I went to look at which limitation PPC places on amount of vcpus in
>> guest, and saw this in kvmppc_core_vcpu_create() in the book3s code:
>> 
>>  vcpu = kvmppc_core_vcpu_create(kvm, id);
>>  vcpu->arch.wqp = &vcpu->wq;
>>  if (!IS_ERR(vcpu))
>>  kvmppc_create_vcpu_debugfs(vcpu, id);
>> 
>> This is wrong, right? The VCPU is dereferenced before actually checking
>> that it's not an error.
> 
> Yeah, that's b0rk.  Alex, a patch below. :)
> 
> 
> Cheers,
> 
> 
> Matt
> 
> ---
> Subject: [PATCH] KVM: PPC: Fix vcpu_create dereference before validity check.
> 
> 
> Signed-off-by: Matt Evans 

Thanks, applied to kvm-ppc-next with an actual patch description added.

Alex

> ---
> arch/powerpc/kvm/powerpc.c |5 +++--
> 1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 084d1c5..7c7220c 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -285,9 +285,10 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, 
> unsigned int id)
> {
>   struct kvm_vcpu *vcpu;
>   vcpu = kvmppc_core_vcpu_create(kvm, id);
> - vcpu->arch.wqp = &vcpu->wq;
> - if (!IS_ERR(vcpu))
> + if (!IS_ERR(vcpu)) {
> + vcpu->arch.wqp = &vcpu->wq;
>   kvmppc_create_vcpu_debugfs(vcpu, id);
> + }
>   return vcpu;
> }
> 
> -- 
> 1.7.0.4
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/28] kvm tools: Move 'kvm__recommended_cpus' to arch-specific code

2011-12-20 Thread Alexander Graf

On 07.12.2011, at 09:29, Sasha Levin wrote:

> On Wed, 2011-12-07 at 18:28 +1100, Matt Evans wrote:
>> On 07/12/11 18:24, Alexander Graf wrote:
>>> 
>>> On 07.12.2011, at 08:19, Matt Evans  wrote:
>>> 
 On 07/12/11 17:34, Sasha Levin wrote:
> On Wed, 2011-12-07 at 17:17 +1100, Matt Evans wrote:
>> On 06/12/11 19:20, Sasha Levin wrote:
>>> Why is it getting moved out of generic code?
>>> 
>>> This is used to determine the maximum amount of vcpus supported by the
>>> host for a single guest, and as far as I know KVM_CAP_NR_VCPUS and
>>> KVM_CAP_MAX_VCPUS are not arch specific.
>> 
>> I checked api.txt and you're right, it isn't arch-specific.  I assumed 
>> it was,
>> because PPC KVM doesn't support it ;-) I've dropped this patch and in 
>> its place
>> implemented the api.txt suggestion of "if KVM_CAP_NR_VCPUS fails, use 4" 
>> instead
>> of die(); you'll see that when I repost.
>> 
>> This will have the effect of PPC being limited to 4 CPUs until the kernel
>> supports that CAP.  (I'll see about this part too.)
> 
> I went to look at which limitation PPC places on amount of vcpus in
> guest, and saw this in kvmppc_core_vcpu_create() in the book3s code:
> 
>   vcpu = kvmppc_core_vcpu_create(kvm, id);
>   vcpu->arch.wqp = &vcpu->wq;
>   if (!IS_ERR(vcpu))
>   kvmppc_create_vcpu_debugfs(vcpu, id);
> 
> This is wrong, right? The VCPU is dereferenced before actually checking
> that it's not an error.
 
 Yeah, that's b0rk.  Alex, a patch below. :)
>>> 
>>> Thanks :). Will apply asap but don't have a real keyboard today :).
>> 
>> Ha!  Voice control on your phone, what could go wrong?
>> 
>>> I suppose this is stable material?
>> 
>> Good idea, (and if we're formal,
>> Signed-off-by: Matt Evans 
>> ).  I suppose no one's seen a vcpu fail to be created, yet.
> 
> I also got another one, but it's **completely untested** (not even
> compiled). Alex, Matt, any chance one of you can loan a temporary ppc
> shell for the upcoming tests of KVM tool/ppc KVM?

The problem with giving you a shell on a PPC box is really that the hardware 
Matt's work is focusing on is not available to the public yet. I could maybe 
try and get you access on a G5 box, but I'm not sure how useful that is to you 
really, as you'll only be able to run PR KVM, not HV KVM.

> 
> ---
> 
> From: Sasha Levin 
> Date: Wed, 7 Dec 2011 10:24:56 +0200
> Subject: [PATCH] KVM: PPC: Use the vcpu kmem_cache when allocating new VCPUs
> 
> Currently the code kzalloc()s new VCPUs instead of using the kmem_cache
> which is created when KVM is initialized.
> 
> Modify it to allocate VCPUs from that kmem_cache.
> 
> Signed-off-by: Sasha Levin 
> ---
> arch/powerpc/kvm/book3s_hv.c |6 +++---
> 1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 0cb137a..e309099 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -411,7 +411,7 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, 
> unsigned int id)
>   goto out;
> 
>   err = -ENOMEM;
> - vcpu = kzalloc(sizeof(struct kvm_vcpu), GFP_KERNEL);
> + vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);

Paul, is there any rationale on why not to use the kmem cache? Are we bound by 
real mode magic again?


Alex

>   if (!vcpu)
>   goto out;
> 
> @@ -463,7 +463,7 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, 
> unsigned int id)
>   return vcpu;
> 
> free_vcpu:
> - kfree(vcpu);
> + kmem_cache_free(kvm_vcpu_cache, vcpu);
> out:
>   return ERR_PTR(err);
> }
> @@ -471,7 +471,7 @@ out:
> void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
> {
>   kvm_vcpu_uninit(vcpu);
> - kfree(vcpu);
> + kmem_cache_free(kvm_vcpu_cache, vcpu);
> }
> 
> static void kvmppc_set_timer(struct kvm_vcpu *vcpu)
> 
> -- 
> 
> Sasha.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Add KVM_CAP_NR_VCPUS and KVM_CAP_MAX_VCPUS

2011-12-20 Thread Sasha Levin
On Tue, 2011-12-20 at 16:17 +0100, Alexander Graf wrote:
> On 08.12.2011, at 03:55, Matt Evans wrote:
> 
> > PPC KVM lacks these two capabilities, and as such a userland system must 
> > assume
> > a max of 4 VCPUs (following api.txt).  With these, a userland can determine
> > a more realistic limit.
> > 
> > Signed-off-by: Matt Evans 
> > ---
> > 
> > Alex: For when you're back in civilisation -- the kvmtool/PPC stuff will be
> > limited to 4 VCPUs until the kernel returns something for these caps.
> > 
> > Cheers, Matt
> > 
> > arch/powerpc/kvm/powerpc.c |   15 +++
> > 1 files changed, 15 insertions(+), 0 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 7c7220c..3f7219d 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -245,6 +245,21 @@ int kvm_dev_ioctl_check_extension(long ext)
> > r = 2;
> > break;
> > #endif
> > +   case KVM_CAP_NR_VCPUS:
> > +   /*
> > +* Recommending a number of CPUs is somewhat arbitrary; we 
> > return the number of present
> > +* CPUs for -HV (since a host will have secondary threads 
> > "offline"), and for other KVM
> > +* implementations just count online CPUs.
> > +*/
> > +#ifdef CONFIG_KVM_BOOK3S_64_HV
> > +   r = num_present_cpus();
> > +#else
> > +   r = num_online_cpus();
> > +#endif
> 
> That will essentially restrict us to not allow overcommitting when in the 
> scope of a single VM. Is that what we want? You could easily run a 32-way 
> guest on a 4-way host, even with _HV.
> 
> Maybe some really big number makes more sense here.

These two caps are defined pretty well on x86:

KVM_CAP_MAX_VCPUS - Absolute possible number of vcpus we can squeeze in
a single guest due to some technical limitation. On x86 it's limited to
254 due to not supporting IRQ remapping.

KVM_CAP_NR_VCPUS - This is actually an arbitrary number which reflects
the point at which adding more vcpus over this number will actually
cause a performance hit.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: building vgabios-stdvga.bin

2011-12-20 Thread Gerd Hoffmann
  Hi,

> I don't know if this is the right place to make this suggestion,
> but it would be nice if this version could make it into the
> qemu-kvm distribution.

> Do you know what the real constraints are?  Do you know if they're
> imposed by the vgabios driver, the VNC module, or the guest
> operating system (Windows 7 in this case)?

vnc works with 16x16 internally, so a multiple of 16 should be fine.

Reminds me of a bug, I've trapped into this before, digging ...

Patch attached.  Seems to be still not applied, has been on the
qemu-devel list several times already :-(

HTH,
  Gerd
commit a22fe41d90d484a68317e0ac46785611afab545f
Author: Gerd Hoffmann 
Date:   Mon Jun 14 12:28:23 2010 +0200

Fix vnc memory corruption with width = 1400

vnc assumes that the screen width is a multiple of 16 in several places.
If this is not the case vnc will overrun buffers, corrupt memory, make
qemu crash.

This is the minimum fix for this bug. It makes sure we don't overrun the
scanline, thereby fixing the segfault.  The rendering is *not* correct
though, there is a black border at the right side of the screen, 8
pixels wide because 1400 % 16 == 8.

Signed-off-by: Gerd Hoffmann 

diff --git a/ui/vnc.c b/ui/vnc.c
index e85ee66..afbf82c 100644
--- a/ui/vnc.c
+++ b/ui/vnc.c
@@ -2445,7 +2445,7 @@ static int vnc_refresh_server_surface(VncDisplay *vd)
 guest_ptr  = guest_row;
 server_ptr = server_row;
 
-for (x = 0; x < vd->guest.ds->width;
+for (x = 0; x + 15 < vd->guest.ds->width;
 x += 16, guest_ptr += cmp_bytes, server_ptr += cmp_bytes) {
 if (!test_and_clear_bit((x / 16), vd->guest.dirty[y]))
 continue;


Re: [PATCH] KVM: PPC: Add KVM_CAP_NR_VCPUS and KVM_CAP_MAX_VCPUS

2011-12-20 Thread Alexander Graf

On 20.12.2011, at 16:32, Sasha Levin wrote:

> On Tue, 2011-12-20 at 16:17 +0100, Alexander Graf wrote:
>> On 08.12.2011, at 03:55, Matt Evans wrote:
>> 
>>> PPC KVM lacks these two capabilities, and as such a userland system must 
>>> assume
>>> a max of 4 VCPUs (following api.txt).  With these, a userland can determine
>>> a more realistic limit.
>>> 
>>> Signed-off-by: Matt Evans 
>>> ---
>>> 
>>> Alex: For when you're back in civilisation -- the kvmtool/PPC stuff will be
>>> limited to 4 VCPUs until the kernel returns something for these caps.
>>> 
>>> Cheers, Matt
>>> 
>>> arch/powerpc/kvm/powerpc.c |   15 +++
>>> 1 files changed, 15 insertions(+), 0 deletions(-)
>>> 
>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>> index 7c7220c..3f7219d 100644
>>> --- a/arch/powerpc/kvm/powerpc.c
>>> +++ b/arch/powerpc/kvm/powerpc.c
>>> @@ -245,6 +245,21 @@ int kvm_dev_ioctl_check_extension(long ext)
>>> r = 2;
>>> break;
>>> #endif
>>> +   case KVM_CAP_NR_VCPUS:
>>> +   /*
>>> +* Recommending a number of CPUs is somewhat arbitrary; we 
>>> return the number of present
>>> +* CPUs for -HV (since a host will have secondary threads 
>>> "offline"), and for other KVM
>>> +* implementations just count online CPUs.
>>> +*/
>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>> +   r = num_present_cpus();
>>> +#else
>>> +   r = num_online_cpus();
>>> +#endif
>> 
>> That will essentially restrict us to not allow overcommitting when in the 
>> scope of a single VM. Is that what we want? You could easily run a 32-way 
>> guest on a 4-way host, even with _HV.
>> 
>> Maybe some really big number makes more sense here.
> 
> These two caps are defined pretty well on x86:
> 
> KVM_CAP_MAX_VCPUS - Absolute possible number of vcpus we can squeeze in
> a single guest due to some technical limitation. On x86 it's limited to
> 254 due to not supporting IRQ remapping.
> 
> KVM_CAP_NR_VCPUS - This is actually an arbitrary number which reflects
> the point at which adding more vcpus over this number will actually
> cause a performance hit.

Ah cool - then it's all fine and shiny. Thanks for the reminder! :)

Applied the patch to kvm-ppc-next (with 80 character line limit fixed).

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] intel-iommu: Add device info into list before doing context mapping

2011-12-20 Thread Chris Wright
* Hao, Xudong (xudong@intel.com) wrote:
> @@ -2282,6 +2276,14 @@ static int domain_add_dev_info(struct dmar_domain 
> *domain,
>   pdev->dev.archdata.iommu = info;
>   spin_unlock_irqrestore(&device_domain_lock, flags);
>  
> + ret = domain_context_mapping(domain, pdev, translation);
> + if (ret) {
> + list_del(&info->link);
> + list_del(&info->global);

At the very least, this is not correct locking.

> + free_devinfo_mem(info);
> + return ret;
> + }
> +
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2]: kvm-s390: add KVM_S390_GET/SET_SREGS2 call for additional hw regs

2011-12-20 Thread Christian Borntraeger
On 20/12/11 15:20, Alexander Graf wrote:
>> +Reads special registers from the vcpu which are not covered by sregs.
>> +
>> +/* s390x */
>> +struct kvm_sregs2 {
>> +__u64 ckc;  /* clock comparator */
>> +__u64 cputm;/* cpu timer */
>> +__u64 gbea; /* guest breaking event address */
>> +__u32 todpr;/* tod programmable field */
>> +__u32 prefix;   /* prefix register */
>> +};
> 
> Would it make sense to instead use the GET_ONE_REG and SET_ONE_REG interfaces?
> 
>   http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/80854

Still not sure if I like that interface or not, but it should work.
I have some questions, though

- how should userspace check if the kernel supports this specific register?
- How would a GET_MANY_REGS / SET_MANY_REGS look like?
- Is the interface limited to 56 registers? (see the ID)
- scalability and performance. I dont know about other platforms, but the 
  exit overhead on s390 is in the same order of magnitute as a system call
  overhead, so multiple ioctls on the exit path will make the exit overhead 
  noticably more expensive (probably can be solved by a MANY variant). This
  might be a micro optimization though.
  (actually the only register that bothers me regarding performance right now 
  is prefix. qemu will need the content if it has to write to the prefix page. 
  Would be good to have an interface to get that without doing another system
  call)

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2]: kvm-s390: add KVM_S390_GET/SET_SREGS2 call for additional hw regs

2011-12-20 Thread Alexander Graf

On 20.12.2011, at 17:16, Christian Borntraeger wrote:

> On 20/12/11 15:20, Alexander Graf wrote:
>>> +Reads special registers from the vcpu which are not covered by sregs.
>>> +
>>> +/* s390x */
>>> +struct kvm_sregs2 {
>>> +   __u64 ckc;  /* clock comparator */
>>> +   __u64 cputm;/* cpu timer */
>>> +   __u64 gbea; /* guest breaking event address */
>>> +   __u32 todpr;/* tod programmable field */
>>> +   __u32 prefix;   /* prefix register */
>>> +};
>> 
>> Would it make sense to instead use the GET_ONE_REG and SET_ONE_REG 
>> interfaces?
>> 
>>  http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/80854
> 
> Still not sure if I like that interface or not, but it should work.
> I have some questions, though
> 
> - how should userspace check if the kernel supports this specific register?

It would try to get/set it and get the respective return value.

> - How would a GET_MANY_REGS / SET_MANY_REGS look like?

struct many_regs {
__u64 nr_regs; (array length for the latter two)
__u64 ids_ptr; (ptr to user space memory with an array where we store ids 
we request)
__u64 ret_ptr; (ptr to user space memory with an array where we store 
return values)
__u64 reg_ptr; (ptr to user space memory with an array where the registers 
are)
}

for (i = 0; i < nr_regs; i++) {
u32 __user *ret;
u64 __user *ids;
struct one_reg __user *reg;
struct one_reg tmp_reg;
int size;

size = do_normal_one_reg(get_user(&ids[i]), &tmp_reg);
if (size < 0)
return size;
copy_reg(®[i], &tmp_reg, size);

return 0;
}

Something like this maybe. We could also combine the 3 pointers into a single 
struct and just have the user pass an array of that struct to kernel space.

> - Is the interface limited to 56 registers? (see the ID)

Uh. It's limited to 0x0fff registers per architecture :).

> - scalability and performance. I dont know about other platforms, but the 
>  exit overhead on s390 is in the same order of magnitute as a system call
>  overhead, so multiple ioctls on the exit path will make the exit overhead 
>  noticably more expensive (probably can be solved by a MANY variant). This
>  might be a micro optimization though.
>  (actually the only register that bothers me regarding performance right now 
>  is prefix. qemu will need the content if it has to write to the prefix page. 
>  Would be good to have an interface to get that without doing another system
>  call)

Do you expect the prefix register to be synced often? If so, then you should 
maybe put it into kvm_struct
and always have it shared between kernel and user space, always updating it on 
every user space exit
and entry (you can optimize by checking if it changed).

I don't think user space should worry about prefix too often though. Unless you 
expect anyone to DMA
into the CPU prefix area :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tool: Introduce own BUG_ON handler

2011-12-20 Thread Cyrill Gorcunov
Raise SIGABRT in case if run-time crtitical
problem found.

Proposed-by: Ingo Molnar 
Signed-off-by: Cyrill Gorcunov 
---

Ingo, you meant something like below?

 tools/kvm/include/kvm/util.h |   17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

Index: linux-2.6.git/tools/kvm/include/kvm/util.h
===
--- linux-2.6.git.orig/tools/kvm/include/kvm/util.h
+++ linux-2.6.git/tools/kvm/include/kvm/util.h
@@ -9,7 +9,6 @@
  * Some bits are stolen from perf tool :)
  */
 
-#include 
 #include 
 #include 
 #include 
@@ -17,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -51,9 +51,20 @@ extern void set_die_routine(void (*routi
__func__, __LINE__, ##__VA_ARGS__); \
} while (0)
 
-#
+
 #define BUILD_BUG_ON(condition)((void)sizeof(char[1 - 
2*!!(condition)]))
-#define BUG_ON(condition)  assert(!(condition))
+
+#ifndef BUG_ON_HANDLER
+# define BUG_ON_HANDLER(condition) \
+   do {\
+   if ((condition)) {  \
+   pr_err("BUG at %s:%d", __FILE__, __LINE__); \
+   raise(SIGABRT); \
+   }   \
+   } while (0)
+#endif
+
+#define BUG_ON(condition)  BUG_ON_HANDLER((condition))
 
 #define DIE_IF(cnd)\
 do {   \
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Jan Kiszka
On 2011-12-20 15:07, Anthony Liguori wrote:
> On 12/20/2011 07:57 AM, Paolo Bonzini wrote:
>> On 12/20/2011 02:54 PM, Anthony Liguori wrote:
 In QOM parlance Jan implemented this:

 abstract class Object
 abstract class Device
 class APIC: { backend: link }
 abstract class APICBackend
 class QEMU_APICBackend
 class KVM_APICBackend
>>>
>>> I don't fundamentally object to modeling it like this provided that it's
>>> modeled (and visible) through qdev and not done through a one-off
>>> infrastructure.
>>
>> There is no superclass of DeviceState, hence doing it through qdev
>> would mean
>> introducing a new bus type and so on. This would be a superb example of a
>> useless bus that can disappear with QOM, but I don't see why we should
>> take the
>> pain to add it in the first place. :)
> 
> Right, so let's modeled it for now as inheritance which qdev can cope with.

Do we have a clear plan now how to sort out the addressing issues in
this model? I mean when registering two devices under different names
that are supposed to be addressable under the same alias once
instantiated. I didn't follow recent qtree naming changes in details
unfortunately, if they already enable this.

This does not need to be implemented before merge. I just like to have a
common view on how to address it once it matters (for device inspection).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virt-test: support static ip address in framework

2011-12-20 Thread Lucas Meneghel Rodrigues

On 12/19/2011 11:11 AM, Amos Kong wrote:

Sometime, we need to test with guest(s) which have static ip
address(es).
eg. No real/emulated DHCP server in test environment.
eg. Test with old image we don't want to change the net config.
eg. Test when DHCP exists problem.


Ok Amos, looks reasonable. Would you please send a v2 with ip_nic 
commented out and a companion wiki documentation? It'd be the start of a 
KVM autotest networking documentation. In case you are not aware, the 
autotest wiki is now a git repo, you can clone it, edit the pages on 
your editor, commit and push the changes.


If you have any problems, please contact me.

Cheers,

Lucas


This is an example of using static ip address:
1. edit ifcfg-eth0 of guest to assign static IP
(192.168.100.110). You can also do this by install
post-script/serial.
2. add and setup bridge in host
# brctl addbr vbr
# ifconfig vbr 192.168.100.1
3. add script for setup tap device
/etc/qemu-ifup-vbr
| #!/bin/sh
| switch=vbr
| /sbin/ifconfig $1 0.0.0.0 up
| /usr/sbin/brctl addif ${switch} $1
| /usr/sbin/brctl setfd ${switch} 0
| /usr/sbin/brctl stp ${switch} off
4. assign parameters in config file and execute test as usual
test.cfg:
| ip_nic1 = 192.168.100.110
| mac_nic1 = 11:22:33:44:55:67
| bridge = vbr

Signed-off-by: Amos Kong
---
  client/tests/kvm/base.cfg.sample |3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/client/tests/kvm/base.cfg.sample b/client/tests/kvm/base.cfg.sample
index 411decf..c86ec1f 100644
--- a/client/tests/kvm/base.cfg.sample
+++ b/client/tests/kvm/base.cfg.sample
@@ -17,6 +17,9 @@ nics = nic1
  # Connect NIC devices to host bridge device
  bridge = virbr0

+# Tell framework of nic1's static ip address
+ip_nic1 = 192.168.100.110
+
  # List of block device object names (whitespace seperated)
  images = image1
  # List of optical device object names
diff --git a/client/virt/kvm_vm.py b/client/virt/kvm_vm.py
index fa258c3..1fb177f 100644
--- a/client/virt/kvm_vm.py
+++ b/client/virt/kvm_vm.py
@@ -821,7 +821,12 @@ class VM(virt_vm.BaseVM):
  if mac:
  virt_utils.set_mac_address(self.instance, vlan, mac)
  else:
-virt_utils.generate_mac_address(self.instance, vlan)
+mac = virt_utils.generate_mac_address(self.instance, vlan)
+
+if nic_params.get("ip"):
+self.address_cache[mac] = nic_params.get("ip")
+logging.debug("(address cache) Adding static cache entry: "
+  "%s --->  %s" % (mac, nic_params.get("ip")))

  # Assign a PCI assignable device
  self.pci_assignable = None

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Anthony Liguori

On 12/20/2011 11:02 AM, Jan Kiszka wrote:

On 2011-12-20 15:07, Anthony Liguori wrote:

On 12/20/2011 07:57 AM, Paolo Bonzini wrote:

On 12/20/2011 02:54 PM, Anthony Liguori wrote:

In QOM parlance Jan implemented this:

abstract class Object
abstract class Device
class APIC: { backend: link  }
abstract class APICBackend
class QEMU_APICBackend
class KVM_APICBackend


I don't fundamentally object to modeling it like this provided that it's
modeled (and visible) through qdev and not done through a one-off
infrastructure.


There is no superclass of DeviceState, hence doing it through qdev
would mean
introducing a new bus type and so on. This would be a superb example of a
useless bus that can disappear with QOM, but I don't see why we should
take the
pain to add it in the first place. :)


Right, so let's modeled it for now as inheritance which qdev can cope with.


Do we have a clear plan now how to sort out the addressing issues in
this model? I mean when registering two devices under different names
that are supposed to be addressable under the same alias once
instantiated. I didn't follow recent qtree naming changes in details
unfortunately, if they already enable this.


I think everyone is in agreement.  We'll start with an APICBase type that's 
modeled in qdev as a base class.


There will be an APICBaseInfo that will replace APICBackend.

There will be two classes that implement APICBaseInfo, KvmAPIC and APIC.  They 
will be separate devices.


APICBase will register the vmsd and will use the name "apic" to register it. 
You can just set the qdev.vmsd field in the apic_qdev_register() function to 
ensure that both use the same implementation.




This does not need to be implemented before merge. I just like to have a
common view on how to address it once it matters (for device inspection).


You can do this all today without any pending patches.  As I mentioned earlier, 
I don't mind doing this after the fact if you'd just like to get the current 
series merged.


If your series lands before the QOM series I just posted, then I will need to do 
it as part of the QOM series anyway.


Regards,

Anthony Liguori


Jan



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][Autotest] Autotest: Add subtest inteface to client utils.

2011-12-20 Thread Lucas Meneghel Rodrigues

On 12/09/2011 10:50 AM, Jiří Župka wrote:

This class and some decorators are for easy way of start function like a 
subtest.
Subtests result are collected and it is posible for review on end of test.
Subtest class and decorators should be placed in autotest_lib.client.utils.

 There is possibility how to change  results format.

 Example:
 @staticmethod
 def result_to_string(result):
 """
 @param result: Result of test.
 """
 print result
 return ("[%(result)]%(name): %(output)") % (result)

   1)
 Subtest.result_to_string = result_to_string
 Subtest.get_text_result()

   2)
 Subtest.get_text_result(result_to_string)

Pull-request: https://github.com/autotest/autotest/pull/111


^ I made a few remarks to the pull request, and now I wait on an updated 
version of the patchset. Thanks Jiri!



Signed-off-by: Jiří Župka
---
  client/common_lib/base_utils.py  |  214 ++
  client/common_lib/base_utils_unittest.py |  117 
  2 files changed, 331 insertions(+), 0 deletions(-)

diff --git a/client/common_lib/base_utils.py b/client/common_lib/base_utils.py
index 005e3b0..fc6578d 100644
--- a/client/common_lib/base_utils.py
+++ b/client/common_lib/base_utils.py
@@ -119,6 +119,220 @@ class BgJob(object):
  signal.signal(signal.SIGPIPE, signal.SIG_DFL)


+def subtest_fatal(function):
+"""
+Decorator which mark test critical.
+If subtest failed whole test ends.
+"""
+def wrapped(self, *args, **kwds):
+self._fatal = True
+self.decored()
+result = function(self, *args, **kwds)
+return result
+wrapped.func_name = function.func_name
+return wrapped
+
+
+def subtest_nocleanup(function):
+"""
+Decorator disable cleanup function.
+"""
+def wrapped(self, *args, **kwds):
+self._cleanup = False
+self.decored()
+result = function(self, *args, **kwds)
+return result
+wrapped.func_name = function.func_name
+return wrapped
+
+
+class Subtest(object):
+"""
+Collect result of subtest of main test.
+"""
+result = []
+passed = 0
+failed = 0
+def __new__(cls, *args, **kargs):
+self = super(Subtest, cls).__new__(cls)
+
+self._fatal = False
+self._cleanup = True
+self._num_decored = 0
+
+ret = None
+if args is None:
+args = []
+
+res = {
+   'result' : None,
+   'name'   : self.__class__.__name__,
+   'args'   : args,
+   'kargs'  : kargs,
+   'output' : None,
+  }
+try:
+logging.info("Starting test %s" % self.__class__.__name__)
+ret = self.test(*args, **kargs)
+res['result'] = 'PASS'
+res['output'] = ret
+try:
+logging.info(Subtest.result_to_string(res))
+except:
+self._num_decored = 0
+raise
+Subtest.result.append(res)
+Subtest.passed += 1
+except NotImplementedError:
+raise
+except Exception:
+exc_type, exc_value, exc_traceback = sys.exc_info()
+for _ in range(self._num_decored):
+exc_traceback = exc_traceback.tb_next
+logging.error("In function (" + self.__class__.__name__ + "):")
+logging.error("Call from:\n" +
+  traceback.format_stack()[-2][:-1])
+logging.error("Exception from:\n" +
+  "".join(traceback.format_exception(
+  exc_type, exc_value,
+  exc_traceback.tb_next)))
+# Clean up environment after subTest crash
+res['result'] = 'FAIL'
+logging.info(self.result_to_string(res))
+Subtest.result.append(res)
+Subtest.failed += 1
+if self._fatal:
+raise
+finally:
+if self._cleanup:
+self.clean()
+
+return ret
+
+
+def test(self):
+"""
+Check if test is defined.
+
+For makes test fatal add before implementation of test method
+decorator @subtest_fatal
+"""
+raise NotImplementedError("Method test is not implemented.")
+
+
+def clean(self):
+"""
+Check if cleanup is defined.
+
+For makes test fatal add before implementation of test method
+decorator @subtest_nocleanup
+"""
+raise NotImplementedError("Method cleanup is not implemented.")
+
+
+def decored(self):
+self._num_decored += 1
+
+
+@classmethod
+def has_failed(cls):
+"""
+@return: If any of subtest not pass return True.
+"""
+if cls.failed>  0:
+

Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-20 Thread Tejun Heo
Hello,

On Tue, Dec 20, 2011 at 09:50:24AM -0500, Vivek Goyal wrote:
> So IIUC, existing mempool implementation is not directly usable for my
> requirement and I need to write some code of my own for the caching
> layer which always allocates objects from reserve and fills in the
> pool asynchronously with the help of a worker thread.

I've been looking at it and don't think allowing percpu allocator to
be called from no io path is a good idea.  The on-demand area filling
is tied into vmalloc area management which in turn is tied to arch
page table code and I really want to avoid pre-allocating full chunk -
it can be huge.  I'm trying to extend mempool to cover percpu areas,
so please wait a bit.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Current kernel fails to compile with KVM on PowerPC

2011-12-20 Thread Jörg Sommer
Hello Alexander,

Jörg Sommer hat am Mon 07. Nov, 20:48 (+0100) geschrieben:
>   CHK include/linux/version.h
>   HOSTCC  scripts/mod/modpost.o
>   CHK include/generated/utsrelease.h
>   UPD include/generated/utsrelease.h
>   HOSTLD  scripts/mod/modpost
>   GEN include/generated/bounds.h
>   CC  arch/powerpc/kernel/asm-offsets.s
> In file included from arch/powerpc/kernel/asm-offsets.c:59:0:
> /home/joerg/git/linux/arch/powerpc/include/asm/kvm_book3s.h: In function 
> ‘compute_tlbie_rb’:
> /home/joerg/git/linux/arch/powerpc/include/asm/kvm_book3s.h:393:10: error: 
> ‘HPTE_V_SECONDARY’ undeclared (first use in this function)
> /home/joerg/git/linux/arch/powerpc/include/asm/kvm_book3s.h:393:10: note: 
> each undeclared identifier is reported only once for each function it appears 
> in
> /home/joerg/git/linux/arch/powerpc/include/asm/kvm_book3s.h:396:12: error: 
> ‘HPTE_V_1TB_SEG’ undeclared (first use in this function)
> /home/joerg/git/linux/arch/powerpc/include/asm/kvm_book3s.h:401:10: error: 
> ‘HPTE_V_LARGE’ undeclared (first use in this function)
> /home/joerg/git/linux/arch/powerpc/include/asm/kvm_book3s.h:415:2: warning: 
> right shift count >= width of type [enabled by default]
> make[3]: *** [arch/powerpc/kernel/asm-offsets.s] Fehler 1
> make[2]: *** [prepare0] Fehler 2
> make[1]: *** [deb-pkg] Fehler 2
> make: *** [deb-pkg] Fehler 2

I'm so sorry to have to report a new bug in one of your patches. It was
covered by the one above. I've picked the commit for it from your git
repository and now, the kernel build fails with this error:

  BOOTCC  arch/powerpc/boot/fdt_strerror.o
  BOOTAR  arch/powerpc/boot/wrapper.a
  WRAParch/powerpc/boot/zImage.pmac
  WRAParch/powerpc/boot/zImage.coff
ERROR: "kvmppc_h_pr" [arch/powerpc/kvm/kvm.ko] undefined!
make[3]: *** [__modpost] Fehler 1
make[2]: *** [modules] Fehler 2
make[2]: *** Warte auf noch nicht beendete Prozesse...
  WRAParch/powerpc/boot/zImage.miboot
make[1]: *** [deb-pkg] Fehler 2
make: *** [deb-pkg] Fehler 2

The bug was introduced by this commit

HEAD is now at aacf9aa KVM: PPC: Stub emulate CFAR and PURR SPRs
a668f2bd3f14ce7f92e119f4b5d9b50cdc59e855 is the first bad commit
commit a668f2bd3f14ce7f92e119f4b5d9b50cdc59e855
Author: Alexander Graf 
Date:   Mon Aug 8 17:26:24 2011 +0200

KVM: PPC: Support SC1 hypercalls for PAPR in PR mode

PAPR defines hypercalls as SC1 instructions. Using these, the guest modifies
page tables and does other privileged operations that it wouldn't be allowed
to do in supervisor mode.

This patch adds support for PR KVM to trap these instructions and route them
through the same PAPR hypercall interface that we already use for HV style
KVM.

Signed-off-by: Alexander Graf 

Bye, Jörg.
-- 
“Perl—the only language that looks the same
 before and after RSA encryption.”   (Keith Bostic)


signature.asc
Description: Digital signature http://en.wikipedia.org/wiki/OpenPGP


Re: [PATCH v2 2/2] kvm: Device assignment permission checks

2011-12-20 Thread Sasha Levin
On Tue, 2011-12-20 at 07:30 -0700, Alex Williamson wrote:
> Only allow KVM device assignment to attach to devices which:
> 
>  - Are not bridges
>  - Have BAR resources (assume others are special devices)
>  - The user has permissions to use
> 
> Assigning a bridge is a configuration error, it's not supported, and
> typically doesn't result in the behavior the user is expecting anyway.
> Devices without BAR resources are typically chipset components that
> also don't have host drivers.  We don't want users to hold such devices
> captive or cause system problems by fencing them off into an iommu
> domain.  We determine "permission to use" by testing whether the user
> has access to the PCI sysfs resource files.  By default a normal user
> will not have access to these files, so it provides a good indication
> that an administration agent has granted the user access to the device.
> 
> Signed-off-by: Alex Williamson 
> ---
> 
>  Documentation/virtual/kvm/api.txt |4 +++
>  virt/kvm/assigned-dev.c   |   55 
> -
>  2 files changed, 58 insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index ee2c96b..4df9af4 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1154,6 +1154,10 @@ following flags are specified:
>  The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
>  isolation of the device.  Usages not specifying this flag are deprecated.
>  
> +Only PCI header type 0 devices with PCI BAR resources are supported by
> +device assignment.  The user requesting this ioctl must have read/write
> +access to the PCI sysfs resource files associated with the device.
> +
>  4.49 KVM_DEASSIGN_PCI_DEVICE
>  
>  Capability: KVM_CAP_DEVICE_DEASSIGNMENT
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index a251a28..faec641 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "irq.h"
>  
>  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct 
> list_head *head,
> @@ -483,9 +484,11 @@ out:
>  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> struct kvm_assigned_pci_dev *assigned_dev)
>  {
> - int r = 0, idx;
> + int r = 0, idx, i;
>   struct kvm_assigned_dev_kernel *match;
>   struct pci_dev *dev;
> + u8 header_type;
> + bool bar_found = false;
>  
>   if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
>   return -EINVAL;
> @@ -516,6 +519,56 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>   r = -EINVAL;
>   goto out_free;
>   }
> +
> + /* Don't allow bridges to be assigned */
> + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type);
> + if ((header_type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL) {
> + r = -EPERM;
> + goto out_put;
> + }
> +
> + /* We want to test whether the caller has been granted permissions to
> +  * use this device.  To be able to configure and control the device,
> +  * the user needs access to PCI configuration space and BAR resources.
> +  * These are accessed through PCI sysfs.  PCI config space is often
> +  * passed to the process calling this ioctl via file descriptor, so we
> +  * can't rely on access to that file.  We can check for permissions
> +  * on each of the BAR resource files, which is a pretty clear
> +  * indicator that the user has been granted access to the device. */
> + for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++) {
> + char buf[64];
> + struct path path;
> + struct inode *inode;
> +
> + if (!pci_resource_len(dev, i))
> + continue;
> +
> + /* Per sysfs-rules, sysfs is always at /sys */
> + snprintf(buf, sizeof(buf), "/sys/bus/pci/devices/%04x:%02x:"
> +  "%02x.%d/resource%d", pci_domain_nr(dev->bus),
> +  dev->bus->number, PCI_SLOT(dev->devfn),
> +  PCI_FUNC(dev->devfn), i);

This should probably be done by grabbing devname out of
'dev' (kobject_get_path(&dev->dev.kobj, GFP_KERNEL) ) instead of
formatting it ourselves. This is also mentioned to be always correct in
sysfs-rules while this method isn't.

> +
> + r = kern_path(buf, LOOKUP_FOLLOW, &path);
> + if (r)
> + goto out_put;
> +
> + inode = path.dentry->d_inode;
> +
> + r = inode_permission(inode, MAY_READ | MAY_WRITE | MAY_ACCESS);
> + path_put(&path);
> + if (r)
> + goto out_put;
> +
> + bar_found = true;
> + }
> +
> + /* If no resources, probably something special */
> + if (!bar_found) {
> + r = -

Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Jan Kiszka
On 2011-12-20 20:14, Anthony Liguori wrote:
> On 12/20/2011 11:02 AM, Jan Kiszka wrote:
>> On 2011-12-20 15:07, Anthony Liguori wrote:
>>> On 12/20/2011 07:57 AM, Paolo Bonzini wrote:
 On 12/20/2011 02:54 PM, Anthony Liguori wrote:
>> In QOM parlance Jan implemented this:
>>
>> abstract class Object
>> abstract class Device
>> class APIC: { backend: link  }
>> abstract class APICBackend
>> class QEMU_APICBackend
>> class KVM_APICBackend
>
> I don't fundamentally object to modeling it like this provided that
> it's
> modeled (and visible) through qdev and not done through a one-off
> infrastructure.

 There is no superclass of DeviceState, hence doing it through qdev
 would mean
 introducing a new bus type and so on. This would be a superb example
 of a
 useless bus that can disappear with QOM, but I don't see why we should
 take the
 pain to add it in the first place. :)
>>>
>>> Right, so let's modeled it for now as inheritance which qdev can cope
>>> with.
>>
>> Do we have a clear plan now how to sort out the addressing issues in
>> this model? I mean when registering two devices under different names
>> that are supposed to be addressable under the same alias once
>> instantiated. I didn't follow recent qtree naming changes in details
>> unfortunately, if they already enable this.
> 
> I think everyone is in agreement.  We'll start with an APICBase type
> that's modeled in qdev as a base class.
> 
> There will be an APICBaseInfo that will replace APICBackend.
> 
> There will be two classes that implement APICBaseInfo, KvmAPIC and
> APIC.  They will be separate devices.
> 
> APICBase will register the vmsd and will use the name "apic" to register
> it. You can just set the qdev.vmsd field in the apic_qdev_register()
> function to ensure that both use the same implementation.

I'm not talking about migration here, I'm talking about qtree
addressability. That is orthogonal, at least right now.

> 
>>
>> This does not need to be implemented before merge. I just like to have a
>> common view on how to address it once it matters (for device inspection).
> 
> You can do this all today without any pending patches.

Nope, don't see how.

There is currently no use case for it (e.g. no device_show -
device_add/del makes no sense for the devices in question), but it
should be addressable in QOM in the future.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Anthony Liguori

On 12/20/2011 03:23 PM, Jan Kiszka wrote:

On 2011-12-20 20:14, Anthony Liguori wrote:

On 12/20/2011 11:02 AM, Jan Kiszka wrote:

On 2011-12-20 15:07, Anthony Liguori wrote:

On 12/20/2011 07:57 AM, Paolo Bonzini wrote:

On 12/20/2011 02:54 PM, Anthony Liguori wrote:

In QOM parlance Jan implemented this:

abstract class Object
abstract class Device
class APIC: { backend: link   }
abstract class APICBackend
class QEMU_APICBackend
class KVM_APICBackend


I don't fundamentally object to modeling it like this provided that
it's
modeled (and visible) through qdev and not done through a one-off
infrastructure.


There is no superclass of DeviceState, hence doing it through qdev
would mean
introducing a new bus type and so on. This would be a superb example
of a
useless bus that can disappear with QOM, but I don't see why we should
take the
pain to add it in the first place. :)


Right, so let's modeled it for now as inheritance which qdev can cope
with.


Do we have a clear plan now how to sort out the addressing issues in
this model? I mean when registering two devices under different names
that are supposed to be addressable under the same alias once
instantiated. I didn't follow recent qtree naming changes in details
unfortunately, if they already enable this.


I think everyone is in agreement.  We'll start with an APICBase type
that's modeled in qdev as a base class.

There will be an APICBaseInfo that will replace APICBackend.

There will be two classes that implement APICBaseInfo, KvmAPIC and
APIC.  They will be separate devices.

APICBase will register the vmsd and will use the name "apic" to register
it. You can just set the qdev.vmsd field in the apic_qdev_register()
function to ensure that both use the same implementation.


I'm not talking about migration here, I'm talking about qtree
addressability. That is orthogonal, at least right now.


qtree is not an ABI.  The output of info qtree can (and will) change over time.







This does not need to be implemented before merge. I just like to have a
common view on how to address it once it matters (for device inspection).


You can do this all today without any pending patches.


Nope, don't see how.


What is this issue?



There is currently no use case for it (e.g. no device_show -
device_add/del makes no sense for the devices in question), but it
should be addressable in QOM in the future.


I guess I'm a bit confused...

Regards,

Anthony Liguori



Jan



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Jan Kiszka
On 2011-12-20 22:38, Anthony Liguori wrote:
> On 12/20/2011 03:23 PM, Jan Kiszka wrote:
>> On 2011-12-20 20:14, Anthony Liguori wrote:
>>> On 12/20/2011 11:02 AM, Jan Kiszka wrote:
 On 2011-12-20 15:07, Anthony Liguori wrote:
> On 12/20/2011 07:57 AM, Paolo Bonzini wrote:
>> On 12/20/2011 02:54 PM, Anthony Liguori wrote:
 In QOM parlance Jan implemented this:

 abstract class Object
 abstract class Device
 class APIC: { backend: link   }
 abstract class APICBackend
 class QEMU_APICBackend
 class KVM_APICBackend
>>>
>>> I don't fundamentally object to modeling it like this provided that
>>> it's
>>> modeled (and visible) through qdev and not done through a one-off
>>> infrastructure.
>>
>> There is no superclass of DeviceState, hence doing it through qdev
>> would mean
>> introducing a new bus type and so on. This would be a superb example
>> of a
>> useless bus that can disappear with QOM, but I don't see why we
>> should
>> take the
>> pain to add it in the first place. :)
>
> Right, so let's modeled it for now as inheritance which qdev can cope
> with.

 Do we have a clear plan now how to sort out the addressing issues in
 this model? I mean when registering two devices under different names
 that are supposed to be addressable under the same alias once
 instantiated. I didn't follow recent qtree naming changes in details
 unfortunately, if they already enable this.
>>>
>>> I think everyone is in agreement.  We'll start with an APICBase type
>>> that's modeled in qdev as a base class.
>>>
>>> There will be an APICBaseInfo that will replace APICBackend.
>>>
>>> There will be two classes that implement APICBaseInfo, KvmAPIC and
>>> APIC.  They will be separate devices.
>>>
>>> APICBase will register the vmsd and will use the name "apic" to register
>>> it. You can just set the qdev.vmsd field in the apic_qdev_register()
>>> function to ensure that both use the same implementation.
>>
>> I'm not talking about migration here, I'm talking about qtree
>> addressability. That is orthogonal, at least right now.
> 
> qtree is not an ABI.  The output of info qtree can (and will) change
> over time.

That's not the point. The point is that at least some branch of the
qtree should be identically named for both the KVM and the user space
incarnations of a particular device (given a certain qemu version).

The request was that /qtree/path/to/apic should not change if you enable
KVM in-kernel acceleration in the very same qemu release. There can also
be some /qtree/path/to/kvm-apic then, but as alias (or as primary name
and the other becomes an alias). I think this makes sense if the user is
still able to clearly differentiate between both versions when listing
devices.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Anthony Liguori

On 12/20/2011 03:45 PM, Jan Kiszka wrote:

On 2011-12-20 22:38, Anthony Liguori wrote:

I'm not talking about migration here, I'm talking about qtree
addressability. That is orthogonal, at least right now.


qtree is not an ABI.  The output of info qtree can (and will) change
over time.


That's not the point. The point is that at least some branch of the
qtree should be identically named for both the KVM and the user space
incarnations of a particular device (given a certain qemu version).


There is no such thing as "qtree paths".  Today, devices have ids or are 
anonymous.  The apic is currently an anonymous device and there's no way to 
address it until we complete the PC composition tree.  I have patches for this, 
but that won't land until after series 4.


Starting right now, we have a standard path mechanism.  This path will either 
follow the composition tree or potentially an arbitrary path through the link graph.


The components of the path are the *property* names of the parent device.  In 
the case of the local APIC, you would have something like:


/cpus/cpu0/apic
/cpus/cpu1/apic

Which would be links on the composition tree.  The name wouldn't change even if 
the type of this object changed.  You'll probably have a flag or something in 
the cpu object that lets you determine whether the child is created as a 
kvm-apic or just a normal apic.  But that would only affect the 'type' flag.



The request was that /qtree/path/to/apic should not change if you enable
KVM in-kernel acceleration in the very same qemu release.


The type names of the devices are orthogonal to the path names.


There can also
be some /qtree/path/to/kvm-apic then, but as alias (or as primary name
and the other becomes an alias).   I think this makes sense if the user is
still able to clearly differentiate between both versions when listing
devices.


Yes, they just need to read the 'type' property.  The distinguishing property 
would be:


/cpus/cpu0/apic.type = 'apic'

vs.

/cpus/cpu0/apic.type = 'kvm-apic'

But otherwise, it would look the same.

Again, if you implement qdev based inheritance as I described in my previous 
note, this will all Just Work.  We have everything we need in the tree to model 
this.


Regards,

Anthony Liguori



Jan



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Jan Kiszka
On 2011-12-20 22:55, Anthony Liguori wrote:
> On 12/20/2011 03:45 PM, Jan Kiszka wrote:
>> On 2011-12-20 22:38, Anthony Liguori wrote:
 I'm not talking about migration here, I'm talking about qtree
 addressability. That is orthogonal, at least right now.
>>>
>>> qtree is not an ABI.  The output of info qtree can (and will) change
>>> over time.
>>
>> That's not the point. The point is that at least some branch of the
>> qtree should be identically named for both the KVM and the user space
>> incarnations of a particular device (given a certain qemu version).
> 
> There is no such thing as "qtree paths".  Today, devices have ids or are
> anonymous.  The apic is currently an anonymous device and there's no way
> to address it until we complete the PC composition tree.  I have patches
> for this, but that won't land until after series 4.
> 
> Starting right now, we have a standard path mechanism.  This path will
> either follow the composition tree or potentially an arbitrary path
> through the link graph.
> 
> The components of the path are the *property* names of the parent
> device.  In the case of the local APIC, you would have something like:
> 
> /cpus/cpu0/apic
> /cpus/cpu1/apic
> 
> Which would be links on the composition tree.  The name wouldn't change
> even if the type of this object changed. 

Perfect! That was what I forgot about and what makes it possible to
return to the original two-device model.

> You'll probably have a flag or
> something in the cpu object that lets you determine whether the child is
> created as a kvm-apic or just a normal apic. 

I rather hope you will be able to ask the device for its type instead
replicating that information.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Anthony Liguori

On 12/20/2011 04:20 PM, Jan Kiszka wrote:

On 2011-12-20 22:55, Anthony Liguori wrote:

The components of the path are the *property* names of the parent
device.  In the case of the local APIC, you would have something like:

/cpus/cpu0/apic
/cpus/cpu1/apic

Which would be links on the composition tree.  The name wouldn't change
even if the type of this object changed.


Perfect! That was what I forgot about and what makes it possible to
return to the original two-device model.


You'll probably have a flag or
something in the cpu object that lets you determine whether the child is
created as a kvm-apic or just a normal apic.


I rather hope you will be able to ask the device for its type instead
replicating that information.


Yes, but that's not what I was getting at.

I think you are currently planning on enabling/disabling the in-kernel apic 
through a machine option?


Where I'd like to get to is that the CPUs are modeled as devices and whether the 
APIC is in-kernel or not is a property of the CPU (just like any other CPU flag).


For something like the i8254, since that's a child of the PIIX3, it would be a 
property of the PIIX3 which it would use to create the appropriate i8254 type.


You could also have the CPU and/or i8254 have a link<> which would allow a user 
to explicitly instantiate the appropriate device but I think that makes it 
harder to use than it should be.


By making it a property of the composition parent, you let the parent make the 
best choice to start with and then a user has the ability to override it if it 
sees fit to.


Regards,

Anthony Liguori


Jan



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 06/16] apic: Introduce backend/frontend infrastructure for KVM reuse

2011-12-20 Thread Jan Kiszka
On 2011-12-21 00:41, Anthony Liguori wrote:
> On 12/20/2011 04:20 PM, Jan Kiszka wrote:
>> On 2011-12-20 22:55, Anthony Liguori wrote:
>>> The components of the path are the *property* names of the parent
>>> device.  In the case of the local APIC, you would have something like:
>>>
>>> /cpus/cpu0/apic
>>> /cpus/cpu1/apic
>>>
>>> Which would be links on the composition tree.  The name wouldn't change
>>> even if the type of this object changed.
>>
>> Perfect! That was what I forgot about and what makes it possible to
>> return to the original two-device model.
>>
>>> You'll probably have a flag or
>>> something in the cpu object that lets you determine whether the child is
>>> created as a kvm-apic or just a normal apic.
>>
>> I rather hope you will be able to ask the device for its type instead
>> replicating that information.
> 
> Yes, but that's not what I was getting at.
> 
> I think you are currently planning on enabling/disabling the in-kernel
> apic through a machine option?

Yes, because it is a VM-wide flag, nothing you can control per irqchip,
per chipset or whatever. It must be consistent for the whole VM, means
all CPUs, the chipset, the IOAPIC (which may or may not (PIIX3) be part
of it) etc. It also affects KVM internals that are not directly bound to
device models.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH V2 4/6] kvm tools: Add PPC64 XICS interrupt controller support

2011-12-20 Thread David Gibson
On Tue, Dec 20, 2011 at 12:16:40PM +1100, Matt Evans wrote:
> Hi David,
> 
> On 14/12/11 13:35, David Gibson wrote:
> > On Tue, Dec 13, 2011 at 06:10:48PM +1100, Matt Evans wrote:
> >> This patch adds XICS emulation code (heavily borrowed from QEMU), and wires
> >> this into kvm_cpu__irq() to fire a CPU IRQ via KVM.  A device tree entry is
> >> also added.  IPIs work, xics_alloc_irqnum() is added to allocate an 
> >> external
> >> IRQ (which will later be used by the PHB PCI code) and finally, 
> >> kvm__irq_line()
> >> can be called to raise an IRQ on XICS.\
> > 
> > Hrm, looks like you took a somewhat old version of xics.c from qemu.
> > It dangerously uses the same variable names for global irq numbers and
> > numbers local to one ics unit.  It used to have at least one bug
> > caused by confusing the two, which I'm not sure if you've also copied.
> 
> Just had a look at the diffs between this and hw/xics.c from the master branch
> in your qemu-impreza.git (which I based the kvmtool stuff on) and I can't see
> anything standing out.
> 
> Is there a particular commit/patch/variable name you have in mind that I can
> search for?

Sorry, my mistake, I was looking in the wrong place.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: PPC: booke: Add booke206 TLB trace

2011-12-20 Thread Scott Wood
From: Liu Yu 

The existing kvm_stlb_write/kvm_gtlb_write were a poor match for
the e500/book3e MMU -- mas1 was passed as "tid", mas2 was limited
to "unsigned int" which will be a problem on 64-bit, mas3/7 got
split up rather than treated as a single 64-bit word, etc.

Signed-off-by: Liu Yu 
[scottw...@freescale.com: made mas2 64-bit, and added mas8 init]
Signed-off-by: Scott Wood 
---
v2: expanded commit message

 arch/powerpc/kvm/e500_tlb.c |   10 ---
 arch/powerpc/kvm/trace.h|   57 +++
 2 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 1746e67..6e53e41 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -294,6 +294,9 @@ static inline void __write_host_tlbe(struct 
kvm_book3e_206_tlb_entry *stlbe,
mtspr(SPRN_MAS7, (u32)(stlbe->mas7_3 >> 32));
asm volatile("isync; tlbwe" : : : "memory");
local_irq_restore(flags);
+
+   trace_kvm_booke206_stlb_write(mas0, stlbe->mas8, stlbe->mas1,
+ stlbe->mas2, stlbe->mas7_3);
 }
 
 /*
@@ -332,8 +335,6 @@ static inline void write_host_tlbe(struct kvmppc_vcpu_e500 
*vcpu_e500,
  MAS0_TLBSEL(1) |
  MAS0_ESEL(to_htlb1_esel(sesel)));
}
-   trace_kvm_stlb_write(index_of(tlbsel, esel), stlbe->mas1, stlbe->mas2,
-(u32)stlbe->mas7_3, (u32)(stlbe->mas7_3 >> 32));
 }
 
 void kvmppc_map_magic(struct kvm_vcpu *vcpu)
@@ -355,6 +356,7 @@ void kvmppc_map_magic(struct kvm_vcpu *vcpu)
magic.mas2 = vcpu->arch.magic_page_ea | MAS2_M;
magic.mas7_3 = ((u64)pfn << PAGE_SHIFT) |
   MAS3_SW | MAS3_SR | MAS3_UW | MAS3_UR;
+   magic.mas8 = 0;
 
__write_host_tlbe(&magic, MAS0_TLBSEL(1) | MAS0_ESEL(tlbcam_index));
preempt_enable();
@@ -954,8 +956,8 @@ int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
gtlbe->mas2 = vcpu->arch.shared->mas2;
gtlbe->mas7_3 = vcpu->arch.shared->mas7_3;
 
-   trace_kvm_gtlb_write(vcpu->arch.shared->mas0, gtlbe->mas1, gtlbe->mas2,
-(u32)gtlbe->mas7_3, (u32)(gtlbe->mas7_3 >> 32));
+   trace_kvm_booke206_gtlb_write(vcpu->arch.shared->mas0, gtlbe->mas1,
+ gtlbe->mas2, gtlbe->mas7_3);
 
/* Invalidate shadow mappings for the about-to-be-clobbered TLBE. */
if (tlbe_is_host_safe(vcpu, gtlbe)) {
diff --git a/arch/powerpc/kvm/trace.h b/arch/powerpc/kvm/trace.h
index 609d8bf..877186b 100644
--- a/arch/powerpc/kvm/trace.h
+++ b/arch/powerpc/kvm/trace.h
@@ -340,6 +340,63 @@ TRACE_EVENT(kvm_book3s_slbmte,
 
 #endif /* CONFIG_PPC_BOOK3S */
 
+
+/*
+ * Book3E trace points   *
+ */
+
+#ifdef CONFIG_BOOKE
+
+TRACE_EVENT(kvm_booke206_stlb_write,
+   TP_PROTO(__u32 mas0, __u32 mas8, __u32 mas1, __u64 mas2, __u64 mas7_3),
+   TP_ARGS(mas0, mas8, mas1, mas2, mas7_3),
+
+   TP_STRUCT__entry(
+   __field(__u32,  mas0)
+   __field(__u32,  mas8)
+   __field(__u32,  mas1)
+   __field(__u64,  mas2)
+   __field(__u64,  mas7_3  )
+   ),
+
+   TP_fast_assign(
+   __entry->mas0   = mas0;
+   __entry->mas8   = mas8;
+   __entry->mas1   = mas1;
+   __entry->mas2   = mas2;
+   __entry->mas7_3 = mas7_3;
+   ),
+
+   TP_printk("mas0=%x mas8=%x mas1=%x mas2=%llx mas7_3=%llx",
+   __entry->mas0, __entry->mas8, __entry->mas1,
+   __entry->mas2, __entry->mas7_3)
+);
+
+TRACE_EVENT(kvm_booke206_gtlb_write,
+   TP_PROTO(__u32 mas0, __u32 mas1, __u64 mas2, __u64 mas7_3),
+   TP_ARGS(mas0, mas1, mas2, mas7_3),
+
+   TP_STRUCT__entry(
+   __field(__u32,  mas0)
+   __field(__u32,  mas1)
+   __field(__u64,  mas2)
+   __field(__u64,  mas7_3  )
+   ),
+
+   TP_fast_assign(
+   __entry->mas0   = mas0;
+   __entry->mas1   = mas1;
+   __entry->mas2   = mas2;
+   __entry->mas7_3 = mas7_3;
+   ),
+
+   TP_printk("mas0=%x mas1=%x mas2=%llx mas7_3=%llx",
+   __entry->mas0, __entry->mas1,
+   __entry->mas2, __entry->mas7_3)
+);
+
+#endif
+
 #endif /* _TRACE_KVM_H */
 
 /* This part must be outside protection */
-- 
1.7.7.rc3.4.g8d714

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a 

[PATCH] KVM: PPC: e500: include linux/export.h

2011-12-20 Thread Scott Wood
This is required for THIS_MODULE.  We recently stopped acquiring
it via some other header.

Signed-off-by: Scott Wood 
---
 arch/powerpc/kvm/e500.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 0910104..709d82f 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
-- 
1.7.7.rc3.4.g8d714

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO

2011-12-20 Thread Minchan Kim
This patch is follow-up of Christohp Hellwig's work
[RFC: ->make_request support for virtio-blk].
http://thread.gmane.org/gmane.linux.kernel/1199763

Quote from hch
"This patchset allows the virtio-blk driver to support much higher IOP
rates which can be driven out of modern PCI-e flash devices.  At this
point it really is just a RFC due to various issues."

I fixed race bug and add batch I/O for enhancing sequential I/O,
FLUSH/FUA emulation.

I tested this patch on fusion I/O device by aio-stress.
Result is following as.

Benchmark : aio-stress (64 thread, test file size 512M, 8K io per IO, O_DIRECT 
write)
Environment: 8 socket - 8 core, 2533.372Hz, Fusion IO 320G storage
Test repeated by 20 times
Guest I/O scheduler : CFQ
Host I/O scheduler : NOOP

Request BIO(patch 1-4)  BIO-batch(patch 1-6)
 (MB/s)  stddev (MB/s)  stddev  (MB/s)  stddev
w737.820 4.063  613.735 31.605  730.288 24.854
rw   208.754 20.450 314.630 37.352  317.831 41.719
r770.974 2.340  347.483 51.370  750.324 8.280
rr   250.391 16.910 350.053 29.986  325.976 24.846

This patch enhances ramdom I/O performance compared to request-based I/O path.
It's still RFC so welcome to any comment and review.

Christoph Hellwig (3):
  block: add bio_map_sg
  virtio: support unlocked queue kick
  virtio-blk: remove the unused list of pending requests

Minchan Kim (3):
  virtio-blk: implement ->make_request
  virtio-blk: Support batch I/O for enhancing sequential IO
  virtio-blk: Emulate Flush/FUA

 block/blk-merge.c|   63 
 drivers/block/virtio_blk.c   |  690 ++
 drivers/virtio/virtio_ring.c |   33 ++-
 include/linux/blkdev.h   |2 +
 include/linux/virtio.h   |   21 ++
 5 files changed, 737 insertions(+), 72 deletions(-)

-- 
1.7.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] block: add bio_map_sg

2011-12-20 Thread Minchan Kim
From: Christoph Hellwig 

Add a helper to map a bio to a scatterlist, modelled after blk_rq_map_sg.
This helper is useful for any driver that wants to create a scatterlist
from its ->make_request method.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Minchan Kim 
---
 block/blk-merge.c  |   63 
 include/linux/blkdev.h |2 +
 2 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index cfcc37c..a8ac944 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -199,6 +199,69 @@ new_segment:
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
+/*
+ * map a bio to a scatterlist, return number of sg entries setup. Caller
+ * must make sure sg can hold bio->bi_phys_segments entries
+ */
+int bio_map_sg(struct request_queue *q, struct bio *bio,
+ struct scatterlist *sglist)
+{
+   struct bio_vec *bvec, *bvprv;
+   struct scatterlist *sg;
+   int nsegs, cluster;
+   unsigned long i;
+
+   nsegs = 0;
+   cluster = blk_queue_cluster(q);
+
+   bvprv = NULL;
+   sg = NULL;
+   bio_for_each_segment(bvec, bio, i) {
+   int nbytes = bvec->bv_len;
+
+   if (bvprv && cluster) {
+   if (sg->length + nbytes > queue_max_segment_size(q))
+   goto new_segment;
+
+   if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
+   goto new_segment;
+   if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
+   goto new_segment;
+
+   sg->length += nbytes;
+   } else {
+new_segment:
+   if (!sg)
+   sg = sglist;
+   else {
+   /*
+* If the driver previously mapped a shorter
+* list, we could see a termination bit
+* prematurely unless it fully inits the sg
+* table on each mapping. We KNOW that there
+* must be more entries here or the driver
+* would be buggy, so force clear the
+* termination bit to avoid doing a full
+* sg_init_table() in drivers for each command.
+*/
+   sg->page_link &= ~0x02;
+   sg = sg_next(sg);
+   }
+
+   sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
+   nsegs++;
+   }
+   bvprv = bvec;
+   } /* segments in bio */
+
+   if (sg)
+   sg_mark_end(sg);
+
+   BUG_ON(bio->bi_phys_segments && nsegs > bio->bi_phys_segments);
+   return nsegs;
+}
+EXPORT_SYMBOL(bio_map_sg);
+
 static inline int ll_new_hw_segment(struct request_queue *q,
struct request *req,
struct bio *bio)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 94acd81..7ad8e89 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -853,6 +853,8 @@ extern void blk_queue_flush_queueable(struct request_queue 
*q, bool queueable);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device 
*bdev);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct 
scatterlist *);
+extern int bio_map_sg(struct request_queue *q, struct bio *bio,
+   struct scatterlist *sglist);
 extern void blk_dump_rq_flags(struct request *, char *);
 extern long nr_blockdev_pages(void);
 
-- 
1.7.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] virtio-blk: remove the unused list of pending requests

2011-12-20 Thread Minchan Kim
From: Christoph Hellwig 

Signed-off-by: Christoph Hellwig 
Signed-off-by: Minchan Kim 
---
 drivers/block/virtio_blk.c |   10 --
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4d0b70a..26d4443 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -28,9 +28,6 @@ struct virtio_blk
/* The disk structure for the kernel. */
struct gendisk *disk;
 
-   /* Request tracking. */
-   struct list_head reqs;
-
mempool_t *pool;
 
/* Process context for config space updates */
@@ -48,7 +45,6 @@ struct virtio_blk
 
 struct virtblk_req
 {
-   struct list_head list;
struct request *req;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
@@ -92,7 +88,6 @@ static void blk_done(struct virtqueue *vq)
}
 
__blk_end_request_all(vbr->req, error);
-   list_del(&vbr->list);
mempool_free(vbr, vblk->pool);
}
/* In case queue is stopped waiting for more buffers. */
@@ -177,7 +172,6 @@ static bool do_req(struct request_queue *q, struct 
virtio_blk *vblk,
return false;
}
 
-   list_add_tail(&vbr->list, &vblk->reqs);
return true;
 }
 
@@ -383,7 +377,6 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
goto out_free_index;
}
 
-   INIT_LIST_HEAD(&vblk->reqs);
spin_lock_init(&vblk->lock);
vblk->vdev = vdev;
vblk->sg_elems = sg_elems;
@@ -544,9 +537,6 @@ static void __devexit virtblk_remove(struct virtio_device 
*vdev)
 
flush_work(&vblk->config_work);
 
-   /* Nothing should be pending. */
-   BUG_ON(!list_empty(&vblk->reqs));
-
/* Stop all the virtqueues. */
vdev->config->reset(vdev);
 
-- 
1.7.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] virtio-blk: implement ->make_request

2011-12-20 Thread Minchan Kim
Change I/O path from request-based to bio-based for virtio-blk.

This is required for high IOPs devices which get slowed down to 1/5th of
the native speed by all the locking, memory allocation and other overhead
in the request based I/O path.

But it still supports request-based IO path for scsi ioctl but it's just
used for ioctl, not file system.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Minchan Kim 
---
 drivers/block/virtio_blk.c |  303 
 1 files changed, 247 insertions(+), 56 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 26d4443..4e476d6 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -12,6 +12,7 @@
 #include 
 
 #define PART_BITS 4
+static int use_make_request = 1;
 
 static int major;
 static DEFINE_IDA(vd_index_ida);
@@ -24,6 +25,7 @@ struct virtio_blk
 
struct virtio_device *vdev;
struct virtqueue *vq;
+   wait_queue_head_t queue_wait;
 
/* The disk structure for the kernel. */
struct gendisk *disk;
@@ -38,61 +40,124 @@ struct virtio_blk
 
/* Ida index - used to track minor number allocations. */
int index;
-
-   /* Scatterlist: can be too big for stack. */
-   struct scatterlist sg[/*sg_elems*/];
 };
 
 struct virtblk_req
 {
-   struct request *req;
+   void *private;
+   struct virtblk_req *next;
+
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
+   u8 kind;
+#define VIRTIO_BLK_REQUEST 0x00
+#define VIRTIO_BLK_BIO 0x01
u8 status;
+
+   struct scatterlist sg[];
 };
 
+static struct virtblk_req *alloc_virtblk_req(struct virtio_blk *vblk,
+   gfp_t gfp_mask)
+{
+   struct virtblk_req *vbr;
+
+   vbr = mempool_alloc(vblk->pool, gfp_mask);
+   if (vbr)
+   sg_init_table(vbr->sg, vblk->sg_elems);
+
+   return vbr;
+}
+
+static inline int virtblk_result(struct virtblk_req *vbr)
+{
+   switch (vbr->status) {
+   case VIRTIO_BLK_S_OK:
+   return 0;
+   case VIRTIO_BLK_S_UNSUPP:
+   return -ENOTTY;
+   default:
+   return -EIO;
+   }
+}
+
+static void virtblk_request_done(struct virtio_blk *vblk,
+   struct virtblk_req *vbr)
+{
+   struct request *req = vbr->private;
+   int error = virtblk_result(vbr);
+
+   if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
+   req->resid_len = vbr->in_hdr.residual;
+   req->sense_len = vbr->in_hdr.sense_len;
+   req->errors = vbr->in_hdr.errors;
+   }
+   else if (req->cmd_type == REQ_TYPE_SPECIAL) {
+   printk("REQ_TYPE_SPECIAL done\n");
+   req->errors = (error != 0);
+   }
+
+   __blk_end_request_all(req, error);
+   mempool_free(vbr, vblk->pool);
+}
+
+static void virtblk_bio_done(struct virtio_blk *vblk,
+   struct virtblk_req *vbr)
+{
+   bio_endio(vbr->private, virtblk_result(vbr));
+   mempool_free(vbr, vblk->pool);
+}
+
 static void blk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq->vdev->priv;
-   struct virtblk_req *vbr;
+   struct virtblk_req *vbr, *head = NULL, *tail = NULL;
unsigned int len;
unsigned long flags;
 
spin_lock_irqsave(&vblk->lock, flags);
while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
-   int error;
-
-   switch (vbr->status) {
-   case VIRTIO_BLK_S_OK:
-   error = 0;
+   switch (vbr->kind) {
+   case VIRTIO_BLK_REQUEST:
+   virtblk_request_done(vblk, vbr);
+   /*
+* In case queue is stopped waiting
+* for more buffers.
+*/
+   blk_start_queue(vblk->disk->queue);
break;
-   case VIRTIO_BLK_S_UNSUPP:
-   error = -ENOTTY;
+   case VIRTIO_BLK_BIO:
+   if (head) {
+   tail->next = vbr;
+   tail = vbr;
+   } else {
+   tail = head = vbr;
+   }
break;
default:
-   error = -EIO;
-   break;
+   BUG();
}
 
-   switch (vbr->req->cmd_type) {
-   case REQ_TYPE_BLOCK_PC:
-   vbr->req->resid_len = vbr->in_hdr.residual;
-   vbr->req->sense_len = vbr->in_hdr.sense_len;
-   vbr->req->errors = vbr->in_hdr.errors;
-   break;
-   case REQ_TYPE_SPECIAL:
-   vbr->req->errors = (error != 0);
+   }
+
+   spin_unlock_irqrestore(&vblk->lock, flags);
+   wake_up

[PATCH 2/6] virtio: support unlocked queue kick

2011-12-20 Thread Minchan Kim
From: Christoph Hellwig 

Split virtqueue_kick to be able to do the actual notification outside the
lock protecting the virtqueue.  This patch was originally done by
Stefan Hajnoczi, but I can't find the original one anymore and had to
recreated it from memory.  Pointers to the original or corrections for
the commit message are welcome.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Minchan Kim 
---
 drivers/virtio/virtio_ring.c |   33 ++---
 include/linux/virtio.h   |   21 +
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c7a2c20..c5f0458 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -238,10 +238,12 @@ add_head:
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 
-void virtqueue_kick(struct virtqueue *_vq)
+bool virtqueue_kick_prepare(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
u16 new, old;
+   bool needs_kick;
+
START_USE(vq);
/* Descriptors and available array need to be set before we expose the
 * new available array entries. */
@@ -254,13 +256,30 @@ void virtqueue_kick(struct virtqueue *_vq)
/* Need to update avail index before checking if we should notify */
virtio_mb();
 
-   if (vq->event ?
-   vring_need_event(vring_avail_event(&vq->vring), new, old) :
-   !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
-   /* Prod other side to tell it about changes. */
-   vq->notify(&vq->vq);
-
+   if (vq->event) {
+   needs_kick = vring_need_event(vring_avail_event(&vq->vring),
+ new, old);
+   } else {
+   needs_kick = (!(vq->vring.used->flags & 
VRING_USED_F_NO_NOTIFY));
+   }
END_USE(vq);
+   return needs_kick;
+}
+EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
+
+void virtqueue_notify(struct virtqueue *_vq)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* Prod other side to tell it about changes. */
+   vq->notify(_vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_notify);
+
+void virtqueue_kick(struct virtqueue *vq)
+{
+   if (virtqueue_kick_prepare(vq))
+   virtqueue_notify(vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_kick);
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 4c069d8..722a35d 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -90,6 +90,27 @@ static inline int virtqueue_add_buf(struct virtqueue *vq,
 
 void virtqueue_kick(struct virtqueue *vq);
 
+/**
+ * virtqueue_kick_prepare - first half of split virtqueue_kick call.
+ * @vq: the struct virtqueue
+ *
+ * Instead of virtqueue_kick(), you can do:
+ * if (virtqueue_kick_prepare(vq))
+ * virtqueue_notify(vq);
+ *
+ * This is sometimes useful because the virtqueue_kick_prepare() needs
+ * to be serialized, but the actual virtqueue_notify() call does not.
+ */
+bool virtqueue_kick_prepare(struct virtqueue *vq);
+
+/**
+ * virtqueue_notify - second half of split virtqueue_kick call.
+ * @vq: the struct virtqueue
+ *
+ * This does not need to be serialized.
+ */
+void virtqueue_notify(struct virtqueue *vq);
+
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 
 void virtqueue_disable_cb(struct virtqueue *vq);
-- 
1.7.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] virtio-blk: Support batch I/O for enhancing sequential IO

2011-12-20 Thread Minchan Kim
BIO-based path has a disadvantage which it's not good to sequential
stream because it cannot merge BIO while reuqest can do it.

This patch makes per-cpu BIO for batch I/O.
If this request is contiguous with previous's one, this request would
be merged with previous one on batch queue.
If non-contiguous I/O issue or pass 1ms, batch queue would be drained.

Signed-off-by: Minchan Kim 
---
 drivers/block/virtio_blk.c |  366 +++-
 1 files changed, 331 insertions(+), 35 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4e476d6..e32c69e 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -19,6 +19,28 @@ static DEFINE_IDA(vd_index_ida);
 
 struct workqueue_struct *virtblk_wq;
 
+#define BIO_QUEUE_MAX  32
+
+struct per_cpu_bio
+{
+   struct bio *bios[BIO_QUEUE_MAX];
+   int idx;/* current index */
+   struct virtio_blk *vblk;
+   struct request_queue *q;
+   struct delayed_work dwork;
+   unsigned int segments;  /* the number of accumulated segement */
+   bool seq_mode;  /* sequential mode */
+   sector_t next_offset;   /*
+* next expected sector offset
+* for becoming sequential mode
+*/
+};
+
+struct bio_queue
+{
+   struct per_cpu_bio __percpu *pcbio;
+};
+
 struct virtio_blk
 {
spinlock_t lock;
@@ -38,6 +60,9 @@ struct virtio_blk
/* What host tells us, plus 2 for header & tailer. */
unsigned int sg_elems;
 
+   /* bio queue for batch IO */
+   struct bio_queue bq;
+
/* Ida index - used to track minor number allocations. */
int index;
 };
@@ -57,6 +82,8 @@ struct virtblk_req
struct scatterlist sg[];
 };
 
+static void wait_virtq_flush(struct virtio_blk *vblk);
+
 static struct virtblk_req *alloc_virtblk_req(struct virtio_blk *vblk,
gfp_t gfp_mask)
 {
@@ -93,7 +120,6 @@ static void virtblk_request_done(struct virtio_blk *vblk,
req->errors = vbr->in_hdr.errors;
}
else if (req->cmd_type == REQ_TYPE_SPECIAL) {
-   printk("REQ_TYPE_SPECIAL done\n");
req->errors = (error != 0);
}
 
@@ -104,7 +130,15 @@ static void virtblk_request_done(struct virtio_blk *vblk,
 static void virtblk_bio_done(struct virtio_blk *vblk,
struct virtblk_req *vbr)
 {
-   bio_endio(vbr->private, virtblk_result(vbr));
+   struct bio *bio;
+   bio = vbr->private;
+
+   while(bio) {
+   struct bio *free_bio = bio;
+   bio = bio->bi_next;
+   bio_endio(free_bio, virtblk_result(vbr));
+   }
+
mempool_free(vbr, vblk->pool);
 }
 
@@ -298,52 +332,220 @@ static bool virtblk_plugged(struct virtio_blk *vblk)
return true;
 }
 
-static void virtblk_add_buf_wait(struct virtio_blk *vblk,
-   struct virtblk_req *vbr, unsigned long out, unsigned long in)
+bool seq_bio(struct bio *bio, struct per_cpu_bio __percpu *pcbio)
 {
-   DEFINE_WAIT(wait);
-   bool retry, notify;
+   struct bio *last_bio;
+   int index = pcbio->idx - 1;
 
-   for (;;) {
-   prepare_to_wait(&vblk->queue_wait, &wait,
-   TASK_UNINTERRUPTIBLE);
+   BUG_ON(index < 0 || index > BIO_QUEUE_MAX);
+   last_bio = pcbio->bios[index];
+
+   if (last_bio->bi_rw != bio->bi_rw)
+   return false;
+
+   if ((last_bio->bi_sector + (last_bio->bi_size >> 9)) ==
+   bio->bi_sector)
+   return true;
+
+   return false;
+}
+
+int add_pcbio_to_vq(struct per_cpu_bio __percpu *pcbio,
+   struct virtio_blk *vblk, struct request_queue *q,
+   int *notify)
+{
+   int i;
+   unsigned long num = 0, out = 0, in = 0;
+   bool retry;
+   struct virtblk_req *vbr;
+   struct bio *bio;
+
+   vbr = alloc_virtblk_req(vblk, GFP_ATOMIC);
+   if (!vbr)
+   return 1;
+
+   vbr->private = NULL;
+   vbr->next = NULL;
+   vbr->kind = VIRTIO_BLK_BIO;
+
+   bio = pcbio->bios[0];
+   BUG_ON(!bio);
+
+   vbr->out_hdr.type = 0;
+   vbr->out_hdr.sector = bio->bi_sector;
+   vbr->out_hdr.ioprio = bio_prio(bio);
+
+   sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 
-   spin_lock_irq(&vblk->lock);
-   if (virtqueue_add_buf(vblk->vq, vbr->sg,
-   out, in, vbr) < 0) {
-   retry = true;
+   for ( i = 0; i < pcbio->idx; i++) {
+   struct bio *prev;
+   bio = pcbio->bios[i];
+
+   BUG_ON(!bio);
+   num += bio_map_sg(q, bio, vbr->sg + out + num);
+   BUG_ON(num > (vblk->sg_elems - 2));
+
+   prev = vbr->private;
+

[PATCH 6/6] virtio-blk: Emulate Flush/FUA

2011-12-20 Thread Minchan Kim
This patch emulates flush/fua on virtio-blk and pass xfstest on ext4.
But it needs more reviews.

Signed-off-by: Minchan Kim 
---
 drivers/block/virtio_blk.c |   89 ++-
 1 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index e32c69e..6721b9d 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -12,7 +12,6 @@
 #include 
 
 #define PART_BITS 4
-static int use_make_request = 1;
 
 static int major;
 static DEFINE_IDA(vd_index_ida);
@@ -77,6 +76,7 @@ struct virtblk_req
u8 kind;
 #define VIRTIO_BLK_REQUEST 0x00
 #define VIRTIO_BLK_BIO 0x01
+#define VIRTIO_BLK_BIO_FLUSH   0x02
u8 status;
 
struct scatterlist sg[];
@@ -160,6 +160,9 @@ static void blk_done(struct virtqueue *vq)
 */
blk_start_queue(vblk->disk->queue);
break;
+   case VIRTIO_BLK_BIO_FLUSH:
+   complete(vbr->private);
+   break;
case VIRTIO_BLK_BIO:
if (head) {
tail->next = vbr;
@@ -526,6 +529,59 @@ static void virtblk_add_buf_wait(struct virtio_blk *vblk,
 finish_wait(&vblk->queue_wait, &wait);
 }
 
+static int virtblk_flush(struct virtio_blk *vblk,
+   struct virtblk_req *vbr, struct bio *bio)
+{
+   int error;
+   bool retry, notify;
+   DECLARE_COMPLETION_ONSTACK(done);
+
+   vbr->private = &done;
+   vbr->next = NULL;
+   vbr->kind = VIRTIO_BLK_BIO_FLUSH;
+
+   vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
+   vbr->out_hdr.sector = 0;
+   if (bio)
+   vbr->out_hdr.ioprio = bio_prio(bio);
+   else
+   vbr->out_hdr.ioprio = 0;
+
+   sg_set_buf(&vbr->sg[0], &vbr->out_hdr, sizeof(vbr->out_hdr));
+   sg_set_buf(&vbr->sg[1], &vbr->status, sizeof(vbr->status));
+
+   spin_lock_irq(&vblk->lock);
+   if (virtqueue_add_buf(vblk->vq, vbr->sg, 1, 1, vbr) < 0) {
+   retry = true;
+   } else {
+   retry = false;
+   }
+
+   notify = virtqueue_kick_prepare(vblk->vq);
+   spin_unlock_irq(&vblk->lock);
+
+   if (notify && !virtblk_plugged(vblk))
+   virtqueue_notify(vblk->vq);
+
+   if (retry)
+   virtblk_add_buf_wait(vblk, vbr, 1, 1);
+
+   wait_for_completion(&done);
+   error = virtblk_result(vbr);
+   return error;
+}
+
+void bq_flush(struct bio_queue *bq)
+{
+   int cpu;
+   for_each_possible_cpu(cpu) {
+   struct per_cpu_bio __percpu *pcbio = per_cpu_ptr(bq->pcbio, 
cpu);
+   queue_work_on(cpu,
+   virtblk_wq, &pcbio->dwork.work);
+   flush_work_sync(&pcbio->dwork.work);
+   }
+}
+
 bool full_segment(struct per_cpu_bio __percpu *pcbio, struct bio *bio,
unsigned int max)
 {
@@ -616,9 +672,36 @@ static void virtblk_make_request(struct request_queue *q, 
struct bio *bio)
 {
struct virtio_blk *vblk = q->queuedata;
struct per_cpu_bio __percpu *pcbio;
+   bool pre_flush, post_flush;
 
BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
-   BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
+
+   pre_flush = bio->bi_rw & REQ_FLUSH;
+   post_flush = bio->bi_rw & REQ_FUA;
+
+   if (pre_flush) {
+   struct virtblk_req *dummy_vbr;
+   bq_flush(&vblk->bq);
+
+   dummy_vbr = alloc_virtblk_req(vblk, GFP_NOIO);
+   virtblk_flush(vblk, dummy_vbr, NULL);
+   mempool_free(dummy_vbr, vblk->pool);
+
+   if (bio->bi_sector && post_flush) {
+   int error;
+   struct virtblk_req *vbr;
+   vbr = alloc_virtblk_req(vblk, GFP_NOIO);
+   error = virtblk_flush(vblk, vbr, bio);
+   mempool_free(vbr, vblk->pool);
+
+   dummy_vbr = alloc_virtblk_req(vblk, GFP_NOIO);
+   virtblk_flush(vblk, dummy_vbr, NULL);
+   mempool_free(dummy_vbr, vblk->pool);
+
+   bio_endio(bio, error);
+   return;
+   }
+   }
 retry:
preempt_disable();
pcbio = this_cpu_ptr(vblk->bq.pcbio);
@@ -918,7 +1001,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
vblk->index = index;
 
/* configure queue flush support */
-   if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH) && !use_make_request)
+   if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
blk_queue_flush(q, REQ_FLUSH);
 
/* If disk is read-only in the host, the guest should obey */
-- 
1.7.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

[RFC PATCH 00/16] KVM: PPC: e500mc support

2011-12-20 Thread Scott Wood
This is a preliminary patchset for e500mc KVM support, using
hardware virtualization support.  There's still some ugliness
that I need to tame, and it needs a bunch of testing -- but I wanted
to get something out for people to comment on and/or test.

CCing linuxppc-dev as well since some of the patches wander
outside of KVM-land.

Scott Wood (16):
  powerpc/booke: Set CPU_FTR_DEBUG_LVL_EXC on 32-bit
  powerpc/e500: split CPU_FTRS_ALWAYS/CPU_FTRS_POSSIBLE
  KVM: PPC: Use pt_regs in vcpu->arch
  KVM: PPC: factor out lpid allocator from book3s_64_mmu_hv
  KVM: PPC: booke: add booke-level vcpu load/put
  KVM: PPC: booke: Move vm core init/destroy out of booke.c
  KVM: PPC: e500: rename e500_tlb.h to e500.h
  KVM: PPC: e500: merge  into arch/powerpc/kvm/e500.h
  KVM: PPC: e500: clean up arch/powerpc/kvm/e500.h
  KVM: PPC: e500: refactor core-specific TLB code
  KVM: PPC: e500: Track TLB1 entries with a bitmap
  KVM: PPC: e500: emulate tlbilx
  powerpc/booke: Provide exception macros with interrupt name
  KVM: PPC: booke: category E.HV (GS-mode) support
  KVM: PPC: booke: standard PPC floating point support
  KVM: PPC: e500mc support

 arch/powerpc/include/asm/cputable.h |   21 +-
 arch/powerpc/include/asm/dbell.h|1 +
 arch/powerpc/include/asm/kvm.h  |1 +
 arch/powerpc/include/asm/kvm_asm.h  |8 +
 arch/powerpc/include/asm/kvm_book3s.h   |   31 +-
 arch/powerpc/include/asm/kvm_booke.h|   27 +-
 arch/powerpc/include/asm/kvm_booke_hv_asm.h |   49 +++
 arch/powerpc/include/asm/kvm_e500.h |   96 -
 arch/powerpc/include/asm/kvm_host.h |   30 +-
 arch/powerpc/include/asm/kvm_ppc.h  |8 +
 arch/powerpc/include/asm/mmu-book3e.h   |6 +
 arch/powerpc/include/asm/processor.h|3 +
 arch/powerpc/include/asm/reg.h  |2 +
 arch/powerpc/include/asm/reg_booke.h|   34 ++
 arch/powerpc/include/asm/system.h   |1 +
 arch/powerpc/kernel/asm-offsets.c   |   32 +-
 arch/powerpc/kernel/cpu_setup_fsl_booke.S   |1 +
 arch/powerpc/kernel/head_44x.S  |   23 +-
 arch/powerpc/kernel/head_booke.h|   69 ++-
 arch/powerpc/kernel/head_fsl_booke.S|   98 -
 arch/powerpc/kvm/44x.c  |   12 +
 arch/powerpc/kvm/Kconfig|   20 +-
 arch/powerpc/kvm/Makefile   |   11 +
 arch/powerpc/kvm/book3s_32_mmu.c|2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |   26 +-
 arch/powerpc/kvm/book3s_hv.c|9 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |   12 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |4 +-
 arch/powerpc/kvm/booke.c|  485 +--
 arch/powerpc/kvm/booke.h|   57 +++-
 arch/powerpc/kvm/booke_emulate.c|   25 +-
 arch/powerpc/kvm/bookehv_interrupts.S   |  587 ++
 arch/powerpc/kvm/e500.c |  372 ++---
 arch/powerpc/kvm/e500.h |  302 ++
 arch/powerpc/kvm/e500_emulate.c |   42 ++-
 arch/powerpc/kvm/e500_tlb.c |  590 +++
 arch/powerpc/kvm/e500_tlb.h |  174 
 arch/powerpc/kvm/e500mc.c   |  342 
 arch/powerpc/kvm/powerpc.c  |   45 ++-
 arch/powerpc/kvm/timing.h   |6 +
 40 files changed, 2727 insertions(+), 937 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_booke_hv_asm.h
 delete mode 100644 arch/powerpc/include/asm/kvm_e500.h
 create mode 100644 arch/powerpc/kvm/bookehv_interrupts.S
 create mode 100644 arch/powerpc/kvm/e500.h
 delete mode 100644 arch/powerpc/kvm/e500_tlb.h
 create mode 100644 arch/powerpc/kvm/e500mc.c

-- 
1.7.7.rc3.4.g8d714

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 01/16] powerpc/booke: Set CPU_FTR_DEBUG_LVL_EXC on 32-bit

2011-12-20 Thread Scott Wood
Currently 32-bit only cares about this for choice of exception
vector, which is done in core-specific code.  However, KVM will
want to distinguish as well.

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/cputable.h |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index e30442c..033ad30 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -375,7 +375,8 @@ extern const char *powerpc_base_platform;
 #define CPU_FTRS_47X   (CPU_FTRS_440x6)
 #define CPU_FTRS_E200  (CPU_FTR_USE_TB | CPU_FTR_SPE_COMP | \
CPU_FTR_NODSISRALIGN | CPU_FTR_COHERENT_ICACHE | \
-   CPU_FTR_UNIFIED_ID_CACHE | CPU_FTR_NOEXECUTE)
+   CPU_FTR_UNIFIED_ID_CACHE | CPU_FTR_NOEXECUTE | \
+   CPU_FTR_DEBUG_LVL_EXC)
 #define CPU_FTRS_E500  (CPU_FTR_MAYBE_CAN_DOZE | CPU_FTR_USE_TB | \
CPU_FTR_SPE_COMP | CPU_FTR_MAYBE_CAN_NAP | CPU_FTR_NODSISRALIGN | \
CPU_FTR_NOEXECUTE)
@@ -384,7 +385,7 @@ extern const char *powerpc_base_platform;
CPU_FTR_NODSISRALIGN | CPU_FTR_NOEXECUTE)
 #define CPU_FTRS_E500MC(CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN | \
CPU_FTR_L2CSR | CPU_FTR_LWSYNC | CPU_FTR_NOEXECUTE | \
-   CPU_FTR_DBELL)
+   CPU_FTR_DBELL | CPU_FTR_DEBUG_LVL_EXC)
 #define CPU_FTRS_E5500 (CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN | \
CPU_FTR_L2CSR | CPU_FTR_LWSYNC | CPU_FTR_NOEXECUTE | \
CPU_FTR_DBELL | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
-- 
1.7.7.rc3.4.g8d714


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 05/16] KVM: PPC: booke: add booke-level vcpu load/put

2011-12-20 Thread Scott Wood
This gives us a place to put load/put actions that correspond to
code that is booke-specific but not specific to a particular core.

Signed-off-by: Scott Wood 
---
 arch/powerpc/kvm/44x.c   |3 +++
 arch/powerpc/kvm/booke.c |8 
 arch/powerpc/kvm/booke.h |3 +++
 arch/powerpc/kvm/e500.c  |3 +++
 4 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/44x.c b/arch/powerpc/kvm/44x.c
index 7b612a7..879a1a7 100644
--- a/arch/powerpc/kvm/44x.c
+++ b/arch/powerpc/kvm/44x.c
@@ -29,15 +29,18 @@
 #include 
 
 #include "44x_tlb.h"
+#include "booke.h"
 
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+   kvmppc_booke_vcpu_load(vcpu, cpu);
kvmppc_44x_tlb_load(vcpu);
 }
 
 void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu)
 {
kvmppc_44x_tlb_put(vcpu);
+   kvmppc_booke_vcpu_put(vcpu);
 }
 
 int kvmppc_core_check_processor_compat(void)
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index a41287d..933e611 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -972,6 +972,14 @@ void kvmppc_decrementer_func(unsigned long data)
kvmppc_set_tsr_bits(vcpu, TSR_DIS);
 }
 
+void kvmppc_booke_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
+void kvmppc_booke_vcpu_put(struct kvm_vcpu *vcpu)
+{
+}
+
 int __init kvmppc_booke_init(void)
 {
unsigned long ivor[16];
diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
index 2fe2027..05d1d99 100644
--- a/arch/powerpc/kvm/booke.h
+++ b/arch/powerpc/kvm/booke.h
@@ -71,4 +71,7 @@ void kvmppc_save_guest_spe(struct kvm_vcpu *vcpu);
 /* high-level function, manages flags, host state */
 void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu);
 
+void kvmppc_booke_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void kvmppc_booke_vcpu_put(struct kvm_vcpu *vcpu);
+
 #endif /* __KVM_BOOKE_H__ */
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 709d82f..923f375 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -36,6 +36,7 @@ void kvmppc_core_load_guest_debugstate(struct kvm_vcpu *vcpu)
 
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+   kvmppc_booke_vcpu_load(vcpu, cpu);
kvmppc_e500_tlb_load(vcpu, cpu);
 }
 
@@ -47,6 +48,8 @@ void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu)
if (vcpu->arch.shadow_msr & MSR_SPE)
kvmppc_vcpu_disable_spe(vcpu);
 #endif
+
+   kvmppc_booke_vcpu_put(vcpu);
 }
 
 int kvmppc_core_check_processor_compat(void)
-- 
1.7.7.rc3.4.g8d714


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 04/16] KVM: PPC: factor out lpid allocator from book3s_64_mmu_hv

2011-12-20 Thread Scott Wood
We'll use it on e500mc as well.

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/kvm_book3s.h |3 ++
 arch/powerpc/include/asm/kvm_booke.h  |3 ++
 arch/powerpc/include/asm/kvm_ppc.h|5 
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   26 +---
 arch/powerpc/kvm/powerpc.c|   34 +
 5 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 60e069e..58c8bec 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -448,4 +448,7 @@ static inline bool kvmppc_critical_section(struct kvm_vcpu 
*vcpu)
 
 #define INS_DCBZ   0x7c0007ec
 
+/* LPIDs we support with this build -- runtime limit may be lower */
+#define KVMPPC_NR_LPIDS(LPID_RSVD + 1)
+
 #endif /* __ASM_KVM_BOOK3S_H__ */
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index e20c162..138118e 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -23,6 +23,9 @@
 #include 
 #include 
 
+/* LPIDs we support with this build -- runtime limit may be lower */
+#define KVMPPC_NR_LPIDS64
+
 static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val)
 {
vcpu->arch.regs.gpr[num] = val;
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index a61b5b5..5524f88 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -202,4 +202,9 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
 struct kvm_dirty_tlb *cfg);
 
+long kvmppc_alloc_lpid(void);
+void kvmppc_claim_lpid(long lpid);
+void kvmppc_free_lpid(long lpid);
+void kvmppc_init_lpid(unsigned long nr_lpids);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 66d6452..45b6f0e 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -36,13 +36,11 @@
 
 /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
 #define MAX_LPID_970   63
-#define NR_LPIDS   (LPID_RSVD + 1)
-unsigned long lpid_inuse[BITS_TO_LONGS(NR_LPIDS)];
 
 long kvmppc_alloc_hpt(struct kvm *kvm)
 {
unsigned long hpt;
-   unsigned long lpid;
+   long lpid;
struct revmap_entry *rev;
 
/* Allocate guest's hashed page table */
@@ -62,14 +60,9 @@ long kvmppc_alloc_hpt(struct kvm *kvm)
}
kvm->arch.revmap = rev;
 
-   /* Allocate the guest's logical partition ID */
-   do {
-   lpid = find_first_zero_bit(lpid_inuse, NR_LPIDS);
-   if (lpid >= NR_LPIDS) {
-   pr_err("kvm_alloc_hpt: No LPIDs free\n");
-   goto out_freeboth;
-   }
-   } while (test_and_set_bit(lpid, lpid_inuse));
+   lpid = kvmppc_alloc_lpid();
+   if (lpid < 0)
+   goto out_freeboth;
 
kvm->arch.sdr1 = __pa(hpt) | (HPT_ORDER - 18);
kvm->arch.lpid = lpid;
@@ -86,7 +79,7 @@ long kvmppc_alloc_hpt(struct kvm *kvm)
 
 void kvmppc_free_hpt(struct kvm *kvm)
 {
-   clear_bit(kvm->arch.lpid, lpid_inuse);
+   kvmppc_free_lpid(kvm->arch.lpid);
vfree(kvm->arch.revmap);
free_pages(kvm->arch.hpt_virt, HPT_ORDER - PAGE_SHIFT);
 }
@@ -158,8 +151,7 @@ int kvmppc_mmu_hv_init(void)
if (!cpu_has_feature(CPU_FTR_HVMODE))
return -EINVAL;
 
-   memset(lpid_inuse, 0, sizeof(lpid_inuse));
-
+   /* POWER7 has 10-bit LPIDs, PPC970 and e500mc have 6-bit LPIDs */
if (cpu_has_feature(CPU_FTR_ARCH_206)) {
host_lpid = mfspr(SPRN_LPID);   /* POWER7 */
rsvd_lpid = LPID_RSVD;
@@ -168,9 +160,11 @@ int kvmppc_mmu_hv_init(void)
rsvd_lpid = MAX_LPID_970;
}
 
-   set_bit(host_lpid, lpid_inuse);
+   kvmppc_init_lpid(rsvd_lpid + 1);
+
+   kvmppc_claim_lpid(host_lpid);
/* rsvd_lpid is reserved for use in partition switching */
-   set_bit(rsvd_lpid, lpid_inuse);
+   kvmppc_claim_lpid(rsvd_lpid);
 
return 0;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 64c738dc..42701e5 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -800,6 +800,40 @@ out:
return r;
 }
 
+static unsigned long lpid_inuse[BITS_TO_LONGS(KVMPPC_NR_LPIDS)];
+static unsigned long nr_lpids;
+
+long kvmppc_alloc_lpid(void)
+{
+   long lpid;
+
+   do {
+   lpid = find_first_zero_bit(lpid_inuse, KVMPPC_NR_LPIDS);
+   if (lpid >= nr_lpids) {
+   pr_err("%s: No LPIDs free\n", __func__);
+   return -ENOMEM;
+   }
+   } while (test_and_set_bit(lpid, lpid_inus

[RFC PATCH 06/16] KVM: PPC: booke: Move vm core init/destroy out of booke.c

2011-12-20 Thread Scott Wood
e500mc will want to do lpid allocation/deallocation here.

Signed-off-by: Scott Wood 
---
 arch/powerpc/kvm/44x.c   |9 +
 arch/powerpc/kvm/booke.c |9 -
 arch/powerpc/kvm/e500.c  |9 +
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/44x.c b/arch/powerpc/kvm/44x.c
index 879a1a7..50e7dbc 100644
--- a/arch/powerpc/kvm/44x.c
+++ b/arch/powerpc/kvm/44x.c
@@ -163,6 +163,15 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
kmem_cache_free(kvm_vcpu_cache, vcpu_44x);
 }
 
+int kvmppc_core_init_vm(struct kvm *kvm)
+{
+   return 0;
+}
+
+void kvmppc_core_destroy_vm(struct kvm *kvm)
+{
+}
+
 static int __init kvmppc_44x_init(void)
 {
int r;
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 933e611..f66e741 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -936,15 +936,6 @@ void kvmppc_core_commit_memory_region(struct kvm *kvm,
 {
 }
 
-int kvmppc_core_init_vm(struct kvm *kvm)
-{
-   return 0;
-}
-
-void kvmppc_core_destroy_vm(struct kvm *kvm)
-{
-}
-
 void kvmppc_set_tcr(struct kvm_vcpu *vcpu, u32 new_tcr)
 {
vcpu->arch.tcr = new_tcr;
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 923f375..80b9c84 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -226,6 +226,15 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
kmem_cache_free(kvm_vcpu_cache, vcpu_e500);
 }
 
+int kvmppc_core_init_vm(struct kvm *kvm)
+{
+   return 0;
+}
+
+void kvmppc_core_destroy_vm(struct kvm *kvm)
+{
+}
+
 static int __init kvmppc_e500_init(void)
 {
int r, i;
-- 
1.7.7.rc3.4.g8d714


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 02/16] powerpc/e500: split CPU_FTRS_ALWAYS/CPU_FTRS_POSSIBLE

2011-12-20 Thread Scott Wood
Split e500 (v1/v2) and e500mc/e5500 to allow optimization of feature
checks that differ between the two.

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/cputable.h |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 033ad30..a80be60 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -482,8 +482,10 @@ enum {
CPU_FTRS_E200 |
 #endif
 #ifdef CONFIG_E500
-   CPU_FTRS_E500 | CPU_FTRS_E500_2 | CPU_FTRS_E500MC |
-   CPU_FTRS_E5500 |
+   CPU_FTRS_E500 | CPU_FTRS_E500_2 |
+#endif
+#ifdef CONFIG_PPC_E500MC
+   CPU_FTRS_E500MC | CPU_FTRS_E5500 |
 #endif
0,
 };
@@ -527,8 +529,10 @@ enum {
CPU_FTRS_E200 &
 #endif
 #ifdef CONFIG_E500
-   CPU_FTRS_E500 & CPU_FTRS_E500_2 & CPU_FTRS_E500MC &
-   CPU_FTRS_E5500 &
+   CPU_FTRS_E500 & CPU_FTRS_E500_2 &
+#endif
+#ifdef CONFIG_PPC_E500MC
+   CPU_FTRS_E500MC & CPU_FTRS_E5500 &
 #endif
CPU_FTRS_POSSIBLE,
 };
-- 
1.7.7.rc3.4.g8d714


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 08/16] KVM: PPC: e500: merge into arch/powerpc/kvm/e500.h

2011-12-20 Thread Scott Wood
Keeping two separate headers for e500-specific things was a
pain, and wasn't even organized along any logical boundary.

There was TLB stuff in  despite the existence of
arch/powerpc/kvm/e500_tlb.h, and nothing in  needed
to be referenced from outside arch/powerpc/kvm.

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/kvm_e500.h |   96 ---
 arch/powerpc/kvm/e500.h |   82 --
 2 files changed, 78 insertions(+), 100 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/kvm_e500.h

diff --git a/arch/powerpc/include/asm/kvm_e500.h 
b/arch/powerpc/include/asm/kvm_e500.h
deleted file mode 100644
index 8cd50a5..000
--- a/arch/powerpc/include/asm/kvm_e500.h
+++ /dev/null
@@ -1,96 +0,0 @@
-/*
- * Copyright (C) 2008-2011 Freescale Semiconductor, Inc. All rights reserved.
- *
- * Author: Yu Liu, 
- *
- * Description:
- * This file is derived from arch/powerpc/include/asm/kvm_44x.h,
- * by Hollis Blanchard .
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License, version 2, as
- * published by the Free Software Foundation.
- */
-
-#ifndef __ASM_KVM_E500_H__
-#define __ASM_KVM_E500_H__
-
-#include 
-
-#define BOOKE_INTERRUPT_SIZE 36
-
-#define E500_PID_NUM   3
-#define E500_TLB_NUM   2
-
-#define E500_TLB_VALID 1
-#define E500_TLB_DIRTY 2
-
-struct tlbe_ref {
-   pfn_t pfn;
-   unsigned int flags; /* E500_TLB_* */
-};
-
-struct tlbe_priv {
-   struct tlbe_ref ref; /* TLB0 only -- TLB1 uses tlb_refs */
-};
-
-struct vcpu_id_table;
-
-struct kvmppc_e500_tlb_params {
-   int entries, ways, sets;
-};
-
-struct kvmppc_vcpu_e500 {
-   /* Unmodified copy of the guest's TLB -- shared with host userspace. */
-   struct kvm_book3e_206_tlb_entry *gtlb_arch;
-
-   /* Starting entry number in gtlb_arch[] */
-   int gtlb_offset[E500_TLB_NUM];
-
-   /* KVM internal information associated with each guest TLB entry */
-   struct tlbe_priv *gtlb_priv[E500_TLB_NUM];
-
-   struct kvmppc_e500_tlb_params gtlb_params[E500_TLB_NUM];
-
-   unsigned int gtlb_nv[E500_TLB_NUM];
-
-   /*
-* information associated with each host TLB entry --
-* TLB1 only for now.  If/when guest TLB1 entries can be
-* mapped with host TLB0, this will be used for that too.
-*
-* We don't want to use this for guest TLB0 because then we'd
-* have the overhead of doing the translation again even if
-* the entry is still in the guest TLB (e.g. we swapped out
-* and back, and our host TLB entries got evicted).
-*/
-   struct tlbe_ref *tlb_refs[E500_TLB_NUM];
-   unsigned int host_tlb1_nv;
-
-   u32 host_pid[E500_PID_NUM];
-   u32 pid[E500_PID_NUM];
-   u32 svr;
-
-   /* vcpu id table */
-   struct vcpu_id_table *idt;
-
-   u32 l1csr0;
-   u32 l1csr1;
-   u32 hid0;
-   u32 hid1;
-   u32 tlb0cfg;
-   u32 tlb1cfg;
-   u64 mcar;
-
-   struct page **shared_tlb_pages;
-   int num_shared_tlb_pages;
-
-   struct kvm_vcpu vcpu;
-};
-
-static inline struct kvmppc_vcpu_e500 *to_e500(struct kvm_vcpu *vcpu)
-{
-   return container_of(vcpu, struct kvmppc_vcpu_e500, vcpu);
-}
-
-#endif /* __ASM_KVM_E500_H__ */
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 02ecde2..51d13bd 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -1,11 +1,12 @@
 /*
  * Copyright (C) 2008-2011 Freescale Semiconductor, Inc. All rights reserved.
  *
- * Author: Yu Liu, yu@freescale.com
+ * Author: Yu Liu 
  *
  * Description:
- * This file is based on arch/powerpc/kvm/44x_tlb.h,
- * by Hollis Blanchard .
+ * This file is based on arch/powerpc/kvm/44x_tlb.h and
+ * arch/powerpc/include/asm/kvm_44x.h by Hollis Blanchard ,
+ * Copyright IBM Corp. 2007-2008
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License, version 2, as
@@ -18,7 +19,80 @@
 #include 
 #include 
 #include 
-#include 
+
+#define E500_PID_NUM   3
+#define E500_TLB_NUM   2
+
+#define E500_TLB_VALID 1
+#define E500_TLB_DIRTY 2
+
+struct tlbe_ref {
+   pfn_t pfn;
+   unsigned int flags; /* E500_TLB_* */
+};
+
+struct tlbe_priv {
+   struct tlbe_ref ref; /* TLB0 only -- TLB1 uses tlb_refs */
+};
+
+struct vcpu_id_table;
+
+struct kvmppc_e500_tlb_params {
+   int entries, ways, sets;
+};
+
+struct kvmppc_vcpu_e500 {
+   /* Unmodified copy of the guest's TLB -- shared with host userspace. */
+   struct kvm_book3e_206_tlb_entry *gtlb_arch;
+
+   /* Starting entry number in gtlb_arch[] */
+   int gtlb_offset[E500_TLB_NUM];
+
+   /* KVM internal information associated with each guest TLB entry */
+   struct tlbe_priv *gtlb_priv[E500_TLB_NUM];
+
+   struct kvmppc_e500_tlb_params gtlb_params[E500_TLB_NUM];
+
+   unsigned int gtlb_nv[E5

[RFC PATCH 07/16] KVM: PPC: e500: rename e500_tlb.h to e500.h

2011-12-20 Thread Scott Wood
This is in preparation for merging in the contents of
arch/powerpc/include/asm/kvm_e500.h.

Signed-off-by: Scott Wood 
---
 arch/powerpc/kvm/e500.c |2 +-
 arch/powerpc/kvm/{e500_tlb.h => e500.h} |6 +++---
 arch/powerpc/kvm/e500_emulate.c |2 +-
 arch/powerpc/kvm/e500_tlb.c |2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)
 rename arch/powerpc/kvm/{e500_tlb.h => e500.h} (98%)

diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 80b9c84..faa32df 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -24,7 +24,7 @@
 #include 
 
 #include "booke.h"
-#include "e500_tlb.h"
+#include "e500.h"
 
 void kvmppc_core_load_host_debugstate(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/e500_tlb.h b/arch/powerpc/kvm/e500.h
similarity index 98%
rename from arch/powerpc/kvm/e500_tlb.h
rename to arch/powerpc/kvm/e500.h
index 5c6d2d7..02ecde2 100644
--- a/arch/powerpc/kvm/e500_tlb.h
+++ b/arch/powerpc/kvm/e500.h
@@ -12,8 +12,8 @@
  * published by the Free Software Foundation.
  */
 
-#ifndef __KVM_E500_TLB_H__
-#define __KVM_E500_TLB_H__
+#ifndef KVM_E500_H
+#define KVM_E500_H
 
 #include 
 #include 
@@ -171,4 +171,4 @@ static inline int tlbe_is_host_safe(const struct kvm_vcpu 
*vcpu,
return 1;
 }
 
-#endif /* __KVM_E500_TLB_H__ */
+#endif /* KVM_E500_H */
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index 6d0b2bd..2a1a228 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -17,7 +17,7 @@
 #include 
 
 #include "booke.h"
-#include "e500_tlb.h"
+#include "e500.h"
 
 #define XOP_TLBIVAX 786
 #define XOP_TLBSX   914
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 6e9bc42..3ec3ad6 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -29,7 +29,7 @@
 #include 
 
 #include "../mm/mmu_decl.h"
-#include "e500_tlb.h"
+#include "e500.h"
 #include "trace.h"
 #include "timing.h"
 
-- 
1.7.7.rc3.4.g8d714


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 09/16] KVM: PPC: e500: clean up arch/powerpc/kvm/e500.h

2011-12-20 Thread Scott Wood
Move vcpu to the beginning of vcpu_e500 to give it appropriate
prominence, especially if more fields end up getting added to the
end of vcpu_e500 (and vcpu ends up in the middle).

Remove gratuitous "extern" and add parameter names to prototypes.

Signed-off-by: Scott Wood 
---
 arch/powerpc/kvm/e500.h |   32 ++--
 1 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 51d13bd..6b53a88 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -42,6 +42,8 @@ struct kvmppc_e500_tlb_params {
 };
 
 struct kvmppc_vcpu_e500 {
+   struct kvm_vcpu vcpu;
+
/* Unmodified copy of the guest's TLB -- shared with host userspace. */
struct kvm_book3e_206_tlb_entry *gtlb_arch;
 
@@ -72,9 +74,6 @@ struct kvmppc_vcpu_e500 {
u32 pid[E500_PID_NUM];
u32 svr;
 
-   /* vcpu id table */
-   struct vcpu_id_table *idt;
-
u32 l1csr0;
u32 l1csr1;
u32 hid0;
@@ -85,8 +84,6 @@ struct kvmppc_vcpu_e500 {
 
struct page **shared_tlb_pages;
int num_shared_tlb_pages;
-
-   struct kvm_vcpu vcpu;
 };
 
 static inline struct kvmppc_vcpu_e500 *to_e500(struct kvm_vcpu *vcpu)
@@ -113,19 +110,18 @@ static inline struct kvmppc_vcpu_e500 *to_e500(struct 
kvm_vcpu *vcpu)
  (MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3 \
   | E500_TLB_USER_PERM_MASK | E500_TLB_SUPER_PERM_MASK)
 
-extern void kvmppc_dump_tlbs(struct kvm_vcpu *);
-extern int kvmppc_e500_emul_mt_mmucsr0(struct kvmppc_vcpu_e500 *, ulong);
-extern int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *);
-extern int kvmppc_e500_emul_tlbre(struct kvm_vcpu *);
-extern int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *, int, int);
-extern int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *, int);
-extern int kvmppc_e500_tlb_search(struct kvm_vcpu *, gva_t, unsigned int, int);
-extern void kvmppc_e500_tlb_put(struct kvm_vcpu *);
-extern void kvmppc_e500_tlb_load(struct kvm_vcpu *, int);
-extern int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *);
-extern void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *);
-extern void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *);
-extern void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *);
+int kvmppc_e500_emul_mt_mmucsr0(struct kvmppc_vcpu_e500 *vcpu_e500,
+   ulong value);
+int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu);
+int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu);
+int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *vcpu, int ra, int rb);
+int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb);
+int kvmppc_e500_tlb_search(struct kvm_vcpu *, gva_t, unsigned int, int);
+int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500);
+void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500);
+
+void kvmppc_get_sregs_e500_tlb(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
+int kvmppc_set_sregs_e500_tlb(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
 
 /* TLB helper functions */
 static inline unsigned int
-- 
1.7.7.rc3.4.g8d714


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 11/16] KVM: PPC: e500: Track TLB1 entries with a bitmap

2011-12-20 Thread Scott Wood
Rather than invalidate everything when a TLB1 entry needs to be
taken down, keep track of which host TLB1 entries are used for
a given guest TLB1 entry, and invalidate just those entries.

Based on code from Ashish Kalra 
and Liu Yu .

Signed-off-by: Scott Wood 
---
 arch/powerpc/kvm/e500.h |5 +++
 arch/powerpc/kvm/e500_tlb.c |   72 ---
 2 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 34cef08..f4dee55 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -2,6 +2,7 @@
  * Copyright (C) 2008-2011 Freescale Semiconductor, Inc. All rights reserved.
  *
  * Author: Yu Liu 
+ * Ashish Kalra 
  *
  * Description:
  * This file is based on arch/powerpc/kvm/44x_tlb.h and
@@ -25,6 +26,7 @@
 
 #define E500_TLB_VALID 1
 #define E500_TLB_DIRTY 2
+#define E500_TLB_BITMAP 4
 
 struct tlbe_ref {
pfn_t pfn;
@@ -82,6 +84,9 @@ struct kvmppc_vcpu_e500 {
struct page **shared_tlb_pages;
int num_shared_tlb_pages;
 
+   u64 *g2h_tlb1_map;
+   unsigned int *h2g_tlb1_rmap;
+
 #ifdef CONFIG_KVM_E500
u32 pid[E500_PID_NUM];
 
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index b306270..031fd5b 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -2,6 +2,7 @@
  * Copyright (C) 2008-2011 Freescale Semiconductor, Inc. All rights reserved.
  *
  * Author: Yu Liu, yu@freescale.com
+ * Ashish Kalra, ashish.ka...@freescale.com
  *
  * Description:
  * This file is based on arch/powerpc/kvm/44x_tlb.c,
@@ -175,8 +176,28 @@ static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 
*vcpu_e500,
struct kvm_book3e_206_tlb_entry *gtlbe =
get_entry(vcpu_e500, tlbsel, esel);
 
-   if (tlbsel == 1) {
-   kvmppc_e500_tlbil_all(vcpu_e500);
+   if (tlbsel == 1 &&
+   vcpu_e500->gtlb_priv[1][esel].ref.flags & E500_TLB_BITMAP) {
+   u64 tmp = vcpu_e500->g2h_tlb1_map[esel];
+   int hw_tlb_indx;
+   unsigned long flags;
+
+   local_irq_save(flags);
+   while (tmp) {
+   hw_tlb_indx = __ilog2_u64(tmp & -tmp);
+   mtspr(SPRN_MAS0,
+ MAS0_TLBSEL(1) |
+ MAS0_ESEL(to_htlb1_esel(hw_tlb_indx)));
+   mtspr(SPRN_MAS1, 0);
+   asm volatile("tlbwe");
+   vcpu_e500->h2g_tlb1_rmap[hw_tlb_indx] = 0;
+   tmp &= tmp - 1;
+   }
+   mb();
+   vcpu_e500->g2h_tlb1_map[esel] = 0;
+   vcpu_e500->gtlb_priv[1][esel].ref.flags &= ~E500_TLB_BITMAP;
+   local_irq_restore(flags);
+
return;
}
 
@@ -282,6 +303,16 @@ static inline void kvmppc_e500_ref_release(struct tlbe_ref 
*ref)
}
 }
 
+static void clear_tlb1_bitmap(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+   if (vcpu_e500->g2h_tlb1_map)
+   memset(vcpu_e500->g2h_tlb1_map,
+  sizeof(u64) * vcpu_e500->gtlb_params[1].entries, 0);
+   if (vcpu_e500->h2g_tlb1_rmap)
+   memset(vcpu_e500->h2g_tlb1_rmap,
+  sizeof(unsigned int) * host_tlb_params[1].entries, 0);
+}
+
 static void clear_tlb_privs(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
int tlbsel = 0;
@@ -511,7 +542,7 @@ static void kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 
*vcpu_e500,
 /* XXX for both one-one and one-to-many , for now use TLB1 */
 static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 *vcpu_e500,
u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
-   struct kvm_book3e_206_tlb_entry *stlbe)
+   struct kvm_book3e_206_tlb_entry *stlbe, int esel)
 {
struct tlbe_ref *ref;
unsigned int victim;
@@ -524,6 +555,14 @@ static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 
*vcpu_e500,
ref = &vcpu_e500->tlb_refs[1][victim];
kvmppc_e500_shadow_map(vcpu_e500, gvaddr, gfn, gtlbe, 1, stlbe, ref);
 
+   vcpu_e500->g2h_tlb1_map[esel] |= (u64)1 << victim;
+   vcpu_e500->gtlb_priv[1][esel].ref.flags |= E500_TLB_BITMAP;
+   if (vcpu_e500->h2g_tlb1_rmap[victim]) {
+   unsigned int idx = vcpu_e500->h2g_tlb1_rmap[victim];
+   vcpu_e500->g2h_tlb1_map[idx] &= ~(1ULL << victim);
+   }
+   vcpu_e500->h2g_tlb1_rmap[victim] = esel;
+
return victim;
 }
 
@@ -728,7 +767,7 @@ int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 * are mapped on the fly. */
stlbsel = 1;
sesel = kvmppc_e500_tlb1_map(vcpu_e500, eaddr,
-   raddr >> PAGE_SHIFT, gtlbe, &stlbe);
+   raddr >> PAGE_SHIFT, gtlbe, &stlbe, esel);
break;
 

[RFC PATCH 13/16] powerpc/booke: Provide exception macros with interrupt name

2011-12-20 Thread Scott Wood
DO_KVM will need to identify the particular exception type.

There is an existing set of arbitrary numbers that Linux passes,
but it's an undocumented mess that sort of corresponds to server/classic
exception vectors but not really.

FIXME: Replace the existing trap numbering rather than add to it.

Signed-off-by: Scott Wood 
---
 arch/powerpc/kernel/head_44x.S   |   23 +--
 arch/powerpc/kernel/head_booke.h |   41 ++
 arch/powerpc/kernel/head_fsl_booke.S |   52 +-
 3 files changed, 68 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S
index b725dab..51a49f6 100644
--- a/arch/powerpc/kernel/head_44x.S
+++ b/arch/powerpc/kernel/head_44x.S
@@ -160,10 +160,11 @@ _ENTRY(_start);
 
 interrupt_base:
/* Critical Input Interrupt */
-   CRITICAL_EXCEPTION(0x0100, CriticalInput, unknown_exception)
+   CRITICAL_EXCEPTION(0x0100, CRITICAL, CriticalInput, unknown_exception)
 
/* Machine Check Interrupt */
-   CRITICAL_EXCEPTION(0x0200, MachineCheck, machine_check_exception)
+   CRITICAL_EXCEPTION(0x0200, MACHINE_CHECK, MachineCheck, \
+  machine_check_exception)
MCHECK_EXCEPTION(0x0210, MachineCheckA, machine_check_exception)
 
/* Data Storage Interrupt */
@@ -173,7 +174,8 @@ interrupt_base:
INSTRUCTION_STORAGE_EXCEPTION
 
/* External Input Interrupt */
-   EXCEPTION(0x0500, ExternalInput, do_IRQ, EXC_XFER_LITE)
+   EXCEPTION(0x0500, BOOKE_INTERRUPT_EXTERNAL, ExternalInput, \
+ do_IRQ, EXC_XFER_LITE)
 
/* Alignment Interrupt */
ALIGNMENT_EXCEPTION
@@ -185,29 +187,32 @@ interrupt_base:
 #ifdef CONFIG_PPC_FPU
FP_UNAVAILABLE_EXCEPTION
 #else
-   EXCEPTION(0x2010, FloatingPointUnavailable, unknown_exception, 
EXC_XFER_EE)
+   EXCEPTION(0x2010, BOOKE_INTERRUPT_FP_UNAVAIL, \
+ FloatingPointUnavailable, unknown_exception, EXC_XFER_EE)
 #endif
/* System Call Interrupt */
START_EXCEPTION(SystemCall)
-   NORMAL_EXCEPTION_PROLOG
+   NORMAL_EXCEPTION_PROLOG(BOOKE_INTERRUPT_SYSCALL)
EXC_XFER_EE_LITE(0x0c00, DoSyscall)
 
/* Auxiliary Processor Unavailable Interrupt */
-   EXCEPTION(0x2020, AuxillaryProcessorUnavailable, unknown_exception, 
EXC_XFER_EE)
+   EXCEPTION(0x2020, BOOKE_INTERRUPT_AP_UNAVAIL, \
+ AuxillaryProcessorUnavailable, unknown_exception, EXC_XFER_EE)
 
/* Decrementer Interrupt */
DECREMENTER_EXCEPTION
 
/* Fixed Internal Timer Interrupt */
/* TODO: Add FIT support */
-   EXCEPTION(0x1010, FixedIntervalTimer, unknown_exception, EXC_XFER_EE)
+   EXCEPTION(0x1010, BOOKE_INTERRUPT_FIT, FixedIntervalTimer, \
+ unknown_exception, EXC_XFER_EE)
 
/* Watchdog Timer Interrupt */
/* TODO: Add watchdog support */
 #ifdef CONFIG_BOOKE_WDT
-   CRITICAL_EXCEPTION(0x1020, WatchdogTimer, WatchdogException)
+   CRITICAL_EXCEPTION(0x1020, WATCHDOG, WatchdogTimer, WatchdogException)
 #else
-   CRITICAL_EXCEPTION(0x1020, WatchdogTimer, unknown_exception)
+   CRITICAL_EXCEPTION(0x1020, WATCHDOG, WatchdogTimer, unknown_exception)
 #endif
 
/* Data TLB Error Interrupt */
diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index fc921bf..06ab353 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -2,6 +2,8 @@
 #define __HEAD_BOOKE_H__
 
 #include /* for STACK_FRAME_REGS_MARKER */
+#include 
+
 /*
  * Macros used for common Book-e exception handling
  */
@@ -28,7 +30,7 @@
  */
 #define THREAD_NORMSAVE(offset)(THREAD_NORMSAVES + (offset * 4))
 
-#define NORMAL_EXCEPTION_PROLOG
 \
+#define NORMAL_EXCEPTION_PROLOG(intno) 
 \
mtspr   SPRN_SPRG_WSCRATCH0, r10;   /* save one register */  \
mfspr   r10, SPRN_SPRG_THREAD;   \
stw r11, THREAD_NORMSAVE(0)(r10);\
@@ -113,7 +115,7 @@
  * registers as the normal prolog above. Instead we use a portion of the
  * critical/machine check exception stack at low physical addresses.
  */
-#define EXC_LEVEL_EXCEPTION_PROLOG(exc_level, exc_level_srr0, exc_level_srr1) \
+#define EXC_LEVEL_EXCEPTION_PROLOG(exc_level, intno, exc_level_srr0, 
exc_level_srr1) \
mtspr   SPRN_SPRG_WSCRATCH_##exc_level,r8;   \
BOOKE_LOAD_EXC_LEVEL_STACK(exc_level);/* r8 points to the exc_level 
stack*/ \
stw r9,GPR9(r8);/* save various registers  */\
@@ -162,12 +164,13 @@
SAVE_4GPRS(3, r11);  \
SAVE_2GPRS(7, r11)
 
-#define CRITICAL_EXCEPTION_PROLOG \
-   EXC_LEVEL_EXCEPTI

[RFC PATCH 03/16] KVM: PPC: Use pt_regs in vcpu->arch

2011-12-20 Thread Scott Wood
This makes it easy to pass to host exception handling functions, in
situations where we can't just let the interrupt happen again
naturally.

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/kvm_book3s.h   |   28 
 arch/powerpc/include/asm/kvm_booke.h|   24 +++---
 arch/powerpc/include/asm/kvm_host.h |   11 ++---
 arch/powerpc/kernel/asm-offsets.c   |   17 +--
 arch/powerpc/kvm/book3s_32_mmu.c|2 +-
 arch/powerpc/kvm/book3s_hv.c|9 ---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |   12 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |4 +-
 arch/powerpc/kvm/booke.c|   34 +-
 arch/powerpc/kvm/booke_emulate.c|2 +-
 arch/powerpc/kvm/e500_tlb.c |2 +-
 11 files changed, 70 insertions(+), 75 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index ea9539c..60e069e 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -206,7 +206,7 @@ static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, 
int num, ulong val)
svcpu_put(svcpu);
to_book3s(vcpu)->shadow_vcpu->gpr[num] = val;
} else
-   vcpu->arch.gpr[num] = val;
+   vcpu->arch.regs.gpr[num] = val;
 }
 
 static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num)
@@ -217,7 +217,7 @@ static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, 
int num)
svcpu_put(svcpu);
return r;
} else
-   return vcpu->arch.gpr[num];
+   return vcpu->arch.regs.gpr[num];
 }
 
 static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
@@ -360,62 +360,62 @@ static inline void kvmppc_update_int_pending(struct 
kvm_vcpu *vcpu,
 
 static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val)
 {
-   vcpu->arch.gpr[num] = val;
+   vcpu->arch.regs.gpr[num] = val;
 }
 
 static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num)
 {
-   return vcpu->arch.gpr[num];
+   return vcpu->arch.regs.gpr[num];
 }
 
 static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
 {
-   vcpu->arch.cr = val;
+   vcpu->arch.regs.ccr = val;
 }
 
 static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.cr;
+   return vcpu->arch.regs.ccr;
 }
 
 static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, u32 val)
 {
-   vcpu->arch.xer = val;
+   vcpu->arch.regs.xer = val;
 }
 
 static inline u32 kvmppc_get_xer(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.xer;
+   return vcpu->arch.regs.xer;
 }
 
 static inline void kvmppc_set_ctr(struct kvm_vcpu *vcpu, ulong val)
 {
-   vcpu->arch.ctr = val;
+   vcpu->arch.regs.ctr = val;
 }
 
 static inline ulong kvmppc_get_ctr(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.ctr;
+   return vcpu->arch.regs.ctr;
 }
 
 static inline void kvmppc_set_lr(struct kvm_vcpu *vcpu, ulong val)
 {
-   vcpu->arch.lr = val;
+   vcpu->arch.regs.link = val;
 }
 
 static inline ulong kvmppc_get_lr(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.lr;
+   return vcpu->arch.regs.link;
 }
 
 static inline void kvmppc_set_pc(struct kvm_vcpu *vcpu, ulong val)
 {
-   vcpu->arch.pc = val;
+   vcpu->arch.regs.nip = val;
 }
 
 static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.pc;
+   return vcpu->arch.regs.nip;
 }
 
 static inline u32 kvmppc_get_last_inst(struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index a90e091..e20c162 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -25,32 +25,32 @@
 
 static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val)
 {
-   vcpu->arch.gpr[num] = val;
+   vcpu->arch.regs.gpr[num] = val;
 }
 
 static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num)
 {
-   return vcpu->arch.gpr[num];
+   return vcpu->arch.regs.gpr[num];
 }
 
 static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
 {
-   vcpu->arch.cr = val;
+   vcpu->arch.regs.ccr = val;
 }
 
 static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.cr;
+   return vcpu->arch.regs.ccr;
 }
 
 static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, u32 val)
 {
-   vcpu->arch.xer = val;
+   vcpu->arch.regs.xer = val;
 }
 
 static inline u32 kvmppc_get_xer(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.xer;
+   return vcpu->arch.regs.xer;
 }
 
 static inline u32 kvmppc_get_last_inst(struct kvm_vcpu *vcpu)
@@ -60,32 +60,32 @@ static inline u32 kvmppc_get_last_inst(struct kvm_vcpu 
*vcpu)
 
 static inline void kvmppc_set_ctr(struct kvm_vcpu *vcpu, ulong val)
 {
-   vcpu->arch.ctr = val;
+   vcpu->arch.regs.ctr = val;

[RFC PATCH 10/16] KVM: PPC: e500: refactor core-specific TLB code

2011-12-20 Thread Scott Wood
The PID handling is e500v1/v2-specific, and is moved to e500.c.

The MMU sregs code and kvmppc_core_vcpu_translate will be shared with
e500mc, and is moved from e500.c to e500_tlb.c.

Partially based on patches from Liu Yu .

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/kvm/e500.c |  358 +++
 arch/powerpc/kvm/e500.h |   55 -
 arch/powerpc/kvm/e500_emulate.c |7 +-
 arch/powerpc/kvm/e500_tlb.c |  461 +--
 5 files changed, 473 insertions(+), 410 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 443f007..ad4d671 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -415,6 +415,8 @@ struct kvm_vcpu_arch {
ulong fault_esr;
ulong queued_dear;
ulong queued_esr;
+   u32 tlbcfg[4];
+   u32 mmucfg;
 #endif
gpa_t paddr_accessed;
 
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index faa32df..77e3134 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -20,12 +20,283 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
+#include "../mm/mmu_decl.h"
 #include "booke.h"
 #include "e500.h"
 
+struct id {
+   unsigned long val;
+   struct id **pentry;
+};
+
+#define NUM_TIDS 256
+
+/*
+ * This table provide mappings from:
+ * (guestAS,guestTID,guestPR) --> ID of physical cpu
+ * guestAS [0..1]
+ * guestTID[0..255]
+ * guestPR [0..1]
+ * ID  [1..255]
+ * Each vcpu keeps one vcpu_id_table.
+ */
+struct vcpu_id_table {
+   struct id id[2][NUM_TIDS][2];
+};
+
+/*
+ * This table provide reversed mappings of vcpu_id_table:
+ * ID --> address of vcpu_id_table item.
+ * Each physical core has one pcpu_id_table.
+ */
+struct pcpu_id_table {
+   struct id *entry[NUM_TIDS];
+};
+
+static DEFINE_PER_CPU(struct pcpu_id_table, pcpu_sids);
+
+/* This variable keeps last used shadow ID on local core.
+ * The valid range of shadow ID is [1..255] */
+static DEFINE_PER_CPU(unsigned long, pcpu_last_used_sid);
+
+/*
+ * Allocate a free shadow id and setup a valid sid mapping in given entry.
+ * A mapping is only valid when vcpu_id_table and pcpu_id_table are match.
+ *
+ * The caller must have preemption disabled, and keep it that way until
+ * it has finished with the returned shadow id (either written into the
+ * TLB or arch.shadow_pid, or discarded).
+ */
+static inline int local_sid_setup_one(struct id *entry)
+{
+   unsigned long sid;
+   int ret = -1;
+
+   sid = ++(__get_cpu_var(pcpu_last_used_sid));
+   if (sid < NUM_TIDS) {
+   __get_cpu_var(pcpu_sids).entry[sid] = entry;
+   entry->val = sid;
+   entry->pentry = &__get_cpu_var(pcpu_sids).entry[sid];
+   ret = sid;
+   }
+
+   /*
+* If sid == NUM_TIDS, we've run out of sids.  We return -1, and
+* the caller will invalidate everything and start over.
+*
+* sid > NUM_TIDS indicates a race, which we disable preemption to
+* avoid.
+*/
+   WARN_ON(sid > NUM_TIDS);
+
+   return ret;
+}
+
+/*
+ * Check if given entry contain a valid shadow id mapping.
+ * An ID mapping is considered valid only if
+ * both vcpu and pcpu know this mapping.
+ *
+ * The caller must have preemption disabled, and keep it that way until
+ * it has finished with the returned shadow id (either written into the
+ * TLB or arch.shadow_pid, or discarded).
+ */
+static inline int local_sid_lookup(struct id *entry)
+{
+   if (entry && entry->val != 0 &&
+   __get_cpu_var(pcpu_sids).entry[entry->val] == entry &&
+   entry->pentry == &__get_cpu_var(pcpu_sids).entry[entry->val])
+   return entry->val;
+   return -1;
+}
+
+/* Invalidate all id mappings on local core -- call with preempt disabled */
+static inline void local_sid_destroy_all(void)
+{
+   __get_cpu_var(pcpu_last_used_sid) = 0;
+   memset(&__get_cpu_var(pcpu_sids), 0, sizeof(__get_cpu_var(pcpu_sids)));
+}
+
+static void *kvmppc_e500_id_table_alloc(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+   vcpu_e500->idt = kzalloc(sizeof(struct vcpu_id_table), GFP_KERNEL);
+   return vcpu_e500->idt;
+}
+
+static void kvmppc_e500_id_table_free(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+   kfree(vcpu_e500->idt);
+   vcpu_e500->idt = NULL;
+}
+
+/* Map guest pid to shadow.
+ * We use PID to keep shadow of current guest non-zero PID,
+ * and use PID1 to keep shadow of guest zero PID.
+ * So that guest tlbe with TID=0 can be accessed at any time */
+static void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+   preempt_disable();
+   vcpu_e500->vcpu.arch.shadow_pid = kvmppc_e500_get_sid(vcpu_e500,
+   get_cur_as(&vcpu_e500->vcpu),
+   get_cur_pid(&vcpu_e500->vcpu),
+  

[RFC PATCH 15/16] KVM: PPC: booke: standard PPC floating point support

2011-12-20 Thread Scott Wood
e500mc has a normal PPC FPU, rather than SPE which is found
on e500v1/v2.

Based on code from Liu Yu .

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/system.h |1 +
 arch/powerpc/kvm/booke.c  |   44 +
 arch/powerpc/kvm/booke.h  |   30 +
 3 files changed, 75 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/system.h 
b/arch/powerpc/include/asm/system.h
index e30a13d..0561356 100644
--- a/arch/powerpc/include/asm/system.h
+++ b/arch/powerpc/include/asm/system.h
@@ -140,6 +140,7 @@ extern void via_cuda_init(void);
 extern void read_rtc_time(void);
 extern void pmac_find_display(void);
 extern void giveup_fpu(struct task_struct *);
+extern void load_up_fpu(void);
 extern void disable_kernel_fp(void);
 extern void enable_kernel_fp(void);
 extern void flush_fp_to_thread(struct task_struct *);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index cf63b93..4bf43f9 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -460,6 +460,11 @@ void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 {
int ret;
+#ifdef CONFIG_PPC_FPU
+   unsigned int fpscr;
+   int fpexc_mode;
+   u64 fpr[32];
+#endif
 
if (!vcpu->arch.sane) {
kvm_run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
@@ -482,7 +487,46 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
}
 
kvm_guest_enter();
+
+#ifdef CONFIG_PPC_FPU
+   /* Save userspace FPU state in stack */
+   enable_kernel_fp();
+   memcpy(fpr, current->thread.fpr, sizeof(current->thread.fpr));
+   fpscr = current->thread.fpscr.val;
+   fpexc_mode = current->thread.fpexc_mode;
+
+   /* Restore guest FPU state to thread */
+   memcpy(current->thread.fpr, vcpu->arch.fpr, sizeof(vcpu->arch.fpr));
+   current->thread.fpscr.val = vcpu->arch.fpscr;
+
+   /*
+* Since we can't trap on MSR_FP in GS-mode, we consider the guest
+* as always using the FPU.  Kernel usage of FP (via
+* enable_kernel_fp()) in this thread must not occur while
+* vcpu->fpu_active is set.
+*/
+   vcpu->fpu_active = 1;
+
+   kvmppc_load_guest_fp(vcpu);
+#endif
+
ret = __kvmppc_vcpu_run(kvm_run, vcpu);
+
+#ifdef CONFIG_PPC_FPU
+   kvmppc_save_guest_fp(vcpu);
+
+   vcpu->fpu_active = 0;
+
+   /* Save guest FPU state from thread */
+   memcpy(vcpu->arch.fpr, current->thread.fpr, sizeof(vcpu->arch.fpr));
+   vcpu->arch.fpscr = current->thread.fpscr.val;
+
+   /* Restore userspace FPU state from stack */
+   memcpy(current->thread.fpr, fpr, sizeof(current->thread.fpr));
+   current->thread.fpscr.val = fpscr;
+   current->thread.fpexc_mode = fpexc_mode;
+#endif
+
kvm_guest_exit();
 
 out:
diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
index d53bcf2..3bf5eda 100644
--- a/arch/powerpc/kvm/booke.h
+++ b/arch/powerpc/kvm/booke.h
@@ -96,4 +96,34 @@ enum int_class {
 
 void kvmppc_set_pending_interrupt(struct kvm_vcpu *vcpu, enum int_class type);
 
+/*
+ * Load up guest vcpu FP state if it's needed.
+ * It also set the MSR_FP in thread so that host know
+ * we're holding FPU, and then host can help to save
+ * guest vcpu FP state if other threads require to use FPU.
+ * This simulates an FP unavailable fault.
+ *
+ * It requires to be called with preemption disabled.
+ */
+static inline void kvmppc_load_guest_fp(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_PPC_FPU
+   if (vcpu->fpu_active && !(current->thread.regs->msr & MSR_FP)) {
+   load_up_fpu();
+   current->thread.regs->msr |= MSR_FP;
+   }
+#endif
+}
+
+/*
+ * Save guest vcpu FP state into thread.
+ * It requires to be called with preemption disabled.
+ */
+static inline void kvmppc_save_guest_fp(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_PPC_FPU
+   if (vcpu->fpu_active && (current->thread.regs->msr & MSR_FP))
+   giveup_fpu(current);
+#endif
+}
 #endif /* __KVM_BOOKE_H__ */
-- 
1.7.7.rc3.4.g8d714


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 12/16] KVM: PPC: e500: emulate tlbilx

2011-12-20 Thread Scott Wood
tlbilx is the new, preferred invalidation instruction.  It is not
found on e500 prior to e500mc, but there should be no harm in
supporting it on all e500.

Based on code from Ashish Kalra .

Signed-off-by: Scott Wood 
---
 arch/powerpc/kvm/e500.h |1 +
 arch/powerpc/kvm/e500_emulate.c |9 ++
 arch/powerpc/kvm/e500_tlb.c |   52 +++
 3 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index f4dee55..ce3f163 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -124,6 +124,7 @@ int kvmppc_e500_emul_mt_mmucsr0(struct kvmppc_vcpu_e500 
*vcpu_e500,
 int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu);
 int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu);
 int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *vcpu, int ra, int rb);
+int kvmppc_e500_emul_tlbilx(struct kvm_vcpu *vcpu, int rt, int ra, int rb);
 int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb);
 int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500);
 void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500);
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index c80794d..af02c18 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -22,6 +22,7 @@
 #define XOP_TLBSX   914
 #define XOP_TLBRE   946
 #define XOP_TLBWE   978
+#define XOP_TLBILX  18
 
 int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu,
unsigned int inst, int *advance)
@@ -29,6 +30,7 @@ int kvmppc_core_emulate_op(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
int emulated = EMULATE_DONE;
int ra;
int rb;
+   int rt;
 
switch (get_op(inst)) {
case 31:
@@ -47,6 +49,13 @@ int kvmppc_core_emulate_op(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
emulated = kvmppc_e500_emul_tlbsx(vcpu,rb);
break;
 
+   case XOP_TLBILX:
+   ra = get_ra(inst);
+   rb = get_rb(inst);
+   rt = get_rt(inst);
+   emulated = kvmppc_e500_emul_tlbilx(vcpu, rt, ra, rb);
+   break;
+
case XOP_TLBIVAX:
ra = get_ra(inst);
rb = get_rb(inst);
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 031fd5b..121cd68 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -631,6 +631,58 @@ int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *vcpu, int 
ra, int rb)
return EMULATE_DONE;
 }
 
+static void tlbilx_all(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel,
+  int pid, int rt)
+{
+   struct kvm_book3e_206_tlb_entry *tlbe;
+   int tid, esel;
+
+   /* invalidate all entries */
+   for (esel = 0; esel < vcpu_e500->gtlb_params[tlbsel].entries; esel++) {
+   tlbe = get_entry(vcpu_e500, tlbsel, esel);
+   tid = get_tlb_tid(tlbe);
+   if (rt == 0 || tid == pid) {
+   inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
+   kvmppc_e500_gtlbe_invalidate(vcpu_e500, tlbsel, esel);
+   }
+   }
+}
+
+static void tlbilx_one(struct kvmppc_vcpu_e500 *vcpu_e500, int pid,
+  int ra, int rb)
+{
+   int tlbsel, esel;
+   gva_t ea;
+
+   ea = kvmppc_get_gpr(&vcpu_e500->vcpu, rb);
+   if (ra)
+   ea += kvmppc_get_gpr(&vcpu_e500->vcpu, ra);
+
+   for (tlbsel = 0; tlbsel < 2; tlbsel++) {
+   esel = kvmppc_e500_tlb_index(vcpu_e500, ea, tlbsel, pid, -1);
+   if (esel >= 0) {
+   inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
+   kvmppc_e500_gtlbe_invalidate(vcpu_e500, tlbsel, esel);
+   break;
+   }
+   }
+}
+
+int kvmppc_e500_emul_tlbilx(struct kvm_vcpu *vcpu, int rt, int ra, int rb)
+{
+   struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+   int pid = get_cur_spid(vcpu);
+
+   if (rt == 0 || rt == 1) {
+   tlbilx_all(vcpu_e500, 0, pid, rt);
+   tlbilx_all(vcpu_e500, 1, pid, rt);
+   } else if (rt == 3) {
+   tlbilx_one(vcpu_e500, pid, ra, rb);
+   }
+
+   return EMULATE_DONE;
+}
+
 int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu)
 {
struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-- 
1.7.7.rc3.4.g8d714


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 16/16] KVM: PPC: e500mc support

2011-12-20 Thread Scott Wood
Add processor support for e500mc, using hardware virtualization support
(GS-mode).

Current issues include:
 - No support for external proxy (coreint) interrupt mode in the guest.

Includes work by Ashish Kalra ,
Varun Sethi , and
Liu Yu .

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/cputable.h   |6 +-
 arch/powerpc/include/asm/kvm.h|1 +
 arch/powerpc/kernel/cpu_setup_fsl_booke.S |1 +
 arch/powerpc/kernel/head_fsl_booke.S  |   46 
 arch/powerpc/kvm/Kconfig  |   17 ++-
 arch/powerpc/kvm/Makefile |   11 +
 arch/powerpc/kvm/e500.h   |   13 +-
 arch/powerpc/kvm/e500_emulate.c   |   24 ++-
 arch/powerpc/kvm/e500_tlb.c   |   21 ++-
 arch/powerpc/kvm/e500mc.c |  342 +
 arch/powerpc/kvm/powerpc.c|6 +-
 11 files changed, 476 insertions(+), 12 deletions(-)
 create mode 100644 arch/powerpc/kvm/e500mc.c

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index a80be60..eddb322 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -168,6 +168,7 @@ extern const char *powerpc_base_platform;
 #define CPU_FTR_LWSYNC ASM_CONST(0x0800)
 #define CPU_FTR_NOEXECUTE  ASM_CONST(0x1000)
 #define CPU_FTR_INDEXED_DCRASM_CONST(0x2000)
+#define CPU_FTR_EMB_HV ASM_CONST(0x4000)
 
 /*
  * Add the 64-bit processor unique features in the top half of the word;
@@ -385,11 +386,11 @@ extern const char *powerpc_base_platform;
CPU_FTR_NODSISRALIGN | CPU_FTR_NOEXECUTE)
 #define CPU_FTRS_E500MC(CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN | \
CPU_FTR_L2CSR | CPU_FTR_LWSYNC | CPU_FTR_NOEXECUTE | \
-   CPU_FTR_DBELL | CPU_FTR_DEBUG_LVL_EXC)
+   CPU_FTR_DBELL | CPU_FTR_DEBUG_LVL_EXC | CPU_FTR_EMB_HV)
 #define CPU_FTRS_E5500 (CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN | \
CPU_FTR_L2CSR | CPU_FTR_LWSYNC | CPU_FTR_NOEXECUTE | \
CPU_FTR_DBELL | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
-   CPU_FTR_DEBUG_LVL_EXC)
+   CPU_FTR_DEBUG_LVL_EXC | CPU_FTR_EMB_HV)
 #define CPU_FTRS_GENERIC_32(CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN)
 
 /* 64-bit CPUs */
@@ -534,6 +535,7 @@ enum {
 #ifdef CONFIG_PPC_E500MC
CPU_FTRS_E500MC & CPU_FTRS_E5500 &
 #endif
+   ~CPU_FTR_EMB_HV &   /* can be removed at runtime */
CPU_FTRS_POSSIBLE,
 };
 #endif /* __powerpc64__ */
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index fb3fddc..98abdd0 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -280,6 +280,7 @@ struct kvm_guest_debug_arch {
 #define KVM_CPU_E500V2 2
 #define KVM_CPU_3S_32  3
 #define KVM_CPU_3S_64  4
+#define KVM_CPU_E500MC 5
 
 /* for KVM_CAP_SPAPR_TCE */
 struct kvm_create_spapr_tce {
diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S 
b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index 8053db0..69fdd23 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -73,6 +73,7 @@ _GLOBAL(__setup_cpu_e500v2)
mtlrr4
blr
 _GLOBAL(__setup_cpu_e500mc)
+   mr  r5, r4
mflrr4
bl  __e500_icache_setup
bl  __e500_dcache_setup
diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index 5701e87..b269b86 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -380,10 +380,16 @@ interrupt_base:
mtspr   SPRN_SPRG_WSCRATCH0, r10 /* Save some working registers */
mfspr   r10, SPRN_SPRG_THREAD
stw r11, THREAD_NORMSAVE(0)(r10)
+#ifdef CONFIG_KVM_BOOKE_HV
+BEGIN_FTR_SECTION
+   mfspr   r11, SPRN_SRR1
+END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
+#endif
stw r12, THREAD_NORMSAVE(1)(r10)
stw r13, THREAD_NORMSAVE(2)(r10)
mfcrr13
stw r13, THREAD_NORMSAVE(3)(r10)
+   DO_KVM  BOOKE_INTERRUPT_DTLB_MISS SPRN_SRR1
mfspr   r10, SPRN_DEAR  /* Get faulting address */
 
/* If we are faulting a kernel address, we have to use the
@@ -468,10 +474,16 @@ interrupt_base:
mtspr   SPRN_SPRG_WSCRATCH0, r10 /* Save some working registers */
mfspr   r10, SPRN_SPRG_THREAD
stw r11, THREAD_NORMSAVE(0)(r10)
+#ifdef CONFIG_KVM_BOOKE_HV
+BEGIN_FTR_SECTION
+   mfspr   r11, SPRN_SRR1
+END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
+#endif
stw r12, THREAD_NORMSAVE(1)(r10)
stw r13, THREAD_NORMSAVE(2)(r10)
mfcrr13
stw r13, THREAD_NORMSAVE(3)(r10)
+   DO_KVM  BOOKE_INTERRUPT_ITLB_MISS SPRN_SRR1
mfspr   r10, SPRN_SRR0  /* Get faulting address */
 
/* If we are faulting a kernel address, we hav

[RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2011-12-20 Thread Scott Wood
Chips such as e500mc that implement category E.HV in Power ISA 2.06
provide hardware virtualization features, including a new MSR mode for
guest state.  The guest OS can perform many operations without trapping
into the hypervisor, including transitions to and from guest userspace.

Since we can use SRR1[GS] to reliably tell whether an exception came from
guest state, instead of messing around with IVPR, we use DO_KVM similarly
to book3s.

Current issues include:
 - Machine checks from guest state are not routed to the host handler.
 - The guest can cause a host oops by executing an emulated instruction
   in a page that lacks read permission.  Existing e500/4xx support has
   the same problem.

Includes work by Ashish Kalra ,
Varun Sethi , and
Liu Yu .

Signed-off-by: Scott Wood 
---
 arch/powerpc/include/asm/dbell.h|1 +
 arch/powerpc/include/asm/kvm_asm.h  |8 +
 arch/powerpc/include/asm/kvm_booke_hv_asm.h |   49 +++
 arch/powerpc/include/asm/kvm_host.h |   19 +-
 arch/powerpc/include/asm/kvm_ppc.h  |3 +
 arch/powerpc/include/asm/mmu-book3e.h   |6 +
 arch/powerpc/include/asm/processor.h|3 +
 arch/powerpc/include/asm/reg.h  |2 +
 arch/powerpc/include/asm/reg_booke.h|   34 ++
 arch/powerpc/kernel/asm-offsets.c   |   15 +-
 arch/powerpc/kernel/head_booke.h|   28 ++-
 arch/powerpc/kvm/Kconfig|3 +
 arch/powerpc/kvm/booke.c|  398 ++-
 arch/powerpc/kvm/booke.h|   24 +-
 arch/powerpc/kvm/booke_emulate.c|   23 +-
 arch/powerpc/kvm/bookehv_interrupts.S   |  587 +++
 arch/powerpc/kvm/powerpc.c  |5 +
 arch/powerpc/kvm/timing.h   |6 +
 18 files changed, 1107 insertions(+), 107 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_booke_hv_asm.h
 create mode 100644 arch/powerpc/kvm/bookehv_interrupts.S

diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
index efa74ac..d7365b0 100644
--- a/arch/powerpc/include/asm/dbell.h
+++ b/arch/powerpc/include/asm/dbell.h
@@ -19,6 +19,7 @@
 
 #define PPC_DBELL_MSG_BRDCAST  (0x0400)
 #define PPC_DBELL_TYPE(x)  (((x) & 0xf) << (63-36))
+#define PPC_DBELL_LPID(x)  ((x) << (63 - 49))
 enum ppc_dbell {
PPC_DBELL = 0,  /* doorbell */
PPC_DBELL_CRIT = 1, /* critical doorbell */
diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 7b1f0e0..0978152 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -48,6 +48,14 @@
 #define BOOKE_INTERRUPT_SPE_FP_DATA 33
 #define BOOKE_INTERRUPT_SPE_FP_ROUND 34
 #define BOOKE_INTERRUPT_PERFORMANCE_MONITOR 35
+#define BOOKE_INTERRUPT_DOORBELL 36
+#define BOOKE_INTERRUPT_DOORBELL_CRITICAL 37
+
+/* booke_hv */
+#define BOOKE_INTERRUPT_GUEST_DBELL 38
+#define BOOKE_INTERRUPT_GUEST_DBELL_CRIT 39
+#define BOOKE_INTERRUPT_HV_SYSCALL 40
+#define BOOKE_INTERRUPT_HV_PRIV 41
 
 /* book3s */
 
diff --git a/arch/powerpc/include/asm/kvm_booke_hv_asm.h 
b/arch/powerpc/include/asm/kvm_booke_hv_asm.h
new file mode 100644
index 000..30a600f
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_booke_hv_asm.h
@@ -0,0 +1,49 @@
+/*
+ * Copyright 2010-2011 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef ASM_KVM_BOOKE_HV_ASM_H
+#define ASM_KVM_BOOKE_HV_ASM_H
+
+#ifdef __ASSEMBLY__
+
+/*
+ * All exceptions from guest state must go through KVM
+ * (except for those which are delivered directly to the guest) --
+ * there are no exceptions for which we fall through directly to
+ * the normal host handler.
+ *
+ * Expected inputs (normal exceptions):
+ *   SCRATCH0 = saved r10
+ *   r10 = thread struct
+ *   r11 = appropriate SRR1 variant (currently used as scratch)
+ *   r13 = saved CR
+ *   *(r10 + THREAD_NORMSAVE(0)) = saved r11
+ *   *(r10 + THREAD_NORMSAVE(2)) = saved r13
+ *
+ * Expected inputs (crit/mcheck/debug exceptions):
+ *   appropriate SCRATCH = saved r8
+ *   r8 = exception level stack frame
+ *   r9 = *(r8 + _CCR) = saved CR
+ *   r11 = appropriate SRR1 variant (currently used as scratch)
+ *   *(r8 + GPR9) = saved r9
+ *   *(r8 + GPR10) = saved r10 (r10 not yet clobbered)
+ *   *(r8 + GPR11) = saved r11
+ */
+.macro DO_KVM intno srr1
+#ifdef CONFIG_KVM_BOOKE_HV
+BEGIN_FTR_SECTION
+   mtocrf  0x80, r11   /* check MSR[GS] without clobbering reg */
+   bf  3, kvmppc_resume_\intno\()_\srr1
+   b   kvmppc_handler_\intno\()_\srr1
+kvmppc_resume_\intno\()_\srr1:
+END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
+#endif
+.endm
+
+#endif /*__ASSEMBLY__ */
+#endif /* ASM_KVM_BOOKE_HV_ASM_H */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerp

RE: [PATCH] intel-iommu: Add device info into list before doing context mapping

2011-12-20 Thread Hao, Xudong
Yes, Chris, thanks your comments.
How about this one?

---
 drivers/iommu/intel-iommu.c |   16 ++--
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index a004c39..0fc5efd 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2264,12 +2264,6 @@ static int domain_add_dev_info(struct dmar_domain 
*domain,
if (!info)
return -ENOMEM;

-   ret = domain_context_mapping(domain, pdev, translation);
-   if (ret) {
-   free_devinfo_mem(info);
-   return ret;
-   }
-
info->segment = pci_domain_nr(pdev->bus);
info->bus = pdev->bus->number;
info->devfn = pdev->devfn;
@@ -2282,6 +2276,16 @@ static int domain_add_dev_info(struct dmar_domain 
*domain,
pdev->dev.archdata.iommu = info;
spin_unlock_irqrestore(&device_domain_lock, flags);

+   ret = domain_context_mapping(domain, pdev, translation);
+   if (ret) {
+   spin_lock_irqsave(&device_domain_lock, flags);
+   list_del(&info->link);
+   list_del(&info->global);
+   spin_unlock_irqrestore(&device_domain_lock, flags);
+   free_devinfo_mem(info);
+   return ret;
+   }
+
return 0;
 }

> -Original Message-
> From: Chris Wright [mailto:chr...@sous-sol.org]
> Sent: Wednesday, December 21, 2011 12:08 AM
> To: Hao, Xudong
> Cc: io...@lists.linux-foundation.org; dw...@infradead.org; Zhang, Xiantao;
> linux-ker...@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH] intel-iommu: Add device info into list before doing 
> context
> mapping
> 
> * Hao, Xudong (xudong@intel.com) wrote:
> > @@ -2282,6 +2276,14 @@ static int domain_add_dev_info(struct
> dmar_domain *domain,
> > pdev->dev.archdata.iommu = info;
> > spin_unlock_irqrestore(&device_domain_lock, flags);
> >
> > +   ret = domain_context_mapping(domain, pdev, translation);
> > +   if (ret) {
> > +   list_del(&info->link);
> > +   list_del(&info->global);
> 
> At the very least, this is not correct locking.
> 
> > +   free_devinfo_mem(info);
> > +   return ret;
> > +   }
> > +
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] kvm: Device assignment permission checks

2011-12-20 Thread Alex Williamson
On Wed, 2011-12-21 at 00:59 +0200, Sasha Levin wrote:
> On Tue, 2011-12-20 at 07:30 -0700, Alex Williamson wrote:
> > Only allow KVM device assignment to attach to devices which:
> > 
> >  - Are not bridges
> >  - Have BAR resources (assume others are special devices)
> >  - The user has permissions to use
> > 
> > Assigning a bridge is a configuration error, it's not supported, and
> > typically doesn't result in the behavior the user is expecting anyway.
> > Devices without BAR resources are typically chipset components that
> > also don't have host drivers.  We don't want users to hold such devices
> > captive or cause system problems by fencing them off into an iommu
> > domain.  We determine "permission to use" by testing whether the user
> > has access to the PCI sysfs resource files.  By default a normal user
> > will not have access to these files, so it provides a good indication
> > that an administration agent has granted the user access to the device.
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> > 
> >  Documentation/virtual/kvm/api.txt |4 +++
> >  virt/kvm/assigned-dev.c   |   55 
> > -
> >  2 files changed, 58 insertions(+), 1 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt 
> > b/Documentation/virtual/kvm/api.txt
> > index ee2c96b..4df9af4 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -1154,6 +1154,10 @@ following flags are specified:
> >  The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
> >  isolation of the device.  Usages not specifying this flag are deprecated.
> >  
> > +Only PCI header type 0 devices with PCI BAR resources are supported by
> > +device assignment.  The user requesting this ioctl must have read/write
> > +access to the PCI sysfs resource files associated with the device.
> > +
> >  4.49 KVM_DEASSIGN_PCI_DEVICE
> >  
> >  Capability: KVM_CAP_DEVICE_DEASSIGNMENT
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index a251a28..faec641 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > @@ -17,6 +17,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "irq.h"
> >  
> >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct 
> > list_head *head,
> > @@ -483,9 +484,11 @@ out:
> >  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> >   struct kvm_assigned_pci_dev *assigned_dev)
> >  {
> > -   int r = 0, idx;
> > +   int r = 0, idx, i;
> > struct kvm_assigned_dev_kernel *match;
> > struct pci_dev *dev;
> > +   u8 header_type;
> > +   bool bar_found = false;
> >  
> > if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
> > return -EINVAL;
> > @@ -516,6 +519,56 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> > r = -EINVAL;
> > goto out_free;
> > }
> > +
> > +   /* Don't allow bridges to be assigned */
> > +   pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type);
> > +   if ((header_type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL) {
> > +   r = -EPERM;
> > +   goto out_put;
> > +   }
> > +
> > +   /* We want to test whether the caller has been granted permissions to
> > +* use this device.  To be able to configure and control the device,
> > +* the user needs access to PCI configuration space and BAR resources.
> > +* These are accessed through PCI sysfs.  PCI config space is often
> > +* passed to the process calling this ioctl via file descriptor, so we
> > +* can't rely on access to that file.  We can check for permissions
> > +* on each of the BAR resource files, which is a pretty clear
> > +* indicator that the user has been granted access to the device. */
> > +   for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++) {
> > +   char buf[64];
> > +   struct path path;
> > +   struct inode *inode;
> > +
> > +   if (!pci_resource_len(dev, i))
> > +   continue;
> > +
> > +   /* Per sysfs-rules, sysfs is always at /sys */
> > +   snprintf(buf, sizeof(buf), "/sys/bus/pci/devices/%04x:%02x:"
> > +"%02x.%d/resource%d", pci_domain_nr(dev->bus),
> > +dev->bus->number, PCI_SLOT(dev->devfn),
> > +PCI_FUNC(dev->devfn), i);
> 
> This should probably be done by grabbing devname out of
> 'dev' (kobject_get_path(&dev->dev.kobj, GFP_KERNEL) ) instead of
> formatting it ourselves. This is also mentioned to be always correct in
> sysfs-rules while this method isn't.

Ok, we end up with a lot more dynamic allocations this way, but better
to use well defined methods.

> > +
> > +   r = kern_path(buf, LOOKUP_FOLLOW, &path);
> > +   if (r)
> > +   goto out_put;
> > +
> > +   inode = path.dentry->d_inode;
> > +
> > +   r = inode_permissi

[PATCH v3 0/2] kvm: Lock down device assignment

2011-12-20 Thread Alex Williamson
v2: Update API documentation for each patch
v3: Incorporate Sasha's comments: kobject path, separate func, and CONFIG_SYSFS

Two patches to try to better secure the device assignment ioctl.
This firt patch makes KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory
option when assigning a device.  I don't believe we have any
users of this option, so I think we can skip any deprecation
period, especially since it's existence is rather dangerous.

The second patch introduces some file permission checking that Avi
suggested.  If a user has been granted read/write permission to
the PCI sysfs BAR resource files, this is a good indication that
they have access to the device.  We can't call sys_faccessat
directly (not exported), but the important bits are self contained
enough to include directly.  This still works with sudo and libvirt
usage, the latter already grants qemu permission to these files.
Thanks,

Alex

---

Alex Williamson (2):
  kvm: Device assignment permission checks
  kvm: Remove ability to assign a device without iommu support


 Documentation/virtual/kvm/api.txt |7 +++
 virt/kvm/assigned-dev.c   |   90 +
 2 files changed, 88 insertions(+), 9 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/2] kvm: Remove ability to assign a device without iommu support

2011-12-20 Thread Alex Williamson
This option has no users and it exposes a security hole that we
can allow devices to be assigned without iommu protection.  Make
KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.

Signed-off-by: Alex Williamson 
---

 Documentation/virtual/kvm/api.txt |3 +++
 virt/kvm/assigned-dev.c   |   18 +-
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 7945b0b..ee2c96b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1151,6 +1151,9 @@ following flags are specified:
 /* Depends on KVM_CAP_IOMMU */
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0)
 
+The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
+isolation of the device.  Usages not specifying this flag are deprecated.
+
 4.49 KVM_DEASSIGN_PCI_DEVICE
 
 Capability: KVM_CAP_DEVICE_DEASSIGNMENT
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 3ad0925..a251a28 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -487,6 +487,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
struct kvm_assigned_dev_kernel *match;
struct pci_dev *dev;
 
+   if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
+   return -EINVAL;
+
mutex_lock(&kvm->lock);
idx = srcu_read_lock(&kvm->srcu);
 
@@ -544,16 +547,14 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 
list_add(&match->list, &kvm->arch.assigned_dev_head);
 
-   if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
-   if (!kvm->arch.iommu_domain) {
-   r = kvm_iommu_map_guest(kvm);
-   if (r)
-   goto out_list_del;
-   }
-   r = kvm_assign_device(kvm, match);
+   if (!kvm->arch.iommu_domain) {
+   r = kvm_iommu_map_guest(kvm);
if (r)
goto out_list_del;
}
+   r = kvm_assign_device(kvm, match);
+   if (r)
+   goto out_list_del;
 
 out:
srcu_read_unlock(&kvm->srcu, idx);
@@ -593,8 +594,7 @@ static int kvm_vm_ioctl_deassign_device(struct kvm *kvm,
goto out;
}
 
-   if (match->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
-   kvm_deassign_device(kvm, match);
+   kvm_deassign_device(kvm, match);
 
kvm_free_assigned_device(kvm, match);
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/2] kvm: Device assignment permission checks

2011-12-20 Thread Alex Williamson
Only allow KVM device assignment to attach to devices which:

 - Are not bridges
 - Have BAR resources (assume others are special devices)
 - The user has permissions to use

Assigning a bridge is a configuration error, it's not supported, and
typically doesn't result in the behavior the user is expecting anyway.
Devices without BAR resources are typically chipset components that
also don't have host drivers.  We don't want users to hold such devices
captive or cause system problems by fencing them off into an iommu
domain.  We determine "permission to use" by testing whether the user
has access to the PCI sysfs resource files.  By default a normal user
will not have access to these files, so it provides a good indication
that an administration agent has granted the user access to the device.

Signed-off-by: Alex Williamson 
---

 Documentation/virtual/kvm/api.txt |4 ++
 virt/kvm/assigned-dev.c   |   72 +
 2 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index ee2c96b..4df9af4 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1154,6 +1154,10 @@ following flags are specified:
 The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
 isolation of the device.  Usages not specifying this flag are deprecated.
 
+Only PCI header type 0 devices with PCI BAR resources are supported by
+device assignment.  The user requesting this ioctl must have read/write
+access to the PCI sysfs resource files associated with the device.
+
 4.49 KVM_DEASSIGN_PCI_DEVICE
 
 Capability: KVM_CAP_DEVICE_DEASSIGNMENT
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index a251a28..da9690e 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "irq.h"
 
 static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head 
*head,
@@ -480,12 +481,71 @@ out:
return r;
 }
 
+/* We want to test whether the caller has been granted permissions to
+ * use this device.  To be able to configure and control the device,
+ * the user needs access to PCI configuration space and BAR resources.
+ * These are accessed through PCI sysfs.  PCI config space is often
+ * passed to the process calling this ioctl via file descriptor, so we
+ * can't rely on access to that file.  We can check for permissions
+ * on each of the BAR resource files, which is a pretty clear
+ * indicator that the user has been granted access to the device. */
+static int probe_sysfs_permissions(struct pci_dev *dev)
+{
+#ifdef CONFIG_SYSFS
+   int i;
+   bool bar_found = false;
+
+   for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++) {
+   char *kpath, *syspath;
+   struct path path;
+   struct inode *inode;
+   int r;
+
+   if (!pci_resource_len(dev, i))
+   continue;
+
+   kpath = kobject_get_path(&dev->dev.kobj, GFP_KERNEL);
+   if (!kpath)
+   return -ENOMEM;
+
+   /* Per sysfs-rules, sysfs is always at /sys */
+   syspath = kasprintf(GFP_KERNEL, "/sys%s/resource%d", kpath, i);
+   kfree(kpath);
+   if (!syspath)
+   return -ENOMEM;
+
+   r = kern_path(syspath, LOOKUP_FOLLOW, &path);
+   kfree(syspath);
+   if (r)
+   return r;
+
+   inode = path.dentry->d_inode;
+
+   r = inode_permission(inode, MAY_READ | MAY_WRITE | MAY_ACCESS);
+   path_put(&path);
+   if (r)
+   return r;
+
+   bar_found = true;
+   }
+
+   /* If no resources, probably something special */
+   if (!bar_found)
+   return -EPERM;
+
+   return 0;
+#else
+   return -EINVAL; /* No way to control the device without sysfs */
+#endif
+}
+
 static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
  struct kvm_assigned_pci_dev *assigned_dev)
 {
int r = 0, idx;
struct kvm_assigned_dev_kernel *match;
struct pci_dev *dev;
+   u8 header_type;
 
if (!(assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU))
return -EINVAL;
@@ -516,6 +576,18 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
r = -EINVAL;
goto out_free;
}
+
+   /* Don't allow bridges to be assigned */
+   pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type);
+   if ((header_type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL) {
+   r = -EPERM;
+   goto out_put;
+   }
+
+   r = probe_sysfs_permissions(dev);
+   if (r)
+   goto out_put;
+
if (pci_enable_device(dev)) {
printk(K

Re: [PATCH 0/6][RFC] virtio-blk: Change I/O path from request to BIO

2011-12-20 Thread Rusty Russell
On Wed, 21 Dec 2011 10:00:48 +0900, Minchan Kim  wrote:
> This patch is follow-up of Christohp Hellwig's work
> [RFC: ->make_request support for virtio-blk].
> http://thread.gmane.org/gmane.linux.kernel/1199763
> 
> Quote from hch
> "This patchset allows the virtio-blk driver to support much higher IOP
> rates which can be driven out of modern PCI-e flash devices.  At this
> point it really is just a RFC due to various issues."
> 
> I fixed race bug and add batch I/O for enhancing sequential I/O,
> FLUSH/FUA emulation.
> 
> I tested this patch on fusion I/O device by aio-stress.
> Result is following as.
> 
> Benchmark : aio-stress (64 thread, test file size 512M, 8K io per IO, 
> O_DIRECT write)
> Environment: 8 socket - 8 core, 2533.372Hz, Fusion IO 320G storage
> Test repeated by 20 times
> Guest I/O scheduler : CFQ
> Host I/O scheduler : NOOP
> 
> Request   BIO(patch 1-4)  BIO-batch(patch 
> 1-6)
>  (MB/s)  stddev   (MB/s)  stddev  (MB/s)  stddev
> w737.820 4.063613.735 31.605  730.288 24.854
> rw   208.754 20.450   314.630 37.352  317.831 41.719
> r770.974 2.340347.483 51.370  750.324 8.280
> rr   250.391 16.910   350.053 29.986  325.976 24.846

So, you dropped w and r down 2%, but rw and rr up 40%.

If I knew what the various rows were, I'd have something intelligent to
say, I'm sure :)

I can find the source to aio-stress, but no obvious clues.

Help!
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio: harsher barriers for virtio-mmio.

2011-12-20 Thread Rusty Russell
We were cheating with our barriers; using the smp ones rather than the
real device ones.  That was fine, until virtio-mmio came along, which
could be talking to a real device (a non-SMP CPU).

Unfortunately, just putting back the real barriers (reverting
d57ed95d) causes a performance regression on virtio-pci.  In
particular, Amos reports netbench's TCP_RR over virtio_net CPU
utilization increased up to 35% while throughput went down by up to
14%.

By comparison, this branch is in the noise.

Reference: https://lkml.org/lkml/2011/12/11/22

Signed-off-by: Rusty Russell 
---
 drivers/lguest/lguest_device.c |   10 ++
 drivers/s390/kvm/kvm_virtio.c  |2 +-
 drivers/virtio/virtio_mmio.c   |7 ---
 drivers/virtio/virtio_pci.c|4 ++--
 drivers/virtio/virtio_ring.c   |   34 +-
 include/linux/virtio_ring.h|1 +
 tools/virtio/linux/virtio.h|1 +
 tools/virtio/virtio_test.c |3 ++-
 8 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru
}
 
/*
-* OK, tell virtio_ring.c to set up a virtqueue now we know its size
-* and we've got a pointer to its pages.
+* OK, tell virtio_ring.c to set up a virtqueue now we know its size
+* and we've got a pointer to its pages.  Note that we set weak_barriers
+* to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu
+* barriers.
 */
-   vq = vring_new_virtqueue(lvq->config.num, LGUEST_VRING_ALIGN,
-vdev, lvq->pages, lg_notify, callback, name);
+   vq = vring_new_virtqueue(lvq->config.num, LGUEST_VRING_ALIGN, vdev,
+true, lvq->pages, lg_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto unmap;
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str
goto out;
 
vq = vring_new_virtqueue(config->num, KVM_S390_VIRTIO_RING_ALIGN,
-vdev, (void *) config->address,
+vdev, true, (void *) config->address,
 kvm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str
writel(virt_to_phys(info->queue) >> PAGE_SHIFT,
vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
 
-   /* Create the vring */
-   vq = vring_new_virtqueue(info->num, VIRTIO_MMIO_VRING_ALIGN,
-vdev, info->queue, vm_notify, callback, name);
+   /* Create the vring: no weak barriers, the other side is could
+* be an independent "device". */
+   vq = vring_new_virtqueue(info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+false, info->queue, vm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto error_new_virtqueue;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
-   vq = vring_new_virtqueue(info->num, VIRTIO_PCI_VRING_ALIGN,
-vdev, info->queue, vp_notify, callback, name);
+   vq = vring_new_virtqueue(info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
+true, info->queue, vp_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto out_activate_queue;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -28,17 +28,20 @@
 #ifdef CONFIG_SMP
 /* Where possible, use SMP barriers which are more lightweight than mandatory
  * barriers, because mandatory barriers control MMIO effects on accesses
- * through relaxed memory I/O windows (which virtio does not use). */
-#define virtio_mb() smp_mb()
-#define virtio_rmb() smp_rmb()
-#define virtio_wmb() smp_wmb()
+ * through relaxed memory I/O windows (which virtio-pci does not use). */
+#define virtio_mb(vq) \
+   do { if ((vq)->weak_barriers) smp_mb(); else mb(); } while(0)
+#define virtio_rmb(vq) \
+   do { if ((vq)->weak_barriers) smp_rmb(); else rmb(); } while(0)
+#define virtio_wmb(vq) \
+   do { if ((vq)->weak_barriers) smp_rmb(); else rmb(); } w

  1   2   >