[PATCH] pci: clean all funcs when hot-removing multifunc device

2011-09-13 Thread Amos Kong

CC: kvm@vger.kernel.org

- Original Message -

'slot->funcs' is initialized in acpiphp_glue.c:register_slot() before
hotpluging device, and only one entry(func 0) is added to it,
no new entry will be added to the list when hotpluging devices to the slot.
When we release the whole device, there is only one entry in the list,
this causes func1~7 could not be released.
I try to add entries for all hotpluged device in enable_device(), but
it doesn't work, because 'slot->funcs' is used in many place which we only
need to process func 0. This patch just try to clean all funcs in
disable_device().

drivers/pci/hotplug/acpiphp_glue.c:
static int disable_device(struct acpiphp_slot *slot) {
list_for_each_entry(func, &slot->funcs, sibling) {
pdev = pci_get_slot(slot->bridge->pci_bus,
   PCI_DEVFN(slot->device, func->function));
..clean code.. // those code can only be executed one time(func 
0)
pci_remove_bus_device(pdev);
---
pci_bus_add_device() is called for each func device in 
acpiphp_glue.c:enable_device().
pci_remove_bus_device(pdev) is only called for func 0 in 
acpiphp_glue.c:disable_device().

Boot up a KVM guest, hotplug a multifunc device(8 funcs), we can find it in the 
guest.
@ ls /dev/vd*
   vda vdb vdc vde vdf vdg vdh
@ lspci
   00:06.0 SCSI storage controller: Red Hat, Inc Virtio block device
   ...
   00:06.7 SCSI storage controller: Red Hat, Inc Virtio block device

But func 1~7 still exist in guest after hot-removing the multifunc
device through qemu monitor.
@ lspci (00:06.0 disappeared)
   00:06.1 SCSI storage controller: Red Hat, Inc Virtio block device (rev ff)
   ...
   00:06.7 SCSI storage controller: Red Hat, Inc Virtio block device (rev ff)
 
@ ls /dev/vd*
   vdb vdc vde vdf vdg vdh
@ mkfs /dev/vdb
   INFO: task mkfs.ext2:1784 blocked for more than 120 seconds. (task hung)

Hotpluging multifunc of WinXp is fine.

Signed-off-by: Amos Kong 
---
 drivers/pci/hotplug/acpiphp_glue.c |   27 ++-
 1 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/hotplug/acpiphp_glue.c 
b/drivers/pci/hotplug/acpiphp_glue.c
index a70fa89..3b86d1a 100644
--- a/drivers/pci/hotplug/acpiphp_glue.c
+++ b/drivers/pci/hotplug/acpiphp_glue.c
@@ -880,6 +880,8 @@ static int disable_device(struct acpiphp_slot *slot)
 {
 struct acpiphp_func *func;
 struct pci_dev *pdev;
+struct pci_bus *bus = slot->bridge->pci_bus;
+int i, num = 1;
 
 /* is this slot already disabled? */
 if (!(slot->flags & SLOT_ENABLED))
@@ -893,16 +895,23 @@ static int disable_device(struct acpiphp_slot *slot)
 func->bridge = NULL;
 }
 
-pdev = pci_get_slot(slot->bridge->pci_bus,
-PCI_DEVFN(slot->device, func->function));
-if (pdev) {
-pci_stop_bus_device(pdev);
-if (pdev->subordinate) {
-disable_bridges(pdev->subordinate);
-pci_disable_device(pdev);
+pdev = pci_scan_single_device(bus,
+PCI_DEVFN(slot->device, 0));
+if (!pdev)
+goto err_exit;
+if (pdev->multifunction == 1)
+num = 8;
+for (i=0; idevice, i));
+if (pdev) {
+pci_stop_bus_device(pdev);
+if (pdev->subordinate) {
+disable_bridges(pdev->subordinate);
+pci_disable_device(pdev);
+}
+pci_remove_bus_device(pdev);
+pci_dev_put(pdev);
 }
-pci_remove_bus_device(pdev);
-pci_dev_put(pdev);
 }
 }
 
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] Qemu co-operation with kvm tsc deadline timer

2011-09-13 Thread Liu, Jinsong
Jan Kiszka wrote:
> On 2011-09-13 16:38, Liu, Jinsong wrote:
>> From c1b502d6548fcc41592cd90acc82109ee949df75 Mon Sep 17 00:00:00
>> 2001 
>> From: Liu, Jinsong 
>> Date: Tue, 13 Sep 2011 22:05:30 +0800
>> Subject: [PATCH] Qemu co-operation with kvm tsc deadline timer
>> 
>> KVM add emulation of lapic tsc deadline timer for guest.
>> This patch is co-operation work at qemu side.
>> 
>> Signed-off-by: Liu, Jinsong  ---
>>  target-i386/cpu.h |2 ++
>>  target-i386/kvm.c |   14 ++
>>  2 files changed, 16 insertions(+), 0 deletions(-)
>> 
>> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
>> index 935d08a..62ff73c 100644
>> --- a/target-i386/cpu.h
>> +++ b/target-i386/cpu.h
>> @@ -283,6 +283,7 @@
>>  #define MSR_IA32_APICBASE_BSP   (1<<8)
>>  #define MSR_IA32_APICBASE_ENABLE(1<<11)
>>  #define MSR_IA32_APICBASE_BASE  (0xf<<12)
>> +#define MSR_IA32_TSCDEADLINE0x6e0
>> 
>>  #define MSR_MTRRcap 0xfe
>>  #define MSR_MTRRcap_VCNT8
>> @@ -687,6 +688,7 @@ typedef struct CPUX86State {
>>  uint64_t async_pf_en_msr;
>> 
>>  uint64_t tsc;
>> +uint64_t tsc_deadline;
> 
> This field has to be saved/restored for snapshots/migrations.
> 
> Frankly, I've no clue right now if substates are in vogue again (they
> had problems in their binary format) or if you can simply add a
> versioned top-level field and bump the CPUState version number.
> 

Yes, it would be saved/restored. After migration, tsc_deadline would be set to 
MSR_IA32_TSCDEADLINE to trigger tsc timer interrupt.

>> 
>>  uint64_t mcg_status;
>> 
>> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
>> index aa843f0..206fcad 100644
>> --- a/target-i386/kvm.c
>> +++ b/target-i386/kvm.c
>> @@ -59,6 +59,7 @@ const KVMCapabilityInfo
>> kvm_arch_required_capabilities[] = { 
>> 
>>  static bool has_msr_star;
>>  static bool has_msr_hsave_pa;
>> +static bool has_msr_tsc_deadline;
>>  static bool has_msr_async_pf_en;
>>  static int lm_capable_kernel;
>> 
>> @@ -571,6 +572,10 @@ static int kvm_get_supported_msrs(KVMState *s)
>>  has_msr_hsave_pa = true;
>>  continue;
>>  }
>> +if (kvm_msr_list->indices[i] ==
>> MSR_IA32_TSCDEADLINE) { +has_msr_tsc_deadline =
>> true; +continue;
>> +}
>>  }
>>  }
>> 
>> @@ -899,6 +904,9 @@ static int kvm_put_msrs(CPUState *env, int
>>  level)  if (has_msr_hsave_pa) {
>> kvm_msr_entry_set(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave); 
>> } +if (has_msr_tsc_deadline) { +   
>>  kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSCDEADLINE,
>>  env->tsc_deadline); +} #ifdef TARGET_X86_64 if
>>  (lm_capable_kernel) { kvm_msr_entry_set(&msrs[n++],
>> MSR_CSTAR, env->cstar); @@ -1145,6 +1153,9 @@ static int
>>  kvm_get_msrs(CPUState *env)  if (has_msr_hsave_pa) {
>>  msrs[n++].index = MSR_VM_HSAVE_PA; }
>> +if (has_msr_tsc_deadline) {
>> +msrs[n++].index = MSR_IA32_TSCDEADLINE;
>> +}
>> 
>>  if (!env->tsc_valid) {
>>  msrs[n++].index = MSR_IA32_TSC;
>> @@ -1213,6 +1224,9 @@ static int kvm_get_msrs(CPUState *env)
>>  case MSR_IA32_TSC: env->tsc = msrs[i].data;
>>  break;
>> +case MSR_IA32_TSCDEADLINE:
>> +env->tsc_deadline = msrs[i].data;
>> +break;
>>  case MSR_VM_HSAVE_PA:
>>  env->vm_hsave = msrs[i].data;
>>  break;
> 
> Just to double check: This feature is exposed to the guest when A) the
> host CPU supports it and B) QEMU passed down guest CPU specifications
> (cpuid data) that allow it as well?
> 
> Jan

Yes.

Thanks,
Jinsong

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Andi Kleen
On Tue, Sep 13, 2011 at 04:53:18PM -0400, Don Zickus wrote:
> On Tue, Sep 13, 2011 at 09:58:38PM +0200, Andi Kleen wrote:
> > > Or are you saying an NMI in an idle system will have the same %rip thus
> > > falsely detecting a back-to-back NMI?
> > 
> > Yup.
> 
> Hmm.  That sucks.  Is there another register that can be used in
> conjunction to seperate it, like sp or something?  Or we can we assume

Not that I know of.

> than an idle cpu isn't doing much for local NMI IPIs and that the only
> NMIs that would interrupt it would be external ones?

There can be still activity on the "idle" system, e.g. SMM or some 
Hypervisor in the background. If you profile events those might trigger 
samples, but they will be all accounted to the MWAIT.

> > Another problem is very long running instructions, like WBINVD and some 
> > others.
> > If there's a high frequency NMI it may well hit multiple times in a single
> >  instance.
> 
> I thought NMIs happen on instruction boundaries, maybe not.

Yes, but there may be multiple queued up when the instruction has finished
executing. So you get multiple at the boundary.

> Honestly, our current implementation would probably be tripped up with
> those examples too, so I don't think I am making things worse (assuming
> the only high frequency NMI is coming from perf).

I wish perf/oprofile would just stop using NMIs.  The interrupt off regions
are getting smaller and smaller and they can be profiled in a limited way
using PEBS anyways. Or maybe make it a knob with default to off.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Don Zickus
On Tue, Sep 13, 2011 at 09:58:38PM +0200, Andi Kleen wrote:
> > Or are you saying an NMI in an idle system will have the same %rip thus
> > falsely detecting a back-to-back NMI?
> 
> Yup.

Hmm.  That sucks.  Is there another register that can be used in
conjunction to seperate it, like sp or something?  Or we can we assume
than an idle cpu isn't doing much for local NMI IPIs and that the only
NMIs that would interrupt it would be external ones?

> 
> Another problem is very long running instructions, like WBINVD and some 
> others.
> If there's a high frequency NMI it may well hit multiple times in a single
>  instance.

I thought NMIs happen on instruction boundaries, maybe not.

Honestly, our current implementation would probably be tripped up with
those examples too, so I don't think I am making things worse (assuming
the only high frequency NMI is coming from perf).

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Avoid soft lockup message when KVM is stopped by host

2011-09-13 Thread Eric B Munson
On Fri, 09 Sep 2011, Marcelo Tosatti wrote:

> On Thu, Sep 01, 2011 at 02:27:49PM -0600, emun...@mgebm.net wrote:
> > On Thu, 01 Sep 2011 14:24:12 -0500, Anthony Liguori wrote:
> > >On 08/30/2011 07:26 AM, Marcelo Tosatti wrote:
> > >>On Mon, Aug 29, 2011 at 05:27:11PM -0600, Eric B Munson wrote:
> > >>>Currently, when qemu stops a guest kernel that guest will
> > >>>issue a soft lockup
> > >>>message when it resumes.  This set provides the ability for
> > >>>qemu to comminucate
> > >>>to the guest that it has been stopped.  When the guest hits
> > >>>the watchdog on
> > >>>resume it will check if it was suspended before issuing the
> > >>>warning.
> > >>>
> > >>>Eric B Munson (4):
> > >>>   Add flag to indicate that a vm was stopped by the host
> > >>>   Add functions to check if the host has stopped the vm
> > >>>   Add generic stubs for kvm stop check functions
> > >>>   Add check for suspended vm in softlockup detector
> > >>>
> > >>>  arch/x86/include/asm/pvclock-abi.h |1 +
> > >>>  arch/x86/include/asm/pvclock.h |2 ++
> > >>>  arch/x86/kernel/kvmclock.c |   14 ++
> > >>>  include/asm-generic/pvclock.h  |   14 ++
> > >>>  kernel/watchdog.c  |   12 
> > >>>  5 files changed, 43 insertions(+), 0 deletions(-)
> > >>>  create mode 100644 include/asm-generic/pvclock.h
> > >>>
> > >>>--
> > >>>1.7.4.1
> > >>
> > >>How is the host supposed to set this flag?
> > >>
> > >>As mentioned previously, if you save save/restore the offset
> > >>added to
> > >>kvmclock on stop/cont (and the TSC MSR, forgot to mention that), no
> > >>paravirt infrastructure is required. Which means the issue is
> > >>also fixed
> > >>for older guests.
> > >

Marcelo,

I think that stopping the TSC is the wrong approach because it will break time
between the two systems so timething that expects the monotonic clock to move
consistently will be wrong.  IMO, messing with the TSC at run time to avoid a
watchdog message is the wrong solution, better to teach the watchdog to ignore
this special case.

Thanks,
Eric


signature.asc
Description: Digital signature


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Andi Kleen
> Or are you saying an NMI in an idle system will have the same %rip thus
> falsely detecting a back-to-back NMI?

Yup.

Another problem is very long running instructions, like WBINVD and some others.
If there's a high frequency NMI it may well hit multiple times in a single
 instance.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Don Zickus
On Tue, Sep 13, 2011 at 09:03:20PM +0200, Andi Kleen wrote:
> > So I got around to implementing this and it seems to work great.  The back
> > to back NMIs are detected properly using the %rip and that info is passed to
> > the NMI notifier.  That info is used to determine if only the first
> > handler to report 'handled' is executed or _all_ the handlers are
> > executed.
> > 
> > I think all the 'unknown' NMIs I generated with various perf runs have
> > disappeared.  I'll post a new version of my nmi notifier rewrite soon.
> 
> This will fail when the system is idle.

Heh.  I don't think I explained what I was doing properly.

I had a bunch of perf runs going simultaneously on my Core2quad.  Probably
generated around 40,000 NMIs in a few minutes.  I think around a 1000 or
so detected back-to-back NMIs.

With the current NMI detection algorithm in perf to determine back-to-back
NMIs, I can usually use the above scenario to generate an 'unknown' NMI.
Actually numerous ones before the box freezes. It is a false positive.

With my current code and Avi's idea, all those disappeared as they are now
'swallowed' by the algorithm.  That seemed like a positive.

However, being cautious, I decided to instrument lkdtm to inject NMIs to
force unknown NMI conditions
(apic->send_IPI_mask(cpumask_of(smp_processor_id(), NMI_VECTOR)))

I tried to inject the NMIs while my trace_printk buffer was flooding my
screen with 'back-to-back NMIs detected'.  I did this 4 or 5 times and
every single one of them were detected as an 'unknown' NMI.  So this was
good too, in the sense it was not swallowing 'real' unknown NMIs.

Does that make sense?

Or are you saying an NMI in an idle system will have the same %rip thus
falsely detecting a back-to-back NMI?

Cheers,
Don

> 
> -Andi
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/6] iommu/core: split mapping to page sizes as supported by the hardware

2011-09-13 Thread Ohad Ben-Cohen
When mapping a memory region, split it to page sizes as supported
by the iommu hardware. Always prefer bigger pages, when possible,
in order to reduce the TLB pressure.

The logic to do that is now added to the IOMMU core, so neither the iommu
drivers themselves nor users of the IOMMU API have to duplicate it.

This allows a more lenient granularity of mappings; traditionally the
IOMMU API took 'order' (of a page) as a mapping size, and directly let
the low level iommu drivers handle the mapping, but now that the IOMMU
core can split arbitrary memory regions into pages, we can remove this
limitation, so users don't have to split those regions by themselves.

Currently the supported page sizes are advertised once and they then
remain static. That works well for OMAP (and seemingly MSM too) but
it would probably not fly with intel's hardware, where the page size
capabilities seem to have the potential to be different between
several DMA remapping devices. This limitation can be dealt with
later, if desired. For now, the existing IOMMU API behavior is retained
(see: "iommu/intel: announce supported page sizes").

As requested, register_iommu() isn't changed yet, so we can convert
the IOMMU drivers in subsequent patches, and after all the drivers
are converted, register_iommu will be changed (and the temporary
register_iommu_pgsize() will be removed).

Mainline users of the IOMMU API (kvm and omap-iovmm) are adopted
to send the mapping size in bytes instead of in page order.

Signed-off-by: Ohad Ben-Cohen 
Cc: David Brown 
Cc: David Woodhouse 
Cc: Joerg Roedel 
Cc: Stepan Moskovchenko 
Cc: Hiroshi DOYU 
Cc: Laurent Pinchart 
Cc: kvm@vger.kernel.org
---
v1->v2: keep old code around until all drivers are converted

 drivers/iommu/iommu.c  |  158 +---
 drivers/iommu/omap-iovmm.c |   12 +---
 include/linux/iommu.h  |6 +-
 virt/kvm/iommu.c   |4 +-
 4 files changed, 157 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c68ff29..c848f14 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -16,6 +16,8 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
  */
 
+#define pr_fmt(fmt)"%s: " fmt, __func__
+
 #include 
 #include 
 #include 
@@ -23,15 +25,73 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct iommu_ops *iommu_ops;
 
+/* bitmap of supported page sizes */
+static unsigned long *iommu_pgsize_bitmap;
+
+/* number of bits used to represent the supported pages */
+static unsigned int iommu_nr_page_bits;
+
+/* size of the smallest supported page (in bytes) */
+static unsigned int iommu_min_pagesz;
+
+/* bit number of the smallest supported page */
+static unsigned int iommu_min_page_idx;
+
+/**
+ * register_iommu() - register an IOMMU hardware
+ * @ops: iommu handlers
+ * @pgsize_bitmap: bitmap of page sizes supported by the hardware
+ * @nr_page_bits: size of @pgsize_bitmap (in bits)
+ *
+ * Note: this is a temporary function, which will be removed once
+ * all IOMMU drivers are converted. The only reason it exists is to
+ * allow splitting the pgsizes changes to several patches in order to ease
+ * the review.
+ */
+void register_iommu_pgsize(struct iommu_ops *ops, unsigned long *pgsize_bitmap,
+   unsigned int nr_page_bits)
+{
+   if (iommu_ops || iommu_pgsize_bitmap || !nr_page_bits)
+   BUG();
+
+   iommu_ops = ops;
+   iommu_pgsize_bitmap = pgsize_bitmap;
+   iommu_nr_page_bits = nr_page_bits;
+
+   /* find the minimum page size and its index only once */
+   iommu_min_page_idx = find_first_bit(pgsize_bitmap, nr_page_bits);
+   iommu_min_pagesz = 1 << iommu_min_page_idx;
+}
+
+/*
+ * default pagesize bitmap, will be removed once all IOMMU drivers
+ * are converted
+ */
+static unsigned long default_iommu_pgsizes = ~0xFFFUL;
+
 void register_iommu(struct iommu_ops *ops)
 {
if (iommu_ops)
BUG();
 
iommu_ops = ops;
+
+   /*
+* set default pgsize values, which retain the existing
+* IOMMU API behavior: drivers will be called to map
+* regions that are sized/aligned to order of 4KB pages
+*/
+   iommu_pgsize_bitmap = &default_iommu_pgsizes;
+   iommu_nr_page_bits = BITS_PER_LONG;
+
+   /* find the minimum page size and its index only once */
+   iommu_min_page_idx = find_first_bit(iommu_pgsize_bitmap,
+   iommu_nr_page_bits);
+   iommu_min_pagesz = 1 << iommu_min_page_idx;
 }
 
 bool iommu_found(void)
@@ -109,26 +169,104 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
 EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
 
 int iommu_map(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, int gfp_order, int prot)
+ phys_addr_t paddr, size_t size, int prot)
 {
-   size_t size;
+   int ret = 0;
+
+

Re: [PATCH v3 01/11] KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write

2011-09-13 Thread Xiao Guangrong
On 09/13/2011 06:50 PM, Avi Kivity wrote:
> On 09/13/2011 01:24 PM, Xiao Guangrong wrote:
>> On 09/13/2011 05:51 PM, Avi Kivity wrote:
>> >  On 08/30/2011 05:34 AM, Xiao Guangrong wrote:
>> >>  kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
>> >>  function when spte is prefetched, unfortunately, we can not know how many
>> >>  spte need to be prefetched on this path, that means we can use out of the
>> >>  free  pte_list_desc object in the cache, and BUG_ON() is triggered, also 
>> >> some
>> >>  path does not fill the cache, such as INS instruction emulated that does 
>> >> not
>> >>  trigger page fault
>> >>
>> >>  @@ -3716,10 +3735,6 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, 
>> >> gva_t cr2, u32 error_code,
>> >>goto out;
>> >>}
>> >>
>> >>  -r = mmu_topup_memory_caches(vcpu);
>> >>  -if (r)
>> >>  -goto out;
>> >>  -
>> >>er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len);
>> >>
>> >
>> >  Suppose we are out of memory, can't this get us in an endless loop?
>> >
>> >  return -ENOMEM breaks as out (and kills the guest, likely).
>> >
>>
>> If memory is not enough, we just clear sptes on pte_write path(not prefetch 
>> spte),
>> the later page fault path can return -1 to let guest crash. Hmm?
>>
> 
> Yes.
> 
> btw, is rmap_can_add() sufficent?  We allocate more than just rmaps in 
> mmu_topup_memory_caches().  I guess it is, but this is getting tricky.
> 

rmap_can_add() is used to avoid prefetching sptes more than the number of rmaps 
in
the cache, because we do not know the exact number of sptes to be fetched. :-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Don Zickus
On Tue, Sep 13, 2011 at 09:03:20PM +0200, Andi Kleen wrote:
> > So I got around to implementing this and it seems to work great.  The back
> > to back NMIs are detected properly using the %rip and that info is passed to
> > the NMI notifier.  That info is used to determine if only the first
> > handler to report 'handled' is executed or _all_ the handlers are
> > executed.
> > 
> > I think all the 'unknown' NMIs I generated with various perf runs have
> > disappeared.  I'll post a new version of my nmi notifier rewrite soon.
> 
> This will fail when the system is idle.

Oh one other thing I forgot to mention is that an NMI handler has to
return a value greater than 1, meaning that it handle multiple NMI events
during this NMI to even enable the 'NMI swallowing' algorithm.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 11/11] KVM: MMU: improve write flooding detected

2011-09-13 Thread Xiao Guangrong
On 09/13/2011 07:07 PM, Avi Kivity wrote:
> On 08/30/2011 05:38 AM, Xiao Guangrong wrote:
>> Detecting write-flooding does not work well, when we handle page written, if
>> the last speculative spte is not accessed, we treat the page is
>> write-flooding, however, we can speculative spte on many path, such as pte
>> prefetch, page synced, that means the last speculative spte may be not point
>> to the written page and the written page can be accessed via other sptes, so
>> depends on the Accessed bit of the last speculative spte is not enough
>>
>> Instead of detected page accessed, we can detect whether the spte is accessed
>> after it is written, if the spte is not accessed but it is written 
>> frequently,
>> we treat is not a page table or it not used for a long time
>>
>>
> 
> The spte may not be accessed, but other sptes in the same page can be 
> accessed.  An example is the fixmap area for kmap_atomic(), there will be a 
> lot of pte writes but other sptes will be accessed without going through 
> soft-mmu at all.

I think this kind of shadow pae is mostly the last page table(level=1), maybe
we can skip the write-flooding for the last shadow page, because the last shadow
page can become unsync and it can not let page table write-protected.
 
> I think you have to read the parent_ptes->spte.accessed bits to be sure.
> 

I guess the overload of this way is little high:
- it needs to walk parent ptes for every shadow pages
- we need to clear the parent_ptes->spte.accessed bit when the page is written, 
and
  the tlb flush is needed.
no?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Andi Kleen
> So I got around to implementing this and it seems to work great.  The back
> to back NMIs are detected properly using the %rip and that info is passed to
> the NMI notifier.  That info is used to determine if only the first
> handler to report 'handled' is executed or _all_ the handlers are
> executed.
> 
> I think all the 'unknown' NMIs I generated with various perf runs have
> disappeared.  I'll post a new version of my nmi notifier rewrite soon.

This will fail when the system is idle.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-13 Thread Don Zickus
On Wed, Sep 07, 2011 at 08:09:37PM +0300, Avi Kivity wrote:
> >But then the downside
> >here is we accidentally handle an NMI that was latched.  This would cause
> >a 'Dazed on confused' message as that NMI was already handled by the
> >previous NMI.
> >
> >We are working on an algorithm to detect this condition and flag it
> >(nothing complicated).  But it may never be perfect.
> >
> >On the other hand, what else are we going to do with an edge-triggered
> >shared interrupt line?
> >
> 
> How about, during NMI, save %rip to a per-cpu variable.  Handle just
> one cause.  If, on the next NMI, we hit the same %rip, assume
> back-to-back NMI has occured and now handle all causes.

So I got around to implementing this and it seems to work great.  The back
to back NMIs are detected properly using the %rip and that info is passed to
the NMI notifier.  That info is used to determine if only the first
handler to report 'handled' is executed or _all_ the handlers are
executed.

I think all the 'unknown' NMIs I generated with various perf runs have
disappeared.  I'll post a new version of my nmi notifier rewrite soon.

Thanks for the great suggestion Avi!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/11] KVM: MMU: cleanup FNAME(invlpg)

2011-09-13 Thread Xiao Guangrong
On 09/13/2011 07:00 PM, Avi Kivity wrote:
> On 08/30/2011 05:36 AM, Xiao Guangrong wrote:
>> Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
>> same code between FNAME(invlpg) and FNAME(sync_page)
>>
>>
>> +static gpa_t FNAME(get_first_pte_gpa)(struct kvm_mmu_page *sp)
> 
> spte
> 

Yes, Will fix.

>> +{
>> +int offset = 0;
>> +
>> +if (PTTYPE == 32)
>> +offset = sp->role.quadrant<<  PT64_LEVEL_BITS;
>> +
>> +return gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t);
>> +}
> 
> Only works for L1 tables, yes?  Need a comment so it isn't reused for L2.
> 

OK.

>> @@ -663,7 +673,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
>> gva)
>>   gpa_t pte_gpa = -1;
>>   int level;
>>   u64 *sptep;
>> -int need_flush = 0;
>>
>>   vcpu_clear_mmio_info(vcpu, gva);
>>
>> @@ -675,36 +684,20 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
>> gva)
>>
>>   sp = page_header(__pa(sptep));
>>   if (is_last_spte(*sptep, level)) {
>> -int offset, shift;
>> -
>>   if (!sp->unsync)
>>   break;
>>
>> -shift = PAGE_SHIFT -
>> -  (PT_LEVEL_BITS - PT64_LEVEL_BITS) * level;
>> -offset = sp->role.quadrant<<  shift;
>> -
>> -pte_gpa = (sp->gfn<<  PAGE_SHIFT) + offset;
>> +pte_gpa = FNAME(get_first_pte_gpa)(sp);
> 
> 
> Here is can be used for L2 - I think we can use 2MB host pages to back 4MB 
> guest mappings.
> 

Only unsync shadow page is fetched here, and its level is always 1.
 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/11] KVM: MMU: do not mark accessed bit on pte write path

2011-09-13 Thread Xiao Guangrong
On 09/13/2011 06:53 PM, Avi Kivity wrote:
> On 08/30/2011 05:35 AM, Xiao Guangrong wrote:
>> In current code, the accessed bit is always set when page fault occurred,
>> do not need to set it on pte write path
> 
> What about speculative sptes that are then only accessed via emulation?
> 

The gfn is read and written only via emulation? I think this case is very
very rare?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 03/11] KVM: x86: retry non-page-table writing instruction

2011-09-13 Thread Xiao Guangrong
On 09/13/2011 06:47 PM, Avi Kivity wrote:
> On 08/30/2011 05:35 AM, Xiao Guangrong wrote:
>> If the emulation is caused by #PF and it is non-page_table writing 
>> instruction,
>> it means the VM-EXIT is caused by shadow page protected, we can zap the 
>> shadow
>> page and retry this instruction directly
>>
>> The idea is from Avi
>>
>>
>>   int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int 
>> insn_len);
>> +bool page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
> 
> Please use the usual x86_ prefix used in the emulator interface.
> 

OK, will fix.

>> @@ -3720,10 +3721,18 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
>>   kvm_mmu_commit_zap_page(vcpu->kvm,&invalid_list);
>>   }
>>
>> +static bool is_mmio_page_fault(struct kvm_vcpu *vcpu, gva_t addr)
>> +{
>> +if (vcpu->arch.mmu.direct_map || mmu_is_nested(vcpu))
>> +return vcpu_match_mmio_gpa(vcpu, addr);
>> +
>> +return vcpu_match_mmio_gva(vcpu, addr);
>> +}
>> +
>>   int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
>>  void *insn, int insn_len)
>>   {
>> -int r;
>> +int r, emulation_type = EMULTYPE_RETRY;
>>   enum emulation_result er;
>>
>>   r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
>> @@ -3735,7 +3744,10 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t 
>> cr2, u32 error_code,
>>   goto out;
>>   }
>>
>> -er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len);
>> +if (is_mmio_page_fault(vcpu, cr2))
>> +emulation_type = 0;
>> +
>> +er = x86_emulate_instruction(vcpu, cr2, emulation_type, insn, insn_len);
>>
>>   switch (er) {
>>   case EMULATE_DONE:
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 6b37f18..1afe59e 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4814,6 +4814,50 @@ static bool reexecute_instruction(struct kvm_vcpu 
>> *vcpu, gva_t gva)
>>   return false;
>>   }
>>
>> +static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
>> +  unsigned long cr2,  int emulation_type)
>> +{
>> +if (!vcpu->arch.mmu.direct_map&&  !mmu_is_nested(vcpu))
>> +gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
> 
> If mmu_is_nested() cr2 is an ngpa, we have to translate it to a gpa, no?
> 

Yeah, will fix it.

And this bug also exists in the current code: it always uses L2 gpa to emulate
write operation.
I guess the reason that it is not triggered is: the gpa of L2's shadow page can
not be touched by L2, it means no page table is write-protected by L2.

> btw, I don't see mmu.direct_map initialized for nested npt?
> 

nested_svm_vmrun() -> nested_svm_init_mmu_context():
static int nested_svm_init_mmu_context(struct kvm_vcpu *vcpu)
{
int r;

r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);

vcpu->arch.mmu.set_cr3   = nested_svm_set_tdp_cr3;
vcpu->arch.mmu.get_cr3   = nested_svm_get_tdp_cr3;
vcpu->arch.mmu.get_pdptr = nested_svm_get_tdp_pdptr;
vcpu->arch.mmu.inject_page_fault = nested_svm_inject_npf_exit;
vcpu->arch.mmu.shadow_root_level = get_npt_level();
vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;

return r;
}

It is initialized in kvm_init_shadow_mmu :-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-tool: remove addr_type - unused but set variable

2011-09-13 Thread Pekka Enberg
On Wed, Sep 7, 2011 at 9:48 PM, Hagen Paul Pfeifer  wrote:
> Signed-off-by: Hagen Paul Pfeifer 
> Cc: Sasha Levin 

Applied, thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type

2011-09-13 Thread Greg KH
On Tue, Sep 13, 2011 at 05:38:11PM +0200, Roedel, Joerg wrote:
> On Tue, Sep 13, 2011 at 10:58:55AM -0400, Greg KH wrote:
> > On Tue, Sep 13, 2011 at 04:54:02PM +0200, Roedel, Joerg wrote:
> > > --- a/include/linux/device.h
> > > +++ b/include/linux/device.h
> > > @@ -22,6 +22,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > 
> > Ick, please don't add new #includes to device.h, it makes the whole
> > build slower.  Just pre-declare the structure and all should be fine.
> 
> Hmm, since linux/iommu.h provides 'struct iommu_ops', and this patch
> adds a 'struct iommu_ops' to 'struct bus_type', wouldn't a simple
> forward declaration make the bus_type incomplete in most other places?

No, just like it doesn't make iommu.h incomplete as you used a struct
bus_type there.

> To lower the impact I can move the 'struct iommu_ops' to a seperate
> header file instead.

No, that should not be necessary.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type

2011-09-13 Thread Roedel, Joerg
On Tue, Sep 13, 2011 at 10:58:55AM -0400, Greg KH wrote:
> On Tue, Sep 13, 2011 at 04:54:02PM +0200, Roedel, Joerg wrote:
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -22,6 +22,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> 
> Ick, please don't add new #includes to device.h, it makes the whole
> build slower.  Just pre-declare the structure and all should be fine.

Hmm, since linux/iommu.h provides 'struct iommu_ops', and this patch
adds a 'struct iommu_ops' to 'struct bus_type', wouldn't a simple
forward declaration make the bus_type incomplete in most other places?
To lower the impact I can move the 'struct iommu_ops' to a seperate
header file instead.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type

2011-09-13 Thread Roedel, Joerg
On Tue, Sep 13, 2011 at 10:58:55AM -0400, Greg KH wrote:
> On Tue, Sep 13, 2011 at 04:54:02PM +0200, Roedel, Joerg wrote:
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -22,6 +22,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> 
> Ick, please don't add new #includes to device.h, it makes the whole
> build slower.  Just pre-declare the structure and all should be fine.

Okay, will do. Thanks.

> Don't you need to also redo the other patches in this series based on
> the other comments you received?

Unless I have missed something I am not aware of other comments that
need action by now.
Sethi pointed out that the KVM device assignment code needs rework to be
usable outside of PCI too. But that work is already underway with Alex
Williamson's VFIO framework and not scope of this patch-set.

Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Qemu co-operation with kvm tsc deadline timer

2011-09-13 Thread Jan Kiszka
On 2011-09-13 16:38, Liu, Jinsong wrote:
> From c1b502d6548fcc41592cd90acc82109ee949df75 Mon Sep 17 00:00:00 2001
> From: Liu, Jinsong 
> Date: Tue, 13 Sep 2011 22:05:30 +0800
> Subject: [PATCH] Qemu co-operation with kvm tsc deadline timer
> 
> KVM add emulation of lapic tsc deadline timer for guest.
> This patch is co-operation work at qemu side.
> 
> Signed-off-by: Liu, Jinsong 
> ---
>  target-i386/cpu.h |2 ++
>  target-i386/kvm.c |   14 ++
>  2 files changed, 16 insertions(+), 0 deletions(-)
> 
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 935d08a..62ff73c 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -283,6 +283,7 @@
>  #define MSR_IA32_APICBASE_BSP   (1<<8)
>  #define MSR_IA32_APICBASE_ENABLE(1<<11)
>  #define MSR_IA32_APICBASE_BASE  (0xf<<12)
> +#define MSR_IA32_TSCDEADLINE0x6e0
>  
>  #define MSR_MTRRcap  0xfe
>  #define MSR_MTRRcap_VCNT 8
> @@ -687,6 +688,7 @@ typedef struct CPUX86State {
>  uint64_t async_pf_en_msr;
>  
>  uint64_t tsc;
> +uint64_t tsc_deadline;

This field has to be saved/restored for snapshots/migrations.

Frankly, I've no clue right now if substates are in vogue again (they
had problems in their binary format) or if you can simply add a
versioned top-level field and bump the CPUState version number.

>  
>  uint64_t mcg_status;
>  
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index aa843f0..206fcad 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -59,6 +59,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
>  
>  static bool has_msr_star;
>  static bool has_msr_hsave_pa;
> +static bool has_msr_tsc_deadline;
>  static bool has_msr_async_pf_en;
>  static int lm_capable_kernel;
>  
> @@ -571,6 +572,10 @@ static int kvm_get_supported_msrs(KVMState *s)
>  has_msr_hsave_pa = true;
>  continue;
>  }
> +if (kvm_msr_list->indices[i] == MSR_IA32_TSCDEADLINE) {
> +has_msr_tsc_deadline = true;
> +continue;
> +}
>  }
>  }
>  
> @@ -899,6 +904,9 @@ static int kvm_put_msrs(CPUState *env, int level)
>  if (has_msr_hsave_pa) {
>  kvm_msr_entry_set(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave);
>  }
> +if (has_msr_tsc_deadline) {
> +kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSCDEADLINE, 
> env->tsc_deadline);
> +}
>  #ifdef TARGET_X86_64
>  if (lm_capable_kernel) {
>  kvm_msr_entry_set(&msrs[n++], MSR_CSTAR, env->cstar);
> @@ -1145,6 +1153,9 @@ static int kvm_get_msrs(CPUState *env)
>  if (has_msr_hsave_pa) {
>  msrs[n++].index = MSR_VM_HSAVE_PA;
>  }
> +if (has_msr_tsc_deadline) {
> +msrs[n++].index = MSR_IA32_TSCDEADLINE;
> +}
>  
>  if (!env->tsc_valid) {
>  msrs[n++].index = MSR_IA32_TSC;
> @@ -1213,6 +1224,9 @@ static int kvm_get_msrs(CPUState *env)
>  case MSR_IA32_TSC:
>  env->tsc = msrs[i].data;
>  break;
> +case MSR_IA32_TSCDEADLINE:
> +env->tsc_deadline = msrs[i].data;
> +break;
>  case MSR_VM_HSAVE_PA:
>  env->vm_hsave = msrs[i].data;
>  break;

Just to double check: This feature is exposed to the guest when A) the
host CPU supports it and B) QEMU passed down guest CPU specifications
(cpuid data) that allow it as well?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type

2011-09-13 Thread Greg KH
On Tue, Sep 13, 2011 at 04:54:02PM +0200, Roedel, Joerg wrote:
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

Ick, please don't add new #includes to device.h, it makes the whole
build slower.  Just pre-declare the structure and all should be fine.

Don't you need to also redo the other patches in this series based on
the other comments you received?

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type

2011-09-13 Thread Roedel, Joerg
Hi Greg,

On Wed, Sep 07, 2011 at 03:44:45PM -0400, Greg KH wrote:
> > 
> > The IOMMUs are usually devices on the bus itself, so they are
> > initialized after the bus is set up and the devices on it are
> > populated.  So the function can not be called on bus initialization
> > because the IOMMU is not ready at this point.
> 
> Ok, that makes more sense, please state as much in the documentation.

I added the kernel-doc documentation you requested. Here is the updated
patch. Please let me know what you think.

>From 23757825035f6d9309866dc142133aada0e2c1de Mon Sep 17 00:00:00 2001
From: Joerg Roedel 
Date: Fri, 26 Aug 2011 16:48:26 +0200
Subject: [PATCH 02/10] Driver core: Add iommu_ops to bus_type

This is the starting point to make the iommu_ops used for
the iommu-api a per-bus-type structure. It is required to
easily implement bus-specific setup in the iommu-layer.
The first user will be the iommu-group attribute in sysfs.

Signed-off-by: Joerg Roedel 
---
 drivers/base/bus.c |   29 +
 drivers/iommu/iommu.c  |4 
 include/linux/device.h |9 +
 include/linux/iommu.h  |2 ++
 4 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 000e7b2..b3014fe 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -1028,6 +1028,35 @@ void bus_sort_breadthfirst(struct bus_type *bus,
 }
 EXPORT_SYMBOL_GPL(bus_sort_breadthfirst);
 
+#ifdef CONFIG_IOMMU_API
+/**
+ * bus_set_iommu - set iommu-callbacks for the bus
+ * @bus: bus.
+ * @ops: the callbacks provided by the iommu-driver
+ *
+ * This function is called by an iommu driver to set the iommu methods
+ * used for a particular bus. Drivers for devices on that bus can use
+ * the iommu-api after these ops are registered.
+ * This special function is needed because IOMMUs are usually devices on
+ * the bus itself, so the iommu drivers are not initialized when the bus
+ * is set up. With this function the iommu-driver can set the iommu-ops
+ * afterwards.
+ */
+int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops)
+{
+   if (bus->iommu_ops != NULL)
+   return -EBUSY;
+
+   bus->iommu_ops = ops;
+
+   /* Do IOMMU specific setup for this bus-type */
+   iommu_bus_init(bus, ops);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(bus_set_iommu);
+#endif
+
 int __init buses_init(void)
 {
bus_kset = kset_create_and_add("bus", &bus_uevent_ops, NULL);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 30b0644..3b24a5b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -34,6 +34,10 @@ void register_iommu(struct iommu_ops *ops)
iommu_ops = ops;
 }
 
+void iommu_bus_init(struct bus_type *bus, struct iommu_ops *ops)
+{
+}
+
 bool iommu_found(void)
 {
return iommu_ops != NULL;
diff --git a/include/linux/device.h b/include/linux/device.h
index c20dfbf..8240b2a 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -67,6 +68,9 @@ extern void bus_remove_file(struct bus_type *, struct 
bus_attribute *);
  * @resume:Called to bring a device on this bus out of sleep mode.
  * @pm:Power management operations of this bus, callback the 
specific
  * device driver's pm-ops.
+ * @iommu_ops   IOMMU specific operations for this bus, used to attach IOMMU
+ *  driver implementations to a bus and allow the driver to do
+ *  bus-specific setup
  * @p: The private data of the driver core, only the driver core can
  * touch this.
  *
@@ -96,6 +100,8 @@ struct bus_type {
 
const struct dev_pm_ops *pm;
 
+   struct iommu_ops *iommu_ops;
+
struct subsys_private *p;
 };
 
@@ -148,6 +154,9 @@ extern int bus_unregister_notifier(struct bus_type *bus,
 #define BUS_NOTIFY_UNBOUND_DRIVER  0x0006 /* driver is unbound
  from the device */
 
+/* IOMMU related bus functions */
+int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
+
 extern struct kset *bus_get_kset(struct bus_type *bus);
 extern struct klist *bus_get_device_klist(struct bus_type *bus);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 6470cd8..4739e36 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -25,6 +25,7 @@
 #define IOMMU_WRITE(2)
 #define IOMMU_CACHE(4) /* DMA cache coherency */
 
+struct bus_type;
 struct device;
 
 struct iommu_domain {
@@ -52,6 +53,7 @@ struct iommu_ops {
 };
 
 extern void register_iommu(struct iommu_ops *ops);
+extern void iommu_bus_init(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_found(void);
 extern struct iommu_domain *iommu_domain_alloc(void);
 extern void iommu_domain_free(struct iommu_domain *domain);
-- 
1.7.4.1

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85

[PATCH 2/2] Qemu co-operation with kvm tsc deadline timer

2011-09-13 Thread Liu, Jinsong
>From c1b502d6548fcc41592cd90acc82109ee949df75 Mon Sep 17 00:00:00 2001
From: Liu, Jinsong 
Date: Tue, 13 Sep 2011 22:05:30 +0800
Subject: [PATCH] Qemu co-operation with kvm tsc deadline timer

KVM add emulation of lapic tsc deadline timer for guest.
This patch is co-operation work at qemu side.

Signed-off-by: Liu, Jinsong 
---
 target-i386/cpu.h |2 ++
 target-i386/kvm.c |   14 ++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 935d08a..62ff73c 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -283,6 +283,7 @@
 #define MSR_IA32_APICBASE_BSP   (1<<8)
 #define MSR_IA32_APICBASE_ENABLE(1<<11)
 #define MSR_IA32_APICBASE_BASE  (0xf<<12)
+#define MSR_IA32_TSCDEADLINE0x6e0
 
 #define MSR_MTRRcap0xfe
 #define MSR_MTRRcap_VCNT   8
@@ -687,6 +688,7 @@ typedef struct CPUX86State {
 uint64_t async_pf_en_msr;
 
 uint64_t tsc;
+uint64_t tsc_deadline;
 
 uint64_t mcg_status;
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index aa843f0..206fcad 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -59,6 +59,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
 
 static bool has_msr_star;
 static bool has_msr_hsave_pa;
+static bool has_msr_tsc_deadline;
 static bool has_msr_async_pf_en;
 static int lm_capable_kernel;
 
@@ -571,6 +572,10 @@ static int kvm_get_supported_msrs(KVMState *s)
 has_msr_hsave_pa = true;
 continue;
 }
+if (kvm_msr_list->indices[i] == MSR_IA32_TSCDEADLINE) {
+has_msr_tsc_deadline = true;
+continue;
+}
 }
 }
 
@@ -899,6 +904,9 @@ static int kvm_put_msrs(CPUState *env, int level)
 if (has_msr_hsave_pa) {
 kvm_msr_entry_set(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave);
 }
+if (has_msr_tsc_deadline) {
+kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSCDEADLINE, env->tsc_deadline);
+}
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 kvm_msr_entry_set(&msrs[n++], MSR_CSTAR, env->cstar);
@@ -1145,6 +1153,9 @@ static int kvm_get_msrs(CPUState *env)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
+if (has_msr_tsc_deadline) {
+msrs[n++].index = MSR_IA32_TSCDEADLINE;
+}
 
 if (!env->tsc_valid) {
 msrs[n++].index = MSR_IA32_TSC;
@@ -1213,6 +1224,9 @@ static int kvm_get_msrs(CPUState *env)
 case MSR_IA32_TSC:
 env->tsc = msrs[i].data;
 break;
+case MSR_IA32_TSCDEADLINE:
+env->tsc_deadline = msrs[i].data;
+break;
 case MSR_VM_HSAVE_PA:
 env->vm_hsave = msrs[i].data;
 break;
-- 
1.6.5.6


qemu-lapic-tsc-deadline-timer.patch
Description: qemu-lapic-tsc-deadline-timer.patch


[PATCH 1/2] KVM: emulate lapic tsc deadline timer for guest

2011-09-13 Thread Liu, Jinsong
>From 7b12021e1d1b79797b49e41cc0a7be05a6180d9a Mon Sep 17 00:00:00 2001
From: Liu, Jinsong 
Date: Tue, 13 Sep 2011 21:52:54 +0800
Subject: [PATCH] KVM: emulate lapic tsc deadline timer for guest

This patch emulate lapic tsc deadline timer for guest:
Enumerate tsc deadline timer capability by CPUID;
Enable tsc deadline timer mode by lapic MMIO;
Start tsc deadline timer by WRMSR;

Signed-off-by: Liu, Jinsong 
---
 arch/x86/include/asm/apicdef.h|2 +
 arch/x86/include/asm/cpufeature.h |3 +
 arch/x86/include/asm/kvm_host.h   |2 +
 arch/x86/include/asm/msr-index.h  |2 +
 arch/x86/kvm/kvm_timer.h  |2 +
 arch/x86/kvm/lapic.c  |  122 ++---
 arch/x86/kvm/lapic.h  |3 +
 arch/x86/kvm/x86.c|   20 ++-
 8 files changed, 132 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h
index 34595d5..3925d80 100644
--- a/arch/x86/include/asm/apicdef.h
+++ b/arch/x86/include/asm/apicdef.h
@@ -100,7 +100,9 @@
 #defineAPIC_TIMER_BASE_CLKIN   0x0
 #defineAPIC_TIMER_BASE_TMBASE  0x1
 #defineAPIC_TIMER_BASE_DIV 0x2
+#defineAPIC_LVT_TIMER_ONESHOT  (0 << 17)
 #defineAPIC_LVT_TIMER_PERIODIC (1 << 17)
+#defineAPIC_LVT_TIMER_TSCDEADLINE  (2 << 17)
 #defineAPIC_LVT_MASKED (1 << 16)
 #defineAPIC_LVT_LEVEL_TRIGGER  (1 << 15)
 #defineAPIC_LVT_REMOTE_IRR (1 << 14)
diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 4258aac..8a26e48 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -120,6 +120,7 @@
 #define X86_FEATURE_X2APIC (4*32+21) /* x2APIC */
 #define X86_FEATURE_MOVBE  (4*32+22) /* MOVBE instruction */
 #define X86_FEATURE_POPCNT  (4*32+23) /* POPCNT instruction */
+#define X86_FEATURE_TSC_DEADLINE_TIMER(4*32+24) /* Tsc deadline timer */
 #define X86_FEATURE_AES(4*32+25) /* AES instructions */
 #define X86_FEATURE_XSAVE  (4*32+26) /* XSAVE/XRSTOR/XSETBV/XGETBV */
 #define X86_FEATURE_OSXSAVE(4*32+27) /* "" XSAVE enabled in the OS */
@@ -284,6 +285,8 @@ extern const char * const x86_power_flags[32];
 #define cpu_has_xmm4_1 boot_cpu_has(X86_FEATURE_XMM4_1)
 #define cpu_has_xmm4_2 boot_cpu_has(X86_FEATURE_XMM4_2)
 #define cpu_has_x2apic boot_cpu_has(X86_FEATURE_X2APIC)
+#define cpu_has_tsc_deadline_timer \
+   boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)
 #define cpu_has_xsave  boot_cpu_has(X86_FEATURE_XSAVE)
 #define cpu_has_hypervisor boot_cpu_has(X86_FEATURE_HYPERVISOR)
 #define cpu_has_pclmulqdq  boot_cpu_has(X86_FEATURE_PCLMULQDQ)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 307e3cf..2ce6529 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -671,6 +671,8 @@ u8 kvm_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t 
gfn);
 
 extern bool tdp_enabled;
 
+extern u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
+
 /* control of guest tsc rate supported? */
 extern bool kvm_has_tsc_control;
 /* minimum supported tsc_khz for guests */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d52609a..a6962d9 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -229,6 +229,8 @@
 #define MSR_IA32_APICBASE_ENABLE   (1<<11)
 #define MSR_IA32_APICBASE_BASE (0xf<<12)
 
+#define MSR_IA32_TSCDEADLINE   0x06e0
+
 #define MSR_IA32_UCODE_WRITE   0x0079
 #define MSR_IA32_UCODE_REV 0x008b
 
diff --git a/arch/x86/kvm/kvm_timer.h b/arch/x86/kvm/kvm_timer.h
index 64bc6ea..497dbaa 100644
--- a/arch/x86/kvm/kvm_timer.h
+++ b/arch/x86/kvm/kvm_timer.h
@@ -2,6 +2,8 @@
 struct kvm_timer {
struct hrtimer timer;
s64 period; /* unit: ns */
+   u32 timer_mode_mask;
+   u64 tscdeadline;
atomic_t pending;   /* accumulated triggered timers 
*/
bool reinject;
struct kvm_timer_ops *t_ops;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 2b2255b..925d4b9 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -135,9 +135,23 @@ static inline int apic_lvt_vector(struct kvm_lapic *apic, 
int lvt_type)
return apic_get_reg(apic, lvt_type) & APIC_VECTOR_MASK;
 }
 
+static inline int apic_lvtt_oneshot(struct kvm_lapic *apic)
+{
+   return ((apic_get_reg(apic, APIC_LVTT) & 
+   apic->lapic_timer.timer_mode_mask) == APIC_LVT_TIMER_ONESHOT);
+}
+
 static inline int apic_lvtt_period(struct kvm_lapic *apic)
 {
-   return apic_get_reg(apic, APIC_LVTT) & APIC_LVT_TIMER_PERIODIC;
+   

Re: About hotplug multifunction

2011-09-13 Thread Amos Kong
- Original Message -
> Hi all,

I've tested with WinXp guest, the multifunction hotplug works.

> After reading the pci driver code, I found a problem.
> 
> There is a list for each slot, (slot->funcs)
> it will be inited in acpiphp_glue.c:register_slot() before hotpluging
> device,
> and only one entry(func 0) will be added to it,
> no new entry will be added to the list when hotpluging devices to the
> slot.
> 
> When we release the device, there are only _one_ entry in the
> list(slot->funcs).

This list (slot->funcs) is designed to restore the func entries,
But it only restore the func 0.

I changed the # to # in seabios: src/acpi-dsdt.dsl
mf hotplug of Windows doesn't work. linux guest will only remove the last func, 
func 0~6 still exist in guest.
it seems a bug of Linux pci driver (not calling pci_remove_bus_device() for 
func 1~7).

--- a/src/acpi-dsdt.dsl
+++ b/src/acpi-dsdt.dsl
@@ -130,7 +130,7 @@ DefinitionBlock (
 
 #define hotplug_slot(name, nr) \
 Device (S##name) {\
-   Name (_ADR, nr##)  \
+   Name (_ADR, nr##)  \
Method (_EJ0,1) {  \
 Store(ShiftLeft(1, nr), B0EJ) \
 Return (0x0)  \
@@ -462,7 +462,7 @@ DefinitionBlock (
 
 #define gen_pci_device(name, nr)\
 Device(SL##name) {  \
-Name (_ADR, nr##)   \
+Name (_ADR, nr##)   \
 Method (_RMV) { \

==
I try to add new entries in acpiphp_glue.c:enable_device() for each func, but 
it doesn't work.



> acpiphp_glue.c:disable_device()
> list_for_each_entry(func, &slot->funcs, sibling) {
> pdev = pci_get_slot(slot->bridge->pci_bus,
> PCI_DEVFN(slot->device, func->function));
> ...release code... // those code can only be executed one time (func
> 0)
> pci_remove_bus_device(pdev);
> }
> 
> bus.c:pci_bus_add_device() is called for each func device in
> acpiphp_glue.c:enable_device().
> bus.c:pci_remove_bus_device(pdev) is only called for func 0 in
> acpiphp_glue.c:disable_device().
> 
> 
> Resolution: (I've tested it, success)
> enumerate all the funcs when disable device.
> 
> list_for_each_entry(func, &slot->funcs, sibling) {
> for (i=0; i<8; i++) {
> pdev = pci_get_slot(slot->bridge->pci_bus,
> PCI_DEVFN(slot->device, i));
> ...release code...
> pci_remove_bus_device(pdev);
> 
> }
> }


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Fix build failure with HV KVM and CBE

2011-09-13 Thread Alexander Graf
When running with HV KVM and CBE config options enabled, I get
build failures like the following:

  arch/powerpc/kernel/head_64.o: In function `cbe_system_error_hv':
  (.text+0x1228): undefined reference to `do_kvm_0x1202'
  arch/powerpc/kernel/head_64.o: In function `cbe_maintenance_hv':
  (.text+0x1628): undefined reference to `do_kvm_0x1602'
  arch/powerpc/kernel/head_64.o: In function `cbe_thermal_hv':
  (.text+0x1828): undefined reference to `do_kvm_0x1802'

This is because we jump to a KVM handler when HV is enabled, but we
only generate the handler with PR KVM mode.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kernel/exceptions-64s.S |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 29ddd8b..396d080 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -267,7 +267,7 @@ vsx_unavailable_pSeries_1:
 
 #ifdef CONFIG_CBE_RAS
STD_EXCEPTION_HV(0x1200, 0x1202, cbe_system_error)
-   KVM_HANDLER_PR_SKIP(PACA_EXGEN, EXC_HV, 0x1202)
+   KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1202)
 #endif /* CONFIG_CBE_RAS */
 
STD_EXCEPTION_PSERIES(0x1300, 0x1300, instruction_breakpoint)
@@ -275,7 +275,7 @@ vsx_unavailable_pSeries_1:
 
 #ifdef CONFIG_CBE_RAS
STD_EXCEPTION_HV(0x1600, 0x1602, cbe_maintenance)
-   KVM_HANDLER_PR_SKIP(PACA_EXGEN, EXC_HV, 0x1602)
+   KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1602)
 #endif /* CONFIG_CBE_RAS */
 
STD_EXCEPTION_PSERIES(0x1700, 0x1700, altivec_assist)
@@ -283,7 +283,7 @@ vsx_unavailable_pSeries_1:
 
 #ifdef CONFIG_CBE_RAS
STD_EXCEPTION_HV(0x1800, 0x1802, cbe_thermal)
-   KVM_HANDLER_PR_SKIP(PACA_EXGEN, EXC_HV, 0x1802)
+   KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1802)
 #endif /* CONFIG_CBE_RAS */
 
. = 0x3000
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: Fix build without VNC support

2011-09-13 Thread Boris Dolgov
Hello!

Qemu-kvm 0.15.0 doesn't build with vnc support disabled. The following
patch fixes the problem:

Signed-off-by: Boris Dolgov 

- monitor.c~  2011-08-09 12:40:29.0 +
+++ monitor.c   2011-09-13 13:02:40.0 +
@@ -1221,10 +1221,12 @@ static int add_graphics_client(Monitor *
 }
qerror_report(QERR_ADD_CLIENT_FAILED);
return -1;
+#ifdef CONFIG_VNC
 } else if (strcmp(protocol, "vnc") == 0) {
int fd = monitor_get_fd(mon, fdname);
vnc_display_add_client(NULL, fd, skipauth);
return 0;
+#endif
 } else if ((s = qemu_chr_find(protocol)) != NULL) {
int fd = monitor_get_fd(mon, fdname);
if (qemu_chr_add_client(s, fd) < 0) {

-- 
Boris Dolgov.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] iommu/core: split mapping to page sizes as supported by the hardware

2011-09-13 Thread Ohad Ben-Cohen
On Tue, Sep 13, 2011 at 1:44 PM, Roedel, Joerg  wrote:
> Not necessarily. You could implement this side-by-side with the old code
> until all drivers are converted and remove the old code then. This keeps
> bisectability.

Ok.

>> > Intel IOMMU does not support arbitrary page-sizes, afaik.
>>
>> It does; besides the usual 4K it has "super page sizes" support of
>> 2MB, 1GB, 512GB and 1TB.
>
> But the value ~0xfffUL indicates support for 4k, 8k, 16k .. 2^63, no?

Yes, I have done this intentionally, in order to retain the existing
behavior for IOMMU drivers which are already capable of handling
arbitrary page sizes (intel-iommu handles this in software, see
hardware_largepage_caps() and the code that uses it).

Long term, it might make more sense to remove
hardware_largepage_caps() (and the logic around it) and instead just
declare the real page sizes the hardware supports when calling
register_iommu(), but I guess it's up to Intel guys. For now it's just
safer to declare ~0xfffUL which really means: keep calling me with
sizes and alignments that are an order of 4KB, just like you always
did.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About hotplug multifunction

2011-09-13 Thread Amos Kong
Hi all,

After reading the pci driver code, I found a problem.

There is a list for each slot, (slot->funcs)
it will be inited in acpiphp_glue.c:register_slot() before hotpluging device,
and only one entry(func 0) will be added to it,
no new entry will be added to the list when hotpluging devices to the slot.

When we release the device, there are only _one_ entry in the list(slot->funcs).

acpiphp_glue.c:disable_device()
list_for_each_entry(func, &slot->funcs, sibling) {
pdev = pci_get_slot(slot->bridge->pci_bus,
PCI_DEVFN(slot->device, func->function));
...release code...  // those code can only be executed one time (func 0)
pci_remove_bus_device(pdev);
}

bus.c:pci_bus_add_device() is called for each func device in 
acpiphp_glue.c:enable_device().
bus.c:pci_remove_bus_device(pdev) is only called for func 0 in 
acpiphp_glue.c:disable_device().


Resolution: (I've tested it, success)
enumerate all the funcs when disable device.

list_for_each_entry(func, &slot->funcs, sibling) {
for (i=0; i<8; i++) {
pdev = pci_get_slot(slot->bridge->pci_bus,
PCI_DEVFN(slot->device, i));
...release code...
pci_remove_bus_device(pdev);

}
}

Thanks,
Amos
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 11/11] KVM: MMU: improve write flooding detected

2011-09-13 Thread Avi Kivity

On 08/30/2011 05:38 AM, Xiao Guangrong wrote:

Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time




The spte may not be accessed, but other sptes in the same page can be 
accessed.  An example is the fixmap area for kmap_atomic(), there will 
be a lot of pte writes but other sptes will be accessed without going 
through soft-mmu at all.


I think you have to read the parent_ptes->spte.accessed bits to be sure.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/11] KVM: MMU: cleanup FNAME(invlpg)

2011-09-13 Thread Avi Kivity

On 08/30/2011 05:36 AM, Xiao Guangrong wrote:

Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)


+static gpa_t FNAME(get_first_pte_gpa)(struct kvm_mmu_page *sp)


spte


+{
+   int offset = 0;
+
+   if (PTTYPE == 32)
+   offset = sp->role.quadrant<<  PT64_LEVEL_BITS;
+
+   return gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t);
+}


Only works for L1 tables, yes?  Need a comment so it isn't reused for L2.


@@ -663,7 +673,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
gpa_t pte_gpa = -1;
int level;
u64 *sptep;
-   int need_flush = 0;

vcpu_clear_mmio_info(vcpu, gva);

@@ -675,36 +684,20 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
gva)

sp = page_header(__pa(sptep));
if (is_last_spte(*sptep, level)) {
-   int offset, shift;
-
if (!sp->unsync)
break;

-   shift = PAGE_SHIFT -
- (PT_LEVEL_BITS - PT64_LEVEL_BITS) * level;
-   offset = sp->role.quadrant<<  shift;
-
-   pte_gpa = (sp->gfn<<  PAGE_SHIFT) + offset;
+   pte_gpa = FNAME(get_first_pte_gpa)(sp);



Here is can be used for L2 - I think we can use 2MB host pages to back 
4MB guest mappings.




--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/11] KVM: MMU: do not mark accessed bit on pte write path

2011-09-13 Thread Avi Kivity

On 08/30/2011 05:35 AM, Xiao Guangrong wrote:

In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path


What about speculative sptes that are then only accessed via emulation?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/11] KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write

2011-09-13 Thread Avi Kivity

On 09/13/2011 01:24 PM, Xiao Guangrong wrote:

On 09/13/2011 05:51 PM, Avi Kivity wrote:
>  On 08/30/2011 05:34 AM, Xiao Guangrong wrote:
>>  kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
>>  function when spte is prefetched, unfortunately, we can not know how many
>>  spte need to be prefetched on this path, that means we can use out of the
>>  free  pte_list_desc object in the cache, and BUG_ON() is triggered, also 
some
>>  path does not fill the cache, such as INS instruction emulated that does not
>>  trigger page fault
>>
>>  @@ -3716,10 +3735,6 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t 
cr2, u32 error_code,
>>goto out;
>>}
>>
>>  -r = mmu_topup_memory_caches(vcpu);
>>  -if (r)
>>  -goto out;
>>  -
>>er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len);
>>
>
>  Suppose we are out of memory, can't this get us in an endless loop?
>
>  return -ENOMEM breaks as out (and kills the guest, likely).
>

If memory is not enough, we just clear sptes on pte_write path(not prefetch 
spte),
the later page fault path can return -1 to let guest crash. Hmm?



Yes.

btw, is rmap_can_add() sufficent?  We allocate more than just rmaps in 
mmu_topup_memory_caches().  I guess it is, but this is getting tricky.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 03/11] KVM: x86: retry non-page-table writing instruction

2011-09-13 Thread Avi Kivity

On 08/30/2011 05:35 AM, Xiao Guangrong wrote:

If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly

The idea is from Avi


  int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len);
+bool page_table_writing_insn(struct x86_emulate_ctxt *ctxt);


Please use the usual x86_ prefix used in the emulator interface.


@@ -3720,10 +3721,18 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
kvm_mmu_commit_zap_page(vcpu->kvm,&invalid_list);
  }

+static bool is_mmio_page_fault(struct kvm_vcpu *vcpu, gva_t addr)
+{
+   if (vcpu->arch.mmu.direct_map || mmu_is_nested(vcpu))
+   return vcpu_match_mmio_gpa(vcpu, addr);
+
+   return vcpu_match_mmio_gva(vcpu, addr);
+}
+
  int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
   void *insn, int insn_len)
  {
-   int r;
+   int r, emulation_type = EMULTYPE_RETRY;
enum emulation_result er;

r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
@@ -3735,7 +3744,10 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u32 error_code,
goto out;
}

-   er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len);
+   if (is_mmio_page_fault(vcpu, cr2))
+   emulation_type = 0;
+
+   er = x86_emulate_instruction(vcpu, cr2, emulation_type, insn, insn_len);

switch (er) {
case EMULATE_DONE:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6b37f18..1afe59e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4814,6 +4814,50 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, 
gva_t gva)
return false;
  }

+static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
+ unsigned long cr2,  int emulation_type)
+{
+   if (!vcpu->arch.mmu.direct_map&&  !mmu_is_nested(vcpu))
+   gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);


If mmu_is_nested() cr2 is an ngpa, we have to translate it to a gpa, no?

btw, I don't see mmu.direct_map initialized for nested npt?


+
+   kvm_mmu_unprotect_page(vcpu->kvm, gpa>>  PAGE_SHIFT);
+
+   return true;
+}
+
  int x86_emulate_instruction(struct kvm_vcpu *vcpu,
unsigned long cr2,
int emulation_type,
@@ -4855,6 +4899,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
return EMULATE_DONE;
}

+   if (retry_instruction(ctxt, cr2, emulation_type))
+   return EMULATE_DONE;
+
/* this is needed for vmware backdoor interface to work since it
   changes registers values  during IO operation */
if (vcpu->arch.emulate_regs_need_sync_from_vcpu) {



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] iommu/core: split mapping to page sizes as supported by the hardware

2011-09-13 Thread Roedel, Joerg
On Tue, Sep 13, 2011 at 06:34:23AM -0400, Ohad Ben-Cohen wrote:
> Hi Joerg,
> 
> On Tue, Sep 13, 2011 at 1:10 PM, Roedel, Joerg  wrote:
> > Please split this patch into the core-change and patches for the
> > individual iommu-drivers and post this as a seperate patch-set.
> 
> But we'll be breaking bisectibility this way, no?

Not necessarily. You could implement this side-by-side with the old code
until all drivers are converted and remove the old code then. This keeps
bisectability.

> > Intel IOMMU does not support arbitrary page-sizes, afaik.
> 
> It does; besides the usual 4K it has "super page sizes" support of
> 2MB, 1GB, 512GB and 1TB.

But the value ~0xfffUL indicates support for 4k, 8k, 16k .. 2^63, no?

> 
> >> +       pr_debug("map: iova 0x%lx pa 0x%lx size 0x%lx\n", iova,
> >> +                                       (unsigned long)paddr, size);
> >
> > Please keep the debug-code in a seperate patch in your dev-tree. No need
> > for it to be merged upstream.
> 
> It's actually useful sometimes to have those around - it's off by
> default, and can be enabled only when needed (CONFIG_DYNAMIC_DEBUG).
> 
> But I don't mind removing them.

Ah right, it is just debug, so I am fine keeping it.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] iommu/core: split mapping to page sizes as supported by the hardware

2011-09-13 Thread Ohad Ben-Cohen
Hi Joerg,

On Tue, Sep 13, 2011 at 1:10 PM, Roedel, Joerg  wrote:
> Please split this patch into the core-change and patches for the
> individual iommu-drivers and post this as a seperate patch-set.

But we'll be breaking bisectibility this way, no ?

> Intel IOMMU does not support arbitrary page-sizes, afaik.

It does; besides the usual 4K it has "super page sizes" support of
2MB, 1GB, 512GB and 1TB.

>> +       pr_debug("map: iova 0x%lx pa 0x%lx size 0x%lx\n", iova,
>> +                                       (unsigned long)paddr, size);
>
> Please keep the debug-code in a seperate patch in your dev-tree. No need
> for it to be merged upstream.

It's actually useful sometimes to have those around - it's off by
default, and can be enabled only when needed (CONFIG_DYNAMIC_DEBUG).

But I don't mind removing them.

Thanks,
Ohad.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/11] KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write

2011-09-13 Thread Xiao Guangrong
On 09/13/2011 05:51 PM, Avi Kivity wrote:
> On 08/30/2011 05:34 AM, Xiao Guangrong wrote:
>> kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
>> function when spte is prefetched, unfortunately, we can not know how many
>> spte need to be prefetched on this path, that means we can use out of the
>> free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
>> path does not fill the cache, such as INS instruction emulated that does not
>> trigger page fault
>>
>> @@ -3716,10 +3735,6 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t 
>> cr2, u32 error_code,
>>   goto out;
>>   }
>>
>> -r = mmu_topup_memory_caches(vcpu);
>> -if (r)
>> -goto out;
>> -
>>   er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len);
>>
> 
> Suppose we are out of memory, can't this get us in an endless loop?
> 
> return -ENOMEM breaks as out (and kills the guest, likely).
> 

If memory is not enough, we just clear sptes on pte_write path(not prefetch 
spte),
the later page fault path can return -1 to let guest crash. Hmm?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] iommu/core: split mapping to page sizes as supported by the hardware

2011-09-13 Thread Roedel, Joerg
On Wed, Sep 07, 2011 at 02:53:24PM -0400, Ohad Ben-Cohen wrote:

>  drivers/iommu/amd_iommu.c   |   20 ++-
>  drivers/iommu/intel-iommu.c |   20 ++-
>  drivers/iommu/iommu.c   |  129 
> +++
>  drivers/iommu/msm_iommu.c   |8 ++-
>  drivers/iommu/omap-iommu.c  |6 ++-
>  drivers/iommu/omap-iovmm.c  |   12 +---
>  include/linux/iommu.h   |7 +-
>  virt/kvm/iommu.c|4 +-
>  8 files changed, 176 insertions(+), 30 deletions(-)

Please split this patch into the core-change and patches for the
individual iommu-drivers and post this as a seperate patch-set.

> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index a14f8dc..5cdfa91 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2488,12 +2488,30 @@ static unsigned device_dma_ops_init(void)
>  }
> 
>  /*
> + * This bitmap is used to advertise the page sizes our hardware support
> + * to the IOMMU core, which will then use this information to split
> + * physically contiguous memory regions it is mapping into page sizes
> + * that we support.
> + *
> + * Traditionally the IOMMU core just handed us the mappings directly,
> + * after making sure the size is an order of a 4KB page and that the
> + * mapping has natural alignment.
> + *
> + * To retain this behavior, we currently advertise that we support
> + * all page sizes that are an order of 4KB.
> + *
> + * If at some point we'd like to utilize the IOMMU core's new behavior,
> + * we could change this to advertise the real page sizes we support.
> + */
> +static unsigned long amd_iommu_pgsizes = ~0xFFFUL;
> +
> +/*
>   * The function which clues the AMD IOMMU driver into dma_ops.
>   */
> 
>  void __init amd_iommu_init_api(void)
>  {
> -   register_iommu(&amd_iommu_ops);
> +   register_iommu(&amd_iommu_ops, &amd_iommu_pgsizes, BITS_PER_LONG);
>  }
> 
>  int __init amd_iommu_init_dma_ops(void)
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index c621c98..a8c91a6 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3426,6 +3426,24 @@ static struct notifier_block device_nb = {
> .notifier_call = device_notifier,
>  };
> 
> +/*
> + * This bitmap is used to advertise the page sizes our hardware support
> + * to the IOMMU core, which will then use this information to split
> + * physically contiguous memory regions it is mapping into page sizes
> + * that we support.
> + *
> + * Traditionally the IOMMU core just handed us the mappings directly,
> + * after making sure the size is an order of a 4KB page and that the
> + * mapping has natural alignment.
> + *
> + * To retain this behavior, we currently advertise that we support
> + * all page sizes that are an order of 4KB.
> + *
> + * If at some point we'd like to utilize the IOMMU core's new behavior,
> + * we could change this to advertise the real page sizes we support.
> + */
> +static unsigned long intel_iommu_pgsizes = ~0xFFFUL;

Intel IOMMU does not support arbitrary page-sizes, afaik.

> +
>  int __init intel_iommu_init(void)
>  {
> int ret = 0;
> @@ -3486,7 +3504,7 @@ int __init intel_iommu_init(void)
> 
> init_iommu_pm_ops();
> 
> -   register_iommu(&intel_iommu_ops);
> +   register_iommu(&intel_iommu_ops, &intel_iommu_pgsizes, BITS_PER_LONG);
> 
> bus_register_notifier(&pci_bus_type, &device_nb);
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index c68ff29..e07ea03 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -16,6 +16,8 @@
>   * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
>   */
> 
> +#define pr_fmt(fmt)"%s: " fmt, __func__
> +
>  #include 
>  #include 
>  #include 
> @@ -23,15 +25,41 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  static struct iommu_ops *iommu_ops;
> 
> -void register_iommu(struct iommu_ops *ops)
> +/* bitmap of supported page sizes */
> +static unsigned long *iommu_pgsize_bitmap;
> +
> +/* number of bits used to represent the supported pages */
> +static unsigned int iommu_nr_page_bits;
> +
> +/* size of the smallest supported page (in bytes) */
> +static unsigned int iommu_min_pagesz;
> +
> +/* bit number of the smallest supported page */
> +static unsigned int iommu_min_page_idx;
> +
> +/**
> + * register_iommu() - register an IOMMU hardware
> + * @ops: iommu handlers
> + * @pgsize_bitmap: bitmap of page sizes supported by the hardware
> + * @nr_page_bits: size of @pgsize_bitmap (in bits)
> + */
> +void register_iommu(struct iommu_ops *ops, unsigned long *pgsize_bitmap,
> +   unsigned int nr_page_bits)
>  {
> -   if (iommu_ops)
> +   if (iommu_ops || iommu_pgsize_bitmap || !nr_page_bits)
> BUG();
> 
> iommu_ops = ops;
> +   iommu_pgsize_bitmap = pgsize_bitmap;
> +   iommu_nr_page_bits = nr_page_bits;
> +
> +   /* find the minimum 

Re: [Qemu-devel] About hotplug multifunction

2011-09-13 Thread Gleb Natapov
On Tue, Sep 13, 2011 at 01:05:00PM +0300, Michael S. Tsirkin wrote:
> On Tue, Sep 13, 2011 at 09:52:49AM +0300, Gleb Natapov wrote:
> > On Tue, Sep 13, 2011 at 02:57:20PM +0900, Isaku Yamahata wrote:
> > > On Sun, Sep 11, 2011 at 12:05:17PM +0300, Michael S. Tsirkin wrote:
> > > > On Sat, Sep 10, 2011 at 02:43:11AM +0900, Isaku Yamahata wrote:
> > > > > pci/pcie hot plug needs clean up for multifunction hotplug in long 
> > > > > term.
> > > > > Only single function device case works. Multifunction case is broken 
> > > > > somwehat.
> > > > > Especially the current acpi based hotplug should be replaced by
> > > > > the standardized hot plug controller in long term.
> > > > 
> > > > We'll need to keep supporting windows XP, which IIUC only
> > > > supports hotplug through ACPI. So it looks like we'll
> > > > need both.
> > > 
> > > Yes, we'll need both then.
> > > It would be possible to implement acpi-based hotplug with
> > > standardized hotplug controller. Not with qemu-specific controller.
> > > 
> > Where is this "standardized hotplug controller" documented?
> 
> Sorry both pci bridge and hotplug spec only reference shpc.
> The spec itself is PCI Standard Hot-Plug
> Controller and Subsystem Specification.
> 
> Revision 1.0 - get it from pcisig
> 
Thanks, Isaku is already pointed it to me.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] About hotplug multifunction

2011-09-13 Thread Michael S. Tsirkin
On Tue, Sep 13, 2011 at 09:52:49AM +0300, Gleb Natapov wrote:
> On Tue, Sep 13, 2011 at 02:57:20PM +0900, Isaku Yamahata wrote:
> > On Sun, Sep 11, 2011 at 12:05:17PM +0300, Michael S. Tsirkin wrote:
> > > On Sat, Sep 10, 2011 at 02:43:11AM +0900, Isaku Yamahata wrote:
> > > > pci/pcie hot plug needs clean up for multifunction hotplug in long term.
> > > > Only single function device case works. Multifunction case is broken 
> > > > somwehat.
> > > > Especially the current acpi based hotplug should be replaced by
> > > > the standardized hot plug controller in long term.
> > > 
> > > We'll need to keep supporting windows XP, which IIUC only
> > > supports hotplug through ACPI. So it looks like we'll
> > > need both.
> > 
> > Yes, we'll need both then.
> > It would be possible to implement acpi-based hotplug with
> > standardized hotplug controller. Not with qemu-specific controller.
> > 
> Where is this "standardized hotplug controller" documented?

Sorry both pci bridge and hotplug spec only reference shpc.
The spec itself is PCI Standard Hot-Plug
Controller and Subsystem Specification.

Revision 1.0 - get it from pcisig

> > It would require a bit amount of work to write ACPI code in DSDT that
> > handles standardized hotplug controller.
> > So I'm not sure it's worth while only for windows XP support.
> > -- 
> > yamahata
> 
> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] About hotplug multifunction

2011-09-13 Thread Michael S. Tsirkin
On Tue, Sep 13, 2011 at 09:52:49AM +0300, Gleb Natapov wrote:
> On Tue, Sep 13, 2011 at 02:57:20PM +0900, Isaku Yamahata wrote:
> > On Sun, Sep 11, 2011 at 12:05:17PM +0300, Michael S. Tsirkin wrote:
> > > On Sat, Sep 10, 2011 at 02:43:11AM +0900, Isaku Yamahata wrote:
> > > > pci/pcie hot plug needs clean up for multifunction hotplug in long term.
> > > > Only single function device case works. Multifunction case is broken 
> > > > somwehat.
> > > > Especially the current acpi based hotplug should be replaced by
> > > > the standardized hot plug controller in long term.
> > > 
> > > We'll need to keep supporting windows XP, which IIUC only
> > > supports hotplug through ACPI. So it looks like we'll
> > > need both.
> > 
> > Yes, we'll need both then.
> > It would be possible to implement acpi-based hotplug with
> > standardized hotplug controller. Not with qemu-specific controller.
> > 
> Where is this "standardized hotplug controller" documented?

In the pci bridge spec.

> > It would require a bit amount of work to write ACPI code in DSDT that
> > handles standardized hotplug controller.
> > So I'm not sure it's worth while only for windows XP support.
> > -- 
> > yamahata
> 
> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/11] KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write

2011-09-13 Thread Avi Kivity

On 08/30/2011 05:34 AM, Xiao Guangrong wrote:

kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault

@@ -3716,10 +3735,6 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u32 error_code,
goto out;
}

-   r = mmu_topup_memory_caches(vcpu);
-   if (r)
-   goto out;
-
er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len);



Suppose we are out of memory, can't this get us in an endless loop?

return -ENOMEM breaks as out (and kills the guest, likely).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests] emulator: test shld, shrd

2011-09-13 Thread Avi Kivity
Used to be broken in multiple ways.

Signed-off-by: Avi Kivity 
---
 x86/emulator.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/x86/emulator.c b/x86/emulator.c
index d4f0363..73079f8 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -642,6 +642,16 @@ static void test_rip_relative(unsigned *mem, char 
*insn_ram)
 report("movb $imm, 0(%rip)", *mem == 0x1);
 }
 
+static void test_shld_shrd(u32 *mem)
+{
+*mem = 0x12345678;
+asm("shld %2, %1, %0" : "+m"(*mem) : "r"(0xU), "c"((u8)3));
+report("shld (cl)", *mem == ((0x12345678 << 3) | 5));
+*mem = 0x12345678;
+asm("shrd %2, %1, %0" : "+m"(*mem) : "r"(0xU), "c"((u8)3));
+report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
+}
+
 int main()
 {
void *mem;
@@ -684,6 +694,7 @@ int main()
test_div(mem);
test_sse(mem);
test_rip_relative(mem, insn_ram);
+   test_shld_shrd(mem);
 
printf("\nSUMMARY: %d tests, %d failures\n", tests, fails);
return fails ? 1 : 0;
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/14] KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack

2011-09-13 Thread Avi Kivity
OpReg decoding has a hack that inhibits byte registers for movsx and movzx
instructions.  It should be replaced by something better, but meanwhile,
qualify that the hack is only active for the destination operand.

Note these instructions only use OpReg for the destination, but better to
be explicit about it.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a0d6ceb..17a8910 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3346,6 +3346,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
switch (d) {
case OpReg:
decode_register_operand(ctxt, op,
+op == &ctxt->dst &&
 ctxt->twobyte && (ctxt->b == 0xb6 || ctxt->b == 0xb7));
break;
case OpImmUByte:
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/14] KVM: x86 emulator: split dst decode to a generic decode_operand()

2011-09-13 Thread Avi Kivity
Instead of decoding each operand using its own code, use a generic
function.  Start with the destination operand.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |  146 ---
 1 files changed, 87 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 58172fb..6a6aed9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -29,6 +29,22 @@
 #include "tss.h"
 
 /*
+ * Operand types
+ */
+#define OpNone 0
+#define OpImplicit 1  /* No generic decode */
+#define OpReg  2  /* Register */
+#define OpMem  3  /* Memory */
+#define OpAcc  4  /* Accumulator: AL/AX/EAX/RAX */
+#define OpDI   5  /* ES:DI/EDI/RDI */
+#define OpMem646  /* Memory, 64-bit */
+#define OpImmUByte 7  /* Zero-extended 8-bit immediate */
+#define OpDX   8  /* DX register */
+
+#define OpBits 4  /* Width of operand field */
+#define OpMask ((1 << OpBits) - 1)
+
+/*
  * Opcode effective-address decode tables.
  * Note that we only emulate instructions that have at least one memory
  * operand (excluding implicit stack references). We assume that stack
@@ -40,15 +56,16 @@
 /* Operand sizes: 8-bit operands or specified/overridden size. */
 #define ByteOp  (1<<0) /* 8-bit operands. */
 /* Destination operand type. */
-#define ImplicitOps (1<<1) /* Implicit in opcode. No generic decode. */
-#define DstReg  (2<<1) /* Register operand. */
-#define DstMem  (3<<1) /* Memory operand. */
-#define DstAcc  (4<<1) /* Destination Accumulator */
-#define DstDI   (5<<1) /* Destination is in ES:(E)DI */
-#define DstMem64(6<<1) /* 64bit memory operand */
-#define DstImmUByte (7<<1) /* 8-bit unsigned immediate operand */
-#define DstDX   (8<<1) /* Destination is in DX register */
-#define DstMask (0xf<<1)
+#define DstShift1
+#define ImplicitOps (OpImplicit << DstShift)
+#define DstReg  (OpReg << DstShift)
+#define DstMem  (OpMem << DstShift)
+#define DstAcc  (OpAcc << DstShift)
+#define DstDI   (OpDI << DstShift)
+#define DstMem64(OpMem64 << DstShift)
+#define DstImmUByte (OpImmUByte << DstShift)
+#define DstDX   (OpDX << DstShift)
+#define DstMask (OpMask << DstShift)
 /* Source operand type. */
 #define SrcNone (0<<5) /* No source operand. */
 #define SrcReg  (1<<5) /* Register operand. */
@@ -3316,6 +,66 @@ static int decode_imm(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
return rc;
 }
 
+static int decode_operand(struct x86_emulate_ctxt *ctxt, struct operand *op,
+ unsigned d)
+{
+   int rc = X86EMUL_CONTINUE;
+
+   switch (d) {
+   case OpReg:
+   decode_register_operand(ctxt, op,
+ctxt->twobyte && (ctxt->b == 0xb6 || ctxt->b == 0xb7));
+   break;
+   case OpImmUByte:
+   op->type = OP_IMM;
+   op->addr.mem.ea = ctxt->_eip;
+   op->bytes = 1;
+   op->val = insn_fetch(u8, ctxt);
+   break;
+   case OpMem:
+   case OpMem64:
+   *op = ctxt->memop;
+   ctxt->memopp = op;
+   if (d == OpMem64)
+   op->bytes = 8;
+   else
+   op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
+   if (ctxt->d & BitOp)
+   fetch_bit_operand(ctxt);
+   op->orig_val = op->val;
+   break;
+   case OpAcc:
+   op->type = OP_REG;
+   op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
+   op->addr.reg = &ctxt->regs[VCPU_REGS_RAX];
+   fetch_register_operand(op);
+   op->orig_val = op->val;
+   break;
+   case OpDI:
+   op->type = OP_MEM;
+   op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
+   op->addr.mem.ea =
+   register_address(ctxt, ctxt->regs[VCPU_REGS_RDI]);
+   op->addr.mem.seg = VCPU_SREG_ES;
+   op->val = 0;
+   break;
+   case OpDX:
+   op->type = OP_REG;
+   op->bytes = 2;
+   op->addr.reg = &ctxt->regs[VCPU_REGS_RDX];
+   fetch_register_operand(op);
+   break;
+   case OpImplicit:
+   /* Special instructions do their own operand decoding. */
+   default:
+   op->type = OP_NONE; /* Disable writeback. */
+   break;
+   }
+
+done:
+   return rc;
+}
+
 int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 {
int rc = X86EMUL_CONTINUE;
@@ -3602,56 +3679,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
goto done;
 
/* Decode and fetch the destination oper

[PATCH 08/14] KVM: x86 emulator: switch OpImmUByte decode to decode_imm()

2011-09-13 Thread Avi Kivity
Similar to SrcImmUByte.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |5 +
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 00e0904..a0d6ceb 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3349,10 +3349,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
 ctxt->twobyte && (ctxt->b == 0xb6 || ctxt->b == 0xb7));
break;
case OpImmUByte:
-   op->type = OP_IMM;
-   op->addr.mem.ea = ctxt->_eip;
-   op->bytes = 1;
-   op->val = insn_fetch(u8, ctxt);
+   rc = decode_imm(ctxt, op, 1, false);
break;
case OpMem:
case OpMem64:
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/14] KVM: x86 emulator: expand decode flags to 64 bits

2011-09-13 Thread Avi Kivity
Unifiying the operands means not taking advantage of the fact that some
operand types can only go into certain operands (for example, DI can only
be used by the destination), so we need more bits to hold the operand type.

Signed-off-by: Avi Kivity 
---
 arch/x86/include/asm/kvm_emulate.h |2 +-
 arch/x86/kvm/emulate.c |   38 ++--
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 56bac3e..a026507 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -262,7 +262,7 @@ struct x86_emulate_ctxt {
struct operand dst;
bool has_seg_override;
u8 seg_override;
-   unsigned int d;
+   u64 d;
int (*execute)(struct x86_emulate_ctxt *ctxt);
int (*check_perm)(struct x86_emulate_ctxt *ctxt);
/* modrm */
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6a6aed9..8c65ff2 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -31,18 +31,18 @@
 /*
  * Operand types
  */
-#define OpNone 0
-#define OpImplicit 1  /* No generic decode */
-#define OpReg  2  /* Register */
-#define OpMem  3  /* Memory */
-#define OpAcc  4  /* Accumulator: AL/AX/EAX/RAX */
-#define OpDI   5  /* ES:DI/EDI/RDI */
-#define OpMem646  /* Memory, 64-bit */
-#define OpImmUByte 7  /* Zero-extended 8-bit immediate */
-#define OpDX   8  /* DX register */
+#define OpNone 0ull
+#define OpImplicit 1ull  /* No generic decode */
+#define OpReg  2ull  /* Register */
+#define OpMem  3ull  /* Memory */
+#define OpAcc  4ull  /* Accumulator: AL/AX/EAX/RAX */
+#define OpDI   5ull  /* ES:DI/EDI/RDI */
+#define OpMem646ull  /* Memory, 64-bit */
+#define OpImmUByte 7ull  /* Zero-extended 8-bit immediate */
+#define OpDX   8ull  /* DX register */
 
 #define OpBits 4  /* Width of operand field */
-#define OpMask ((1 << OpBits) - 1)
+#define OpMask ((1ull << OpBits) - 1)
 
 /*
  * Opcode effective-address decode tables.
@@ -108,12 +108,12 @@
 #define Priv(1<<27) /* instruction generates #GP if current CPL != 0 */
 #define No64   (1<<28)
 /* Source 2 operand type */
-#define Src2None(0<<29)
-#define Src2CL  (1<<29)
-#define Src2ImmByte (2<<29)
-#define Src2One (3<<29)
-#define Src2Imm (4<<29)
-#define Src2Mask(7<<29)
+#define Src2None(0u<<29)
+#define Src2CL  (1u<<29)
+#define Src2ImmByte (2u<<29)
+#define Src2One (3u<<29)
+#define Src2Imm (4u<<29)
+#define Src2Mask(7u<<29)
 
 #define X2(x...) x, x
 #define X3(x...) X2(x), x
@@ -125,8 +125,8 @@
 #define X16(x...) X8(x), X8(x)
 
 struct opcode {
-   u32 flags;
-   u8 intercept;
+   u64 flags : 56;
+   u64 intercept : 8;
union {
int (*execute)(struct x86_emulate_ctxt *ctxt);
struct opcode *group;
@@ -3530,7 +3530,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
return EMULATION_FAILED;
}
 
-   ctxt->d &= ~GroupMask;
+   ctxt->d &= ~(u64)GroupMask;
ctxt->d |= opcode.flags;
}
 
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/14] KVM: x86 emulator: simplify OpMem64 decode

2011-09-13 Thread Avi Kivity
Use the same technique as the other OpMem variants, and goto mem_common.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |9 -
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e46809b..1c95935 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3361,11 +3361,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
rc = decode_imm(ctxt, op, 1, false);
break;
case OpMem:
-   case OpMem64:
-   if (d == OpMem64)
-   ctxt->memop.bytes = 8;
-   else
-   ctxt->memop.bytes = (ctxt->d & ByteOp) ? 1 : 
ctxt->op_bytes;
+   ctxt->memop.bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
mem_common:
*op = ctxt->memop;
ctxt->memopp = op;
@@ -3373,6 +3369,9 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
fetch_bit_operand(ctxt);
op->orig_val = op->val;
break;
+   case OpMem64:
+   ctxt->memop.bytes = 8;
+   goto mem_common;
case OpAcc:
op->type = OP_REG;
op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/14] KVM: x86 emulator: free up some flag bits near src, dst

2011-09-13 Thread Avi Kivity
Op fields are going to grow by a bit, we need two free bits.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 88d32fc..00e0904 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -88,10 +88,6 @@
 #define SrcImmU16   (0xe<<5)/* Immediate operand, unsigned, 16 bits */
 #define SrcDX   (0xf<<5)   /* Source is in DX register */
 #define SrcMask (0xf<<5)
-/* Generic ModRM decode. */
-#define ModRM   (1<<9)
-/* Destination is only written; never read. */
-#define Mov (1<<10)
 #define BitOp   (1<<11)
 #define MemAbs  (1<<12)  /* Memory operand is absolute displacement */
 #define String  (1<<13) /* String instruction (rep capable) */
@@ -102,6 +98,10 @@
 #define Prefix  (3<<15) /* Instruction varies with 66/f2/f3 prefix */
 #define RMExt   (4<<15) /* Opcode extension in ModRM r/m if mod == 3 */
 #define Sse (1<<18) /* SSE Vector instruction */
+/* Generic ModRM decode. */
+#define ModRM   (1<<19)
+/* Destination is only written; never read. */
+#define Mov (1<<20)
 /* Misc flags */
 #define Prot(1<<21) /* instruction generates #UD if not in prot-mode */
 #define VendorSpecific (1<<22) /* Vendor specific instruction */
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/14] KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode

2011-09-13 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |   22 +++---
 1 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index ab48611..bd3e488 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1828,8 +1828,9 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt)
return rc;
 }
 
-static int emulate_load_segment(struct x86_emulate_ctxt *ctxt, int seg)
+static int em_lseg(struct x86_emulate_ctxt *ctxt)
 {
+   int seg = ctxt->src2.val;
unsigned short sel;
int rc;
 
@@ -3193,8 +3194,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
D2bv(DstMem | SrcImmByte | ModRM),
I(ImplicitOps | Stack | SrcImmU16, em_ret_near_imm),
I(ImplicitOps | Stack, em_ret),
-   D(DstReg | SrcMemFAddr | ModRM | No64 | Src2ES),
-   D(DstReg | SrcMemFAddr | ModRM | No64 | Src2DS),
+   I(DstReg | SrcMemFAddr | ModRM | No64 | Src2ES, em_lseg),
+   I(DstReg | SrcMemFAddr | ModRM | No64 | Src2DS, em_lseg),
G(ByteOp, group11), G(0, group11),
/* 0xC8 - 0xCF */
N, N, N, I(ImplicitOps | Stack, em_ret_far),
@@ -3281,10 +3282,10 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
D(ModRM), I(DstReg | SrcMem | ModRM, em_imul),
/* 0xB0 - 0xB7 */
D2bv(DstMem | SrcReg | ModRM | Lock),
-   D(DstReg | SrcMemFAddr | ModRM | Src2SS),
+   I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg),
D(DstMem | SrcReg | ModRM | BitOp | Lock),
-   D(DstReg | SrcMemFAddr | ModRM | Src2FS),
-   D(DstReg | SrcMemFAddr | ModRM | Src2GS),
+   I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg),
+   I(DstReg | SrcMemFAddr | ModRM | Src2GS, em_lseg),
D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM 
| Mov),
/* 0xB8 - 0xBF */
N, N,
@@ -3893,10 +3894,6 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
case 0xc0 ... 0xc1:
rc = em_grp2(ctxt);
break;
-   case 0xc4:  /* les */
-   case 0xc5:  /* lds */
-   rc = emulate_load_segment(ctxt, ctxt->src2.val);
-   break;
case 0xcc:  /* int3 */
rc = emulate_int(ctxt, 3);
break;
@@ -4146,11 +4143,6 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
ctxt->dst.addr.reg = (unsigned long 
*)&ctxt->regs[VCPU_REGS_RAX];
}
break;
-   case 0xb2:  /* lss */
-   case 0xb4:  /* lfs */
-   case 0xb5:  /* lgs */
-   rc = emulate_load_segment(ctxt, ctxt->src2.val);
-   break;
case 0xb3:
  btr:  /* btr */
emulate_2op_SrcV_nobyte(ctxt, "btr");
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/14] KVM: x86 emulator: switch src decode to decode_operand()

2011-09-13 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |  156 +++
 1 files changed, 63 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 17a8910..e46809b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -44,8 +44,15 @@
 #define OpImmByte 10ull  /* 8-bit sign extended immediate */
 #define OpOne 11ull  /* Implied 1 */
 #define OpImm 12ull  /* Sign extended immediate */
-
-#define OpBits 4  /* Width of operand field */
+#define OpMem16   13ull  /* Memory operand (16-bit). */
+#define OpMem32   14ull  /* Memory operand (32-bit). */
+#define OpImmU15ull  /* Immediate operand, zero extended */
+#define OpSI  16ull  /* SI/ESI/RSI */
+#define OpImmFAddr17ull  /* Immediate far address */
+#define OpMemFAddr18ull  /* Far address in memory */
+#define OpImmU16  19ull  /* Immediate operand, 16 bits, zero extended 
*/
+
+#define OpBits 5  /* Width of operand field */
 #define OpMask ((1ull << OpBits) - 1)
 
 /*
@@ -71,23 +78,24 @@
 #define DstDX   (OpDX << DstShift)
 #define DstMask (OpMask << DstShift)
 /* Source operand type. */
-#define SrcNone (0<<5) /* No source operand. */
-#define SrcReg  (1<<5) /* Register operand. */
-#define SrcMem  (2<<5) /* Memory operand. */
-#define SrcMem16(3<<5) /* Memory operand (16-bit). */
-#define SrcMem32(4<<5) /* Memory operand (32-bit). */
-#define SrcImm  (5<<5) /* Immediate operand. */
-#define SrcImmByte  (6<<5) /* 8-bit sign-extended immediate operand. */
-#define SrcOne  (7<<5) /* Implied '1' */
-#define SrcImmUByte (8<<5)  /* 8-bit unsigned immediate operand. */
-#define SrcImmU (9<<5)  /* Immediate operand, unsigned */
-#define SrcSI   (0xa<<5)   /* Source is in the DS:RSI */
-#define SrcImmFAddr (0xb<<5)   /* Source is immediate far address */
-#define SrcMemFAddr (0xc<<5)   /* Source is far address in memory */
-#define SrcAcc  (0xd<<5)   /* Source Accumulator */
-#define SrcImmU16   (0xe<<5)/* Immediate operand, unsigned, 16 bits */
-#define SrcDX   (0xf<<5)   /* Source is in DX register */
-#define SrcMask (0xf<<5)
+#define SrcShift6
+#define SrcNone (OpNone << SrcShift)
+#define SrcReg  (OpReg << SrcShift)
+#define SrcMem  (OpMem << SrcShift)
+#define SrcMem16(OpMem16 << SrcShift)
+#define SrcMem32(OpMem32 << SrcShift)
+#define SrcImm  (OpImm << SrcShift)
+#define SrcImmByte  (OpImmByte << SrcShift)
+#define SrcOne  (OpOne << SrcShift)
+#define SrcImmUByte (OpImmUByte << SrcShift)
+#define SrcImmU (OpImmU << SrcShift)
+#define SrcSI   (OpSI << SrcShift)
+#define SrcImmFAddr (OpImmFAddr << SrcShift)
+#define SrcMemFAddr (OpMemFAddr << SrcShift)
+#define SrcAcc  (OpAcc << SrcShift)
+#define SrcImmU16   (OpImmU16 << SrcShift)
+#define SrcDX   (OpDX << SrcShift)
+#define SrcMask (OpMask << SrcShift)
 #define BitOp   (1<<11)
 #define MemAbs  (1<<12)  /* Memory operand is absolute displacement */
 #define String  (1<<13) /* String instruction (rep capable) */
@@ -3354,13 +3362,14 @@ static int decode_operand(struct x86_emulate_ctxt 
*ctxt, struct operand *op,
break;
case OpMem:
case OpMem64:
-   *op = ctxt->memop;
-   ctxt->memopp = op;
if (d == OpMem64)
-   op->bytes = 8;
+   ctxt->memop.bytes = 8;
else
-   op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
-   if (ctxt->d & BitOp)
+   ctxt->memop.bytes = (ctxt->d & ByteOp) ? 1 : 
ctxt->op_bytes;
+   mem_common:
+   *op = ctxt->memop;
+   ctxt->memopp = op;
+   if ((ctxt->d & BitOp) && op == &ctxt->dst)
fetch_bit_operand(ctxt);
op->orig_val = op->val;
break;
@@ -3399,6 +3408,35 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
case OpImm:
rc = decode_imm(ctxt, op, imm_size(ctxt), true);
break;
+   case OpMem16:
+   ctxt->memop.bytes = 2;
+   goto mem_common;
+   case OpMem32:
+   ctxt->memop.bytes = 4;
+   goto mem_common;
+   case OpImmU16:
+   rc = decode_imm(ctxt, op, 2, false);
+   break;
+   case OpImmU:
+   rc = decode_imm(ctxt, op, imm_size(ctxt), false);
+   break;
+   case OpSI:
+   op->type = OP_MEM;
+   op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
+   op->addr.mem.ea =
+   register_address(ctxt, ctxt->regs[VCPU_REGS_RSI]);
+   op->addr.mem.seg = seg_override(ctxt);
+  

[PATCH 12/14] KVM: x86 emulator: streamline decode of segment registers

2011-09-13 Thread Avi Kivity
The opcodes

  push %seg
  pop %seg
  l%seg, %mem, %reg  (e.g. lds/les/lss/lfs/lgs)

all have an segment register encoded in the instruction.  To allow reuse,
decode the segment number into src2 during the decode stage instead of the
execution stage.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |   99 +++
 1 files changed, 57 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 1c95935..ab48611 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -51,6 +51,12 @@
 #define OpImmFAddr17ull  /* Immediate far address */
 #define OpMemFAddr18ull  /* Far address in memory */
 #define OpImmU16  19ull  /* Immediate operand, 16 bits, zero extended 
*/
+#define OpES  20ull  /* ES */
+#define OpCS  21ull  /* CS */
+#define OpSS  22ull  /* SS */
+#define OpDS  23ull  /* DS */
+#define OpFS  24ull  /* FS */
+#define OpGS  25ull  /* GS */
 
 #define OpBits 5  /* Width of operand field */
 #define OpMask ((1ull << OpBits) - 1)
@@ -126,6 +132,12 @@
 #define Src2ImmByte (OpImmByte << Src2Shift)
 #define Src2One (OpOne << Src2Shift)
 #define Src2Imm (OpImm << Src2Shift)
+#define Src2ES  (OpES << Src2Shift)
+#define Src2CS  (OpCS << Src2Shift)
+#define Src2SS  (OpSS << Src2Shift)
+#define Src2DS  (OpDS << Src2Shift)
+#define Src2FS  (OpFS << Src2Shift)
+#define Src2GS  (OpGS << Src2Shift)
 #define Src2Mask(OpMask << Src2Shift)
 
 #define X2(x...) x, x
@@ -3101,16 +3113,19 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 static struct opcode opcode_table[256] = {
/* 0x00 - 0x07 */
I6ALU(Lock, em_add),
-   D(ImplicitOps | Stack | No64), D(ImplicitOps | Stack | No64),
+   D(ImplicitOps | Stack | No64 | Src2ES),
+   D(ImplicitOps | Stack | No64 | Src2ES),
/* 0x08 - 0x0F */
I6ALU(Lock, em_or),
-   D(ImplicitOps | Stack | No64), N,
+   D(ImplicitOps | Stack | No64 | Src2CS), N,
/* 0x10 - 0x17 */
I6ALU(Lock, em_adc),
-   D(ImplicitOps | Stack | No64), D(ImplicitOps | Stack | No64),
+   D(ImplicitOps | Stack | No64 | Src2SS),
+   D(ImplicitOps | Stack | No64 | Src2SS),
/* 0x18 - 0x1F */
I6ALU(Lock, em_sbb),
-   D(ImplicitOps | Stack | No64), D(ImplicitOps | Stack | No64),
+   D(ImplicitOps | Stack | No64 | Src2DS),
+   D(ImplicitOps | Stack | No64 | Src2DS),
/* 0x20 - 0x27 */
I6ALU(Lock, em_and), N, N,
/* 0x28 - 0x2F */
@@ -3178,7 +3193,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
D2bv(DstMem | SrcImmByte | ModRM),
I(ImplicitOps | Stack | SrcImmU16, em_ret_near_imm),
I(ImplicitOps | Stack, em_ret),
-   D(DstReg | SrcMemFAddr | ModRM | No64), D(DstReg | SrcMemFAddr | ModRM 
| No64),
+   D(DstReg | SrcMemFAddr | ModRM | No64 | Src2ES),
+   D(DstReg | SrcMemFAddr | ModRM | No64 | Src2DS),
G(ByteOp, group11), G(0, group11),
/* 0xC8 - 0xCF */
N, N, N, I(ImplicitOps | Stack, em_ret_far),
@@ -3253,20 +3269,22 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
/* 0x90 - 0x9F */
X16(D(ByteOp | DstMem | SrcNone | ModRM| Mov)),
/* 0xA0 - 0xA7 */
-   D(ImplicitOps | Stack), D(ImplicitOps | Stack),
+   D(Stack | Src2FS), D(Stack | Src2FS),
DI(ImplicitOps, cpuid), D(DstMem | SrcReg | ModRM | BitOp),
D(DstMem | SrcReg | Src2ImmByte | ModRM),
D(DstMem | SrcReg | Src2CL | ModRM), N, N,
/* 0xA8 - 0xAF */
-   D(ImplicitOps | Stack), D(ImplicitOps | Stack),
+   D(Stack | Src2GS), D(Stack | Src2GS),
DI(ImplicitOps, rsm), D(DstMem | SrcReg | ModRM | BitOp | Lock),
D(DstMem | SrcReg | Src2ImmByte | ModRM),
D(DstMem | SrcReg | Src2CL | ModRM),
D(ModRM), I(DstReg | SrcMem | ModRM, em_imul),
/* 0xB0 - 0xB7 */
D2bv(DstMem | SrcReg | ModRM | Lock),
-   D(DstReg | SrcMemFAddr | ModRM), D(DstMem | SrcReg | ModRM | BitOp | 
Lock),
-   D(DstReg | SrcMemFAddr | ModRM), D(DstReg | SrcMemFAddr | ModRM),
+   D(DstReg | SrcMemFAddr | ModRM | Src2SS),
+   D(DstMem | SrcReg | ModRM | BitOp | Lock),
+   D(DstReg | SrcMemFAddr | ModRM | Src2FS),
+   D(DstReg | SrcMemFAddr | ModRM | Src2GS),
D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM 
| Mov),
/* 0xB8 - 0xBF */
N, N,
@@ -3436,6 +3454,24 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
case OpMemFAddr:
ctxt->memop.bytes = ctxt->op_bytes + 2;
goto mem_common;
+   case OpES:
+   op->val = VCPU_SREG_ES;
+   break;
+   case OpCS:
+   op->val = VCPU_SREG_CS;
+   break;
+   case OpSS:
+   op->val = VCPU_SRE

[PATCH 14/14] KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode

2011-09-13 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |   44 +++-
 1 files changed, 15 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index bd3e488..f1e3be1 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1458,15 +1458,18 @@ static int em_popf(struct x86_emulate_ctxt *ctxt)
return emulate_popf(ctxt, &ctxt->dst.val, ctxt->op_bytes);
 }
 
-static int emulate_push_sreg(struct x86_emulate_ctxt *ctxt, int seg)
+static int em_push_sreg(struct x86_emulate_ctxt *ctxt)
 {
+   int seg = ctxt->src2.val;
+
ctxt->src.val = get_segment_selector(ctxt, seg);
 
return em_push(ctxt);
 }
 
-static int emulate_pop_sreg(struct x86_emulate_ctxt *ctxt, int seg)
+static int em_pop_sreg(struct x86_emulate_ctxt *ctxt)
 {
+   int seg = ctxt->src2.val;
unsigned long selector;
int rc;
 
@@ -3114,19 +3117,20 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 static struct opcode opcode_table[256] = {
/* 0x00 - 0x07 */
I6ALU(Lock, em_add),
-   D(ImplicitOps | Stack | No64 | Src2ES),
-   D(ImplicitOps | Stack | No64 | Src2ES),
+   I(ImplicitOps | Stack | No64 | Src2ES, em_push_sreg),
+   I(ImplicitOps | Stack | No64 | Src2ES, em_pop_sreg),
/* 0x08 - 0x0F */
I6ALU(Lock, em_or),
-   D(ImplicitOps | Stack | No64 | Src2CS), N,
+   I(ImplicitOps | Stack | No64 | Src2CS, em_push_sreg),
+   N,
/* 0x10 - 0x17 */
I6ALU(Lock, em_adc),
-   D(ImplicitOps | Stack | No64 | Src2SS),
-   D(ImplicitOps | Stack | No64 | Src2SS),
+   I(ImplicitOps | Stack | No64 | Src2SS, em_push_sreg),
+   I(ImplicitOps | Stack | No64 | Src2SS, em_pop_sreg),
/* 0x18 - 0x1F */
I6ALU(Lock, em_sbb),
-   D(ImplicitOps | Stack | No64 | Src2DS),
-   D(ImplicitOps | Stack | No64 | Src2DS),
+   I(ImplicitOps | Stack | No64 | Src2DS, em_push_sreg),
+   I(ImplicitOps | Stack | No64 | Src2DS, em_pop_sreg),
/* 0x20 - 0x27 */
I6ALU(Lock, em_and), N, N,
/* 0x28 - 0x2F */
@@ -3270,12 +3274,12 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
/* 0x90 - 0x9F */
X16(D(ByteOp | DstMem | SrcNone | ModRM| Mov)),
/* 0xA0 - 0xA7 */
-   D(Stack | Src2FS), D(Stack | Src2FS),
+   I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg),
DI(ImplicitOps, cpuid), D(DstMem | SrcReg | ModRM | BitOp),
D(DstMem | SrcReg | Src2ImmByte | ModRM),
D(DstMem | SrcReg | Src2CL | ModRM), N, N,
/* 0xA8 - 0xAF */
-   D(Stack | Src2GS), D(Stack | Src2GS),
+   I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg),
DI(ImplicitOps, rsm), D(DstMem | SrcReg | ModRM | BitOp | Lock),
D(DstMem | SrcReg | Src2ImmByte | ModRM),
D(DstMem | SrcReg | Src2CL | ModRM),
@@ -3839,16 +3843,6 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
goto twobyte_insn;
 
switch (ctxt->b) {
-   case 0x06:  /* push es */
-   case 0x0e:  /* push cs */
-   case 0x16:  /* push ss */
-   case 0x1e:  /* push ds */
-   rc = emulate_push_sreg(ctxt, ctxt->src2.val);
-   break;
-   case 0x07:  /* pop es */
-   case 0x17:  /* pop ss */
-   case 0x1f:  /* pop ds */
-   rc = emulate_pop_sreg(ctxt, ctxt->src2.val);
case 0x40 ... 0x47: /* inc r16/r32 */
emulate_1op(ctxt, "inc");
break;
@@ -4097,14 +4091,6 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
case 0x90 ... 0x9f: /* setcc r/m8 */
ctxt->dst.val = test_cc(ctxt->b, ctxt->eflags);
break;
-   case 0xa0:/* push fs */
-   case 0xa8:/* push gs */
-   rc = emulate_push_sreg(ctxt, ctxt->src2.val);
-   break;
-   case 0xa1:   /* pop fs */
-   case 0xa9:   /* pop gs */
-   rc = emulate_pop_sreg(ctxt, ctxt->src2.val);
-   break;
case 0xa3:
  bt:   /* bt */
ctxt->dst.type = OP_NONE;
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/14] KVM: x86 emulator: convert group 3 instructions to direct decode

2011-09-13 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |   82 
 1 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index af06539..ed819bd 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1663,37 +1663,49 @@ static int em_grp2(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_grp3(struct x86_emulate_ctxt *ctxt)
+static int em_not(struct x86_emulate_ctxt *ctxt)
+{
+   ctxt->dst.val = ~ctxt->dst.val;
+   return X86EMUL_CONTINUE;
+}
+
+static int em_neg(struct x86_emulate_ctxt *ctxt)
+{
+   emulate_1op(ctxt, "neg");
+   return X86EMUL_CONTINUE;
+}
+
+static int em_mul_ex(struct x86_emulate_ctxt *ctxt)
+{
+   u8 ex = 0;
+
+   emulate_1op_rax_rdx(ctxt, "mul", ex);
+   return X86EMUL_CONTINUE;
+}
+
+static int em_imul_ex(struct x86_emulate_ctxt *ctxt)
+{
+   u8 ex = 0;
+
+   emulate_1op_rax_rdx(ctxt, "imul", ex);
+   return X86EMUL_CONTINUE;
+}
+
+static int em_div_ex(struct x86_emulate_ctxt *ctxt)
 {
u8 de = 0;
 
-   switch (ctxt->modrm_reg) {
-   case 0 ... 1:   /* test */
-   emulate_2op_SrcV(ctxt, "test");
-   /* Disable writeback. */
-   ctxt->dst.type = OP_NONE;
-   break;
-   case 2: /* not */
-   ctxt->dst.val = ~ctxt->dst.val;
-   break;
-   case 3: /* neg */
-   emulate_1op(ctxt, "neg");
-   break;
-   case 4: /* mul */
-   emulate_1op_rax_rdx(ctxt, "mul", de);
-   break;
-   case 5: /* imul */
-   emulate_1op_rax_rdx(ctxt, "imul", de);
-   break;
-   case 6: /* div */
-   emulate_1op_rax_rdx(ctxt, "div", de);
-   break;
-   case 7: /* idiv */
-   emulate_1op_rax_rdx(ctxt, "idiv", de);
-   break;
-   default:
-   return X86EMUL_UNHANDLEABLE;
-   }
+   emulate_1op_rax_rdx(ctxt, "div", de);
+   if (de)
+   return emulate_de(ctxt);
+   return X86EMUL_CONTINUE;
+}
+
+static int em_idiv_ex(struct x86_emulate_ctxt *ctxt)
+{
+   u8 de = 0;
+
+   emulate_1op_rax_rdx(ctxt, "idiv", de);
if (de)
return emulate_de(ctxt);
return X86EMUL_CONTINUE;
@@ -2989,9 +3001,14 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 };
 
 static struct opcode group3[] = {
-   D(DstMem | SrcImm | ModRM), D(DstMem | SrcImm | ModRM),
-   D(DstMem | SrcNone | ModRM | Lock), D(DstMem | SrcNone | ModRM | Lock),
-   X4(D(SrcMem | ModRM)),
+   I(DstMem | SrcImm | ModRM, em_test),
+   I(DstMem | SrcImm | ModRM, em_test),
+   I(DstMem | SrcNone | ModRM | Lock, em_not),
+   I(DstMem | SrcNone | ModRM | Lock, em_neg),
+   I(SrcMem | ModRM, em_mul_ex),
+   I(SrcMem | ModRM, em_imul_ex),
+   I(SrcMem | ModRM, em_div_ex),
+   I(SrcMem | ModRM, em_idiv_ex),
 };
 
 static struct opcode group4[] = {
@@ -3917,9 +3934,6 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
/* complement carry flag from eflags reg */
ctxt->eflags ^= EFLG_CF;
break;
-   case 0xf6 ... 0xf7: /* Grp3 */
-   rc = em_grp3(ctxt);
-   break;
case 0xf8: /* clc */
ctxt->eflags &= ~EFLG_CF;
break;
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/14] KVM: x86 emulator: switch src2 to generic decode_operand()

2011-09-13 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |   51 ---
 1 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8c65ff2..88d32fc 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -40,6 +40,10 @@
 #define OpMem646ull  /* Memory, 64-bit */
 #define OpImmUByte 7ull  /* Zero-extended 8-bit immediate */
 #define OpDX   8ull  /* DX register */
+#define OpCL   9ull  /* CL register (for shifts) */
+#define OpImmByte 10ull  /* 8-bit sign extended immediate */
+#define OpOne 11ull  /* Implied 1 */
+#define OpImm 12ull  /* Sign extended immediate */
 
 #define OpBits 4  /* Width of operand field */
 #define OpMask ((1ull << OpBits) - 1)
@@ -108,12 +112,13 @@
 #define Priv(1<<27) /* instruction generates #GP if current CPL != 0 */
 #define No64   (1<<28)
 /* Source 2 operand type */
-#define Src2None(0u<<29)
-#define Src2CL  (1u<<29)
-#define Src2ImmByte (2u<<29)
-#define Src2One (3u<<29)
-#define Src2Imm (4u<<29)
-#define Src2Mask(7u<<29)
+#define Src2Shift   (29)
+#define Src2None(OpNone << Src2Shift)
+#define Src2CL  (OpCL << Src2Shift)
+#define Src2ImmByte (OpImmByte << Src2Shift)
+#define Src2One (OpOne << Src2Shift)
+#define Src2Imm (OpImm << Src2Shift)
+#define Src2Mask(OpMask << Src2Shift)
 
 #define X2(x...) x, x
 #define X3(x...) X2(x), x
@@ -3382,6 +3387,20 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
op->addr.reg = &ctxt->regs[VCPU_REGS_RDX];
fetch_register_operand(op);
break;
+   case OpCL:
+   op->bytes = 1;
+   op->val = ctxt->regs[VCPU_REGS_RCX] & 0xff;
+   break;
+   case OpImmByte:
+   rc = decode_imm(ctxt, op, 1, true);
+   break;
+   case OpOne:
+   op->bytes = 1;
+   op->val = 1;
+   break;
+   case OpImm:
+   rc = decode_imm(ctxt, op, imm_size(ctxt), true);
+   break;
case OpImplicit:
/* Special instructions do their own operand decoding. */
default:
@@ -3656,25 +3675,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
 * Decode and fetch the second source operand: register, memory
 * or immediate.
 */
-   switch (ctxt->d & Src2Mask) {
-   case Src2None:
-   break;
-   case Src2CL:
-   ctxt->src2.bytes = 1;
-   ctxt->src2.val = ctxt->regs[VCPU_REGS_RCX] & 0xff;
-   break;
-   case Src2ImmByte:
-   rc = decode_imm(ctxt, &ctxt->src2, 1, true);
-   break;
-   case Src2One:
-   ctxt->src2.bytes = 1;
-   ctxt->src2.val = 1;
-   break;
-   case Src2Imm:
-   rc = decode_imm(ctxt, &ctxt->src2, imm_size(ctxt), true);
-   break;
-   }
-
+   rc = decode_operand(ctxt, &ctxt->src2, (ctxt->d >> Src2Shift) & OpMask);
if (rc != X86EMUL_CONTINUE)
goto done;
 
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/14] KVM: x86 emulator: fix Src2CL decode

2011-09-13 Thread Avi Kivity
Src2CL decode (used for double width shifts) erronously decodes only bit 3
of %rcx, instead of bits 7:0.

Fix by decoding %cl in its entirety.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c37f67e..af06539 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3567,7 +3567,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
break;
case Src2CL:
ctxt->src2.bytes = 1;
-   ctxt->src2.val = ctxt->regs[VCPU_REGS_RCX] & 0x8;
+   ctxt->src2.val = ctxt->regs[VCPU_REGS_RCX] & 0xff;
break;
case Src2ImmByte:
rc = decode_imm(ctxt, &ctxt->src2, 1, true);
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/14] KVM: x86 emulator: move memop, memopp into emulation context

2011-09-13 Thread Avi Kivity
Simplifies further generalization of decode.

Signed-off-by: Avi Kivity 
---
 arch/x86/include/asm/kvm_emulate.h |2 ++
 arch/x86/kvm/emulate.c |   34 +-
 2 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 6040d11..56bac3e 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -275,6 +275,8 @@ struct x86_emulate_ctxt {
unsigned long _eip;
/* Fields above regs are cleared together. */
unsigned long regs[NR_VCPU_REGS];
+   struct operand memop;
+   struct operand *memopp;
struct fetch_cache fetch;
struct read_cache io_read;
struct read_cache mem_read;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index ed819bd..58172fb 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3323,8 +3323,9 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
int def_op_bytes, def_ad_bytes, goffset, simd_prefix;
bool op_prefix = false;
struct opcode opcode;
-   struct operand memop = { .type = OP_NONE }, *memopp = NULL;
 
+   ctxt->memop.type = OP_NONE;
+   ctxt->memopp = NULL;
ctxt->_eip = ctxt->eip;
ctxt->fetch.start = ctxt->_eip;
ctxt->fetch.end = ctxt->fetch.start + insn_len;
@@ -3482,21 +3483,21 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
 
/* ModRM and SIB bytes. */
if (ctxt->d & ModRM) {
-   rc = decode_modrm(ctxt, &memop);
+   rc = decode_modrm(ctxt, &ctxt->memop);
if (!ctxt->has_seg_override)
set_seg_override(ctxt, ctxt->modrm_seg);
} else if (ctxt->d & MemAbs)
-   rc = decode_abs(ctxt, &memop);
+   rc = decode_abs(ctxt, &ctxt->memop);
if (rc != X86EMUL_CONTINUE)
goto done;
 
if (!ctxt->has_seg_override)
set_seg_override(ctxt, VCPU_SREG_DS);
 
-   memop.addr.mem.seg = seg_override(ctxt);
+   ctxt->memop.addr.mem.seg = seg_override(ctxt);
 
-   if (memop.type == OP_MEM && ctxt->ad_bytes != 8)
-   memop.addr.mem.ea = (u32)memop.addr.mem.ea;
+   if (ctxt->memop.type == OP_MEM && ctxt->ad_bytes != 8)
+   ctxt->memop.addr.mem.ea = (u32)ctxt->memop.addr.mem.ea;
 
/*
 * Decode and fetch the source operand: register, memory
@@ -3509,17 +3510,16 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
decode_register_operand(ctxt, &ctxt->src, 0);
break;
case SrcMem16:
-   memop.bytes = 2;
+   ctxt->memop.bytes = 2;
goto srcmem_common;
case SrcMem32:
-   memop.bytes = 4;
+   ctxt->memop.bytes = 4;
goto srcmem_common;
case SrcMem:
-   memop.bytes = (ctxt->d & ByteOp) ? 1 :
-  ctxt->op_bytes;
+   ctxt->memop.bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
srcmem_common:
-   ctxt->src = memop;
-   memopp = &ctxt->src;
+   ctxt->src = ctxt->memop;
+   ctxt->memopp = &ctxt->src;
break;
case SrcImmU16:
rc = decode_imm(ctxt, &ctxt->src, 2, false);
@@ -3561,7 +3561,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
insn_fetch_arr(ctxt->src.valptr, ctxt->src.bytes, ctxt);
break;
case SrcMemFAddr:
-   memop.bytes = ctxt->op_bytes + 2;
+   ctxt->memop.bytes = ctxt->op_bytes + 2;
goto srcmem_common;
break;
case SrcDX:
@@ -3615,8 +3615,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
break;
case DstMem:
case DstMem64:
-   ctxt->dst = memop;
-   memopp = &ctxt->dst;
+   ctxt->dst = ctxt->memop;
+   ctxt->memopp = &ctxt->dst;
if ((ctxt->d & DstMask) == DstMem64)
ctxt->dst.bytes = 8;
else
@@ -3654,8 +3654,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
}
 
 done:
-   if (memopp && memopp->type == OP_MEM && ctxt->rip_relative)
-   memopp->addr.mem.ea += ctxt->_eip;
+   if (ctxt->memopp && ctxt->memopp->type == OP_MEM && ctxt->rip_relative)
+   ctxt->memopp->addr.mem.ea += ctxt->_eip;
 
return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
 }
-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-i

[PATCH 00/14] Emulator decode generalization

2011-09-13 Thread Avi Kivity
- merge dst/src/src2 decode
- generalize %seg embedded in opcode decode

plus a fix.

Avi Kivity (14):
  KVM: x86 emulator: fix Src2CL decode
  KVM: x86 emulator: convert group 3 instructions to direct decode
  KVM: x86 emulator: move memop, memopp into emulation context
  KVM: x86 emulator: split dst decode to a generic decode_operand()
  KVM: x86 emulator: expand decode flags to 64 bits
  KVM: x86 emulator: switch src2 to generic decode_operand()
  KVM: x86 emulator: free up some flag bits near src, dst
  KVM: x86 emulator: switch OpImmUByte decode to decode_imm()
  KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack
  KVM: x86 emulator: switch src decode to decode_operand()
  KVM: x86 emulator: simplify OpMem64 decode
  KVM: x86 emulator: streamline decode of segment registers
  KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode
  KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode

 arch/x86/include/asm/kvm_emulate.h |4 +-
 arch/x86/kvm/emulate.c |  563 ++--
 2 files changed, 286 insertions(+), 281 deletions(-)

-- 
1.7.6.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: Split up MSI-X assigned device IRQ handler

2011-09-13 Thread Tian, Kevin
> From: Jan Kiszka [mailto:jan.kis...@siemens.com]
> Sent: Tuesday, September 13, 2011 3:30 PM
> 
> On 2011-09-13 08:40, Tian, Kevin wrote:
> >> From: Jan Kiszka
> >> Sent: Tuesday, September 13, 2011 12:58 AM
> >>
> >> The threaded IRQ handler for MSI-X has almost nothing in common with the
> >> INTx/MSI handler. Move its code into a dedicated handler.
> >
> > if it's desired to further go down this cleanup path, there's also no need 
> > to
> > register ack notifier for MSI-X type device since all the logic there is 
> > simply
> > a nop:
> 
> You mean
> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/79075? :)
> 

yes, and sorry that I didn't catch that thread. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Split up MSI-X assigned device IRQ handler

2011-09-13 Thread Jan Kiszka
On 2011-09-13 08:40, Tian, Kevin wrote:
>> From: Jan Kiszka
>> Sent: Tuesday, September 13, 2011 12:58 AM
>>
>> The threaded IRQ handler for MSI-X has almost nothing in common with the
>> INTx/MSI handler. Move its code into a dedicated handler.
> 
> if it's desired to further go down this cleanup path, there's also no need to
> register ack notifier for MSI-X type device since all the logic there is 
> simply
> a nop:

You mean http://thread.gmane.org/gmane.comp.emulators.kvm.devel/79075? :)

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] About hotplug multifunction

2011-09-13 Thread Isaku Yamahata
On Tue, Sep 13, 2011 at 09:52:49AM +0300, Gleb Natapov wrote:
> On Tue, Sep 13, 2011 at 02:57:20PM +0900, Isaku Yamahata wrote:
> > On Sun, Sep 11, 2011 at 12:05:17PM +0300, Michael S. Tsirkin wrote:
> > > On Sat, Sep 10, 2011 at 02:43:11AM +0900, Isaku Yamahata wrote:
> > > > pci/pcie hot plug needs clean up for multifunction hotplug in long term.
> > > > Only single function device case works. Multifunction case is broken 
> > > > somwehat.
> > > > Especially the current acpi based hotplug should be replaced by
> > > > the standardized hot plug controller in long term.
> > > 
> > > We'll need to keep supporting windows XP, which IIUC only
> > > supports hotplug through ACPI. So it looks like we'll
> > > need both.
> > 
> > Yes, we'll need both then.
> > It would be possible to implement acpi-based hotplug with
> > standardized hotplug controller. Not with qemu-specific controller.
> > 
> Where is this "standardized hotplug controller" documented?

PCI Hot-Plug 1.1 by PCI sig.

NOTE: I already implemented pcie native hotplug in qemu which is defined
in pcie spec. 

> > It would require a bit amount of work to write ACPI code in DSDT that
> > handles standardized hotplug controller.
> > So I'm not sure it's worth while only for windows XP support.
> > -- 
> > yamahata
> 
> --
>   Gleb.
> 

-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html