memory-hotplug : possible circular locking dependency detected

2012-09-13 Thread Yasuaki Ishimatsu
When I offline a memory on linux-3.6-rc5, possible circular
locking dependency detected messages are shown.
Are the messages known problem?

[  201.596363] Offlined Pages 32768
[  201.596373] remove from free list 14 1024 148000
[  201.596493] remove from free list 140400 1024 148000
[  201.596612] remove from free list 140800 1024 148000
[  201.596730] remove from free list 140c00 1024 148000
[  201.596849] remove from free list 141000 1024 148000
[  201.596968] remove from free list 141400 1024 148000
[  201.597049] remove from free list 141800 1024 148000
[  201.597049] remove from free list 141c00 1024 148000
[  201.597049] remove from free list 142000 1024 148000
[  201.597049] remove from free list 142400 1024 148000
[  201.597049] remove from free list 142800 1024 148000
[  201.597049] remove from free list 142c00 1024 148000
[  201.597049] remove from free list 143000 1024 148000
[  201.597049] remove from free list 143400 1024 148000
[  201.597049] remove from free list 143800 1024 148000
[  201.597049] remove from free list 143c00 1024 148000
[  201.597049] remove from free list 144000 1024 148000
[  201.597049] remove from free list 144400 1024 148000
[  201.597049] remove from free list 144800 1024 148000
[  201.597049] remove from free list 144c00 1024 148000
[  201.597049] remove from free list 145000 1024 148000
[  201.597049] remove from free list 145400 1024 148000
[  201.597049] remove from free list 145800 1024 148000
[  201.597049] remove from free list 145c00 1024 148000
[  201.597049] remove from free list 146000 1024 148000
[  201.597049] remove from free list 146400 1024 148000
[  201.597049] remove from free list 146800 1024 148000
[  201.597049] remove from free list 146c00 1024 148000
[  201.597049] remove from free list 147000 1024 148000
[  201.597049] remove from free list 147400 1024 148000
[  201.597049] remove from free list 147800 1024 148000
[  201.597049] remove from free list 147c00 1024 148000
[  201.602143] 
[  201.602150] ==
[  201.602153] [ INFO: possible circular locking dependency detected ]
[  201.602157] 3.6.0-rc5 #1 Not tainted
[  201.602159] ---
[  201.602162] bash/2789 is trying to acquire lock:
[  201.602164]  ((memory_chain).rwsem){.+.+.+}, at: [8109fe16] 
__blocking_notifier_call_chain+0x66/0xd0
[  201.602180] 
[  201.602180] but task is already holding lock:
[  201.602182]  (ksm_thread_mutex/1){+.+.+.}, at: [811b41fa] 
ksm_memory_callback+0x3a/0xc0
[  201.602194] 
[  201.602194] which lock already depends on the new lock.
[  201.602194] 
[  201.602197] 
[  201.602197] the existing dependency chain (in reverse order) is:
[  201.602200] 
[  201.602200] - #1 (ksm_thread_mutex/1){+.+.+.}:
[  201.602208][810dbee9] validate_chain+0x6d9/0x7e0
[  201.602214][810dc2e6] __lock_acquire+0x2f6/0x4f0
[  201.602219][810dc57d] lock_acquire+0x9d/0x190
[  201.602223][8166b4fc] __mutex_lock_common+0x5c/0x420
[  201.602229][8166ba2a] mutex_lock_nested+0x4a/0x60
[  201.602234][811b41fa] ksm_memory_callback+0x3a/0xc0
[  201.602239][81673447] notifier_call_chain+0x67/0x150
[  201.602244][8109fe2b] 
__blocking_notifier_call_chain+0x7b/0xd0
[  201.602250][8109fe96] 
blocking_notifier_call_chain+0x16/0x20
[  201.602255][8144c53b] memory_notify+0x1b/0x20
[  201.602261][81653c51] offline_pages+0x1b1/0x470
[  201.602267][811bfcae] remove_memory+0x1e/0x20
[  201.602273][8144c661] memory_block_action+0xa1/0x190
[  201.602278][8144c7c9] memory_block_change_state+0x79/0xe0
[  201.602282][8144c8f2] store_mem_state+0xc2/0xd0
[  201.602287][81436980] dev_attr_store+0x20/0x30
[  201.602293][812498d3] sysfs_write_file+0xa3/0x100
[  201.602299][811cba80] vfs_write+0xd0/0x1a0
[  201.602304][811cbc54] sys_write+0x54/0xa0
[  201.602309][81678529] system_call_fastpath+0x16/0x1b
[  201.602315] 
[  201.602315] - #0 ((memory_chain).rwsem){.+.+.+}:
[  201.602322][810db7e7] check_prev_add+0x527/0x550
[  201.602326][810dbee9] validate_chain+0x6d9/0x7e0
[  201.602331][810dc2e6] __lock_acquire+0x2f6/0x4f0
[  201.602335][810dc57d] lock_acquire+0x9d/0x190
[  201.602340][8166c1a1] down_read+0x51/0xa0
[  201.602345][8109fe16] 
__blocking_notifier_call_chain+0x66/0xd0
[  201.602350][8109fe96] 
blocking_notifier_call_chain+0x16/0x20
[  201.602355][8144c53b] memory_notify+0x1b/0x20
[  201.602360][81653e67] offline_pages+0x3c7/0x470
[  201.602365][811bfcae] remove_memory+0x1e/0x20
[  201.602370]

Re: memtest 4.20+ does not work with -cpu host

2012-09-13 Thread Peter Lieven

On 10.09.2012 14:32, Avi Kivity wrote:

On 09/10/2012 03:29 PM, Peter Lieven wrote:

On 09/10/12 14:21, Gleb Natapov wrote:

On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote:

Il 10/09/2012 13:52, Peter Lieven ha scritto:

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd

it only works without the skip. but the msr device returns all zeroes.

Hmm, the strange API of the MSR device doesn't work well with dd (dd
skips to 0x194 * 8 because bs is 8.  You can try this program:


There is rdmsr/wrmsr in msr-tools.

rdmsr returns it cannot read those MSRs. regardless if I use -cpu host
or -cpu qemu64.

On the host.



did you get my output?

#rdmsr -0 0x194
00011100
#rdmsr -0 0xce
0c0004011103

cheers,
peter

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-13 Thread Paolo Bonzini
Il 13/09/2012 09:53, Peter Lieven ha scritto:

 rdmsr returns it cannot read those MSRs. regardless if I use -cpu host
 or -cpu qemu64.
 On the host.


 did you get my output?
 
 #rdmsr -0 0x194
 00011100
 #rdmsr -0 0xce
 0c0004011103

Yes, that can help implementing it in KVM.  But without a spec to
understand what the bits actually mean, it's just as risky...

Peter, do you have any idea where to get the spec of the memory
controller MSRs in Nehalem and newer processors?  Apparently, memtest is
using them (and in particular 0x194) to find the speed of the FSB, or
something like that.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-13 Thread Gleb Natapov
On Thu, Sep 13, 2012 at 09:55:06AM +0200, Paolo Bonzini wrote:
 Il 13/09/2012 09:53, Peter Lieven ha scritto:
 
  rdmsr returns it cannot read those MSRs. regardless if I use -cpu host
  or -cpu qemu64.
  On the host.
 
 
  did you get my output?
  
  #rdmsr -0 0x194
  00011100
  #rdmsr -0 0xce
  0c0004011103
 
 Yes, that can help implementing it in KVM.  But without a spec to
 understand what the bits actually mean, it's just as risky...
 
 Peter, do you have any idea where to get the spec of the memory
 controller MSRs in Nehalem and newer processors?  Apparently, memtest is
 using them (and in particular 0x194) to find the speed of the FSB, or
 something like that.
 
Why would anyone will want to run memtest in a vm? May be just add those
MSRs to ignore list and that's it.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-dimensional Paging in Nested virtualization

2012-09-13 Thread Nadav Har'El
On Tue, Sep 11, 2012, siddhesh phadke wrote about Multi-dimensional Paging in 
Nested virtualization:
 I read turtles project paper where they have explained  how
 multi-dimensional page tables are built on L0. L2 is launched with
 empty EPT 0-2 and EPT 0-2 is built on-the-fly.
 I tried to find out how this is done in kvm code but i could not find
 where EPT 0-2 is built.

Nested EPT is not yet included in the mainline KVM. The original nested EPT
code that we had written as part of the Turtles paper became obsolete when
much of KVM's MMU code has been rewritten.

I have since rewritten the nested EPT code for the modern KVM. I sent
the second (latest) version of these patches to the KVM mailing list in
August, and you can find them in, for example,
http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95395

These patches were not yet accepted into KVM. They have bugs in various
setups (which I have not yet found the time to fix, unfortunately),
and some known issues found by Avi Kivity on this mailing lest.

 Does L1 handle ept violation first and then L0 updates its EPT0-2?
 How this is done?

This is explained in the turtles paper, but here's the short story:

L1 defines an EPT table for L2 which we call EPT12. L0 builds from this
an EPT02, with L1 addresses changed to L0. Now, when L2 runs and we get
an EPT violation, we exit to L0 (in nested vmx, any exit first gets to
L0). L0 checks if the translation is missing already in EPT12, and if it
isn't it emulates an exit into L1 - and inject the EPT violation into
L1. But if the translation wasn't missing in EPT12, then it's L0's
problem, and we just need to update EPT02.

 Can anybody give me some pointers about where to look into the code?

Please look at the patches above. Each patch is also documented.

Nadav.

-- 
Nadav Har'El|  Thursday, Sep 13 2012, 26 Elul 5772
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |error compiling committee.c: too many
http://nadav.harel.org.il   |arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-13 Thread Paolo Bonzini
Il 13/09/2012 09:57, Gleb Natapov ha scritto:
   
   #rdmsr -0 0x194
   00011100
   #rdmsr -0 0xce
   0c0004011103
  
  Yes, that can help implementing it in KVM.  But without a spec to
  understand what the bits actually mean, it's just as risky...
  
  Peter, do you have any idea where to get the spec of the memory
  controller MSRs in Nehalem and newer processors?  Apparently, memtest is
  using them (and in particular 0x194) to find the speed of the FSB, or
  something like that.
  
 Why would anyone will want to run memtest in a vm? May be just add those
 MSRs to ignore list and that's it.

From the output it looks like it's basically a list of bits.  Returning
something sensible is better, same as for the speed scaling MSRs.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-13 Thread Gleb Natapov
On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote:
 Il 13/09/2012 09:57, Gleb Natapov ha scritto:

#rdmsr -0 0x194
00011100
#rdmsr -0 0xce
0c0004011103
   
   Yes, that can help implementing it in KVM.  But without a spec to
   understand what the bits actually mean, it's just as risky...
   
   Peter, do you have any idea where to get the spec of the memory
   controller MSRs in Nehalem and newer processors?  Apparently, memtest is
   using them (and in particular 0x194) to find the speed of the FSB, or
   something like that.
   
  Why would anyone will want to run memtest in a vm? May be just add those
  MSRs to ignore list and that's it.
 
 From the output it looks like it's basically a list of bits.  Returning
 something sensible is better, same as for the speed scaling MSRs.
 
Everything is list of bits in computers :) At least 0xce is documented in  SDM.
It cannot be implemented in a migration safe manner.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Optimize page table walk

2012-09-13 Thread Avi Kivity
On 09/13/2012 01:20 AM, Marcelo Tosatti wrote:
 On Wed, Sep 12, 2012 at 05:29:49PM +0300, Avi Kivity wrote:
 (resend due to mail server malfunction)
 
 The page table walk has gotten crufty over the years and is threatening to 
 become
 even more crufty when SMAP is introduced.  Clean it up (and optimize it) 
 somewhat.
 
 What is SMAP?
 

Supervisor Mode Access Prevention, see
http://software.intel.com/sites/default/files/319433-014.pdf.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Prepare kvm for lto

2012-09-13 Thread Avi Kivity
On 09/12/2012 10:17 PM, Andi Kleen wrote:
 On Wed, Sep 12, 2012 at 05:50:41PM +0300, Avi Kivity wrote:
 vmx.c has an lto-unfriendly bit, fix it up.
 
 While there, clean up our asm code.
 
 Avi Kivity (3):
   KVM: VMX: Make lto-friendly
   KVM: VMX: Make use of asm.h
   KVM: SVM: Make use of asm.h
 
 Works for me in my LTO build, thanks Avi.
 I cannot guarantee I always hit the unit splitting case, but it looks
 good so far.

Actually I think patch 1 is missing a .global vmx_return.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: graphics card pci passthrough success report

2012-09-13 Thread Jan Kiszka
On 2012-09-13 07:55, Gerd Hoffmann wrote:
   Hi,
 
 - Apply the patches at the end of this mail to kvm and SeaBIOS to
   allow for more BAR space under 4G.  (The relevant BARs on the
   graphics cards _are_ 64 bit BARs, but kvm seemed to turn those
   into 32 bit BARs in the guest.)
 
 Which qemu/seabios versions have you used?
 
 qemu-1.2 (+ bundled seabios) should handle that just fine without
 patching.  There is no fixed I/O window any more, all memory space above
 lowmem is available for pci, i.e. if you give 2G to your guest
 everything above 0x8000.
 
 And if there isn't enougth address space below 4G (if you assign lot of
 memory to your guest so qemu keeps only the 0xe000 - 0x
 window free) seabios should try to map 64bit bars above 4G.
 
 - Apply the hacky patch at the end of this mail to SeaBIOS to
   always skip initialising the Radeon's option ROMs, or the VM
   would hang inside the Radeon option ROM if you boot the VM
   without the default cirrus video.
 
 A better way to handle that would probably be to add an pci passthrough
 config option to not expose the rom to the guest.

-device pci-assign,option-rom=,...

 
 Any clue *why* the rom doesn't run?

Maybe because we are not passing through the legacy VGA I/O ranges,
maybe because the card is accessing one of the famous side channels to
configure its mappings, and we do not virtualize them (as we usually do
not know them).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3] KVM: optimize apic interrupt delivery

2012-09-13 Thread Gleb Natapov
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 Changelog:

  - v2-v3
   * sparse annotation for rcu usage
   * move mutex above map
   * use mask/shift to calculate cluster/dst ids
   * use gotos
   * add comment about logic behind logical table creation

  - v1-v2
   * fix race Avi noticed
   * rcu_read_lock() out of the block as per Avi
   * fix rcu issues pointed to by MST. All but one. Still use
 call_rcu(). Do not think this is serious issue. If it is should be
 solved by RCU subsystem.
   * Fix phys_map overflow pointed to by MST
   * recalculate_apic_map() does not return error any more.
   * add optimization for low prio logical mode with one cpu as dst (it
 happens)


diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 64adb61..9dcfd3e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
 };
 
+struct kvm_apic_map {
+   struct rcu_head rcu;
+   u8 ldr_bits;
+   u32 cid_shift, cid_mask, lid_mask;
+   struct kvm_lapic *phys_map[256];
+   struct kvm_lapic *logical_map[16][16];
+};
+
 struct kvm_arch {
unsigned int n_used_mmu_pages;
unsigned int n_requested_mmu_pages;
@@ -528,6 +536,8 @@ struct kvm_arch {
struct kvm_ioapic *vioapic;
struct kvm_pit *vpit;
int vapics_in_nmi_mode;
+   struct mutex apic_map_lock;
+   struct kvm_apic_map *apic_map;
 
unsigned int tss_addr;
struct page *apic_access_page;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 07ad628..a03d4aa 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -139,11 +139,105 @@ static inline int apic_enabled(struct kvm_lapic *apic)
(LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
 APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
 
+static inline int apic_x2apic_mode(struct kvm_lapic *apic)
+{
+   return apic-vcpu-arch.apic_base  X2APIC_ENABLE;
+}
+
 static inline int kvm_apic_id(struct kvm_lapic *apic)
 {
return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
 }
 
+static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
+{
+   ldr = 32 - map-ldr_bits;
+   return (ldr  map-cid_shift)  map-cid_mask;
+}
+
+static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
+{
+   ldr = (32 - map-ldr_bits);
+   return ldr  map-lid_mask;
+}
+
+static inline void recalculate_apic_map(struct kvm *kvm)
+{
+   struct kvm_apic_map *new, *old = NULL;
+   struct kvm_vcpu *vcpu;
+   int i;
+
+   new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
+
+   mutex_lock(kvm-arch.apic_map_lock);
+
+   if (!new)
+   goto out;
+
+   new-ldr_bits = 8;
+   /* flat mode is deafult */
+   new-cid_shift = 8;
+   new-cid_mask = 0;
+   new-lid_mask = 0xff;
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct kvm_lapic *apic = vcpu-arch.apic;
+   u16 cid, lid;
+   u32 ldr;
+
+   if (!kvm_apic_present(vcpu))
+   continue;
+
+   /*
+* All APICs have to be configured in the same mode by an OS.
+* We take advatage of this while building logical id loockup
+* table. After reset APICs are in xapic/flat mode, so if we
+* find apic with different setting we assume this is the mode
+* os wants all apics to be in and build lookup table
+* accordingly.
+*/
+   if (apic_x2apic_mode(apic)) {
+   new-ldr_bits = 32;
+   new-cid_shift = 16;
+   new-cid_mask = new-lid_mask = 0x;
+   } else if (kvm_apic_sw_enabled(apic) 
+   !new-cid_mask /* flat mode */ 
+   kvm_apic_get_reg(apic, APIC_DFR) == 
APIC_DFR_CLUSTER) {
+   new-cid_shift = 4; 
+   new-cid_mask = 0xf;
+   new-lid_mask = 0xf;
+   }
+
+   new-phys_map[kvm_apic_id(apic)] = apic;
+
+   ldr = kvm_apic_get_reg(apic, APIC_LDR);
+   cid = apic_cluster_id(new, ldr);
+   lid = apic_logical_id(new, ldr);
+
+   if (lid)
+   new-logical_map[cid][ffs(lid) - 1] = apic;
+   }
+out:
+   old = rcu_dereference_protected(kvm-arch.apic_map, 1);
+   rcu_assign_pointer(kvm-arch.apic_map, new);
+   mutex_unlock(kvm-arch.apic_map_lock);
+
+   if (old)
+   kfree_rcu(old, rcu);
+}
+
+static inline void 

Re: [PATCHv3] KVM: optimize apic interrupt delivery

2012-09-13 Thread Avi Kivity
On 09/13/2012 12:00 PM, Gleb Natapov wrote:
 Most interrupt are delivered to only one vcpu. Use pre-build tables to
 find interrupt destination instead of looping through all vcpus. In case
 of logical mode loop only through vcpus in a logical cluster irq is sent
 to.

Looks good.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 47451] New: need to re-load driver in guest to make a hot-plug VF work

2012-09-13 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=47451

   Summary: need to re-load driver in guest to make a hot-plug VF
work
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.5.0
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: yongjie@intel.com
Regression: Yes


Environment:

Host OS (ia32/ia32e/IA64):ia32e
Guest OS (ia32/ia32e/IA64):ia32e
Guest OS Type (Linux/Windows):Linux (RHEL6u3)
kvm.git Commit:37e41afa97307a3e54b200a5c9179ada1632a844(master branch)
qemu-kvm Commit:28c3a9b197900c88f27b14f8862a7a15c00dc7f0(master branch)
Host Kernel Version:3.5.0-rc6  (Also exists in 3.6.0-rc3)
Hardware:Romley-EP (SandyBridge system)


Bug detailed description:
--
After hot plugging a VF to a Linux guest (e.g.RHEL6.3) in qemu monitor, the VF
cannot work in the guest by. I need to remove the VF driver (e.g. igbvf,
ixgbevf) and probe it again, then the VF can work in guest.
NIC: Intel 82599 NIC, Intel 82576 NIC

It needn't reload VF driver in hot-plug case when using an old kernel.
It's a regression in kernel. (commits are in kvm.git and qemu-kvm.git tree)
kvm  + qemu-kvm =result
37e41afa + 28c3a9b1 =bad
322728e5 + 28c3a9b1 =good

Note:
1. When assigning a VF in qemu-kvm command line (not hot-plug), VF can work
fine after boot-up.
2. It's easier to reproduce this in guest with 512/1024MB memory and 1/2 vCPUs.
3. Can't always reproduce with 2048MB and 2vCPUs. (Not very stable.)

Reproduce steps:

1.start up a host with kvm
2.qemu-system-x86_64 -m 512 smp 2 –net none –hda /root/rhel6u3.img
3.switch to qemu monitor  (ctrl+Alt+2)
4.device_add pci-assign,host=02:10.0,id=mynic   (02:10.0 is VF's BDF number.)
5.switch to guest  (ctrl+Alt+1)
6.check network of the VF.  (it can't work)
7. remove VF driver in guest ('rmmod igbvf')
8. re-probe VF driver in guest ('modprobe igbvf')
9. check network of the VF. (It should work this time.)


Current result:

The VF cannot work in the guest by default. Need to re-load VF driver in guest.

Expected result:

VF works well in the guest by default after hot-plug.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Biweekly KVM Test report, kernel 9a781977... qemu 4c3e02be...

2012-09-13 Thread Ren, Yongjie
Hi All,

This is KVM upstream test result against kvm.git next branch and qemu-kvm.git 
master branch.
 kvm.git next branch: 9a7819774e4236e8736a074b7e85276967911924 based on 
kernel 3.6.0-rc3
 qemu-kvm.git master branch: 4c3e02beed9878a5f760eeceb6cd42c475cf0127

We found 1 new bug and no bug fixed in the past two weeks. 

New issue (1):
1. need to re-load driver in guest to make a hot-plug VF work
  https://bugzilla.kernel.org/show_bug.cgi?id=47451
  -- It's a regression in kernel side about 2 or 3 months ago.

Fixed issue (0):

Old issues (5):
--
1. Nested-virt: L1 (kvm on kvm)guest panic with parameter -cpu host in qemu 
command line.
  https://bugs.launchpad.net/qemu/+bug/994378
2. Can't install or boot up 32bit win8 guest.
  https://bugs.launchpad.net/qemu/+bug/1007269
3. vCPU hot-add makes the guest abort. 
  https://bugs.launchpad.net/qemu/+bug/1019179
4. Nested Virt: VMX can't be initialized in L1 Xen (Xen on KVM)
  https://bugzilla.kernel.org/show_bug.cgi?id=45931
5. Guest has no xsave feature with parameter -cpu qemu64,+xsave in qemu 
command line.
  https://bugs.launchpad.net/qemu/+bug/1042561

Test environment:
==
  Platform   Westmere-EPSandybridge-EP
  CPU Cores   2432
  Memory size 24G   32G


Best Regards,
 Yongjie Ren  (Jay)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: graphics card pci passthrough success report

2012-09-13 Thread Lennert Buytenhek
On Thu, Sep 13, 2012 at 07:55:00AM +0200, Gerd Hoffmann wrote:

 Hi,

Hi,


  - Apply the patches at the end of this mail to kvm and SeaBIOS to
allow for more BAR space under 4G.  (The relevant BARs on the
graphics cards _are_ 64 bit BARs, but kvm seemed to turn those
into 32 bit BARs in the guest.)
 
 Which qemu/seabios versions have you used?
 
 qemu-1.2 (+ bundled seabios) should handle that just fine without
 patching.  There is no fixed I/O window any more, all memory space above
 lowmem is available for pci, i.e. if you give 2G to your guest
 everything above 0x8000.
 
 And if there isn't enougth address space below 4G (if you assign lot of
 memory to your guest so qemu keeps only the 0xe000 - 0x
 window free) seabios should try to map 64bit bars above 4G.

This was some time ago, on (L)ubuntu 12.04, which has qemu-kvm 1.0
and seabios 0.6.2.  We'll retry on a newer distro soon.


  - Apply the hacky patch at the end of this mail to SeaBIOS to
always skip initialising the Radeon's option ROMs, or the VM
would hang inside the Radeon option ROM if you boot the VM
without the default cirrus video.
 
 A better way to handle that would probably be to add an pci passthrough
 config option to not expose the rom to the guest.
 
 Any clue *why* the rom doesn't run?

No idea, we didn't look into that -- this was just a one afternoon
hacking session.


thanks,
Lennert
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/5] KVM: MMU: Optimize is_last_gpte()

2012-09-13 Thread Avi Kivity
On 09/12/2012 09:03 PM, Avi Kivity wrote:
 On 09/12/2012 08:49 PM, Avi Kivity wrote:
 Instead of branchy code depending on level, gpte.ps, and mmu configuration,
 prepare everything in a bitmap during mode changes and look it up during
 runtime.
 
 6/5 is buggy, sorry, will update it tomorrow.
 
 

8--8--

From: Avi Kivity a...@redhat.com
Date: Wed, 12 Sep 2012 20:46:56 +0300
Subject: [PATCH v2 6/5] KVM: MMU: Optimize is_last_gpte()

Instead of branchy code depending on level, gpte.ps, and mmu configuration,
prepare everything in a bitmap during mode changes and look it up during
runtime.

Signed-off-by: Avi Kivity a...@redhat.com
---

v2: rearrange bitmap (one less shift)
avoid stomping on local variable
fix index calculation
move check back to a function

 arch/x86/include/asm/kvm_host.h |  7 +++
 arch/x86/kvm/mmu.c  | 31 +++
 arch/x86/kvm/mmu.h  |  3 ++-
 arch/x86/kvm/paging_tmpl.h  | 22 +-
 4 files changed, 41 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3318bde..f9a48cf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -298,6 +298,13 @@ struct kvm_mmu {
u64 *lm_root;
u64 rsvd_bits_mask[2][4];
 
+   /*
+* Bitmap: bit set = last pte in walk
+* index[0]: pte.ps
+* index[1:2]: level
+*/
+   u8 last_pte_bitmap;
+
bool nx;
 
u64 pdptrs[4]; /* pae */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ce78408..32fe597 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3447,6 +3447,15 @@ static inline unsigned gpte_access(struct kvm_vcpu 
*vcpu, u64 gpte)
return access;
 }
 
+static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned 
gpte)
+{
+   unsigned index;
+
+   index = level - 1;
+   index |= (gpte  PT_PAGE_SIZE_MASK)  (PT_PAGE_SIZE_SHIFT - 2);
+   return mmu-last_pte_bitmap  (1  index);
+}
+
 #define PTTYPE 64
 #include paging_tmpl.h
 #undef PTTYPE
@@ -3548,6 +3557,24 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu, struct kvm_mmu *mmu
}
 }
 
+static void update_last_pte_bitmap(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
+{
+   u8 map;
+   unsigned level, root_level = mmu-root_level;
+   const unsigned ps_set_index = 1  2;  /* bit 2 of index: ps */
+
+   if (root_level == PT32E_ROOT_LEVEL)
+   --root_level;
+   /* PT_PAGE_TABLE_LEVEL always terminates */
+   map = 1 | (1  ps_set_index);
+   for (level = PT_DIRECTORY_LEVEL; level = root_level; ++level) {
+   if (level = PT_PDPE_LEVEL
+(mmu-root_level = PT32E_ROOT_LEVEL || is_pse(vcpu)))
+   map |= 1  (ps_set_index | (level - 1));
+   }
+   mmu-last_pte_bitmap = map;
+}
+
 static int paging64_init_context_common(struct kvm_vcpu *vcpu,
struct kvm_mmu *context,
int level)
@@ -3557,6 +3584,7 @@ static int paging64_init_context_common(struct kvm_vcpu 
*vcpu,
 
reset_rsvds_bits_mask(vcpu, context);
update_permission_bitmask(vcpu, context);
+   update_last_pte_bitmap(vcpu, context);
 
ASSERT(is_pae(vcpu));
context-new_cr3 = paging_new_cr3;
@@ -3586,6 +3614,7 @@ static int paging32_init_context(struct kvm_vcpu *vcpu,
 
reset_rsvds_bits_mask(vcpu, context);
update_permission_bitmask(vcpu, context);
+   update_last_pte_bitmap(vcpu, context);
 
context-new_cr3 = paging_new_cr3;
context-page_fault = paging32_page_fault;
@@ -3647,6 +3676,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
}
 
update_permission_bitmask(vcpu, context);
+   update_last_pte_bitmap(vcpu, context);
 
return 0;
 }
@@ -3724,6 +3754,7 @@ static int init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
}
 
update_permission_bitmask(vcpu, g_context);
+   update_last_pte_bitmap(vcpu, g_context);
 
return 0;
 }
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 143ee70..b08dd34 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -20,7 +20,8 @@
 #define PT_ACCESSED_MASK (1ULL  PT_ACCESSED_SHIFT)
 #define PT_DIRTY_SHIFT 6
 #define PT_DIRTY_MASK (1ULL  PT_DIRTY_SHIFT)
-#define PT_PAGE_SIZE_MASK (1ULL  7)
+#define PT_PAGE_SIZE_SHIFT 7
+#define PT_PAGE_SIZE_MASK (1ULL  PT_PAGE_SIZE_SHIFT)
 #define PT_PAT_MASK (1ULL  7)
 #define PT_GLOBAL_MASK (1ULL  8)
 #define PT64_NX_SHIFT 63
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index eb4a668..ec1e101 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -101,24 +101,6 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
return (ret != orig_pte);
 }
 
-static bool 

Re: Windows VM slow boot

2012-09-13 Thread Mel Gorman
On Wed, Sep 12, 2012 at 05:46:15PM +0100, Richard Davies wrote:
 Hi Mel - thanks for replying to my underhand bcc!
 
 Mel Gorman wrote:
  I see that this is an old-ish bug but I did not read the full history.
  Is it now booting faster than 3.5.0 was? I'm asking because I'm
  interested to see if commit c67fe375 helped your particular case.
 
 Yes, I think 3.6.0-rc5 is already better than 3.5.x but can still be
 improved, as discussed.
 

What are the boot times for each kernel?

 PATCH SNIPPED
 
 I have applied and tested again - perf results below.
 
 isolate_migratepages_range is indeed much reduced.
 
 There is now a lot of time in isolate_freepages_block and still quite a lot
 of lock contention, although in a different place.
 

This on top please.

---8---
From: Shaohua Li s...@fusionio.com
compaction: abort compaction loop if lock is contended or run too long

isolate_migratepages_range() might isolate none pages, for example, when
zone-lru_lock is contended and compaction is async. In this case, we should
abort compaction, otherwise, compact_zone will run a useless loop and make
zone-lru_lock is even contended.

V2:
only abort the compaction if lock is contended or run too long
Rearranged the code by Andrea Arcangeli.

[minc...@kernel.org: Putback pages isolated for migration if aborting]
[a...@linux-foundation.org: Fixup one contended usage site]
Signed-off-by: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Shaohua Li s...@fusionio.com
Signed-off-by: Mel Gorman mgor...@suse.de
---
 mm/compaction.c |   17 -
 mm/internal.h   |2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 7fcd3a5..a8de20d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -70,8 +70,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, 
unsigned long *flags,
 
/* async aborts if taking too long or contended */
if (!cc-sync) {
-   if (cc-contended)
-   *cc-contended = true;
+   cc-contended = true;
return false;
}
 
@@ -634,7 +633,7 @@ static isolate_migrate_t isolate_migratepages(struct zone 
*zone,
 
/* Perform the isolation */
low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
-   if (!low_pfn)
+   if (!low_pfn || cc-contended)
return ISOLATE_ABORT;
 
cc-migrate_pfn = low_pfn;
@@ -787,6 +786,8 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
switch (isolate_migratepages(zone, cc)) {
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
+   putback_lru_pages(cc-migratepages);
+   cc-nr_migratepages = 0;
goto out;
case ISOLATE_NONE:
continue;
@@ -831,6 +832,7 @@ static unsigned long compact_zone_order(struct zone *zone,
 int order, gfp_t gfp_mask,
 bool sync, bool *contended)
 {
+   unsigned long ret;
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
@@ -838,12 +840,17 @@ static unsigned long compact_zone_order(struct zone *zone,
.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
-   .contended = contended,
};
INIT_LIST_HEAD(cc.freepages);
INIT_LIST_HEAD(cc.migratepages);
 
-   return compact_zone(zone, cc);
+   ret = compact_zone(zone, cc);
+
+   VM_BUG_ON(!list_empty(cc.freepages));
+   VM_BUG_ON(!list_empty(cc.migratepages));
+
+   *contended = cc.contended;
+   return ret;
 }
 
 int sysctl_extfrag_threshold = 500;
diff --git a/mm/internal.h b/mm/internal.h
index b8c91b3..4bd7c0e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -130,7 +130,7 @@ struct compact_control {
int order;  /* order a direct compactor needs */
int migratetype;/* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
-   bool *contended;/* True if a lock was contended */
+   bool contended; /* True if a lock was contended */
 };
 
 unsigned long
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm loops after kernel udpate

2012-09-13 Thread Avi Kivity
On 09/12/2012 09:11 PM, Jiri Slaby wrote:
 On 09/12/2012 10:18 AM, Avi Kivity wrote:
 On 09/12/2012 11:13 AM, Jiri Slaby wrote:

  Please provide the output of vmxcap
 (http://goo.gl/c5lUO),

   Unrestricted guest   no
 
 The big real mode fixes.
 
 

 and a snapshot of kvm_stat while the guest is hung.

 kvm statistics

  exits  6778198  615942
  host_state_reload 1988 187
  irq_exits 1523 138
  mmu_cache_miss   4   0
  fpu_reload   1   0
 
 Please run this as root so we get the tracepoint based output; and press
 'x' when it's running so we get more detailed output.
 
 kvm statistics
 
  kvm_exit  13798699  330708
  kvm_entry 13799110  330708
  kvm_page_fault13793650  330604
  kvm_exit(EXCEPTION_NMI)6188458  330604
  kvm_exit(EXTERNAL_INTERRUPT)  2169 105
  kvm_exit(TPR_BELOW_THRESHOLD)   82   0
  kvm_exit(IO_INSTRUCTION) 6   0

Strange, it's unable to fault in the very first page.

Please provide a trace as per http://www.linux-kvm.org/page/Tracing (but
append -e kvmmmu to the command line).



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3] KVM: optimize apic interrupt delivery

2012-09-13 Thread Jan Kiszka
On 2012-09-13 11:00, Gleb Natapov wrote:
 Most interrupt are delivered to only one vcpu. Use pre-build tables to
 find interrupt destination instead of looping through all vcpus. In case
 of logical mode loop only through vcpus in a logical cluster irq is sent
 to.
 
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  Changelog:
 
   - v2-v3
* sparse annotation for rcu usage
* move mutex above map
* use mask/shift to calculate cluster/dst ids
* use gotos
* add comment about logic behind logical table creation
 
   - v1-v2
* fix race Avi noticed
* rcu_read_lock() out of the block as per Avi
* fix rcu issues pointed to by MST. All but one. Still use
  call_rcu(). Do not think this is serious issue. If it is should be
  solved by RCU subsystem.
* Fix phys_map overflow pointed to by MST
* recalculate_apic_map() does not return error any more.
* add optimization for low prio logical mode with one cpu as dst (it
  happens)
 
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 64adb61..9dcfd3e 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
   struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
  };
  
 +struct kvm_apic_map {
 + struct rcu_head rcu;
 + u8 ldr_bits;
 + u32 cid_shift, cid_mask, lid_mask;
 + struct kvm_lapic *phys_map[256];
 + struct kvm_lapic *logical_map[16][16];
 +};
 +
  struct kvm_arch {
   unsigned int n_used_mmu_pages;
   unsigned int n_requested_mmu_pages;
 @@ -528,6 +536,8 @@ struct kvm_arch {
   struct kvm_ioapic *vioapic;
   struct kvm_pit *vpit;
   int vapics_in_nmi_mode;
 + struct mutex apic_map_lock;
 + struct kvm_apic_map *apic_map;
  
   unsigned int tss_addr;
   struct page *apic_access_page;
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 07ad628..a03d4aa 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -139,11 +139,105 @@ static inline int apic_enabled(struct kvm_lapic *apic)
   (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
  
 +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
 +{
 + return apic-vcpu-arch.apic_base  X2APIC_ENABLE;
 +}
 +
  static inline int kvm_apic_id(struct kvm_lapic *apic)
  {
   return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
  }
  
 +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
 +{
 + ldr = 32 - map-ldr_bits;
 + return (ldr  map-cid_shift)  map-cid_mask;
 +}
 +
 +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
 +{
 + ldr = (32 - map-ldr_bits);
 + return ldr  map-lid_mask;
 +}
 +
 +static inline void recalculate_apic_map(struct kvm *kvm)

Inline? No recent compiler will respect it anyway, but it still looks
strange for this function.

 +{
 + struct kvm_apic_map *new, *old = NULL;
 + struct kvm_vcpu *vcpu;
 + int i;
 +
 + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
 +
 + mutex_lock(kvm-arch.apic_map_lock);
 +
 + if (!new)
 + goto out;
 +
 + new-ldr_bits = 8;
 + /* flat mode is deafult */
 + new-cid_shift = 8;
 + new-cid_mask = 0;
 + new-lid_mask = 0xff;
 +
 + kvm_for_each_vcpu(i, vcpu, kvm) {
 + struct kvm_lapic *apic = vcpu-arch.apic;
 + u16 cid, lid;
 + u32 ldr;
 +
 + if (!kvm_apic_present(vcpu))
 + continue;
 +
 + /*
 +  * All APICs have to be configured in the same mode by an OS.
 +  * We take advatage of this while building logical id loockup
 +  * table. After reset APICs are in xapic/flat mode, so if we
 +  * find apic with different setting we assume this is the mode
 +  * os wants all apics to be in and build lookup table
 +  * accordingly.
 +  */
 + if (apic_x2apic_mode(apic)) {
 + new-ldr_bits = 32;
 + new-cid_shift = 16;
 + new-cid_mask = new-lid_mask = 0x;
 + } else if (kvm_apic_sw_enabled(apic) 
 + !new-cid_mask /* flat mode */ 
 + kvm_apic_get_reg(apic, APIC_DFR) == 
 APIC_DFR_CLUSTER) {
 + new-cid_shift = 4; 
 + new-cid_mask = 0xf;
 + new-lid_mask = 0xf;
 + }
 +
 + new-phys_map[kvm_apic_id(apic)] = apic;
 +
 + ldr = kvm_apic_get_reg(apic, APIC_LDR);
 + cid = apic_cluster_id(new, ldr);
 + lid = apic_logical_id(new, ldr);
 +
 + if (lid)
 + new-logical_map[cid][ffs(lid) - 1] = apic;
 + }
 +out:
 + old = rcu_dereference_protected(kvm-arch.apic_map, 1);
 + rcu_assign_pointer(kvm-arch.apic_map, 

Re: [PATCHv3] KVM: optimize apic interrupt delivery

2012-09-13 Thread Gleb Natapov
On Thu, Sep 13, 2012 at 12:29:44PM +0200, Jan Kiszka wrote:
 On 2012-09-13 11:00, Gleb Natapov wrote:
  Most interrupt are delivered to only one vcpu. Use pre-build tables to
  find interrupt destination instead of looping through all vcpus. In case
  of logical mode loop only through vcpus in a logical cluster irq is sent
  to.
  
  Signed-off-by: Gleb Natapov g...@redhat.com
  ---
   Changelog:
  
- v2-v3
 * sparse annotation for rcu usage
 * move mutex above map
 * use mask/shift to calculate cluster/dst ids
 * use gotos
 * add comment about logic behind logical table creation
  
- v1-v2
 * fix race Avi noticed
 * rcu_read_lock() out of the block as per Avi
 * fix rcu issues pointed to by MST. All but one. Still use
   call_rcu(). Do not think this is serious issue. If it is should be
   solved by RCU subsystem.
 * Fix phys_map overflow pointed to by MST
 * recalculate_apic_map() does not return error any more.
 * add optimization for low prio logical mode with one cpu as dst (it
   happens)
  
  
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index 64adb61..9dcfd3e 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
  struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
   };
   
  +struct kvm_apic_map {
  +   struct rcu_head rcu;
  +   u8 ldr_bits;
  +   u32 cid_shift, cid_mask, lid_mask;
  +   struct kvm_lapic *phys_map[256];
  +   struct kvm_lapic *logical_map[16][16];
  +};
  +
   struct kvm_arch {
  unsigned int n_used_mmu_pages;
  unsigned int n_requested_mmu_pages;
  @@ -528,6 +536,8 @@ struct kvm_arch {
  struct kvm_ioapic *vioapic;
  struct kvm_pit *vpit;
  int vapics_in_nmi_mode;
  +   struct mutex apic_map_lock;
  +   struct kvm_apic_map *apic_map;
   
  unsigned int tss_addr;
  struct page *apic_access_page;
  diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
  index 07ad628..a03d4aa 100644
  --- a/arch/x86/kvm/lapic.c
  +++ b/arch/x86/kvm/lapic.c
  @@ -139,11 +139,105 @@ static inline int apic_enabled(struct kvm_lapic 
  *apic)
  (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
   APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
   
  +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
  +{
  +   return apic-vcpu-arch.apic_base  X2APIC_ENABLE;
  +}
  +
   static inline int kvm_apic_id(struct kvm_lapic *apic)
   {
  return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
   }
   
  +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
  +{
  +   ldr = 32 - map-ldr_bits;
  +   return (ldr  map-cid_shift)  map-cid_mask;
  +}
  +
  +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
  +{
  +   ldr = (32 - map-ldr_bits);
  +   return ldr  map-lid_mask;
  +}
  +
  +static inline void recalculate_apic_map(struct kvm *kvm)
 
 Inline? No recent compiler will respect it anyway, but it still looks
 strange for this function.
Agree. I marked it inline when it was much smaller. Avi/Marcelo should I
resend or you can edit before applying?

 
  +{
  +   struct kvm_apic_map *new, *old = NULL;
  +   struct kvm_vcpu *vcpu;
  +   int i;
  +
  +   new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
  +
  +   mutex_lock(kvm-arch.apic_map_lock);
  +
  +   if (!new)
  +   goto out;
  +
  +   new-ldr_bits = 8;
  +   /* flat mode is deafult */
  +   new-cid_shift = 8;
  +   new-cid_mask = 0;
  +   new-lid_mask = 0xff;
  +
  +   kvm_for_each_vcpu(i, vcpu, kvm) {
  +   struct kvm_lapic *apic = vcpu-arch.apic;
  +   u16 cid, lid;
  +   u32 ldr;
  +
  +   if (!kvm_apic_present(vcpu))
  +   continue;
  +
  +   /*
  +* All APICs have to be configured in the same mode by an OS.
  +* We take advatage of this while building logical id loockup
  +* table. After reset APICs are in xapic/flat mode, so if we
  +* find apic with different setting we assume this is the mode
  +* os wants all apics to be in and build lookup table
  +* accordingly.
  +*/
  +   if (apic_x2apic_mode(apic)) {
  +   new-ldr_bits = 32;
  +   new-cid_shift = 16;
  +   new-cid_mask = new-lid_mask = 0x;
  +   } else if (kvm_apic_sw_enabled(apic) 
  +   !new-cid_mask /* flat mode */ 
  +   kvm_apic_get_reg(apic, APIC_DFR) == 
  APIC_DFR_CLUSTER) {
  +   new-cid_shift = 4; 
  +   new-cid_mask = 0xf;
  +   new-lid_mask = 0xf;
  +   }
  +
  +   new-phys_map[kvm_apic_id(apic)] = apic;
  +
  +   ldr = kvm_apic_get_reg(apic, APIC_LDR);
  +   cid = apic_cluster_id(new, ldr);
  +   lid = apic_logical_id(new, ldr);
  +
  +   

Re: [PATCHv3] KVM: optimize apic interrupt delivery

2012-09-13 Thread Jan Kiszka
On 2012-09-13 12:33, Gleb Natapov wrote:

 So, this can be the foundation for direct MSI delivery as well, right?

 What do you mean by direct MSI delivery? kvm_irq_delivery_to_apic() is
 called by MSI. If you mean delivery from irq context, then yes, mst
 plans to do so.

Yes, that's what I was aiming at.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: MMU: Push clean gpte write protection out of gpte_access()

2012-09-13 Thread Xiao Guangrong
On 09/12/2012 10:29 PM, Avi Kivity wrote:
 gpte_access() computes the access permissions of a guest pte and also
 write-protects clean gptes.  This is wrong when we are servicing a
 write fault (since we'll be setting the dirty bit momentarily) but
 correct when instantiating a speculative spte, or when servicing a
 read fault (since we'll want to trap a following write in order to
 set the dirty bit).
 
 It doesn't seem to hurt in practice, but in order to make the code

In current code, it seems that we will get two #PF if guest write memory
through clean pte: one mark the dirty bit, then fault again, set W bit.

 readable, push the write protection out of gpte_access() and into
 a new protect_clean_gpte() which is called explicitly when needed.

Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: MMU: Optimize gpte_access() slightly

2012-09-13 Thread Xiao Guangrong
On 09/12/2012 10:29 PM, Avi Kivity wrote:
 If nx is disabled, then is gpte[63] is set we will hit a reserved
 bit set fault before checking permissions; so we can ignore the
 setting of efer.nxe.

Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: MMU: Move gpte_access() out of paging_tmpl.h

2012-09-13 Thread Xiao Guangrong
On 09/12/2012 10:29 PM, Avi Kivity wrote:

  static bool FNAME(is_last_gpte)(struct guest_walker *walker,
   struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
   pt_element_t gpte)
 @@ -217,7 +206,7 @@ retry_walk:
 
   last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte);
   if (last_gpte) {
 - pte_access = pt_access  FNAME(gpte_access)(vcpu, pte);
 + pte_access = pt_access  gpte_access(vcpu, pte);

It can pass 32bit variable to gpte_access without cast, no warning?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3] KVM: optimize apic interrupt delivery

2012-09-13 Thread Michael S. Tsirkin
On Thu, Sep 13, 2012 at 12:00:59PM +0300, Gleb Natapov wrote:
 Most interrupt are delivered to only one vcpu. Use pre-build tables to
 find interrupt destination instead of looping through all vcpus. In case
 of logical mode loop only through vcpus in a logical cluster irq is sent
 to.
 
 Signed-off-by: Gleb Natapov g...@redhat.com


Some comments below.
The code's pretty complex now, I think adding some comments will be
helpful. Below, I noted where this would be especially beneficial.

Thanks!

 ---
  Changelog:
 
   - v2-v3
* sparse annotation for rcu usage
* move mutex above map
* use mask/shift to calculate cluster/dst ids
* use gotos
* add comment about logic behind logical table creation
 
   - v1-v2
* fix race Avi noticed
* rcu_read_lock() out of the block as per Avi
* fix rcu issues pointed to by MST. All but one. Still use
  call_rcu(). Do not think this is serious issue. If it is should be
  solved by RCU subsystem.
* Fix phys_map overflow pointed to by MST
* recalculate_apic_map() does not return error any more.
* add optimization for low prio logical mode with one cpu as dst (it
  happens)
 
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 64adb61..9dcfd3e 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
   struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
  };
  
 +struct kvm_apic_map {
 + struct rcu_head rcu;
 + u8 ldr_bits;

ldr_bits are never used directly, always 32 - ldr_bits.
It might be a good idea to just store 32 - ldr_bits.
I am not sure.

 + u32 cid_shift, cid_mask, lid_mask;
 + struct kvm_lapic *phys_map[256];
 + struct kvm_lapic *logical_map[16][16];


Would be nice to add documentation for structure fields:
what does each field include? For example what are
the index values into logical_map? When is each mode used?
I am guessing this will address some questions below.

16 is used in sevral places it code. We also have
0xf which is really 16 - 1. Would be nice to have defines here.


 +};
 +
  struct kvm_arch {
   unsigned int n_used_mmu_pages;
   unsigned int n_requested_mmu_pages;
 @@ -528,6 +536,8 @@ struct kvm_arch {
   struct kvm_ioapic *vioapic;
   struct kvm_pit *vpit;
   int vapics_in_nmi_mode;
 + struct mutex apic_map_lock;
 + struct kvm_apic_map *apic_map;
  
   unsigned int tss_addr;
   struct page *apic_access_page;
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 07ad628..a03d4aa 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -139,11 +139,105 @@ static inline int apic_enabled(struct kvm_lapic *apic)
   (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
  
 +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
 +{
 + return apic-vcpu-arch.apic_base  X2APIC_ENABLE;
 +}
 +
  static inline int kvm_apic_id(struct kvm_lapic *apic)
  {
   return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
  }
  
 +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)

Why is this u16? It seems the only legal values are 0-15
since this is used as index in lookup in logical_map.
Maybe add a comment explaning legal values are 0-15.
Or maybe BUG_ON to check result is 0 to 15.


 +{
 + ldr = 32 - map-ldr_bits;
 + return (ldr  map-cid_shift)  map-cid_mask;
 +}
 +
 +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
 +{
 + ldr = (32 - map-ldr_bits);
 + return ldr  map-lid_mask;
 +}
 +
 +static inline void recalculate_apic_map(struct kvm *kvm)
 +{
 + struct kvm_apic_map *new, *old = NULL;
 + struct kvm_vcpu *vcpu;
 + int i;
 +
 + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
 +
 + mutex_lock(kvm-arch.apic_map_lock);
 +
 + if (!new)
 + goto out;
 +
 + new-ldr_bits = 8;
 + /* flat mode is deafult */

Typo

 + new-cid_shift = 8;
 + new-cid_mask = 0;
 + new-lid_mask = 0xff;
 +
 + kvm_for_each_vcpu(i, vcpu, kvm) {
 + struct kvm_lapic *apic = vcpu-arch.apic;
 + u16 cid, lid;
 + u32 ldr;
 +
 + if (!kvm_apic_present(vcpu))
 + continue;
 +
 + /*
 +  * All APICs have to be configured in the same mode by an OS.
 +  * We take advatage of this while building logical id loockup
 +  * table. After reset APICs are in xapic/flat mode, so if we
 +  * find apic with different setting we assume this is the mode
 +  * os wants all apics to be in


s/os/OS (for consistency).

 and build lookup table accordingly.

A bit clearer:
; we  build the lookup table accordingly.

(otherwise it reads as if os builds the lookup table)

 +  */
 + if (apic_x2apic_mode(apic)) {
 + 

Re: [PATCH 3/5] KVM: MMU: Move gpte_access() out of paging_tmpl.h

2012-09-13 Thread Avi Kivity
On 09/13/2012 02:48 PM, Xiao Guangrong wrote:
 On 09/12/2012 10:29 PM, Avi Kivity wrote:
 
  static bool FNAME(is_last_gpte)(struct guest_walker *walker,
  struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
  pt_element_t gpte)
 @@ -217,7 +206,7 @@ retry_walk:
 
  last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte);
  if (last_gpte) {
 -pte_access = pt_access  FNAME(gpte_access)(vcpu, pte);
 +pte_access = pt_access  gpte_access(vcpu, pte);
 
 It can pass 32bit variable to gpte_access without cast, no warning?

No warning.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-13 Thread Raghavendra K T
* Andrew Theurer haban...@linux.vnet.ibm.com [2012-09-11 13:27:41]:

 On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
  On 09/11/2012 01:42 AM, Andrew Theurer wrote:
   On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
   On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
   +static bool __yield_to_candidate(struct task_struct *curr, struct 
   task_struct *p)
   +{
   + if (!curr-sched_class-yield_to_task)
   + return false;
   +
   + if (curr-sched_class != p-sched_class)
   + return false;
  
  
   Peter,
  
   Should we also add a check if the runq has a skip buddy (as pointed out
   by Raghu) and return if the skip buddy is already set.
  
   Oh right, I missed that suggestion.. the performance improvement went
   from 81% to 139% using this, right?
  
   It might make more sense to keep that separate, outside of this
   function, since its not a strict prerequisite.
  
  
   + if (task_running(p_rq, p) || p-state)
   + return false;
   +
   + return true;
   +}
  
  
   @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
   bool preempt)
  rq = this_rq();
  
 again:
   + /* optimistic test to avoid taking locks */
   + if (!__yield_to_candidate(curr, p))
   + goto out_irq;
   +
  
   So add something like:
  
/* Optimistic, if we 'raced' with another yield_to(), don't bother */
if (p_rq-cfs_rq-skip)
goto out_irq;
  
  
  p_rq = task_rq(p);
  double_rq_lock(rq, p_rq);
  
  
   But I do have a question on this optimization though,.. Why do we check
   p_rq-cfs_rq-skip and not rq-cfs_rq-skip ?
  
   That is, I'd like to see this thing explained a little better.
  
   Does it go something like: p_rq is the runqueue of the task we'd like to
   yield to, rq is our own, they might be the same. If we have a -skip,
   there's nothing we can do about it, OTOH p_rq having a -skip and
   failing the yield_to() simply means us picking the next VCPU thread,
   which might be running on an entirely different cpu (rq) and could
   succeed?
  
   Here's two new versions, both include a __yield_to_candidate(): v3
   uses the check for p_rq-curr in guest mode, and v4 uses the cfs_rq
   skip check.  Raghu, I am not sure if this is exactly what you want
   implemented in v4.
  
  
  Andrew, Yes that is what I had. I think there was a mis-understanding. 
  My intention was to if there is a directed_yield happened in runqueue 
  (say rqA), do not bother to directed yield to that. But unfortunately as 
  PeterZ pointed that would have resulted in setting next buddy of a 
  different run queue than rqA.
  So we can drop this skip idea. Pondering more over what to do? can we 
  use next buddy itself ... thinking..
 
 As I mentioned earlier today, I did not have your changes from kvm.git
 tree when I tested my changes.  Here are your changes and my changes
 compared:
 
 throughput in MB/sec
 
 kvm_vcpu_on_spin changes:  4636 +/- 15.74%
 yield_to changes:4515 +/- 12.73%
 
 I would be inclined to stick with your changes which are kept in kvm
 code.  I did try both combined, and did not get good results:
 
 both changes:4074 +/- 19.12%
 
 So, having both is probably not a good idea.  However, I feel like
 there's more work to be done.  With no over-commit (10 VMs), total
 throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
 overhead, but a reduction to ~4500 is still terrible.  By contrast,
 8-way VMs with 2x over-commit have a total throughput roughly 10% less
 than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
 host).  We still have what appears to be scalability problems, but now
 it's not so much in runqueue locks for yield_to(), but now
 get_pid_task():


Hi Andrew,
IMHO, reducing the double runqueue lock overhead is a good idea,
and may be  we see the benefits when we increase the overcommit further.

The explaination for not seeing good benefit on top of PLE handler
optimization patch is because we filter the yield_to candidates,
and hence resulting in less contention for double runqueue lock.
and extra code overhead during genuine yield_to might have resulted in
some degradation in the case you tested.

However, did you use cfs.next also?. I hope it helps, when we combine.

Here is the result that is showing positive benefit.
I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
  
+---+---+---++---+
kernbench time in sec, lower is better 
+---+---+---++---+
   base  stddev patched stddev  %improve
+---+---+---++---+
1x44.3880 1.869940.8180 1.9173 8.04271
2x96.7580 4.278793.4188 3.5150 3.45108
+---+---+---++---+



Re: memtest 4.20+ does not work with -cpu host

2012-09-13 Thread Peter Lieven

On 13.09.2012 10:05, Gleb Natapov wrote:

On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote:

Il 13/09/2012 09:57, Gleb Natapov ha scritto:

#rdmsr -0 0x194
00011100
#rdmsr -0 0xce
0c0004011103

Yes, that can help implementing it in KVM.  But without a spec to
understand what the bits actually mean, it's just as risky...

Peter, do you have any idea where to get the spec of the memory
controller MSRs in Nehalem and newer processors?  Apparently, memtest is
using them (and in particular 0x194) to find the speed of the FSB, or
something like that.


Why would anyone will want to run memtest in a vm? May be just add those
MSRs to ignore list and that's it.

From the output it looks like it's basically a list of bits.  Returning
something sensible is better, same as for the speed scaling MSRs.


Everything is list of bits in computers :) At least 0xce is documented in  SDM.
It cannot be implemented in a migration safe manner.

What do you suggest just say memtest does not work?
I am wondering why it is working with -cpu qemu64.

Peter



--
Gleb.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] KVM: MMU: Optimize pte permission checks

2012-09-13 Thread Xiao Guangrong
On 09/12/2012 10:29 PM, Avi Kivity wrote:
 walk_addr_generic() permission checks are a maze of branchy code, which is
 performed four times per lookup.  It depends on the type of access, efer.nxe,
 cr0.wp, cr4.smep, and in the near future, cr4.smap.
 
 Optimize this away by precalculating all variants and storing them in a
 bitmap.  The bitmap is recalculated when rarely-changing variables change
 (cr0, cr4) and is indexed by the often-changing variables (page fault error
 code, pte access permissions).

Really graceful!

 
 The result is short, branch-free code.
 
 Signed-off-by: Avi Kivity a...@redhat.com

 +static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu 
 *mmu)
 +{
 + unsigned bit, byte, pfec;
 + u8 map;
 + bool fault, x, w, u, wf, uf, ff, smep;
 +
 + smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
 + for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
 + pfec = byte  1;
 + map = 0;
 + wf = pfec  PFERR_WRITE_MASK;
 + uf = pfec  PFERR_USER_MASK;
 + ff = pfec  PFERR_FETCH_MASK;
 + for (bit = 0; bit  8; ++bit) {
 + x = bit  ACC_EXEC_MASK;
 + w = bit  ACC_WRITE_MASK;
 + u = bit  ACC_USER_MASK;
 +
 + /* Not really needed: !nx will cause pte.nx to fault */
 + x |= !mmu-nx;
 + /* Allow supervisor writes if !cr0.wp */
 + w |= !is_write_protection(vcpu)  !uf;
 + /* Disallow supervisor fetches if cr4.smep */
 + x = !(smep  !uf);

In the case of smep, supervisor mode can fetch the memory if pte.u == 0,
so, it should be x = !(smep  !uf  u)?

 @@ -3672,20 +3672,18 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu 
 *vcpu, unsigned long gva,
   gpa_t *gpa, struct x86_exception *exception,
   bool write)
  {
 - u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 + u32 access = ((kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0)
 + | (write ? PFERR_WRITE_MASK : 0);
 + u8 bit = vcpu-arch.access;
 
 - if (vcpu_match_mmio_gva(vcpu, gva) 
 -   check_write_user_access(vcpu, write, access,
 -   vcpu-arch.access)) {
 + if (vcpu_match_mmio_gva(vcpu, gva)
 +  ((vcpu-arch.walk_mmu-permissions[access  1]  bit)  1)) {

!((vcpu-arch.walk_mmu-permissions[access  1]  bit)  1) ?

It is better introducing a function to do the permission check?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-13 Thread Avi Kivity
On 09/11/2012 09:27 PM, Andrew Theurer wrote:
 
 So, having both is probably not a good idea.  However, I feel like
 there's more work to be done.  With no over-commit (10 VMs), total
 throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
 overhead, but a reduction to ~4500 is still terrible.  By contrast,
 8-way VMs with 2x over-commit have a total throughput roughly 10% less
 than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
 host).  We still have what appears to be scalability problems, but now
 it's not so much in runqueue locks for yield_to(), but now
 get_pid_task():
 
 perf on host:
 
 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task
 11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock
 10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to
  9.17%  91507 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin
  7.74%  77257 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to
  3.56%  35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
  3.00%  29951 qemu-system-x86 [kvm] [k] __vcpu_run
  2.93%  29268 qemu-system-x86 [kvm_intel]   [k] vmx_vcpu_run
  2.88%  28783 qemu-system-x86 [kvm] [k] vcpu_enter_guest
  2.59%  25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule
  1.40%  13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq
  1.28%  12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task
  1.14%  11376 qemu-system-x86 [kvm_intel]   [k] vmcs_writel
  0.85%   8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair
  0.53%   5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe
  0.46%   4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc
 
 get_pid_task() uses some rcu fucntions, wondering how scalable this
 is  I tend to think of rcu as -not- having issues like this... is
 there a rcu stat/tracing tool which would help identify potential
 problems?

It's not, it's the atomics + cache line bouncing.  We're basically
guaranteed to bounce here.

Here we're finally paying for the ioctl() based interface.  A syscall
based interface would have a 1:1 correspondence between vcpus and tasks,
so these games would be unnecessary.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] KVM: MMU: Optimize pte permission checks

2012-09-13 Thread Avi Kivity
On 09/13/2012 03:09 PM, Xiao Guangrong wrote:
 
 The result is short, branch-free code.
 
 Signed-off-by: Avi Kivity a...@redhat.com
 
 +static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu 
 *mmu)
 +{
 +unsigned bit, byte, pfec;
 +u8 map;
 +bool fault, x, w, u, wf, uf, ff, smep;
 +
 +smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
 +for (byte = 0; byte  ARRAY_SIZE(mmu-permissions); ++byte) {
 +pfec = byte  1;
 +map = 0;
 +wf = pfec  PFERR_WRITE_MASK;
 +uf = pfec  PFERR_USER_MASK;
 +ff = pfec  PFERR_FETCH_MASK;
 +for (bit = 0; bit  8; ++bit) {
 +x = bit  ACC_EXEC_MASK;
 +w = bit  ACC_WRITE_MASK;
 +u = bit  ACC_USER_MASK;
 +
 +/* Not really needed: !nx will cause pte.nx to fault */
 +x |= !mmu-nx;
 +/* Allow supervisor writes if !cr0.wp */
 +w |= !is_write_protection(vcpu)  !uf;
 +/* Disallow supervisor fetches if cr4.smep */
 +x = !(smep  !uf);
 
 In the case of smep, supervisor mode can fetch the memory if pte.u == 0,
 so, it should be x = !(smep  !uf  u)?

Good catch, will fix.

 
 @@ -3672,20 +3672,18 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu 
 *vcpu, unsigned long gva,
  gpa_t *gpa, struct x86_exception *exception,
  bool write)
  {
 -u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 +u32 access = ((kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0)
 +| (write ? PFERR_WRITE_MASK : 0);
 +u8 bit = vcpu-arch.access;
 
 -if (vcpu_match_mmio_gva(vcpu, gva) 
 -  check_write_user_access(vcpu, write, access,
 -  vcpu-arch.access)) {
 +if (vcpu_match_mmio_gva(vcpu, gva)
 + ((vcpu-arch.walk_mmu-permissions[access  1]  bit)  1)) {
 
 !((vcpu-arch.walk_mmu-permissions[access  1]  bit)  1) ?
 
 It is better introducing a function to do the permission check?
 

Probably, I'll rethink it.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] KVM: MMU: Optimize pte permission checks

2012-09-13 Thread Xiao Guangrong
On 09/12/2012 10:29 PM, Avi Kivity wrote:

 + pte_access = pt_access  gpte_access(vcpu, pte);
 + eperm |= (mmu-permissions[access  1]  pte_access)  1;
 
   last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte);
 - if (last_gpte) {
 - pte_access = pt_access  gpte_access(vcpu, pte);
 - /* check if the kernel is fetching from user page */
 - if (unlikely(pte_access  PT_USER_MASK) 
 - kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
 - if (fetch_fault  !user_fault)
 - eperm = true;
 - }

I see this in the SDM:

If CR4.SMEP = 1, instructions may be fetched from any linear
address with a valid translation for which the U/S flag (bit 2) is 0 in at
least one of the paging-structure entries controlling the translation.

This patch checks smep on every levels, breaks this rule.
(current code checks smep on the last level).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-13 Thread Gleb Natapov
On Thu, Sep 13, 2012 at 02:05:23PM +0200, Peter Lieven wrote:
 On 13.09.2012 10:05, Gleb Natapov wrote:
 On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote:
 Il 13/09/2012 09:57, Gleb Natapov ha scritto:
 #rdmsr -0 0x194
 00011100
 #rdmsr -0 0xce
 0c0004011103
 Yes, that can help implementing it in KVM.  But without a spec to
 understand what the bits actually mean, it's just as risky...
 
 Peter, do you have any idea where to get the spec of the memory
 controller MSRs in Nehalem and newer processors?  Apparently, memtest is
 using them (and in particular 0x194) to find the speed of the FSB, or
 something like that.
 
 Why would anyone will want to run memtest in a vm? May be just add those
 MSRs to ignore list and that's it.
 From the output it looks like it's basically a list of bits.  Returning
 something sensible is better, same as for the speed scaling MSRs.
 
 Everything is list of bits in computers :) At least 0xce is documented in  
 SDM.
 It cannot be implemented in a migration safe manner.
 What do you suggest just say memtest does not work?
Why do you want to run it in a guest? 

 I am wondering why it is working with -cpu qemu64.
 
Because memtest has different code for different cpu models.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro

2012-09-13 Thread Caraman Mihai Claudiu-B02008
 -Original Message-
 From: Wood Scott-B07421
 Sent: Thursday, September 13, 2012 12:54 AM
 To: Alexander Graf
 Cc: Caraman Mihai Claudiu-B02008; kvm-...@vger.kernel.org; linuxppc-
 d...@lists.ozlabs.org; kvm@vger.kernel.org
 Subject: Re: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM
 macro
 
 On 09/12/2012 04:45 PM, Alexander Graf wrote:
 
 
  On 12.09.2012, at 23:38, Scott Wood scottw...@freescale.com wrote:
 
  On 09/12/2012 01:56 PM, Alexander Graf wrote:
 
 
  On 12.09.2012, at 15:18, Mihai Caraman mihai.cara...@freescale.com
 wrote:
 
  The current form of DO_KVM macro restricts its use to one call per
 input
  parameter set. This is caused by kvmppc_resume_\intno\()_\srr1
 symbol
  definition.
  Duplicate calls of DO_KVM are required by distinct implementations
 of
  exeption handlers which are delegated at runtime.
 
  Not sure I understand what you're trying to achieve here. Please
 elaborate ;)
 
  On 64-bit book3e we compile multiple versions of the TLB miss
 handlers,
  and choose from them at runtime.

The exception handler patching is active in __early_init_mmu() function
powerpc/mm/tlb_nohash.c for quite a few years. For tlb miss exceptions
there are three handler versions: standard, HW tablewalk and bolted.

 I posted a patch to add another variant, for e6500-style hardware
 tablewalk, which shares the bolted prolog/epilog (besides prolog/epilog
 performance, e6500 is incompatible with the IBM tablewalk code for
 various reasons).  That caused us to have two DO_KVMs for the same
 exception type.

Sorry, I missed to cc kvm-ppc mailist when I replayed to that discussion
thread.

-Mike


Re: memtest 4.20+ does not work with -cpu host

2012-09-13 Thread Peter Lieven

On 13.09.2012 14:42, Gleb Natapov wrote:

On Thu, Sep 13, 2012 at 02:05:23PM +0200, Peter Lieven wrote:

On 13.09.2012 10:05, Gleb Natapov wrote:

On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote:

Il 13/09/2012 09:57, Gleb Natapov ha scritto:

#rdmsr -0 0x194
00011100
#rdmsr -0 0xce
0c0004011103

Yes, that can help implementing it in KVM.  But without a spec to
understand what the bits actually mean, it's just as risky...

Peter, do you have any idea where to get the spec of the memory
controller MSRs in Nehalem and newer processors?  Apparently, memtest is
using them (and in particular 0x194) to find the speed of the FSB, or
something like that.


Why would anyone will want to run memtest in a vm? May be just add those
MSRs to ignore list and that's it.

From the output it looks like it's basically a list of bits.  Returning
something sensible is better, same as for the speed scaling MSRs.


Everything is list of bits in computers :) At least 0xce is documented in  SDM.
It cannot be implemented in a migration safe manner.

What do you suggest just say memtest does not work?

Why do you want to run it in a guest?
Testing memory thorughput of different host memory layouts/settings 
(hugepages, ksm etc.).

Stress testing new settings and qemu-kvm builds.
Testing new nodes with a VM which claims all available pages. Its a lot 
easier than booting

a node with a CD and attaching to the Console.

This, of course, is all not missing critical and call also be done with 
cpu model qemu64. I just
came across memtest no longer working and where wondering if there is a 
general regressing.


BTW, from 
http://opensource.apple.com/source/xnu/xnu-1228.15.4/osfmk/i386/tsc.c?txt


#define MSR_FLEX_RATIO  0x194
#define MSR_PLATFORM_INFO   0x0ce
#define BASE_NHM_CLOCK_SOURCE   1ULL
#define CPUID_MODEL_NEHALEM 26

switch (cpuid_info()-cpuid_model) {
case CPUID_MODEL_NEHALEM: {
uint64_t cpu_mhz;
uint64_t msr_flex_ratio;
uint64_t msr_platform_info;

/* See if FLEX_RATIO is being used */
msr_flex_ratio = rdmsr64(MSR_FLEX_RATIO);
msr_platform_info = rdmsr64(MSR_PLATFORM_INFO);
flex_ratio_min = (uint32_t)bitfield(msr_platform_info, 47, 40);
flex_ratio_max = (uint32_t)bitfield(msr_platform_info, 15, 8);
/* No BIOS-programed flex ratio. Use hardware max as default */
tscGranularity = flex_ratio_max;
if (msr_flex_ratio  bit(16)) {
/* Flex Enabled: Use this MSR if less than max */
flex_ratio = (uint32_t)bitfield(msr_flex_ratio, 15, 8);
if (flex_ratio  flex_ratio_max)
tscGranularity = flex_ratio;
}

/* If EFI isn't configured correctly, use a constant
 * value. See 6036811.
 */
if (busFreq == 0)
busFreq = BASE_NHM_CLOCK_SOURCE;

cpu_mhz = tscGranularity * BASE_NHM_CLOCK_SOURCE;

kprintf([NHM] Maximum Non-Turbo Ratio = [%d]\n,
(uint32_t)tscGranularity);
kprintf([NHM] CPU: Frequency  = %6d.%04dMhz\n,
(uint32_t)(cpu_mhz / Mega), (uint32_t)(cpu_mhz % Mega));
break;
}



Peter
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] KVM: MMU: Optimize pte permission checks

2012-09-13 Thread Avi Kivity
On 09/13/2012 03:41 PM, Xiao Guangrong wrote:
 On 09/12/2012 10:29 PM, Avi Kivity wrote:
 
 +pte_access = pt_access  gpte_access(vcpu, pte);
 +eperm |= (mmu-permissions[access  1]  pte_access)  1;
 
  last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte);
 -if (last_gpte) {
 -pte_access = pt_access  gpte_access(vcpu, pte);
 -/* check if the kernel is fetching from user page */
 -if (unlikely(pte_access  PT_USER_MASK) 
 -kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
 -if (fetch_fault  !user_fault)
 -eperm = true;
 -}
 
 I see this in the SDM:
 
 If CR4.SMEP = 1, instructions may be fetched from any linear
 address with a valid translation for which the U/S flag (bit 2) is 0 in at
 least one of the paging-structure entries controlling the translation.

Another good catch.

 
 This patch checks smep on every levels, breaks this rule.
 (current code checks smep on the last level).
 

We can just move the permission check to the end of the loop.  We used
to terminate the loop on a permission error, but now we do the whole
thing anyway.

It does mean that we'll need to set accessed bits after the loop is
complete.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/5] KVM: MMU: Optimize is_last_gpte()

2012-09-13 Thread Xiao Guangrong
On 09/13/2012 05:47 PM, Avi Kivity wrote:
 On 09/12/2012 09:03 PM, Avi Kivity wrote:
 On 09/12/2012 08:49 PM, Avi Kivity wrote:
 Instead of branchy code depending on level, gpte.ps, and mmu configuration,
 prepare everything in a bitmap during mode changes and look it up during
 runtime.

 6/5 is buggy, sorry, will update it tomorrow.


 
 8--8--
 
 From: Avi Kivity a...@redhat.com
 Date: Wed, 12 Sep 2012 20:46:56 +0300
 Subject: [PATCH v2 6/5] KVM: MMU: Optimize is_last_gpte()
 
 Instead of branchy code depending on level, gpte.ps, and mmu configuration,
 prepare everything in a bitmap during mode changes and look it up during
 runtime.
 
 Signed-off-by: Avi Kivity a...@redhat.com
 ---
 
 v2: rearrange bitmap (one less shift)
 avoid stomping on local variable
 fix index calculation
 move check back to a function
 
  arch/x86/include/asm/kvm_host.h |  7 +++
  arch/x86/kvm/mmu.c  | 31 +++
  arch/x86/kvm/mmu.h  |  3 ++-
  arch/x86/kvm/paging_tmpl.h  | 22 +-
  4 files changed, 41 insertions(+), 22 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 3318bde..f9a48cf 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -298,6 +298,13 @@ struct kvm_mmu {
   u64 *lm_root;
   u64 rsvd_bits_mask[2][4];
 
 + /*
 +  * Bitmap: bit set = last pte in walk
 +  * index[0]: pte.ps
 +  * index[1:2]: level
 +  */

Opposite? index[2]: pte.pse?

Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic

2012-09-13 Thread Maciej W. Rozycki
On Wed, 12 Sep 2012, Matthew Ogilvie wrote:

 Also, how big of a concern is a very rare gained or lost IRQ0
 actually?  Under normal conditions, I would expect this to at most
 cause a one time clock drift in the guest OS of a fraction of
 a second.  If that only happens when rebooting or migrating the
 guest...

 It depends on how you define very rare.  Once per month or probably 
even per day is probably acceptable although you'll see a disruption in 
the system clock.  This is still likely unwanted if the system is used as 
a clock reference and not just wants to keep its clock right for own 
purposes.  Anything more frequent and NTP does care very much; an accurate 
system clock is important in many uses, starting from basic ones such as 
where timestamps of files exported over NFS are concerned.

 Speaking of real hw -- I don't know whether that really matters for 
emulated systems.  Thanks for looking into the 8254 PIT in details.

  Maciej
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memtest 4.20+ does not work with -cpu host

2012-09-13 Thread Gleb Natapov
On Thu, Sep 13, 2012 at 02:56:33PM +0200, Peter Lieven wrote:
 On 13.09.2012 14:42, Gleb Natapov wrote:
 On Thu, Sep 13, 2012 at 02:05:23PM +0200, Peter Lieven wrote:
 On 13.09.2012 10:05, Gleb Natapov wrote:
 On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote:
 Il 13/09/2012 09:57, Gleb Natapov ha scritto:
 #rdmsr -0 0x194
 00011100
 #rdmsr -0 0xce
 0c0004011103
 Yes, that can help implementing it in KVM.  But without a spec to
 understand what the bits actually mean, it's just as risky...
 
 Peter, do you have any idea where to get the spec of the memory
 controller MSRs in Nehalem and newer processors?  Apparently, memtest 
 is
 using them (and in particular 0x194) to find the speed of the FSB, or
 something like that.
 
 Why would anyone will want to run memtest in a vm? May be just add those
 MSRs to ignore list and that's it.
 From the output it looks like it's basically a list of bits.  Returning
 something sensible is better, same as for the speed scaling MSRs.
 
 Everything is list of bits in computers :) At least 0xce is documented in  
 SDM.
 It cannot be implemented in a migration safe manner.
 What do you suggest just say memtest does not work?
 Why do you want to run it in a guest?
 Testing memory thorughput of different host memory layouts/settings
 (hugepages, ksm etc.).
In may days memtets looked for memory errors. This does not make much
sense in virtualized environment. What does it do today? Calculates
throughput? Does it prefaults memory before doing so, because otherwise
numbers will not be very meaningful when running inside VM. But since
memtets works on physical memory I doubt it prefaults.

 Stress testing new settings and qemu-kvm builds.
Why guest accessing memory stress qemu-kvm?

 Testing new nodes with a VM which claims all available pages. Its a
 lot easier than booting
 a node with a CD and attaching to the Console.
Boot Window, it access all memory :) or run with qemu64 like you say
below.

 
 This, of course, is all not missing critical and call also be done
 with cpu model qemu64. I just
 came across memtest no longer working and where wondering if there
 is a general regressing.
 
If it is a regression it is likely in memtest

 BTW, from 
 http://opensource.apple.com/source/xnu/xnu-1228.15.4/osfmk/i386/tsc.c?txt

You can send them patch to check that it runs in a VM and skip all that.
 
 #define MSR_FLEX_RATIO  0x194
 #define MSR_PLATFORM_INFO   0x0ce
 #define BASE_NHM_CLOCK_SOURCE   1ULL
 #define CPUID_MODEL_NEHALEM 26
 
   switch (cpuid_info()-cpuid_model) {
   case CPUID_MODEL_NEHALEM: {
   uint64_t cpu_mhz;
   uint64_t msr_flex_ratio;
   uint64_t msr_platform_info;
 
   /* See if FLEX_RATIO is being used */
   msr_flex_ratio = rdmsr64(MSR_FLEX_RATIO);
   msr_platform_info = rdmsr64(MSR_PLATFORM_INFO);
   flex_ratio_min = (uint32_t)bitfield(msr_platform_info, 47, 40);
   flex_ratio_max = (uint32_t)bitfield(msr_platform_info, 15, 8);
   /* No BIOS-programed flex ratio. Use hardware max as default */
   tscGranularity = flex_ratio_max;
   if (msr_flex_ratio  bit(16)) {
   /* Flex Enabled: Use this MSR if less than max */
   flex_ratio = (uint32_t)bitfield(msr_flex_ratio, 15, 8);
   if (flex_ratio  flex_ratio_max)
   tscGranularity = flex_ratio;
   }
 
   /* If EFI isn't configured correctly, use a constant
* value. See 6036811.
*/
   if (busFreq == 0)
   busFreq = BASE_NHM_CLOCK_SOURCE;
 
   cpu_mhz = tscGranularity * BASE_NHM_CLOCK_SOURCE;
 
   kprintf([NHM] Maximum Non-Turbo Ratio = [%d]\n,
   (uint32_t)tscGranularity);
   kprintf([NHM] CPU: Frequency  = %6d.%04dMhz\n,
   (uint32_t)(cpu_mhz / Mega), (uint32_t)(cpu_mhz % Mega));
   break;
 }
 
 
 
 Peter

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] KVM: perf: kvm events analysis tool

2012-09-13 Thread Arnaldo Carvalho de Melo
Em Wed, Sep 12, 2012 at 10:56:44PM -0600, David Ahern escreveu:
   static const char * const kvm_usage[] = {
 +perf kvm [options] {top|record|report|diff|buildid-list|stat},

 The usage for the report/record sub commands of stat is never shown. e.g.,
 $ perf kvm stat
 -- shows help for perf-stat

 $ perf kvm
 -- shows the above and perf-kvm's usage
 
 [I deleted this thread, so having to reply to one of my responses.
 hopefully noone is unduly harmed by this.]
 
 I've been using this command a bit lately -- especially on nested
 virtualization -- and I think the syntax is quirky - meaning wrong.
 In my case I always follow up a record with a report and end up
 using a shell script wrapper that combines the 2 and running it
 repeatedly. e.g.,
 
 $PERF kvm stat record -o $FILE -p $pid -- sleep $time
 [ $? -eq 0 ]  $PERF --no-pager kvm  -i $FILE stat report
 
 As my daughter likes to say - awkward.
 
 That suggests what is really needed is a 'live' mode - a continual
 updating of the output like perf top, not a record and analyze later
 mode. Which does come back to why I responded to this email -- the
 syntax is klunky and awkward.
 
 So, I spent a fair amount of time today implementing a live mode.
 And after a lot of swearing at the tracepoint processing code I

What kind of swearing? I'm working on 'perf test' entries for
tracepoints to make sure we don't regress on the perf/libtraceevent
junction, doing that as prep work for further simplifying tracepoint
tools like sched, kvm, kmem, etc.

 finally have it working. And the format extends easily (meaning 
 day and the next step) to a perf-based kvm_stat replacement. Example
 syntax is:
 
perf kvm stat [-p pid|-a|...]
 
 which defaults to an update delay of 1 second, and vmexit analysis.
 
 The guts of the processing logic come from the existing kvm-events
 code. The changes focus on combining the record and report paths
 into one. The display needs some help (Arnaldo?), but it seems to
 work well.
 
 I'd like to get opinions on what next? IMO, the record/report path
 should not get a foot hold from a backward compatibility perspective
 and having to maintain those options. I am willing to take the
 existing patches into git to maintain authorship and from there
 apply patches to make the live mode work - which includes a bit of
 refactoring of perf code (like the stats changes).
 
 Before I march down this path, any objections, opinions, etc?

Can I see the code?

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic

2012-09-13 Thread Jan Kiszka
On 2012-09-13 15:41, Maciej W. Rozycki wrote:
 On Wed, 12 Sep 2012, Matthew Ogilvie wrote:
 
 Also, how big of a concern is a very rare gained or lost IRQ0
 actually?  Under normal conditions, I would expect this to at most
 cause a one time clock drift in the guest OS of a fraction of
 a second.  If that only happens when rebooting or migrating the
 guest...
 
  It depends on how you define very rare.  Once per month or probably 
 even per day is probably acceptable although you'll see a disruption in 
 the system clock.  This is still likely unwanted if the system is used as 
 a clock reference and not just wants to keep its clock right for own 
 purposes.  Anything more frequent and NTP does care very much; an accurate 
 system clock is important in many uses, starting from basic ones such as 
 where timestamps of files exported over NFS are concerned.
 
  Speaking of real hw -- I don't know whether that really matters for 
 emulated systems.  Thanks for looking into the 8254 PIT in details.

First correct, then fast. That rule applies at least to the conceptual
phase. Also, for rarely used PIT modes, I would refrain from optimizing
them away from the specified behaviour.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic

2012-09-13 Thread Jan Kiszka
On 2012-09-13 07:49, Matthew Ogilvie wrote:
 On Wed, Sep 12, 2012 at 10:57:57AM +0200, Jan Kiszka wrote:
 On 2012-09-12 10:51, Avi Kivity wrote:
 On 09/12/2012 11:48 AM, Jan Kiszka wrote:
 On 2012-09-12 10:01, Avi Kivity wrote:
 On 09/10/2012 04:29 AM, Matthew Ogilvie wrote:
 Intel's definition of edge triggered means: asserted with a
 low-to-high transition at the time an interrupt is registered
 and then kept high until the interrupt is served via one of the
 EOI mechanisms or goes away unhandled.

 So the only difference between edge triggered and level triggered
 is in the leading edge, with no difference in the trailing edge.

 This bug manifested itself when the guest was Microport UNIX
 System V/386 v2.1 (ca. 1987), because it would sometimes mask
 off IRQ14 in the slave IMR after it had already been asserted.
 The master would still try to deliver an interrupt even though
 IRQ2 had dropped again, resulting in a spurious interupt
 (IRQ15) and a panicked UNIX kernel.
 diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
 index adba28f..5cbba99 100644
 --- a/arch/x86/kvm/i8254.c
 +++ b/arch/x86/kvm/i8254.c
 @@ -302,8 +302,12 @@ static void pit_do_work(struct kthread_work *work)
  }
  spin_unlock(ps-inject_lock);
  if (inject) {
 -kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 1);
 +/* Clear previous interrupt, then create a rising
 + * edge to request another interupt, and leave it at
 + * level=1 until time to inject another one.
 + */
  kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 0);
 +kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 1);
  
  /*

 I thought I understood this, now I'm not sure.  How can this be correct?
  Real hardware doesn't act like this.

 What if the PIT is disabled after this?  You're injecting a spurious
 interrupt then.

 Yes, the PIT has to raise the output as long as specified, i.e.
 according to the datasheet. That's important now due to the corrections
 to the PIC. We can then carefully check if there is room for
 simplifications / optimizations. I also cannot imagine that the above
 already fulfills these requirements.

 And if the PIT is disabled by the HPET, we need to clear the output
 explicitly as we inject the IRQ#0 under a different source ID than
 userspace HPET does (which will logically take over IRQ#0 control). The
 kernel would otherwise OR both sources to an incorrect result.


 I guess we need to double the hrtimer rate then in order to generate a
 square wave.  It's getting ridiculous how accurate our model needs to be.

 I would suggest to solve this for the userspace model first, ensure that
 it works properly in all modes, maybe optimize it, and then decide how
 to map all this on kernel space. As long as we have two models, we can
 also make use of them.
 
 Thoughts about the 8254 PIT:
 
 First, this summary of (real) 8254 PIT behavior seems fairly
 good, as far it goes:
 
 On Tue, Sep 04, 2012 at 07:27:38PM +0100, Maciej W. Rozycki wrote:
  * The 8254 PIT is normally configured in mode 2 or 3 in the PC/AT
architecture.  In the former its output is high (active) all the time
except from one (last) clock cycle.  In the latter a wave that has a
duty cycle close or equal to 0.5 (depending on whether the divider is
odd or even) is produced, so no short pulses either.  I don't remember
the other four modes -- have a look at the datasheet if interested, but
I reckon they're not really compatible with the wiring anyway, e.g. the
gate is hardwired enabled.
 
 I've also just skimmed parts of the 8254 section of The Indispensable PC
 Hardware Book, by Hans-Peter Messmer, Copyright 1994 Addison-Wesley,
 although I probably ought to read it more carefully.

http://download.intel.com/design/archives/periphrl/docs/23124406.pdf
should be the primary reference - as long as it leaves no open questions.

 
 Under normal conditions, the 8254 part of the patch above should be
 indistinguishable from previous behavior.  The 8259's IRR will
 still show up as 1 until the interrupt is actually serviced,
 and no new interrupt will be serviced after one is serviced until
 another edge is injected via the high-low-high transition of the new
 code.  (Unless the guest resets the 8259 or maybe messes with IMR,
 but real hardware would generate extra interrupts in such cases as
 well.)
 
 The new code sounds much closer to mode 2 described by
 Maciej, compared to the old code - except the duty cycle is
 effectively 100 percent instead of 99.[some number of 9's] percent.
 
 -
 But there might be some concerns in abnormal conditions:
 
* If some guest is actually depending on a 50 percent duty cycle
  (maybe some kind of polling rather than interrupts), I would
  expect it to be just as broken before this patch as after,
  unless it is really weird (handles 

Re: [PATCH v7 3/3] KVM: perf: kvm events analysis tool

2012-09-13 Thread David Ahern

On 9/13/12 7:45 AM, Arnaldo Carvalho de Melo wrote:

Em Wed, Sep 12, 2012 at 10:56:44PM -0600, David Ahern escreveu:

  static const char * const kvm_usage[] = {
+perf kvm [options] {top|record|report|diff|buildid-list|stat},



The usage for the report/record sub commands of stat is never shown. e.g.,
$ perf kvm stat
-- shows help for perf-stat



$ perf kvm
-- shows the above and perf-kvm's usage


[I deleted this thread, so having to reply to one of my responses.
hopefully noone is unduly harmed by this.]

I've been using this command a bit lately -- especially on nested
virtualization -- and I think the syntax is quirky - meaning wrong.
In my case I always follow up a record with a report and end up
using a shell script wrapper that combines the 2 and running it
repeatedly. e.g.,

 $PERF kvm stat record -o $FILE -p $pid -- sleep $time
 [ $? -eq 0 ]  $PERF --no-pager kvm  -i $FILE stat report

As my daughter likes to say - awkward.

That suggests what is really needed is a 'live' mode - a continual
updating of the output like perf top, not a record and analyze later
mode. Which does come back to why I responded to this email -- the
syntax is klunky and awkward.

So, I spent a fair amount of time today implementing a live mode.
And after a lot of swearing at the tracepoint processing code I


What kind of swearing? I'm working on 'perf test' entries for
tracepoints to make sure we don't regress on the perf/libtraceevent
junction, doing that as prep work for further simplifying tracepoint
tools like sched, kvm, kmem, etc.


Have you seen how the tracing initialization is done? ugly. record 
generates tracing data events and report uses those to do the init so 
you can access the raw_data. I ended up writing this:


static int perf_kvm__tracing_init(void)
{
struct tracing_data *tdata;
char temp_file[] = /tmp/perf-;
int fd;

fd = mkstemp(temp_file);
if (fd  0) {
pr_err(mkstemp failed\n);
return -1;
}
unlink(temp_file);

tdata = tracing_data_get(kvm_events.evlist-entries, fd, false);
if (!tdata)
return -1;
lseek(fd, 0, SEEK_SET);
(void) trace_report(fd, kvm_events.session-pevent, false);
tracing_data_put(tdata);

return 0;
}





finally have it working. And the format extends easily (meaning 
day and the next step) to a perf-based kvm_stat replacement. Example
syntax is:

perf kvm stat [-p pid|-a|...]

which defaults to an update delay of 1 second, and vmexit analysis.

The guts of the processing logic come from the existing kvm-events
code. The changes focus on combining the record and report paths
into one. The display needs some help (Arnaldo?), but it seems to
work well.

I'd like to get opinions on what next? IMO, the record/report path
should not get a foot hold from a backward compatibility perspective
and having to maintain those options. I am willing to take the
existing patches into git to maintain authorship and from there
apply patches to make the live mode work - which includes a bit of
refactoring of perf code (like the stats changes).

Before I march down this path, any objections, opinions, etc?


Can I see the code?


Let me clean it up over the weekend and send out an RFC for it.

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv4] KVM: optimize apic interrupt delivery

2012-09-13 Thread Gleb Natapov
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 Changelog:

  - v3-v4
   * remove inline from recalculate_apic_map()
   * add BUG_ON() to apic_cluster_id()
   * add comments to non self explanatory kvm_apic_map fields
   * MST convinced me that we do not need to optimize low prio logical
 mode with one cpu as dst, so drop it
   * fix some typo and comments
   * remove unneeded cast

  - v2-v3
   * sparse annotation for rcu usage
   * move mutex above map
   * use mask/shift to calculate cluster/dst ids
   * use gotos
   * add comment about logic behind logical table creation

  - v1-v2
   * fix race Avi noticed
   * rcu_read_lock() out of the block as per Avi
   * fix rcu issues pointed to by MST. All but one. Still use
 call_rcu(). Do not think this is serious issue. If it is should be
 solved by RCU subsystem.
   * Fix phys_map overflow pointed to by MST
   * recalculate_apic_map() does not return error any more.
   * add optimization for low prio logical mode with one cpu as dst (it
 happens)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 64adb61..742f91b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -511,6 +511,16 @@ struct kvm_arch_memory_slot {
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
 };
 
+struct kvm_apic_map {
+   struct rcu_head rcu;
+   u8 ldr_bits;
+   /* fields bellow are used to decode ldr values in different modes */
+   u32 cid_shift, cid_mask, lid_mask;
+   struct kvm_lapic *phys_map[256];
+   /* first index is cluster id second is cpu id in a cluster */
+   struct kvm_lapic *logical_map[16][16];
+};
+
 struct kvm_arch {
unsigned int n_used_mmu_pages;
unsigned int n_requested_mmu_pages;
@@ -528,6 +538,8 @@ struct kvm_arch {
struct kvm_ioapic *vioapic;
struct kvm_pit *vpit;
int vapics_in_nmi_mode;
+   struct mutex apic_map_lock;
+   struct kvm_apic_map *apic_map;
 
unsigned int tss_addr;
struct page *apic_access_page;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 07ad628..6e12ddd 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -139,11 +139,110 @@ static inline int apic_enabled(struct kvm_lapic *apic)
(LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
 APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
 
+static inline int apic_x2apic_mode(struct kvm_lapic *apic)
+{
+   return apic-vcpu-arch.apic_base  X2APIC_ENABLE;
+}
+
 static inline int kvm_apic_id(struct kvm_lapic *apic)
 {
return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
 }
 
+static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
+{
+   u16 cid;
+   ldr = 32 - map-ldr_bits;
+   cid = (ldr  map-cid_shift)  map-cid_mask;
+
+   BUG_ON(cid = ARRAY_SIZE(map-logical_map));
+
+   return cid;
+}
+
+static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
+{
+   ldr = (32 - map-ldr_bits);
+   return ldr  map-lid_mask;
+}
+
+static void recalculate_apic_map(struct kvm *kvm)
+{
+   struct kvm_apic_map *new, *old = NULL;
+   struct kvm_vcpu *vcpu;
+   int i;
+
+   new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
+
+   mutex_lock(kvm-arch.apic_map_lock);
+
+   if (!new)
+   goto out;
+
+   new-ldr_bits = 8;
+   /* flat mode is default */
+   new-cid_shift = 8;
+   new-cid_mask = 0;
+   new-lid_mask = 0xff;
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct kvm_lapic *apic = vcpu-arch.apic;
+   u16 cid, lid;
+   u32 ldr;
+
+   if (!kvm_apic_present(vcpu))
+   continue;
+
+   /*
+* All APICs have to be configured in the same mode by an OS.
+* We take advatage of this while building logical id loockup
+* table. After reset APICs are in xapic/flat mode, so if we
+* find apic with different setting we assume this is the mode
+* OS wants all apics to be in; build lookup table accordingly.
+*/
+   if (apic_x2apic_mode(apic)) {
+   new-ldr_bits = 32;
+   new-cid_shift = 16;
+   new-cid_mask = new-lid_mask = 0x;
+   } else if (kvm_apic_sw_enabled(apic) 
+   !new-cid_mask /* flat mode */ 
+   kvm_apic_get_reg(apic, APIC_DFR) == 
APIC_DFR_CLUSTER) {
+   new-cid_shift = 4; 
+   new-cid_mask = 0xf;
+   new-lid_mask = 0xf;
+   }
+
+   

Re: [PATCH v7 3/3] KVM: perf: kvm events analysis tool

2012-09-13 Thread Arnaldo Carvalho de Melo
Em Thu, Sep 13, 2012 at 08:14:37AM -0600, David Ahern escreveu:
 On 9/13/12 7:45 AM, Arnaldo Carvalho de Melo wrote:
 Em Wed, Sep 12, 2012 at 10:56:44PM -0600, David Ahern escreveu:

 So, I spent a fair amount of time today implementing a live mode.
 And after a lot of swearing at the tracepoint processing code I

 What kind of swearing? I'm working on 'perf test' entries for
 tracepoints to make sure we don't regress on the perf/libtraceevent
 junction, doing that as prep work for further simplifying tracepoint
 tools like sched, kvm, kmem, etc.
 
 Have you seen how the tracing initialization is done? ugly. record
 generates tracing data events and report uses those to do the init
 so you can access the raw_data. I ended up writing this:

And all we need is the list of fields so that we can use
perf_evsel__{int,str}val like I did in my 'perf sched' patch series (in
my perf/core branch), and even those accessors I'll tweak some more as
we don't need to check the endianness of the events, its in the same
machine, etc.

I'm trying to get by without using a 'pevent' just using 'event_format',
its doable when everything is local, as a single machine top tool is.

I want to just create the tracepoint events and process them like in
'top', using code more or less like what is in test__PERF_RECORD.

This still needs more work, so I think you can continue in your path and
eventually we'll have infrastructure to do it the way I'm describing,
optimizing the case where the record and top are in the same
machine, i.e. a short circuited 'live mode' with the top machinery
completely reused for tools, be it written in C, like 'sched', 'kvm',
'kmem', etc, or in perl or python.

- Arnaldo
 
 static int perf_kvm__tracing_init(void)
 {
 struct tracing_data *tdata;
 char temp_file[] = /tmp/perf-;
 int fd;
 
 fd = mkstemp(temp_file);
 if (fd  0) {
 pr_err(mkstemp failed\n);
 return -1;
 }
 unlink(temp_file);
 
 tdata = tracing_data_get(kvm_events.evlist-entries, fd, false);
 if (!tdata)
 return -1;
 lseek(fd, 0, SEEK_SET);
 (void) trace_report(fd, kvm_events.session-pevent, false);
 tracing_data_put(tdata);
 
 return 0;
 }
 
 
 
 finally have it working. And the format extends easily (meaning 
 day and the next step) to a perf-based kvm_stat replacement. Example
 syntax is:
 
 perf kvm stat [-p pid|-a|...]
 
 which defaults to an update delay of 1 second, and vmexit analysis.
 
 The guts of the processing logic come from the existing kvm-events
 code. The changes focus on combining the record and report paths
 into one. The display needs some help (Arnaldo?), but it seems to
 work well.
 
 I'd like to get opinions on what next? IMO, the record/report path
 should not get a foot hold from a backward compatibility perspective
 and having to maintain those options. I am willing to take the
 existing patches into git to maintain authorship and from there
 apply patches to make the live mode work - which includes a bit of
 refactoring of perf code (like the stats changes).
 
 Before I march down this path, any objections, opinions, etc?
 
 Can I see the code?
 
 Let me clean it up over the weekend and send out an RFC for it.
 
 David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro

2012-09-13 Thread Alexander Graf

On 09/12/2012 03:18 PM, Mihai Caraman wrote:

The current form of DO_KVM macro restricts its use to one call per input
parameter set. This is caused by kvmppc_resume_\intno\()_\srr1 symbol
definition.
Duplicate calls of DO_KVM are required by distinct implementations of
exeption handlers which are delegated at runtime. Use a rare label number
to avoid conflicts with the calling contexts.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com


Thanks, applied to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic

2012-09-13 Thread Maciej W. Rozycki
On Thu, 13 Sep 2012, Jan Kiszka wrote:

  I've also just skimmed parts of the 8254 section of The Indispensable PC
  Hardware Book, by Hans-Peter Messmer, Copyright 1994 Addison-Wesley,
  although I probably ought to read it more carefully.
 
 http://download.intel.com/design/archives/periphrl/docs/23124406.pdf
 should be the primary reference - as long as it leaves no open questions.

 Oh, I'm glad they've put it online after all, so there's an ultimate 
place to refer to.  I've only got a copy of this datasheet I got from 
Intel on a CD some 15 years ago.

 And for the record -- they used to publish the 8259A datasheet as well, 
but it appears to have gone from its place.  However it can be easily 
tracked down by an Internet search engine of your choice by referring to 
its order # as 231468.pdf (no revision number embedded there in the file 
name as there was none as it was originally published either).

  Maciej
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Prepare kvm for lto

2012-09-13 Thread Andi Kleen
On Thu, Sep 13, 2012 at 11:27:43AM +0300, Avi Kivity wrote:
 On 09/12/2012 10:17 PM, Andi Kleen wrote:
  On Wed, Sep 12, 2012 at 05:50:41PM +0300, Avi Kivity wrote:
  vmx.c has an lto-unfriendly bit, fix it up.
  
  While there, clean up our asm code.
  
  Avi Kivity (3):
KVM: VMX: Make lto-friendly
KVM: VMX: Make use of asm.h
KVM: SVM: Make use of asm.h
  
  Works for me in my LTO build, thanks Avi.
  I cannot guarantee I always hit the unit splitting case, but it looks
  good so far.
 
 Actually I think patch 1 is missing a .global vmx_return.

Ok can you add it please? It always depends how the LTO partitioner
decides to split the subunits.

I can run it with randomconfig in a loop over night. That's the best way I know
to try to cover these cases.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-13 Thread Marcelo Tosatti
On Wed, Sep 12, 2012 at 04:10:24PM +0800, Xudong Hao wrote:
 Enable KVM FPU fully eager restore, if there is other FPU state which isn't
 tracked by CR0.TS bit.
 
 v3 changes from v2:
 - Make fpu active explicitly while guest xsave is enabling and non-lazy 
 xstate bit
 exist.

How about a guest_xcr0_can_lazy_saverestore bool to control this?
It only needs to be updated when guest xcr0 is updated.

That seems cleaner. Avi?

 v2 changes from v1:
 - Expand KVM_XSTATE_LAZY to 64 bits before negating it.
 
 Signed-off-by: Xudong Hao xudong@intel.com
 ---
  arch/x86/include/asm/kvm.h |4 
  arch/x86/kvm/vmx.c |2 ++
  arch/x86/kvm/x86.c |   15 ++-
  3 files changed, 20 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
 index 521bf25..4c27056 100644
 --- a/arch/x86/include/asm/kvm.h
 +++ b/arch/x86/include/asm/kvm.h
 @@ -8,6 +8,8 @@
  
  #include linux/types.h
  #include linux/ioctl.h
 +#include asm/user.h
 +#include asm/xsave.h
  
  /* Select x86 specific features in linux/kvm.h */
  #define __KVM_HAVE_PIT
 @@ -30,6 +32,8 @@
  /* Architectural interrupt line count. */
  #define KVM_NR_INTERRUPTS 256
  
 +#define KVM_XSTATE_LAZY  (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
 +
  struct kvm_memory_alias {
   __u32 slot;  /* this has a different namespace than memory slots */
   __u32 flags;
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 248c2b4..853e875 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -3028,6 +3028,8 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned 
 long cr0)
  
   if (!vcpu-fpu_active)
   hw_cr0 |= X86_CR0_TS | X86_CR0_MP;
 + else
 + hw_cr0 = ~(X86_CR0_TS | X86_CR0_MP);
  
   vmcs_writel(CR0_READ_SHADOW, cr0);
   vmcs_writel(GUEST_CR0, hw_cr0);
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 20f2266..183cf60 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -560,6 +560,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
 xcr)
   return 1;
   if (xcr0  ~host_xcr0)
   return 1;
 + if (xcr0  ~((u64)KVM_XSTATE_LAZY))
 + vcpu-fpu_active = 1;
   vcpu-arch.xcr0 = xcr0;
   vcpu-guest_xcr0_loaded = 0;
   return 0;
 @@ -5969,7 +5971,18 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
   vcpu-guest_fpu_loaded = 0;
   fpu_save_init(vcpu-arch.guest_fpu);
   ++vcpu-stat.fpu_reload;
 - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
 + /*
 +  * Currently KVM trigger FPU restore by #NM (via CR0.TS),
 +  * till now only XCR0.bit0, XCR0.bit1, XCR0.bit2 is tracked
 +  * by TS bit, there might be other FPU state is not tracked
 +  * by TS bit. Here it only make FPU deactivate request and do 
 +  * FPU lazy restore for these cases: 1)xsave isn't enabled 
 +  * in guest, 2)all guest FPU states can be tracked by TS bit.
 +  * For others, doing fully FPU eager restore.
 +  */
 + if (!kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) ||
 + !(vcpu-arch.xcr0  ~((u64)KVM_XSTATE_LAZY)))
 + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
   trace_kvm_fpu(0);
  }
  
 -- 
 1.5.5
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-13 Thread Marcelo Tosatti
On Thu, Sep 13, 2012 at 01:26:36PM -0300, Marcelo Tosatti wrote:
 On Wed, Sep 12, 2012 at 04:10:24PM +0800, Xudong Hao wrote:
  Enable KVM FPU fully eager restore, if there is other FPU state which isn't
  tracked by CR0.TS bit.
  
  v3 changes from v2:
  - Make fpu active explicitly while guest xsave is enabling and non-lazy 
  xstate bit
  exist.
 
 How about a guest_xcr0_can_lazy_saverestore bool to control this?
 It only needs to be updated when guest xcr0 is updated.
 
 That seems cleaner. Avi?

Reasoning below.

  v2 changes from v1:
  - Expand KVM_XSTATE_LAZY to 64 bits before negating it.
  
  Signed-off-by: Xudong Hao xudong@intel.com
  ---
   arch/x86/include/asm/kvm.h |4 
   arch/x86/kvm/vmx.c |2 ++
   arch/x86/kvm/x86.c |   15 ++-
   3 files changed, 20 insertions(+), 1 deletions(-)
  
  diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
  index 521bf25..4c27056 100644
  --- a/arch/x86/include/asm/kvm.h
  +++ b/arch/x86/include/asm/kvm.h
  @@ -8,6 +8,8 @@
   
   #include linux/types.h
   #include linux/ioctl.h
  +#include asm/user.h
  +#include asm/xsave.h
   
   /* Select x86 specific features in linux/kvm.h */
   #define __KVM_HAVE_PIT
  @@ -30,6 +32,8 @@
   /* Architectural interrupt line count. */
   #define KVM_NR_INTERRUPTS 256
   
  +#define KVM_XSTATE_LAZY(XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
  +
   struct kvm_memory_alias {
  __u32 slot;  /* this has a different namespace than memory slots */
  __u32 flags;
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 248c2b4..853e875 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -3028,6 +3028,8 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, 
  unsigned long cr0)
   
  if (!vcpu-fpu_active)
  hw_cr0 |= X86_CR0_TS | X86_CR0_MP;
  +   else
  +   hw_cr0 = ~(X86_CR0_TS | X86_CR0_MP);
   
  vmcs_writel(CR0_READ_SHADOW, cr0);
  vmcs_writel(GUEST_CR0, hw_cr0);
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index 20f2266..183cf60 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -560,6 +560,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
  xcr)
  return 1;
  if (xcr0  ~host_xcr0)
  return 1;
  +   if (xcr0  ~((u64)KVM_XSTATE_LAZY))
  +   vcpu-fpu_active = 1;

This is confusing. The variable allows to decrease the number of places
the decision is made.

  vcpu-arch.xcr0 = xcr0;
  vcpu-guest_xcr0_loaded = 0;
  return 0;
  @@ -5969,7 +5971,18 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
  vcpu-guest_fpu_loaded = 0;
  fpu_save_init(vcpu-arch.guest_fpu);
  ++vcpu-stat.fpu_reload;
  -   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
  +   /*
  +* Currently KVM trigger FPU restore by #NM (via CR0.TS),
  +* till now only XCR0.bit0, XCR0.bit1, XCR0.bit2 is tracked
  +* by TS bit, there might be other FPU state is not tracked
  +* by TS bit. Here it only make FPU deactivate request and do 
  +* FPU lazy restore for these cases: 1)xsave isn't enabled 
  +* in guest, 2)all guest FPU states can be tracked by TS bit.
  +* For others, doing fully FPU eager restore.
  +*/
  +   if (!kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) ||
  +   !(vcpu-arch.xcr0  ~((u64)KVM_XSTATE_LAZY)))
  +   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
  trace_kvm_fpu(0);
   }
   
  -- 
  1.5.5
  
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-13 Thread Avi Kivity
On 09/12/2012 11:10 AM, Xudong Hao wrote:
 Enable KVM FPU fully eager restore, if there is other FPU state which isn't
 tracked by CR0.TS bit.
 
 v3 changes from v2:
 - Make fpu active explicitly while guest xsave is enabling and non-lazy 
 xstate bit
 exist.
 
 v2 changes from v1:
 - Expand KVM_XSTATE_LAZY to 64 bits before negating it.
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 248c2b4..853e875 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -3028,6 +3028,8 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned 
 long cr0)
  
   if (!vcpu-fpu_active)
   hw_cr0 |= X86_CR0_TS | X86_CR0_MP;
 + else
 + hw_cr0 = ~(X86_CR0_TS | X86_CR0_MP);
  

Why?  The guest may wish to receive #NM faults.

   vmcs_writel(CR0_READ_SHADOW, cr0);
   vmcs_writel(GUEST_CR0, hw_cr0);
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 20f2266..183cf60 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -560,6 +560,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
 xcr)
   return 1;
   if (xcr0  ~host_xcr0)
   return 1;
 + if (xcr0  ~((u64)KVM_XSTATE_LAZY))
 + vcpu-fpu_active = 1;
   vcpu-arch.xcr0 = xcr0;
   vcpu-guest_xcr0_loaded = 0;
   return 0;
 @@ -5969,7 +5971,18 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
   vcpu-guest_fpu_loaded = 0;
   fpu_save_init(vcpu-arch.guest_fpu);
   ++vcpu-stat.fpu_reload;
 - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
 + /*
 +  * Currently KVM trigger FPU restore by #NM (via CR0.TS),
 +  * till now only XCR0.bit0, XCR0.bit1, XCR0.bit2 is tracked

currently, till now, don't tell someone reading the code in six
months anything.  Just say how the code works.

 +  * by TS bit, there might be other FPU state is not tracked
 +  * by TS bit. Here it only make FPU deactivate request and do 
 +  * FPU lazy restore for these cases: 1)xsave isn't enabled 
 +  * in guest, 2)all guest FPU states can be tracked by TS bit.
 +  * For others, doing fully FPU eager restore.
 +  */
 + if (!kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) ||
 + !(vcpu-arch.xcr0  ~((u64)KVM_XSTATE_LAZY)))
 + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
   trace_kvm_fpu(0);
  }
  
 


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-13 Thread Avi Kivity
On 09/13/2012 07:29 PM, Marcelo Tosatti wrote:
 On Thu, Sep 13, 2012 at 01:26:36PM -0300, Marcelo Tosatti wrote:
 On Wed, Sep 12, 2012 at 04:10:24PM +0800, Xudong Hao wrote:
  Enable KVM FPU fully eager restore, if there is other FPU state which isn't
  tracked by CR0.TS bit.
  
  v3 changes from v2:
  - Make fpu active explicitly while guest xsave is enabling and non-lazy 
  xstate bit
  exist.
 
 How about a guest_xcr0_can_lazy_saverestore bool to control this?
 It only needs to be updated when guest xcr0 is updated.
 
 That seems cleaner. Avi?
 
 Reasoning below.
 
  v2 changes from v1:
  - Expand KVM_XSTATE_LAZY to 64 bits before negating it.
  
  Signed-off-by: Xudong Hao xudong@intel.com
  ---
   arch/x86/include/asm/kvm.h |4 
   arch/x86/kvm/vmx.c |2 ++
   arch/x86/kvm/x86.c |   15 ++-
   3 files changed, 20 insertions(+), 1 deletions(-)
  
  diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
  index 521bf25..4c27056 100644
  --- a/arch/x86/include/asm/kvm.h
  +++ b/arch/x86/include/asm/kvm.h
  @@ -8,6 +8,8 @@
   
   #include linux/types.h
   #include linux/ioctl.h
  +#include asm/user.h
  +#include asm/xsave.h
   
   /* Select x86 specific features in linux/kvm.h */
   #define __KVM_HAVE_PIT
  @@ -30,6 +32,8 @@
   /* Architectural interrupt line count. */
   #define KVM_NR_INTERRUPTS 256
   
  +#define KVM_XSTATE_LAZY   (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
  +
   struct kvm_memory_alias {
 __u32 slot;  /* this has a different namespace than memory slots */
 __u32 flags;
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 248c2b4..853e875 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -3028,6 +3028,8 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, 
  unsigned long cr0)
   
 if (!vcpu-fpu_active)
 hw_cr0 |= X86_CR0_TS | X86_CR0_MP;
  +  else
  +  hw_cr0 = ~(X86_CR0_TS | X86_CR0_MP);
   
 vmcs_writel(CR0_READ_SHADOW, cr0);
 vmcs_writel(GUEST_CR0, hw_cr0);
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index 20f2266..183cf60 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -560,6 +560,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, 
  u64 xcr)
 return 1;
 if (xcr0  ~host_xcr0)
 return 1;
  +  if (xcr0  ~((u64)KVM_XSTATE_LAZY))
  +  vcpu-fpu_active = 1;
 
 This is confusing. The variable allows to decrease the number of places
 the decision is made.

Better to have a helper function (lazy_fpu_allowed(), for example).
Variables raise the question of whether they are maintained correctly.




-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM: PPC: Memslot handling improvements

2012-09-13 Thread Alexander Graf

On 09/12/2012 01:26 AM, Paul Mackerras wrote:

This series of 3 patches fixes up the memslot handling for Book3S HV
style KVM on powerpc, making slot deletion and modification work and
making sure we have the appropriate SRCU synchronization against
updates.

The series is against the next branch of the kvm tree.  These patches
have all been posted before, but I am reposting them now because
Marcelo's patches that are a prerequisite for the third patch
(2df72e9bc4, KVM: split kvm_arch_flush_shadow and 12d6e7538e, KVM:
perform an invalid memslot step for gpa base change) have now gone
into the kvm next branch.


Thanks, applied all to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: graphics card pci passthrough success report

2012-09-13 Thread Alex Williamson
On Thu, 2012-09-13 at 11:40 +0200, Lennert Buytenhek wrote:
 On Thu, Sep 13, 2012 at 07:55:00AM +0200, Gerd Hoffmann wrote:
 
  Hi,
 
 Hi,
 
 
   - Apply the patches at the end of this mail to kvm and SeaBIOS to
 allow for more BAR space under 4G.  (The relevant BARs on the
 graphics cards _are_ 64 bit BARs, but kvm seemed to turn those
 into 32 bit BARs in the guest.)
  
  Which qemu/seabios versions have you used?
  
  qemu-1.2 (+ bundled seabios) should handle that just fine without
  patching.  There is no fixed I/O window any more, all memory space above
  lowmem is available for pci, i.e. if you give 2G to your guest
  everything above 0x8000.
  
  And if there isn't enougth address space below 4G (if you assign lot of
  memory to your guest so qemu keeps only the 0xe000 - 0x
  window free) seabios should try to map 64bit bars above 4G.
 
 This was some time ago, on (L)ubuntu 12.04, which has qemu-kvm 1.0
 and seabios 0.6.2.  We'll retry on a newer distro soon.
 
 
   - Apply the hacky patch at the end of this mail to SeaBIOS to
 always skip initialising the Radeon's option ROMs, or the VM
 would hang inside the Radeon option ROM if you boot the VM
 without the default cirrus video.
  
  A better way to handle that would probably be to add an pci passthrough
  config option to not expose the rom to the guest.
  
  Any clue *why* the rom doesn't run?
 
 No idea, we didn't look into that -- this was just a one afternoon
 hacking session.

Thanks for the report.  Spawned by your success, I tested a Radeon HD
5450 using VFIO based device assignment.  I can get it to work on
Windows XP, with no changes (from the version I'll post soon), but Win7
dies (still need to play around more with your suggestions of cpu type).
For skipping the option rom, is it sufficient to not expose it
(rombar=0) or does the guest OS driver need it as well?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: graphics card pci passthrough success report

2012-09-13 Thread Lennert Buytenhek
On Thu, Sep 13, 2012 at 11:05:07AM -0600, Alex Williamson wrote:

- Apply the hacky patch at the end of this mail to SeaBIOS to
  always skip initialising the Radeon's option ROMs, or the VM
  would hang inside the Radeon option ROM if you boot the VM
  without the default cirrus video.
   
   A better way to handle that would probably be to add an pci passthrough
   config option to not expose the rom to the guest.
   
   Any clue *why* the rom doesn't run?
  
  No idea, we didn't look into that -- this was just a one afternoon
  hacking session.
 
 Thanks for the report.  Spawned by your success, I tested a Radeon HD
 5450 using VFIO based device assignment.  I can get it to work on
 Windows XP, with no changes (from the version I'll post soon),

Yay!


 but Win7 dies (still need to play around more with your suggestions
 of cpu type).

ACK.  That's a nasty one, don't ask how we found that out...

Note that the bluescreen I described when the cpu type is wrong only
actually happened for us if the AMD drivers were installed in the
VM -- maybe you can try without AMD drivers to see if that makes it
go away.  If you still have a bluescreen without the AMD drivers
installed, it's probably a different issue.


 For skipping the option rom, is it sufficient to not expose it
 (rombar=0) or does the guest OS driver need it as well?

I don't actually know.  Something to try out when we get round to
testing this again, I suppose..


cheers,
Lennert
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm loops after kernel udpate

2012-09-13 Thread Jiri Slaby
On 09/13/2012 11:59 AM, Avi Kivity wrote:
 On 09/12/2012 09:11 PM, Jiri Slaby wrote:
 On 09/12/2012 10:18 AM, Avi Kivity wrote:
 On 09/12/2012 11:13 AM, Jiri Slaby wrote:

  Please provide the output of vmxcap
 (http://goo.gl/c5lUO),

   Unrestricted guest   no

 The big real mode fixes.



 and a snapshot of kvm_stat while the guest is hung.

 kvm statistics

  exits  6778198  615942
  host_state_reload 1988 187
  irq_exits 1523 138
  mmu_cache_miss   4   0
  fpu_reload   1   0

 Please run this as root so we get the tracepoint based output; and press
 'x' when it's running so we get more detailed output.

 kvm statistics

  kvm_exit  13798699  330708
  kvm_entry 13799110  330708
  kvm_page_fault13793650  330604
  kvm_exit(EXCEPTION_NMI)6188458  330604
  kvm_exit(EXTERNAL_INTERRUPT)  2169 105
  kvm_exit(TPR_BELOW_THRESHOLD)   82   0
  kvm_exit(IO_INSTRUCTION) 6   0
 
 Strange, it's unable to fault in the very first page.
 
 Please provide a trace as per http://www.linux-kvm.org/page/Tracing (but
 append -e kvmmmu to the command line).

Attached. Does it make sense? It wrote things like:
  failed to read event print fmt for kvm_mmu_unsync_page
to the stderr.

thanks,
-- 
js
suse labs
version = 6
CPU 0 is empty
cpus=2
qemu-kvm-6170  [001]   457.811896: kvm_mmu_get_page: [FAILED TO 
PARSE] gfn=0 role=122882 root_count=0 unsync=0 created=1
qemu-kvm-6170  [001]   457.811899: kvm_mmu_get_page: [FAILED TO 
PARSE] gfn=262144 role=122882 root_count=0 unsync=0 created=1
qemu-kvm-6170  [001]   457.811900: kvm_mmu_get_page: [FAILED TO 
PARSE] gfn=524288 role=122882 root_count=0 unsync=0 created=1
qemu-kvm-6170  [001]   457.811902: kvm_mmu_get_page: [FAILED TO 
PARSE] gfn=786432 role=122882 root_count=0 unsync=0 created=1
qemu-kvm-6171  [001]   462.416705: kvm_mmu_prepare_zap_page: [FAILED TO 
PARSE] gfn=786432 role=122882 root_count=1 unsync=0
qemu-kvm-6171  [001]   462.416712: kvm_mmu_prepare_zap_page: [FAILED TO 
PARSE] gfn=524288 role=122882 root_count=1 unsync=0
qemu-kvm-6171  [001]   462.416715: kvm_mmu_prepare_zap_page: [FAILED TO 
PARSE] gfn=262144 role=122882 root_count=1 unsync=0
qemu-kvm-6171  [001]   462.416717: kvm_mmu_prepare_zap_page: [FAILED TO 
PARSE] gfn=0 role=122882 root_count=1 unsync=0
qemu-kvm-6171  [001]   462.485197: kvm_mmu_prepare_zap_page: [FAILED TO 
PARSE] gfn=0 role=253954 root_count=0 unsync=0
qemu-kvm-6171  [001]   462.485202: kvm_mmu_prepare_zap_page: [FAILED TO 
PARSE] gfn=262144 role=253954 root_count=0 unsync=0
qemu-kvm-6171  [001]   462.485205: kvm_mmu_prepare_zap_page: [FAILED TO 
PARSE] gfn=524288 role=253954 root_count=0 unsync=0
qemu-kvm-6171  [001]   462.485209: kvm_mmu_prepare_zap_page: [FAILED TO 
PARSE] gfn=786432 role=253954 root_count=0 unsync=0


Re: [PATCH v2 3/4] target-i386: Allow changing of Hypervisor CPUIDs.

2012-09-13 Thread Don Slutz

On 09/12/12 13:55, Marcelo Tosatti wrote:

The problem with integrating this is that it has little or
no assurance from documentation. The Linux kernel source is a good
source, then say accordingly to VMWare guest support code in version xyz
in the changelog.
I will work on getting a list of the documentation and sources used to 
generate this.


Also extracting this information in a text file (or comment in the code)
would be better than just adding code.
I am not sure what information you are talking about here.  Are you 
asking about the known Hypervisor CPUIDs, or what a lot of Linux 
version look at to determine the Hypervisor they are on, or something else?


On Tue, Sep 11, 2012 at 10:07:46AM -0400, Don Slutz wrote:

This is primarily done so that the guest will think it is running
under vmware when hypervisor-vendor=vmware is specified as a
property of a cpu.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
  target-i386/cpu.c |  214 +
  target-i386/cpu.h |   21 +
  target-i386/kvm.c |   33 +++--
  3 files changed, 262 insertions(+), 6 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 5f9866a..9f1f390 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1135,6 +1135,36 @@ static void x86_cpuid_set_model_id(Object *obj, const 
char *model_id,
  }
  }
  
+static void x86_cpuid_set_vmware_extra(Object *obj)

+{
+X86CPU *cpu = X86_CPU(obj);
+
+if ((cpu-env.tsc_khz != 0) 
+(cpu-env.cpuid_hv_level == CPUID_HV_LEVEL_VMARE_4) 
+(cpu-env.cpuid_hv_vendor1 == CPUID_HV_VENDOR_VMWARE_1) 
+(cpu-env.cpuid_hv_vendor2 == CPUID_HV_VENDOR_VMWARE_2) 
+(cpu-env.cpuid_hv_vendor3 == CPUID_HV_VENDOR_VMWARE_3)) {
+const uint32_t apic_khz = 100L;
+
+/*
+ * From article.gmane.org/gmane.comp.emulators.kvm.devel/22643
+ *
+ *Leaf 0x4010, Timing Information.
+ *
+ *VMware has defined the first generic leaf to provide timing
+ *information.  This leaf returns the current TSC frequency and
+ *current Bus frequency in kHz.
+ *
+ *# EAX: (Virtual) TSC frequency in kHz.
+ *# EBX: (Virtual) Bus (local apic timer) frequency in kHz.
+ *# ECX, EDX: RESERVED (Per above, reserved fields are set to 
zero).
+ */
+cpu-env.cpuid_hv_extra = 0x4010;
+cpu-env.cpuid_hv_extra_a = (uint32_t)cpu-env.tsc_khz;
+cpu-env.cpuid_hv_extra_b = apic_khz;
+}
+}

What happens in case you migrate the vmware guest to a host
with different frequency? How is that transmitted to the
vmware-guest-running-on-kvm ? Or is migration not supported?
As far as I know, it would be the same as for a non-vmware guest. 
http://lists.nongnu.org/archive/html/qemu-devel/2011-07/msg01656.html is 
related to this.


I did not look to see if this has been done since then.

All this change does is to allow the guest to read the tsc-frequency 
instead of trying to calculate it.


I will look into the current state of migration when tsc_freq=X is 
specified.  The machine I have been doing most of the testing on (Intel 
Xeon E3-1260L) when I add tsc_freq=2.0G or tsc_freq=2.4G, the guest does 
not see any difference in accel=kvm.



+static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+uint32_t value;
+
+visit_type_uint32(v, value, name, errp);
+if (error_is_set(errp)) {
+return;
+}
+
+if ((value != 0)  (value  0x4000)) {
+value += 0x4000;
+}
+cpu-env.cpuid_hv_level = value;
+}
+
+static char *x86_cpuid_get_hv_vendor(Object *obj, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+CPUX86State *env = cpu-env;
+char *value;
+int i;
+
+value = (char *)g_malloc(CPUID_VENDOR_SZ + 1);
+for (i = 0; i  4; i++) {
+value[i + 0] = env-cpuid_hv_vendor1  (8 * i);
+value[i + 4] = env-cpuid_hv_vendor2  (8 * i);
+value[i + 8] = env-cpuid_hv_vendor3  (8 * i);
+}
+value[CPUID_VENDOR_SZ] = '\0';
+
+/* Convert known names */
+if (!strcmp(value, CPUID_HV_VENDOR_VMWARE)) {
+if (env-cpuid_hv_level == CPUID_HV_LEVEL_VMARE_4) {
+pstrcpy(value, sizeof(value), vmware4);
+} else if (env-cpuid_hv_level == CPUID_HV_LEVEL_VMARE_3) {
+pstrcpy(value, sizeof(value), vmware3);
+}
+} else if (!strcmp(value, CPUID_HV_VENDOR_XEN) 
+   env-cpuid_hv_level == CPUID_HV_LEVEL_XEN) {
+pstrcpy(value, sizeof(value), xen);
+} else if (!strcmp(value, CPUID_HV_VENDOR_KVM) 
+   env-cpuid_hv_level == 0) {
+pstrcpy(value, sizeof(value), kvm);
+}
+return value;
+}
+
+static void x86_cpuid_set_hv_vendor(Object *obj, const char *value,
+ Error **errp)
+{
+X86CPU 

Re: Multi-dimensional Paging in Nested virtualization

2012-09-13 Thread siddhesh phadke
Thanks a lot Nadav.This was really helpful.

Siddhesh

On Thu, Sep 13, 2012 at 3:49 AM, Nadav Har'El n...@math.technion.ac.il wrote:
 On Tue, Sep 11, 2012, siddhesh phadke wrote about Multi-dimensional Paging 
 in Nested virtualization:
 I read turtles project paper where they have explained  how
 multi-dimensional page tables are built on L0. L2 is launched with
 empty EPT 0-2 and EPT 0-2 is built on-the-fly.
 I tried to find out how this is done in kvm code but i could not find
 where EPT 0-2 is built.

 Nested EPT is not yet included in the mainline KVM. The original nested EPT
 code that we had written as part of the Turtles paper became obsolete when
 much of KVM's MMU code has been rewritten.

 I have since rewritten the nested EPT code for the modern KVM. I sent
 the second (latest) version of these patches to the KVM mailing list in
 August, and you can find them in, for example,
 http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95395

 These patches were not yet accepted into KVM. They have bugs in various
 setups (which I have not yet found the time to fix, unfortunately),
 and some known issues found by Avi Kivity on this mailing lest.

 Does L1 handle ept violation first and then L0 updates its EPT0-2?
 How this is done?

 This is explained in the turtles paper, but here's the short story:

 L1 defines an EPT table for L2 which we call EPT12. L0 builds from this
 an EPT02, with L1 addresses changed to L0. Now, when L2 runs and we get
 an EPT violation, we exit to L0 (in nested vmx, any exit first gets to
 L0). L0 checks if the translation is missing already in EPT12, and if it
 isn't it emulates an exit into L1 - and inject the EPT violation into
 L1. But if the translation wasn't missing in EPT12, then it's L0's
 problem, and we just need to update EPT02.

 Can anybody give me some pointers about where to look into the code?

 Please look at the patches above. Each patch is also documented.

 Nadav.

 --
 Nadav Har'El|  Thursday, Sep 13 2012, 26 Elul 5772
 n...@math.technion.ac.il 
 |-
 Phone +972-523-790466, ICQ 13349191 |error compiling committee.c: too many
 http://nadav.harel.org.il   |arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Revert mm: have order 0 compaction start near a pageblock with free pages

2012-09-13 Thread Rik van Riel
On Wed, 12 Sep 2012 17:46:15 +0100
Richard Davies rich...@arachsys.com wrote:
 Mel Gorman wrote:
  I see that this is an old-ish bug but I did not read the full history.
  Is it now booting faster than 3.5.0 was? I'm asking because I'm
  interested to see if commit c67fe375 helped your particular case.
 
 Yes, I think 3.6.0-rc5 is already better than 3.5.x but can still be
 improved, as discussed.

Re-reading Mel's commit de74f1cc3b1e9730d9b58580cd11361d30cd182d,
I believe it re-introduces the quadratic behaviour that the code
was suffering from before, by not moving zone-compact_cached_free_pfn
down when no more free pfns are found in a page block.

This mail reverts that changeset, the next introduces what I hope to
be the proper fix.  Richard, would you be willing to give these patches
a try, since your system seems to reproduce this bug easily?

---8---

Revert mm: have order  0 compaction start near a pageblock with free pages

This reverts commit de74f1cc3b1e9730d9b58580cd11361d30cd182d.

Mel found a real issue with my skip ahead logic in the
compaction code, but unfortunately his approach appears to
have re-introduced quadratic behaviour in that the value
of zone-compact_cached_free_pfn is never advanced until
the compaction run wraps around the start of the zone.

This merely moved the starting point for the quadratic behaviour
further into the zone, but the behaviour has still been observed.

It looks like another fix is required.

Signed-off-by: Rik van Riel r...@redhat.com
Reported-by: Richard Davies rich...@daviesmail.org

diff --git a/mm/compaction.c b/mm/compaction.c
index 7fcd3a5..771775d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -431,20 +431,6 @@ static bool suitable_migration_target(struct page *page)
 }
 
 /*
- * Returns the start pfn of the last page block in a zone.  This is the 
starting
- * point for full compaction of a zone.  Compaction searches for free pages 
from
- * the end of each zone, while isolate_freepages_block scans forward inside 
each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
-   unsigned long free_pfn;
-   free_pfn = zone-zone_start_pfn + zone-spanned_pages;
-   free_pfn = ~(pageblock_nr_pages-1);
-   return free_pfn;
-}
-
-/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -483,6 +469,17 @@ static void isolate_freepages(struct zone *zone,
pfn -= pageblock_nr_pages) {
unsigned long isolated;
 
+   /*
+* Skip ahead if another thread is compacting in the area
+* simultaneously. If we wrapped around, we can only skip
+* ahead if zone-compact_cached_free_pfn also wrapped to
+* above our starting point.
+*/
+   if (cc-order  0  (!cc-wrapped ||
+ zone-compact_cached_free_pfn 
+ cc-start_free_pfn))
+   pfn = min(pfn, zone-compact_cached_free_pfn);
+
if (!pfn_valid(pfn))
continue;
 
@@ -533,15 +530,7 @@ static void isolate_freepages(struct zone *zone,
 */
if (isolated) {
high_pfn = max(high_pfn, pfn);
-
-   /*
-* If the free scanner has wrapped, update
-* compact_cached_free_pfn to point to the highest
-* pageblock with free pages. This reduces excessive
-* scanning of full pageblocks near the end of the
-* zone
-*/
-   if (cc-order  0  cc-wrapped)
+   if (cc-order  0)
zone-compact_cached_free_pfn = high_pfn;
}
}
@@ -551,11 +540,6 @@ static void isolate_freepages(struct zone *zone,
 
cc-free_pfn = high_pfn;
cc-nr_freepages = nr_freepages;
-
-   /* If compact_cached_free_pfn is reset then set it now */
-   if (cc-order  0  !cc-wrapped 
-   zone-compact_cached_free_pfn == start_free_pfn(zone))
-   zone-compact_cached_free_pfn = high_pfn;
 }
 
 /*
@@ -642,6 +626,20 @@ static isolate_migrate_t isolate_migratepages(struct zone 
*zone,
return ISOLATE_SUCCESS;
 }
 
+/*
+ * Returns the start pfn of the last page block in a zone.  This is the 
starting
+ * point for full compaction of a zone.  Compaction searches for free pages 
from
+ * the end of each zone, while isolate_freepages_block scans forward inside 
each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+   unsigned long free_pfn;
+   free_pfn = zone-zone_start_pfn + zone-spanned_pages;
+   free_pfn = ~(pageblock_nr_pages-1);
+   return free_pfn;
+}
+
 

[PATCH 2/2] make the compaction skip ahead logic robust

2012-09-13 Thread Rik van Riel
Make the skip ahead logic in compaction resistant to compaction
wrapping around to the end of the zone.  This can lead to less
efficient compaction when one thread has wrapped around to the
end of the zone, and another simultaneous compactor has not done
so yet. However, it should ensure that we do not suffer quadratic
behaviour any more.

Signed-off-by: Rik van Riel r...@redhat.com
Reported-by: Richard Davies rich...@daviesmail.org

diff --git a/mm/compaction.c b/mm/compaction.c
index 771775d..0656759 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -431,6 +431,24 @@ static bool suitable_migration_target(struct page *page)
 }
 
 /*
+ * We scan the zone in a circular fashion, starting at
+ * zone-compact_cached_free_pfn. Be careful not to skip if
+ * one compacting thread has just wrapped back to the end of the
+ * zone, but another thread has not.
+ */
+static bool compaction_may_skip(struct zone *zone,
+   struct compact_control *cc)
+{
+   if (!cc-wrapped  zone-compact_free_pfn  cc-start_pfn)
+   return true;
+
+   if (cc-wrapped  zone_compact_free_pfn  cc-start_pfn)
+   return true;
+
+   return false;
+}
+
+/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -471,13 +489,9 @@ static void isolate_freepages(struct zone *zone,
 
/*
 * Skip ahead if another thread is compacting in the area
-* simultaneously. If we wrapped around, we can only skip
-* ahead if zone-compact_cached_free_pfn also wrapped to
-* above our starting point.
+* simultaneously, and has finished with this page block.
 */
-   if (cc-order  0  (!cc-wrapped ||
- zone-compact_cached_free_pfn 
- cc-start_free_pfn))
+   if (cc-order  0  compaction_may_skip(zone, cc))
pfn = min(pfn, zone-compact_cached_free_pfn);
 
if (!pfn_valid(pfn))

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2 2/2] make the compaction skip ahead logic robust

2012-09-13 Thread Rik van Riel
Argh. And of course I send out the version from _before_ the compile test,
instead of the one after! I am not used to caffeine any more and have had
way too much tea...

---8---

Make the skip ahead logic in compaction resistant to compaction
wrapping around to the end of the zone.  This can lead to less
efficient compaction when one thread has wrapped around to the
end of the zone, and another simultaneous compactor has not done
so yet. However, it should ensure that we do not suffer quadratic
behaviour any more.

Signed-off-by: Rik van Riel r...@redhat.com
Reported-by: Richard Davies rich...@daviesmail.org

diff --git a/mm/compaction.c b/mm/compaction.c
index 771775d..0656759 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -431,6 +431,24 @@ static bool suitable_migration_target(struct page *page)
 }
 
 /*
+ * We scan the zone in a circular fashion, starting at
+ * zone-compact_cached_free_pfn. Be careful not to skip if
+ * one compacting thread has just wrapped back to the end of the
+ * zone, but another thread has not.
+ */
+static bool compaction_may_skip(struct zone *zone,
+   struct compact_control *cc)
+{
+   if (!cc-wrapped  zone-compact_cached_free_pfn  cc-start_free_pfn)
+   return true;
+
+   if (cc-wrapped  zone-compact_cached_free_pfn  cc-start_free_pfn)
+   return true;
+
+   return false;
+}
+
+/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -471,13 +489,9 @@ static void isolate_freepages(struct zone *zone,
 
/*
 * Skip ahead if another thread is compacting in the area
-* simultaneously. If we wrapped around, we can only skip
-* ahead if zone-compact_cached_free_pfn also wrapped to
-* above our starting point.
+* simultaneously, and has finished with this page block.
 */
-   if (cc-order  0  (!cc-wrapped ||
- zone-compact_cached_free_pfn 
- cc-start_free_pfn))
+   if (cc-order  0  compaction_may_skip(zone, cc))
pfn = min(pfn, zone-compact_cached_free_pfn);
 
if (!pfn_valid(pfn))

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/4] VFIO-based PCI device assignment

2012-09-13 Thread Alex Williamson
Here's an updated version of the VFIO PCI device assignment series.
Now that we're targetting QEMU 1.3, I've opened up support so that
vfio-pci is added to all softmmu targets supporting PCI on Linux
hosts.  Only some printf format changes were required to make this
build.

I also added a workaround for INTx support.  Ideally we'd like to know
when an EOI is written to the interrupt controller to know when to
de-assert and unmask an interrupt, but as a substitute we can consider
a BAR access to be a response to an interrupt and do the de-assert and
unmask then.  The device will re-assert the interrupt until it's been
handled.  The benefit is that the solution is generic, the draw-back
is that we can't make use of the mmap'd memory region in this mode.
The memory API conveniently has a way to toggle enabling the mmap'd
region that fits nicely with this usage.

I've also added an x-intx=off option to disable INTx support for a
device, which can be useful for devices that don't make use of any
interrupts and for which the overhead of trapping BAR access is too
high (graphics cards, including a Radeon HD 5450 which I was able to
get working under WinXP with this version).  This option should be
considered experimental, thus the x- prefix.  Future EOI acceleration
should make this option unnecessary where KVM is available.

I was also successful in passing through both a tg3 and e1000e NIC
from an x86 host to powerpc guest (g3beiege) using this series.  This
guest machine doesn't appear to support MSI, so the INTx mechanism
above is necessary to trigger an EOI.

In addition to these series here, the code is able at:

git://github.com/awilliam/qemu-vfio.git branch vfio-for-qemu

as well is in signed tag vfio-pci-for-qemu-v4.

Thanks,

Alex

---

Alex Williamson (4):
  vfio: Enable vfio-pci and mark supported
  vfio: vfio-pci device assignment driver
  Update Linux kernel headers
  Update kernel header script to include vfio


 MAINTAINERS |5 
 configure   |6 
 hw/Makefile.objs|3 
 hw/vfio_pci.c   | 1860 +++
 hw/vfio_pci_int.h   |  114 ++
 linux-headers/linux/vfio.h  |  368 
 scripts/update-linux-headers.sh |2 
 7 files changed, 2356 insertions(+), 2 deletions(-)
 create mode 100644 hw/vfio_pci.c
 create mode 100644 hw/vfio_pci_int.h
 create mode 100644 linux-headers/linux/vfio.h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/4] Update kernel header script to include vfio

2012-09-13 Thread Alex Williamson
Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 scripts/update-linux-headers.sh |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index a639c5b..605102f 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -43,7 +43,7 @@ done
 
 rm -rf $output/linux-headers/linux
 mkdir -p $output/linux-headers/linux
-for header in kvm.h kvm_para.h vhost.h virtio_config.h virtio_ring.h; do
+for header in kvm.h kvm_para.h vfio.h vhost.h virtio_config.h virtio_ring.h; do
 cp $tmpdir/include/linux/$header $output/linux-headers/linux
 done
 rm -rf $output/linux-headers/asm-generic

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/4] Update Linux kernel headers

2012-09-13 Thread Alex Williamson
Based on Linux as of 1a95620.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 linux-headers/linux/vfio.h |  368 
 1 file changed, 368 insertions(+)
 create mode 100644 linux-headers/linux/vfio.h

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
new file mode 100644
index 000..f787b72
--- /dev/null
+++ b/linux-headers/linux/vfio.h
@@ -0,0 +1,368 @@
+/*
+ * VFIO API definition
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ * Author: Alex Williamson alex.william...@redhat.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef VFIO_H
+#define VFIO_H
+
+#include linux/types.h
+#include linux/ioctl.h
+
+#define VFIO_API_VERSION   0
+
+
+/* Kernel  User level defines for VFIO IOCTLs. */
+
+/* Extensions */
+
+#define VFIO_TYPE1_IOMMU   1
+
+/*
+ * The IOCTL interface is designed for extensibility by embedding the
+ * structure length (argsz) and flags into structures passed between
+ * kernel and userspace.  We therefore use the _IO() macro for these
+ * defines to avoid implicitly embedding a size into the ioctl request.
+ * As structure fields are added, argsz will increase to match and flag
+ * bits will be defined to indicate additional fields with valid data.
+ * It's *always* the caller's responsibility to indicate the size of
+ * the structure passed by setting argsz appropriately.
+ */
+
+#define VFIO_TYPE  (';')
+#define VFIO_BASE  100
+
+/*  IOCTLs for VFIO file descriptor (/dev/vfio/vfio)  */
+
+/**
+ * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
+ *
+ * Report the version of the VFIO API.  This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: VFIO_API_VERSION
+ * Availability: Always
+ */
+#define VFIO_GET_API_VERSION   _IO(VFIO_TYPE, VFIO_BASE + 0)
+
+/**
+ * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __u32)
+ *
+ * Check whether an extension is supported.
+ * Return: 0 if not supported, 1 (or some other positive integer) if supported.
+ * Availability: Always
+ */
+#define VFIO_CHECK_EXTENSION   _IO(VFIO_TYPE, VFIO_BASE + 1)
+
+/**
+ * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
+ *
+ * Set the iommu to the given type.  The type must be supported by an
+ * iommu driver as verified by calling CHECK_EXTENSION using the same
+ * type.  A group must be set to this file descriptor before this
+ * ioctl is available.  The IOMMU interfaces enabled by this call are
+ * specific to the value set.
+ * Return: 0 on success, -errno on failure
+ * Availability: When VFIO group attached
+ */
+#define VFIO_SET_IOMMU _IO(VFIO_TYPE, VFIO_BASE + 2)
+
+/*  IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP)  */
+
+/**
+ * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
+ * struct vfio_group_status)
+ *
+ * Retrieve information about the group.  Fills in provided
+ * struct vfio_group_info.  Caller sets argsz.
+ * Return: 0 on succes, -errno on failure.
+ * Availability: Always
+ */
+struct vfio_group_status {
+   __u32   argsz;
+   __u32   flags;
+#define VFIO_GROUP_FLAGS_VIABLE(1  0)
+#define VFIO_GROUP_FLAGS_CONTAINER_SET (1  1)
+};
+#define VFIO_GROUP_GET_STATUS  _IO(VFIO_TYPE, VFIO_BASE + 3)
+
+/**
+ * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32)
+ *
+ * Set the container for the VFIO group to the open VFIO file
+ * descriptor provided.  Groups may only belong to a single
+ * container.  Containers may, at their discretion, support multiple
+ * groups.  Only when a container is set are all of the interfaces
+ * of the VFIO file descriptor and the VFIO group file descriptor
+ * available to the user.
+ * Return: 0 on success, -errno on failure.
+ * Availability: Always
+ */
+#define VFIO_GROUP_SET_CONTAINER   _IO(VFIO_TYPE, VFIO_BASE + 4)
+
+/**
+ * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5)
+ *
+ * Remove the group from the attached container.  This is the
+ * opposite of the SET_CONTAINER call and returns the group to
+ * an initial state.  All device file descriptors must be released
+ * prior to calling this interface.  When removing the last group
+ * from a container, the IOMMU will be disabled and all state lost,
+ * effectively also returning the VFIO file descriptor to an initial
+ * state.
+ * Return: 0 on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_UNSET_CONTAINER _IO(VFIO_TYPE, VFIO_BASE + 5)
+
+/**
+ * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char)
+ *
+ * Return a new file descriptor for the device object described by
+ * the provided string.  The 

[PATCH v4 4/4] vfio: Enable vfio-pci and mark supported

2012-09-13 Thread Alex Williamson
Enabled for all softmmu guests supporting PCI on Linux hosts.  Note
that currently only x86 hosts have the kernel side VFIO IOMMU support
for this.  PPC (g3beige) is the only non-x86 guest known to work.
ARM (veratile) hangs in firmware, others untested.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 MAINTAINERS  |5 +
 configure|6 ++
 hw/Makefile.objs |3 ++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 61f8b45..fd3eca0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -474,6 +474,11 @@ M: Gerd Hoffmann kra...@redhat.com
 S: Maintained
 F: hw/usb*
 
+VFIO
+M: Alex Williamson alex.william...@redhat.com
+S: Supported
+F: hw/vfio*
+
 vhost
 M: Michael S. Tsirkin m...@redhat.com
 S: Supported
diff --git a/configure b/configure
index 30be784..b56e61f 100755
--- a/configure
+++ b/configure
@@ -167,6 +167,7 @@ attr=
 libattr=
 xfs=
 
+vfio_pci=no
 vhost_net=no
 kvm=no
 gprof=no
@@ -528,6 +529,7 @@ Haiku)
   usb=linux
   kvm=yes
   vhost_net=yes
+  vfio_pci=yes
   if [ $cpu = i386 -o $cpu = x86_64 ] ; then
 audio_possible_drivers=$audio_possible_drivers fmod
   fi
@@ -3180,6 +3182,7 @@ echo libiscsi support  $libiscsi
 echo build guest agent $guest_agent
 echo seccomp support   $seccomp
 echo coroutine backend $coroutine_backend
+echo VFIO PCI support  $vfio_pci
 
 if test $sdl_too_old = yes; then
 echo - Your SDL version is too old - please upgrade to have SDL support
@@ -3921,6 +3924,9 @@ if test $target_softmmu = yes ; then
   if test $smartcard_nss = yes ; then
 echo subdir-$target: subdir-libcacard  $config_host_mak
   fi
+  if test $vfio_pci = yes ; then
+echo CONFIG_VFIO_PCI=y  $config_target_mak
+  fi
   case $target_arch2 in
 i386|x86_64)
   echo CONFIG_HAVE_CORE_DUMP=y  $config_target_mak
diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 6dfebd2..7f8d3e4 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -198,7 +198,8 @@ obj-$(CONFIG_VGA) += vga.o
 obj-$(CONFIG_SOFTMMU) += device-hotplug.o
 obj-$(CONFIG_XEN) += xen_domainbuild.o xen_machine_pv.o
 
-# Inter-VM PCI shared memory
+# Inter-VM PCI shared memory  VFIO PCI device assignment
 ifeq ($(CONFIG_PCI), y)
 obj-$(CONFIG_KVM) += ivshmem.o
+obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
 endif

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] vhost-scsi: Add support for host virtualized target

2012-09-13 Thread Nicholas A. Bellinger
On Tue, 2012-09-11 at 12:36 +0800, Asias He wrote:
 Hello Nicholas,
 

Hello Asias!

 On 09/07/2012 02:48 PM, Nicholas A. Bellinger wrote:
  From: Nicholas Bellinger n...@linux-iscsi.org
  
  Hello Anthony  Co,
  
  This is the fourth installment to add host virtualized target support for
  the mainline tcm_vhost fabric driver using Linux v3.6-rc into QEMU 1.3.0-rc.
  
  The series is available directly from the following git branch:
  
 git://git.kernel.org/pub/scm/virt/kvm/nab/qemu-kvm.git vhost-scsi-for-1.3
  
  Note the code is cut against yesterday's QEMU head, and dispite the name
  of the tree is based upon mainline qemu.org git code + has thus far been
  running overnight with  100K IOPs small block 4k workloads using v3.6-rc2+
  based target code with RAMDISK_DR backstores.
 
 Are you still seeing the performance degradation discussed in the thread
 
  vhost-scsi port to v1.1.0 + MSI-X performance regression
 

So the performance regression reported here with QEMU v1.2-rc +
virtio-scsi ended up being related to virtio interrupts being delivered
across multiple CPUs.

After explicitly setting the IRQ affinity of the virtio0-request MSI-X
vector to a specific CPU, the small block (4k) mixed random I/O
performance jumped back up to the expected ~100K IOPs for a single LUN.

FYI, I just tried this again with the most recent QEMU v1.2.50 (v1.3-rc)
code, and both cases appear to be performing as expected once again
regardless of the explicit IRQ affinity setting.

--nab

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-13 Thread Andrew Theurer
On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
 * Andrew Theurer haban...@linux.vnet.ibm.com [2012-09-11 13:27:41]:
 
  On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
   On 09/11/2012 01:42 AM, Andrew Theurer wrote:
On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
+static bool __yield_to_candidate(struct task_struct *curr, struct 
task_struct *p)
+{
+ if (!curr-sched_class-yield_to_task)
+ return false;
+
+ if (curr-sched_class != p-sched_class)
+ return false;
   
   
Peter,
   
Should we also add a check if the runq has a skip buddy (as pointed 
out
by Raghu) and return if the skip buddy is already set.
   
Oh right, I missed that suggestion.. the performance improvement went
from 81% to 139% using this, right?
   
It might make more sense to keep that separate, outside of this
function, since its not a strict prerequisite.
   
   
+ if (task_running(p_rq, p) || p-state)
+ return false;
+
+ return true;
+}
   
   
@@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
bool preempt)
   rq = this_rq();
   
  again:
+ /* optimistic test to avoid taking locks */
+ if (!__yield_to_candidate(curr, p))
+ goto out_irq;
+
   
So add something like:
   
   /* Optimistic, if we 'raced' with another yield_to(), don't 
bother */
   if (p_rq-cfs_rq-skip)
   goto out_irq;
   
   
   p_rq = task_rq(p);
   double_rq_lock(rq, p_rq);
   
   
But I do have a question on this optimization though,.. Why do we check
p_rq-cfs_rq-skip and not rq-cfs_rq-skip ?
   
That is, I'd like to see this thing explained a little better.
   
Does it go something like: p_rq is the runqueue of the task we'd like 
to
yield to, rq is our own, they might be the same. If we have a -skip,
there's nothing we can do about it, OTOH p_rq having a -skip and
failing the yield_to() simply means us picking the next VCPU thread,
which might be running on an entirely different cpu (rq) and could
succeed?
   
Here's two new versions, both include a __yield_to_candidate(): v3
uses the check for p_rq-curr in guest mode, and v4 uses the cfs_rq
skip check.  Raghu, I am not sure if this is exactly what you want
implemented in v4.
   
   
   Andrew, Yes that is what I had. I think there was a mis-understanding. 
   My intention was to if there is a directed_yield happened in runqueue 
   (say rqA), do not bother to directed yield to that. But unfortunately as 
   PeterZ pointed that would have resulted in setting next buddy of a 
   different run queue than rqA.
   So we can drop this skip idea. Pondering more over what to do? can we 
   use next buddy itself ... thinking..
  
  As I mentioned earlier today, I did not have your changes from kvm.git
  tree when I tested my changes.  Here are your changes and my changes
  compared:
  
throughput in MB/sec
  
  kvm_vcpu_on_spin changes:  4636 +/- 15.74%
  yield_to changes:  4515 +/- 12.73%
  
  I would be inclined to stick with your changes which are kept in kvm
  code.  I did try both combined, and did not get good results:
  
  both changes:  4074 +/- 19.12%
  
  So, having both is probably not a good idea.  However, I feel like
  there's more work to be done.  With no over-commit (10 VMs), total
  throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
  overhead, but a reduction to ~4500 is still terrible.  By contrast,
  8-way VMs with 2x over-commit have a total throughput roughly 10% less
  than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
  host).  We still have what appears to be scalability problems, but now
  it's not so much in runqueue locks for yield_to(), but now
  get_pid_task():
 
 
 Hi Andrew,
 IMHO, reducing the double runqueue lock overhead is a good idea,
 and may be  we see the benefits when we increase the overcommit further.
 
 The explaination for not seeing good benefit on top of PLE handler
 optimization patch is because we filter the yield_to candidates,
 and hence resulting in less contention for double runqueue lock.
 and extra code overhead during genuine yield_to might have resulted in
 some degradation in the case you tested.
 
 However, did you use cfs.next also?. I hope it helps, when we combine.
 
 Here is the result that is showing positive benefit.
 I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
   
 +---+---+---++---+
 kernbench time in sec, lower is better 
 +---+---+---++---+
base  stddev patched stddev  %improve
 

Re: [PATCH 4/5] virtio-scsi: Add start/stop functionality for vhost-scsi

2012-09-13 Thread Nicholas A. Bellinger
On Tue, 2012-09-11 at 18:07 +0300, Michael S. Tsirkin wrote:
 On Tue, Sep 11, 2012 at 08:46:34AM -0500, Anthony Liguori wrote:
  On 09/10/2012 01:24 AM, Michael S. Tsirkin wrote:
  On Mon, Sep 10, 2012 at 08:16:54AM +0200, Paolo Bonzini wrote:
  Il 09/09/2012 00:40, Michael S. Tsirkin ha scritto:
  On Fri, Sep 07, 2012 at 06:00:50PM +0200, Paolo Bonzini wrote:

SNIP

  Please create a completely separate device vhost-scsi-pci instead (or
  virtio-scsi-tcm-pci, or something like that).  It is used completely
  differently from virtio-scsi-pci, it does not make sense to conflate the
  two.
  
  Ideally the name would say how it is different, not what backend it
  uses. Any good suggestions?
  
  I chose the backend name because, ideally, there would be no other
  difference.  QEMU _could_ implement all the goodies in vhost-scsi (such
  as reservations or ALUA), it just doesn't do that yet.
  
  Paolo
  
  Then why do you say It is used completely differently from
  virtio-scsi-pci?  Isn't it just a different backend?
  
  If yes then it should be a backend option, like it is
  for virtio-net.
  
  I don't mean to bike shed here so don't take this as a nack on
  making it a backend option, but in retrospect, the way we did
  vhost-net was a mistake even though I strongly advocated for it to
  be a backend option.
  
  The code to do it is really, really ugly.  I think it would have
  made a lot more sense to just make it a device and then have it not
  use a netdev backend or any other kind of backend split.
  
  For instance:
  
  qemu -device vhost-net-pci,tapfd=X
  
  I know this breaks the model of separate backends and frontends but
  since vhost-net absolutely requires a tap fd, I think it's better in
  the long run to not abuse the netdev backend to prevent user
  confusion.  Having a dedicated backend type that only has one
  possible option and can only be used by one device is a bit silly
  too.
  
  So I would be in favor of dropping/squashing 3/5 and radically
  simplifying how this was exposed to the user.
  
  I would just take qemu_vhost_scsi_opts and make them device properties.
  
  Regards,
  
  Anthony Liguori
 
 I'd like to clarify that I'm fine with either approach.
 Even a separate device is OK if this is what others want
 though I like it the least.
 

Hi MST, Paolo  Co,

I've been out the better part of the week with the flu, and am just now
catching up on emails from the last days..

So to better understand the reasoning for adding an separate PCI device
for vhost-scsi ahead of implementing the code changes, here are main
points from folk's comments:

*) Convert vhost-scsi into a separate standalone vhost-scsi-pci device

  - Lets userspace know that virtio-scsi + QEMU block and virtio-scsi + 
tcm_vhost do not track SCSI state (such as reservations + ALUA), and
hence are not interchangeable during live-migration.
  
  - Reduces complexity of adding vhost-scsi related logic into existing
virtio-scsi-pci code path.

  - Having backends with one possible option doesn’t make much sense.

*) Keep vhost-scsi as a backend to virtio-scsi-pci

  - Reduces duplicated code amongst multiple virtio-scsi backends.
  
  - Follows the split for what existing vhost-net code already does.

So that said, two quick questions for Paolo  Co..

For the standalone vhost-scsi-pci device case, can you give a brief idea
as to what extent you'd like to see virtio-scsi.c code/defs duplicated
and/or shared amongst a new vhost-scsi-pci device..?

Also to help me along, can you give an example based on the current
usage below how the QEMU command line arguments would change with a
standalone vhost-scsi-pci device..?

./x86_64-softmmu/qemu-system-x86_64 -enable-kvm -smp 4 -m 2048 \
-hda /usr/src/qemu-vhost.git/debian_squeeze_amd64_standard-old.qcow2 \
-vhost-scsi id=vhost-scsi0,wwpn=naa.600140579ad21088,tpgt=1 \
-device virtio-scsi-pci,vhost-scsi=vhost-scsi0,event_idx=off

Thank you!

--nab

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: Book3S HV: Get/set guest SPRs using the GET/SET_ONE_REG interface

2012-09-13 Thread Alexander Graf

On 12.09.2012, at 02:18, Paul Mackerras wrote:

 This enables userspace to get and set various SPRs (special-purpose
 registers) using the KVM_[GS]ET_ONE_REG ioctls.  With this, userspace
 can get and set all the SPRs that are part of the guest state, either
 through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or
 the KVM_[GS]ET_ONE_REG ioctls.
 
 The SPRs that are added here are:
 
 - DABR:  Data address breakpoint register
 - DSCR:  Data stream control register
 - PURR:  Processor utilization of resources register
 - SPURR: Scaled PURR
 - DAR:   Data address register
 - DSISR: Data storage interrupt status register
 - AMR:   Authority mask register
 - UAMOR: User authority mask override register
 - MMCR0, MMCR1, MMCRA: Performance monitor unit control registers
 - PMC1..PMC8: Performance monitor unit counter registers
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
 arch/powerpc/include/asm/kvm.h |   21 
 arch/powerpc/kvm/book3s_hv.c   |  106 

Documentation/virtual/kvm/api.txt | +++

:)


Alex

 2 files changed, 127 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
 index 3c14202..9557576 100644
 --- a/arch/powerpc/include/asm/kvm.h
 +++ b/arch/powerpc/include/asm/kvm.h
 @@ -338,5 +338,26 @@ struct kvm_book3e_206_tlb_params {
 #define KVM_REG_PPC_IAC4  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x5)
 #define KVM_REG_PPC_DAC1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x6)
 #define KVM_REG_PPC_DAC2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x7)
 +#define KVM_REG_PPC_DABR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x8)
 +#define KVM_REG_PPC_DSCR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9)
 +#define KVM_REG_PPC_PURR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa)
 +#define KVM_REG_PPC_SPURR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb)
 +#define KVM_REG_PPC_DAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc)
 +#define KVM_REG_PPC_DSISR(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xd)
 +#define KVM_REG_PPC_AMR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xe)
 +#define KVM_REG_PPC_UAMOR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xf)
 +
 +#define KVM_REG_PPC_MMCR0(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
 +#define KVM_REG_PPC_MMCR1(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
 +#define KVM_REG_PPC_MMCRA(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
 +
 +#define KVM_REG_PPC_PMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18)
 +#define KVM_REG_PPC_PMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19)
 +#define KVM_REG_PPC_PMC3 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1a)
 +#define KVM_REG_PPC_PMC4 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1b)
 +#define KVM_REG_PPC_PMC5 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1c)
 +#define KVM_REG_PPC_PMC6 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1d)
 +#define KVM_REG_PPC_PMC7 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1e)
 +#define KVM_REG_PPC_PMC8 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1f)
 
 #endif /* __LINUX_KVM_POWERPC_H */
 diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
 index 83e929e..7fe5c9a 100644
 --- a/arch/powerpc/kvm/book3s_hv.c
 +++ b/arch/powerpc/kvm/book3s_hv.c
 @@ -538,11 +538,53 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg)
 {
   int r = -EINVAL;
 + long int i;
 
   switch (reg-id) {
   case KVM_REG_PPC_HIOR:
   r = put_user(0, (u64 __user *)reg-addr);
   break;
 + case KVM_REG_PPC_DABR:
 + r = put_user(vcpu-arch.dabr, (u64 __user *)reg-addr);
 + break;
 + case KVM_REG_PPC_DSCR:
 + r = put_user(vcpu-arch.dscr, (u64 __user *)reg-addr);
 + break;
 + case KVM_REG_PPC_PURR:
 + r = put_user(vcpu-arch.purr, (u64 __user *)reg-addr);
 + break;
 + case KVM_REG_PPC_SPURR:
 + r = put_user(vcpu-arch.spurr, (u64 __user *)reg-addr);
 + break;
 + case KVM_REG_PPC_DAR:
 + r = put_user(vcpu-arch.shregs.dar, (u64 __user *)reg-addr);
 + break;
 + case KVM_REG_PPC_DSISR:
 + r = put_user(vcpu-arch.shregs.dsisr, (u32 __user *)reg-addr);
 + break;
 + case KVM_REG_PPC_AMR:
 + r = put_user(vcpu-arch.amr, (u64 __user *)reg-addr);
 + break;
 + case KVM_REG_PPC_UAMOR:
 + r = put_user(vcpu-arch.uamor, (u64 __user *)reg-addr);
 + break;
 + case KVM_REG_PPC_MMCR0:
 + case KVM_REG_PPC_MMCR1:
 + case KVM_REG_PPC_MMCRA:
 + i = reg-id - KVM_REG_PPC_MMCR0;
 + r = put_user(vcpu-arch.mmcr[i], (u64 __user *)reg-addr);
 + break;
 + case KVM_REG_PPC_PMC1:
 + case KVM_REG_PPC_PMC2:
 + case KVM_REG_PPC_PMC3:
 + case KVM_REG_PPC_PMC4:
 + case KVM_REG_PPC_PMC5:
 + case KVM_REG_PPC_PMC6:
 + case KVM_REG_PPC_PMC7:
 + case KVM_REG_PPC_PMC8:
 + i = reg-id - 

Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Alexander Graf

On 12.09.2012, at 02:19, Paul Mackerras wrote:

 Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP
 error on powerpc.  This implements them for Book 3S processors.  Since
 those processors have more than just the 32 basic floating-point
 registers, this extends the kvm_fpu structure to have space for the
 additional registers -- the 32 vector registers (128 bits each) for
 VMX/Altivec and the 32 additional 64-bit registers that were added
 on POWER7 for the vector-scalar extension (VSX).  It also adds a
 `valid' field, which is a bitmap indicating which elements contain
 valid data.
 
 The layout of the floating-point register data in the vcpu struct is
 mostly the same between different flavors of KVM on Book 3S processors,
 but the set of supported registers may differ depending on what the
 CPU hardware supports and how much is emulated.  Therefore we have
 a flavor-specific function to work out which set of registers to
 supply for the get function.
 
 On POWER7 processors using the Book 3S HV flavor of KVM, we save the
 standard floating-point registers together with their corresponding
 VSX extension register in the vcpu-arch.vsr[] array, since each
 pair can be loaded or stored with one instruction.  This is different
 to other flavors of KVM, and to other processors (i.e. PPC970) with
 HV KVM, which store the standard FPRs in vcpu-arch.fpr[].  To cope
 with this, we use the kvmppc_core_get_fpu_valid() and
 kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[]
 and arch.vsr[] arrays as needed.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Any reason to not use ONE_REG here?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM: PPC: Book3S HV: More flexible allocator for linear memory

2012-09-13 Thread Alexander Graf

On 12.09.2012, at 02:34, Paul Mackerras wrote:

 This series of 3 patches makes it possible for guests to allocate
 whatever size of HPT they need from linear memory preallocated at
 boot, rather than being restricted to a single size of HPT (by
 default, 16MB) and having to use the kernel page allocator for
 anything else -- which in practice limits them to at most 16MB given
 the default value for the maximum page order.  Instead of allocating
 many individual pieces of memory, this allocates a single contiguous
 area and uses a simple bitmap-based allocator to hand out pieces of it
 as required.

Have you tried to play with CMA for this? It sounds like it could buy us 
exactly what we need.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Paul Mackerras
On Fri, Sep 14, 2012 at 01:30:51AM +0200, Alexander Graf wrote:
 
 On 12.09.2012, at 02:19, Paul Mackerras wrote:
 
  Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP
  error on powerpc.  This implements them for Book 3S processors.  Since
  those processors have more than just the 32 basic floating-point
  registers, this extends the kvm_fpu structure to have space for the
  additional registers -- the 32 vector registers (128 bits each) for
  VMX/Altivec and the 32 additional 64-bit registers that were added
  on POWER7 for the vector-scalar extension (VSX).  It also adds a
  `valid' field, which is a bitmap indicating which elements contain
  valid data.
  
  The layout of the floating-point register data in the vcpu struct is
  mostly the same between different flavors of KVM on Book 3S processors,
  but the set of supported registers may differ depending on what the
  CPU hardware supports and how much is emulated.  Therefore we have
  a flavor-specific function to work out which set of registers to
  supply for the get function.
  
  On POWER7 processors using the Book 3S HV flavor of KVM, we save the
  standard floating-point registers together with their corresponding
  VSX extension register in the vcpu-arch.vsr[] array, since each
  pair can be loaded or stored with one instruction.  This is different
  to other flavors of KVM, and to other processors (i.e. PPC970) with
  HV KVM, which store the standard FPRs in vcpu-arch.fpr[].  To cope
  with this, we use the kvmppc_core_get_fpu_valid() and
  kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[]
  and arch.vsr[] arrays as needed.
  
  Signed-off-by: Paul Mackerras pau...@samba.org
 
 Any reason to not use ONE_REG here?

Just consistency with x86 -- they have an xmm[][] field in their
struct kvm_fpu which looks like it contains their vector state.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Alexander Graf

On 14.09.2012, at 01:58, Paul Mackerras wrote:

 On Fri, Sep 14, 2012 at 01:30:51AM +0200, Alexander Graf wrote:
 
 On 12.09.2012, at 02:19, Paul Mackerras wrote:
 
 Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP
 error on powerpc.  This implements them for Book 3S processors.  Since
 those processors have more than just the 32 basic floating-point
 registers, this extends the kvm_fpu structure to have space for the
 additional registers -- the 32 vector registers (128 bits each) for
 VMX/Altivec and the 32 additional 64-bit registers that were added
 on POWER7 for the vector-scalar extension (VSX).  It also adds a
 `valid' field, which is a bitmap indicating which elements contain
 valid data.
 
 The layout of the floating-point register data in the vcpu struct is
 mostly the same between different flavors of KVM on Book 3S processors,
 but the set of supported registers may differ depending on what the
 CPU hardware supports and how much is emulated.  Therefore we have
 a flavor-specific function to work out which set of registers to
 supply for the get function.
 
 On POWER7 processors using the Book 3S HV flavor of KVM, we save the
 standard floating-point registers together with their corresponding
 VSX extension register in the vcpu-arch.vsr[] array, since each
 pair can be loaded or stored with one instruction.  This is different
 to other flavors of KVM, and to other processors (i.e. PPC970) with
 HV KVM, which store the standard FPRs in vcpu-arch.fpr[].  To cope
 with this, we use the kvmppc_core_get_fpu_valid() and
 kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[]
 and arch.vsr[] arrays as needed.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 
 Any reason to not use ONE_REG here?
 
 Just consistency with x86 -- they have an xmm[][] field in their
 struct kvm_fpu which looks like it contains their vector state.

Yup, Considering how different the FPU state on differnet ppc cores is, I'd be 
more happy with shoving it into something that allows for more dynamic control. 
Otherwise we'd end up with yet another struct sregs that can contain SPE 
registers, altivec, and a dozen additions to it :).

Please just use one_reg for all of the register synchronization you want to 
add, unless there's a compelling reason to do it differently. It will make our 
live a lot easier in the future. If we need to transfer too much data and 
actually run into performance trouble, we can always add a GET_MANY_REG ioctl.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Paul Mackerras
On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote:
 
 Yup, Considering how different the FPU state on differnet ppc cores is, I'd 
 be more happy with shoving it into something that allows for more dynamic 
 control. Otherwise we'd end up with yet another struct sregs that can contain 
 SPE registers, altivec, and a dozen additions to it :).
 
 Please just use one_reg for all of the register synchronization you want to 
 add, unless there's a compelling reason to do it differently. It will make 
 our live a lot easier in the future. If we need to transfer too much data and 
 actually run into performance trouble, we can always add a GET_MANY_REG ioctl.

It just seems perverse to ignore the existing interface that every
other architecture uses, and instead do something unique that is
actually slower, but whatever...

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Alexander Graf

On 14.09.2012, at 03:36, Paul Mackerras wrote:

 On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote:
 
 Yup, Considering how different the FPU state on differnet ppc cores is, I'd 
 be more happy with shoving it into something that allows for more dynamic 
 control. Otherwise we'd end up with yet another struct sregs that can 
 contain SPE registers, altivec, and a dozen additions to it :).
 
 Please just use one_reg for all of the register synchronization you want to 
 add, unless there's a compelling reason to do it differently. It will make 
 our live a lot easier in the future. If we need to transfer too much data 
 and actually run into performance trouble, we can always add a GET_MANY_REG 
 ioctl.
 
 It just seems perverse to ignore the existing interface that every
 other architecture uses, and instead do something unique that is
 actually slower, but whatever...

We're slowly moving towards ONE_REG. ARM is already going full steam ahead and 
I'd like to have every new register in PPC be modeled with it as well. The old 
interface broke on us one time too often now :).

As I said, if we run into performance problems, we will implement ways to 
improve performance. At the end of the day, x86 will be the odd one out.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Alexander Graf

On 14.09.2012, at 03:44, Alexander Graf wrote:

 
 On 14.09.2012, at 03:36, Paul Mackerras wrote:
 
 On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote:
 
 Yup, Considering how different the FPU state on differnet ppc cores is, I'd 
 be more happy with shoving it into something that allows for more dynamic 
 control. Otherwise we'd end up with yet another struct sregs that can 
 contain SPE registers, altivec, and a dozen additions to it :).
 
 Please just use one_reg for all of the register synchronization you want to 
 add, unless there's a compelling reason to do it differently. It will make 
 our live a lot easier in the future. If we need to transfer too much data 
 and actually run into performance trouble, we can always add a GET_MANY_REG 
 ioctl.
 
 It just seems perverse to ignore the existing interface that every
 other architecture uses, and instead do something unique that is
 actually slower, but whatever...
 
 We're slowly moving towards ONE_REG. ARM is already going full steam ahead 
 and I'd like to have every new register in PPC be modeled with it as well. 
 The old interface broke on us one time too often now :).
 
 As I said, if we run into performance problems, we will implement ways to 
 improve performance. At the end of the day, x86 will be the odd one out.

(plus your patch breaks abi compatibility with old user space)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] KVM: perf: kvm events analysis tool

2012-09-13 Thread Xiao Guangrong
On 09/13/2012 12:56 PM, David Ahern wrote:

 
 That suggests what is really needed is a 'live' mode - a continual updating 
 of the output like perf top, not a record and analyze later mode. Which does 
 come back to why I responded to this email -- the syntax is klunky and 
 awkward.
 
 So, I spent a fair amount of time today implementing a live mode. And after a 
 lot of swearing at the tracepoint processing code I finally have it working. 
 And the format extends easily (meaning  day and the next step) to a 
 perf-based kvm_stat replacement. Example syntax is:
 
perf kvm stat [-p pid|-a|...]
 
 which defaults to an update delay of 1 second, and vmexit analysis.

Hi David,

I am very glad to see the live mode, it is very similar with kvm_stat(*). I 
think
kvm guys will like it.

 
 The guts of the processing logic come from the existing kvm-events code. The 
 changes focus on combining the record and report paths into one. The display 
 needs some help (Arnaldo?), but it seems to work well.
 
 I'd like to get opinions on what next? IMO, the record/report path should not 
 get a foot hold from a backward compatibility perspective and having to 
 maintain those options. I am willing to take the existing patches into git to 
 maintain authorship and from there apply patches to make the live mode work - 
 which includes a bit of refactoring of perf code (like the stats changes).

We'd better keep the record/report function, sometimes, we can only get 
perf.data
from the customers whose machine can not be reached for us.

Especially, other tracepoints are also interesting for us when the customers 
encounter
the performance issue, we always ask costumes to use perf kvm stat -e xxx to 
append
other events, like lock:*. Then, we can get not only the information of kvm 
events
by using 'perf kvm stat' but also other informations like 'perf lock' or 'perf 
script'
to get the whole sequences.

 
 Before I march down this path, any objections, opinions, etc?

And, i think live mode is also useful for 'perf lock/sched', could you 
implement it
in perf core?

By the way, the new version of our patchset is ready, do you want to add your 
implement
after it is accepted by Arnaldo? Or are you going to post it with our patchset 
together?

Thanks!

* kvm_stat can be found at scripts/kvm/kvm_stat in the code of Qemu which 
locate at
  https://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memory-hotplug : possible circular locking dependency detected

2012-09-13 Thread Wen Congyang
At 09/13/2012 02:19 PM, Yasuaki Ishimatsu Wrote:
 When I offline a memory on linux-3.6-rc5, possible circular
 locking dependency detected messages are shown.
 Are the messages known problem?

It is a known problem, but it doesn't cause a deadlock.
There is 3 locks: memory hotplug's lock, memory hotplug
notifier's lock, and ksm_thread_mutex.

ksm_thread_mutex is locked when the memory is going offline
and is unlocked when the memory is offlined or the offlining
is cancelled. So we meet the warning messages. But it
doesn't cause deadlock, because we lock mem_hotplug_mutex
first.

Thanks
Wen Congyang

 
 [  201.596363] Offlined Pages 32768
 [  201.596373] remove from free list 14 1024 148000
 [  201.596493] remove from free list 140400 1024 148000
 [  201.596612] remove from free list 140800 1024 148000
 [  201.596730] remove from free list 140c00 1024 148000
 [  201.596849] remove from free list 141000 1024 148000
 [  201.596968] remove from free list 141400 1024 148000
 [  201.597049] remove from free list 141800 1024 148000
 [  201.597049] remove from free list 141c00 1024 148000
 [  201.597049] remove from free list 142000 1024 148000
 [  201.597049] remove from free list 142400 1024 148000
 [  201.597049] remove from free list 142800 1024 148000
 [  201.597049] remove from free list 142c00 1024 148000
 [  201.597049] remove from free list 143000 1024 148000
 [  201.597049] remove from free list 143400 1024 148000
 [  201.597049] remove from free list 143800 1024 148000
 [  201.597049] remove from free list 143c00 1024 148000
 [  201.597049] remove from free list 144000 1024 148000
 [  201.597049] remove from free list 144400 1024 148000
 [  201.597049] remove from free list 144800 1024 148000
 [  201.597049] remove from free list 144c00 1024 148000
 [  201.597049] remove from free list 145000 1024 148000
 [  201.597049] remove from free list 145400 1024 148000
 [  201.597049] remove from free list 145800 1024 148000
 [  201.597049] remove from free list 145c00 1024 148000
 [  201.597049] remove from free list 146000 1024 148000
 [  201.597049] remove from free list 146400 1024 148000
 [  201.597049] remove from free list 146800 1024 148000
 [  201.597049] remove from free list 146c00 1024 148000
 [  201.597049] remove from free list 147000 1024 148000
 [  201.597049] remove from free list 147400 1024 148000
 [  201.597049] remove from free list 147800 1024 148000
 [  201.597049] remove from free list 147c00 1024 148000
 [  201.602143] 
 [  201.602150] ==
 [  201.602153] [ INFO: possible circular locking dependency detected ]
 [  201.602157] 3.6.0-rc5 #1 Not tainted
 [  201.602159] ---
 [  201.602162] bash/2789 is trying to acquire lock:
 [  201.602164]  ((memory_chain).rwsem){.+.+.+}, at: [8109fe16] 
 __blocking_notifier_call_chain+0x66/0xd0
 [  201.602180] 
 [  201.602180] but task is already holding lock:
 [  201.602182]  (ksm_thread_mutex/1){+.+.+.}, at: [811b41fa] 
 ksm_memory_callback+0x3a/0xc0
 [  201.602194] 
 [  201.602194] which lock already depends on the new lock.
 [  201.602194] 
 [  201.602197] 
 [  201.602197] the existing dependency chain (in reverse order) is:
 [  201.602200] 
 [  201.602200] - #1 (ksm_thread_mutex/1){+.+.+.}:
 [  201.602208][810dbee9] validate_chain+0x6d9/0x7e0
 [  201.602214][810dc2e6] __lock_acquire+0x2f6/0x4f0
 [  201.602219][810dc57d] lock_acquire+0x9d/0x190
 [  201.602223][8166b4fc] __mutex_lock_common+0x5c/0x420
 [  201.602229][8166ba2a] mutex_lock_nested+0x4a/0x60
 [  201.602234][811b41fa] ksm_memory_callback+0x3a/0xc0
 [  201.602239][81673447] notifier_call_chain+0x67/0x150
 [  201.602244][8109fe2b] 
 __blocking_notifier_call_chain+0x7b/0xd0
 [  201.602250][8109fe96] 
 blocking_notifier_call_chain+0x16/0x20
 [  201.602255][8144c53b] memory_notify+0x1b/0x20
 [  201.602261][81653c51] offline_pages+0x1b1/0x470
 [  201.602267][811bfcae] remove_memory+0x1e/0x20
 [  201.602273][8144c661] memory_block_action+0xa1/0x190
 [  201.602278][8144c7c9] memory_block_change_state+0x79/0xe0
 [  201.602282][8144c8f2] store_mem_state+0xc2/0xd0
 [  201.602287][81436980] dev_attr_store+0x20/0x30
 [  201.602293][812498d3] sysfs_write_file+0xa3/0x100
 [  201.602299][811cba80] vfs_write+0xd0/0x1a0
 [  201.602304][811cbc54] sys_write+0x54/0xa0
 [  201.602309][81678529] system_call_fastpath+0x16/0x1b
 [  201.602315] 
 [  201.602315] - #0 ((memory_chain).rwsem){.+.+.+}:
 [  201.602322][810db7e7] check_prev_add+0x527/0x550
 [  201.602326][810dbee9] validate_chain+0x6d9/0x7e0
 [  201.602331][810dc2e6] 

Re: [3.5.0 BUG] vmx_handle_exit: unexpected, valid vectoring info (0x80000b0e)

2012-09-13 Thread Xiao Guangrong
On 09/12/2012 04:15 PM, Avi Kivity wrote:
 On 09/12/2012 07:40 AM, Fengguang Wu wrote:
 Hi,

 3 of my test boxes running v3.5 kernel become unaccessible and I find
 two of them kept emitting this dmesg:

 vmx_handle_exit: unexpected, valid vectoring info (0x8b0e) and exit 
 reason is 0x31

 The other one has froze and the above lines are the last dmesg.
 Any ideas?
 
 First, that printk should be rate-limited.
 
 Second, we should add EXIT_REASON_EPT_MISCONFIG (0x31) to 
 
   if ((vectoring_info  VECTORING_INFO_VALID_MASK) 
   (exit_reason != EXIT_REASON_EXCEPTION_NMI 
   exit_reason != EXIT_REASON_EPT_VIOLATION 
   exit_reason != EXIT_REASON_TASK_SWITCH))
   printk(KERN_WARNING %s: unexpected, valid vectoring info 
  (0x%x) and exit reason is 0x%x\n,
  __func__, vectoring_info, exit_reason);
 
 since it's easily caused by the guest.

Yes, i will do these.

 
 Third, it's really unexpected.  It seems the guest was attempting to deliver 
 a page fault exception (0x0e) but encountered an mmio page during delivery 
 (in the IDT, TSS, stack, or page tables).  Is this reproducible?  If so it's 
 easy to patch kvm to halt in that case and allow examining the guest via qemu.
 

Have no idea yet why the box was frozen under this case, will try to write a 
test case,
hope it can help me to find the reason out.

 Maybe we should do so regardless (return a KVM_EXIT_INTERNAL_ERROR).

I think this is reasonable.

Thanks!


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro

2012-09-13 Thread Caraman Mihai Claudiu-B02008
 -Original Message-
 From: Wood Scott-B07421
 Sent: Thursday, September 13, 2012 12:54 AM
 To: Alexander Graf
 Cc: Caraman Mihai Claudiu-B02008; kvm-ppc@vger.kernel.org; linuxppc-
 d...@lists.ozlabs.org; k...@vger.kernel.org
 Subject: Re: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM
 macro
 
 On 09/12/2012 04:45 PM, Alexander Graf wrote:
 
 
  On 12.09.2012, at 23:38, Scott Wood scottw...@freescale.com wrote:
 
  On 09/12/2012 01:56 PM, Alexander Graf wrote:
 
 
  On 12.09.2012, at 15:18, Mihai Caraman mihai.cara...@freescale.com
 wrote:
 
  The current form of DO_KVM macro restricts its use to one call per
 input
  parameter set. This is caused by kvmppc_resume_\intno\()_\srr1
 symbol
  definition.
  Duplicate calls of DO_KVM are required by distinct implementations
 of
  exeption handlers which are delegated at runtime.
 
  Not sure I understand what you're trying to achieve here. Please
 elaborate ;)
 
  On 64-bit book3e we compile multiple versions of the TLB miss
 handlers,
  and choose from them at runtime.

The exception handler patching is active in __early_init_mmu() function
powerpc/mm/tlb_nohash.c for quite a few years. For tlb miss exceptions
there are three handler versions: standard, HW tablewalk and bolted.

 I posted a patch to add another variant, for e6500-style hardware
 tablewalk, which shares the bolted prolog/epilog (besides prolog/epilog
 performance, e6500 is incompatible with the IBM tablewalk code for
 various reasons).  That caused us to have two DO_KVMs for the same
 exception type.

Sorry, I missed to cc kvm-ppc mailist when I replayed to that discussion
thread.

-Mike


Re: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro

2012-09-13 Thread Alexander Graf

On 09/12/2012 03:18 PM, Mihai Caraman wrote:

The current form of DO_KVM macro restricts its use to one call per input
parameter set. This is caused by kvmppc_resume_\intno\()_\srr1 symbol
definition.
Duplicate calls of DO_KVM are required by distinct implementations of
exeption handlers which are delegated at runtime. Use a rare label number
to avoid conflicts with the calling contexts.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com


Thanks, applied to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Alexander Graf

On 12.09.2012, at 02:19, Paul Mackerras wrote:

 Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP
 error on powerpc.  This implements them for Book 3S processors.  Since
 those processors have more than just the 32 basic floating-point
 registers, this extends the kvm_fpu structure to have space for the
 additional registers -- the 32 vector registers (128 bits each) for
 VMX/Altivec and the 32 additional 64-bit registers that were added
 on POWER7 for the vector-scalar extension (VSX).  It also adds a
 `valid' field, which is a bitmap indicating which elements contain
 valid data.
 
 The layout of the floating-point register data in the vcpu struct is
 mostly the same between different flavors of KVM on Book 3S processors,
 but the set of supported registers may differ depending on what the
 CPU hardware supports and how much is emulated.  Therefore we have
 a flavor-specific function to work out which set of registers to
 supply for the get function.
 
 On POWER7 processors using the Book 3S HV flavor of KVM, we save the
 standard floating-point registers together with their corresponding
 VSX extension register in the vcpu-arch.vsr[] array, since each
 pair can be loaded or stored with one instruction.  This is different
 to other flavors of KVM, and to other processors (i.e. PPC970) with
 HV KVM, which store the standard FPRs in vcpu-arch.fpr[].  To cope
 with this, we use the kvmppc_core_get_fpu_valid() and
 kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[]
 and arch.vsr[] arrays as needed.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Any reason to not use ONE_REG here?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM: PPC: Book3S HV: More flexible allocator for linear memory

2012-09-13 Thread Alexander Graf

On 12.09.2012, at 02:34, Paul Mackerras wrote:

 This series of 3 patches makes it possible for guests to allocate
 whatever size of HPT they need from linear memory preallocated at
 boot, rather than being restricted to a single size of HPT (by
 default, 16MB) and having to use the kernel page allocator for
 anything else -- which in practice limits them to at most 16MB given
 the default value for the maximum page order.  Instead of allocating
 many individual pieces of memory, this allocates a single contiguous
 area and uses a simple bitmap-based allocator to hand out pieces of it
 as required.

Have you tried to play with CMA for this? It sounds like it could buy us 
exactly what we need.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Paul Mackerras
On Fri, Sep 14, 2012 at 01:30:51AM +0200, Alexander Graf wrote:
 
 On 12.09.2012, at 02:19, Paul Mackerras wrote:
 
  Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP
  error on powerpc.  This implements them for Book 3S processors.  Since
  those processors have more than just the 32 basic floating-point
  registers, this extends the kvm_fpu structure to have space for the
  additional registers -- the 32 vector registers (128 bits each) for
  VMX/Altivec and the 32 additional 64-bit registers that were added
  on POWER7 for the vector-scalar extension (VSX).  It also adds a
  `valid' field, which is a bitmap indicating which elements contain
  valid data.
  
  The layout of the floating-point register data in the vcpu struct is
  mostly the same between different flavors of KVM on Book 3S processors,
  but the set of supported registers may differ depending on what the
  CPU hardware supports and how much is emulated.  Therefore we have
  a flavor-specific function to work out which set of registers to
  supply for the get function.
  
  On POWER7 processors using the Book 3S HV flavor of KVM, we save the
  standard floating-point registers together with their corresponding
  VSX extension register in the vcpu-arch.vsr[] array, since each
  pair can be loaded or stored with one instruction.  This is different
  to other flavors of KVM, and to other processors (i.e. PPC970) with
  HV KVM, which store the standard FPRs in vcpu-arch.fpr[].  To cope
  with this, we use the kvmppc_core_get_fpu_valid() and
  kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[]
  and arch.vsr[] arrays as needed.
  
  Signed-off-by: Paul Mackerras pau...@samba.org
 
 Any reason to not use ONE_REG here?

Just consistency with x86 -- they have an xmm[][] field in their
struct kvm_fpu which looks like it contains their vector state.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Alexander Graf

On 14.09.2012, at 01:58, Paul Mackerras wrote:

 On Fri, Sep 14, 2012 at 01:30:51AM +0200, Alexander Graf wrote:
 
 On 12.09.2012, at 02:19, Paul Mackerras wrote:
 
 Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP
 error on powerpc.  This implements them for Book 3S processors.  Since
 those processors have more than just the 32 basic floating-point
 registers, this extends the kvm_fpu structure to have space for the
 additional registers -- the 32 vector registers (128 bits each) for
 VMX/Altivec and the 32 additional 64-bit registers that were added
 on POWER7 for the vector-scalar extension (VSX).  It also adds a
 `valid' field, which is a bitmap indicating which elements contain
 valid data.
 
 The layout of the floating-point register data in the vcpu struct is
 mostly the same between different flavors of KVM on Book 3S processors,
 but the set of supported registers may differ depending on what the
 CPU hardware supports and how much is emulated.  Therefore we have
 a flavor-specific function to work out which set of registers to
 supply for the get function.
 
 On POWER7 processors using the Book 3S HV flavor of KVM, we save the
 standard floating-point registers together with their corresponding
 VSX extension register in the vcpu-arch.vsr[] array, since each
 pair can be loaded or stored with one instruction.  This is different
 to other flavors of KVM, and to other processors (i.e. PPC970) with
 HV KVM, which store the standard FPRs in vcpu-arch.fpr[].  To cope
 with this, we use the kvmppc_core_get_fpu_valid() and
 kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[]
 and arch.vsr[] arrays as needed.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 
 Any reason to not use ONE_REG here?
 
 Just consistency with x86 -- they have an xmm[][] field in their
 struct kvm_fpu which looks like it contains their vector state.

Yup, Considering how different the FPU state on differnet ppc cores is, I'd be 
more happy with shoving it into something that allows for more dynamic control. 
Otherwise we'd end up with yet another struct sregs that can contain SPE 
registers, altivec, and a dozen additions to it :).

Please just use one_reg for all of the register synchronization you want to 
add, unless there's a compelling reason to do it differently. It will make our 
live a lot easier in the future. If we need to transfer too much data and 
actually run into performance trouble, we can always add a GET_MANY_REG ioctl.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Paul Mackerras
On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote:
 
 Yup, Considering how different the FPU state on differnet ppc cores is, I'd 
 be more happy with shoving it into something that allows for more dynamic 
 control. Otherwise we'd end up with yet another struct sregs that can contain 
 SPE registers, altivec, and a dozen additions to it :).
 
 Please just use one_reg for all of the register synchronization you want to 
 add, unless there's a compelling reason to do it differently. It will make 
 our live a lot easier in the future. If we need to transfer too much data and 
 actually run into performance trouble, we can always add a GET_MANY_REG ioctl.

It just seems perverse to ignore the existing interface that every
other architecture uses, and instead do something unique that is
actually slower, but whatever...

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Alexander Graf

On 14.09.2012, at 03:36, Paul Mackerras wrote:

 On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote:
 
 Yup, Considering how different the FPU state on differnet ppc cores is, I'd 
 be more happy with shoving it into something that allows for more dynamic 
 control. Otherwise we'd end up with yet another struct sregs that can 
 contain SPE registers, altivec, and a dozen additions to it :).
 
 Please just use one_reg for all of the register synchronization you want to 
 add, unless there's a compelling reason to do it differently. It will make 
 our live a lot easier in the future. If we need to transfer too much data 
 and actually run into performance trouble, we can always add a GET_MANY_REG 
 ioctl.
 
 It just seems perverse to ignore the existing interface that every
 other architecture uses, and instead do something unique that is
 actually slower, but whatever...

We're slowly moving towards ONE_REG. ARM is already going full steam ahead and 
I'd like to have every new register in PPC be modeled with it as well. The old 
interface broke on us one time too often now :).

As I said, if we run into performance problems, we will implement ways to 
improve performance. At the end of the day, x86 will be the odd one out.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions

2012-09-13 Thread Alexander Graf

On 14.09.2012, at 03:44, Alexander Graf wrote:

 
 On 14.09.2012, at 03:36, Paul Mackerras wrote:
 
 On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote:
 
 Yup, Considering how different the FPU state on differnet ppc cores is, I'd 
 be more happy with shoving it into something that allows for more dynamic 
 control. Otherwise we'd end up with yet another struct sregs that can 
 contain SPE registers, altivec, and a dozen additions to it :).
 
 Please just use one_reg for all of the register synchronization you want to 
 add, unless there's a compelling reason to do it differently. It will make 
 our live a lot easier in the future. If we need to transfer too much data 
 and actually run into performance trouble, we can always add a GET_MANY_REG 
 ioctl.
 
 It just seems perverse to ignore the existing interface that every
 other architecture uses, and instead do something unique that is
 actually slower, but whatever...
 
 We're slowly moving towards ONE_REG. ARM is already going full steam ahead 
 and I'd like to have every new register in PPC be modeled with it as well. 
 The old interface broke on us one time too often now :).
 
 As I said, if we run into performance problems, we will implement ways to 
 improve performance. At the end of the day, x86 will be the odd one out.

(plus your patch breaks abi compatibility with old user space)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html