Re: question about napi_disable (was Re: [PATCH] virtio_net: set/cancel work on ndo_open/ndo_stop)

2012-04-05 Thread Jason Wang

On 04/04/2012 05:32 PM, Michael S. Tsirkin wrote:

On Thu, Dec 29, 2011 at 09:12:38PM +1030, Rusty Russell wrote:

  Michael S. Tsirkin noticed that we could run the refill work after
  ndo_close, which can re-enable napi - we don't disable it until
  virtnet_remove.  This is clearly wrong, so move the workqueue control
  to ndo_open and ndo_stop (aka. virtnet_open and virtnet_close).
  
  One subtle point: virtnet_probe() could simply fail if it couldn't

  allocate a receive buffer, but that's less polite in virtnet_open() so
  we schedule a refill as we do in the normal receive path if we run out
  of memory.
  
  Signed-off-by: Rusty Russellru...@rustcorp.com.au

Doh.
napi_disable does not prevent the following
napi_schedule, does it?

Can someone confirm that I am not seeing things please?
Looks like napi_disable() does prevent the following scheduling, as 
napi_schedule_prep() returns true only when there's an 0 - 1 transition 
of NAPI_STATE_SCHED bit.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-05 Thread Jan Kiszka
On 2012-04-05 05:42, Alex Williamson wrote:
 We've hit a kernel host panic, when issuing a 'system_reset' with an
 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
 
 [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
 32993
 [Hardware Error]: APEI generic hardware error status
 [Hardware Error]: severity: 1, fatal
 [Hardware Error]: section: 0, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 [Hardware Error]: section: 1, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 Kernel panic - not syncing: Fatal hardware error!
 Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
 Call Trace:
  NMI  [814f2fe5] ? panic+0xa0/0x168
  [812f919c] ? ghes_notify_nmi+0x17c/0x180
  [814f91d5] ? notifier_call_chain+0x55/0x80
  [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
  [8109667e] ? notify_die+0x2e/0x30
  [814f6e81] ? do_nmi+0x1a1/0x2b0
  [814f6760] ? nmi+0x20/0x30
  [8103762b] ? native_safe_halt+0xb/0x10
  EOE  [8101495d] ? default_idle+0x4d/0xb0
  [81009e06] ? cpu_idle+0xb6/0x110
  [814da63a] ? rest_init+0x7a/0x80
  [81c1ff7b] ? start_kernel+0x424/0x430
  [81c1f33a] ? x86_64_start_reservations+0x125/0x129
  [81c1f438] ? x86_64_start_kernel+0xfa/0x109
 
 The root cause of the problem is that the 'reset_assigned_device()' code
 first writes a 0 to the command register. Then, when qemu subsequently does
 a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
 the kernel ends up calling '__msix_mask_irq()', which performs a write to
 the memory mapped msi vector space. Since, we've explicitly told the device
 to disallow mmio access (via the 0 write to the command register), we end
 up with the above 'Unsupported Request'.
 
 The fix here is to first disable MSI-X, before doing the reset.  We also
 disable MSI, leaving the device in INTx mode.  In this way, the device is
 a known state after reset, and we avoid touching msi memory mapped space
 on any subsequent 'kvm_deassign_irq()'.
 
 Thanks to Michael S. Tsirkin for help in understanding what was going on
 here and Jason Baron, the original debugger of this problem.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 Jason is out of the office for a couple weeks, so I'll try to resolve
 this while he's away.  Somehow the emulated config updates were lost
 in Jason's original posting, so I've fixed that and taken Jan's suggestion
 to simply call into the update functions instead of open coding the
 interrupt disable.  I think there still may be some disagreements about
 how to handle guest generated errors in the host, but that's a large
 project whereas this is something we should be doing at reset anyway,
 and even if only a workaround, resolves the problem above.
 
  hw/device-assignment.c |   23 +++
  1 files changed, 23 insertions(+), 0 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 89823f1..2e6b93e 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
  const char reset[] = 1;
  int fd, ret;
  
 +/*
 + * If a guest is reset without being shutdown, MSI/MSI-X can still
 + * be running.  We want to return the device to a known state on
 + * reset, so disable those here.  We especially do not want MSI-X

To be precised: It is the specified state after reset which we fail to
restore so far.

 + * enabled since it lives 

Re: KVM and Hylafax

2012-04-05 Thread Federico Fanton

We have an Hylafax 5.2.5 CentOS 5 installation hosted inside a Xen virtual
machine. It works quite well, but now I'm in the process of
upgrading/migrating it to a KVM virtual machine running Ubuntu 10.04.

The problem I'm having is that - while receiving works fine - sending faxes
is extremely unreliable, I get lots of No response to MPS repeated 3
tries, or Failure to transmit clean ECM image data, just what would
happen if the phone line was noisy.


For the record, I switched to a CentOS VM instead of an Ubuntu one, and 
now it's working much better

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42636] PCI passthrough does not work with AMD iommu for PCI device

2012-04-05 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42636





--- Comment #9 from Klaus Mueller kmuel...@justmail.de  2012-04-05 08:08:58 
---
I tested the same device here and got the same error as you reported.

Good to know, that Xen doesn't have any problem. This really means, that it is
most probably a kvm or qemu or AMD iommu bug.

kvm 0.15, kernel: linux 3.1.10 (openSUSE)

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU softlockup due to smp_call_function()

2012-04-05 Thread Sasha Levin
On Thu, Apr 5, 2012 at 6:06 AM, Milton Miller milt...@bga.com wrote:

 On Wed, 4 Apr 2012 about 22:12:36 +0200, Sasha Levin wrote:
 I've starting seeing soft lockups resulting from smp_call_function()
 calls. I've attached two different backtraces of this happening with
 different code paths.

 This is running inside a KVM guest with the trinity fuzzer, using
 today's linux-next kernel.

 Hi Sasha.

 You have two different call sites (arch/x86/mm/pageattr.c
 cpa_flush_range and net/core/dev.c netdev_run_todo), and both use
 call on_each_cpu with wait=1.

 I tried a few options but can't get close enough to your compiled
 length of 2a0 to know if the code is spinning on the first
 csd_lock_wait in csd_lock or in the second csd_lock_wait after the
 call to arch_send_call_function_ipi_mask (aka smp_ops + 0x44 in my
 x86_64 compile).  Please check your disassembly and report.

Hi Milton,

# addr2line -i -e /usr/src/linux/vmlinux 8111f30e
/usr/src/linux/kernel/smp.c:102
/usr/src/linux/kernel/smp.c:554

So It's the 2nd lock.

I'll work on adding the debug code mentioned in your mail.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-05 Thread Raghavendra K T

On 04/01/2012 07:23 PM, Avi Kivity wrote:

On 04/01/2012 04:48 PM, Raghavendra K T wrote:

I have patch something like below in mind to try:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d3b98b1..5127668 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
* else and called schedule in __vcpu_run.  Hopefully that
* VCPU is holding the lock that we need and will release it.
* We approximate round-robin by starting at the last boosted
VCPU.
+ * Priority is given to vcpu that are unhalted.
*/
-for (pass = 0; pass   2   !yielded; pass++) {
+for (pass = 0; pass   3   !yielded; pass++) {
   kvm_for_each_vcpu(i, vcpu, kvm) {
   struct task_struct *task = NULL;
   struct pid *pid;
-if (!pass   i   last_boosted_vcpu) {
+if (!pass   !vcpu-pv_unhalted)
+continue;
+else if (pass == 1   i   last_boosted_vcpu) {
   i = last_boosted_vcpu;
   continue;
-} else if (pass   i   last_boosted_vcpu)
+} else if (pass == 2   i   last_boosted_vcpu)
   break;
   if (vcpu == me)
   continue;





[...]


I'm interested in how PLE does vs. your patches, both with PLE enabled
and disabled.



 Here is the result taken on PLE machine. Results seem to support all 
our assumptions.

 Following are the observations from results:

 1) There is a huge benefit for Non PLE based configuration. 
(base_nople vs pv_ple) (around 90%)


 2) ticketlock + kvm patches does go well along with PLE (more benefit 
is seen not degradation)

(base_ple vs pv_ple)

 3) The ticketlock + kvm patches make behaves almost like PLE enabled 
machine (base_ple vs pv_nople)


 4) ple handler modification patches seem to give advantage (pv_ple vs 
pv_ple_optimized). More study needed

probably with higher M/N ratio Avi pointed.

 configurations:

 base_nople   = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=n - PLE
 base_ple = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=n  + PLE
 pv_ple   = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y + PLE + 
ticketlock + kvm patches
 pv_nople = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y - PLE + 
ticketlock + kvm patches
 pv_ple_optimized = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y + PLE + 
optimizaton patch + ticketlock
			+ kvm patches + posted with ple_handler modification (yield to kicked 
vcpu).


 Machine : IBM xSeries with Intel(R) Xeon(R)  X7560 2.27GHz CPU with 32 
core, with 8

 online cores and 4*64GB RAM

 3 guests running with 2GB RAM, 8vCPU.

 Results:
 ---
 case A)
 1x: 1 kernbench 2 idle
 2x: 1 kernbench 1 while1 hog 1 idle
 3x: 1 kernbench 2 while1 hog

 Average time taken in sec for kernbench run (std). [ lower the value 
better ]


  base_nople base_ple  pv_ple 
  pv_nople   pv_ple_optimized


 1x   72.8284 (89.8757) 	 70.475 (85.6979) 	63.5033 (72.7041) 
65.7634 (77.0504)  64.3284 (73.2688)
 2x   823.053 (1113.05) 	 110.971 (132.829) 	105.099 (128.738) 
139.058 (165.156)  106.268 (129.611)
 3x   3244.37 (4707.61) 	 150.265 (184.766) 	138.341 (172.69) 
139.106 (163.549)  133.238 (168.388)



   Percentage improvement calculation w.r.t base_nople
   -

  base_ple  pv_plepv_nople pv_ple_optimized

 1x3.23143  12.8042   9.70089   11.6713
 2x86.5172  87.2306   83.1046   87.0886
 3x95.3684  95.73695.7124   95.8933

---
   Percentage improvement calculation w.r.t base_ple
   -

  base_nople  pv_plepv_nople  pv_ple_optimized

  1x   -3.33939.89244   6.68549   8.72167
  2x   -641.683   5.29147   -25.3102  4.23804
  3x   -2059.17.93531   7.42621   11.3313


 case B)
 all 3 guests running kernbench
 Average time taken in sec for kernbench run (std). [ lower the value 
better ].
 Note that std is calculated over 6*3 run average from all 3 guests 
given by kernbench


 base_noplebase_plepv_ple 
pv_nople  pv_ple_opt
 2886.92 (18.289131)   204.80333 (7.1784039)   200.22517 (10.134804) 
202.091 (12.249673)   201.60683 (7.881737)



   Percentage improvement calculation w.r.t base_nople
   -

  base_ple   pv_plepv_nople   pv_ple_optimized
  92.905893.0644   9393.0166



   Percentage improvement calculation w.r.t base_ple
   -

  base_nople   pv_plepv_nople   pv_ple_optimized
  -1309.6062.23541.324  1.5607  

 I hope the experimental results should convey same message if somebody 
does benchmarking.


 Also as Ian pointed in the thread, the earlier results from Attilio
and me was to convince that 

Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-05 Thread Avi Kivity
On 04/02/2012 12:51 PM, Raghavendra K T wrote:
 On 04/01/2012 07:23 PM, Avi Kivity wrote:
  On 04/01/2012 04:48 PM, Raghavendra K T wrote:
  I have patch something like below in mind to try:
 
  diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
  index d3b98b1..5127668 100644
  --- a/virt/kvm/kvm_main.c
  +++ b/virt/kvm/kvm_main.c
  @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  * else and called schedule in __vcpu_run.  Hopefully that
  * VCPU is holding the lock that we need and will release it.
  * We approximate round-robin by starting at the last boosted
  VCPU.
  + * Priority is given to vcpu that are unhalted.
  */
  -for (pass = 0; pass   2   !yielded; pass++) {
  +for (pass = 0; pass   3   !yielded; pass++) {
 kvm_for_each_vcpu(i, vcpu, kvm) {
 struct task_struct *task = NULL;
 struct pid *pid;
  -if (!pass   i   last_boosted_vcpu) {
  +if (!pass   !vcpu-pv_unhalted)
  +continue;
  +else if (pass == 1   i   last_boosted_vcpu) {
 i = last_boosted_vcpu;
 continue;
  -} else if (pass   i   last_boosted_vcpu)
  +} else if (pass == 2   i   last_boosted_vcpu)
 break;
 if (vcpu == me)
 continue;
 
 
  Actually I think this is unneeded.  The loops tries to find vcpus
 that
  are runnable but not running (vcpu_active(vcpu-wq)), and halted
 vcpus
  don't match this condition.
 

 Oh! I think I misinterpreted your statement. hmm I got it. you told to
 remove if (vcpu == me) condition.

No, the entire patch is unneeded.  My original comment was:

 from the PLE handler, don't wake up a vcpu that is sleeping because it
is waiting for a kick

But the PLE handler never wakes up sleeping vcpus anyway.

There's still a conflict with PLE in that it may trigger during the spin
phase and send a random yield_to() somewhere.  Maybe it's sufficient to
tune the PLE timeout to be longer than the spinlock timeout.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-05 Thread Avi Kivity
On 04/02/2012 12:26 PM, Thomas Gleixner wrote:
  One thing about it is that it can give many false positives.  Consider a
  fine-grained spinlock that is being accessed by many threads.  That is,
  the lock is taken and released with high frequency, but there is no
  contention, because each vcpu is accessing a different instance.  So the
  host scheduler will needlessly delay preemption of vcpus that happen to
  be holding a lock, even though this gains nothing.

 You're talking about per cpu locks, right? I can see the point there,
 but per cpu stuff might be worth annotating to avoid this.

Or any lock which is simply uncontended.  Say a single process is
running, the rest of the system is idle.  It will take and release many
locks; but it can be preempted at any point by the hypervisor with no
performance loss.

The overhead is arming a timer twice and an extra exit per deferred
context switch.  Perhaps not much given that you don't see tons of
context switches on virt workloads, at least without threaded interrupts
(or maybe interrupt threads should override this mechanism, by being
realtime threads).

  A second issue may happen with a lock that is taken and released with
  high frequency, with a high hold percentage.  The host scheduler may
  always sample the guest in a held state, leading it to conclude that
  it's exceeding its timeout when in fact the lock is held for a short
  time only.

 Well, no. You can avoid that.

 host  VCPU
   spin_lock()
modify_global_state()
   exit
 check_global_state()
 mark_global_state()
 reschedule vcpu

   spin_unlock()
check_global_state()
 trigger_exit()

 So when an exit occures in the locked section, then the host can mark
 the global state to tell the guest to issue a trap on unlock.

Right.

How does this nest?  Do we trigger the exit on the outermost unlock?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-05 Thread Michael S. Tsirkin
On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote:
 We've hit a kernel host panic, when issuing a 'system_reset' with an
 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
 
 [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
 32993
 [Hardware Error]: APEI generic hardware error status
 [Hardware Error]: severity: 1, fatal
 [Hardware Error]: section: 0, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 [Hardware Error]: section: 1, severity: 1, fatal
 [Hardware Error]: flags: 0x01
 [Hardware Error]: primary
 [Hardware Error]: section_type: PCIe error
 [Hardware Error]: port_type: 0, PCIe end point
 [Hardware Error]: version: 1.0
 [Hardware Error]: command: 0x, status: 0x0010
 [Hardware Error]: device_id: :08:00.0
 [Hardware Error]: slot: 1
 [Hardware Error]: secondary_bus: 0x00
 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
 [Hardware Error]: class_code: 02
 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
 [Hardware Error]: Unsupported Request
 [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
 [Hardware Error]: aer_uncor_severity: 0x00067011
 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
 Kernel panic - not syncing: Fatal hardware error!
 Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
 Call Trace:
  NMI  [814f2fe5] ? panic+0xa0/0x168
  [812f919c] ? ghes_notify_nmi+0x17c/0x180
  [814f91d5] ? notifier_call_chain+0x55/0x80
  [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
  [8109667e] ? notify_die+0x2e/0x30
  [814f6e81] ? do_nmi+0x1a1/0x2b0
  [814f6760] ? nmi+0x20/0x30
  [8103762b] ? native_safe_halt+0xb/0x10
  EOE  [8101495d] ? default_idle+0x4d/0xb0
  [81009e06] ? cpu_idle+0xb6/0x110
  [814da63a] ? rest_init+0x7a/0x80
  [81c1ff7b] ? start_kernel+0x424/0x430
  [81c1f33a] ? x86_64_start_reservations+0x125/0x129
  [81c1f438] ? x86_64_start_kernel+0xfa/0x109
 
 The root cause of the problem is that the 'reset_assigned_device()' code
 first writes a 0 to the command register. Then, when qemu subsequently does
 a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
 the kernel ends up calling '__msix_mask_irq()', which performs a write to
 the memory mapped msi vector space. Since, we've explicitly told the device
 to disallow mmio access (via the 0 write to the command register), we end
 up with the above 'Unsupported Request'.
 
 The fix here is to first disable MSI-X, before doing the reset.  We also
 disable MSI, leaving the device in INTx mode.  In this way, the device is
 a known state after reset, and we avoid touching msi memory mapped space
 on any subsequent 'kvm_deassign_irq()'.
 
 Thanks to Michael S. Tsirkin for help in understanding what was going on
 here and Jason Baron, the original debugger of this problem.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 Jason is out of the office for a couple weeks, so I'll try to resolve
 this while he's away.  Somehow the emulated config updates were lost
 in Jason's original posting, so I've fixed that and taken Jan's suggestion
 to simply call into the update functions instead of open coding the
 interrupt disable.  I think there still may be some disagreements about
 how to handle guest generated errors in the host, but that's a large
 project whereas this is something we should be doing at reset anyway,
 and even if only a workaround, resolves the problem above.

I don't think there's a disagreement: don't we all agree
they should be forwarded to qemu and on the guest if possible,
ignored if not?

  hw/device-assignment.c |   23 +++
  1 files changed, 23 insertions(+), 0 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 89823f1..2e6b93e 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev)
  const char reset[] = 1;
  int fd, ret;
  
 +/*
 + * If a guest is reset without being shutdown, MSI/MSI-X can still
 + * be running.  We want to return the device to a known state on
 + * reset, so disable those 

Re: [PATCH v4 1/4] kvm: Update kernel headers against kvm.git

2012-04-05 Thread Michael S. Tsirkin
On Thu, Mar 08, 2012 at 11:10:24AM +0100, Jan Kiszka wrote:
 Required for INTx masking / host IRQ sharing support of assigned
 devices.
 
 Corresponding kvm.git hash: 186195928e
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com

Acked-by: Michael S. Tsirkin m...@redhat.com

 ---
  linux-headers/asm-s390/kvm.h |2 ++
  linux-headers/linux/kvm.h|6 ++
  2 files changed, 8 insertions(+), 0 deletions(-)
 
 diff --git a/linux-headers/asm-s390/kvm.h b/linux-headers/asm-s390/kvm.h
 index 9acbde4..9607667 100644
 --- a/linux-headers/asm-s390/kvm.h
 +++ b/linux-headers/asm-s390/kvm.h
 @@ -44,10 +44,12 @@ struct kvm_guest_debug_arch {
  #define KVM_SYNC_PREFIX (1UL  0)
  #define KVM_SYNC_GPRS   (1UL  1)
  #define KVM_SYNC_ACRS   (1UL  2)
 +#define KVM_SYNC_CRS(1UL  3)
  /* definition of registers in kvm_run */
  struct kvm_sync_regs {
   __u64 prefix;   /* prefix register */
   __u64 gprs[16]; /* general purpose registers */
   __u32 acrs[16]; /* access registers */
 + __u64 crs[16];  /* control registers */
  };
  #endif
 diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
 index f6b5343..cf7b098 100644
 --- a/linux-headers/linux/kvm.h
 +++ b/linux-headers/linux/kvm.h
 @@ -588,6 +588,7 @@ struct kvm_ppc_pvinfo {
  #define KVM_CAP_TSC_DEADLINE_TIMER 72
  #define KVM_CAP_S390_UCONTROL 73
  #define KVM_CAP_SYNC_REGS 74
 +#define KVM_CAP_PCI_2_3 75
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
 @@ -784,6 +785,9 @@ struct kvm_s390_ucas_mapping {
  /* Available with KVM_CAP_TSC_CONTROL */
  #define KVM_SET_TSC_KHZ   _IO(KVMIO,  0xa2)
  #define KVM_GET_TSC_KHZ   _IO(KVMIO,  0xa3)
 +/* Available with KVM_CAP_PCI_2_3 */
 +#define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa4, \
 +struct kvm_assigned_pci_dev)
  
  /*
   * ioctls for vcpu fds
 @@ -857,6 +861,8 @@ struct kvm_s390_ucas_mapping {
  #define KVM_SET_ONE_REG_IOW(KVMIO,  0xac, struct kvm_one_reg)
  
  #define KVM_DEV_ASSIGN_ENABLE_IOMMU  (1  0)
 +#define KVM_DEV_ASSIGN_PCI_2_3   (1  1)
 +#define KVM_DEV_ASSIGN_MASK_INTX (1  2)
  
  struct kvm_assigned_pci_dev {
   __u32 assigned_dev_id;
 -- 
 1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/4] kvm: Introduce kvm_has_intx_set_mask

2012-04-05 Thread Michael S. Tsirkin
On Thu, Mar 08, 2012 at 11:10:25AM +0100, Jan Kiszka wrote:
 Will be used by PCI device assignment code.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com

Acked-by: Michael S. Tsirkin m...@redhat.com

 ---
  kvm-all.c |8 
  kvm.h |1 +
  2 files changed, 9 insertions(+), 0 deletions(-)
 
 diff --git a/kvm-all.c b/kvm-all.c
 index 7fed5c9..471293d 100644
 --- a/kvm-all.c
 +++ b/kvm-all.c
 @@ -78,6 +78,7 @@ struct KVMState
  int xsave, xcrs;
  int many_ioeventfds;
  int pit_state2;
 +int intx_set_mask;
  int irqchip_inject_ioctl;
  #ifdef KVM_CAP_IRQ_ROUTING
  struct kvm_irq_routing *irq_routes;
 @@ -960,6 +961,8 @@ int kvm_init(void)
  s-pit_state2 = kvm_check_extension(s, KVM_CAP_PIT_STATE2);
  #endif
  
 +s-intx_set_mask = kvm_check_extension(s, KVM_CAP_PCI_2_3);
 +
  ret = kvm_arch_init(s);
  if (ret  0) {
  goto err;
 @@ -1319,6 +1322,11 @@ int kvm_has_gsi_routing(void)
  #endif
  }
  
 +int kvm_has_intx_set_mask(void)
 +{
 +return kvm_state-intx_set_mask;
 +}
 +
  int kvm_allows_irq0_override(void)
  {
  return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
 diff --git a/kvm.h b/kvm.h
 index 43d4c68..1885be1 100644
 --- a/kvm.h
 +++ b/kvm.h
 @@ -57,6 +57,7 @@ int kvm_has_xcrs(void);
  int kvm_has_many_ioeventfds(void);
  int kvm_has_pit_state2(void);
  int kvm_has_gsi_routing(void);
 +int kvm_has_intx_set_mask(void);
  
  int kvm_allows_irq0_override(void);
  
 -- 
 1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/4] pci-assign: Use PCI-2.3-based shared legacy interrupts

2012-04-05 Thread Michael S. Tsirkin
On Thu, Mar 08, 2012 at 11:10:26AM +0100, Jan Kiszka wrote:
 Enable the new KVM feature that allows legacy interrupt sharing for
 PCI-2.3-compliant devices. This requires to synchronize any guest
 change of the INTx mask bit to the kernel.
 
 The feature is controlled by the property 'share_intx' and is off by
 default for now.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com

Acked-by: Michael S. Tsirkin m...@redhat.com

 ---
  hw/device-assignment.c |   25 +
  hw/device-assignment.h |   10 ++
  qemu-kvm.c |9 +
  qemu-kvm.h |2 ++
  4 files changed, 42 insertions(+), 4 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index a5f1abb..b7cabd4 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -782,6 +782,13 @@ static int assign_device(AssignedDevice *dev)
  cause host memory corruption if the device issues DMA write 
 
  requests!\n);
  }
 +if (dev-features  ASSIGNED_DEVICE_SHARE_INTX_MASK) {
 +assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
 +
 +/* hide host-side INTx masking from the guest */
 +dev-emulate_config_read[PCI_COMMAND + 1] |=
 +PCI_COMMAND_INTX_DISABLE  8;
 +}
  
  r = kvm_assign_pci_device(kvm_state, assigned_dev_data);
  if (r  0) {
 @@ -1121,10 +1128,26 @@ static void assigned_dev_pci_write_config(PCIDevice 
 *pci_dev, uint32_t address,
uint32_t val, int len)
  {
  AssignedDevice *assigned_dev = DO_UPCAST(AssignedDevice, dev, pci_dev);
 +uint16_t old_cmd = pci_get_word(pci_dev-config + PCI_COMMAND);
  uint32_t emulate_mask, full_emulation_mask;
 +int ret;
  
  pci_default_write_config(pci_dev, address, val, len);
  
 +if (kvm_has_intx_set_mask() 
 +range_covers_byte(address, len, PCI_COMMAND + 1)) {
 +bool intx_masked = (pci_get_word(pci_dev-config + PCI_COMMAND) 
 +PCI_COMMAND_INTX_DISABLE);
 +
 +if (intx_masked != !!(old_cmd  PCI_COMMAND_INTX_DISABLE)) {
 +ret = kvm_device_intx_set_mask(kvm_state,
 +   
 calc_assigned_dev_id(assigned_dev),
 +   intx_masked);
 +if (ret) {
 +perror(assigned_dev_pci_write_config: set intx mask);
 +}
 +}
 +}
  if (assigned_dev-cap.available  ASSIGNED_DEVICE_CAP_MSI) {
  if (range_covers_byte(address, len,
pci_dev-msi_cap + PCI_MSI_FLAGS)) {
 @@ -1748,6 +1771,8 @@ static Property da_properties[] =
 ASSIGNED_DEVICE_USE_IOMMU_BIT, true),
  DEFINE_PROP_BIT(prefer_msi, AssignedDevice, features,
 ASSIGNED_DEVICE_PREFER_MSI_BIT, true),
 +DEFINE_PROP_BIT(share_intx, AssignedDevice, features,
 +ASSIGNED_DEVICE_SHARE_INTX_BIT, false),
  DEFINE_PROP_INT32(bootindex, AssignedDevice, bootindex, -1),
  DEFINE_PROP_STRING(configfd, AssignedDevice, configfd_name),
  DEFINE_PROP_END_OF_LIST(),
 diff --git a/hw/device-assignment.h b/hw/device-assignment.h
 index b4bcfa6..5d271d5 100644
 --- a/hw/device-assignment.h
 +++ b/hw/device-assignment.h
 @@ -74,11 +74,13 @@ typedef struct {
  PCIRegion *region;
  } AssignedDevRegion;
  
 -#define ASSIGNED_DEVICE_USE_IOMMU_BIT0
 -#define ASSIGNED_DEVICE_PREFER_MSI_BIT   1
 +#define ASSIGNED_DEVICE_USE_IOMMU_BIT   0
 +#define ASSIGNED_DEVICE_PREFER_MSI_BIT  1
 +#define ASSIGNED_DEVICE_SHARE_INTX_BIT  2
  
 -#define ASSIGNED_DEVICE_USE_IOMMU_MASK   (1  
 ASSIGNED_DEVICE_USE_IOMMU_BIT)
 -#define ASSIGNED_DEVICE_PREFER_MSI_MASK  (1  
 ASSIGNED_DEVICE_PREFER_MSI_BIT)
 +#define ASSIGNED_DEVICE_USE_IOMMU_MASK  (1  ASSIGNED_DEVICE_USE_IOMMU_BIT)
 +#define ASSIGNED_DEVICE_PREFER_MSI_MASK (1  ASSIGNED_DEVICE_PREFER_MSI_BIT)
 +#define ASSIGNED_DEVICE_SHARE_INTX_MASK (1  ASSIGNED_DEVICE_SHARE_INTX_BIT)
  
  typedef struct {
  uint32_t addr_lo;
 diff --git a/qemu-kvm.c b/qemu-kvm.c
 index 09a35f0..2047ebb 100644
 --- a/qemu-kvm.c
 +++ b/qemu-kvm.c
 @@ -54,6 +54,15 @@ static int kvm_old_assign_irq(KVMState *s,
  return kvm_vm_ioctl(s, KVM_ASSIGN_IRQ, assigned_irq);
  }
  
 +int kvm_device_intx_set_mask(KVMState *s, uint32_t dev_id, bool masked)
 +{
 +struct kvm_assigned_pci_dev assigned_dev;
 +
 +assigned_dev.assigned_dev_id = dev_id;
 +assigned_dev.flags = masked ? KVM_DEV_ASSIGN_MASK_INTX : 0;
 +return kvm_vm_ioctl(s, KVM_ASSIGN_SET_INTX_MASK, assigned_dev);
 +}
 +
  #ifdef KVM_CAP_ASSIGN_DEV_IRQ
  int kvm_assign_irq(KVMState *s, struct kvm_assigned_irq *assigned_irq)
  {
 diff --git a/qemu-kvm.h b/qemu-kvm.h
 index 3c4f023..2b23daf 100644
 --- a/qemu-kvm.h
 +++ b/qemu-kvm.h
 @@ -114,6 +114,8 @@ int kvm_assign_irq(KVMState *s, struct kvm_assigned_irq 
 *assigned_irq);
   */
  int kvm_deassign_irq(KVMState *s, 

Re: [PATCH v4 4/4] pci_assign: Flip defaults of prefer_msi and share_intx

2012-04-05 Thread Michael S. Tsirkin
On Thu, Mar 08, 2012 at 11:10:27AM +0100, Jan Kiszka wrote:
 INTx sharing is a bit more expensive than exclusive host interrupts, but
 this channel is not supposed to be used for high-performance scenarios
 anyway. Modern devices support MSI/MSI-X and do not depend on using INTx
 under critical workload, real old devices do not support INTx sharing
 anyway.
 
 For those in the middle, the user experience is much better if they just
 work even when IRQ sharing is required. If there is nothing to share,
 share_intx=off can still be applied as tuning parameter.
 
 With INTx sharing as default, the primary reason for prefer_msi=on is
 gone. Make it default off, specifically as it is known to cause troubles
 with devices that have incomplete/broken MSI support or otherwise
 stumble if host IRQ configuration does not match guest driver
 expectation.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com

I'm not familiar enough with what prefer_msi does
so no comment here. Maybe Alex can ack.
If not I'll try to find the time to understand it
a bit better but will take some time.

  hw/device-assignment.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index b7cabd4..caa3602 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -1770,9 +1770,9 @@ static Property da_properties[] =
  DEFINE_PROP_BIT(iommu, AssignedDevice, features,
 ASSIGNED_DEVICE_USE_IOMMU_BIT, true),
  DEFINE_PROP_BIT(prefer_msi, AssignedDevice, features,
 -   ASSIGNED_DEVICE_PREFER_MSI_BIT, true),
 +ASSIGNED_DEVICE_PREFER_MSI_BIT, false),
  DEFINE_PROP_BIT(share_intx, AssignedDevice, features,
 -ASSIGNED_DEVICE_SHARE_INTX_BIT, false),
 +ASSIGNED_DEVICE_SHARE_INTX_BIT, true),
  DEFINE_PROP_INT32(bootindex, AssignedDevice, bootindex, -1),
  DEFINE_PROP_STRING(configfd, AssignedDevice, configfd_name),
  DEFINE_PROP_END_OF_LIST(),
 -- 
 1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-05 Thread Raghavendra K T

On 04/05/2012 02:31 PM, Avi Kivity wrote:

On 04/02/2012 12:51 PM, Raghavendra K T wrote:

On 04/01/2012 07:23 PM, Avi Kivity wrote:

On 04/01/2012 04:48 PM, Raghavendra K T wrote:

I have patch something like below in mind to try:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d3b98b1..5127668 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 * else and called schedule in __vcpu_run.  Hopefully that
 * VCPU is holding the lock that we need and will release it.
 * We approximate round-robin by starting at the last boosted
VCPU.
+ * Priority is given to vcpu that are unhalted.
 */
-for (pass = 0; pass2!yielded; pass++) {
+for (pass = 0; pass3!yielded; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
struct task_struct *task = NULL;
struct pid *pid;
-if (!passilast_boosted_vcpu) {
+if (!pass!vcpu-pv_unhalted)
+continue;
+else if (pass == 1ilast_boosted_vcpu) {
i = last_boosted_vcpu;
continue;
-} else if (passilast_boosted_vcpu)
+} else if (pass == 2ilast_boosted_vcpu)
break;
if (vcpu == me)
continue;



Actually I think this is unneeded.  The loops tries to find vcpus

that

are runnable but not running (vcpu_active(vcpu-wq)), and halted

vcpus

don't match this condition.



Oh! I think I misinterpreted your statement. hmm I got it. you told to
remove if (vcpu == me) condition.


No, the entire patch is unneeded.  My original comment was:


from the PLE handler, don't wake up a vcpu that is sleeping because it

is waiting for a kick

But the PLE handler never wakes up sleeping vcpus anyway.


I agree with you. It is already doing that. But my approach here is
little different.

In 2 classes of vcpus we have (one is subset of another when we try to
do yield_to) viz,

1) runnable and kicked  (subset of) 2) just runnable

what we are trying to do here is targeting 1) first so that we get good
lock progress.

Here was the sequence I was talking.

vcpu1 releases lock-finds that vcpu2  is next candidate -
kick hypercall - kvm_pv_kick_cpu_op - set kicked flag -
vcpu-kick(vcpu2)

at this point we have vcpu2 waiting for getting scheduled. But
above yield call can wake *anybody*.

I agree this is not serious (rather it is overhead) when there are are 
less number of vcpus, But when we have more number of vcpu/vm.. it may

not scale well. My attempt was to fix that.

Let me know if I am completely missing something..



There's still a conflict with PLE in that it may trigger during the spin
phase and send a random yield_to() somewhere.  Maybe it's sufficient to
tune the PLE timeout to be longer than the spinlock timeout.



Ok ... But we also should be cautious that, we may do more halt, though 
we are about to get spinlock.

Need more study on this.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU softlockup due to smp_call_function()

2012-04-05 Thread Avi Kivity
On 04/04/2012 11:12 PM, Sasha Levin wrote:
 Hi all,

 I've starting seeing soft lockups resulting from smp_call_function()
 calls. I've attached two different backtraces of this happening with
 different code paths.

 This is running inside a KVM guest with the trinity fuzzer, using
 today's linux-next kernel.

 [ 6540.134009] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u:1:38]
 [ 6540.134048] irq event stamp: 286811770
 [ 6540.134048] hardirqs last  enabled at (286811769):
 [82669e74] restore_args+0x0/0x30
 [ 6540.134048] hardirqs last disabled at (286811770):
 [8266b3ea] apic_timer_interrupt+0x6a/0x80
 [ 6540.134048] softirqs last  enabled at (286811768):
 [810b746e] __do_softirq+0x16e/0x190
 [ 6540.134048] softirqs last disabled at (286811749):
 [8266bdec] call_softirq+0x1c/0x30
 [ 6540.134048] CPU 0
 [ 6540.134048] Pid: 38, comm: kworker/u:1 Tainted: GW
 3.4.0-rc1-next-20120404-sasha-dirty #72
 [ 6540.134048] RIP: 0010:[8111f30e]  [8111f30e]
 smp_call_function_many+0x27e/0x2a0


This cpu is waiting for some other cpu to process a function (likely
rps_trigger_softirq(), from the trace).  Can you get a backtrace on all
cpus when this happens?

It would be good to enhance smp_call_function_*() to do this
automatically when it happens - it's spinning there anyway, so it might
as well count the iterations and NMI the lagging cpu if it waits for too
long.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-05 Thread Avi Kivity
On 04/04/2012 01:29 PM, Gleb Natapov wrote:
  
  ok. seems to be. will move over to perf as its working fine inside guest.
  
 Good riddance IMO. I managed to run it on a guest (but not on my
 host!). The thing is buggy. It does not use global ctrl MSR to enable
 counters and kvm has all of them disabled by default. I didn't find what
 value this MSR should have after reset, so this may be either kvm bug or
 real BIOSes enable all counters in global ctrl MSR for PMUv1
 compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The
 second problem is that oprofile reprogram PMU counters without
 disabling them first and this is explicitly prohibited by Intel SDM.
 The patch below solve that, but oprofile is the one who should be fixed.

Both should be fixed, there may be other profilers affected.


 diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
 index a73f0c1..be05028 100644
 --- a/arch/x86/kvm/pmu.c
 +++ b/arch/x86/kvm/pmu.c
 @@ -396,6 +396,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 
 data)
   (pmc = get_fixed_pmc(pmu, index))) {
   data = (s64)(s32)data;
   pmc-counter += data - read_pmc(pmc);
 + reprogram_gp_counter(pmc, pmc-eventsel);
   return 0;
   } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
   if (data == pmc-eventsel)


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/1] NUMA aware scheduling per vhost thread patch

2012-04-05 Thread Michael S. Tsirkin
On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote:
 On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote:
  Hi:
  
  Thanks for the work and it looks very reasonable, some questions
  below.

Yes I am happy to see the per-cpu work resurrected.
Some comments below.

  On 03/23/2012 07:48 AM, Shirley Ma wrote:
   Sorry for being late to submit this patch. I have spent lots of time
   trying to find the best approach. This effort is still going on...
  
   This patch is built against net-next tree.
  
   This is an experimental RFC patch. The purpose of this patch is to
   address KVM networking scalability and NUMA scheduling issue.
  
  Need also test for non-NUMA machine, I see that you just choose the
  cpu 
  that initiates the work for non-numa machine which seems sub optimal.
 
 Good suggestions. I don't have any non-numa systems. But KK run some
 tests on non-numa system. He could see around 20% performance gain for
 single VMs local host to guest. I hope we can run a full test on
 non-numa system.
 
 On non-numa system, the same per vhost-cpu thread will be always picked
 up consistently for a particular vq since all cores are on same cpu
 socket. So there will be two per-cpu vhost threads handle TX/RX
 simultaneously.
 
   The existing implementation of vhost creats a vhost thread
  per-device
   (virtio_net) based. RX and TX work of a VMs per-device is handled by
   same vhost thread.
  
   One of the limitation of this implementation is with increasing the
   number VMs or the number of virtio-net interfces, more vhost threads
  are
   created, it will consume more kernel resources, and induce more
  threads
   context switches/scheduling overhead. We noticed that the KVM
  network
   performance doesn't scale with increasing number of VMs.
  
   The other limitation is to have single vhost thread to process both
  RX
   and TX, the work will be blocked. So we create this per cpu vhost
  thread
   implementation. The number of vhost cpu threads is limited to the
  number
   of cpus on the host.
  
   To address these limitations, we are propsing a per-cpu vhost thread
   model where the number of vhost threads are limited and equal to the
   number of online cpus on the host.
  
  The number of vhost thread needs more consideration. Consider that we 
  have a 1024 cores host with a card have 16 tx/rx queues, do we really 
  need 1024 vhost threads?
 
 In this case, we could add a module parameter to limit the number of
 cores/sockets to be used.

Hmm. And then which cores would we run on?
Also, is the parameter different between guests?
Another idea is to scale the # of threads on demand.

Sharing the same thread between guests is also an
interesting approach, if we did this then per-cpu
won't be so expensive but making this work well
with cgroups would be a challenge.


  
   Based on our testing experience, the vcpus can be scheduled across
  cpu
   sockets even when the number of vcpus is smaller than the number of
   cores per cpu socket and there is no other  activities besides KVM
   networking workload. We found that if vhost thread is scheduled on
  the
   same socket as the work is received, the performance will be better.
  
   So in this per cpu vhost thread implementation, a vhost thread is
   selected dynamically based on where the TX/RX work is initiated. A
  vhost
   thread on the same cpu socket is selected but not on the same cpu as
  the
   vcpu/interrupt thread that initizated the TX/RX work.
  
   When we test this RFC patch, the other interesting thing we found is
  the
   performance results also seem related to NIC flow steering. We are
   spending time on evaluate different NICs flow director
  implementation
   now. We will enhance this patch based on our findings later.
  
   We have tried different scheduling: per-device based, per vq based
  and
   per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost
  scheduling,
   we found that so far the per vq based scheduling is good enough for
  now.
  
  Could you please explain more about those scheduling strategies? Does 
  per-device based means let a dedicated vhost thread to handle all
  work 
  from that vhost device? As you mentioned, maybe an improvement of the 
  scheduling to take flow steering info (queue mapping, rxhash etc.) of 
  skb in host into account.
 
 Yes, per-device scheduling means one per-cpu vhost theads handle all
 works from one particular vhost-device.
 
 Yes, we think scheduling to take flow steering info would help
 performance. I am studying this now.

Did anything interesing turn up?


  
   We also tried different algorithm to select which cpu vhost thread
  will
   running on a specific cpu socket: avg_load balance, and randomly...
  
  May worth to account the out-of-oder packet during the test as for a 
  single stream as different cpu/vhost/physical queue may be chose to
  do 
  the packet transmission/reception?
 
 Good point. I haven't gone through all data yet. netstat 

Re: CPU softlockup due to smp_call_function()

2012-04-05 Thread Sasha Levin
On Thu, Apr 5, 2012 at 2:24 PM, Avi Kivity a...@redhat.com wrote:
 On 04/04/2012 11:12 PM, Sasha Levin wrote:
 Hi all,

 I've starting seeing soft lockups resulting from smp_call_function()
 calls. I've attached two different backtraces of this happening with
 different code paths.

 This is running inside a KVM guest with the trinity fuzzer, using
 today's linux-next kernel.

 [ 6540.134009] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u:1:38]
 [ 6540.134048] irq event stamp: 286811770
 [ 6540.134048] hardirqs last  enabled at (286811769):
 [82669e74] restore_args+0x0/0x30
 [ 6540.134048] hardirqs last disabled at (286811770):
 [8266b3ea] apic_timer_interrupt+0x6a/0x80
 [ 6540.134048] softirqs last  enabled at (286811768):
 [810b746e] __do_softirq+0x16e/0x190
 [ 6540.134048] softirqs last disabled at (286811749):
 [8266bdec] call_softirq+0x1c/0x30
 [ 6540.134048] CPU 0
 [ 6540.134048] Pid: 38, comm: kworker/u:1 Tainted: G        W
 3.4.0-rc1-next-20120404-sasha-dirty #72
 [ 6540.134048] RIP: 0010:[8111f30e]  [8111f30e]
 smp_call_function_many+0x27e/0x2a0


 This cpu is waiting for some other cpu to process a function (likely
 rps_trigger_softirq(), from the trace).  Can you get a backtrace on all
 cpus when this happens?

 It would be good to enhance smp_call_function_*() to do this
 automatically when it happens - it's spinning there anyway, so it might
 as well count the iterations and NMI the lagging cpu if it waits for too
 long.

What do you think about modifying the softlockup detector to NMI all
CPUs if it's going to panic because it detected a lockup?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-05 Thread Gleb Natapov
On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote:
 On 04/04/2012 01:29 PM, Gleb Natapov wrote:
   
   ok. seems to be. will move over to perf as its working fine inside guest.
   
  Good riddance IMO. I managed to run it on a guest (but not on my
  host!). The thing is buggy. It does not use global ctrl MSR to enable
  counters and kvm has all of them disabled by default. I didn't find what
  value this MSR should have after reset, so this may be either kvm bug or
  real BIOSes enable all counters in global ctrl MSR for PMUv1
  compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The
  second problem is that oprofile reprogram PMU counters without
  disabling them first and this is explicitly prohibited by Intel SDM.
  The patch below solve that, but oprofile is the one who should be fixed.
 
 Both should be fixed, there may be other profilers affected.
 
Global ctrl msr issue I need to investigate further, but second one is
bug that should be fixed in oprofile. Intel spec clearly says:

 EN (Enable Counters) Flag (bit 22) — When set, performance counting
 is enabled in the corresponding performance-monitoring counter; when
 clear, the corresponding counter is disabled. The event logic unit
 for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0,
 before writing to IA32_PMCx.

I suspect that on real HW they got wrong result too. It simply subtly
wrong, so it is not as noticeable as with kvm.

 
  diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
  index a73f0c1..be05028 100644
  --- a/arch/x86/kvm/pmu.c
  +++ b/arch/x86/kvm/pmu.c
  @@ -396,6 +396,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, 
  u64 data)
  (pmc = get_fixed_pmc(pmu, index))) {
  data = (s64)(s32)data;
  pmc-counter += data - read_pmc(pmc);
  +   reprogram_gp_counter(pmc, pmc-eventsel);
  return 0;
  } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
  if (data == pmc-eventsel)
 
 
 -- 
 error compiling committee.c: too many arguments to function

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-05 Thread Avi Kivity
On 04/05/2012 03:37 PM, Gleb Natapov wrote:
 On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote:
  On 04/04/2012 01:29 PM, Gleb Natapov wrote:

ok. seems to be. will move over to perf as its working fine inside 
guest.

   Good riddance IMO. I managed to run it on a guest (but not on my
   host!). The thing is buggy. It does not use global ctrl MSR to enable
   counters and kvm has all of them disabled by default. I didn't find what
   value this MSR should have after reset, so this may be either kvm bug or
   real BIOSes enable all counters in global ctrl MSR for PMUv1
   compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The
   second problem is that oprofile reprogram PMU counters without
   disabling them first and this is explicitly prohibited by Intel SDM.
   The patch below solve that, but oprofile is the one who should be fixed.
  
  Both should be fixed, there may be other profilers affected.
  
 Global ctrl msr issue I need to investigate further, but second one is
 bug that should be fixed in oprofile. Intel spec clearly says:

  EN (Enable Counters) Flag (bit 22) — When set, performance counting
  is enabled in the corresponding performance-monitoring counter; when
  clear, the corresponding counter is disabled. The event logic unit
  for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0,
  before writing to IA32_PMCx.

 I suspect that on real HW they got wrong result too. It simply subtly
 wrong, so it is not as noticeable as with kvm.


First, there is the standard Linus rant to never break userspace (and
for us a guest kernel is userspace).  Second, the Intel rule may have
been added later (as preparation for a change in behaviour), so it may
actually work correctly on older hardware.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU softlockup due to smp_call_function()

2012-04-05 Thread Avi Kivity
On 04/05/2012 03:32 PM, Sasha Levin wrote:
 
  It would be good to enhance smp_call_function_*() to do this
  automatically when it happens - it's spinning there anyway, so it might
  as well count the iterations and NMI the lagging cpu if it waits for too
  long.

 What do you think about modifying the softlockup detector to NMI all
 CPUs if it's going to panic because it detected a lockup?

Fine by me.  The triggering cpu should be dumped first though.

We lose knowledge of exactly which cpu is hung, but it should usually be
easily seen from the traces (the exceptional case would be a busy server
that is handling a lot of IPIs at the moment we trigger the traces).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-05 Thread Gleb Natapov
On Thu, Apr 05, 2012 at 03:48:51PM +0300, Avi Kivity wrote:
 On 04/05/2012 03:37 PM, Gleb Natapov wrote:
  On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote:
   On 04/04/2012 01:29 PM, Gleb Natapov wrote:
 
 ok. seems to be. will move over to perf as its working fine inside 
 guest.
 
Good riddance IMO. I managed to run it on a guest (but not on my
host!). The thing is buggy. It does not use global ctrl MSR to enable
counters and kvm has all of them disabled by default. I didn't find what
value this MSR should have after reset, so this may be either kvm bug or
real BIOSes enable all counters in global ctrl MSR for PMUv1
compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The
second problem is that oprofile reprogram PMU counters without
disabling them first and this is explicitly prohibited by Intel SDM.
The patch below solve that, but oprofile is the one who should be fixed.
   
   Both should be fixed, there may be other profilers affected.
   
  Global ctrl msr issue I need to investigate further, but second one is
  bug that should be fixed in oprofile. Intel spec clearly says:
 
   EN (Enable Counters) Flag (bit 22) — When set, performance counting
   is enabled in the corresponding performance-monitoring counter; when
   clear, the corresponding counter is disabled. The event logic unit
   for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0,
   before writing to IA32_PMCx.
 
  I suspect that on real HW they got wrong result too. It simply subtly
  wrong, so it is not as noticeable as with kvm.
 
 
 First, there is the standard Linus rant to never break userspace (and
We broke nothing. Oprofile is broken and never could have worked
according to Intel spec. (It didn't manage to get any result from it on
real CPU, but may be this is unrelated).

 for us a guest kernel is userspace).  Second, the Intel rule may have
 been added later (as preparation for a change in behaviour), so it may
 actually work correctly on older hardware.
 
May be (the rule is specified for MPUv1 BTW), but we emulate newer HW,
the one that spec describes and we report this in cpuid leaf 10. I
highly doubt Intel would change CPU in a way that may make old software
to work incorrectly on newer cpus.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-05 Thread Avi Kivity
On 04/05/2012 04:26 PM, Gleb Natapov wrote:
 On Thu, Apr 05, 2012 at 03:48:51PM +0300, Avi Kivity wrote:
  On 04/05/2012 03:37 PM, Gleb Natapov wrote:
   On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote:
On 04/04/2012 01:29 PM, Gleb Natapov wrote:
  
  ok. seems to be. will move over to perf as its working fine inside 
  guest.
  
 Good riddance IMO. I managed to run it on a guest (but not on my
 host!). The thing is buggy. It does not use global ctrl MSR to enable
 counters and kvm has all of them disabled by default. I didn't find 
 what
 value this MSR should have after reset, so this may be either kvm bug 
 or
 real BIOSes enable all counters in global ctrl MSR for PMUv1
 compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. 
 The
 second problem is that oprofile reprogram PMU counters without
 disabling them first and this is explicitly prohibited by Intel SDM.
 The patch below solve that, but oprofile is the one who should be 
 fixed.

Both should be fixed, there may be other profilers affected.

   Global ctrl msr issue I need to investigate further, but second one is
   bug that should be fixed in oprofile. Intel spec clearly says:
  
EN (Enable Counters) Flag (bit 22) — When set, performance counting
is enabled in the corresponding performance-monitoring counter; when
clear, the corresponding counter is disabled. The event logic unit
for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0,
before writing to IA32_PMCx.
  
   I suspect that on real HW they got wrong result too. It simply subtly
   wrong, so it is not as noticeable as with kvm.
  
  
  First, there is the standard Linus rant to never break userspace (and
 We broke nothing. Oprofile is broken and never could have worked
 according to Intel spec. (It didn't manage to get any result from it on
 real CPU, but may be this is unrelated).

But it did work sometime in the past, I've used it.

  for us a guest kernel is userspace).  Second, the Intel rule may have
  been added later (as preparation for a change in behaviour), so it may
  actually work correctly on older hardware.
  
 May be (the rule is specified for MPUv1 BTW), but we emulate newer HW,
 the one that spec describes and we report this in cpuid leaf 10. I
 highly doubt Intel would change CPU in a way that may make old software
 to work incorrectly on newer cpus.

The non-architectural PMU is version specific, so they could have made
this change before arch PMU was introduced.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-05 Thread Gleb Natapov
On Thu, Apr 05, 2012 at 04:28:48PM +0300, Avi Kivity wrote:
 On 04/05/2012 04:26 PM, Gleb Natapov wrote:
  On Thu, Apr 05, 2012 at 03:48:51PM +0300, Avi Kivity wrote:
   On 04/05/2012 03:37 PM, Gleb Natapov wrote:
On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote:
 On 04/04/2012 01:29 PM, Gleb Natapov wrote:
   
   ok. seems to be. will move over to perf as its working fine 
   inside guest.
   
  Good riddance IMO. I managed to run it on a guest (but not on my
  host!). The thing is buggy. It does not use global ctrl MSR to 
  enable
  counters and kvm has all of them disabled by default. I didn't find 
  what
  value this MSR should have after reset, so this may be either kvm 
  bug or
  real BIOSes enable all counters in global ctrl MSR for PMUv1
  compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. 
  The
  second problem is that oprofile reprogram PMU counters without
  disabling them first and this is explicitly prohibited by Intel SDM.
  The patch below solve that, but oprofile is the one who should be 
  fixed.
 
 Both should be fixed, there may be other profilers affected.
 
Global ctrl msr issue I need to investigate further, but second one is
bug that should be fixed in oprofile. Intel spec clearly says:
   
 EN (Enable Counters) Flag (bit 22) — When set, performance counting
 is enabled in the corresponding performance-monitoring counter; when
 clear, the corresponding counter is disabled. The event logic unit
 for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0,
 before writing to IA32_PMCx.
   
I suspect that on real HW they got wrong result too. It simply subtly
wrong, so it is not as noticeable as with kvm.
   
   
   First, there is the standard Linus rant to never break userspace (and
  We broke nothing. Oprofile is broken and never could have worked
  according to Intel spec. (It didn't manage to get any result from it on
  real CPU, but may be this is unrelated).
 
 But it did work sometime in the past, I've used it.
 
May be it used NMI based profiling. We should ask oprofile developers.
As I said I am almost sure my inability to run it on a host is probably
PEBKAC, although I ran the same script exactly on the host and the
guest (the script is from the first email of this thread)

   for us a guest kernel is userspace).  Second, the Intel rule may have
   been added later (as preparation for a change in behaviour), so it may
   actually work correctly on older hardware.
   
  May be (the rule is specified for MPUv1 BTW), but we emulate newer HW,
  the one that spec describes and we report this in cpuid leaf 10. I
  highly doubt Intel would change CPU in a way that may make old software
  to work incorrectly on newer cpus.
 
 The non-architectural PMU is version specific, so they could have made
 this change before arch PMU was introduced.
 
So Table 18-11 section 18.4.4.3 Writing a PEBS Interrupt Service Routine
has:
 For Processors based on Intel Core microarchitecture Counters MUST be
 stopped before writing.

Must is written in all caps in the spec. For NetBurst this is optional.
Really, I fail to see how this is not oprofile bug.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-05 Thread Gleb Natapov
On Thu, Apr 05, 2012 at 04:48:57PM +0300, Gleb Natapov wrote:
 On Thu, Apr 05, 2012 at 04:28:48PM +0300, Avi Kivity wrote:
  On 04/05/2012 04:26 PM, Gleb Natapov wrote:
   On Thu, Apr 05, 2012 at 03:48:51PM +0300, Avi Kivity wrote:
On 04/05/2012 03:37 PM, Gleb Natapov wrote:
 On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote:
  On 04/04/2012 01:29 PM, Gleb Natapov wrote:

ok. seems to be. will move over to perf as its working fine 
inside guest.

   Good riddance IMO. I managed to run it on a guest (but not on my
   host!). The thing is buggy. It does not use global ctrl MSR to 
   enable
   counters and kvm has all of them disabled by default. I didn't 
   find what
   value this MSR should have after reset, so this may be either kvm 
   bug or
   real BIOSes enable all counters in global ctrl MSR for PMUv1
   compatibility. Doing wrmsr 0x38f 0x7000f solves this 
   problem. The
   second problem is that oprofile reprogram PMU counters without
   disabling them first and this is explicitly prohibited by Intel 
   SDM.
   The patch below solve that, but oprofile is the one who should be 
   fixed.
  
  Both should be fixed, there may be other profilers affected.
  
 Global ctrl msr issue I need to investigate further, but second one is
 bug that should be fixed in oprofile. Intel spec clearly says:

  EN (Enable Counters) Flag (bit 22) — When set, performance counting
  is enabled in the corresponding performance-monitoring counter; when
  clear, the corresponding counter is disabled. The event logic unit
  for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0,
  before writing to IA32_PMCx.

 I suspect that on real HW they got wrong result too. It simply subtly
 wrong, so it is not as noticeable as with kvm.


First, there is the standard Linus rant to never break userspace (and
   We broke nothing. Oprofile is broken and never could have worked
   according to Intel spec. (It didn't manage to get any result from it on
   real CPU, but may be this is unrelated).
  
  But it did work sometime in the past, I've used it.
  
 May be it used NMI based profiling. We should ask oprofile developers.
 As I said I am almost sure my inability to run it on a host is probably
 PEBKAC, although I ran the same script exactly on the host and the
 guest (the script is from the first email of this thread)
 
After upgrading the kernel to latest git from whatever it was there the
same script works and counts CPU_CLK_UNHALT events.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-05 Thread Avi Kivity
On 04/05/2012 04:57 PM, Gleb Natapov wrote:
   
  May be it used NMI based profiling. We should ask oprofile developers.
  As I said I am almost sure my inability to run it on a host is probably
  PEBKAC, although I ran the same script exactly on the host and the
  guest (the script is from the first email of this thread)
  
 After upgrading the kernel to latest git from whatever it was there the
 same script works and counts CPU_CLK_UNHALT events.


This is even while it violates the Intel guidelines?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-05 Thread Gleb Natapov
On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote:
 On 04/05/2012 04:57 PM, Gleb Natapov wrote:

   May be it used NMI based profiling. We should ask oprofile developers.
   As I said I am almost sure my inability to run it on a host is probably
   PEBKAC, although I ran the same script exactly on the host and the
   guest (the script is from the first email of this thread)
   
  After upgrading the kernel to latest git from whatever it was there the
  same script works and counts CPU_CLK_UNHALT events.
 
 
 This is even while it violates the Intel guidelines?
 
Yes, but who says the result is correct :) It seems that we handle
global ctrl msr wrong. That is counter can be enabled either in global
ctrl or in eventsel. Trying to confirm that.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-05 Thread Alex Williamson
On Thu, 2012-04-05 at 12:34 +0300, Michael S. Tsirkin wrote:
 On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote:
  We've hit a kernel host panic, when issuing a 'system_reset' with an
  82576 nic assigned and a Windows guest. Host system is a PowerEdge R815.
  
  [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 
  32993
  [Hardware Error]: APEI generic hardware error status
  [Hardware Error]: severity: 1, fatal
  [Hardware Error]: section: 0, severity: 1, fatal
  [Hardware Error]: flags: 0x01
  [Hardware Error]: primary
  [Hardware Error]: section_type: PCIe error
  [Hardware Error]: port_type: 0, PCIe end point
  [Hardware Error]: version: 1.0
  [Hardware Error]: command: 0x, status: 0x0010
  [Hardware Error]: device_id: :08:00.0
  [Hardware Error]: slot: 1
  [Hardware Error]: secondary_bus: 0x00
  [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
  [Hardware Error]: class_code: 02
  [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
  [Hardware Error]: Unsupported Request
  [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
  [Hardware Error]: aer_uncor_severity: 0x00067011
  [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
  [Hardware Error]: section: 1, severity: 1, fatal
  [Hardware Error]: flags: 0x01
  [Hardware Error]: primary
  [Hardware Error]: section_type: PCIe error
  [Hardware Error]: port_type: 0, PCIe end point
  [Hardware Error]: version: 1.0
  [Hardware Error]: command: 0x, status: 0x0010
  [Hardware Error]: device_id: :08:00.0
  [Hardware Error]: slot: 1
  [Hardware Error]: secondary_bus: 0x00
  [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9
  [Hardware Error]: class_code: 02
  [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000
  [Hardware Error]: Unsupported Request
  [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
  [Hardware Error]: aer_uncor_severity: 0x00067011
  [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100
  Kernel panic - not syncing: Fatal hardware error!
  Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1
  Call Trace:
   NMI  [814f2fe5] ? panic+0xa0/0x168
   [812f919c] ? ghes_notify_nmi+0x17c/0x180
   [814f91d5] ? notifier_call_chain+0x55/0x80
   [814f923a] ? atomic_notifier_call_chain+0x1a/0x20
   [8109667e] ? notify_die+0x2e/0x30
   [814f6e81] ? do_nmi+0x1a1/0x2b0
   [814f6760] ? nmi+0x20/0x30
   [8103762b] ? native_safe_halt+0xb/0x10
   EOE  [8101495d] ? default_idle+0x4d/0xb0
   [81009e06] ? cpu_idle+0xb6/0x110
   [814da63a] ? rest_init+0x7a/0x80
   [81c1ff7b] ? start_kernel+0x424/0x430
   [81c1f33a] ? x86_64_start_reservations+0x125/0x129
   [81c1f438] ? x86_64_start_kernel+0xfa/0x109
  
  The root cause of the problem is that the 'reset_assigned_device()' code
  first writes a 0 to the command register. Then, when qemu subsequently does
  a kvm_deassign_irq() (called by assign_irq(), in the system_reset path),
  the kernel ends up calling '__msix_mask_irq()', which performs a write to
  the memory mapped msi vector space. Since, we've explicitly told the device
  to disallow mmio access (via the 0 write to the command register), we end
  up with the above 'Unsupported Request'.
  
  The fix here is to first disable MSI-X, before doing the reset.  We also
  disable MSI, leaving the device in INTx mode.  In this way, the device is
  a known state after reset, and we avoid touching msi memory mapped space
  on any subsequent 'kvm_deassign_irq()'.
  
  Thanks to Michael S. Tsirkin for help in understanding what was going on
  here and Jason Baron, the original debugger of this problem.
  
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  ---
  
  Jason is out of the office for a couple weeks, so I'll try to resolve
  this while he's away.  Somehow the emulated config updates were lost
  in Jason's original posting, so I've fixed that and taken Jan's suggestion
  to simply call into the update functions instead of open coding the
  interrupt disable.  I think there still may be some disagreements about
  how to handle guest generated errors in the host, but that's a large
  project whereas this is something we should be doing at reset anyway,
  and even if only a workaround, resolves the problem above.
 
 I don't think there's a disagreement: don't we all agree
 they should be forwarded to qemu and on the guest if possible,
 ignored if not?

With AER perhaps, but I'm not sure we have that flexibility with APEI,
but I'd need to read up on the spec.  It seems APEI defines that the
BIOS gets first shot at handling the error and gets to dictate how the
OS handles it.  In this case, deciding that it's fatal.

   hw/device-assignment.c |   23 +++
   1 files changed, 23 insertions(+), 0 deletions(-)
  
  diff --git a/hw/device-assignment.c 

Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-05 Thread Michael S. Tsirkin
On Thu, Apr 05, 2012 at 08:42:03AM -0600, Alex Williamson wrote:
  So far so good.
  
   + * We especially do not want MSI-X
   + * enabled since it lives in MMIO space, which is about to get
   + * disabled.
  
  I think we are better off dropping the above, because it's
  a bug that we disable MMIO space on the physical device:
  we did not enable it (kvm did) let's not disable.
 
 I'd like to investigate removing the command register clearing, but it
 did solve a bug at the time it went in.  That will be a separate patch,
 so we can remove the above sentence when that's removed.  It may be a
 bug, but it's describing the current behavior.

Fair enough.
Acked-by: Michael S. Tsirkin m...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch

2012-04-05 Thread Shirley Ma
On Thu, 2012-04-05 at 15:28 +0300, Michael S. Tsirkin wrote:
 On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote:
  On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote:
   Hi:
   
   Thanks for the work and it looks very reasonable, some questions
   below.
 
 Yes I am happy to see the per-cpu work resurrected.
 Some comments below.
Glad to see you have time on reviewing this.

   On 03/23/2012 07:48 AM, Shirley Ma wrote:
Sorry for being late to submit this patch. I have spent lots of
 time
trying to find the best approach. This effort is still going
 on...
   
This patch is built against net-next tree.
   
This is an experimental RFC patch. The purpose of this patch is
 to
address KVM networking scalability and NUMA scheduling issue.
   
   Need also test for non-NUMA machine, I see that you just choose
 the
   cpu 
   that initiates the work for non-numa machine which seems sub
 optimal.
  
  Good suggestions. I don't have any non-numa systems. But KK run some
  tests on non-numa system. He could see around 20% performance gain
 for
  single VMs local host to guest. I hope we can run a full test on
  non-numa system.
  
  On non-numa system, the same per vhost-cpu thread will be always
 picked
  up consistently for a particular vq since all cores are on same cpu
  socket. So there will be two per-cpu vhost threads handle TX/RX
  simultaneously.
  
The existing implementation of vhost creats a vhost thread
   per-device
(virtio_net) based. RX and TX work of a VMs per-device is
 handled by
same vhost thread.
   
One of the limitation of this implementation is with increasing
 the
number VMs or the number of virtio-net interfces, more vhost
 threads
   are
created, it will consume more kernel resources, and induce more
   threads
context switches/scheduling overhead. We noticed that the KVM
   network
performance doesn't scale with increasing number of VMs.
   
The other limitation is to have single vhost thread to process
 both
   RX
and TX, the work will be blocked. So we create this per cpu
 vhost
   thread
implementation. The number of vhost cpu threads is limited to
 the
   number
of cpus on the host.
   
To address these limitations, we are propsing a per-cpu vhost
 thread
model where the number of vhost threads are limited and equal to
 the
number of online cpus on the host.
   
   The number of vhost thread needs more consideration. Consider that
 we 
   have a 1024 cores host with a card have 16 tx/rx queues, do we
 really 
   need 1024 vhost threads?
  
  In this case, we could add a module parameter to limit the number of
  cores/sockets to be used.
 
 Hmm. And then which cores would we run on?
 Also, is the parameter different between guests?
 Another idea is to scale the # of threads on demand.

If we are able to pass number of guests/vcpus info to vhost, we can
scale the vhost threads. Any API to get this info?


 Sharing the same thread between guests is also an
 interesting approach, if we did this then per-cpu
 won't be so expensive but making this work well
 with cgroups would be a challenge.

Yes, I am comparing vhost thread pool to share among guests approach
with per-cpu vhost approach now.

It's challenge to work with cgroups anyway.

 
   
Based on our testing experience, the vcpus can be scheduled
 across
   cpu
sockets even when the number of vcpus is smaller than the number
 of
cores per cpu socket and there is no other  activities besides
 KVM
networking workload. We found that if vhost thread is scheduled
 on
   the
same socket as the work is received, the performance will be
 better.
   
So in this per cpu vhost thread implementation, a vhost thread
 is
selected dynamically based on where the TX/RX work is initiated.
 A
   vhost
thread on the same cpu socket is selected but not on the same
 cpu as
   the
vcpu/interrupt thread that initizated the TX/RX work.
   
When we test this RFC patch, the other interesting thing we
 found is
   the
performance results also seem related to NIC flow steering. We
 are
spending time on evaluate different NICs flow director
   implementation
now. We will enhance this patch based on our findings later.
   
We have tried different scheduling: per-device based, per vq
 based
   and
per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost
   scheduling,
we found that so far the per vq based scheduling is good enough
 for
   now.
   
   Could you please explain more about those scheduling strategies?
 Does 
   per-device based means let a dedicated vhost thread to handle all
   work 
   from that vhost device? As you mentioned, maybe an improvement of
 the 
   scheduling to take flow steering info (queue mapping, rxhash etc.)
 of 
   skb in host into account.
  
  Yes, per-device scheduling means one per-cpu vhost theads handle all
  works from one particular vhost-device.
  
  Yes, we think 

Re: [GIT PULL] Please pull powerpc KVM fixes

2012-04-05 Thread Avi Kivity
On 04/03/2012 03:22 PM, Paul Mackerras wrote:
 On Tue, Apr 03, 2012 at 10:03:05PM +1000, Paul Mackerras wrote:
  
  The following changes since commit 592f5d87b3feee9d60411f19d583038c0c7670ad:

 OK, I messed up the git request-pull command.  The request should have
 looked like this:

 The following changes since commit b1a808ff436343956a6ae63178ea1810c5e5a3a1:

   Merge branch 'for-next' of git://gitorious.org/kernel-hsi/kernel-hsi 
 (2012-04-02 09:50:40 -0700)

 are available in the git repository at:


   git://github.com/paulusmack/linux tags/powerpc-fixes



Thanks, pulled, will push shortly.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes April 3

2012-04-05 Thread Avi Kivity
On 04/04/2012 05:39 PM, Michael Roth wrote:
  Will command line take in account hot-plugged devices?

 No, that's a good point. We'd probably need to generate the options
 required to ensure the devices are created on the target, and we'd only
 be able to do that just before sending the device state. That means we
 need a way to create machines after we've completed tasks like memory
 migration, which probably has similar requirements to just being able to
 instantiate a machine from a serialized QOM composition tree.


Or we forward the hotplug events over the migration protocol.  This is
needed for devices that include memory (like ivshmem.c)

page
page
page
hotplug event
page
page for the new device
page

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM fixes for 3.4-rc1

2012-04-05 Thread Avi Kivity
Linus, please pull a few fixes from

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/3.4

A bunch of powerpc KVM fixes, a guest and a host RCU fix (unrelated),
and a small build fix.


Alexander Graf (3):
  KVM: PPC: Book3S: Compile fix for ppc32 in HIOR access code
  KVM: PPC: Save/Restore CR over vcpu_run
  KVM: PPC: Book3S: PR: Fix preemption

Avi Kivity (1):
  Merge tag 'powerpc-fixes' of git://github.com/paulusmack/linux
into new/master

Gleb Natapov (1):
  KVM: Resolve RCU vs. async page fault problem

Marcelo Tosatti (1):
  KVM: VMX: vmx_set_cr0 expects kvm-srcu locked

Paul Mackerras (2):
  KVM: PPC: Book3S HV: Fix kvm_alloc_linear in case where no linears
exist
  KVM: PPC: Book3S HV: Save and restore CR in __kvmppc_vcore_entry

Sasikantha babu (1):
  KVM: PMU: Fix integer constant is too large warning in
kvm_pmu_set_msr()

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch

2012-04-05 Thread Shirley Ma
On Thu, 2012-04-05 at 08:22 -0700, Shirley Ma wrote:

 Haven't had time to focus on single stream result yet. 

I forgot to mention that if I switch the vhost scheduling to per-device
based from per vq based, this minor single stream test regression will
be gone. However the improvement of tcp_rrs, udp_rrs and other stream
case performance will also gone.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


New git workflow

2012-04-05 Thread Avi Kivity
In a recent conversation, Linus persuaded me that it's time for change
in our git workflow; the following will bring it in line with the
current practices of most trees.

The current 'master' branch will be abandoned (still available for
reviewing history).  The new branch structure will be as follows:

  'master': contains patches destined for fixes after the current merge
windows; for example, after v3.4 has been released, master contains
fixes which will be merged during the 3.4 cycle.
  'next': contains patches destined for the next merge window; for
example, if v3.4 has just been released, then 'next' will be based on
the v3.5-rc1 tag (or thereabouts), and will be accepting patches
targeted at the 3.6 merge window (when 3.5 is released).
  'queue': holding area for patches which have not been tested yet; may
be rebased; usually merged into 'next'
  'kvm-updates/3.x': unchanged, patches destined for a stable branch

Since 'next' is based on an -rc1 release, it may lack upstream non-kvm
fixes (or features) needed for development.  To work around that, an
'auto-next' branch will be provided that is a merge of current upstream
and 'next' (this is more or less the equivalent of the old 'master'
branch).  That branch of course will be rebased often.

How to work with the new layout:


If you're a developer, usually developing against 'next' is okay.  If
'next' is unstable for you or you need a new upstream API, work against
'auto-next', but let the maintainers know that when posting your patch.

If you're working on a fix for the current cycle, work against upstream
or 'master' (they should be equivalent most of the time).

If you're a submaintainer, post git pull requests against 'next' or
'master', according to your merge target.

Please ping Marcelo or myself with any question or suggestion.  It's
better to ask a silly question than develop against a stale tree
accidentally and waste your efforts.

I'll publish the new branches tomorrow, with any luck.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] KVM: MMU: fask check whether page is writable

2012-04-05 Thread Xiao Guangrong
Hi Avi,

Thanks very much for your review!

Sorry for the delay reply since i was on vacation.

On 04/01/2012 11:52 PM, Avi Kivity wrote:

 On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
 Using PTE_LIST_WRITE_PROTECT bit in rmap store the write-protect status to
 avoid unnecessary shadow page walking

 Also if no shadow page is indirect, the page is write-free


 @@ -2262,6 +2291,9 @@ static int mmu_need_write_protect(struct kvm_vcpu 
 *vcpu, gfn_t gfn,
  }
  if (need_unsync)
  kvm_unsync_pages(vcpu, gfn);
 +
 +*rmap = ~PTE_LIST_WRITE_PROTECT;
 +

 
 So what are the rules for PTE_LIST_WRITE_PROTECT?  Is is a cache for the
 mmu_need_write_protect?
 
 I'd like to understand it, I guess it can be set while write protection
 is unneeded, and cleared on the next check?
 


Yes, it is used as a cache for mmu_need_write_protect.

When the gfn is protected by sync sp or read-only host page we set this bit,
and it is be cleared when the sp become unsync or host page becomes writable.

 Maybe split into two functions, one the normal mmu_need_write_protect
 (but renamed) and a new mmu_need_write_protect(), with locked and
 unlocked variants, calling the old one.
 


Okay, i will split it by introducing a new function named mmu_unsync_gfn_sp
which checks whether sp can become unsync and unsync sp if it is allowed under
the protection of mmu-lock.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock

2012-04-05 Thread Xiao Guangrong
On 04/01/2012 11:53 PM, Avi Kivity wrote:

 On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
 It depends on PTE_LIST_WRITE_PROTECT bit in rmap which let us quickly know
 whether the page is writable out of mmu-lock

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   17 +
  arch/x86/kvm/paging_tmpl.h |2 +-
  2 files changed, 14 insertions(+), 5 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 3887a07..c029185 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1148,6 +1148,12 @@ static int rmap_write_protect(struct kvm *kvm, u64 
 gfn)

  *rmapp |= PTE_LIST_WRITE_PROTECT;

 +/*
 + * Setting PTE_LIST_WRITE_PROTECT bit before doing page
 + * write-protect.
 + */
 +smp_mb();
 +
 
 wmb only needed.
 


We should ensure setting this bit before reading spte, it cooperates with
fast page fault path to avoid this case:

On fast page fault path:On rmap_write_protect path:
read spte: old_spte = *spte
   (reading spte is reordered to the front 
of
setting PTE_LIST_WRITE_PROTECT bit)
set spte.identification
   smp_mb
if (!rmap.PTE_LIST_WRITE_PROTECT)
set rmap.PTE_LIST_WRITE_PROTECT
cmpxchg(sptep, spte, spte | WRITABLE)
see old_spte.identification is not 
set,
so it does not write-protect this 
page
  OOPS!!!

 Would it be better to store this bit in all the sptes instead?  We're
 touching them in any case.  More work to clear them, but
 un-write-protecting a page is beneficial anyway as it can save a fault.
 

There are two reasons:
- if we set this bit in rmap, we can do the quickly check to see the page is
  writble before doing shadow page walking.

- since a full barrier is needed, we should use smp_mb for every spte like this:

  while ((spte = rmap_next(rmapp, spte))) {
read spte
smp_mb
write-protect spte
  }

  smp_mb is called in the loop, i think it is not good, yes?

If you just want to save the fault, we can let all spte to be writeable in
mmu_need_write_protect, but we should cache gpte access bits into spte firstly.
It should be another patchset i think. :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


smp option of qemu-kvm

2012-04-05 Thread Steven
Hi,
I started a kvm VM by adding -smp 2 option. From inside the guest, I
can see that /proc/cpuinfo outputs 2 cores.
However, in the host, I only observe one qemu-kvm process for that VM.
Does that mean this VM is actually running on one core?
If so, how to make a VM to run on 2 or more cores? Thanks.

-ha
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: smp option of qemu-kvm

2012-04-05 Thread Daniel P. Berrange
On Thu, Apr 05, 2012 at 02:28:51PM -0400, Steven wrote:
 Hi,
 I started a kvm VM by adding -smp 2 option. From inside the guest, I
 can see that /proc/cpuinfo outputs 2 cores.
 However, in the host, I only observe one qemu-kvm process for that VM.
 Does that mean this VM is actually running on one core?
 If so, how to make a VM to run on 2 or more cores? Thanks.

Each VCPU in KVM corresponds to a separate thread in the process. The
'ps' command only ever shows the thread leader by default - so you
don't see those VCPU threads in the process list. eg ps -eLf to
see all threads

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: smp option of qemu-kvm

2012-04-05 Thread Steven
Hi, Daniel,
Thanks for your quick response. However, the ps -eLf show 4 threads
for the VM and I checked 4 threads have the same tgid.
But the VM I created is with -smp 2 option. Could you explain this? Thanks.


On Thu, Apr 5, 2012 at 2:40 PM, Daniel P. Berrange berra...@redhat.com wrote:
 On Thu, Apr 05, 2012 at 02:28:51PM -0400, Steven wrote:
 Hi,
 I started a kvm VM by adding -smp 2 option. From inside the guest, I
 can see that /proc/cpuinfo outputs 2 cores.
 However, in the host, I only observe one qemu-kvm process for that VM.
 Does that mean this VM is actually running on one core?
 If so, how to make a VM to run on 2 or more cores? Thanks.

 Each VCPU in KVM corresponds to a separate thread in the process. The
 'ps' command only ever shows the thread leader by default - so you
 don't see those VCPU threads in the process list. eg ps -eLf to
 see all threads

 Daniel
 --
 |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
 |: http://libvirt.org              -o-             http://virt-manager.org :|
 |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
 |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: smp option of qemu-kvm

2012-04-05 Thread Daniel P. Berrange
On Thu, Apr 05, 2012 at 02:52:40PM -0400, Steven wrote:
 Hi, Daniel,
 Thanks for your quick response. However, the ps -eLf show 4 threads
 for the VM and I checked 4 threads have the same tgid.
 But the VM I created is with -smp 2 option. Could you explain this? Thanks.

As well as the vCPU threads, QEMU creates other threads as needed, typically
for I/O - indeed the count of threads may vary over time.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/13] KVM: MMU: fast path of handling guest page fault

2012-04-05 Thread Xiao Guangrong
On 04/02/2012 12:23 AM, Avi Kivity wrote:

 On 03/29/2012 11:27 AM, Xiao Guangrong wrote:
 If the the present bit of page fault error code is set, it indicates
 the shadow page is populated on all levels, it means what we do is only
 modify the access bit which can be done out of mmu-lock

 The tricks in this patch is avoiding the race between fast page fault
 path and write-protect path, write-protect path is a read-check-modify
 path:
 read spte, check W bit, then clear W bit. What we do is populating a
 identification in spte, if write-protect meets it, it modify the spte
 even if the spte is readonly. See the comment in the code to get more
 information

 +
 +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn,
 +   u32 error_code)
 +{
 +unsigned long *rmap;
 +bool write = error_code  PFERR_WRITE_MASK;
 +
 +/*
 + * #PF can be fast only if the shadow page table is present, that
 + * means we just need change the access bits (e.g: R/W, U/S...)
 + * which can be done out of mmu-lock.
 + */
 +if (!(error_code  PFERR_PRESENT_MASK))
 +return false;
 +
 +if (unlikely(vcpu-vcpu_id  max_vcpu_spte()))
 +return false;
 +
 +rmap = gfn_to_rmap(vcpu-kvm, gfn, PT_PAGE_TABLE_LEVEL);
 
 What prevents sp-gfns from being freed while this is going on?  Did I
 miss the patch that makes mmu pages freed by rcu?  


sp-gfns is not used here, what we are using is rmap which came from
memslot protected by srcu.

In this patch, we called kvm_mmu_page_get_gfn in fast_pf_fetch_direct_spte
which only used for direct sp, sp-gfns is also not touched in this case.

But, i want to use it in the next version, it should be freed in the
rcu context, see:
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/88934

 Also need barriers in
 kvm_mmu_get_page() to make sure sp-gfns is made visible before the page
 is hashed in.
 


We do not walk the hash tables of shadow page, we walk shadow page table
which currently is being used by vcpu, so we just need care the order
of setting spte and setting sp-gfns, but it has data dependence in
currently code:

set spte
if (is_shadow_present_pte(*sptep))
   set sp-gfns[]

setting sp-gfns can not be reordered to the head of set spte


 + smp_rmb()
 


Actually, i do not want to use barrier here since a read barrier is
not a complier barrier on X86. And we just quickly check the page
can be writable before shadow page table walking. A real barrier is
used before updating spte.

 +
 +/* Quickly check the page can be writable. */
 +if (write  (ACCESS_ONCE(*rmap)  PTE_LIST_WRITE_PROTECT))
 +return false;
 +
 +return true;
 +}
 +
 +typedef bool (*fast_pf_fetch_spte)(struct kvm_vcpu *vcpu, u64 *sptep,
 +   u64 *new_spte, gfn_t gfn, u32 expect_access,
 +   u64 spte);
 +
 +static bool
 +fast_pf_fetch_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 *new_spte,
 +  gfn_t gfn, u32 expect_access, u64 spte)
 +{
 +struct kvm_mmu_page *sp = page_header(__pa(sptep));
 +
 
 Why is this stable?  Another cpu could have removed it.
 


It is ok for the removed sp since it is not freed on this path. And if the
sp is remove, cmpxchg can see this change since the spte is zapped.


 +WARN_ON(!sp-role.direct);
 +
 +if (kvm_mmu_page_get_gfn(sp, sptep - sp-spt) != gfn)
 +return false;
 
 Simpler to check if the page is sync; if it is, gfn has not changed.
 
 (but the check for a sync page is itself unstable).
 


Checking sp.sync is unsafe since there has a window between guest page write
emulation and update spte:

At the Beginning:
gpte = gfn1
spte = gfn1's pfn
(gfn1 is write-protect, gfn2 is write-free)

VCPU 1  VCPU 2

modify gpte and let it point to gfn2
page fault
write-emulation: gpte = gfn2

 page fault on gpte:
 gfn = gfn2
 fast page fault:
 sett gfn2 is write free and update spte:
 spte = gfn1's pfn + W
 OOPS!!!
zap spte


 +
 +set_spte(vcpu, new_spte, sp-role.access,
 + expect_access  ACC_USER_MASK, expect_access  ACC_WRITE_MASK,
 + sp-role.level, gfn, spte_to_pfn(spte), false, false,
 + spte  SPTE_HOST_WRITEABLE, true);
 +
 
 What if the conditions have changed meanwhile?  Should we set the spte
 atomically via set_spte?  For example an mmu notifier clearing the spte
 in parallel.
 


All of these case are ok, we do not directly update spte and it is updated like
this:

new_spte = 0x0ull /* set it to nonpresent, so rmap path can not be touched. */
set_spte(vcpu, new_spte, ..) /* the expected spte is temporarily saved to 
new_spte.*/
cmpxchg(spte, spte_old, new_spte)

It depends on cmpxchg.

 What if another code path has read *spte, and did 

Re: WARNING: at arch/x86/kernel/smp.c:119 native_smp_send_reschedule+0x25/0x43()

2012-04-05 Thread Tony Luck
A plain v3.3 kernel hits this when I just type reboot on a 32 cpu (2
socket * 8 core * 2 HT) system:


sd 0:0:0:0: [sda] Synchronizing SCSI cache
Restarting system.
machine restart
[ cut here ]
WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x5c/0x60()
Hardware name: S2600CP
Modules linked in:
Pid: 10068, comm: reboot Not tainted 3.3.0- #1
Call Trace:
 IRQ  [8104c37f] warn_slowpath_common+0x7f/0xc0
 [8104c3da] warn_slowpath_null+0x1a/0x20
 [810314bc] native_smp_send_reschedule+0x5c/0x60
 [81084824] trigger_load_balance+0x244/0x2f0
 [8107bf71] scheduler_tick+0x101/0x160
 [8105b1de] update_process_times+0x6e/0x90
 [8109e6a6] tick_sched_timer+0x66/0xc0
 [81072d63] __run_hrtimer+0x83/0x1d0
 [8109e640] ? tick_nohz_handler+0xf0/0xf0
 [81073146] hrtimer_interrupt+0x106/0x240
 [8157bf69] smp_apic_timer_interrupt+0x69/0x99
 [8157ac1e] apic_timer_interrupt+0x6e/0x80
 EOI  [81033276] ? default_send_IPI_mask_allbutself_phys+0xd6/0x110
 [81036397] physflat_send_IPI_allbutself+0x17/0x20
 [81031849] native_nmi_stop_other_cpus+0xa9/0x110
 [81030e24] native_machine_shutdown+0x64/0x90
 [81030a97] native_machine_restart+0x27/0x40
 [810309cf] machine_restart+0xf/0x20
 [810637de] kernel_restart+0x3e/0x60
 [810639e0] sys_reboot+0x1c0/0x240
 [811734af] ? __d_free+0x4f/0x70
 [8117353c] ? d_free+0x6c/0x80
 [81174dbd] ? d_kill+0xad/0x110
 [8117b603] ? mntput+0x23/0x40
 [8115f9b7] ? fput+0x197/0x260
 [8115ba63] ? filp_close+0x63/0x90
 [8157a169] system_call_fastpath+0x16/0x1b
---[ end trace 5e0dddabdbb21c7e ]---
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-05 Thread Xiao Guangrong
On 04/01/2012 08:58 PM, Avi Kivity wrote:

 On 03/30/2012 12:18 PM, Xiao Guangrong wrote:
 On 03/29/2012 08:57 PM, Avi Kivity wrote:

 On 03/29/2012 01:40 PM, Xiao Guangrong wrote:
 * Implementation
 We can freely walk the page between walk_shadow_page_lockless_begin and
 walk_shadow_page_lockless_end, it can ensure all the shadow page is 
 valid.

 In the most case, cmpxchg is fair enough to change the access bit of 
 spte,
 but the write-protect path on softmmu/nested mmu is a especial case: it 
 is
 a read-check-modify path: read spte, check W bit, then clear W bit.

 We also set gpte.D and gpte.A, no? How do you handle that?



 We still need walk gust page table before fast page fault to check
 whether the access is valid.

 Ok.  But that's outside the mmu lock.

 We can also skip that: if !sp-unsync, copy access rights and gpte.A/D
 into spare bits in the spte, and use that to check.



 Great!

 gpte.A need not be copied into spte since EFEC.P = 1 means the shadow page
 table is present, gpte.A must be set in this case.

 And, we do not need to cache gpte access rights into spte, instead of it,
 we can atomicly read gpte to get these information (anyway, we need read gpte
 to get the gfn.)
 
 Why do we need the gfn?  If !sp-unsync then it is unchanged (it's also
 in sp-gfns).
 
 Oh, needed for mark_page_dirty().
 
 Caching gpte access in the spte will save decoding the access buts and
 sp-role.access.
 


Yes, we can directly get the gfn from sp-gfns (it is also safe for usync sp
i think), but we want to have a quickly check (check if gfn can be writable)
before shadow page table waking.

I mentioned it in the previous reply, see:
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/88934

What is your comment?


 I will do it in the next version.


  In order
 to avoid marking spte writable after/during page write-protect, we do the
 trick like below:

   fast page fault path:
 lock RCU
 set identification in the spte

 What if you can't (already taken)?  Spin?  Slow path?


 In this patch, it allows to concurrently access on the same spte:
 it freely set its identification on the spte, because i did not
 want to introduce other atomic operations.

 You remind me that there may be a risk: if many vcpu fault on the
 same spte, it will retry the spte forever. Hmm, how about fix it
 like this:

 if ( spte.identification = 0) {
set spte.identification = vcpu.id
goto cmpxchg-path
 }

 if (spte.identification == vcpu.id)
goto cmpxchg-path

 return to guest and retry the address again;

 cmpxchg-path:
do checks and cmpxchg

 It can ensure the spte can be updated.

 What's the difference between this and


   if test_and_set_bit(spte.lock)
return_to_guest
   else
do checks and cmpxchg

 ?



 test_and_set_bit is a atomic operation that is i want to avoid.
 
 Right.  Can you check what the effect is (with say
 test_and_set_bit(spte, 0) in the same path).
 
 I'm not convinced it's significant.
 


Okay, if you prefer to test_and_set_bit, we may introduce two bits:
fast_pf and write_protect, do things like this:

on fast page fault path:

old_spte = *spte

if (test_and_set_bit(spte.fast_pf))
return_to_guest

old_spte |= fast_pf

if (old_spte.wp)
clear-fast-pf bit and return_to_guest

if (!rmap.PTE_LIST_WRITE_PROTECT)
cmpxchg(spte, old_spte, old_spte +w - fast_pf)
else
clear-fast-pf bit

on page write-protect path:
lock mmu-lock
set rmap.PTE_LIST_WRITE_PROTECT
smp_mb()
if (spte.w || spte.fast_pf)
spte -w -fast_pf + write_protect

unlock mmu-lock

I think it works but can not make page fault fast for unsync sp (
e.g: two sptes point to the gfn, a page fault on one of the spte
make gfn's sp become unsync.), it is not too bad since we can cache
access bits in spte and properly make all sptes become writable on
mmu_need_write_protect path in a new patchset.



 The identification should be unique to avoid the below race:

  VCPU 0VCPU 1VCPU 2
   lock RCU
spte + identification
check conditions
do write-protect, clear
   identification
   lock RCU
 set identification
  cmpxchg + w - identification
 OOPS!!!

 Is it not sufficient to use just two bits?

 pf_lock - taken by page fault path
 wp_lock - taken by write protect path

 pf cmpxchg checks both bits.



 If we just use two byte as identification, it has to use atomic
 operations to maintain these bits? or i misunderstood?

 Ah, you're using byte writes to set the identification? remember I
 didn't read the patches yet.



 Yes.

 Please check the optimization manual, I remember there are warnings
 there about using different sized accesses for the 

buildbot failure in kvm on next-ppc64

2012-04-05 Thread kvm
The Buildbot has detected a new failure on builder next-ppc64 while building 
kvm.
Full details are available at:
 http://buildbot.b1-systems.de/kvm/builders/next-ppc64/builds/495

Buildbot URL: http://buildbot.b1-systems.de/kvm/

Buildslave for this Build: b1_kvm_1

Build Reason: The Nightly scheduler named 'nightly_next' triggered this build
Build Source Stamp: [branch next] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot



buildbot failure in kvm on next-ppc44x

2012-04-05 Thread kvm
The Buildbot has detected a new failure on builder next-ppc44x while building 
kvm.
Full details are available at:
 http://buildbot.b1-systems.de/kvm/builders/next-ppc44x/builds/494

Buildbot URL: http://buildbot.b1-systems.de/kvm/

Buildslave for this Build: b1_kvm_1

Build Reason: The Nightly scheduler named 'nightly_next' triggered this build
Build Source Stamp: [branch next] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot

N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

Re: Questing regarding KVM Guest PMU

2012-04-05 Thread shashank rachamalla
On Thu, Apr 5, 2012 at 8:11 PM, Gleb Natapov g...@redhat.com wrote:
 On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote:
 On 04/05/2012 04:57 PM, Gleb Natapov wrote:
   
   May be it used NMI based profiling. We should ask oprofile developers.
   As I said I am almost sure my inability to run it on a host is probably
   PEBKAC, although I ran the same script exactly on the host and the
   guest (the script is from the first email of this thread)
  
  After upgrading the kernel to latest git from whatever it was there the
  same script works and counts CPU_CLK_UNHALT events.
 

 This is even while it violates the Intel guidelines?

 Yes, but who says the result is correct :) It seems that we handle
 global ctrl msr wrong. That is counter can be enabled either in global
 ctrl or in eventsel. Trying to confirm that.

if that becomes true then will global ctrl msr have any significance ?
 --
                        Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-05 Thread Xiao Guangrong
On 04/06/2012 05:57 AM, Xiao Guangrong wrote:



 What's the difference between this and


   if test_and_set_bit(spte.lock)
return_to_guest
   else
do checks and cmpxchg

 ?



 test_and_set_bit is a atomic operation that is i want to avoid.

 Right.  Can you check what the effect is (with say
 test_and_set_bit(spte, 0) in the same path).

 I'm not convinced it's significant.

 
 
 Okay, if you prefer to test_and_set_bit, we may introduce two bits:
 fast_pf and write_protect, do things like this:
 
 on fast page fault path:
 
   old_spte = *spte
 
   if (test_and_set_bit(spte.fast_pf))
   return_to_guest
 
   old_spte |= fast_pf
 
   if (old_spte.wp)
   clear-fast-pf bit and return_to_guest
 
   if (!rmap.PTE_LIST_WRITE_PROTECT)
   cmpxchg(spte, old_spte, old_spte +w - fast_pf)
   else
   clear-fast-pf bit
 
 on page write-protect path:
   lock mmu-lock
   set rmap.PTE_LIST_WRITE_PROTECT
 smp_mb()
 if (spte.w || spte.fast_pf)
 spte -w -fast_pf + write_protect
 
 unlock mmu-lock
 
 I think it works but can not make page fault fast for unsync sp (
 e.g: two sptes point to the gfn, a page fault on one of the spte
 make gfn's sp become unsync.), it is not too bad since we can cache
 access bits in spte and properly make all sptes become writable on
 mmu_need_write_protect path in a new patchset.
 


Foolish me, i should be crazy. Sorry for my mistake. :(

Unfortunately, it can not work, we can not get a stable gfn from gpte or
sp-gfns[]. For example:

beginning:
Gpte = Gfn1
gfn_to_pfn(Gfn1) = Pfn
Spte = Pfn
Gfn1 is write-free
Gfn2 is write-protected


VCPU 0  VCPU 1 VCPU 2

fault on gpte
fast page fault path:
  set Spte.fast_pf
  get Gfn1 from Gpte/sp-gfns[]
  if (Gfn1 is writable)
Pfn is swapped out:
Spte = 0
Gpte is modified to Gfn2,
and Pfn is realloced and remapped
to Gfn2, so:
Spte = Pfn

  fast page fault path:
 set Spte.fast_pf

 cmpxchg  Spte+w
OOPS!!!
  we see Spte is not changed and
   happily make it writable, so gfn2 can be writable

It seems only a unique identification can prevent this. :(

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Please pull powerpc KVM fixes

2012-04-05 Thread Avi Kivity
On 04/03/2012 03:22 PM, Paul Mackerras wrote:
 On Tue, Apr 03, 2012 at 10:03:05PM +1000, Paul Mackerras wrote:
  
  The following changes since commit 592f5d87b3feee9d60411f19d583038c0c7670ad:

 OK, I messed up the git request-pull command.  The request should have
 looked like this:

 The following changes since commit b1a808ff436343956a6ae63178ea1810c5e5a3a1:

   Merge branch 'for-next' of git://gitorious.org/kernel-hsi/kernel-hsi 
 (2012-04-02 09:50:40 -0700)

 are available in the git repository at:


   git://github.com/paulusmack/linux tags/powerpc-fixes



Thanks, pulled, will push shortly.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


New git workflow

2012-04-05 Thread Avi Kivity
In a recent conversation, Linus persuaded me that it's time for change
in our git workflow; the following will bring it in line with the
current practices of most trees.

The current 'master' branch will be abandoned (still available for
reviewing history).  The new branch structure will be as follows:

  'master': contains patches destined for fixes after the current merge
windows; for example, after v3.4 has been released, master contains
fixes which will be merged during the 3.4 cycle.
  'next': contains patches destined for the next merge window; for
example, if v3.4 has just been released, then 'next' will be based on
the v3.5-rc1 tag (or thereabouts), and will be accepting patches
targeted at the 3.6 merge window (when 3.5 is released).
  'queue': holding area for patches which have not been tested yet; may
be rebased; usually merged into 'next'
  'kvm-updates/3.x': unchanged, patches destined for a stable branch

Since 'next' is based on an -rc1 release, it may lack upstream non-kvm
fixes (or features) needed for development.  To work around that, an
'auto-next' branch will be provided that is a merge of current upstream
and 'next' (this is more or less the equivalent of the old 'master'
branch).  That branch of course will be rebased often.

How to work with the new layout:


If you're a developer, usually developing against 'next' is okay.  If
'next' is unstable for you or you need a new upstream API, work against
'auto-next', but let the maintainers know that when posting your patch.

If you're working on a fix for the current cycle, work against upstream
or 'master' (they should be equivalent most of the time).

If you're a submaintainer, post git pull requests against 'next' or
'master', according to your merge target.

Please ping Marcelo or myself with any question or suggestion.  It's
better to ask a silly question than develop against a stale tree
accidentally and waste your efforts.

I'll publish the new branches tomorrow, with any luck.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html