Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls

2015-01-05 Thread Borislav Petkov
On Fri, Nov 07, 2014 at 03:58:17PM -0800, Andy Lutomirski wrote:
 The int_ret_from_sys_call and syscall tracing code disagrees with
 the sysret path as to the value of RCX.
 
 The Intel SDM, the AMD APM, and my laptop all agree that sysret
 returns with RCX == RIP.  The syscall tracing code does not respect
 this property.
 
 For example, this program:
 
 int main()
 {
   extern const char syscall_rip[];
   unsigned long rcx = 1;
   unsigned long orig_rcx = rcx;
   asm (mov $-1, %%eax\n\t
syscall\n\t
syscall_rip:
: +c (rcx) : : r11);
   printf(syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n,
  rcx, (unsigned long)syscall_rip, orig_rcx);
   return 0;
 }
 
 prints:
 syscall: RCX = 400556  RIP = 400556  orig RCX = 1
 
 Running it under strace gives this instead:
 syscall: RCX =   RIP = 400556  orig RCX = 1

I can trigger the same even without tracing it:

syscall: RCX =   RIP = 40052C  orig RCX = 1

 This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
 show RCX == RIP even under strace.
 
 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  arch/x86/kernel/entry_64.S | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index df088bb03fb3..3710b8241945 100644
 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
   movq \tmp,RSP+\offset(%rsp)
   movq $__USER_DS,SS+\offset(%rsp)
   movq $__USER_CS,CS+\offset(%rsp)
 - movq $-1,RCX+\offset(%rsp)
 + movq RIP+\offset(%rsp),\tmp  /* get rip */
 + movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
   movq R11+\offset(%rsp),\tmp  /* get eflags */
   movq \tmp,EFLAGS+\offset(%rsp)
   .endm
 --

For some reason this patch is causing ata resets on by box, see the
end of this mail. So something's not kosher yet. If I boot the kernel
without it, it all seems ok.

Btw, this change got introduced in 2002 where it used to return rIP in
%rcx before, but it got changed to return -1 for rIP for some reason.

commit af53c7a2c81399b805b6d4eff887401a5e50feef
Author: Andi Kleen a...@muc.de
Date:   Fri Apr 19 20:23:17 2002 -0700

[PATCH] x86-64 architecture specific sync for 2.5.8

This patch brings 2.5.8 in sync with the x86-64 2.4 development tree again
(excluding device drivers)

It has lots of bug fixes and enhancements. It only touches architecture
specific files.

...

diff --git a/arch/x86_64/kernel/entry.S b/arch/x86_64/kernel/entry.S
index 6b98b90891f4..16c6e3faf5a7 100644
--- a/arch/x86_64/kernel/entry.S
+++ b/arch/x86_64/kernel/entry.S
@@ -5,7 +5,7 @@
  *  Copyright (C) 2000, 2001, 2002  Andi Kleen SuSE Labs
  *  Copyright (C) 2000  Pavel Machek pa...@suse.cz
  * 
- *  $Id: entry.S,v 1.66 2001/11/11 17:47:47 ak Exp $   
+ *  $Id$
  */
 
 /*
@@ -39,8 +39,7 @@
 #include asm/msr.h
 #include asm/unistd.h
 #include asm/thread_info.h
-   
-#define RIP_SYMBOL_NAME(x) x(%rip)
+#include asm/hw_irq.h
 
.code64
 
@@ -67,8 +66,7 @@
movq\tmp,RSP(%rsp)
movq$__USER_DS,SS(%rsp)
movq$__USER_CS,CS(%rsp)
-   movqRCX(%rsp),\tmp  /* get return address */
-   movq\tmp,RIP(%rsp)
+   movq$-1,RCX(%rsp)
movqR11(%rsp),\tmp  /* get eflags */
movq\tmp,EFLAGS(%rsp)
.endm
@@ -76,8 +74,6 @@
.macro RESTORE_TOP_OF_STACK tmp,offset=0
movq   RSP-\offset(%rsp),\tmp
movq   \tmp,PDAREF(pda_oldrsp)
-   movq   RIP-\offset(%rsp),\tmp
-   movq   \tmp,RCX-\offset(%rsp)
movq   EFLAGS-\offset(%rsp),\tmp
movq   \tmp,R11-\offset(%rsp)
.endm

---

[  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x6 
frozen
[  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 4096 
out
[  180.072158]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
[  180.086912] ata1.00: status: { DRDY }
[  180.090591] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.095846] ata1.00: cmd 61/08:08:18:ae:d9/00:00:23:00:00/40 tag 1 ncq 4096 
out
[  180.095846]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
[  180.110603] ata1.00: status: { DRDY }
[  180.114283] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.119539] ata1.00: cmd 61/10:10:f0:b1:d9/00:00:23:00:00/40 tag 2 ncq 8192 
out
[  180.119539]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
[  180.134292] ata1.00: status: { DRDY }
[  180.137973] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.143226] ata1.00: cmd 61/08:18:00:98:18/00:00:1d:00:00/40 tag 3 ncq 4096 
out
[  180.143226]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
[  180.158105] ata1.00: status: { DRDY }
[  180.161809] 

Re: Fw: Benchmarking for vhost polling patch

2015-01-05 Thread Michael S. Tsirkin
Hi Razya,
Thanks for the update.
So that's reasonable I think, and I think it makes sense
to keep working on this in isolation - it's more
manageable at this size.

The big questions in my mind:
- What happens if system is lightly loaded?
  E.g. a ping/pong benchmark. How much extra CPU are
  we wasting?
- We see the best performance on your system is with 10usec worth of polling.
  It's OK to be able to tune it for best performance, but
  most people don't have the time or the inclination.
  So what would be the best value for other CPUs?
- Should this be tunable from usespace per vhost instance?
  Why is it only tunable globally?
- How bad is it if you don't pin vhost and vcpu threads?
  Is the scheduler smart enough to pull them apart?
- What happens in overcommit scenarios? Does polling make things
  much worse?
  Clearly polling will work worse if e.g. vhost and vcpu
  share the host cpu. How can we avoid conflicts?

  For two last questions, better cooperation with host scheduler will
  likely help here.
  See e.g.  http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
  I'm currently looking at pushing something similar upstream,
  if it goes in vhost polling can do something similar.

Any data points to shed light on these questions?

On Thu, Jan 01, 2015 at 02:59:21PM +0200, Razya Ladelsky wrote:
 Hi Michael,
 Just a follow up on the polling patch numbers,..
 Please let me know if you find these numbers satisfying enough to continue 
 with submitting this patch.
 Otherwise - we'll have this patch submitted as part of the larger Elvis 
 patch set rather than independently.
 Thank you,
 Razya 
 
 - Forwarded by Razya Ladelsky/Haifa/IBM on 01/01/2015 09:37 AM -
 
 From:   Razya Ladelsky/Haifa/IBM@IBMIL
 To: m...@redhat.com
 Cc: 
 Date:   25/11/2014 02:43 PM
 Subject:Re: Benchmarking for vhost polling patch
 Sent by:kvm-ow...@vger.kernel.org
 
 
 
 Hi Michael,
 
  Hi Razya,
  On the netperf benchmark, it looks like polling=10 gives a modest but
  measureable gain.  So from that perspective it might be worth it if it's
  not too much code, though we'll need to spend more time checking the
  macro effect - we barely moved the needle on the macro benchmark and
  that is suspicious.
 
 I ran memcached with various values for the key  value arguments, and 
 managed to see a bigger impact of polling than when I used the default 
 values,
 Here are the numbers:
 
 key=250 TPS  netvhost vm   TPS/cpu  TPS/CPU
 value=2048   rate   util  util  change
 
 polling=0   101540   103.0  46   100   695.47
 polling=5   136747   123.0  83   100   747.25   0.074440609
 polling=7   140722   125.7  84   100   764.79   0.099663658
 polling=10  141719   126.3  87   100   757.85   0.089688003
 polling=15  142430   127.1  90   100   749.63   0.077863015
 polling=25  146347   128.7  95   100   750.49   0.079107993
 polling=50  150882   131.1  100  100   754.41   0.084733701
 
 Macro benchmarks are less I/O intensive than the micro benchmark, which is 
 why 
 we can expect less impact for polling as compared to netperf. 
 However, as shown above, we managed to get 10% TPS/CPU improvement with 
 the 
 polling patch.
 
  Is there a chance you are actually trading latency for throughput?
  do you observe any effect on latency?
 
 No.
 
  How about trying some other benchmark, e.g. NFS?
  
 
 Tried, but didn't have enough I/O produced (vhost was at most at 15% util)

OK but was there a regression in this case?


  
  Also, I am wondering:
  
  since vhost thread is polling in kernel anyway, shouldn't
  we try and poll the host NIC?
  that would likely reduce at least the latency significantly,
  won't it?
  
 
 Yes, it could be a great addition at some point, but needs a thorough 
 investigation. In any case, not a part of this patch...
 
 Thanks,
 Razya
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cpu frequency

2015-01-05 Thread Nerijus Baliunas
Nerijus Baliunas nerijus at users.sourceforge.net writes:

 Paolo Bonzini pbonzini at redhat.com writes:
 
  In the case of Windows it's probably some timing loop that is executed
  at startup, and the result depends on frequency scaling in the host.
  Try adding this to the XML in the meanwhile, and see if the control
  panel shows the same value:
  
  Inside features:
  
hyperv
  relaxed state='on'/
/hyperv
  
  Inside clock offset='localtime':
  
timer name='hypervclock' present='yes'/
 
 So far Control Panel - System shows CPU as 2.2 GHz, I rebooted once. So it 
 seems OK.

Unfortunately after the host reboot the problem reappeared once. It helped to 
reboot the VM. Any ideas what else to try?

Regards,
Nerijus


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel options for virtio-net

2015-01-05 Thread Brady Dean
I have a base Linux From Scratch installation in Virtualbox and I need
virtio-net in the kernel so I can use the virtio-net adapter through
Virtualbox.

I enabled the options listed here: www.linux-kvm.org/page/Virtio but
the network interface does not show up.

I was wondering if there are more kernel options I need to enable and
if there are any KVM packages I need to install in the guest.

I am using kernel 3.16.2.

Thanks a lot,

Brady
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Marcelo Tosatti
On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
 On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
  The pvclock vdso code was too abstracted to understand easily and
  excessively paranoid.  Simplify it for a huge speedup.
 
  This opens the door for additional simplifications, as the vdso no
  longer accesses the pvti for any vcpu other than vcpu 0.
 
  Before, vclock_gettime using kvm-clock took about 64ns on my machine.
  With this change, it takes 19ns, which is almost as fast as the pure TSC
  implementation.
 
  Signed-off-by: Andy Lutomirski l...@amacapital.net
  ---
   arch/x86/vdso/vclock_gettime.c | 82 
  --
   1 file changed, 47 insertions(+), 35 deletions(-)
 
  diff --git a/arch/x86/vdso/vclock_gettime.c 
  b/arch/x86/vdso/vclock_gettime.c
  index 9793322751e0..f2e0396d5629 100644
  --- a/arch/x86/vdso/vclock_gettime.c
  +++ b/arch/x86/vdso/vclock_gettime.c
  @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info 
  *get_pvti(int cpu)
 
   static notrace cycle_t vread_pvclock(int *mode)
   {
  - const struct pvclock_vsyscall_time_info *pvti;
  + const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti;
cycle_t ret;
  - u64 last;
  - u32 version;
  - u8 flags;
  - unsigned cpu, cpu1;
  -
  + u64 tsc, pvti_tsc;
  + u64 last, delta, pvti_system_time;
  + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
/*
  -  * Note: hypervisor must guarantee that:
  -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
  -  * 2. that per-CPU pvclock time info is updated if the
  -  *underlying CPU changes.
  -  * 3. that version is increased whenever underlying CPU
  -  *changes.
  +  * Note: The kernel and hypervisor must guarantee that cpu ID
  +  * number maps 1:1 to per-CPU pvclock time info.
  +  *
  +  * Because the hypervisor is entirely unaware of guest userspace
  +  * preemption, it cannot guarantee that per-CPU pvclock time
  +  * info is updated if the underlying CPU changes or that that
  +  * version is increased whenever underlying CPU changes.
  +  *
  +  * On KVM, we are guaranteed that pvti updates for any vCPU are
  +  * atomic as seen by *all* vCPUs.  This is an even stronger
  +  * guarantee than we get with a normal seqlock.
 *
  +  * On Xen, we don't appear to have that guarantee, but Xen still
  +  * supplies a valid seqlock using the version field.
  +
  +  * We only do pvclock vdso timing at all if
  +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
  +  * mean that all vCPUs have matching pvti and that the TSC is
  +  * synced, so we can just look at vCPU 0's pvti.
 */
 
  Can Xen guarantee that ?
 
 I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
 at all.  I have no idea going forward, though.
 
 Xen people?
 
 
  - do {
  - cpu = __getcpu()  VGETCPU_CPU_MASK;
  - /* TODO: We can put vcpu id into higher bits of pvti.version.
  -  * This will save a couple of cycles by getting rid of
  -  * __getcpu() calls (Gleb).
  -  */
  -
  - pvti = get_pvti(cpu);
  -
  - version = __pvclock_read_cycles(pvti-pvti, ret, flags);
  -
  - /*
  -  * Test we're still on the cpu as well as the version.
  -  * We could have been migrated just after the first
  -  * vgetcpu but before fetching the version, so we
  -  * wouldn't notice a version change.
  -  */
  - cpu1 = __getcpu()  VGETCPU_CPU_MASK;
  - } while (unlikely(cpu != cpu1 ||
  -   (pvti-pvti.version  1) ||
  -   pvti-pvti.version != version));
  -
  - if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
  +
  + if (unlikely(!(pvti-flags  PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
  + return 0;
  + }
 
  This check must be performed after reading a stable pvti.
 
 
 We can even read it in the middle, guarded by the version checks.
 I'll do that for v2.
 
  +
  + do {
  + version = pvti-version;
  +
  + /* This is also a read barrier, so we'll read version first. 
  */
  + rdtsc_barrier();
  + tsc = __native_read_tsc();
  +
  + pvti_tsc_to_system_mul = pvti-tsc_to_system_mul;
  + pvti_tsc_shift = pvti-tsc_shift;
  + pvti_system_time = pvti-system_time;
  + pvti_tsc = pvti-tsc_timestamp;
  +
  + /* Make sure that the version double-check is last. */
  + smp_rmb();
  + } while (unlikely((version  1) || version != pvti-version));
  +
  + delta = tsc - pvti_tsc;
  + ret = 

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Andy Lutomirski
On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
 On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
  The pvclock vdso code was too abstracted to understand easily and
  excessively paranoid.  Simplify it for a huge speedup.
 
  This opens the door for additional simplifications, as the vdso no
  longer accesses the pvti for any vcpu other than vcpu 0.
 
  Before, vclock_gettime using kvm-clock took about 64ns on my machine.
  With this change, it takes 19ns, which is almost as fast as the pure TSC
  implementation.
 
  Signed-off-by: Andy Lutomirski l...@amacapital.net
  ---
   arch/x86/vdso/vclock_gettime.c | 82 
  --
   1 file changed, 47 insertions(+), 35 deletions(-)
 
  diff --git a/arch/x86/vdso/vclock_gettime.c 
  b/arch/x86/vdso/vclock_gettime.c
  index 9793322751e0..f2e0396d5629 100644
  --- a/arch/x86/vdso/vclock_gettime.c
  +++ b/arch/x86/vdso/vclock_gettime.c
  @@ -78,47 +78,59 @@ static notrace const struct 
  pvclock_vsyscall_time_info *get_pvti(int cpu)
 
   static notrace cycle_t vread_pvclock(int *mode)
   {
  - const struct pvclock_vsyscall_time_info *pvti;
  + const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti;
cycle_t ret;
  - u64 last;
  - u32 version;
  - u8 flags;
  - unsigned cpu, cpu1;
  -
  + u64 tsc, pvti_tsc;
  + u64 last, delta, pvti_system_time;
  + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
/*
  -  * Note: hypervisor must guarantee that:
  -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
  -  * 2. that per-CPU pvclock time info is updated if the
  -  *underlying CPU changes.
  -  * 3. that version is increased whenever underlying CPU
  -  *changes.
  +  * Note: The kernel and hypervisor must guarantee that cpu ID
  +  * number maps 1:1 to per-CPU pvclock time info.
  +  *
  +  * Because the hypervisor is entirely unaware of guest userspace
  +  * preemption, it cannot guarantee that per-CPU pvclock time
  +  * info is updated if the underlying CPU changes or that that
  +  * version is increased whenever underlying CPU changes.
  +  *
  +  * On KVM, we are guaranteed that pvti updates for any vCPU are
  +  * atomic as seen by *all* vCPUs.  This is an even stronger
  +  * guarantee than we get with a normal seqlock.
 *
  +  * On Xen, we don't appear to have that guarantee, but Xen still
  +  * supplies a valid seqlock using the version field.
  +
  +  * We only do pvclock vdso timing at all if
  +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
  +  * mean that all vCPUs have matching pvti and that the TSC is
  +  * synced, so we can just look at vCPU 0's pvti.
 */
 
  Can Xen guarantee that ?

 I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
 at all.  I have no idea going forward, though.

 Xen people?

 
  - do {
  - cpu = __getcpu()  VGETCPU_CPU_MASK;
  - /* TODO: We can put vcpu id into higher bits of 
  pvti.version.
  -  * This will save a couple of cycles by getting rid of
  -  * __getcpu() calls (Gleb).
  -  */
  -
  - pvti = get_pvti(cpu);
  -
  - version = __pvclock_read_cycles(pvti-pvti, ret, flags);
  -
  - /*
  -  * Test we're still on the cpu as well as the version.
  -  * We could have been migrated just after the first
  -  * vgetcpu but before fetching the version, so we
  -  * wouldn't notice a version change.
  -  */
  - cpu1 = __getcpu()  VGETCPU_CPU_MASK;
  - } while (unlikely(cpu != cpu1 ||
  -   (pvti-pvti.version  1) ||
  -   pvti-pvti.version != version));
  -
  - if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
  +
  + if (unlikely(!(pvti-flags  PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
  + return 0;
  + }
 
  This check must be performed after reading a stable pvti.
 

 We can even read it in the middle, guarded by the version checks.
 I'll do that for v2.

  +
  + do {
  + version = pvti-version;
  +
  + /* This is also a read barrier, so we'll read version 
  first. */
  + rdtsc_barrier();
  + tsc = __native_read_tsc();
  +
  + pvti_tsc_to_system_mul = pvti-tsc_to_system_mul;
  + pvti_tsc_shift = pvti-tsc_shift;
  + pvti_system_time = pvti-system_time;
  + pvti_tsc = pvti-tsc_timestamp;
  +
  + /* Make sure that the version double-check is last. */
  + smp_rmb();
  + } while (unlikely((version  1) || version 

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Andy Lutomirski
On Mon, Jan 5, 2015 at 2:48 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Mon, Jan 05, 2015 at 02:38:46PM -0800, Andy Lutomirski wrote:
 On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
  On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
   On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
   The pvclock vdso code was too abstracted to understand easily and
   excessively paranoid.  Simplify it for a huge speedup.
  
   This opens the door for additional simplifications, as the vdso no
   longer accesses the pvti for any vcpu other than vcpu 0.
  
   Before, vclock_gettime using kvm-clock took about 64ns on my machine.
   With this change, it takes 19ns, which is almost as fast as the pure 
   TSC
   implementation.
  
   Signed-off-by: Andy Lutomirski l...@amacapital.net
   ---
arch/x86/vdso/vclock_gettime.c | 82 
   --
1 file changed, 47 insertions(+), 35 deletions(-)
  
   diff --git a/arch/x86/vdso/vclock_gettime.c 
   b/arch/x86/vdso/vclock_gettime.c
   index 9793322751e0..f2e0396d5629 100644
   --- a/arch/x86/vdso/vclock_gettime.c
   +++ b/arch/x86/vdso/vclock_gettime.c
   @@ -78,47 +78,59 @@ static notrace const struct 
   pvclock_vsyscall_time_info *get_pvti(int cpu)
  
static notrace cycle_t vread_pvclock(int *mode)
{
   - const struct pvclock_vsyscall_time_info *pvti;
   + const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti;
 cycle_t ret;
   - u64 last;
   - u32 version;
   - u8 flags;
   - unsigned cpu, cpu1;
   -
   + u64 tsc, pvti_tsc;
   + u64 last, delta, pvti_system_time;
   + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
  
 /*
   -  * Note: hypervisor must guarantee that:
   -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
   -  * 2. that per-CPU pvclock time info is updated if the
   -  *underlying CPU changes.
   -  * 3. that version is increased whenever underlying CPU
   -  *changes.
   +  * Note: The kernel and hypervisor must guarantee that cpu ID
   +  * number maps 1:1 to per-CPU pvclock time info.
   +  *
   +  * Because the hypervisor is entirely unaware of guest userspace
   +  * preemption, it cannot guarantee that per-CPU pvclock time
   +  * info is updated if the underlying CPU changes or that that
   +  * version is increased whenever underlying CPU changes.
   +  *
   +  * On KVM, we are guaranteed that pvti updates for any vCPU are
   +  * atomic as seen by *all* vCPUs.  This is an even stronger
   +  * guarantee than we get with a normal seqlock.
  *
   +  * On Xen, we don't appear to have that guarantee, but Xen still
   +  * supplies a valid seqlock using the version field.
   +
   +  * We only do pvclock vdso timing at all if
   +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
   +  * mean that all vCPUs have matching pvti and that the TSC is
   +  * synced, so we can just look at vCPU 0's pvti.
  */
  
   Can Xen guarantee that ?
 
  I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
  at all.  I have no idea going forward, though.
 
  Xen people?
 
  
   - do {
   - cpu = __getcpu()  VGETCPU_CPU_MASK;
   - /* TODO: We can put vcpu id into higher bits of 
   pvti.version.
   -  * This will save a couple of cycles by getting rid of
   -  * __getcpu() calls (Gleb).
   -  */
   -
   - pvti = get_pvti(cpu);
   -
   - version = __pvclock_read_cycles(pvti-pvti, ret, 
   flags);
   -
   - /*
   -  * Test we're still on the cpu as well as the version.
   -  * We could have been migrated just after the first
   -  * vgetcpu but before fetching the version, so we
   -  * wouldn't notice a version change.
   -  */
   - cpu1 = __getcpu()  VGETCPU_CPU_MASK;
   - } while (unlikely(cpu != cpu1 ||
   -   (pvti-pvti.version  1) ||
   -   pvti-pvti.version != version));
   -
   - if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
   +
   + if (unlikely(!(pvti-flags  PVCLOCK_TSC_STABLE_BIT))) {
 *mode = VCLOCK_NONE;
   + return 0;
   + }
  
   This check must be performed after reading a stable pvti.
  
 
  We can even read it in the middle, guarded by the version checks.
  I'll do that for v2.
 
   +
   + do {
   + version = pvti-version;
   +
   + /* This is also a read barrier, so we'll read version 
   first. */
   + rdtsc_barrier();
   + tsc = __native_read_tsc();
   +
   + pvti_tsc_to_system_mul = pvti-tsc_to_system_mul;
   + 

Re: RECEIVE YOUR ATM CARD BEFORE 23RD OF DECEMBER 2014

2015-01-05 Thread Tage Werner
 ACCESS BANK PLC ATM DEPARTMENT accessb575 at accountant.com writes:

 
 THIS ACCESS BANK PLC WANT TO INFORM YOU THAT YOUR ATM CARD IS READY, 
THAT IF YOU NEED IT, YOU MUST PAY THE $98. IF
 YOU ARE READY, MAKE SURE YOU SEND ME YOUR FULL NAMES AND YOUR DIRECT 
TELEPHONE NUMBER FOR ME TO CALL YOU SO
 THAT YOU CAN PAY DIRECTLY TO OUR ACCOUNT OFFICER.
 
 Thanks,
 
 DR. CHRIS MICHAEL
 FROM ACCESS BANK PLC
 E-MAIL: accessb575 at gmail.com
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majordomo at vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


 
IS THIS A SCAM OR IS THAT A CORRECT MAIL??

PLEASE BE BACK WITH YOUR ANSWER.

TANKS
TAGE WERNER

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Paolo Bonzini


On 05/01/2015 19:56, Andy Lutomirski wrote:
  1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
  1) Update request for all vcpus, for a TSC_STABLE_BIT - ~TSC_STABLE_BIT
  transition.
  2) vCPU-1 updates its pvti with new values.
  3) vCPU-0 still has not updated its pvti with new values.
  4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
  notified of a TSC_STABLE_BIT - ~TSC_STABLE_BIT transition.
 
  The update is not actually atomic across all vCPUs, its atomic in
  the sense of not allowing visibility of distinct
  system_timestamp/tsc_timestamp values.
 
 Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
 it gets marked unstable?  Otherwise the vdso could could just as
 easily be called from vCPU-1, migrated to vCPU-0, read the data
 complete with stale stable bit, and get migrated back to vCPU-1.
 
 But I thought that KVM currently froze all vCPUs when updating pvti
 for any of them.  How can this happen?  I admit I don't really
 understand the update request code.

That was also my understanding.  I thought this was the point of
kvm_make_mclock_inprogress_request/KVM_REQ_MCLOCK_INPROGRESS.

Disabling TSC_STABLE_BIT is triggered by pvclock_gtod_update_fn but it
happens in kvm_gen_update_masterclock, and no guest entries will happen
in the meanwhile.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Marcelo Tosatti
On Mon, Jan 05, 2015 at 02:38:46PM -0800, Andy Lutomirski wrote:
 On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
  On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
   On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
   The pvclock vdso code was too abstracted to understand easily and
   excessively paranoid.  Simplify it for a huge speedup.
  
   This opens the door for additional simplifications, as the vdso no
   longer accesses the pvti for any vcpu other than vcpu 0.
  
   Before, vclock_gettime using kvm-clock took about 64ns on my machine.
   With this change, it takes 19ns, which is almost as fast as the pure TSC
   implementation.
  
   Signed-off-by: Andy Lutomirski l...@amacapital.net
   ---
arch/x86/vdso/vclock_gettime.c | 82 
   --
1 file changed, 47 insertions(+), 35 deletions(-)
  
   diff --git a/arch/x86/vdso/vclock_gettime.c 
   b/arch/x86/vdso/vclock_gettime.c
   index 9793322751e0..f2e0396d5629 100644
   --- a/arch/x86/vdso/vclock_gettime.c
   +++ b/arch/x86/vdso/vclock_gettime.c
   @@ -78,47 +78,59 @@ static notrace const struct 
   pvclock_vsyscall_time_info *get_pvti(int cpu)
  
static notrace cycle_t vread_pvclock(int *mode)
{
   - const struct pvclock_vsyscall_time_info *pvti;
   + const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti;
 cycle_t ret;
   - u64 last;
   - u32 version;
   - u8 flags;
   - unsigned cpu, cpu1;
   -
   + u64 tsc, pvti_tsc;
   + u64 last, delta, pvti_system_time;
   + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
  
 /*
   -  * Note: hypervisor must guarantee that:
   -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
   -  * 2. that per-CPU pvclock time info is updated if the
   -  *underlying CPU changes.
   -  * 3. that version is increased whenever underlying CPU
   -  *changes.
   +  * Note: The kernel and hypervisor must guarantee that cpu ID
   +  * number maps 1:1 to per-CPU pvclock time info.
   +  *
   +  * Because the hypervisor is entirely unaware of guest userspace
   +  * preemption, it cannot guarantee that per-CPU pvclock time
   +  * info is updated if the underlying CPU changes or that that
   +  * version is increased whenever underlying CPU changes.
   +  *
   +  * On KVM, we are guaranteed that pvti updates for any vCPU are
   +  * atomic as seen by *all* vCPUs.  This is an even stronger
   +  * guarantee than we get with a normal seqlock.
  *
   +  * On Xen, we don't appear to have that guarantee, but Xen still
   +  * supplies a valid seqlock using the version field.
   +
   +  * We only do pvclock vdso timing at all if
   +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
   +  * mean that all vCPUs have matching pvti and that the TSC is
   +  * synced, so we can just look at vCPU 0's pvti.
  */
  
   Can Xen guarantee that ?
 
  I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
  at all.  I have no idea going forward, though.
 
  Xen people?
 
  
   - do {
   - cpu = __getcpu()  VGETCPU_CPU_MASK;
   - /* TODO: We can put vcpu id into higher bits of 
   pvti.version.
   -  * This will save a couple of cycles by getting rid of
   -  * __getcpu() calls (Gleb).
   -  */
   -
   - pvti = get_pvti(cpu);
   -
   - version = __pvclock_read_cycles(pvti-pvti, ret, 
   flags);
   -
   - /*
   -  * Test we're still on the cpu as well as the version.
   -  * We could have been migrated just after the first
   -  * vgetcpu but before fetching the version, so we
   -  * wouldn't notice a version change.
   -  */
   - cpu1 = __getcpu()  VGETCPU_CPU_MASK;
   - } while (unlikely(cpu != cpu1 ||
   -   (pvti-pvti.version  1) ||
   -   pvti-pvti.version != version));
   -
   - if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
   +
   + if (unlikely(!(pvti-flags  PVCLOCK_TSC_STABLE_BIT))) {
 *mode = VCLOCK_NONE;
   + return 0;
   + }
  
   This check must be performed after reading a stable pvti.
  
 
  We can even read it in the middle, guarded by the version checks.
  I'll do that for v2.
 
   +
   + do {
   + version = pvti-version;
   +
   + /* This is also a read barrier, so we'll read version 
   first. */
   + rdtsc_barrier();
   + tsc = __native_read_tsc();
   +
   + pvti_tsc_to_system_mul = pvti-tsc_to_system_mul;
   + pvti_tsc_shift = pvti-tsc_shift;
   + pvti_system_time = pvti-system_time;
 

RE: [v3 00/26] Add VT-d Posted-Interrupts support

2015-01-05 Thread Wu, Feng
Ping...

Hi Joerg  David,

Could you please have a look at the IOMMU part of this series (patch 02 - 04, 
patch 06 - 09 , patch 26)?

Hi Thomas, Ingo,  Peter,

Could you please have a look at this series, especially for patch 01, 05, 21?

Thanks,
Feng

 -Original Message-
 From: Wu, Feng
 Sent: Friday, December 12, 2014 11:15 PM
 To: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
 g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org;
 j...@8bytes.org; alex.william...@redhat.com; jiang@linux.intel.com
 Cc: eric.au...@linaro.org; linux-ker...@vger.kernel.org;
 io...@lists.linux-foundation.org; kvm@vger.kernel.org; Wu, Feng
 Subject: [v3 00/26] Add VT-d Posted-Interrupts support
 
 VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
 With VT-d Posted-Interrupts enabled, external interrupts from
 direct-assigned devices can be delivered to guests without VMM
 intervention when guest is running in non-root mode.
 
 You can find the VT-d Posted-Interrtups Spec. in the following URL:
 http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
 y/vt-directed-io-spec.html
 
 v1-v2:
 * Use VFIO framework to enable this feature, the VFIO part of this series is
   base on Eric's patch [PATCH v3 0/8] KVM-VFIO IRQ forward control
 * Rebase this patchset on
 git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git,
   then revise some irq logic based on the new hierarchy irqdomain patches
 provided
   by Jiang Liu jiang@linux.intel.com
 
 v2-v3:
 * Adjust the Posted-interrupts Descriptor updating logic when vCPU is
   preempted or blocked.
 * KVM_DEV_VFIO_DEVICE_POSTING_IRQ --
 KVM_DEV_VFIO_DEVICE_POST_IRQ
 * __KVM_HAVE_ARCH_KVM_VFIO_POSTING --
 __KVM_HAVE_ARCH_KVM_VFIO_POST
 * Add KVM_DEV_VFIO_DEVICE_UNPOST_IRQ attribute for VFIO irq, which
   can be used to change back to remapping mode.
 * Fix typo
 
 This patch series is made of the following groups:
 1-6: Some preparation changes in iommu and irq component, this is based on
 the
  new hierarchy irqdomain logic.
 7-9, 26: IOMMU changes for VT-d Posted-Interrupts, such as, feature detection,
   command line parameter.
 10-17, 22-25: Changes related to KVM itself.
 18-20: Changes in VFIO component, this part was previously sent out as
 [RFC PATCH v2 0/2] kvm-vfio: implement the vfio skeleton for VT-d
 Posted-Interrupts
 21: x86 irq related changes
 
 Feng Wu (26):
   genirq: Introduce irq_set_vcpu_affinity() to target an interrupt to a
 VCPU
   iommu: Add new member capability to struct irq_remap_ops
   iommu, x86: Define new irte structure for VT-d Posted-Interrupts
   iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
   x86, irq: Implement irq_set_vcpu_affinity for pci_msi_ir_controller
   iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
   iommu, x86: Add cap_pi_support() to detect VT-d PI capability
   iommu, x86: Add intel_irq_remapping_capability() for Intel
   iommu, x86: define irq_remapping_cap()
   KVM: change struct pi_desc for VT-d Posted-Interrupts
   KVM: Add some helper functions for Posted-Interrupts
   KVM: Initialize VT-d Posted-Interrupts Descriptor
   KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
   KVM: Get Posted-Interrupts descriptor address from struct kvm_vcpu
   KVM: add interfaces to control PI outside vmx
   KVM: Make struct kvm_irq_routing_table accessible
   KVM: make kvm_set_msi_irq() public
   KVM: kvm-vfio: User API for VT-d Posted-Interrupts
   KVM: kvm-vfio: implement the VFIO skeleton for VT-d Posted-Interrupts
   KVM: x86: kvm-vfio: VT-d posted-interrupts setup
   x86, irq: Define a global vector for VT-d Posted-Interrupts
   KVM: Define a wakeup worker thread for vCPU
   KVM: Update Posted-Interrupts Descriptor when vCPU is preempted
   KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
   KVM: Suppress posted-interrupt when 'SN' is set
   iommu/vt-d: Add a command line parameter for VT-d posted-interrupts
 
  Documentation/kernel-parameters.txt|   1 +
  Documentation/virtual/kvm/devices/vfio.txt |   9 ++
  arch/x86/include/asm/entry_arch.h  |   2 +
  arch/x86/include/asm/hardirq.h |   1 +
  arch/x86/include/asm/hw_irq.h  |   2 +
  arch/x86/include/asm/irq_remapping.h   |  11 ++
  arch/x86/include/asm/irq_vectors.h |   1 +
  arch/x86/include/asm/kvm_host.h|  12 ++
  arch/x86/kernel/apic/msi.c |   1 +
  arch/x86/kernel/entry_64.S |   2 +
  arch/x86/kernel/irq.c  |  27 
  arch/x86/kernel/irqinit.c  |   2 +
  arch/x86/kvm/Makefile  |   2 +-
  arch/x86/kvm/kvm_vfio_x86.c|  77 +
  arch/x86/kvm/vmx.c | 244
 -
  arch/x86/kvm/x86.c |  22 ++-
  drivers/iommu/intel_irq_remapping.c|  68 +++-
  drivers/iommu/irq_remapping.c  

Re: [patch 2/3] KVM: x86: add option to advance tscdeadline hrtimer expiration

2015-01-05 Thread Radim Krcmar
2014-12-23 15:58-0500, Marcelo Tosatti:
 For the hrtimer which emulates the tscdeadline timer in the guest,
 add an option to advance expiration, and busy spin on VM-entry waiting
 for the actual expiration time to elapse.
 
 This allows achieving low latencies in cyclictest (or any scenario 
 which requires strict timing regarding timer expiration).
 
 Reduces average cyclictest latency from 12us to 8us
 on Core i5 desktop.
 
 Note: this option requires tuning to find the appropriate value 
 for a particular hardware/guest combination. One method is to measure the 
 average delay between apic_timer_fn and VM-entry. 
 Another method is to start with 1000ns, and increase the value
 in say 500ns increments until avg cyclictest numbers stop decreasing.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Reviewed-by: Radim Krčmář rkrc...@redhat.com

(Other patches weren't touched, so my previous Reviewed-by holds.)

 +++ kvm/arch/x86/kvm/x86.c
 @@ -108,6 +108,10 @@ EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz)
  static u32 tsc_tolerance_ppm = 250;
  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
  
 +/* lapic timer advance (tscdeadline mode only) in nanoseconds */
 +unsigned int lapic_timer_advance_ns = 0;
 +module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
 +
  static bool backwards_tsc_observed = false;
  
  #define KVM_NR_SHARED_MSRS 16
 @@ -5625,6 +5629,10 @@ static void kvm_timer_init(void)
   __register_hotcpu_notifier(kvmclock_cpu_notifier_block);
   cpu_notifier_register_done();
  
 + if (check_tsc_unstable()  lapic_timer_advance_ns) {
 + pr_info(kvm: unstable TSC, disabling 
 lapic_timer_advance_ns\n);
 + lapic_timer_advance_ns = 0;

Does unstable TSC invalidate this feature?
(lapic_timer_advance_ns can be overridden, so we don't differentiate
 workflows that calibrate after starting with 0.)

And cover letter is a bit misleading:  The condition does nothing to
guarantee TSC based __delay() loop.  (Right now, __delay() = delay_tsc()
whenever the hardware has TSC, regardless of stability, thus always.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls

2015-01-05 Thread Andy Lutomirski
On Mon, Jan 5, 2015 at 4:59 AM, Borislav Petkov b...@alien8.de wrote:
 On Fri, Nov 07, 2014 at 03:58:17PM -0800, Andy Lutomirski wrote:
 The int_ret_from_sys_call and syscall tracing code disagrees with
 the sysret path as to the value of RCX.

 The Intel SDM, the AMD APM, and my laptop all agree that sysret
 returns with RCX == RIP.  The syscall tracing code does not respect
 this property.

 For example, this program:

 int main()
 {
   extern const char syscall_rip[];
   unsigned long rcx = 1;
   unsigned long orig_rcx = rcx;
   asm (mov $-1, %%eax\n\t
syscall\n\t
syscall_rip:
: +c (rcx) : : r11);
   printf(syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n,
  rcx, (unsigned long)syscall_rip, orig_rcx);
   return 0;
 }

 prints:
 syscall: RCX = 400556  RIP = 400556  orig RCX = 1

 Running it under strace gives this instead:
 syscall: RCX =   RIP = 400556  orig RCX = 1

 I can trigger the same even without tracing it:

 syscall: RCX =   RIP = 40052C  orig RCX = 1

Do you have context tracking on?


 This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
 show RCX == RIP even under strace.

 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  arch/x86/kernel/entry_64.S | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index df088bb03fb3..3710b8241945 100644
 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
   movq \tmp,RSP+\offset(%rsp)
   movq $__USER_DS,SS+\offset(%rsp)
   movq $__USER_CS,CS+\offset(%rsp)
 - movq $-1,RCX+\offset(%rsp)
 + movq RIP+\offset(%rsp),\tmp  /* get rip */
 + movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
   movq R11+\offset(%rsp),\tmp  /* get eflags */
   movq \tmp,EFLAGS+\offset(%rsp)
   .endm
 --

 For some reason this patch is causing ata resets on by box, see the
 end of this mail. So something's not kosher yet. If I boot the kernel
 without it, it all seems ok.

 Btw, this change got introduced in 2002 where it used to return rIP in
 %rcx before, but it got changed to return -1 for rIP for some reason.


Thanks!  I assume that's in the historical tree?

[...]


 ---

 [  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 
 0x6 frozen
 [  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
 [  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 
 4096 out
 [  180.072158]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
 (timeout)

That's really weird.  The only thing I can think of is that somehow we
returned to user mode without enabling interrupts.  This leads me to
wonder: why do we save eflags in the R11 pt_regs slot?  This seems
entirely backwards, not to mention that it accounts for two
instructions in each of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK
for no apparently reason whatsoever.

Can you send the full output from syscall_exit_regs_64 from here:

https://gitorious.org/linux-test-utils/linux-clock-tests/source/34884122b6ebe81d9b96e3e5128b6d6d95082c6e:

with the patch applied (assuming it even gets that far for you)?  I
see results like:

[NOTE]syscall : orig RCX = 1  ss = 2b  orig_ss = 6b  flags =
217  orig_flags = 217

which seems fine.

Are you seeing this with the whole series applied or with only this patch?

--Andy

 [  180.086912] ata1.00: status: { DRDY }
 [  180.090591] ata1.00: failed command: WRITE FPDMA QUEUED
 [  180.095846] ata1.00: cmd 61/08:08:18:ae:d9/00:00:23:00:00/40 tag 1 ncq 
 4096 out
 [  180.095846]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
 (timeout)
 [  180.110603] ata1.00: status: { DRDY }
 [  180.114283] ata1.00: failed command: WRITE FPDMA QUEUED
 [  180.119539] ata1.00: cmd 61/10:10:f0:b1:d9/00:00:23:00:00/40 tag 2 ncq 
 8192 out
 [  180.119539]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
 (timeout)
 [  180.134292] ata1.00: status: { DRDY }
 [  180.137973] ata1.00: failed command: WRITE FPDMA QUEUED
 [  180.143226] ata1.00: cmd 61/08:18:00:98:18/00:00:1d:00:00/40 tag 3 ncq 
 4096 out
 [  180.143226]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
 (timeout)
 [  180.158105] ata1.00: status: { DRDY }
 [  180.161809] ata1.00: failed command: WRITE FPDMA QUEUED
 [  180.167071] ata1.00: cmd 61/10:20:18:98:18/00:00:1d:00:00/40 tag 4 ncq 
 8192 out
 [  180.167071]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
 (timeout)
 [  180.181822] ata1.00: status: { DRDY }
 [  180.185503] ata1.00: failed command: WRITE FPDMA QUEUED
 [  180.190756] ata1.00: cmd 61/a0:28:e0:7c:5d/25:00:1d:00:00/40 tag 5 ncq 
 4931584 out
 [  180.190756]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
 (timeout)
 [  180.205770] ata1.00: status: { DRDY }
 [  180.209448] ata1.00: failed command: WRITE FPDMA QUEUED
 

Re: [patch 2/3] KVM: x86: add option to advance tscdeadline hrtimer expiration

2015-01-05 Thread Radim Krcmar
2015-01-05 19:12+0100, Radim Krcmar:
  (Right now, __delay() = delay_tsc()
 whenever the hardware has TSC, regardless of stability, thus always.)

(For quantifiers' sake, there also is 'tsc_disabled' variable.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Andy Lutomirski
On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
 The pvclock vdso code was too abstracted to understand easily and
 excessively paranoid.  Simplify it for a huge speedup.

 This opens the door for additional simplifications, as the vdso no
 longer accesses the pvti for any vcpu other than vcpu 0.

 Before, vclock_gettime using kvm-clock took about 64ns on my machine.
 With this change, it takes 19ns, which is almost as fast as the pure TSC
 implementation.

 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  arch/x86/vdso/vclock_gettime.c | 82 
 --
  1 file changed, 47 insertions(+), 35 deletions(-)

 diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
 index 9793322751e0..f2e0396d5629 100644
 --- a/arch/x86/vdso/vclock_gettime.c
 +++ b/arch/x86/vdso/vclock_gettime.c
 @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info 
 *get_pvti(int cpu)

  static notrace cycle_t vread_pvclock(int *mode)
  {
 - const struct pvclock_vsyscall_time_info *pvti;
 + const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti;
   cycle_t ret;
 - u64 last;
 - u32 version;
 - u8 flags;
 - unsigned cpu, cpu1;
 -
 + u64 tsc, pvti_tsc;
 + u64 last, delta, pvti_system_time;
 + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;

   /*
 -  * Note: hypervisor must guarantee that:
 -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
 -  * 2. that per-CPU pvclock time info is updated if the
 -  *underlying CPU changes.
 -  * 3. that version is increased whenever underlying CPU
 -  *changes.
 +  * Note: The kernel and hypervisor must guarantee that cpu ID
 +  * number maps 1:1 to per-CPU pvclock time info.
 +  *
 +  * Because the hypervisor is entirely unaware of guest userspace
 +  * preemption, it cannot guarantee that per-CPU pvclock time
 +  * info is updated if the underlying CPU changes or that that
 +  * version is increased whenever underlying CPU changes.
 +  *
 +  * On KVM, we are guaranteed that pvti updates for any vCPU are
 +  * atomic as seen by *all* vCPUs.  This is an even stronger
 +  * guarantee than we get with a normal seqlock.
*
 +  * On Xen, we don't appear to have that guarantee, but Xen still
 +  * supplies a valid seqlock using the version field.
 +
 +  * We only do pvclock vdso timing at all if
 +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
 +  * mean that all vCPUs have matching pvti and that the TSC is
 +  * synced, so we can just look at vCPU 0's pvti.
*/

 Can Xen guarantee that ?

I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
at all.  I have no idea going forward, though.

Xen people?


 - do {
 - cpu = __getcpu()  VGETCPU_CPU_MASK;
 - /* TODO: We can put vcpu id into higher bits of pvti.version.
 -  * This will save a couple of cycles by getting rid of
 -  * __getcpu() calls (Gleb).
 -  */
 -
 - pvti = get_pvti(cpu);
 -
 - version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 -
 - /*
 -  * Test we're still on the cpu as well as the version.
 -  * We could have been migrated just after the first
 -  * vgetcpu but before fetching the version, so we
 -  * wouldn't notice a version change.
 -  */
 - cpu1 = __getcpu()  VGETCPU_CPU_MASK;
 - } while (unlikely(cpu != cpu1 ||
 -   (pvti-pvti.version  1) ||
 -   pvti-pvti.version != version));
 -
 - if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
 +
 + if (unlikely(!(pvti-flags  PVCLOCK_TSC_STABLE_BIT))) {
   *mode = VCLOCK_NONE;
 + return 0;
 + }

 This check must be performed after reading a stable pvti.


We can even read it in the middle, guarded by the version checks.
I'll do that for v2.

 +
 + do {
 + version = pvti-version;
 +
 + /* This is also a read barrier, so we'll read version first. */
 + rdtsc_barrier();
 + tsc = __native_read_tsc();
 +
 + pvti_tsc_to_system_mul = pvti-tsc_to_system_mul;
 + pvti_tsc_shift = pvti-tsc_shift;
 + pvti_system_time = pvti-system_time;
 + pvti_tsc = pvti-tsc_timestamp;
 +
 + /* Make sure that the version double-check is last. */
 + smp_rmb();
 + } while (unlikely((version  1) || version != pvti-version));
 +
 + delta = tsc - pvti_tsc;
 + ret = pvti_system_time +
 + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
 + pvti_tsc_shift);

 The following is possible:

 1) State: all pvtis marked as 

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Marcelo Tosatti
On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
 The pvclock vdso code was too abstracted to understand easily and
 excessively paranoid.  Simplify it for a huge speedup.
 
 This opens the door for additional simplifications, as the vdso no
 longer accesses the pvti for any vcpu other than vcpu 0.
 
 Before, vclock_gettime using kvm-clock took about 64ns on my machine.
 With this change, it takes 19ns, which is almost as fast as the pure TSC
 implementation.
 
 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  arch/x86/vdso/vclock_gettime.c | 82 
 --
  1 file changed, 47 insertions(+), 35 deletions(-)
 
 diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
 index 9793322751e0..f2e0396d5629 100644
 --- a/arch/x86/vdso/vclock_gettime.c
 +++ b/arch/x86/vdso/vclock_gettime.c
 @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info 
 *get_pvti(int cpu)
  
  static notrace cycle_t vread_pvclock(int *mode)
  {
 - const struct pvclock_vsyscall_time_info *pvti;
 + const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti;
   cycle_t ret;
 - u64 last;
 - u32 version;
 - u8 flags;
 - unsigned cpu, cpu1;
 -
 + u64 tsc, pvti_tsc;
 + u64 last, delta, pvti_system_time;
 + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
  
   /*
 -  * Note: hypervisor must guarantee that:
 -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
 -  * 2. that per-CPU pvclock time info is updated if the
 -  *underlying CPU changes.
 -  * 3. that version is increased whenever underlying CPU
 -  *changes.
 +  * Note: The kernel and hypervisor must guarantee that cpu ID
 +  * number maps 1:1 to per-CPU pvclock time info.
 +  *
 +  * Because the hypervisor is entirely unaware of guest userspace
 +  * preemption, it cannot guarantee that per-CPU pvclock time
 +  * info is updated if the underlying CPU changes or that that
 +  * version is increased whenever underlying CPU changes.
 +  *
 +  * On KVM, we are guaranteed that pvti updates for any vCPU are
 +  * atomic as seen by *all* vCPUs.  This is an even stronger
 +  * guarantee than we get with a normal seqlock.
*
 +  * On Xen, we don't appear to have that guarantee, but Xen still
 +  * supplies a valid seqlock using the version field.
 +
 +  * We only do pvclock vdso timing at all if
 +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
 +  * mean that all vCPUs have matching pvti and that the TSC is
 +  * synced, so we can just look at vCPU 0's pvti.
*/

Can Xen guarantee that ?

 - do {
 - cpu = __getcpu()  VGETCPU_CPU_MASK;
 - /* TODO: We can put vcpu id into higher bits of pvti.version.
 -  * This will save a couple of cycles by getting rid of
 -  * __getcpu() calls (Gleb).
 -  */
 -
 - pvti = get_pvti(cpu);
 -
 - version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 -
 - /*
 -  * Test we're still on the cpu as well as the version.
 -  * We could have been migrated just after the first
 -  * vgetcpu but before fetching the version, so we
 -  * wouldn't notice a version change.
 -  */
 - cpu1 = __getcpu()  VGETCPU_CPU_MASK;
 - } while (unlikely(cpu != cpu1 ||
 -   (pvti-pvti.version  1) ||
 -   pvti-pvti.version != version));
 -
 - if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
 +
 + if (unlikely(!(pvti-flags  PVCLOCK_TSC_STABLE_BIT))) {
   *mode = VCLOCK_NONE;
 + return 0;
 + }

This check must be performed after reading a stable pvti.

 +
 + do {
 + version = pvti-version;
 +
 + /* This is also a read barrier, so we'll read version first. */
 + rdtsc_barrier();
 + tsc = __native_read_tsc();
 +
 + pvti_tsc_to_system_mul = pvti-tsc_to_system_mul;
 + pvti_tsc_shift = pvti-tsc_shift;
 + pvti_system_time = pvti-system_time;
 + pvti_tsc = pvti-tsc_timestamp;
 +
 + /* Make sure that the version double-check is last. */
 + smp_rmb();
 + } while (unlikely((version  1) || version != pvti-version));
 +
 + delta = tsc - pvti_tsc;
 + ret = pvti_system_time +
 + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
 + pvti_tsc_shift);

The following is possible:

1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
1) Update request for all vcpus, for a TSC_STABLE_BIT - ~TSC_STABLE_BIT
transition.
2) vCPU-1 updates its pvti with new values.
3) vCPU-0 still has not updated its pvti with new values.
4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
notified of a