date:20141029

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread Andrew Cooper

On 29/10/14 05:19, Andy Lutomirski wrote:
 Here's a draft CommonHV spec.  It's also on github:

 https://github.com/amluto/CommonHV

 So far, this provides a two-way RNG interface, a way to detect it, and
 a way to detect other hypervisor leaves.  The latter is because, after
 both the enormous public thread and some private discussions, it seems
 that detection of existing CPUID paravirt leaves is annoying and
 inefficient.  If we're going to define some cross-vendor CPUID leaves,
 it seems like it would be useful to offer a way to quickly enumerate
 other leaves.

 I've been told the AMD intends to update their manual to match Intel's
 so that hypervisors can use the entire 0x4F?? CPUID range.  I have
 intentionally not fixed an MSR value for the RNG because the range of
 allowed MSRs is very small in both the Intel and AMD manuals.  If any
 given hypervisor wants to ignore that small range and advertise a
 higher-numbered MSR, it is welcome to, but I don't want to codify
 something that doesn't comply with the manuals.

 Here's the draft.  Comments?  To the people who work on various
 hypervisors: Would you implement this?  Do you like it?  Is there
 anything, major or minor, that you'd like to see changed?  Do you
 think that this is a good idea at all?

As a first pass, it looks like a plausible idea.  I do however have come
comments.


 I've tried to get good coverage of various hypervisors.  There are
 Hyper-V, VMWare, KVM, and Xen people on the cc list.

 Thanks,
 Andy



 CommonHV, a common hypervisor interface
 ===

 This is CommonHV draft 1.

 The CommonHV specification is Copyright (c) 2014 Andrew Lutomirski.

 Licensing will be determined soon.  The license is expected to be extremely
 liberal.  I am currently leaning towards CC-BY-SA for the specification and
 an explicit license permitting anyone to implement the specification
 with no restrictions whatsoever.

 I have not patented, nor do I intend to patent, anything required to implement
 this specification.  I am not aware of any current or future intellectual
 property rights that would prevent a royalty-free implementation of
 this specification.

 I would like to find a stable, neutral steward of this specification
 going forward.  Help with this would be much appreciated.

 Scope
 -

 CommonHV is a simple interface for communication
 between hypervisors and their guests.

 CommonHV is intended to be very simple and to avoid interfering with
 existing paravirtual interfaces.  To that end, its scope is limited.
 CommonHV does only two types of things:

   * It provides a way to enumerate other paravirtual interfaces.
   * It provides a small, extensible set of paravirtual features that do not
 modify or replace standard system functionality.

 For example, CommonHV does not and will not define anything related to
 interrupt handling or virtual CPU management.

 For now, CommonHV is only applicable to the x86 platform.

 Discovery
 -

 A CommonHV hypervisor MUST set the hypervisor bit (bit 31 in CPUID.1H.0H.ECX)
 and provide the CPUID leaf 4F00H, containing:

   * CPUID.4F00H.0H.EAX = max_commonhv_leaf
   * CPUID.4F00H.0H.EBX = 0x6D6D6F43
   * CPUID.4F00H.0H.ECX = 0x56486E6F
   * CPUID.4F00H.0H.EDX = 0x66746e49

 EBX, ECX, and EDX form the string CommonHVIntf in little-endian ASCII.

While testing various nested combinations, XenServer has found that
modern Windows Server versions must have the hypervisor bit hidden from
them for them to be happy running HyperV, despite the fact that they
will make use of the Viridian virtual extensions also provided.

As a result, while it is certainly advisable for the hypervisor bit to
be set, CommonHV should be available to be found by paravirtualised
drivers inside an OS which can't cope with the hypervisor bit set.


 max_commonhv_leaf MUST be a number between 0x4F00 and 0x4FFF.  It
 indicates the largest leaf defined in this specification that is provided.
 Any leaves described in this specification with EAX values that exceed
 max_commonhv_leaf MUST be handled by guests as though they contain
 all zeros.

 CPUID leaf 4F01H: hypervisor interface enumeration
 --

 If max_commonhv_leaf = 0x4F01, CommonHV provides a list of tuples
 (location, signature).  Each tuple indicates the presence of another
 paravirtual interface identified by the signature at the indicated
 CPUID location.  It is expected that CPUID.location.0H will have
 (EBX, ECX, EDX) == signature, although whether this is required
 is left to the specification associated with the given signature.

 If the list contains N tuples, then, for each 0 = i  N:

   * CPUID.4F01H.i.EBX, CPUID.4F01H.i.ECX, and CPUID.4F01H.i.EDX
 are the signature.
   * CPUID.4F01H.i.EAX is the location.

 CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros.

 To the extent that the

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread Paolo Bonzini



On 10/29/2014 11:37 AM, Andrew Cooper wrote:
 While testing various nested combinations, XenServer has found that
 modern Windows Server versions must have the hypervisor bit hidden from
 them for them to be happy running HyperV, despite the fact that they
 will make use of the Viridian virtual extensions also provided.

Right.

 As a result, while it is certainly advisable for the hypervisor bit to
 be set, CommonHV should be available to be found by paravirtualised
 drivers inside an OS which can't cope with the hypervisor bit set.

Microsoft should just stop putting arbitrary limitations on their
software; or pay the price which, in this case, is not being able to use
the features from the common specification.  I guess what they'd do is
reinvent the RNG as a Viridian extension (if they need it).

You can certainly do CPUID(0x4F00) even if HYPERVISOR=0.  What you
get back is undefined, but in all likelihood it won't be the
CommonHVIntf string.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread David Vrabel

On 29/10/14 13:45, Paolo Bonzini wrote:
 
 
 On 10/29/2014 11:37 AM, Andrew Cooper wrote:
 While testing various nested combinations, XenServer has found that
 modern Windows Server versions must have the hypervisor bit hidden from
 them for them to be happy running HyperV, despite the fact that they
 will make use of the Viridian virtual extensions also provided.
 
 Right.
 
 As a result, while it is certainly advisable for the hypervisor bit to
 be set, CommonHV should be available to be found by paravirtualised
 drivers inside an OS which can't cope with the hypervisor bit set.
 
 Microsoft should just stop putting arbitrary limitations on their
 software; or pay the price which, in this case, is not being able to use
 the features from the common specification.  I guess what they'd do is
 reinvent the RNG as a Viridian extension (if they need it).
 
 You can certainly do CPUID(0x4F00) even if HYPERVISOR=0.  What you
 get back is undefined, but in all likelihood it won't be the
 CommonHVIntf string.

Microsoft already has a specification to obtain a random number via an
ACPI device.  The VM Generation ID.

http://www.microsoft.com/en-us/download/details.aspx?id=30707

David
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread Paolo Bonzini

On 10/29/2014 02:57 PM, David Vrabel wrote:
 Microsoft already has a specification to obtain a random number via an
 ACPI device.  The VM Generation ID.
 
 http://www.microsoft.com/en-us/download/details.aspx?id=30707

That's a very different thing.  The VM Generation ID always returns the
same number if you run the VM from the same configuration file.

It tells the VM when it might need a reseed, but provides neither a best
effort random number like this spec does, nor an entropy source like
virtio-rng does.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v3 2/3] x86: tss: Eliminate fragile calculation of TSS segment limit

2014-10-29 Thread Josh Triplett

__set_tss_desc has a complex calculation of the TSS segment limit,
duplicating the quirky details of the I/O bitmap array length, and
requiring a complex comment to explain.  Replace that calculation with a
simpler one based on the offsetof the stack field that follows the
array.

That then removes the last use of IO_BITMAP_OFFSET, so delete it.

Signed-off-by: Josh Triplett j...@joshtriplett.org
Acked-by: Alexander van Heukelum heuke...@fastmail.fm
---
v3: No changes in this patch, only in patch 3/3.

 arch/x86/include/asm/desc.h  | 11 +--
 arch/x86/include/asm/processor.h |  4 +++-
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 50d033a..f8dc455 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -177,16 +177,7 @@ static inline void __set_tss_desc(unsigned cpu, unsigned 
int entry, void *addr)
struct desc_struct *d = get_cpu_gdt_table(cpu);
tss_desc tss;
 
-   /*
-* sizeof(unsigned long) coming from an extra long at the end
-* of the iobitmap. See tss_struct definition in processor.h
-*
-* -1? seg base+limit should be pointing to the address of the
-* last valid byte
-*/
-   set_tssldt_descriptor(tss, (unsigned long)addr, DESC_TSS,
- IO_BITMAP_OFFSET + IO_BITMAP_BYTES +
- sizeof(unsigned long) - 1);
+   set_tssldt_descriptor(tss, (unsigned long)addr, DESC_TSS, TSS_LIMIT);
write_gdt_entry(d, entry, tss, DESC_TSS);
 }
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec7..76751cd 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -258,9 +258,11 @@ struct x86_hw_tss {
 #define IO_BITMAP_BITS 65536
 #define IO_BITMAP_BYTES(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS(IO_BITMAP_BYTES/sizeof(long))
-#define IO_BITMAP_OFFSET   offsetof(struct tss_struct, io_bitmap)
 #define INVALID_IO_BITMAP_OFFSET   0x8000
 
+/* Segment limits point to the last valid byte, hence the -1 */
+#define TSS_LIMIT  (offsetof(struct tss_struct, stack) - 1)
+
 struct tss_struct {
/*
 * The hardware state:
-- 
2.1.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v3 1/3] x86: process: Unify 32-bit and 64-bit copy_thread I/O bitmap handling

2014-10-29 Thread Josh Triplett

The 32-bit and 64-bit versions of copy_thread have functionally
identical handling for copying the I/O bitmap, modulo differences in
error handling.  Clean up the error paths in both by moving the copy of
the I/O bitmap to the end, to eliminate the need to free it if
subsequent copy steps fail; move the resulting identical code to a
static inline in a common header.

Signed-off-by: Josh Triplett j...@joshtriplett.org
Acked-by: Kees Cook keesc...@chromium.org
---
v3: No changes in this patch, only in patch 3/3.

 arch/x86/kernel/process-io.h | 22 ++
 arch/x86/kernel/process_32.c | 28 
 arch/x86/kernel/process_64.c | 25 +
 3 files changed, 35 insertions(+), 40 deletions(-)
 create mode 100644 arch/x86/kernel/process-io.h

diff --git a/arch/x86/kernel/process-io.h b/arch/x86/kernel/process-io.h
new file mode 100644
index 000..d88
--- /dev/null
+++ b/arch/x86/kernel/process-io.h
@@ -0,0 +1,22 @@
+#ifndef _X86_KERNEL_PROCESS_IO_H
+#define _X86_KERNEL_PROCESS_IO_H
+
+static inline int copy_io_bitmap(struct task_struct *me,
+struct task_struct *p)
+{
+   if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) {
+   p-thread.io_bitmap_ptr = kmemdup(me-thread.io_bitmap_ptr,
+ IO_BITMAP_BYTES, GFP_KERNEL);
+   if (!p-thread.io_bitmap_ptr) {
+   p-thread.io_bitmap_max = 0;
+   return -ENOMEM;
+   }
+   set_tsk_thread_flag(p, TIF_IO_BITMAP);
+   } else {
+   p-thread.io_bitmap_ptr = NULL;
+   }
+
+   return 0;
+}
+
+#endif /* _X86_KERNEL_PROCESS_IO_H */
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 8f3ebfe..07550ff 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -55,6 +55,8 @@
 #include asm/debugreg.h
 #include asm/switch_to.h
 
+#include process-io.h
+
 asmlinkage void ret_from_fork(void) __asm__(ret_from_fork);
 asmlinkage void ret_from_kernel_thread(void) __asm__(ret_from_kernel_thread);
 
@@ -134,7 +136,6 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 {
struct pt_regs *childregs = task_pt_regs(p);
struct task_struct *tsk;
-   int err;
 
p-thread.sp = (unsigned long) childregs;
p-thread.sp0 = (unsigned long) (childregs+1);
@@ -166,32 +167,19 @@ int copy_thread(unsigned long clone_flags, unsigned long 
sp,
 
p-thread.io_bitmap_ptr = NULL;
tsk = current;
-   err = -ENOMEM;
-
-   if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) {
-   p-thread.io_bitmap_ptr = kmemdup(tsk-thread.io_bitmap_ptr,
-   IO_BITMAP_BYTES, GFP_KERNEL);
-   if (!p-thread.io_bitmap_ptr) {
-   p-thread.io_bitmap_max = 0;
-   return -ENOMEM;
-   }
-   set_tsk_thread_flag(p, TIF_IO_BITMAP);
-   }
-
-   err = 0;
 
/*
 * Set a new TLS for the child thread?
 */
-   if (clone_flags  CLONE_SETTLS)
+   if (clone_flags  CLONE_SETTLS) {
+   int err;
err = do_set_thread_area(p, -1,
(struct user_desc __user *)childregs-si, 0);
-
-   if (err  p-thread.io_bitmap_ptr) {
-   kfree(p-thread.io_bitmap_ptr);
-   p-thread.io_bitmap_max = 0;
+   if(err)
+   return err;
}
-   return err;
+
+   return copy_io_bitmap(tsk, p);
 }
 
 void
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3ed4a68..b1babb4 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -50,6 +50,8 @@
 #include asm/debugreg.h
 #include asm/switch_to.h
 
+#include process-io.h
+
 asmlinkage extern void ret_from_fork(void);
 
 __visible DEFINE_PER_CPU(unsigned long, old_rsp);
@@ -154,7 +156,6 @@ static inline u32 read_32bit_tls(struct task_struct *t, int 
tls)
 int copy_thread(unsigned long clone_flags, unsigned long sp,
unsigned long arg, struct task_struct *p)
 {
-   int err;
struct pt_regs *childregs;
struct task_struct *me = current;
 
@@ -191,21 +192,11 @@ int copy_thread(unsigned long clone_flags, unsigned long 
sp,
if (sp)
childregs-sp = sp;
 
-   err = -ENOMEM;
-   if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) {
-   p-thread.io_bitmap_ptr = kmemdup(me-thread.io_bitmap_ptr,
- IO_BITMAP_BYTES, GFP_KERNEL);
-   if (!p-thread.io_bitmap_ptr) {
-   p-thread.io_bitmap_max = 0;
-   return -ENOMEM;
-   }
-   set_tsk_thread_flag(p, TIF_IO_BITMAP);
-   }
-
/*
 * Set a new TLS for the child thread?
 */

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread H. Peter Anvin

On 10/29/2014 03:37 AM, Andrew Cooper wrote:

 CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros.

 To the extent that the hypervisor prefers a given interface, it should
 specify that interface earlier in the list.  For example, KVM might place
 its KVMKVMKVM signature first in the list to indicate that it should be
 used by guests in preference to other supported interfaces.  Other 
 hypervisors
 would likely use a different order.

 The exact semantics of the ordering of the list is beyond the scope of
 this specification.
 
 How do you evaluate N?
 
 It would make more sense for CPUID.4F01[ECX=0] to return N in one
 register, and perhaps prefered interface index in another.  The
 signatures can then be obtained from CPUID.4F01[ECX={1 to N}].
 
 That way, a consumer can be confident that they have found all the
 signatures, without relying on an unbounded loop and checking for zeroes

Yes.  Specifically, it should return it in EAX.  That is the preferred
interface and we are trying to push for that going forward.

-hpa


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread H. Peter Anvin

On 10/29/2014 03:37 AM, Andrew Cooper wrote:

 CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros.

 To the extent that the hypervisor prefers a given interface, it should
 specify that interface earlier in the list.  For example, KVM might place
 its KVMKVMKVM signature first in the list to indicate that it should be
 used by guests in preference to other supported interfaces.  Other 
 hypervisors
 would likely use a different order.

 The exact semantics of the ordering of the list is beyond the scope of
 this specification.
 
 How do you evaluate N?
 
 It would make more sense for CPUID.4F01[ECX=0] to return N in one
 register, and perhaps prefered interface index in another.  The
 signatures can then be obtained from CPUID.4F01[ECX={1 to N}].
 
 That way, a consumer can be confident that they have found all the
 signatures, without relying on an unbounded loop and checking for zeroes

Yes.  Specifically, it should return it in EAX.  That is the preferred
interface and we are trying to push for that going forward.

-hpa


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread H. Peter Anvin

On 10/29/2014 03:37 AM, Andrew Cooper wrote:

 CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros.

 To the extent that the hypervisor prefers a given interface, it should
 specify that interface earlier in the list.  For example, KVM might place
 its KVMKVMKVM signature first in the list to indicate that it should be
 used by guests in preference to other supported interfaces.  Other 
 hypervisors
 would likely use a different order.

 The exact semantics of the ordering of the list is beyond the scope of
 this specification.
 
 How do you evaluate N?
 
 It would make more sense for CPUID.4F01[ECX=0] to return N in one
 register, and perhaps prefered interface index in another.  The
 signatures can then be obtained from CPUID.4F01[ECX={1 to N}].
 
 That way, a consumer can be confident that they have found all the
 signatures, without relying on an unbounded loop and checking for zeroes

Yes.  Specifically, it should return it in EAX.  That is the preferred
interface and we are trying to push for that going forward.

-hpa


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)

2014-10-29 Thread Josh Triplett

On the vast majority of modern systems, no processes will use the
userspsace I/O syscalls, iopl and ioperm.  Add a new config option,
CONFIG_X86_IOPORT, to support configuring them out of the kernel
entirely.  Most current systems do not run programs using these
syscalls, so X86_IOPORT does not depend on EXPERT, though it does still
default to y.

In addition to saving a significant amount of space, this also reduces
the size of several major kernel data structures, drops a bit of code
from several task-related hot paths, and reduces the attack surface of
the kernel.

All of the task-related code dealing with userspace I/O permissions now
lives in process-io.h as several new inline functions, which become
no-ops without CONFIG_X86_IOPORT.

bloat-o-meter results:

add/remove: 0/4 grow/shrink: 4/9 up/down: 83/-9074 (-8991)
function old new   delta
drop_fpu  80 160 +80
perf_event_init_task 638 639  +1
perf_event_exec  192 193  +1
perf_event_aux   132 133  +1
test_tsk_thread_flag  10   - -10
ioperm_active 14   3 -11
init_task916 904 -12
flush_thread 168 149 -19
cpu_init 364 339 -25
__drop_fpu60   - -60
ioperm_get92   6 -86
exit_thread  105  10 -95
sys_iopl 106   --106
copy_thread  364 257-107
__switch_to_xtra 234 123-111
sys_ioperm   240   --240
init_tss8576 384   -8192

Signed-off-by: Josh Triplett j...@joshtriplett.org
---
v3: Eliminated several #ifdefs, and in particular almost all #ifdefs in
.c files, by adding a macro INIT_SET_IOPL_MASK to use in place of
the initializer for set_iopl_mask, and using __maybe_unused rather
than wrapping function definitions in #ifdef.  Rebased on v3.18-rc1.
Recomputed bloat-o-meter.

I assume this should go through the x86 tree rather than the tiny tree.

 arch/x86/Kconfig  | 10 +
 arch/x86/include/asm/paravirt.h   |  2 +
 arch/x86/include/asm/paravirt_types.h |  5 +++
 arch/x86/include/asm/processor.h  | 51 +
 arch/x86/include/asm/syscalls.h   |  3 ++
 arch/x86/kernel/Makefile  |  3 +-
 arch/x86/kernel/cpu/common.c  | 12 +-
 arch/x86/kernel/entry_64.S|  9 +++--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/kernel/process-io.h  | 71 +++
 arch/x86/kernel/process.c | 34 ++---
 arch/x86/kernel/process_32.c  | 13 ++-
 arch/x86/kernel/process_64.c  |  2 +-
 arch/x86/kernel/ptrace.c  |  8 
 arch/x86/xen/enlighten.c  |  4 +-
 drivers/tty/vt/vt_ioctl.c |  2 +-
 kernel/sys_ni.c   |  5 +++
 17 files changed, 169 insertions(+), 67 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f2327e8..a7de2eb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -984,6 +984,16 @@ config X86_ESPFIX64
def_bool y
depends on X86_16BIT  X86_64
 
+config X86_IOPORT
+   bool iopl and ioperm system calls
+   default y
+   ---help---
+ This option enables the iopl and ioperm system calls, which allow
+ privileged userspace processes to directly access I/O ports. This
+ is used by software that drives hardware directly from userspace
+ without using a kernel driver. Unless you intend to run such
+ software, you can safely say N here.
+
 config TOSHIBA
tristate Toshiba Laptop support
depends on X86_32
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..52d2ec8 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -299,10 +299,12 @@ static inline void write_idt_entry(gate_desc *dt, int 
entry, const gate_desc *g)
 {
PVOP_VCALL3(pv_cpu_ops.write_idt_entry, dt, entry, g);
 }
+#ifdef CONFIG_X86_IOPORT
 static inline void set_iopl_mask(unsigned mask)
 {
PVOP_VCALL1(pv_cpu_ops.set_iopl_mask, mask);
 }
+#endif /* CONFIG_X86_IOPORT */
 
 /* The paravirtualized I/O functions */
 static inline void slow_down_io(void)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..a22a5d4 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -142,7 +142,12 @@ struct pv_cpu_ops

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread Andy Lutomirski

On Oct 29, 2014 8:17 AM, Ian Jackson ian.jack...@eu.citrix.com wrote:

 Andy Lutomirski writes ([Xen-devel] [RFC] Hypervisor RNG and enumeration):
  Here's a draft CommonHV spec.  It's also on github:
  https://github.com/amluto/CommonHV

 This a worthwhile direction to investigate, and an interesting
 proposal.  From a Xen point of view I have some concerns, though.

 I think in Xen we would want to implement the bulk of the provision of
 random numbers to guests outside the hypervisor.  That is, the
 hypervisor itself should not have random number pool code, associated
 policy, and so on.  We would like to avoid adding too much code to the
 hypervisor.

 That functionality should live in the lower toolstack layers.  For the
 benefit of people who want to do radical disaggregation (for security
 reasons), it should be capable of being provided by a different domain
 to dom0.

 I think a fully general interface based purely on MSRs makes that
 difficult in a number of ways:

  * Currently I don't think there is any way in Xen to cause MSR
accesses to be passed to toolstack support programs.

  * In some configurations, Xen PV domains do not have a suitable
longrunning service process to handle requests of this kind.

  * MSR reads of this kind might be expected to be fast but if they
involve trapping to a service domain they might be very slow.

I have no objection to specifying that these reads may be quite slow.
Guests should only use them at boot and if they have some reason to
distrust their RNG pool.

The latter can legitimately happen after various types of suspend or
after migration (detected by VM Generation ID, for example).


  * This interface is very x86-specific.


Agreed.

Part of the motivation is to allow guests to use this mechanism very
early in boot for stack canaries, ASLR, etc.  I don't know a good way
to do that without something very platform specific.


 It seems to me that the real need for this facility is to provide a
 good seed for the guest's own cryptographic PRNG.  If we restrict
 ourselves to that use case, we can sidestep the difficulties.

 In particular, the parts of this proposal that are most difficult are:

  * The facility for the guest to provide random numbers back to the
host.  I think this can be dealt with easily by hypervisor-specific
mechanisms, if it is desirable.

Xen can implement this is a no-op.  If we use feature bits, wrmsr
support could be separately enumerated.


  * The implication that a hypervisor ought to be providing a unbounded
stream of random numbers, rather than a fixed amount of seed.


I don't expect hypervisors to estimate the entropy available through
this mechanism.  Given that, the length of the stream is largely
irrelevant, except that an unbounded stream allowed reseeding after
boot.


 I think the most obvious approach would be to provide the VM, at
 startup, with a page containing a fixed amount of random number seed,
 along with some metatdata.

 Some platform-specific way of discovering the location of the page
 would have to be defined.  (That might an MSR but more likely it would
 be Device Tree or ACPI.)

 After the guest has read the page, it would be free to treat it as
 normal memory.

 The metadata need do little more than specify the length and the
 amount of provided entropy.  There should be some room for expansion.

ACPI is not useful early in boot.  DT might be, but that could be a
separate spec.


 The specification should say that the provided seed MUST be
 cryptographically secure, MUST have as much entropy as stated and that
 that amount of entropy MUST be at least (say) 64 bits and SHOULD be at
 least (say) 256 bits.

I don't think this is practical.  It will require hypervisors to
throttle guest startup to ensure that they have that much entropy.

I'm not fundamentally opposed to allowing hypervisors to provide more
than 64 bits of data per invocation, which would help when the trap is
very slow, but I don't know of a suitably simple way to do that.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread Andy Lutomirski

On Wed, Oct 29, 2014 at 9:07 AM, H. Peter Anvin h...@zytor.com wrote:
 On 10/29/2014 03:37 AM, Andrew Cooper wrote:

 CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros.

 To the extent that the hypervisor prefers a given interface, it should
 specify that interface earlier in the list.  For example, KVM might place
 its KVMKVMKVM signature first in the list to indicate that it should be
 used by guests in preference to other supported interfaces.  Other 
 hypervisors
 would likely use a different order.

 The exact semantics of the ordering of the list is beyond the scope of
 this specification.

 How do you evaluate N?

 It would make more sense for CPUID.4F01[ECX=0] to return N in one
 register, and perhaps prefered interface index in another.  The
 signatures can then be obtained from CPUID.4F01[ECX={1 to N}].

 That way, a consumer can be confident that they have found all the
 signatures, without relying on an unbounded loop and checking for zeroes

 Yes.  Specifically, it should return it in EAX.  That is the preferred
 interface and we are trying to push for that going forward.


I'm okay with that.

I'm inclined to leave EBX, ECX, and EDX reserved for now, though.
Barring an actual use case in which the order of the list isn't
sufficient to determine preference, I don't see the need to give
another preference indication.

I updated the copy of github to make this change and to use an
explicit feature bit for the RNG.

--Andy

 -hpa





-- 
Andy Lutomirski
AMA Capital Management, LLC
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread Jake Oshins


I have no objection to specifying that these reads may be quite slow.
Guests should only use them at boot and if they have some reason to
distrust their RNG pool.

The latter can legitimately happen after various types of suspend or
after migration (detected by VM Generation ID, for example).

Just as a point of clarification, the VM Generation ID changes (at least in the 
Hyper-V implementation) only when the VM may have observed a different future, 
as when a VM backup is restored, a checkpoint is applied, etc.  It does not 
change during migration, when the VM is suspended or when it is rebooted.  I've 
heard anecdotes from application vendors saying that there is some other 
hypervisor that actually does change the ID at these moments and they wanted us 
to us to fix that, until I explained that I only control Hyper-V.

-- Jake Oshins

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread Andy Lutomirski

On Wed, Oct 29, 2014 at 9:29 AM, Jake Oshins ja...@microsoft.com wrote:

I have no objection to specifying that these reads may be quite slow.
Guests should only use them at boot and if they have some reason to
distrust their RNG pool.

The latter can legitimately happen after various types of suspend or
after migration (detected by VM Generation ID, for example).

 Just as a point of clarification, the VM Generation ID changes (at least in 
 the Hyper-V implementation) only when the VM may have observed a different 
 future, as when a VM backup is restored, a checkpoint is applied, etc.  It 
 does not change during migration, when the VM is suspended or when it is 
 rebooted.  I've heard anecdotes from application vendors saying that there is 
 some other hypervisor that actually does change the ID at these moments and 
 they wanted us to us to fix that, until I explained that I only control 
 Hyper-V.


Fair enough.

If the VM may indeed have observed a different future, then I would
argue that reseeding the RNG is very important -- more so than after a
normal migration.

If the VM trusts that its other history hasn't been compromised, then
merely mixing in a unique value would get most of the benefit.

--Andy

 -- Jake Oshins




-- 
Andy Lutomirski
AMA Capital Management, LLC
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)

2014-10-29 Thread Kees Cook

On Wed, Oct 29, 2014 at 9:10 AM, Josh Triplett j...@joshtriplett.org wrote:
 On the vast majority of modern systems, no processes will use the
 userspsace I/O syscalls, iopl and ioperm.  Add a new config option,
 CONFIG_X86_IOPORT, to support configuring them out of the kernel
 entirely.  Most current systems do not run programs using these
 syscalls, so X86_IOPORT does not depend on EXPERT, though it does still
 default to y.

 In addition to saving a significant amount of space, this also reduces
 the size of several major kernel data structures, drops a bit of code
 from several task-related hot paths, and reduces the attack surface of
 the kernel.

 All of the task-related code dealing with userspace I/O permissions now
 lives in process-io.h as several new inline functions, which become
 no-ops without CONFIG_X86_IOPORT.

 bloat-o-meter results:

 add/remove: 0/4 grow/shrink: 4/9 up/down: 83/-9074 (-8991)
 function old new   delta
 drop_fpu  80 160 +80
 perf_event_init_task 638 639  +1
 perf_event_exec  192 193  +1
 perf_event_aux   132 133  +1
 test_tsk_thread_flag  10   - -10
 ioperm_active 14   3 -11
 init_task916 904 -12
 flush_thread 168 149 -19
 cpu_init 364 339 -25
 __drop_fpu60   - -60
 ioperm_get92   6 -86
 exit_thread  105  10 -95
 sys_iopl 106   --106
 copy_thread  364 257-107
 __switch_to_xtra 234 123-111
 sys_ioperm   240   --240
 init_tss8576 384   -8192

 Signed-off-by: Josh Triplett j...@joshtriplett.org
 ---
 v3: Eliminated several #ifdefs, and in particular almost all #ifdefs in
 .c files, by adding a macro INIT_SET_IOPL_MASK to use in place of
 the initializer for set_iopl_mask, and using __maybe_unused rather
 than wrapping function definitions in #ifdef.  Rebased on v3.18-rc1.
 Recomputed bloat-o-meter.

 I assume this should go through the x86 tree rather than the tiny tree.

  arch/x86/Kconfig  | 10 +
  arch/x86/include/asm/paravirt.h   |  2 +
  arch/x86/include/asm/paravirt_types.h |  5 +++
  arch/x86/include/asm/processor.h  | 51 +
  arch/x86/include/asm/syscalls.h   |  3 ++
  arch/x86/kernel/Makefile  |  3 +-
  arch/x86/kernel/cpu/common.c  | 12 +-
  arch/x86/kernel/entry_64.S|  9 +++--
  arch/x86/kernel/paravirt.c|  2 +-
  arch/x86/kernel/process-io.h  | 71 
 +++
  arch/x86/kernel/process.c | 34 ++---
  arch/x86/kernel/process_32.c  | 13 ++-
  arch/x86/kernel/process_64.c  |  2 +-
  arch/x86/kernel/ptrace.c  |  8 
  arch/x86/xen/enlighten.c  |  4 +-
  drivers/tty/vt/vt_ioctl.c |  2 +-
  kernel/sys_ni.c   |  5 +++
  17 files changed, 169 insertions(+), 67 deletions(-)

 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index f2327e8..a7de2eb 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -984,6 +984,16 @@ config X86_ESPFIX64
 def_bool y
 depends on X86_16BIT  X86_64

 +config X86_IOPORT
 +   bool iopl and ioperm system calls
 +   default y
 +   ---help---
 + This option enables the iopl and ioperm system calls, which allow
 + privileged userspace processes to directly access I/O ports. This
 + is used by software that drives hardware directly from userspace
 + without using a kernel driver. Unless you intend to run such
 + software, you can safely say N here.
 +
  config TOSHIBA
 tristate Toshiba Laptop support
 depends on X86_32
 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
 index cd6e161..52d2ec8 100644
 --- a/arch/x86/include/asm/paravirt.h
 +++ b/arch/x86/include/asm/paravirt.h
 @@ -299,10 +299,12 @@ static inline void write_idt_entry(gate_desc *dt, int 
 entry, const gate_desc *g)
  {
 PVOP_VCALL3(pv_cpu_ops.write_idt_entry, dt, entry, g);
  }
 +#ifdef CONFIG_X86_IOPORT
  static inline void set_iopl_mask(unsigned mask)
  {
 PVOP_VCALL1(pv_cpu_ops.set_iopl_mask, mask);
  }
 +#endif /* CONFIG_X86_IOPORT */

  /* The paravirtualized I/O functions */
  static inline void slow_down_io(void)
 diff --git a/arch/x86/include/asm/paravirt_types.h

Re: [PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)

2014-10-29 Thread josh

On Wed, Oct 29, 2014 at 09:59:25AM -0700, Kees Cook wrote:
 On Wed, Oct 29, 2014 at 9:10 AM, Josh Triplett j...@joshtriplett.org wrote:
  --- a/arch/x86/kernel/process-io.h
  +++ b/arch/x86/kernel/process-io.h
  @@ -1,9 +1,17 @@
   #ifndef _X86_KERNEL_PROCESS_IO_H
   #define _X86_KERNEL_PROCESS_IO_H
 
  +static inline void clear_thread_io_bitmap(struct task_struct *p)
  +{
  +#ifdef CONFIG_X86_IOPORT
  +   p-thread.io_bitmap_ptr = NULL;
  +#endif /* CONFIG_X86_IOPORT */
  +}
 
 Personally, I prefer seeing these kinds of optional functions declared
 in a single block rather than having the #ifdefs inside the functions:
 
 #ifdef CONFIG_X86_IOPORT
 static inline void clear_thread_io_bitmap(struct task_struct *p)
 {
 ...
 }
 
 static inline int copy_io_bitmap(struct task_struct *me,
   struct task_struct *p)
 {
 ...
 }
 
 ...remaining_functions...
 
 #else
 static inline void clear_thread_io_bitmap(struct task_struct *p) { }
 static inline int copy_io_bitmap(struct task_struct *me,
   struct task_struct *p)
 {
 return 0;
 }
 ...remaining functions...
 #endif /* CONFIG_X86_IOPORT */
 
 But this is entirely a style decision, so I leave it up to the x86
 maintainers ...

I can certainly do that if the x86 maintainers prefer, but that tends to
produce a net increase in lines of code, as well as duplicating all the
function prototypes, which to me seems more error-prone.  If the
stub versions contained any code, rather than just becoming no-ops, I'd
definitely do that.

 Another nit may be that we should call this CONFIG_SYSCALL_IOPL or
 CONFIG_SYSCALL_IOPERM in keeping with the other CONFIG_SYSCALL_*
 naming thread? Again, I don't really care strongly beyond really
 wanting to use this new feature! :)

I don't feel strongly about the naming.  Ingo?

 Thanks for working on this!

No problem.  I look forward to seeing it used, in Chrome OS and
elsewhere. :)

- Josh Triplett
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)

2014-10-29 Thread josh

On Wed, Oct 29, 2014 at 10:20:28AM -0700, H. Peter Anvin wrote:
 On 10/29/2014 10:17 AM, j...@joshtriplett.org wrote:
 
  But this is entirely a style decision, so I leave it up to the x86
  maintainers ...
  
  I can certainly do that if the x86 maintainers prefer, but that tends to
  produce a net increase in lines of code, as well as duplicating all the
  function prototypes, which to me seems more error-prone.  If the
  stub versions contained any code, rather than just becoming no-ops, I'd
  definitely do that.
  
 
 I concur with this style choice.

To clarify: you concur with Kees's suggested change or with the style I
used in my patch?

  Another nit may be that we should call this CONFIG_SYSCALL_IOPL or
  CONFIG_SYSCALL_IOPERM in keeping with the other CONFIG_SYSCALL_*
  naming thread? Again, I don't really care strongly beyond really
  wanting to use this new feature! :)
  
  I don't feel strongly about the naming.  Ingo?
 
 It is sort of a special case here, as this reflects more than one syscall.

As well as four VT ioctls. :)

- Josh Triplett
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

2014-10-29 Thread Paolo Bonzini

On 10/29/2014 05:29 PM, Jake Oshins wrote:
 Just as a point of clarification, the VM Generation ID changes (at
 least in the Hyper-V implementation) only when the VM may have
 observed a different future, as when a VM backup is restored, a
 checkpoint is applied, etc.  It does not change during migration,
 when the VM is suspended or when it is rebooted.  I've heard
 anecdotes from application vendors saying that there is some other
 hypervisor that actually does change the ID at these moments and they
 wanted us to us to fix that, until I explained that I only control
 Hyper-V.

This is indeed the only reasonable way you can read the vmgenid spec.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-29 Thread Waiman Long


On 10/27/2014 05:22 PM, Waiman Long wrote:

On 10/27/2014 02:04 PM, Peter Zijlstra wrote:

On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:

On 10/24/2014 04:54 AM, Peter Zijlstra wrote:

On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:

Since enabling paravirt spinlock will disable unlock function 
inlining,

a jump label can be added to the unlock function without adding patch
sites all over the kernel.

But you don't have to. My patches allowed for the inline to remain,
again reducing the overhead of enabling PV spinlocks while running 
on a

real machine.

Look at:

   http://lkml.kernel.org/r/20140615130154.213923...@chello.nl

In particular this hunk:

Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
===
--- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
+++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
@@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
  DEF_NATIVE(, mov32, mov %edi, %eax);
  DEF_NATIVE(, mov64, mov %rdi, %rax);

+#if defined(CONFIG_PARAVIRT_SPINLOCKS)   
defined(CONFIG_QUEUE_SPINLOCK)

+DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
+#endif
+
  unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
  {
 return paravirt_patch_insns(insnbuf, len,
@@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
 PATCH_SITE(pv_cpu_ops, clts);
 PATCH_SITE(pv_mmu_ops, flush_tlb_single);
 PATCH_SITE(pv_cpu_ops, wbinvd);
+#if defined(CONFIG_PARAVIRT_SPINLOCKS)   
defined(CONFIG_QUEUE_SPINLOCK)

+   PATCH_SITE(pv_lock_ops, queue_unlock);
+#endif

 patch_site:
 ret = paravirt_patch_insns(ibuf, len, start, end);


That makes sure to overwrite the callee-saved call to the
pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).


Therefore you can retain the inlined unlock with hardly (there 
might be

some NOP padding) any overhead at all. On PV it reverts to a callee
saved function call.
My concern is that spin_unlock() can be called in many places, 
including
loadable kernel modules. Can the paravirt_patch_ident_32() function 
able to
patch all of them in reasonable time? How about a kernel module 
loaded later

at run time?

modules should be fine, see arch/x86/kernel/module.c:module_finalize()
-  apply_paravirt().

Also note that the 'default' text is an indirect call into the paravirt
ops table which routes to the 'right' function, so even if the text
patching would be 'late' calls would 'work' as expected, just slower.


Thanks for letting me know about that. I have this concern because 
your patch didn't change the current configuration of disabling unlock 
inlining when paravirt_spinlock is enabled. With that, I think it is 
worthwhile to reduce the performance delta between the PV and non-PV 
kernel on bare metal.


I am sorry that the unlock call sites patching code doesn't work in a 
virtual guest. Your pvqspinlock patch did an unconditional patching even 
in a virtual guest. I added check for the paravirt_spinlocks_enabled, 
but it turned out that some spin_unlock() seemed to be called before 
paravirt_spinlocks_enabled is set. As a result, some call sites were 
still patched resulting in missed wake up's and system hang.


At this point, I am going to leave out that change from my patch set 
until we can figure out a better way of doing that.


-Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v13 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-10-29 Thread Waiman Long

This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock.h   |  118 +++
 include/asm-generic/qspinlock_types.h |   58 +
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 +
 kernel/locking/mcs_spinlock.h |1 +
 kernel/locking/qspinlock.c|  207 +
 6 files changed, 392 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 000..e8a7ae8
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,118 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+   if (!atomic_read(lock-val) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+   u32 val;
+
+

[PATCH v13 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock

2014-10-29 Thread Waiman Long

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/qspinlock.h  |   25 +
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 
 4 files changed, 35 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f2327e8..9ee8324 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -122,6 +122,7 @@ config X86
select MODULES_USE_ELF_RELA if X86_64
select CLONE_BACKWARDS if X86_32
select ARCH_USE_BUILTIN_BSWAP
+   select ARCH_USE_QUEUE_SPINLOCK
select ARCH_USE_QUEUE_RWLOCK
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 000..a6a8762
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,25 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+#ifndef CONFIG_X86_PPRO_FENCE
+
+#definequeue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * An effective smp_store_release() on the least-significant byte.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   barrier();
+   ACCESS_ONCE(*(u8 *)lock) = 0;
+}
+
+#endif /* !CONFIG_X86_PPRO_FENCE */
+
+#include asm-generic/qspinlock.h
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 9295016..5899483 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm/qspinlock.h
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -180,6 +184,7 @@ static __always_inline void 
arch_spin_lock_flags(arch_spinlock_t *lock,
 {
arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index 5f9d757..5d654a1 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm-generic/qspinlock_types.h
+#else
 typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include asm-generic/qrwlock_types.h
 
-- 
1.7.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v13 00/11] qspinlock: a 4-byte queue spinlock with PV support

2014-10-29 Thread Waiman Long

v12-v13:
 - Change patch 9 to generate separate versions of the
   queue_spin_lock_slowpath functions for bare metal and PV guest. This
   reduces the performance impact of the PV code on bare metal systems.

v11-v12:
 - Based on PeterZ's version of the qspinlock patch
   (https://lkml.org/lkml/2014/6/15/63).
 - Incorporated many of the review comments from Konrad Wilk and
   Paolo Bonzini.
 - The pvqspinlock code is largely from my previous version with
   PeterZ's way of going from queue tail to head and his idea of
   using callee saved calls to KVM and XEN codes.

v10-v11:
  - Use a simple test-and-set unfair lock to simplify the code,
but performance may suffer a bit for large guest with many CPUs.
  - Take out Raghavendra KT's test results as the unfair lock changes
may render some of his results invalid.
  - Add PV support without increasing the size of the core queue node
structure.
  - Other minor changes to address some of the feedback comments.

v9-v10:
  - Make some minor changes to qspinlock.c to accommodate review feedback.
  - Change author to PeterZ for 2 of the patches.
  - Include Raghavendra KT's test results in patch 18.

v8-v9:
  - Integrate PeterZ's version of the queue spinlock patch with some
modification:
http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
  - Break the more complex patches into smaller ones to ease review effort.
  - Fix a racing condition in the PV qspinlock code.

v7-v8:
  - Remove one unneeded atomic operation from the slowpath, thus
improving performance.
  - Simplify some of the codes and add more comments.
  - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
unfair lock.
  - Reduce unfair lock slowpath lock stealing frequency depending
on its distance from the queue head.
  - Add performance data for IvyBridge-EX CPU.

v6-v7:
  - Remove an atomic operation from the 2-task contending code
  - Shorten the names of some macros
  - Make the queue waiter to attempt to steal lock when unfair lock is
enabled.
  - Remove lock holder kick from the PV code and fix a race condition
  - Run the unfair lock  PV code on overcommitted KVM guests to collect
performance data.

v5-v6:
 - Change the optimized 2-task contending code to make it fairer at the
   expense of a bit of performance.
 - Add a patch to support unfair queue spinlock for Xen.
 - Modify the PV qspinlock code to follow what was done in the PV
   ticketlock.
 - Add performance data for the unfair lock as well as the PV
   support code.

v4-v5:
 - Move the optimized 2-task contending code to the generic file to
   enable more architectures to use it without code duplication.
 - Address some of the style-related comments by PeterZ.
 - Allow the use of unfair queue spinlock in a real para-virtualized
   execution environment.
 - Add para-virtualization support to the qspinlock code by ensuring
   that the lock holder and queue head stay alive as much as possible.

v3-v4:
 - Remove debugging code and fix a configuration error
 - Simplify the qspinlock structure and streamline the code to make it
   perform a bit better
 - Add an x86 version of asm/qspinlock.h for holding x86 specific
   optimization.
 - Add an optimized x86 code path for 2 contending tasks to improve
   low contention performance.

v2-v3:
 - Simplify the code by using numerous mode only without an unfair option.
 - Use the latest smp_load_acquire()/smp_store_release() barriers.
 - Move the queue spinlock code to kernel/locking.
 - Make the use of queue spinlock the default for x86-64 without user
   configuration.
 - Additional performance tuning.

v1-v2:
 - Add some more comments to document what the code does.
 - Add a numerous CPU mode to support = 16K CPUs
 - Add a configuration option to allow lock stealing which can further
   improve performance in many cases.
 - Enable wakeup of queue head CPU at unlock time for non-numerous
   CPU mode.

This patch set has 3 different sections:
 1) Patches 1-6: Introduces a queue-based spinlock implementation that
can replace the default ticket spinlock without increasing the
size of the spinlock data structure. As a result, critical kernel
data structures that embed spinlock won't increase in size and
break data alignments.
 2) Patch 7: Enables the use of unfair lock in a virtual guest. This
can resolve some of the locking related performance issues due to
the fact that the next CPU to get the lock may have been scheduled
out for a period of time.
 3) Patches 8-11: Enable qspinlock para-virtualization support by
halting the waiting CPUs after spinning for a certain amount of
time. The unlock code will detect the a sleeping waiter and wake it
up. This is essentially the same logic as the PV ticketlock code.

The queue spinlock has slightly better performance than the ticket
spinlock in uncontended case. Its performance can be much better
with moderate to heavy contention.  This patch has the

[PATCH v13 03/11] qspinlock: Add pending bit

2014-10-29 Thread Waiman Long

From: Peter Zijlstra pet...@infradead.org

Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.

It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.

In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   12 +++-
 kernel/locking/qspinlock.c|  119 +++--
 2 files changed, 107 insertions(+), 24 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 67a2110..4196694 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -36,8 +36,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
   _Q_ ## type ## _OFFSET)
@@ -45,7 +46,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -54,5 +59,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c114076..226b11d 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -92,24 +92,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
return per_cpu_ptr(mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
  *
- * (queue tail, lock value)
- *
- *  fast  :slow  :
unlock
- *:  :
- * uncontended  (0,0)   --:-- (0,1) :-- (*,0)
- *:   | ^./  :
- *:   v   \   |  :
- * uncontended:(n,x) --+-- (n,0) |  :
- *   queue:   | ^--'  |  :
- *:   v   |  :
- * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
- *   queue: ^--' :
+ * (queue tail, pending bit, lock value)
  *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +-- (0,1,0)   \   |  :
+ *   :   | ^--'  |   |  :
+ *   :   v   |   |  :
+ * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
+ *   queue   :   | ^--'  |  :
+ *   :   v   |  :
+ * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
+ *   queue   : ^--' :
  */
 void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
@@ -119,6 +123,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   /*
+* wait for in-progress pending-locked hand-overs
+*
+* 0,1,0 - 0,0,1
+*/
+   if (val == _Q_PENDING_VAL) {
+   while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL)
+

[PATCH v13 04/11] qspinlock: Extract out code snippets for the next patch

2014-10-29 Thread Waiman Long

This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.

 1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
 2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.

This patch also simplifies the trylock operation before queuing by
calling queue_spin_trylock() directly.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock_types.h |2 +
 kernel/locking/qspinlock.c|   91 +---
 2 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 4196694..88d647c 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -58,6 +58,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 226b11d..48bd2ad 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -95,6 +95,54 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   u32 new, old;
+
+   for (;;) {
+   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+}
+
+/**
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   u32 old, new, val = atomic_read(lock-val);
+
+   for (;;) {
+   new = (val  _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return old;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -176,15 +224,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 *
 * *,1,0 - *,0,1
 */
-   for (;;) {
-   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   clear_pending_set_locked(lock, val);
return;
 
/*
@@ -201,37 +241,26 @@ queue:
node-next = NULL;
 
/*
-* We have already touched the queueing cacheline; don't bother with
-* pending stuff.
-*
-* trylock || xchg(lock, node)
-*
-* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked.
-* p,y,x - n,y,x ; tail was p - tail is n; preserving locked.
+* We touched a (possibly) cold cacheline in the per-cpu queue node;
+* attempt the trylock once more in the hope someone let go while we
+* weren't watching.
 */
-   for (;;) {
-   new = _Q_LOCKED_VAL;
-   if (val)
-   new = tail | (val  _Q_LOCKED_PENDING_MASK);
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   if (queue_spin_trylock(lock))
+   goto release;
 
/*
-* we won the trylock; forget about queueing.
+* We have already touched the queueing cacheline; don't bother with
+* pending stuff.
+*
+* p,*,* - n,*,*
 */
-   if (new == _Q_LOCKED_VAL)
-   goto release;
+   old = xchg_tail(lock, tail);
 
/*
 * if there was a previous node; link it and wait until reaching the
 * head of the waitqueue.
 */
-   if (old  ~_Q_LOCKED_PENDING_MASK) {
+   if (old  _Q_TAIL_MASK) {
prev =

[PATCH v13 06/11] qspinlock: Use a simple write to grab the lock

2014-10-29 Thread Waiman Long

Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   3 2324/2321  2248/2265-3%/-2%
   4 2890/2896  2819/2831-2%/-2%
   5 3611/3595  3522/3512-2%/-2%
   6 4281/4276  4173/4160-3%/-3%
   7 5018/5001  4875/4861-3%/-3%
   8 5759/5750  5563/5568-3%/-3%

[Standalone/Embedded - different nodes]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   312242/12237 12087/12093  -1%/-1%
   410688/10696 10507/10521  -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The ebizzy -m test was also run with the following results:

  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 kernel/locking/qspinlock.c |   59 
 1 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7c127b4..fb0e988 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -103,24 +103,33 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
  * By using the whole 2nd least significant byte for the pending bit, we
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
union {
atomic_t val;
-   struct {
 #ifdef __LITTLE_ENDIAN
+   u8   locked;
+   struct {
u16 locked_pending;
u16 tail;
+   };
 #else
+   struct {
u16 tail;
u16 locked_pending;
-#endif
};
+   struct {
+   u8  reserved[3];
+   u8  locked;
+   };
+#endif
};
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -207,6 +216,19 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,*,0 - *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue

[PATCH v13 07/11] qspinlock: Revert to test-and-set on hypervisors

2014-10-29 Thread Waiman Long

From: Peter Zijlstra pet...@infradead.org

When we detect a hypervisor (!paravirt, see qspinlock paravirt support
patches), revert to a simple test-and-set lock to avoid the horrors
of queue preemption.

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/qspinlock.h |   14 ++
 include/asm-generic/qspinlock.h  |7 +++
 kernel/locking/qspinlock.c   |3 +++
 3 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index a6a8762..05a77fe 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_QSPINLOCK_H
 #define _ASM_X86_QSPINLOCK_H
 
+#include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
 
 #ifndef CONFIG_X86_PPRO_FENCE
@@ -20,6 +21,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 
 #endif /* !CONFIG_X86_PPRO_FENCE */
 
+#define virt_queue_spin_lock virt_queue_spin_lock
+
+static inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+   return false;
+
+   while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
+   cpu_relax();
+
+   return true;
+}
+
 #include asm-generic/qspinlock.h
 
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index e8a7ae8..a53a7bb 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -98,6 +98,13 @@ static __always_inline void queue_spin_unlock(struct 
qspinlock *lock)
 }
 #endif
 
+#ifndef virt_queue_spin_lock
+static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   return false;
+}
+#endif
+
 /*
  * Initializier
  */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fb0e988..1c1926a 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -257,6 +257,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (virt_queue_spin_lock(lock))
+   return;
+
/*
 * wait for in-progress pending-locked hand-overs
 *
-- 
1.7.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v13 05/11] qspinlock: Optimize for smaller NR_CPUS

2014-10-29 Thread Waiman Long

From: Peter Zijlstra pet...@infradead.org

When we allow for a max NR_CPUS  2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   13 ++
 kernel/locking/qspinlock.c|   71 -
 2 files changed, 83 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 88d647c..01b46df 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -35,6 +35,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS  16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS = 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -47,7 +55,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS  (1U  14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -58,6 +70,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 48bd2ad..7c127b4 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -22,6 +22,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/mutex.h
+#include asm/byteorder.h
 #include asm/qspinlock.h
 
 /*
@@ -54,6 +55,10 @@
  * node; whereby avoiding the need to carry a node from lock to unlock, and
  * preserving existing lock API. This also makes the unlock code simpler and
  * faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
+ *
  */
 
 #include mcs_spinlock.h
@@ -94,6 +99,64 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL;
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -141,6 +204,7 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
}

[PATCH v13 08/11] qspinlock, x86: Rename paravirt_ticketlocks_enabled

2014-10-29 Thread Waiman Long

This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/include/asm/spinlock.h  |4 ++--
 arch/x86/kernel/kvm.c|2 +-
 arch/x86/kernel/paravirt-spinlocks.c |4 ++--
 arch/x86/xen/spinlock.c  |2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 5899483..928751e 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -39,7 +39,7 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD (1  15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
 #ifdef CONFIG_QUEUE_SPINLOCK
@@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t 
*lock,
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
if (TICKET_SLOWPATH_FLAG 
-   static_key_false(paravirt_ticketlocks_enabled)) {
+   static_key_false(paravirt_spinlocks_enabled)) {
arch_spinlock_t prev;
 
prev = *lock;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index f6945be..eaa064c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -827,7 +827,7 @@ static __init int kvm_spinlock_init_jump(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
printk(KERN_INFO KVM setup paravirtual spinlock\n);
 
return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index bbb6c73..e434f24 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -16,5 +16,5 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 23b45eb..d332ae0 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jump(void)
if (!xen_domain())
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
return 0;
 }
 early_initcall(xen_init_spinlocks_jump);
-- 
1.7.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-29 Thread Waiman Long


On 10/29/2014 03:05 PM, Waiman Long wrote:

On 10/27/2014 05:22 PM, Waiman Long wrote:

On 10/27/2014 02:04 PM, Peter Zijlstra wrote:

On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:

On 10/24/2014 04:54 AM, Peter Zijlstra wrote:

On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:

Since enabling paravirt spinlock will disable unlock function 
inlining,
a jump label can be added to the unlock function without adding 
patch

sites all over the kernel.

But you don't have to. My patches allowed for the inline to remain,
again reducing the overhead of enabling PV spinlocks while running 
on a

real machine.

Look at:

   http://lkml.kernel.org/r/20140615130154.213923...@chello.nl

In particular this hunk:

Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
===
--- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
+++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
@@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
  DEF_NATIVE(, mov32, mov %edi, %eax);
  DEF_NATIVE(, mov64, mov %rdi, %rax);

+#if defined(CONFIG_PARAVIRT_SPINLOCKS)   
defined(CONFIG_QUEUE_SPINLOCK)

+DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
+#endif
+
  unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
  {
 return paravirt_patch_insns(insnbuf, len,
@@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
 PATCH_SITE(pv_cpu_ops, clts);
 PATCH_SITE(pv_mmu_ops, flush_tlb_single);
 PATCH_SITE(pv_cpu_ops, wbinvd);
+#if defined(CONFIG_PARAVIRT_SPINLOCKS)   
defined(CONFIG_QUEUE_SPINLOCK)

+   PATCH_SITE(pv_lock_ops, queue_unlock);
+#endif

 patch_site:
 ret = paravirt_patch_insns(ibuf, len, start, end);


That makes sure to overwrite the callee-saved call to the
pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).


Therefore you can retain the inlined unlock with hardly (there 
might be

some NOP padding) any overhead at all. On PV it reverts to a callee
saved function call.
My concern is that spin_unlock() can be called in many places, 
including
loadable kernel modules. Can the paravirt_patch_ident_32() function 
able to
patch all of them in reasonable time? How about a kernel module 
loaded later

at run time?

modules should be fine, see arch/x86/kernel/module.c:module_finalize()
-  apply_paravirt().

Also note that the 'default' text is an indirect call into the paravirt
ops table which routes to the 'right' function, so even if the text
patching would be 'late' calls would 'work' as expected, just slower.


Thanks for letting me know about that. I have this concern because 
your patch didn't change the current configuration of disabling 
unlock inlining when paravirt_spinlock is enabled. With that, I think 
it is worthwhile to reduce the performance delta between the PV and 
non-PV kernel on bare metal.


I am sorry that the unlock call sites patching code doesn't work in a 
virtual guest. Your pvqspinlock patch did an unconditional patching 
even in a virtual guest. I added check for the 
paravirt_spinlocks_enabled, but it turned out that some spin_unlock() 
seemed to be called before paravirt_spinlocks_enabled is set. As a 
result, some call sites were still patched resulting in missed wake 
up's and system hang.


At this point, I am going to leave out that change from my patch set 
until we can figure out a better way of doing that.




Below was a partial kernel log with the unlock call site patch code in a 
KVM guest:


[0.438006] native_patch: patch out pv_queue_unlock!
[0.438565] native_patch: patch out pv_queue_unlock!
[0.439006] native_patch: patch out pv_queue_unlock!
[0.439638] native_patch: patch out pv_queue_unlock!
[0.440052] native_patch: patch out pv_queue_unlock!
[0.441006] native_patch: patch out pv_queue_unlock!
[0.441566] native_patch: patch out pv_queue_unlock!
[0.442035] ftrace: allocating 24168 entries in 95 pages
[0.451208] Switched APIC routing to physical flat.
[0.453202] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[0.454002] smpboot: CPU0: Intel QEMU Virtual CPU version 1.5.3 (fam: 
06, model: 06, stepping: 03)
[0.456000] Performance Events: Broken PMU hardware detected, using 
software events only.

[0.456003] Failed to access perfctr msr (MSR c1 is 0)
[0.457151] KVM setup paravirtual spinlock
[0.460039] NMI watchdog: disabled (cpu0): hardware events not enabled

It could be seen that some unlock call sites were patched before the KVM 
setup code set the paravirt_spinlocks_enabled flag.


-Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

[PATCH v3 2/3] x86: tss: Eliminate fragile calculation of TSS segment limit

[PATCH v3 1/3] x86: process: Unify 32-bit and 64-bit copy_thread I/O bitmap handling

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

[PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

RE: [Xen-devel] [RFC] Hypervisor RNG and enumeration

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

Re: [PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)

Re: [PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)

Re: [PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)

Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration

Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

[PATCH v13 01/11] qspinlock: A simple generic 4-byte queue spinlock

[PATCH v13 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock

[PATCH v13 00/11] qspinlock: a 4-byte queue spinlock with PV support

[PATCH v13 03/11] qspinlock: Add pending bit

[PATCH v13 04/11] qspinlock: Extract out code snippets for the next patch

[PATCH v13 06/11] qspinlock: Use a simple write to grab the lock

[PATCH v13 07/11] qspinlock: Revert to test-and-set on hypervisors

[PATCH v13 05/11] qspinlock: Optimize for smaller NR_CPUS

[PATCH v13 08/11] qspinlock, x86: Rename paravirt_ticketlocks_enabled

Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

29 matches

Site Navigation

Mail list logo

Footer information