Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 29/10/14 05:19, Andy Lutomirski wrote: Here's a draft CommonHV spec. It's also on github: https://github.com/amluto/CommonHV So far, this provides a two-way RNG interface, a way to detect it, and a way to detect other hypervisor leaves. The latter is because, after both the enormous public thread and some private discussions, it seems that detection of existing CPUID paravirt leaves is annoying and inefficient. If we're going to define some cross-vendor CPUID leaves, it seems like it would be useful to offer a way to quickly enumerate other leaves. I've been told the AMD intends to update their manual to match Intel's so that hypervisors can use the entire 0x4F?? CPUID range. I have intentionally not fixed an MSR value for the RNG because the range of allowed MSRs is very small in both the Intel and AMD manuals. If any given hypervisor wants to ignore that small range and advertise a higher-numbered MSR, it is welcome to, but I don't want to codify something that doesn't comply with the manuals. Here's the draft. Comments? To the people who work on various hypervisors: Would you implement this? Do you like it? Is there anything, major or minor, that you'd like to see changed? Do you think that this is a good idea at all? As a first pass, it looks like a plausible idea. I do however have come comments. I've tried to get good coverage of various hypervisors. There are Hyper-V, VMWare, KVM, and Xen people on the cc list. Thanks, Andy CommonHV, a common hypervisor interface === This is CommonHV draft 1. The CommonHV specification is Copyright (c) 2014 Andrew Lutomirski. Licensing will be determined soon. The license is expected to be extremely liberal. I am currently leaning towards CC-BY-SA for the specification and an explicit license permitting anyone to implement the specification with no restrictions whatsoever. I have not patented, nor do I intend to patent, anything required to implement this specification. I am not aware of any current or future intellectual property rights that would prevent a royalty-free implementation of this specification. I would like to find a stable, neutral steward of this specification going forward. Help with this would be much appreciated. Scope - CommonHV is a simple interface for communication between hypervisors and their guests. CommonHV is intended to be very simple and to avoid interfering with existing paravirtual interfaces. To that end, its scope is limited. CommonHV does only two types of things: * It provides a way to enumerate other paravirtual interfaces. * It provides a small, extensible set of paravirtual features that do not modify or replace standard system functionality. For example, CommonHV does not and will not define anything related to interrupt handling or virtual CPU management. For now, CommonHV is only applicable to the x86 platform. Discovery - A CommonHV hypervisor MUST set the hypervisor bit (bit 31 in CPUID.1H.0H.ECX) and provide the CPUID leaf 4F00H, containing: * CPUID.4F00H.0H.EAX = max_commonhv_leaf * CPUID.4F00H.0H.EBX = 0x6D6D6F43 * CPUID.4F00H.0H.ECX = 0x56486E6F * CPUID.4F00H.0H.EDX = 0x66746e49 EBX, ECX, and EDX form the string CommonHVIntf in little-endian ASCII. While testing various nested combinations, XenServer has found that modern Windows Server versions must have the hypervisor bit hidden from them for them to be happy running HyperV, despite the fact that they will make use of the Viridian virtual extensions also provided. As a result, while it is certainly advisable for the hypervisor bit to be set, CommonHV should be available to be found by paravirtualised drivers inside an OS which can't cope with the hypervisor bit set. max_commonhv_leaf MUST be a number between 0x4F00 and 0x4FFF. It indicates the largest leaf defined in this specification that is provided. Any leaves described in this specification with EAX values that exceed max_commonhv_leaf MUST be handled by guests as though they contain all zeros. CPUID leaf 4F01H: hypervisor interface enumeration -- If max_commonhv_leaf = 0x4F01, CommonHV provides a list of tuples (location, signature). Each tuple indicates the presence of another paravirtual interface identified by the signature at the indicated CPUID location. It is expected that CPUID.location.0H will have (EBX, ECX, EDX) == signature, although whether this is required is left to the specification associated with the given signature. If the list contains N tuples, then, for each 0 = i N: * CPUID.4F01H.i.EBX, CPUID.4F01H.i.ECX, and CPUID.4F01H.i.EDX are the signature. * CPUID.4F01H.i.EAX is the location. CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros. To the extent that the
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 10/29/2014 11:37 AM, Andrew Cooper wrote: While testing various nested combinations, XenServer has found that modern Windows Server versions must have the hypervisor bit hidden from them for them to be happy running HyperV, despite the fact that they will make use of the Viridian virtual extensions also provided. Right. As a result, while it is certainly advisable for the hypervisor bit to be set, CommonHV should be available to be found by paravirtualised drivers inside an OS which can't cope with the hypervisor bit set. Microsoft should just stop putting arbitrary limitations on their software; or pay the price which, in this case, is not being able to use the features from the common specification. I guess what they'd do is reinvent the RNG as a Viridian extension (if they need it). You can certainly do CPUID(0x4F00) even if HYPERVISOR=0. What you get back is undefined, but in all likelihood it won't be the CommonHVIntf string. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 29/10/14 13:45, Paolo Bonzini wrote: On 10/29/2014 11:37 AM, Andrew Cooper wrote: While testing various nested combinations, XenServer has found that modern Windows Server versions must have the hypervisor bit hidden from them for them to be happy running HyperV, despite the fact that they will make use of the Viridian virtual extensions also provided. Right. As a result, while it is certainly advisable for the hypervisor bit to be set, CommonHV should be available to be found by paravirtualised drivers inside an OS which can't cope with the hypervisor bit set. Microsoft should just stop putting arbitrary limitations on their software; or pay the price which, in this case, is not being able to use the features from the common specification. I guess what they'd do is reinvent the RNG as a Viridian extension (if they need it). You can certainly do CPUID(0x4F00) even if HYPERVISOR=0. What you get back is undefined, but in all likelihood it won't be the CommonHVIntf string. Microsoft already has a specification to obtain a random number via an ACPI device. The VM Generation ID. http://www.microsoft.com/en-us/download/details.aspx?id=30707 David ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 10/29/2014 02:57 PM, David Vrabel wrote: Microsoft already has a specification to obtain a random number via an ACPI device. The VM Generation ID. http://www.microsoft.com/en-us/download/details.aspx?id=30707 That's a very different thing. The VM Generation ID always returns the same number if you run the VM from the same configuration file. It tells the VM when it might need a reseed, but provides neither a best effort random number like this spec does, nor an entropy source like virtio-rng does. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 2/3] x86: tss: Eliminate fragile calculation of TSS segment limit
__set_tss_desc has a complex calculation of the TSS segment limit, duplicating the quirky details of the I/O bitmap array length, and requiring a complex comment to explain. Replace that calculation with a simpler one based on the offsetof the stack field that follows the array. That then removes the last use of IO_BITMAP_OFFSET, so delete it. Signed-off-by: Josh Triplett j...@joshtriplett.org Acked-by: Alexander van Heukelum heuke...@fastmail.fm --- v3: No changes in this patch, only in patch 3/3. arch/x86/include/asm/desc.h | 11 +-- arch/x86/include/asm/processor.h | 4 +++- 2 files changed, 4 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index 50d033a..f8dc455 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -177,16 +177,7 @@ static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr) struct desc_struct *d = get_cpu_gdt_table(cpu); tss_desc tss; - /* -* sizeof(unsigned long) coming from an extra long at the end -* of the iobitmap. See tss_struct definition in processor.h -* -* -1? seg base+limit should be pointing to the address of the -* last valid byte -*/ - set_tssldt_descriptor(tss, (unsigned long)addr, DESC_TSS, - IO_BITMAP_OFFSET + IO_BITMAP_BYTES + - sizeof(unsigned long) - 1); + set_tssldt_descriptor(tss, (unsigned long)addr, DESC_TSS, TSS_LIMIT); write_gdt_entry(d, entry, tss, DESC_TSS); } diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index eb71ec7..76751cd 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -258,9 +258,11 @@ struct x86_hw_tss { #define IO_BITMAP_BITS 65536 #define IO_BITMAP_BYTES(IO_BITMAP_BITS/8) #define IO_BITMAP_LONGS(IO_BITMAP_BYTES/sizeof(long)) -#define IO_BITMAP_OFFSET offsetof(struct tss_struct, io_bitmap) #define INVALID_IO_BITMAP_OFFSET 0x8000 +/* Segment limits point to the last valid byte, hence the -1 */ +#define TSS_LIMIT (offsetof(struct tss_struct, stack) - 1) + struct tss_struct { /* * The hardware state: -- 2.1.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 1/3] x86: process: Unify 32-bit and 64-bit copy_thread I/O bitmap handling
The 32-bit and 64-bit versions of copy_thread have functionally identical handling for copying the I/O bitmap, modulo differences in error handling. Clean up the error paths in both by moving the copy of the I/O bitmap to the end, to eliminate the need to free it if subsequent copy steps fail; move the resulting identical code to a static inline in a common header. Signed-off-by: Josh Triplett j...@joshtriplett.org Acked-by: Kees Cook keesc...@chromium.org --- v3: No changes in this patch, only in patch 3/3. arch/x86/kernel/process-io.h | 22 ++ arch/x86/kernel/process_32.c | 28 arch/x86/kernel/process_64.c | 25 + 3 files changed, 35 insertions(+), 40 deletions(-) create mode 100644 arch/x86/kernel/process-io.h diff --git a/arch/x86/kernel/process-io.h b/arch/x86/kernel/process-io.h new file mode 100644 index 000..d88 --- /dev/null +++ b/arch/x86/kernel/process-io.h @@ -0,0 +1,22 @@ +#ifndef _X86_KERNEL_PROCESS_IO_H +#define _X86_KERNEL_PROCESS_IO_H + +static inline int copy_io_bitmap(struct task_struct *me, +struct task_struct *p) +{ + if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) { + p-thread.io_bitmap_ptr = kmemdup(me-thread.io_bitmap_ptr, + IO_BITMAP_BYTES, GFP_KERNEL); + if (!p-thread.io_bitmap_ptr) { + p-thread.io_bitmap_max = 0; + return -ENOMEM; + } + set_tsk_thread_flag(p, TIF_IO_BITMAP); + } else { + p-thread.io_bitmap_ptr = NULL; + } + + return 0; +} + +#endif /* _X86_KERNEL_PROCESS_IO_H */ diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 8f3ebfe..07550ff 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -55,6 +55,8 @@ #include asm/debugreg.h #include asm/switch_to.h +#include process-io.h + asmlinkage void ret_from_fork(void) __asm__(ret_from_fork); asmlinkage void ret_from_kernel_thread(void) __asm__(ret_from_kernel_thread); @@ -134,7 +136,6 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, { struct pt_regs *childregs = task_pt_regs(p); struct task_struct *tsk; - int err; p-thread.sp = (unsigned long) childregs; p-thread.sp0 = (unsigned long) (childregs+1); @@ -166,32 +167,19 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, p-thread.io_bitmap_ptr = NULL; tsk = current; - err = -ENOMEM; - - if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) { - p-thread.io_bitmap_ptr = kmemdup(tsk-thread.io_bitmap_ptr, - IO_BITMAP_BYTES, GFP_KERNEL); - if (!p-thread.io_bitmap_ptr) { - p-thread.io_bitmap_max = 0; - return -ENOMEM; - } - set_tsk_thread_flag(p, TIF_IO_BITMAP); - } - - err = 0; /* * Set a new TLS for the child thread? */ - if (clone_flags CLONE_SETTLS) + if (clone_flags CLONE_SETTLS) { + int err; err = do_set_thread_area(p, -1, (struct user_desc __user *)childregs-si, 0); - - if (err p-thread.io_bitmap_ptr) { - kfree(p-thread.io_bitmap_ptr); - p-thread.io_bitmap_max = 0; + if(err) + return err; } - return err; + + return copy_io_bitmap(tsk, p); } void diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 3ed4a68..b1babb4 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -50,6 +50,8 @@ #include asm/debugreg.h #include asm/switch_to.h +#include process-io.h + asmlinkage extern void ret_from_fork(void); __visible DEFINE_PER_CPU(unsigned long, old_rsp); @@ -154,7 +156,6 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls) int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg, struct task_struct *p) { - int err; struct pt_regs *childregs; struct task_struct *me = current; @@ -191,21 +192,11 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, if (sp) childregs-sp = sp; - err = -ENOMEM; - if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) { - p-thread.io_bitmap_ptr = kmemdup(me-thread.io_bitmap_ptr, - IO_BITMAP_BYTES, GFP_KERNEL); - if (!p-thread.io_bitmap_ptr) { - p-thread.io_bitmap_max = 0; - return -ENOMEM; - } - set_tsk_thread_flag(p, TIF_IO_BITMAP); - } - /* * Set a new TLS for the child thread? */
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 10/29/2014 03:37 AM, Andrew Cooper wrote: CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros. To the extent that the hypervisor prefers a given interface, it should specify that interface earlier in the list. For example, KVM might place its KVMKVMKVM signature first in the list to indicate that it should be used by guests in preference to other supported interfaces. Other hypervisors would likely use a different order. The exact semantics of the ordering of the list is beyond the scope of this specification. How do you evaluate N? It would make more sense for CPUID.4F01[ECX=0] to return N in one register, and perhaps prefered interface index in another. The signatures can then be obtained from CPUID.4F01[ECX={1 to N}]. That way, a consumer can be confident that they have found all the signatures, without relying on an unbounded loop and checking for zeroes Yes. Specifically, it should return it in EAX. That is the preferred interface and we are trying to push for that going forward. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 10/29/2014 03:37 AM, Andrew Cooper wrote: CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros. To the extent that the hypervisor prefers a given interface, it should specify that interface earlier in the list. For example, KVM might place its KVMKVMKVM signature first in the list to indicate that it should be used by guests in preference to other supported interfaces. Other hypervisors would likely use a different order. The exact semantics of the ordering of the list is beyond the scope of this specification. How do you evaluate N? It would make more sense for CPUID.4F01[ECX=0] to return N in one register, and perhaps prefered interface index in another. The signatures can then be obtained from CPUID.4F01[ECX={1 to N}]. That way, a consumer can be confident that they have found all the signatures, without relying on an unbounded loop and checking for zeroes Yes. Specifically, it should return it in EAX. That is the preferred interface and we are trying to push for that going forward. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 10/29/2014 03:37 AM, Andrew Cooper wrote: CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros. To the extent that the hypervisor prefers a given interface, it should specify that interface earlier in the list. For example, KVM might place its KVMKVMKVM signature first in the list to indicate that it should be used by guests in preference to other supported interfaces. Other hypervisors would likely use a different order. The exact semantics of the ordering of the list is beyond the scope of this specification. How do you evaluate N? It would make more sense for CPUID.4F01[ECX=0] to return N in one register, and perhaps prefered interface index in another. The signatures can then be obtained from CPUID.4F01[ECX={1 to N}]. That way, a consumer can be confident that they have found all the signatures, without relying on an unbounded loop and checking for zeroes Yes. Specifically, it should return it in EAX. That is the preferred interface and we are trying to push for that going forward. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)
On the vast majority of modern systems, no processes will use the userspsace I/O syscalls, iopl and ioperm. Add a new config option, CONFIG_X86_IOPORT, to support configuring them out of the kernel entirely. Most current systems do not run programs using these syscalls, so X86_IOPORT does not depend on EXPERT, though it does still default to y. In addition to saving a significant amount of space, this also reduces the size of several major kernel data structures, drops a bit of code from several task-related hot paths, and reduces the attack surface of the kernel. All of the task-related code dealing with userspace I/O permissions now lives in process-io.h as several new inline functions, which become no-ops without CONFIG_X86_IOPORT. bloat-o-meter results: add/remove: 0/4 grow/shrink: 4/9 up/down: 83/-9074 (-8991) function old new delta drop_fpu 80 160 +80 perf_event_init_task 638 639 +1 perf_event_exec 192 193 +1 perf_event_aux 132 133 +1 test_tsk_thread_flag 10 - -10 ioperm_active 14 3 -11 init_task916 904 -12 flush_thread 168 149 -19 cpu_init 364 339 -25 __drop_fpu60 - -60 ioperm_get92 6 -86 exit_thread 105 10 -95 sys_iopl 106 --106 copy_thread 364 257-107 __switch_to_xtra 234 123-111 sys_ioperm 240 --240 init_tss8576 384 -8192 Signed-off-by: Josh Triplett j...@joshtriplett.org --- v3: Eliminated several #ifdefs, and in particular almost all #ifdefs in .c files, by adding a macro INIT_SET_IOPL_MASK to use in place of the initializer for set_iopl_mask, and using __maybe_unused rather than wrapping function definitions in #ifdef. Rebased on v3.18-rc1. Recomputed bloat-o-meter. I assume this should go through the x86 tree rather than the tiny tree. arch/x86/Kconfig | 10 + arch/x86/include/asm/paravirt.h | 2 + arch/x86/include/asm/paravirt_types.h | 5 +++ arch/x86/include/asm/processor.h | 51 + arch/x86/include/asm/syscalls.h | 3 ++ arch/x86/kernel/Makefile | 3 +- arch/x86/kernel/cpu/common.c | 12 +- arch/x86/kernel/entry_64.S| 9 +++-- arch/x86/kernel/paravirt.c| 2 +- arch/x86/kernel/process-io.h | 71 +++ arch/x86/kernel/process.c | 34 ++--- arch/x86/kernel/process_32.c | 13 ++- arch/x86/kernel/process_64.c | 2 +- arch/x86/kernel/ptrace.c | 8 arch/x86/xen/enlighten.c | 4 +- drivers/tty/vt/vt_ioctl.c | 2 +- kernel/sys_ni.c | 5 +++ 17 files changed, 169 insertions(+), 67 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f2327e8..a7de2eb 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -984,6 +984,16 @@ config X86_ESPFIX64 def_bool y depends on X86_16BIT X86_64 +config X86_IOPORT + bool iopl and ioperm system calls + default y + ---help--- + This option enables the iopl and ioperm system calls, which allow + privileged userspace processes to directly access I/O ports. This + is used by software that drives hardware directly from userspace + without using a kernel driver. Unless you intend to run such + software, you can safely say N here. + config TOSHIBA tristate Toshiba Laptop support depends on X86_32 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index cd6e161..52d2ec8 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -299,10 +299,12 @@ static inline void write_idt_entry(gate_desc *dt, int entry, const gate_desc *g) { PVOP_VCALL3(pv_cpu_ops.write_idt_entry, dt, entry, g); } +#ifdef CONFIG_X86_IOPORT static inline void set_iopl_mask(unsigned mask) { PVOP_VCALL1(pv_cpu_ops.set_iopl_mask, mask); } +#endif /* CONFIG_X86_IOPORT */ /* The paravirtualized I/O functions */ static inline void slow_down_io(void) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 7549b8b..a22a5d4 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -142,7 +142,12 @@ struct pv_cpu_ops
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On Oct 29, 2014 8:17 AM, Ian Jackson ian.jack...@eu.citrix.com wrote: Andy Lutomirski writes ([Xen-devel] [RFC] Hypervisor RNG and enumeration): Here's a draft CommonHV spec. It's also on github: https://github.com/amluto/CommonHV This a worthwhile direction to investigate, and an interesting proposal. From a Xen point of view I have some concerns, though. I think in Xen we would want to implement the bulk of the provision of random numbers to guests outside the hypervisor. That is, the hypervisor itself should not have random number pool code, associated policy, and so on. We would like to avoid adding too much code to the hypervisor. That functionality should live in the lower toolstack layers. For the benefit of people who want to do radical disaggregation (for security reasons), it should be capable of being provided by a different domain to dom0. I think a fully general interface based purely on MSRs makes that difficult in a number of ways: * Currently I don't think there is any way in Xen to cause MSR accesses to be passed to toolstack support programs. * In some configurations, Xen PV domains do not have a suitable longrunning service process to handle requests of this kind. * MSR reads of this kind might be expected to be fast but if they involve trapping to a service domain they might be very slow. I have no objection to specifying that these reads may be quite slow. Guests should only use them at boot and if they have some reason to distrust their RNG pool. The latter can legitimately happen after various types of suspend or after migration (detected by VM Generation ID, for example). * This interface is very x86-specific. Agreed. Part of the motivation is to allow guests to use this mechanism very early in boot for stack canaries, ASLR, etc. I don't know a good way to do that without something very platform specific. It seems to me that the real need for this facility is to provide a good seed for the guest's own cryptographic PRNG. If we restrict ourselves to that use case, we can sidestep the difficulties. In particular, the parts of this proposal that are most difficult are: * The facility for the guest to provide random numbers back to the host. I think this can be dealt with easily by hypervisor-specific mechanisms, if it is desirable. Xen can implement this is a no-op. If we use feature bits, wrmsr support could be separately enumerated. * The implication that a hypervisor ought to be providing a unbounded stream of random numbers, rather than a fixed amount of seed. I don't expect hypervisors to estimate the entropy available through this mechanism. Given that, the length of the stream is largely irrelevant, except that an unbounded stream allowed reseeding after boot. I think the most obvious approach would be to provide the VM, at startup, with a page containing a fixed amount of random number seed, along with some metatdata. Some platform-specific way of discovering the location of the page would have to be defined. (That might an MSR but more likely it would be Device Tree or ACPI.) After the guest has read the page, it would be free to treat it as normal memory. The metadata need do little more than specify the length and the amount of provided entropy. There should be some room for expansion. ACPI is not useful early in boot. DT might be, but that could be a separate spec. The specification should say that the provided seed MUST be cryptographically secure, MUST have as much entropy as stated and that that amount of entropy MUST be at least (say) 64 bits and SHOULD be at least (say) 256 bits. I don't think this is practical. It will require hypervisors to throttle guest startup to ensure that they have that much entropy. I'm not fundamentally opposed to allowing hypervisors to provide more than 64 bits of data per invocation, which would help when the trap is very slow, but I don't know of a suitably simple way to do that. --Andy ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On Wed, Oct 29, 2014 at 9:07 AM, H. Peter Anvin h...@zytor.com wrote: On 10/29/2014 03:37 AM, Andrew Cooper wrote: CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros. To the extent that the hypervisor prefers a given interface, it should specify that interface earlier in the list. For example, KVM might place its KVMKVMKVM signature first in the list to indicate that it should be used by guests in preference to other supported interfaces. Other hypervisors would likely use a different order. The exact semantics of the ordering of the list is beyond the scope of this specification. How do you evaluate N? It would make more sense for CPUID.4F01[ECX=0] to return N in one register, and perhaps prefered interface index in another. The signatures can then be obtained from CPUID.4F01[ECX={1 to N}]. That way, a consumer can be confident that they have found all the signatures, without relying on an unbounded loop and checking for zeroes Yes. Specifically, it should return it in EAX. That is the preferred interface and we are trying to push for that going forward. I'm okay with that. I'm inclined to leave EBX, ECX, and EDX reserved for now, though. Barring an actual use case in which the order of the list isn't sufficient to determine preference, I don't see the need to give another preference indication. I updated the copy of github to make this change and to use an explicit feature bit for the RNG. --Andy -hpa -- Andy Lutomirski AMA Capital Management, LLC ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
RE: [Xen-devel] [RFC] Hypervisor RNG and enumeration
I have no objection to specifying that these reads may be quite slow. Guests should only use them at boot and if they have some reason to distrust their RNG pool. The latter can legitimately happen after various types of suspend or after migration (detected by VM Generation ID, for example). Just as a point of clarification, the VM Generation ID changes (at least in the Hyper-V implementation) only when the VM may have observed a different future, as when a VM backup is restored, a checkpoint is applied, etc. It does not change during migration, when the VM is suspended or when it is rebooted. I've heard anecdotes from application vendors saying that there is some other hypervisor that actually does change the ID at these moments and they wanted us to us to fix that, until I explained that I only control Hyper-V. -- Jake Oshins ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On Wed, Oct 29, 2014 at 9:29 AM, Jake Oshins ja...@microsoft.com wrote: I have no objection to specifying that these reads may be quite slow. Guests should only use them at boot and if they have some reason to distrust their RNG pool. The latter can legitimately happen after various types of suspend or after migration (detected by VM Generation ID, for example). Just as a point of clarification, the VM Generation ID changes (at least in the Hyper-V implementation) only when the VM may have observed a different future, as when a VM backup is restored, a checkpoint is applied, etc. It does not change during migration, when the VM is suspended or when it is rebooted. I've heard anecdotes from application vendors saying that there is some other hypervisor that actually does change the ID at these moments and they wanted us to us to fix that, until I explained that I only control Hyper-V. Fair enough. If the VM may indeed have observed a different future, then I would argue that reseeding the RNG is very important -- more so than after a normal migration. If the VM trusts that its other history hasn't been compromised, then merely mixing in a unique value would get most of the benefit. --Andy -- Jake Oshins -- Andy Lutomirski AMA Capital Management, LLC ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)
On Wed, Oct 29, 2014 at 9:10 AM, Josh Triplett j...@joshtriplett.org wrote: On the vast majority of modern systems, no processes will use the userspsace I/O syscalls, iopl and ioperm. Add a new config option, CONFIG_X86_IOPORT, to support configuring them out of the kernel entirely. Most current systems do not run programs using these syscalls, so X86_IOPORT does not depend on EXPERT, though it does still default to y. In addition to saving a significant amount of space, this also reduces the size of several major kernel data structures, drops a bit of code from several task-related hot paths, and reduces the attack surface of the kernel. All of the task-related code dealing with userspace I/O permissions now lives in process-io.h as several new inline functions, which become no-ops without CONFIG_X86_IOPORT. bloat-o-meter results: add/remove: 0/4 grow/shrink: 4/9 up/down: 83/-9074 (-8991) function old new delta drop_fpu 80 160 +80 perf_event_init_task 638 639 +1 perf_event_exec 192 193 +1 perf_event_aux 132 133 +1 test_tsk_thread_flag 10 - -10 ioperm_active 14 3 -11 init_task916 904 -12 flush_thread 168 149 -19 cpu_init 364 339 -25 __drop_fpu60 - -60 ioperm_get92 6 -86 exit_thread 105 10 -95 sys_iopl 106 --106 copy_thread 364 257-107 __switch_to_xtra 234 123-111 sys_ioperm 240 --240 init_tss8576 384 -8192 Signed-off-by: Josh Triplett j...@joshtriplett.org --- v3: Eliminated several #ifdefs, and in particular almost all #ifdefs in .c files, by adding a macro INIT_SET_IOPL_MASK to use in place of the initializer for set_iopl_mask, and using __maybe_unused rather than wrapping function definitions in #ifdef. Rebased on v3.18-rc1. Recomputed bloat-o-meter. I assume this should go through the x86 tree rather than the tiny tree. arch/x86/Kconfig | 10 + arch/x86/include/asm/paravirt.h | 2 + arch/x86/include/asm/paravirt_types.h | 5 +++ arch/x86/include/asm/processor.h | 51 + arch/x86/include/asm/syscalls.h | 3 ++ arch/x86/kernel/Makefile | 3 +- arch/x86/kernel/cpu/common.c | 12 +- arch/x86/kernel/entry_64.S| 9 +++-- arch/x86/kernel/paravirt.c| 2 +- arch/x86/kernel/process-io.h | 71 +++ arch/x86/kernel/process.c | 34 ++--- arch/x86/kernel/process_32.c | 13 ++- arch/x86/kernel/process_64.c | 2 +- arch/x86/kernel/ptrace.c | 8 arch/x86/xen/enlighten.c | 4 +- drivers/tty/vt/vt_ioctl.c | 2 +- kernel/sys_ni.c | 5 +++ 17 files changed, 169 insertions(+), 67 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f2327e8..a7de2eb 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -984,6 +984,16 @@ config X86_ESPFIX64 def_bool y depends on X86_16BIT X86_64 +config X86_IOPORT + bool iopl and ioperm system calls + default y + ---help--- + This option enables the iopl and ioperm system calls, which allow + privileged userspace processes to directly access I/O ports. This + is used by software that drives hardware directly from userspace + without using a kernel driver. Unless you intend to run such + software, you can safely say N here. + config TOSHIBA tristate Toshiba Laptop support depends on X86_32 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index cd6e161..52d2ec8 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -299,10 +299,12 @@ static inline void write_idt_entry(gate_desc *dt, int entry, const gate_desc *g) { PVOP_VCALL3(pv_cpu_ops.write_idt_entry, dt, entry, g); } +#ifdef CONFIG_X86_IOPORT static inline void set_iopl_mask(unsigned mask) { PVOP_VCALL1(pv_cpu_ops.set_iopl_mask, mask); } +#endif /* CONFIG_X86_IOPORT */ /* The paravirtualized I/O functions */ static inline void slow_down_io(void) diff --git a/arch/x86/include/asm/paravirt_types.h
Re: [PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)
On Wed, Oct 29, 2014 at 09:59:25AM -0700, Kees Cook wrote: On Wed, Oct 29, 2014 at 9:10 AM, Josh Triplett j...@joshtriplett.org wrote: --- a/arch/x86/kernel/process-io.h +++ b/arch/x86/kernel/process-io.h @@ -1,9 +1,17 @@ #ifndef _X86_KERNEL_PROCESS_IO_H #define _X86_KERNEL_PROCESS_IO_H +static inline void clear_thread_io_bitmap(struct task_struct *p) +{ +#ifdef CONFIG_X86_IOPORT + p-thread.io_bitmap_ptr = NULL; +#endif /* CONFIG_X86_IOPORT */ +} Personally, I prefer seeing these kinds of optional functions declared in a single block rather than having the #ifdefs inside the functions: #ifdef CONFIG_X86_IOPORT static inline void clear_thread_io_bitmap(struct task_struct *p) { ... } static inline int copy_io_bitmap(struct task_struct *me, struct task_struct *p) { ... } ...remaining_functions... #else static inline void clear_thread_io_bitmap(struct task_struct *p) { } static inline int copy_io_bitmap(struct task_struct *me, struct task_struct *p) { return 0; } ...remaining functions... #endif /* CONFIG_X86_IOPORT */ But this is entirely a style decision, so I leave it up to the x86 maintainers ... I can certainly do that if the x86 maintainers prefer, but that tends to produce a net increase in lines of code, as well as duplicating all the function prototypes, which to me seems more error-prone. If the stub versions contained any code, rather than just becoming no-ops, I'd definitely do that. Another nit may be that we should call this CONFIG_SYSCALL_IOPL or CONFIG_SYSCALL_IOPERM in keeping with the other CONFIG_SYSCALL_* naming thread? Again, I don't really care strongly beyond really wanting to use this new feature! :) I don't feel strongly about the naming. Ingo? Thanks for working on this! No problem. I look forward to seeing it used, in Chrome OS and elsewhere. :) - Josh Triplett ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)
On Wed, Oct 29, 2014 at 10:20:28AM -0700, H. Peter Anvin wrote: On 10/29/2014 10:17 AM, j...@joshtriplett.org wrote: But this is entirely a style decision, so I leave it up to the x86 maintainers ... I can certainly do that if the x86 maintainers prefer, but that tends to produce a net increase in lines of code, as well as duplicating all the function prototypes, which to me seems more error-prone. If the stub versions contained any code, rather than just becoming no-ops, I'd definitely do that. I concur with this style choice. To clarify: you concur with Kees's suggested change or with the style I used in my patch? Another nit may be that we should call this CONFIG_SYSCALL_IOPL or CONFIG_SYSCALL_IOPERM in keeping with the other CONFIG_SYSCALL_* naming thread? Again, I don't really care strongly beyond really wanting to use this new feature! :) I don't feel strongly about the naming. Ingo? It is sort of a special case here, as this reflects more than one syscall. As well as four VT ioctls. :) - Josh Triplett ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 10/29/2014 05:29 PM, Jake Oshins wrote: Just as a point of clarification, the VM Generation ID changes (at least in the Hyper-V implementation) only when the VM may have observed a different future, as when a VM backup is restored, a checkpoint is applied, etc. It does not change during migration, when the VM is suspended or when it is rebooted. I've heard anecdotes from application vendors saying that there is some other hypervisor that actually does change the ID at these moments and they wanted us to us to fix that, until I explained that I only control Hyper-V. This is indeed the only reasonable way you can read the vmgenid spec. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On 10/27/2014 05:22 PM, Waiman Long wrote: On 10/27/2014 02:04 PM, Peter Zijlstra wrote: On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote: On 10/24/2014 04:54 AM, Peter Zijlstra wrote: On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote: Since enabling paravirt spinlock will disable unlock function inlining, a jump label can be added to the unlock function without adding patch sites all over the kernel. But you don't have to. My patches allowed for the inline to remain, again reducing the overhead of enabling PV spinlocks while running on a real machine. Look at: http://lkml.kernel.org/r/20140615130154.213923...@chello.nl In particular this hunk: Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c === --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs) DEF_NATIVE(, mov32, mov %edi, %eax); DEF_NATIVE(, mov64, mov %rdi, %rax); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi)); +#endif + unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len) { return paravirt_patch_insns(insnbuf, len, @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb PATCH_SITE(pv_cpu_ops, clts); PATCH_SITE(pv_mmu_ops, flush_tlb_single); PATCH_SITE(pv_cpu_ops, wbinvd); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) + PATCH_SITE(pv_lock_ops, queue_unlock); +#endif patch_site: ret = paravirt_patch_insns(ibuf, len, start, end); That makes sure to overwrite the callee-saved call to the pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi). Therefore you can retain the inlined unlock with hardly (there might be some NOP padding) any overhead at all. On PV it reverts to a callee saved function call. My concern is that spin_unlock() can be called in many places, including loadable kernel modules. Can the paravirt_patch_ident_32() function able to patch all of them in reasonable time? How about a kernel module loaded later at run time? modules should be fine, see arch/x86/kernel/module.c:module_finalize() - apply_paravirt(). Also note that the 'default' text is an indirect call into the paravirt ops table which routes to the 'right' function, so even if the text patching would be 'late' calls would 'work' as expected, just slower. Thanks for letting me know about that. I have this concern because your patch didn't change the current configuration of disabling unlock inlining when paravirt_spinlock is enabled. With that, I think it is worthwhile to reduce the performance delta between the PV and non-PV kernel on bare metal. I am sorry that the unlock call sites patching code doesn't work in a virtual guest. Your pvqspinlock patch did an unconditional patching even in a virtual guest. I added check for the paravirt_spinlocks_enabled, but it turned out that some spin_unlock() seemed to be called before paravirt_spinlocks_enabled is set. As a result, some call sites were still patched resulting in missed wake up's and system hang. At this point, I am going to leave out that change from my patch set until we can figure out a better way of doing that. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v13 01/11] qspinlock: A simple generic 4-byte queue spinlock
This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock.h | 118 +++ include/asm-generic/qspinlock_types.h | 58 + kernel/Kconfig.locks |7 + kernel/locking/Makefile |1 + kernel/locking/mcs_spinlock.h |1 + kernel/locking/qspinlock.c| 207 + 6 files changed, 392 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 000..e8a7ae8 --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,118 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return atomic_read(lock-val); +} + +/** + * queue_spin_value_unlocked - is the spinlock structure unlocked? + * @lock: queue spinlock structure + * Return: 1 if it is unlocked, 0 otherwise + * + * N.B. Whenever there are tasks waiting for the lock, it is considered + * locked wrt the lockref code to avoid lock stealing by the lockref + * code and change things underneath the lock. This also allows some + * optimizations to be applied without conflict with lockref. + */ +static __always_inline int queue_spin_value_unlocked(struct qspinlock lock) +{ + return !atomic_read(lock.val); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return atomic_read(lock-val) ~_Q_LOCKED_MASK; +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!atomic_read(lock-val) + (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0)) + return 1; + return 0; +} + +extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + u32 val; + +
[PATCH v13 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock
This patch makes the necessary changes at the x86 architecture specific layer to enable the use of queue spinlock for x86-64. As x86-32 machines are typically not multi-socket. The benefit of queue spinlock may not be apparent. So queue spinlock is not enabled. Currently, there is some incompatibilities between the para-virtualized spinlock code (which hard-codes the use of ticket spinlock) and the queue spinlock. Therefore, the use of queue spinlock is disabled when the para-virtualized spinlock is enabled. The arch/x86/include/asm/qspinlock.h header file includes some x86 specific optimization which will make the queue spinlock code perform better than the generic implementation. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- arch/x86/Kconfig |1 + arch/x86/include/asm/qspinlock.h | 25 + arch/x86/include/asm/spinlock.h |5 + arch/x86/include/asm/spinlock_types.h |4 4 files changed, 35 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f2327e8..9ee8324 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -122,6 +122,7 @@ config X86 select MODULES_USE_ELF_RELA if X86_64 select CLONE_BACKWARDS if X86_32 select ARCH_USE_BUILTIN_BSWAP + select ARCH_USE_QUEUE_SPINLOCK select ARCH_USE_QUEUE_RWLOCK select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION select OLD_SIGACTION if X86_32 diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h new file mode 100644 index 000..a6a8762 --- /dev/null +++ b/arch/x86/include/asm/qspinlock.h @@ -0,0 +1,25 @@ +#ifndef _ASM_X86_QSPINLOCK_H +#define _ASM_X86_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +#ifndef CONFIG_X86_PPRO_FENCE + +#definequeue_spin_unlock queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + * + * An effective smp_store_release() on the least-significant byte. + */ +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + barrier(); + ACCESS_ONCE(*(u8 *)lock) = 0; +} + +#endif /* !CONFIG_X86_PPRO_FENCE */ + +#include asm-generic/qspinlock.h + +#endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 9295016..5899483 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -42,6 +42,10 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm/qspinlock.h +#else + #ifdef CONFIG_PARAVIRT_SPINLOCKS static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) @@ -180,6 +184,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, { arch_spin_lock(lock); } +#endif /* CONFIG_QUEUE_SPINLOCK */ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 5f9d757..5d654a1 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm-generic/qspinlock_types.h +#else typedef struct arch_spinlock { union { __ticketpair_t head_tail; @@ -33,6 +36,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif /* CONFIG_QUEUE_SPINLOCK */ #include asm-generic/qrwlock_types.h -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v13 00/11] qspinlock: a 4-byte queue spinlock with PV support
v12-v13: - Change patch 9 to generate separate versions of the queue_spin_lock_slowpath functions for bare metal and PV guest. This reduces the performance impact of the PV code on bare metal systems. v11-v12: - Based on PeterZ's version of the qspinlock patch (https://lkml.org/lkml/2014/6/15/63). - Incorporated many of the review comments from Konrad Wilk and Paolo Bonzini. - The pvqspinlock code is largely from my previous version with PeterZ's way of going from queue tail to head and his idea of using callee saved calls to KVM and XEN codes. v10-v11: - Use a simple test-and-set unfair lock to simplify the code, but performance may suffer a bit for large guest with many CPUs. - Take out Raghavendra KT's test results as the unfair lock changes may render some of his results invalid. - Add PV support without increasing the size of the core queue node structure. - Other minor changes to address some of the feedback comments. v9-v10: - Make some minor changes to qspinlock.c to accommodate review feedback. - Change author to PeterZ for 2 of the patches. - Include Raghavendra KT's test results in patch 18. v8-v9: - Integrate PeterZ's version of the queue spinlock patch with some modification: http://lkml.kernel.org/r/20140310154236.038181...@infradead.org - Break the more complex patches into smaller ones to ease review effort. - Fix a racing condition in the PV qspinlock code. v7-v8: - Remove one unneeded atomic operation from the slowpath, thus improving performance. - Simplify some of the codes and add more comments. - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable unfair lock. - Reduce unfair lock slowpath lock stealing frequency depending on its distance from the queue head. - Add performance data for IvyBridge-EX CPU. v6-v7: - Remove an atomic operation from the 2-task contending code - Shorten the names of some macros - Make the queue waiter to attempt to steal lock when unfair lock is enabled. - Remove lock holder kick from the PV code and fix a race condition - Run the unfair lock PV code on overcommitted KVM guests to collect performance data. v5-v6: - Change the optimized 2-task contending code to make it fairer at the expense of a bit of performance. - Add a patch to support unfair queue spinlock for Xen. - Modify the PV qspinlock code to follow what was done in the PV ticketlock. - Add performance data for the unfair lock as well as the PV support code. v4-v5: - Move the optimized 2-task contending code to the generic file to enable more architectures to use it without code duplication. - Address some of the style-related comments by PeterZ. - Allow the use of unfair queue spinlock in a real para-virtualized execution environment. - Add para-virtualization support to the qspinlock code by ensuring that the lock holder and queue head stay alive as much as possible. v3-v4: - Remove debugging code and fix a configuration error - Simplify the qspinlock structure and streamline the code to make it perform a bit better - Add an x86 version of asm/qspinlock.h for holding x86 specific optimization. - Add an optimized x86 code path for 2 contending tasks to improve low contention performance. v2-v3: - Simplify the code by using numerous mode only without an unfair option. - Use the latest smp_load_acquire()/smp_store_release() barriers. - Move the queue spinlock code to kernel/locking. - Make the use of queue spinlock the default for x86-64 without user configuration. - Additional performance tuning. v1-v2: - Add some more comments to document what the code does. - Add a numerous CPU mode to support = 16K CPUs - Add a configuration option to allow lock stealing which can further improve performance in many cases. - Enable wakeup of queue head CPU at unlock time for non-numerous CPU mode. This patch set has 3 different sections: 1) Patches 1-6: Introduces a queue-based spinlock implementation that can replace the default ticket spinlock without increasing the size of the spinlock data structure. As a result, critical kernel data structures that embed spinlock won't increase in size and break data alignments. 2) Patch 7: Enables the use of unfair lock in a virtual guest. This can resolve some of the locking related performance issues due to the fact that the next CPU to get the lock may have been scheduled out for a period of time. 3) Patches 8-11: Enable qspinlock para-virtualization support by halting the waiting CPUs after spinning for a certain amount of time. The unlock code will detect the a sleeping waiter and wake it up. This is essentially the same logic as the PV ticketlock code. The queue spinlock has slightly better performance than the ticket spinlock in uncontended case. Its performance can be much better with moderate to heavy contention. This patch has the
[PATCH v13 03/11] qspinlock: Add pending bit
From: Peter Zijlstra pet...@infradead.org Because the qspinlock needs to touch a second cacheline (the per-cpu mcs_nodes[]); add a pending bit and allow a single in-word spinner before we punt to the second cacheline. It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 12 +++- kernel/locking/qspinlock.c| 119 +++-- 2 files changed, 107 insertions(+), 24 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 67a2110..4196694 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -36,8 +36,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U _Q_ ## type ## _BITS) - 1)\ _Q_ ## type ## _OFFSET) @@ -45,7 +46,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -54,5 +59,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index c114076..226b11d 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -92,24 +92,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) return per_cpu_ptr(mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word * - * (queue tail, lock value) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:-- (0,1) :-- (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+-- (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+-- (*,0) - (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock value) * + * fast :slow :unlock + * : : + * uncontended (0,0,0) -:-- (0,0,1) --:-- (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +-- (0,1,0) \ | : + * : | ^--' | | : + * : v | | : + * uncontended :(n,x,y) +-- (n,0,0) --' | : + * queue : | ^--' | : + * : v | : + * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -' : + * queue : ^--' : */ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) { @@ -119,6 +123,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + /* +* wait for in-progress pending-locked hand-overs +* +* 0,1,0 - 0,0,1 +*/ + if (val == _Q_PENDING_VAL) { + while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL) +
[PATCH v13 04/11] qspinlock: Extract out code snippets for the next patch
This is a preparatory patch that extracts out the following 2 code snippets to prepare for the next performance optimization patch. 1) the logic for the exchange of new and previous tail code words into a new xchg_tail() function. 2) the logic for clearing the pending bit and setting the locked bit into a new clear_pending_set_locked() function. This patch also simplifies the trylock operation before queuing by calling queue_spin_trylock() directly. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock_types.h |2 + kernel/locking/qspinlock.c| 91 +--- 2 files changed, 62 insertions(+), 31 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 4196694..88d647c 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -58,6 +58,8 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) + #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) #define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 226b11d..48bd2ad 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -95,6 +95,54 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) /** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + u32 new, old; + + for (;;) { + new = (val ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } +} + +/** + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + u32 old, new, val = atomic_read(lock-val); + + for (;;) { + new = (val _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + return old; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -176,15 +224,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) * * *,1,0 - *,0,1 */ - for (;;) { - new = (val ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + clear_pending_set_locked(lock, val); return; /* @@ -201,37 +241,26 @@ queue: node-next = NULL; /* -* We have already touched the queueing cacheline; don't bother with -* pending stuff. -* -* trylock || xchg(lock, node) -* -* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked. -* p,y,x - n,y,x ; tail was p - tail is n; preserving locked. +* We touched a (possibly) cold cacheline in the per-cpu queue node; +* attempt the trylock once more in the hope someone let go while we +* weren't watching. */ - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + if (queue_spin_trylock(lock)) + goto release; /* -* we won the trylock; forget about queueing. +* We have already touched the queueing cacheline; don't bother with +* pending stuff. +* +* p,*,* - n,*,* */ - if (new == _Q_LOCKED_VAL) - goto release; + old = xchg_tail(lock, tail); /* * if there was a previous node; link it and wait until reaching the * head of the waitqueue. */ - if (old ~_Q_LOCKED_PENDING_MASK) { + if (old _Q_TAIL_MASK) { prev =
[PATCH v13 06/11] qspinlock: Use a simple write to grab the lock
Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the tail code. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the tail code in the lock is set. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasksBefore patchAfter patch %Change ----- -- --- 3 2324/2321 2248/2265-3%/-2% 4 2890/2896 2819/2831-2%/-2% 5 3611/3595 3522/3512-2%/-2% 6 4281/4276 4173/4160-3%/-3% 7 5018/5001 4875/4861-3%/-3% 8 5759/5750 5563/5568-3%/-3% [Standalone/Embedded - different nodes] # of tasksBefore patchAfter patch %Change ----- -- --- 312242/12237 12087/12093 -1%/-1% 410688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The ebizzy -m test was also run with the following results: kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- kernel/locking/qspinlock.c | 59 1 files changed, 43 insertions(+), 16 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 7c127b4..fb0e988 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -103,24 +103,33 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) * By using the whole 2nd least significant byte for the pending bit, we * can allow better optimization of the lock acquisition for the pending * bit holder. + * + * This internal structure is also used by the set_locked function which + * is not restricted to _Q_PENDING_BITS == 8. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + u8 locked; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[3]; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -207,6 +216,19 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) #endif /* _Q_PENDING_BITS == 8 */ /** + * set_locked - Set the lock bit and own the lock + * @lock: Pointer to queue spinlock structure + * + * *,*,0 - *,0,1 + */ +static __always_inline void set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue
[PATCH v13 07/11] qspinlock: Revert to test-and-set on hypervisors
From: Peter Zijlstra pet...@infradead.org When we detect a hypervisor (!paravirt, see qspinlock paravirt support patches), revert to a simple test-and-set lock to avoid the horrors of queue preemption. Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/qspinlock.h | 14 ++ include/asm-generic/qspinlock.h |7 +++ kernel/locking/qspinlock.c |3 +++ 3 files changed, 24 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index a6a8762..05a77fe 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_QSPINLOCK_H #define _ASM_X86_QSPINLOCK_H +#include asm/cpufeature.h #include asm-generic/qspinlock_types.h #ifndef CONFIG_X86_PPRO_FENCE @@ -20,6 +21,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock) #endif /* !CONFIG_X86_PPRO_FENCE */ +#define virt_queue_spin_lock virt_queue_spin_lock + +static inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) + return false; + + while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0) + cpu_relax(); + + return true; +} + #include asm-generic/qspinlock.h #endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index e8a7ae8..a53a7bb 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -98,6 +98,13 @@ static __always_inline void queue_spin_unlock(struct qspinlock *lock) } #endif +#ifndef virt_queue_spin_lock +static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + return false; +} +#endif + /* * Initializier */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index fb0e988..1c1926a 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -257,6 +257,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (virt_queue_spin_lock(lock)) + return; + /* * wait for in-progress pending-locked hand-overs * -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v13 05/11] qspinlock: Optimize for smaller NR_CPUS
From: Peter Zijlstra pet...@infradead.org When we allow for a max NR_CPUS 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 13 ++ kernel/locking/qspinlock.c| 71 - 2 files changed, 83 insertions(+), 1 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 88d647c..01b46df 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -35,6 +35,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS = 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -47,7 +55,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS (1U 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -58,6 +70,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 48bd2ad..7c127b4 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -22,6 +22,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/mutex.h +#include asm/byteorder.h #include asm/qspinlock.h /* @@ -54,6 +55,10 @@ * node; whereby avoiding the need to carry a node from lock to unlock, and * preserving existing lock API. This also makes the unlock code simpler and * faster. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. + * */ #include mcs_spinlock.h @@ -94,6 +99,64 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + * + * Lock stealing is not allowed if this function is used. + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL; +} + +/* + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + struct __qspinlock *l = (void *)lock; + + return (u32)xchg(l-tail, tail _Q_TAIL_OFFSET) _Q_TAIL_OFFSET; +} + +#else /* _Q_PENDING_BITS == 8 */ + /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -141,6 +204,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) }
[PATCH v13 08/11] qspinlock, x86: Rename paravirt_ticketlocks_enabled
This patch renames the paravirt_ticketlocks_enabled static key to a more generic paravirt_spinlocks_enabled name. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- arch/x86/include/asm/spinlock.h |4 ++-- arch/x86/kernel/kvm.c|2 +- arch/x86/kernel/paravirt-spinlocks.c |4 ++-- arch/x86/xen/spinlock.c |2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 5899483..928751e 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -39,7 +39,7 @@ /* How long a lock should spin before we consider blocking */ #define SPIN_THRESHOLD (1 15) -extern struct static_key paravirt_ticketlocks_enabled; +extern struct static_key paravirt_spinlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); #ifdef CONFIG_QUEUE_SPINLOCK @@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock, static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { if (TICKET_SLOWPATH_FLAG - static_key_false(paravirt_ticketlocks_enabled)) { + static_key_false(paravirt_spinlocks_enabled)) { arch_spinlock_t prev; prev = *lock; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index f6945be..eaa064c 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -827,7 +827,7 @@ static __init int kvm_spinlock_init_jump(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); printk(KERN_INFO KVM setup paravirtual spinlock\n); return 0; diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index bbb6c73..e434f24 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -16,5 +16,5 @@ struct pv_lock_ops pv_lock_ops = { }; EXPORT_SYMBOL(pv_lock_ops); -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; -EXPORT_SYMBOL(paravirt_ticketlocks_enabled); +struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE; +EXPORT_SYMBOL(paravirt_spinlocks_enabled); diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 23b45eb..d332ae0 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jump(void) if (!xen_domain()) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); return 0; } early_initcall(xen_init_spinlocks_jump); -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On 10/29/2014 03:05 PM, Waiman Long wrote: On 10/27/2014 05:22 PM, Waiman Long wrote: On 10/27/2014 02:04 PM, Peter Zijlstra wrote: On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote: On 10/24/2014 04:54 AM, Peter Zijlstra wrote: On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote: Since enabling paravirt spinlock will disable unlock function inlining, a jump label can be added to the unlock function without adding patch sites all over the kernel. But you don't have to. My patches allowed for the inline to remain, again reducing the overhead of enabling PV spinlocks while running on a real machine. Look at: http://lkml.kernel.org/r/20140615130154.213923...@chello.nl In particular this hunk: Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c === --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs) DEF_NATIVE(, mov32, mov %edi, %eax); DEF_NATIVE(, mov64, mov %rdi, %rax); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi)); +#endif + unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len) { return paravirt_patch_insns(insnbuf, len, @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb PATCH_SITE(pv_cpu_ops, clts); PATCH_SITE(pv_mmu_ops, flush_tlb_single); PATCH_SITE(pv_cpu_ops, wbinvd); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) + PATCH_SITE(pv_lock_ops, queue_unlock); +#endif patch_site: ret = paravirt_patch_insns(ibuf, len, start, end); That makes sure to overwrite the callee-saved call to the pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi). Therefore you can retain the inlined unlock with hardly (there might be some NOP padding) any overhead at all. On PV it reverts to a callee saved function call. My concern is that spin_unlock() can be called in many places, including loadable kernel modules. Can the paravirt_patch_ident_32() function able to patch all of them in reasonable time? How about a kernel module loaded later at run time? modules should be fine, see arch/x86/kernel/module.c:module_finalize() - apply_paravirt(). Also note that the 'default' text is an indirect call into the paravirt ops table which routes to the 'right' function, so even if the text patching would be 'late' calls would 'work' as expected, just slower. Thanks for letting me know about that. I have this concern because your patch didn't change the current configuration of disabling unlock inlining when paravirt_spinlock is enabled. With that, I think it is worthwhile to reduce the performance delta between the PV and non-PV kernel on bare metal. I am sorry that the unlock call sites patching code doesn't work in a virtual guest. Your pvqspinlock patch did an unconditional patching even in a virtual guest. I added check for the paravirt_spinlocks_enabled, but it turned out that some spin_unlock() seemed to be called before paravirt_spinlocks_enabled is set. As a result, some call sites were still patched resulting in missed wake up's and system hang. At this point, I am going to leave out that change from my patch set until we can figure out a better way of doing that. Below was a partial kernel log with the unlock call site patch code in a KVM guest: [0.438006] native_patch: patch out pv_queue_unlock! [0.438565] native_patch: patch out pv_queue_unlock! [0.439006] native_patch: patch out pv_queue_unlock! [0.439638] native_patch: patch out pv_queue_unlock! [0.440052] native_patch: patch out pv_queue_unlock! [0.441006] native_patch: patch out pv_queue_unlock! [0.441566] native_patch: patch out pv_queue_unlock! [0.442035] ftrace: allocating 24168 entries in 95 pages [0.451208] Switched APIC routing to physical flat. [0.453202] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 [0.454002] smpboot: CPU0: Intel QEMU Virtual CPU version 1.5.3 (fam: 06, model: 06, stepping: 03) [0.456000] Performance Events: Broken PMU hardware detected, using software events only. [0.456003] Failed to access perfctr msr (MSR c1 is 0) [0.457151] KVM setup paravirtual spinlock [0.460039] NMI watchdog: disabled (cpu0): hardware events not enabled It could be seen that some unlock call sites were patched before the KVM setup code set the paravirt_spinlocks_enabled flag. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization