Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
On Fri, Feb 3, 2017 at 2:28 PM, Russell King - ARM Linuxwrote: > On Fri, Feb 03, 2017 at 01:08:40PM -0800, Kees Cook wrote: >> On Fri, Feb 3, 2017 at 12:29 PM, Russell King - ARM Linux >> wrote: >> > On Fri, Feb 03, 2017 at 11:45:56AM -0800, Kees Cook wrote: >> >> On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbott wrote: >> >> > diff --git a/arch/Kconfig b/arch/Kconfig >> >> > index 99839c2..22ee01e 100644 >> >> > --- a/arch/Kconfig >> >> > +++ b/arch/Kconfig >> >> > @@ -781,4 +781,32 @@ config VMAP_STACK >> >> > the stack to map directly to the KASAN shadow map using a >> >> > formula >> >> > that is incorrect if the stack is in vmalloc space. >> >> > >> >> > +config ARCH_NO_STRICT_RWX_DEFAULTS >> >> > + def_bool n >> >> > + >> >> > +config ARCH_HAS_STRICT_KERNEL_RWX >> >> > + def_bool n >> >> > + >> >> > +config DEBUG_RODATA >> >> > + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS >> >> > + prompt "Make kernel text and rodata read-only" if >> >> > ARCH_NO_STRICT_RWX_DEFAULTS >> >> >> >> Ah! Yes, perfect. I totally forgot about using conditional "prompt" >> >> lines. Nice! >> > >> > It's no different from the more usual: >> > >> > bool "Make kernel text and rodata read-only" if >> > ARCH_NO_STRICT_RWX_DEFAULTS >> > default y if !ARCH_NO_STRICT_RWX_DEFAULTS >> > depends on ARCH_HAS_STRICT_KERNEL_RWX >> > >> > But... I really don't like this - way too many negations and negatives >> > which make it difficult to figure out what's going on here. >> > >> > The situation we have today is: >> > >> > -config DEBUG_RODATA >> > - bool "Make kernel text and rodata read-only" >> > - depends on MMU && !XIP_KERNEL >> > - default y if CPU_V7 >> > >> > which is "allow the user to select DEBUG_RODATA if building a MMU non-XIP >> > kernel", suggesting that the user turns it on for ARMv7 CPUs. >> > >> > That changes with this and the above: >> > >> > + select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL >> > + select ARCH_HAS_STRICT_MODULE_RWX if MMU >> > + select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7 >> > >> > This means that ARCH_HAS_STRICT_KERNEL_RWX is set for a MMU non-XIP >> > kernel, which carries the same pre-condition for DEBUG_RODATA - no >> > problem there. >> > >> > However, ARCH_NO_STRICT_RWX_DEFAULTS is set for non-ARMv7 CPUs, which >> > means the "Make kernel text and rodata read-only" prompt _is_ provided >> > for those. However, for all ARMv7 systems, we go from "suggesting that >> > the user enables the option" to "you don't have a choice, you get this >> > whether you want it or not." >> > >> > I'd prefer to keep it off for my development systems, where I don't >> > care about kernel security. If we don't wish to do that as a general >> > rule, can we make it dependent on EMBEDDED? >> > >> > Given that on ARM it can add up to 4MB to the kernel image - there >> > _will_ be about 1MB before the .text section, the padding on between >> > __modver and __ex_table which for me is around 626k, the padding >> > between .notes and the init sections start with .vectors (the space >> > between __ex_table and end of .notes is only 4124, which gets padded >> > up to 1MB) and lastly the padding between the .init section and the >> > data section (for me around 593k). This all adds up to an increase >> > in kernel image size of 3.2MB on 14.2MB - an increase of 22%. >> > >> > So no, I'm really not happy with that. >> >> Ah yeah, good point. We have three cases: unsupported, mandatory, >> optional, but we have the case of setting the default for the optional >> case. Maybe something like this? >> >> config STRICT_KERNEL_RWX >> bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX >> depends on ARCH_HAS_STRICT_KERNEL_RWX >> default ARCH_OPTIONAL_KERNEL_RWX_DEFAULT >> >> unsupported: >> !ARCH_HAS_STRICT_KERNEL_RWX >> >> mandatory: >> ARCH_HAS_STRICT_KERNEL_RWX >> !ARCH_OPTIONAL_KERNEL_RWX >> >> optional: >> ARCH_HAS_STRICT_KERNEL_RWX >> ARCH_OPTIONAL_KERNEL_RWX >> with default controlled by ARCH_OPTIONAL_KERNEL_RWX_DEFAULT >> >> Then arm is: >> select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL >> select ARCH_HAS_STRICT_MODULE_RWX if MMU >> select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX >> select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7 >> >> x86 and arm64 are: >> select ARCH_HAS_STRICT_KERNEL_RWX >> select ARCH_HAS_STRICT_MODULE_RWX > > Looks to me like it will do the job. > > In passing, I noticed that, on ARM: > > 3 .rodata 002212b4 c0703000 c0703000 00703000 2**6 > CONTENTS, ALLOC, LOAD, DATA > > a lack of READONLY there - which suggests something in this lot isn't > actually read-only data: > > .rodata : AT(ADDR(.rodata) - LOAD_OFFSET) { \ > VMLINUX_SYMBOL(__start_rodata) = .; \ >
Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
On Fri, Feb 03, 2017 at 01:08:40PM -0800, Kees Cook wrote: > On Fri, Feb 3, 2017 at 12:29 PM, Russell King - ARM Linux >wrote: > > On Fri, Feb 03, 2017 at 11:45:56AM -0800, Kees Cook wrote: > >> On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbott wrote: > >> > diff --git a/arch/Kconfig b/arch/Kconfig > >> > index 99839c2..22ee01e 100644 > >> > --- a/arch/Kconfig > >> > +++ b/arch/Kconfig > >> > @@ -781,4 +781,32 @@ config VMAP_STACK > >> > the stack to map directly to the KASAN shadow map using a > >> > formula > >> > that is incorrect if the stack is in vmalloc space. > >> > > >> > +config ARCH_NO_STRICT_RWX_DEFAULTS > >> > + def_bool n > >> > + > >> > +config ARCH_HAS_STRICT_KERNEL_RWX > >> > + def_bool n > >> > + > >> > +config DEBUG_RODATA > >> > + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS > >> > + prompt "Make kernel text and rodata read-only" if > >> > ARCH_NO_STRICT_RWX_DEFAULTS > >> > >> Ah! Yes, perfect. I totally forgot about using conditional "prompt" > >> lines. Nice! > > > > It's no different from the more usual: > > > > bool "Make kernel text and rodata read-only" if > > ARCH_NO_STRICT_RWX_DEFAULTS > > default y if !ARCH_NO_STRICT_RWX_DEFAULTS > > depends on ARCH_HAS_STRICT_KERNEL_RWX > > > > But... I really don't like this - way too many negations and negatives > > which make it difficult to figure out what's going on here. > > > > The situation we have today is: > > > > -config DEBUG_RODATA > > - bool "Make kernel text and rodata read-only" > > - depends on MMU && !XIP_KERNEL > > - default y if CPU_V7 > > > > which is "allow the user to select DEBUG_RODATA if building a MMU non-XIP > > kernel", suggesting that the user turns it on for ARMv7 CPUs. > > > > That changes with this and the above: > > > > + select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL > > + select ARCH_HAS_STRICT_MODULE_RWX if MMU > > + select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7 > > > > This means that ARCH_HAS_STRICT_KERNEL_RWX is set for a MMU non-XIP > > kernel, which carries the same pre-condition for DEBUG_RODATA - no > > problem there. > > > > However, ARCH_NO_STRICT_RWX_DEFAULTS is set for non-ARMv7 CPUs, which > > means the "Make kernel text and rodata read-only" prompt _is_ provided > > for those. However, for all ARMv7 systems, we go from "suggesting that > > the user enables the option" to "you don't have a choice, you get this > > whether you want it or not." > > > > I'd prefer to keep it off for my development systems, where I don't > > care about kernel security. If we don't wish to do that as a general > > rule, can we make it dependent on EMBEDDED? > > > > Given that on ARM it can add up to 4MB to the kernel image - there > > _will_ be about 1MB before the .text section, the padding on between > > __modver and __ex_table which for me is around 626k, the padding > > between .notes and the init sections start with .vectors (the space > > between __ex_table and end of .notes is only 4124, which gets padded > > up to 1MB) and lastly the padding between the .init section and the > > data section (for me around 593k). This all adds up to an increase > > in kernel image size of 3.2MB on 14.2MB - an increase of 22%. > > > > So no, I'm really not happy with that. > > Ah yeah, good point. We have three cases: unsupported, mandatory, > optional, but we have the case of setting the default for the optional > case. Maybe something like this? > > config STRICT_KERNEL_RWX > bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX > depends on ARCH_HAS_STRICT_KERNEL_RWX > default ARCH_OPTIONAL_KERNEL_RWX_DEFAULT > > unsupported: > !ARCH_HAS_STRICT_KERNEL_RWX > > mandatory: > ARCH_HAS_STRICT_KERNEL_RWX > !ARCH_OPTIONAL_KERNEL_RWX > > optional: > ARCH_HAS_STRICT_KERNEL_RWX > ARCH_OPTIONAL_KERNEL_RWX > with default controlled by ARCH_OPTIONAL_KERNEL_RWX_DEFAULT > > Then arm is: > select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL > select ARCH_HAS_STRICT_MODULE_RWX if MMU > select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX > select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7 > > x86 and arm64 are: > select ARCH_HAS_STRICT_KERNEL_RWX > select ARCH_HAS_STRICT_MODULE_RWX Looks to me like it will do the job. In passing, I noticed that, on ARM: 3 .rodata 002212b4 c0703000 c0703000 00703000 2**6 CONTENTS, ALLOC, LOAD, DATA a lack of READONLY there - which suggests something in this lot isn't actually read-only data: .rodata : AT(ADDR(.rodata) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start_rodata) = .; \ *(.rodata) *(.rodata.*) \ RO_AFTER_INIT_DATA /* Read only after init */ \ KEEP(*(__vermagic)) /* Kernel version magic */ \
Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
On Fri, Feb 3, 2017 at 12:29 PM, Russell King - ARM Linuxwrote: > On Fri, Feb 03, 2017 at 11:45:56AM -0800, Kees Cook wrote: >> On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbott wrote: >> > diff --git a/arch/Kconfig b/arch/Kconfig >> > index 99839c2..22ee01e 100644 >> > --- a/arch/Kconfig >> > +++ b/arch/Kconfig >> > @@ -781,4 +781,32 @@ config VMAP_STACK >> > the stack to map directly to the KASAN shadow map using a formula >> > that is incorrect if the stack is in vmalloc space. >> > >> > +config ARCH_NO_STRICT_RWX_DEFAULTS >> > + def_bool n >> > + >> > +config ARCH_HAS_STRICT_KERNEL_RWX >> > + def_bool n >> > + >> > +config DEBUG_RODATA >> > + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS >> > + prompt "Make kernel text and rodata read-only" if >> > ARCH_NO_STRICT_RWX_DEFAULTS >> >> Ah! Yes, perfect. I totally forgot about using conditional "prompt" >> lines. Nice! > > It's no different from the more usual: > > bool "Make kernel text and rodata read-only" if > ARCH_NO_STRICT_RWX_DEFAULTS > default y if !ARCH_NO_STRICT_RWX_DEFAULTS > depends on ARCH_HAS_STRICT_KERNEL_RWX > > But... I really don't like this - way too many negations and negatives > which make it difficult to figure out what's going on here. > > The situation we have today is: > > -config DEBUG_RODATA > - bool "Make kernel text and rodata read-only" > - depends on MMU && !XIP_KERNEL > - default y if CPU_V7 > > which is "allow the user to select DEBUG_RODATA if building a MMU non-XIP > kernel", suggesting that the user turns it on for ARMv7 CPUs. > > That changes with this and the above: > > + select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL > + select ARCH_HAS_STRICT_MODULE_RWX if MMU > + select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7 > > This means that ARCH_HAS_STRICT_KERNEL_RWX is set for a MMU non-XIP > kernel, which carries the same pre-condition for DEBUG_RODATA - no > problem there. > > However, ARCH_NO_STRICT_RWX_DEFAULTS is set for non-ARMv7 CPUs, which > means the "Make kernel text and rodata read-only" prompt _is_ provided > for those. However, for all ARMv7 systems, we go from "suggesting that > the user enables the option" to "you don't have a choice, you get this > whether you want it or not." > > I'd prefer to keep it off for my development systems, where I don't > care about kernel security. If we don't wish to do that as a general > rule, can we make it dependent on EMBEDDED? > > Given that on ARM it can add up to 4MB to the kernel image - there > _will_ be about 1MB before the .text section, the padding on between > __modver and __ex_table which for me is around 626k, the padding > between .notes and the init sections start with .vectors (the space > between __ex_table and end of .notes is only 4124, which gets padded > up to 1MB) and lastly the padding between the .init section and the > data section (for me around 593k). This all adds up to an increase > in kernel image size of 3.2MB on 14.2MB - an increase of 22%. > > So no, I'm really not happy with that. Ah yeah, good point. We have three cases: unsupported, mandatory, optional, but we have the case of setting the default for the optional case. Maybe something like this? config STRICT_KERNEL_RWX bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX depends on ARCH_HAS_STRICT_KERNEL_RWX default ARCH_OPTIONAL_KERNEL_RWX_DEFAULT unsupported: !ARCH_HAS_STRICT_KERNEL_RWX mandatory: ARCH_HAS_STRICT_KERNEL_RWX !ARCH_OPTIONAL_KERNEL_RWX optional: ARCH_HAS_STRICT_KERNEL_RWX ARCH_OPTIONAL_KERNEL_RWX with default controlled by ARCH_OPTIONAL_KERNEL_RWX_DEFAULT Then arm is: select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL select ARCH_HAS_STRICT_MODULE_RWX if MMU select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7 x86 and arm64 are: select ARCH_HAS_STRICT_KERNEL_RWX select ARCH_HAS_STRICT_MODULE_RWX ? -Kees -- Kees Cook Pixel Security -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
On Fri, Feb 03, 2017 at 11:45:56AM -0800, Kees Cook wrote: > On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbottwrote: > > diff --git a/arch/Kconfig b/arch/Kconfig > > index 99839c2..22ee01e 100644 > > --- a/arch/Kconfig > > +++ b/arch/Kconfig > > @@ -781,4 +781,32 @@ config VMAP_STACK > > the stack to map directly to the KASAN shadow map using a formula > > that is incorrect if the stack is in vmalloc space. > > > > +config ARCH_NO_STRICT_RWX_DEFAULTS > > + def_bool n > > + > > +config ARCH_HAS_STRICT_KERNEL_RWX > > + def_bool n > > + > > +config DEBUG_RODATA > > + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS > > + prompt "Make kernel text and rodata read-only" if > > ARCH_NO_STRICT_RWX_DEFAULTS > > Ah! Yes, perfect. I totally forgot about using conditional "prompt" > lines. Nice! It's no different from the more usual: bool "Make kernel text and rodata read-only" if ARCH_NO_STRICT_RWX_DEFAULTS default y if !ARCH_NO_STRICT_RWX_DEFAULTS depends on ARCH_HAS_STRICT_KERNEL_RWX But... I really don't like this - way too many negations and negatives which make it difficult to figure out what's going on here. The situation we have today is: -config DEBUG_RODATA - bool "Make kernel text and rodata read-only" - depends on MMU && !XIP_KERNEL - default y if CPU_V7 which is "allow the user to select DEBUG_RODATA if building a MMU non-XIP kernel", suggesting that the user turns it on for ARMv7 CPUs. That changes with this and the above: + select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL + select ARCH_HAS_STRICT_MODULE_RWX if MMU + select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7 This means that ARCH_HAS_STRICT_KERNEL_RWX is set for a MMU non-XIP kernel, which carries the same pre-condition for DEBUG_RODATA - no problem there. However, ARCH_NO_STRICT_RWX_DEFAULTS is set for non-ARMv7 CPUs, which means the "Make kernel text and rodata read-only" prompt _is_ provided for those. However, for all ARMv7 systems, we go from "suggesting that the user enables the option" to "you don't have a choice, you get this whether you want it or not." I'd prefer to keep it off for my development systems, where I don't care about kernel security. If we don't wish to do that as a general rule, can we make it dependent on EMBEDDED? Given that on ARM it can add up to 4MB to the kernel image - there _will_ be about 1MB before the .text section, the padding on between __modver and __ex_table which for me is around 626k, the padding between .notes and the init sections start with .vectors (the space between __ex_table and end of .notes is only 4124, which gets padded up to 1MB) and lastly the padding between the .init section and the data section (for me around 593k). This all adds up to an increase in kernel image size of 3.2MB on 14.2MB - an increase of 22%. So no, I'm really not happy with that. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 2/2] arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbottwrote: > > Both of these options are poorly named. The features they provide are > necessary for system security and should not be considered debug only. > Change the name to something that accurately describes what these > options do. It may help to explicitly call out the name change from/to in the commit message. > > Signed-off-by: Laura Abbott > --- > [...] > diff --git a/arch/arm/configs/aspeed_g4_defconfig > b/arch/arm/configs/aspeed_g4_defconfig > index ca39c04..beea2cc 100644 > --- a/arch/arm/configs/aspeed_g4_defconfig > +++ b/arch/arm/configs/aspeed_g4_defconfig > @@ -25,7 +25,6 @@ CONFIG_MODULE_UNLOAD=y > # CONFIG_ARCH_MULTI_V7 is not set > CONFIG_ARCH_ASPEED=y > CONFIG_MACH_ASPEED_G4=y > -CONFIG_DEBUG_RODATA=y > CONFIG_AEABI=y > CONFIG_UACCESS_WITH_MEMCPY=y > CONFIG_SECCOMP=y Are these defconfig cases correct (dropping DEBUG_RODATA without adding STRICT_KERNEL_RWX)? Who should carry this series, btw? -Kees -- Kees Cook Pixel Security -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbottwrote: > There are multiple architectures that support CONFIG_DEBUG_RODATA and > CONFIG_SET_MODULE_RONX. These options also now have the ability to be > turned off at runtime. Move these to an architecture independent > location and make these options def_bool y for almost all of those > arches. > > Signed-off-by: Laura Abbott > --- > v2: This patch is now doing just the refactor of the existing config options. > --- > arch/Kconfig | 28 > arch/arm/Kconfig | 3 +++ > arch/arm/Kconfig.debug| 11 --- > arch/arm/mm/Kconfig | 12 > arch/arm64/Kconfig| 5 ++--- > arch/arm64/Kconfig.debug | 11 --- > arch/parisc/Kconfig | 1 + > arch/parisc/Kconfig.debug | 11 --- > arch/s390/Kconfig | 5 ++--- > arch/s390/Kconfig.debug | 3 --- > arch/x86/Kconfig | 5 ++--- > arch/x86/Kconfig.debug| 11 --- > 12 files changed, 38 insertions(+), 68 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 99839c2..22ee01e 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -781,4 +781,32 @@ config VMAP_STACK > the stack to map directly to the KASAN shadow map using a formula > that is incorrect if the stack is in vmalloc space. > > +config ARCH_NO_STRICT_RWX_DEFAULTS > + def_bool n > + > +config ARCH_HAS_STRICT_KERNEL_RWX > + def_bool n > + > +config DEBUG_RODATA > + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS > + prompt "Make kernel text and rodata read-only" if > ARCH_NO_STRICT_RWX_DEFAULTS Ah! Yes, perfect. I totally forgot about using conditional "prompt" lines. Nice! Acked-by: Kees Cook -Kees -- Kees Cook Pixel Security -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-tip v5 17/21] TP-futex: Group readers together in wait queue
On Fri, 03 Feb 2017 13:42:46 -0500, Waiman Long said: > This patch set does guarantee some minimum performance level, but it > can't guarantee fairness for all the lock waiters. OK, sounds like it's a situation that's statistically unlikely, but it has protections against starvation so the system will eventually dig itself out of the hole, and the scenario is at least known and understood. The improved numbers for average-case probably outweigh the worst-case then... pgpkbpyHrgtN9.pgp Description: PGP signature
Re: [PATCH-tip v5 17/21] TP-futex: Group readers together in wait queue
On 02/03/2017 01:23 PM, valdis.kletni...@vt.edu wrote: > On Fri, 03 Feb 2017 13:03:50 -0500, Waiman Long said: > >> On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10 >> WW futex TP futex Glibc >> - >> Total locking ops35,707,234 58,645,434 10,930,422 >> Per-thread avg/sec 99,149162,887 30,362 >> Per-thread min/sec 93,190 38,641 29,872 >> Per-thread max/sec 104,213225,983 30,708 > Do we understand where the 38K number came from? I'm a bit concerned that the > min-to-max has such a large dispersion compared to all the other numbers. Was > that a worst-case issue, and is the worst-case something likely to happen in > production, or requires special effort to trigger? > Because the lock isn't fair and depending on the placement of the lock, you will see some CPUs have higher likelihood of getting the lock than the others. This is reflected in the different locking rates as reported by the micro-benchmark. As the microbenchmark is included in this patch set, you can play around with it if you want. This patch set does guarantee some minimum performance level, but it can't guarantee fairness for all the lock waiters. Regards, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 2/2] arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
On Fri, Feb 03, 2017 at 09:52:22AM -0800, Laura Abbott wrote: > > Both of these options are poorly named. The features they provide are > necessary for system security and should not be considered debug only. > Change the name to something that accurately describes what these > options do. > > Signed-off-by: Laura AbbottAs with patch 1, this looks good to me. FWIW: Acked-by: Mark Rutland > diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug > index 939815e..560a8d8 100644 > --- a/arch/arm64/Kconfig.debug > +++ b/arch/arm64/Kconfig.debug > @@ -72,7 +72,7 @@ config DEBUG_WX > If in doubt, say "Y". > > config DEBUG_ALIGN_RODATA > - depends on DEBUG_RODATA > + depends on STRICT_KERNEL_RWX > bool "Align linker sections up to SECTION_SIZE" > help > If this option is enabled, sections that may potentially be marked as > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c > index 94b62c1..67f9cb9 100644 > --- a/arch/arm64/kernel/insn.c > +++ b/arch/arm64/kernel/insn.c > @@ -93,7 +93,7 @@ static void __kprobes *patch_map(void *addr, int fixmap) > bool module = !core_kernel_text(uintaddr); > struct page *page; > > - if (module && IS_ENABLED(CONFIG_DEBUG_SET_MODULE_RONX)) > + if (module && IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) > page = vmalloc_to_page(addr); > else if (!module) > page = pfn_to_page(PHYS_PFN(__pa(addr))); Since CONFIG_STRICT_MODULE_RWX is mandatory for arm64, we should be able to simplify this, but that can wait for a later patch. Thanks, Mark. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-tip v5 17/21] TP-futex: Group readers together in wait queue
On Fri, 03 Feb 2017 13:03:50 -0500, Waiman Long said: > On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10 > WW futex TP futex Glibc > - > Total locking ops35,707,234 58,645,434 10,930,422 > Per-thread avg/sec 99,149162,887 30,362 > Per-thread min/sec 93,190 38,641 29,872 > Per-thread max/sec 104,213225,983 30,708 Do we understand where the 38K number came from? I'm a bit concerned that the min-to-max has such a large dispersion compared to all the other numbers. Was that a worst-case issue, and is the worst-case something likely to happen in production, or requires special effort to trigger? pgphwbyLHN1gd.pgp Description: PGP signature
Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
On Fri, Feb 03, 2017 at 09:52:21AM -0800, Laura Abbott wrote: > There are multiple architectures that support CONFIG_DEBUG_RODATA and > CONFIG_SET_MODULE_RONX. These options also now have the ability to be > turned off at runtime. Move these to an architecture independent > location and make these options def_bool y for almost all of those > arches. > > Signed-off-by: Laura Abbott>From my POV this looks good. FWIW: Acked-by: Mark Rutland Mark. > --- > v2: This patch is now doing just the refactor of the existing config options. > --- > arch/Kconfig | 28 > arch/arm/Kconfig | 3 +++ > arch/arm/Kconfig.debug| 11 --- > arch/arm/mm/Kconfig | 12 > arch/arm64/Kconfig| 5 ++--- > arch/arm64/Kconfig.debug | 11 --- > arch/parisc/Kconfig | 1 + > arch/parisc/Kconfig.debug | 11 --- > arch/s390/Kconfig | 5 ++--- > arch/s390/Kconfig.debug | 3 --- > arch/x86/Kconfig | 5 ++--- > arch/x86/Kconfig.debug| 11 --- > 12 files changed, 38 insertions(+), 68 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 99839c2..22ee01e 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -781,4 +781,32 @@ config VMAP_STACK > the stack to map directly to the KASAN shadow map using a formula > that is incorrect if the stack is in vmalloc space. > > +config ARCH_NO_STRICT_RWX_DEFAULTS > + def_bool n > + > +config ARCH_HAS_STRICT_KERNEL_RWX > + def_bool n > + > +config DEBUG_RODATA > + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS > + prompt "Make kernel text and rodata read-only" if > ARCH_NO_STRICT_RWX_DEFAULTS > + depends on ARCH_HAS_STRICT_KERNEL_RWX > + help > + If this is set, kernel text and rodata memory will be made read-only, > + and non-text memory will be made non-executable. This provides > + protection against certain security exploits (e.g. executing the heap > + or modifying text) > + > +config ARCH_HAS_STRICT_MODULE_RWX > + def_bool n > + > +config DEBUG_SET_MODULE_RONX > + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS > + prompt "Set loadable kenrel module data as NX and text as RO" if > ARCH_NO_STRICT_RWX_DEFAULTS > + depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES > + help > + If this is set, module text and rodata memory will be made read-only, > + and non-text memory will be made non-executable. This provides > + protection against certain security exploits (e.g. writing to text) > + > source "kernel/gcov/Kconfig" > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > index 186c4c2..aa73ca8 100644 > --- a/arch/arm/Kconfig > +++ b/arch/arm/Kconfig > @@ -4,10 +4,13 @@ config ARM > select ARCH_CLOCKSOURCE_DATA > select ARCH_HAS_DEVMEM_IS_ALLOWED > select ARCH_HAS_ELF_RANDOMIZE > + select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL > + select ARCH_HAS_STRICT_MODULE_RWX if MMU > select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST > select ARCH_HAVE_CUSTOM_GPIO_H > select ARCH_HAS_GCOV_PROFILE_ALL > select ARCH_MIGHT_HAVE_PC_PARPORT > + select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7 > select ARCH_SUPPORTS_ATOMIC_RMW > select ARCH_USE_BUILTIN_BSWAP > select ARCH_USE_CMPXCHG_LOCKREF > diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug > index d83f7c3..426d271 100644 > --- a/arch/arm/Kconfig.debug > +++ b/arch/arm/Kconfig.debug > @@ -1738,17 +1738,6 @@ config PID_IN_CONTEXTIDR > additional instructions during context switch. Say Y here only if you > are planning to use hardware trace tools with this kernel. > > -config DEBUG_SET_MODULE_RONX > - bool "Set loadable kernel module data as NX and text as RO" > - depends on MODULES && MMU > - ---help--- > - This option helps catch unintended modifications to loadable > - kernel module's text and read-only data. It also prevents execution > - of module data. Such protection may interfere with run-time code > - patching and dynamic kernel tracing - and they might also protect > - against certain classes of kernel exploits. > - If in doubt, say "N". > - > source "drivers/hwtracing/coresight/Kconfig" > > endmenu > diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig > index f68e8ec..419a035 100644 > --- a/arch/arm/mm/Kconfig > +++ b/arch/arm/mm/Kconfig > @@ -1051,18 +1051,6 @@ config ARCH_SUPPORTS_BIG_ENDIAN > This option specifies the architecture can support big endian > operation. > > -config DEBUG_RODATA > - bool "Make kernel text and rodata read-only" > - depends on MMU && !XIP_KERNEL > - default y if CPU_V7 > - help > - If this is set, kernel text and rodata memory will be made > - read-only, and non-text kernel memory will be made non-executable. > -
[PATCH-tip v5 01/21] perf bench: New microbenchmark for userspace mutex performance
This microbenchmark simulates how the use of different futex types can affect the actual performanace of userspace mutex locks. The usage is: perf bench futex mutex Three sets of simple mutex lock and unlock functions are implemented using the wait-wake and PI futexes respectively. This microbenchmark then runs the locking rate measurement tests using either one of those mutexes or all of them consecutively. An example output from this microbenchmark was as follows: [PID 7352]: 72 threads doing WW2 futex lockings (load=1) for 10 secs. Locking statistics: Test run time= 10.00s Total exclusive locking ops = 90,875,633 Exclusive lock futex calls = 10,714,994 Exclusive unlock futex calls = 8,047,536 EAGAIN lock errors = 10,653,695 Process wakeups = 61,299 Percentages: Exclusive lock futex calls = 11.8% Exclusive unlock futex calls = 8.9% EAGAIN lock errors = 99.4% Process wakeups = 0.8% Per-thread Locking Rates: Avg = 126,202 ops/sec (+- 0.21%) Min = 120,114 ops/sec Max = 131,375 ops/sec Signed-off-by: Waiman Long--- tools/perf/Documentation/perf-bench.txt | 2 + tools/perf/bench/Build | 1 + tools/perf/bench/bench.h| 1 + tools/perf/bench/futex-locks.c | 759 tools/perf/builtin-bench.c | 10 + 5 files changed, 773 insertions(+) create mode 100644 tools/perf/bench/futex-locks.c diff --git a/tools/perf/Documentation/perf-bench.txt b/tools/perf/Documentation/perf-bench.txt index 34750fc..1fa5a74 100644 --- a/tools/perf/Documentation/perf-bench.txt +++ b/tools/perf/Documentation/perf-bench.txt @@ -203,6 +203,8 @@ Suite for evaluating requeue calls. *lock-pi*:: Suite for evaluating futex lock_pi calls. +*mutex*:: +Suite for evaluating futex calls for implementing userspace mutexes. SEE ALSO diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build index 60bf119..7ce1cd7 100644 --- a/tools/perf/bench/Build +++ b/tools/perf/bench/Build @@ -6,6 +6,7 @@ perf-y += futex-wake.o perf-y += futex-wake-parallel.o perf-y += futex-requeue.o perf-y += futex-lock-pi.o +perf-y += futex-locks.o perf-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o perf-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h index 579a592..b0632df 100644 --- a/tools/perf/bench/bench.h +++ b/tools/perf/bench/bench.h @@ -36,6 +36,7 @@ int bench_futex_requeue(int argc, const char **argv, const char *prefix); /* pi futexes */ int bench_futex_lock_pi(int argc, const char **argv, const char *prefix); +int bench_futex_mutex(int argc, const char **argv, const char *prefix); #define BENCH_FORMAT_DEFAULT_STR "default" #define BENCH_FORMAT_DEFAULT 0 diff --git a/tools/perf/bench/futex-locks.c b/tools/perf/bench/futex-locks.c new file mode 100644 index 000..622a57c --- /dev/null +++ b/tools/perf/bench/futex-locks.c @@ -0,0 +1,759 @@ +/* + * Copyright (C) 2016-2017 Waiman Long + * + * This microbenchmark simulates how the use of different futex types can + * affect the actual performanace of userspace locking primitives like mutex. + * + * The raw throughput of the futex lock and unlock calls is not a good + * indication of actual throughput of the mutex code as it may not really + * need to call into the kernel. Therefore, 3 sets of simple mutex lock and + * unlock functions are written to implenment a mutex lock using the + * wait-wake (2 versions) and PI futexes respectively. These functions serve + * as the basis for measuring the locking throughput. + */ + +#include + +#include +#include +#include "../util/stat.h" +#include "../perf-sys.h" +#include +#include +#include +#include +#include "bench.h" +#include "futex.h" + +#include +#include +#include + +#define CACHELINE_SIZE 64 +#define gettid() syscall(SYS_gettid) +#define __cacheline_aligned__attribute__((__aligned__(CACHELINE_SIZE))) + +typedef u32 futex_t; +typedef void (*lock_fn_t)(futex_t *futex, int tid); +typedef void (*unlock_fn_t)(futex_t *futex, int tid); + +/* + * Statistical count list + */ +enum { + STAT_OPS, /* # of exclusive locking operations*/ + STAT_LOCKS, /* # of exclusive lock futex calls */ + STAT_UNLOCKS, /* # of exclusive unlock futex calls*/ + STAT_SLEEPS,/* # of exclusive lock sleeps */ + STAT_EAGAINS, /* # of EAGAIN errors */ + STAT_WAKEUPS, /* # of wakeups (unlock return) */ + STAT_LOCKERRS, /* # of exclusive lock errors */ + STAT_UNLKERRS, /* # of exclusive unlock errors */ + STAT_NUM/* Total # of statistical count */ +}; + +/* + * Syscall time list + */ +enum { + TIME_LOCK, /* Total exclusive lock syscall time*/ +
[PATCH-tip v5 00/21] futex: Introducing throughput-optimized (TP) futexes
v4->v5: - Fix 0-day kernel build test compilation warnings. - Extract out non-TP futexes futex-mutex and futex-rwlock microbenchmarks as separate patches, as suggested by Arnaldo. - Rebased to the latest tip tree & use running_clock() instead of sched_clock(). v3->v4: - Properly handle EFAULT error due to page fault. - Extend the use cases of TP futexes to userspace rwlock. - Change the lock handoff trigger from number of spinnings to elapsed time (5ms). - Various updates to the "perf bench futex mutex" microbenchmark. - Add a new "perf bench futex rwlock" microbenchmark for measuring rwlock performance. - Add mutex_owner to futex_state object to hold the serialization mutex owner. - Streamline a number of helper functions and other miscellenous coding improvements. - Rebase the patchset to the 4.10 kernel. v2->v3: - Use the abbreviation TP for the new futexes instead of TO. - Make a number of changes accordingly to review comments from ThomasG, PeterZ and MikeG. - Breaks the main futex patch into smaller pieces to make them easier to review. v1->v2: - Adds an explicit lock hand-off mechanism. - Adds timeout support. - Simplifies the required userspace code. - Fixes a number of problems in the v1 code. This patchset introduces a new futex implementation called throughput-optimized (TP) futexes. It is similar to PI futexes in its calling convention, but provides better throughput than the wait-wake (WW) futexes by encouraging lock stealing and optimistic spinning. The new TP futexes can be used in implementing both userspace mutexes and rwlocks. They provides better performance while simplifying the userspace locking implementation at the same time. The WW futexes are still needed to implement other synchronization primitives like conditional variables and semaphores that cannot be handled by the TP futexes. Another advantage of TP futexes is that it has a built-in lock handoff mechanism to prevent lock starvation from happenning as long as the underlying kernel mutex doesn't have lock starvation problem. Patches 1-2 implements userspace exclusive lock and read/write lock benchmark in perf bench using futexes. Patches 3-8 are preparatory patches that pave the way to implement the TP futexes. Patch 9 implements the basic TP futex that can support userspace mutexes. Patch 10 adds robust handling to TP futex to handle the death of TP futex exclusive lock owners. Patch 11 adds a lock hand-off mechanism that can prevent lock starvation to happen while introducing minimal runtime overhead. Patch 12 enables the FUTEX_LOCK futex(2) syscall to return status information, such as how the lock is acquired and how many time the task needs to sleep in the spinning loop, that can be used by userspace utilities to monitor how the TP futexes are performing. Patch 13 enables userspace applications to supply a timeout value to abort the lock acquisition attempt after the specified time period has passed. Patch 14 adds a new document tp-futex.txt in the Documentation directory to describe the new TP futexes. Patch 15 extends the TP futexes to support userspace rwlocks. Patch 16 enables more reader lock stealing of TP futexes in the kernel. Patch 17 groups readers together as a spin group to enhance reader throughput. Patch 18 updates the tp-futex.txt file to add information about rwlock support. Patch 19 extends the perf bench futex locking benchmark to include variants using the new TP futexes. Patch 20 updates the wake_up_q() function to returns the number woken tasks. Patch 21 enables the dumping of internal futex state information via debugfs. All the benchmark results shown in the change logs were produced by the microbenchmarks included in this patchset. So everyone can run the microbenchmark to see how the TP futexes perform and behave in their own test systems. Once this patchset is finalized, an updated manpage patch to document the new TP futexes will be sent out. The next step will then be to make Glibc NTPL use the new TP futexes. Performance highlight in term of average locking rates (ops/sec) on a 2-socket system are as follows: WW futex TP futex Glibc - mutex 121,129152,242 - rwlock 99,149162,887 30,362 Patches 9 and 17 contain more detailed information about the performance characteristics of the TP futexes when implementing userspace mutex and rwlock respectively when compared with other possible way of doing so via the wait-wake futexes. Waiman Long (21): perf bench: New microbenchmark for userspace mutex performance perf bench: New microbenchmark for userspace rwlock performance futex: Consolidate duplicated timer setup code futex: Rename futex_pi_state to futex_state futex: Add helpers to get & cmpxchg futex value without lock futex: Consolidate pure pi_state_list add &
[PATCH-tip v5 04/21] futex: Rename futex_pi_state to futex_state
The futex_pi_state structure will be overloaded in later patches to store state information about non-PI futexes. So the structure name itself is no longer a good description of its purpose. So its name is changed to futex_state, a more generic name. Some of the functions that process the futex states are also renamed. Signed-off-by: Waiman Long--- include/linux/sched.h | 4 +- kernel/futex.c| 107 +- 2 files changed, 56 insertions(+), 55 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index e9d5503..27a0b77 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -59,7 +59,7 @@ struct sched_attr; struct sched_param; -struct futex_pi_state; +struct futex_state; struct robust_list_head; struct bio_list; struct fs_struct; @@ -1354,7 +1354,7 @@ struct task_struct { struct compat_robust_list_head __user *compat_robust_list; #endif struct list_head pi_state_list; - struct futex_pi_state *pi_state_cache; + struct futex_state *pi_state_cache; #endif #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts]; diff --git a/kernel/futex.c b/kernel/futex.c index 07886ba..b21c6a0 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -194,11 +194,12 @@ #define FLAGS_HAS_TIMEOUT 0x04 /* - * Priority Inheritance state: + * Futex state object: + * - Priority Inheritance state */ -struct futex_pi_state { +struct futex_state { /* -* list of 'owned' pi_state instances - these have to be +* list of 'owned' state instances - these have to be * cleaned up in do_exit() if the task exits prematurely: */ struct list_head list; @@ -242,7 +243,7 @@ struct futex_q { struct task_struct *task; spinlock_t *lock_ptr; union futex_key key; - struct futex_pi_state *pi_state; + struct futex_state *pi_state; struct rt_mutex_waiter *rt_waiter; union futex_key *requeue_pi_key; u32 bitset; @@ -808,76 +809,76 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from) /* * PI code: */ -static int refill_pi_state_cache(void) +static int refill_futex_state_cache(void) { - struct futex_pi_state *pi_state; + struct futex_state *state; if (likely(current->pi_state_cache)) return 0; - pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL); + state = kzalloc(sizeof(*state), GFP_KERNEL); - if (!pi_state) + if (!state) return -ENOMEM; - INIT_LIST_HEAD(_state->list); + INIT_LIST_HEAD(>list); /* pi_mutex gets initialized later */ - pi_state->owner = NULL; - atomic_set(_state->refcount, 1); - pi_state->key = FUTEX_KEY_INIT; + state->owner = NULL; + atomic_set(>refcount, 1); + state->key = FUTEX_KEY_INIT; - current->pi_state_cache = pi_state; + current->pi_state_cache = state; return 0; } -static struct futex_pi_state * alloc_pi_state(void) +static struct futex_state *alloc_futex_state(void) { - struct futex_pi_state *pi_state = current->pi_state_cache; + struct futex_state *state = current->pi_state_cache; - WARN_ON(!pi_state); + WARN_ON(!state); current->pi_state_cache = NULL; - return pi_state; + return state; } /* - * Drops a reference to the pi_state object and frees or caches it + * Drops a reference to the futex state object and frees or caches it * when the last reference is gone. * * Must be called with the hb lock held. */ -static void put_pi_state(struct futex_pi_state *pi_state) +static void put_futex_state(struct futex_state *state) { - if (!pi_state) + if (!state) return; - if (!atomic_dec_and_test(_state->refcount)) + if (!atomic_dec_and_test(>refcount)) return; /* -* If pi_state->owner is NULL, the owner is most probably dying -* and has cleaned up the pi_state already +* If state->owner is NULL, the owner is most probably dying +* and has cleaned up the futex state already */ - if (pi_state->owner) { - raw_spin_lock_irq(_state->owner->pi_lock); - list_del_init(_state->list); - raw_spin_unlock_irq(_state->owner->pi_lock); + if (state->owner) { + raw_spin_lock_irq(>owner->pi_lock); + list_del_init(>list); + raw_spin_unlock_irq(>owner->pi_lock); - rt_mutex_proxy_unlock(_state->pi_mutex, pi_state->owner); + rt_mutex_proxy_unlock(>pi_mutex, state->owner); } if (current->pi_state_cache) - kfree(pi_state); + kfree(state); else { /* -* pi_state->list is already empty. -* clear
[PATCH-tip v5 05/21] futex: Add helpers to get & cmpxchg futex value without lock
Two new helper functions cmpxchg_futex_value() and get_futex_value() are added to access and change the futex value without the hash bucket lock. As a result, page fault is enabled and the page will be faulted in if not present yet. Signed-off-by: Waiman Long--- kernel/futex.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/kernel/futex.c b/kernel/futex.c index b21c6a0..203a388 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -805,6 +805,21 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from) return ret ? -EFAULT : 0; } +/* + * The equivalents of the above cmpxchg_futex_value_locked() and + * get_futex_value_locked which are called without the hash bucket lock + * and so can have page fault enabled. + */ +static inline int cmpxchg_futex_value(u32 *curval, u32 __user *uaddr, + u32 uval, u32 newval) +{ + return futex_atomic_cmpxchg_inatomic(curval, uaddr, uval, newval); +} + +static inline int get_futex_value(u32 *dest, u32 __user *from) +{ + return __get_user(*dest, from); +} /* * PI code: -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-tip v5 02/21] perf bench: New microbenchmark for userspace rwlock performance
This microbenchmark simulates how the use of different futex types can affect the actual performanace of userspace rwlock locks. The usage is: perf bench futex rwlock Two sets of simple rwlock lock and unlock functions are implemented using the wait-wake futexes and glibc rwlock respectively. This microbenchmark then runs the locking rate measurement tests using either one of those rwlocks or all of them consecutively. Sample output from this microbenchmark was as follows: [PID 7434]: 72 threads doing WW futex lockings (load=1) for 10 secs. Locking statistics: Test run time= 10.00s Total exclusive locking ops = 27,147,143 Total shared locking ops = 27,147,143 Exclusive lock futex calls = 8,812,299 Exclusive unlock futex calls = 9,034,693 Shared lock futex calls = 1,685,538 Shared unlock futex calls= 4,605,343 EAGAIN lock errors = 10,472,722 Process wakeups = 25,115 Percentages: Exclusive lock futex calls = 32.5% Shared lock futex calls = 6.2% Exclusive unlock futex calls = 33.3% Shared unlock futex calls= 17.0% EAGAIN lock errors = 99.8% Process wakeups = 0.2% Per-thread Locking Rates: Avg = 75,392 ops/sec (+- 0.30%) Min = 71,259 ops/sec Max = 78,211 ops/sec Signed-off-by: Waiman Long--- tools/perf/Documentation/perf-bench.txt | 3 + tools/perf/bench/bench.h| 1 + tools/perf/bench/futex-locks.c | 683 +++- tools/perf/builtin-bench.c | 1 + 4 files changed, 672 insertions(+), 16 deletions(-) diff --git a/tools/perf/Documentation/perf-bench.txt b/tools/perf/Documentation/perf-bench.txt index 1fa5a74..e5a7079 100644 --- a/tools/perf/Documentation/perf-bench.txt +++ b/tools/perf/Documentation/perf-bench.txt @@ -206,6 +206,9 @@ Suite for evaluating futex lock_pi calls. *mutex*:: Suite for evaluating futex calls for implementing userspace mutexes. +*rwlock*:: +Suite for evaluating futex calls for implementing userspace rwlocks. + SEE ALSO linkperf:perf[1] diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h index b0632df..a7e2037 100644 --- a/tools/perf/bench/bench.h +++ b/tools/perf/bench/bench.h @@ -37,6 +37,7 @@ /* pi futexes */ int bench_futex_lock_pi(int argc, const char **argv, const char *prefix); int bench_futex_mutex(int argc, const char **argv, const char *prefix); +int bench_futex_rwlock(int argc, const char **argv, const char *prefix); #define BENCH_FORMAT_DEFAULT_STR "default" #define BENCH_FORMAT_DEFAULT 0 diff --git a/tools/perf/bench/futex-locks.c b/tools/perf/bench/futex-locks.c index 622a57c..18dc71d 100644 --- a/tools/perf/bench/futex-locks.c +++ b/tools/perf/bench/futex-locks.c @@ -10,6 +10,10 @@ * unlock functions are written to implenment a mutex lock using the * wait-wake (2 versions) and PI futexes respectively. These functions serve * as the basis for measuring the locking throughput. + * + * Two sets of simple reader/writer lock and unlock functions are also + * implemented using the wait-wake futexes as well as the Glibc rwlock + * respectively for performance measurement purpose. */ #include @@ -21,11 +25,13 @@ #include #include #include +#include #include #include "bench.h" #include "futex.h" #include +#include #include #include @@ -42,13 +48,19 @@ */ enum { STAT_OPS, /* # of exclusive locking operations*/ + STAT_SOPS, /* # of shared locking operations */ STAT_LOCKS, /* # of exclusive lock futex calls */ STAT_UNLOCKS, /* # of exclusive unlock futex calls*/ STAT_SLEEPS,/* # of exclusive lock sleeps */ + STAT_SLOCKS,/* # of shared lock futex calls */ + STAT_SUNLOCKS, /* # of shared unlock futex calls */ + STAT_SSLEEPS, /* # of shared lock sleeps */ STAT_EAGAINS, /* # of EAGAIN errors */ STAT_WAKEUPS, /* # of wakeups (unlock return) */ STAT_LOCKERRS, /* # of exclusive lock errors */ STAT_UNLKERRS, /* # of exclusive unlock errors */ + STAT_SLOCKERRS, /* # of shared lock errors */ + STAT_SUNLKERRS, /* # of shared unlock errors*/ STAT_NUM/* Total # of statistical count */ }; @@ -58,6 +70,8 @@ enum { enum { TIME_LOCK, /* Total exclusive lock syscall time*/ TIME_UNLK, /* Total exclusive unlock syscall time */ + TIME_SLOCK, /* Total shared lock syscall time */ + TIME_SUNLK, /* Total shared unlock syscall time */ TIME_NUM, }; @@ -100,7 +114,11 @@ struct worker { static unsigned int threads_stopping; static struct stats throughput_stats; static lock_fn_t mutex_lock_fn; +static lock_fn_t read_lock_fn; +static
[PATCH-tip v5 11/21] TP-futex: Implement lock handoff to prevent lock starvation
The current TP futexes has no guarantee that the top waiter (serialization mutex owner) can get the lock within a finite time. As a result, lock starvation can happen. A lock handoff mechanism is added to the TP futexes to prevent lock starvation from happening. The idea is that the top waiter can set a special handoff_pid variable with its own pid. The futex owner will then check this variable at unlock time to see if it should free the futex or change its value to the designated pid to transfer the lock directly to the top waiter. This change does have the effect of limiting the amount of unfairness and hence may adversely affect performance depending on the workloads. Currently the handoff mechanism is triggered when the top waiter fails to acquire the lock after about 5ms of spinning and sleeping. So the maximum lock handoff rate is about 200 handoffs per second. Signed-off-by: Waiman Long--- kernel/futex.c | 59 -- 1 file changed, 53 insertions(+), 6 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 46a1a4b..348b44c 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -63,6 +63,7 @@ #include #include #include +#include #include #include #include @@ -237,6 +238,8 @@ struct futex_state { struct task_struct *mutex_owner;/* For TP futexes only */ atomic_t refcount; + u32 handoff_pid;/* For TP futexes only */ + enum futex_type type; union futex_key key; }; @@ -914,6 +917,7 @@ static int refill_futex_state_cache(void) /* pi_mutex gets initialized later */ state->owner = NULL; state->mutex_owner = NULL; + state->handoff_pid = 0; atomic_set(>refcount, 1); state->key = FUTEX_KEY_INIT; @@ -3337,8 +3341,9 @@ void exit_robust_list(struct task_struct *curr) * and unlocking. * * The purpose of this FUTEX_WAITERS bit is to make the unlocker wake up the - * serialization mutex owner. Not having the FUTEX_WAITERS bit set doesn't - * mean there is no waiter in the kernel. + * serialization mutex owner or to hand off the lock directly to the top + * waiter. Not having the FUTEX_WAITERS bit set doesn't mean there is no + * waiter in the kernel. * * Like PI futexes, TP futexes are orthogonal to robust futexes can be * used together. @@ -3354,6 +3359,16 @@ void exit_robust_list(struct task_struct *curr) * the ownership of the futex. */ +/* + * Timeout value for enabling lock handoff and prevent lock starvation. + * + * Currently, lock handoff will be enabled if the top waiter can't get the + * futex after about 5ms. To reduce time checking overhead, it is only done + * after every 128 spins or right after wakeup. As a result, the actual + * elapsed time will be a bit longer than the specified value. + */ +#define TP_HANDOFF_TIMEOUT 500 /* 5ms */ + /** * lookup_futex_state - Looking up the futex state structure. * @hb: hash bucket @@ -3433,6 +3448,7 @@ static inline int put_futex_state_unlocked(struct futex_state *state) * If waiter is true * then * don't preserve the flag bits; + * check for handoff (futex word == own pid) * else * preserve the flag bits * endif @@ -3451,6 +3467,9 @@ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval, uval = *puval; + if (waiter && (uval & FUTEX_TID_MASK) == vpid) + return 1; + if (uval & FUTEX_TID_MASK) return 0; /* Trylock fails */ @@ -3544,10 +3563,12 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, "futex: owner pid %d of TP futex 0x%lx was %s.\n" \ "\tLock is now acquired by pid %d!\n" - int ret; + int ret, loopcnt = 1; + bool handoff_set = false; u32 uval; u32 owner_pid = 0; struct task_struct *owner_task = NULL; + u64 handoff_time = running_clock() + TP_HANDOFF_TIMEOUT; preempt_disable(); WRITE_ONCE(state->mutex_owner, current); @@ -3602,6 +3623,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, if (need_resched()) { __set_current_state(TASK_RUNNING); schedule_preempt_disabled(); + loopcnt = 0; continue; } @@ -3610,6 +3632,23 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, break; } + /* +* Enable lock handoff if the elapsed time exceed the timeout +* value. We also need to set the FUTEX_WAITERS bit to make +* sure that futex lock holder will initiate the handoff at +* unlock time. +*/ + if (!handoff_set && !(loopcnt++ & 0x7f)) { + if (running_clock() >
[PATCH-tip v5 06/21] futex: Consolidate pure pi_state_list add & delete codes to helpers
Two new helper functions (task_pi_list_add & task_pi_list_del) are created to consolidate all the pure pi_state_list addition and insertion codes. The set_owner argument in task_pi_list_add() will be needed in a later patch. Signed-off-by: Waiman Long--- kernel/futex.c | 64 +++--- 1 file changed, 39 insertions(+), 25 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 203a388..25881e6 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -824,6 +824,39 @@ static inline int get_futex_value(u32 *dest, u32 __user *from) /* * PI code: */ + +/** + * task_pi_list_add - add futex state object to a task's pi_state_list + * @task : task structure + * @state: futex state object + */ +static inline void task_pi_list_add(struct task_struct *task, + struct futex_state *state) +{ + raw_spin_lock_irq(>pi_lock); + WARN_ON(!list_empty(>list)); + list_add(>list, >pi_state_list); + state->owner = task; + raw_spin_unlock_irq(>pi_lock); +} + +/** + * task_pi_list_del - delete futex state object from a task's pi_state_list + * @state: futex state object + * @warn : warn if list is empty when set + */ +static inline void task_pi_list_del(struct futex_state *state, const bool warn) +{ + struct task_struct *task = state->owner; + + raw_spin_lock_irq(>pi_lock); + if (warn) + WARN_ON(list_empty(>list)); + list_del_init(>list); + state->owner = NULL; + raw_spin_unlock_irq(>pi_lock); +} + static int refill_futex_state_cache(void) { struct futex_state *state; @@ -876,9 +909,7 @@ static void put_futex_state(struct futex_state *state) * and has cleaned up the futex state already */ if (state->owner) { - raw_spin_lock_irq(>owner->pi_lock); - list_del_init(>list); - raw_spin_unlock_irq(>owner->pi_lock); + task_pi_list_del(state, false); rt_mutex_proxy_unlock(>pi_mutex, state->owner); } @@ -1399,16 +1430,8 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, return ret; } - raw_spin_lock(_state->owner->pi_lock); - WARN_ON(list_empty(_state->list)); - list_del_init(_state->list); - raw_spin_unlock(_state->owner->pi_lock); - - raw_spin_lock(_owner->pi_lock); - WARN_ON(!list_empty(_state->list)); - list_add(_state->list, _owner->pi_state_list); - pi_state->owner = new_owner; - raw_spin_unlock(_owner->pi_lock); + task_pi_list_del(pi_state, true); + task_pi_list_add(new_owner, pi_state); raw_spin_unlock_irq(_state->pi_mutex.wait_lock); @@ -2213,19 +2236,10 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, * We fixed up user space. Now we need to fix the pi_state * itself. */ - if (pi_state->owner != NULL) { - raw_spin_lock_irq(_state->owner->pi_lock); - WARN_ON(list_empty(_state->list)); - list_del_init(_state->list); - raw_spin_unlock_irq(_state->owner->pi_lock); - } + if (pi_state->owner != NULL) + task_pi_list_del(pi_state, true); - pi_state->owner = newowner; - - raw_spin_lock_irq(>pi_lock); - WARN_ON(!list_empty(_state->list)); - list_add(_state->list, >pi_state_list); - raw_spin_unlock_irq(>pi_lock); + task_pi_list_add(newowner, pi_state); return 0; /* -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-tip v5 14/21] TP-futex, doc: Add TP futexes documentation
This patch adds a new document file on how to use the TP futexes. Signed-off-by: Waiman Long--- Documentation/00-INDEX | 2 + Documentation/tp-futex.txt | 161 + 2 files changed, 163 insertions(+) create mode 100644 Documentation/tp-futex.txt diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index c8a8eb1..326b68c 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -416,6 +416,8 @@ this_cpu_ops.txt - List rationale behind and the way to use this_cpu operations. thermal/ - directory with information on managing thermal issues (CPU/temp) +tp-futex.txt + - Documentation on lightweight throughput-optimized futexes. trace/ - directory with info on tracing technologies within linux translations/ diff --git a/Documentation/tp-futex.txt b/Documentation/tp-futex.txt new file mode 100644 index 000..5324ee0 --- /dev/null +++ b/Documentation/tp-futex.txt @@ -0,0 +1,161 @@ +Started by: Waiman Long + +Throughput-Optimized Futexes + + +There are two main problems for a wait-wake futex (FUTEX_WAIT and +FUTEX_WAKE) when used for creating user-space locking primitives: + + 1) With a wait-wake futex, tasks waiting for a lock are put to sleep +in the futex queue to be woken up by the lock owner when it is done +with the lock. Waking up a sleeping task, however, introduces some +additional latency which can be large especially if the critical +section protected by the lock is relatively short. This may cause +a performance bottleneck on large systems with many CPUs running +applications that need a lot of inter-thread synchronization. + + 2) The performance of the wait-wake futex is currently +spinlock-constrained. When many threads are contending for a +futex in a large system with many CPUs, it is not unusual to have +spinlock contention accounting for more than 90% of the total +CPU cycles consumed at various points in time. + +This two problems can create performance bottlenecks with a +futex-constrained workload especially on systems with large number +of CPUs. + +The goal of the throughput-optimized (TP) futexes is maximize the +locking throughput at the expense of fairness and deterministic +latency. This is done by encouraging lock stealing and optimistic +spinning on a locked futex when the futex owner is running. This is +the same optimistic spinning mechanism used by the kernel mutex and rw +semaphore implementations to improve performance. Optimistic spinning +was done without taking any lock. + +Lock stealing is known to be a performance enhancement technique as +long as the safeguards are in place to make sure that there will be no +lock starvation. The TP futexes has a built-in lock hand-off mechanism +to prevent lock starvation from happening as long as the underlying +kernel mutexes that the TP futexes use have no lock starvation problem. + +When the top lock waiter has failed to acquire the lock within a +certain time threshold, it will initiate the hand-off mechanism by +forcing the unlocker to transfer the lock to itself instead of freeing +it for others to grab. This limit the maximum latency a waiter has +to wait. + +The downside of this improved throughput is the increased variance +of the actual response times of the locking operations. Some locking +operations will be very fast, while others may be considerably slower. +The average response time should be better than the wait-wake futexes. + +Performance-wise, TP futexes should be faster than wait-wake futexes +especially if the futex locker holders do not sleep. For workload +that does a lot of sleeping within the critical sections, the TP +futexes may not be faster than the wait-wake futexes. + +Implementation +-- + +Like the PI and robust futexes, an exclusive lock acquirer has to +atomically put its thread ID (TID) into the lower 30 bits of the +32-bit futex which should has an original value of 0. If it succeeds, +it will be the owner of the futex. Otherwise, it has to call into +the kernel using the FUTEX_LOCK futex(2) syscall. + + futex(uaddr, FUTEX_LOCK, 0, timeout, NULL, 0); + +Only the optional timeout parameter is being used by this futex(2) +syscall. + +Inside the kernel, a kernel mutex is used for serialization among +the futex waiters. Only the top lock waiter which is the owner of +the serialization mutex is allowed to continuously spin and attempt +to acquire the lock. Other lock waiters will have one attempt to +steal the lock before entering the mutex queues. + +When the exclusive futex lock owner is no longer running, the top +waiter will set the FUTEX_WAITERS bit before going to sleep. This is +to make sure the futex owner will go into the kernel at unlock time +to wake up the top waiter. + +The return values of the above futex locking syscall, if non-negative, +are status code that
[PATCH-tip v5 08/21] futex: Allow direct attachment of futex_state objects to hash bucket
Currently, the futex state objects can only be located indirectly as hash bucket => futex_q => futex state Actually it can be beneficial in some cases to locate the futex state object directly from the hash bucket without the futex_q middleman. Therefore, a new list head to link the futex state objects as well as a new spinlock to manage them are added to the hash bucket. To limit size increase for UP systems, these new fields are only for SMP machines where the cacheline alignment of the hash bucket leaves it with enough empty space for the new fields. Signed-off-by: Waiman Long--- kernel/futex.c | 24 1 file changed, 24 insertions(+) diff --git a/kernel/futex.c b/kernel/futex.c index 928212c..2ced6c0 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -209,6 +209,11 @@ struct futex_state { struct list_head list; /* +* Can link to fs_head in the owning hash bucket. +*/ + struct list_head fs_list; + + /* * The PI object: */ struct rt_mutex pi_mutex; @@ -264,11 +269,24 @@ struct futex_q { * Hash buckets are shared by all the futex_keys that hash to the same * location. Each key may have multiple futex_q structures, one for each task * waiting on a futex. + * + * Alternatively (in SMP), a key can be associated with a unique futex_state + * object where multiple waiters waiting for that futex can queue up in that + * futex_state object without using the futex_q structure. A separate + * futex_state lock (fs_lock) is used for processing those futex_state objects. */ struct futex_hash_bucket { atomic_t waiters; spinlock_t lock; struct plist_head chain; + +#ifdef CONFIG_SMP + /* +* Fields for managing futex_state object list +*/ + spinlock_t fs_lock; + struct list_head fs_head; +#endif } cacheline_aligned_in_smp; /* @@ -875,6 +893,8 @@ static int refill_futex_state_cache(void) return -ENOMEM; INIT_LIST_HEAD(>list); + INIT_LIST_HEAD(>fs_list); + /* pi_mutex gets initialized later */ state->owner = NULL; atomic_set(>refcount, 1); @@ -3365,6 +3385,10 @@ static int __init futex_init(void) atomic_set(_queues[i].waiters, 0); plist_head_init(_queues[i].chain); spin_lock_init(_queues[i].lock); +#ifdef CONFIG_SMP + INIT_LIST_HEAD(_queues[i].fs_head); + spin_lock_init(_queues[i].fs_lock); +#endif } return 0; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-tip v5 12/21] TP-futex: Return status code on FUTEX_LOCK calls
To better understand how the TP futexes are performing, it is useful to return the internal status on the TP futexes. The FUTEX_LOCK futex(2) syscall will now return a positive status code if no error happens. The status code consists of the following 3 fields: 1) Bits 00-07: code on how the lock is acquired. 2) Bits 08-15: reserved 3) Bits 16-30: how many time the task sleeps in the optimistic spinning loop. By returning the TP status code, an external monitoring or tracking program can have a macro view of how the TP futexes are performing. Signed-off-by: Waiman Long--- kernel/futex.c | 41 +++-- 1 file changed, 31 insertions(+), 10 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 348b44c..22f7906 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3369,6 +3369,22 @@ void exit_robust_list(struct task_struct *curr) */ #define TP_HANDOFF_TIMEOUT 500 /* 5ms */ +/* + * The futex_lock() function returns the internal status of the TP futex. + * The status code consists of the following 3 fields: + * 1) bits 00-07: code on how the lock is acquired + *0 - steals the lock + *1 - top waiter (mutex owner) acquires the lock + *2 - handed off the lock + * 2) bits 08-15: reserved + * 3) bits 15-30: how many times the task has slept or yield to scheduler + * in futex_spin_on_owner(). + */ +#define TP_LOCK_STOLEN 0 +#define TP_LOCK_ACQUIRED 1 +#define TP_LOCK_HANDOFF2 +#define TP_STATUS_SLEEP(val, sleep)((val)|((sleep) << 16)) + /** * lookup_futex_state - Looking up the futex state structure. * @hb: hash bucket @@ -3453,9 +3469,11 @@ static inline int put_futex_state_unlocked(struct futex_state *state) * preserve the flag bits * endif * - * Return: 1 if lock acquired; + * Return: TP_LOCK_ACQUIRED if lock acquired; + *TP_LOCK_HANDOFF if lock was handed off; *0 if lock acquisition failed; *-EFAULT if an error happened. + **puval will contain the latest futex value when trylock fails. */ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval, const bool waiter) @@ -3468,7 +3486,7 @@ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval, uval = *puval; if (waiter && (uval & FUTEX_TID_MASK) == vpid) - return 1; + return TP_LOCK_HANDOFF; if (uval & FUTEX_TID_MASK) return 0; /* Trylock fails */ @@ -3479,7 +3497,7 @@ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval, if (unlikely(cmpxchg_futex_value(puval, uaddr, uval, vpid|flags))) return -EFAULT; - return *puval == uval; + return (*puval == uval) ? TP_LOCK_ACQUIRED : 0; } /** @@ -3493,7 +3511,8 @@ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval, * of faulting in the futex word. This function should only be called from * within futex_spin_on_owner(). * - * Return: 1 if lock acquired; + * Return: TP_LOCK_ACQUIRED if lock acquired; + *TP_LOCK_HANDOFF if lock was handed off; *0 if lock acquisition failed; *-EFAULT if an error happened. */ @@ -3554,7 +3573,7 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval) * unless the pid wraps around and the perceived owner is not the real owner. * To guard against this case, we will have to use the robust futex feature. * - * Return: 0 if futex acquired, < 0 if an error happens. + * Return: TP status code if lock acquired, < 0 if an error happens. */ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, struct futex_state *state) @@ -3564,6 +3583,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, "\tLock is now acquired by pid %d!\n" int ret, loopcnt = 1; + int nsleep = 0; bool handoff_set = false; u32 uval; u32 owner_pid = 0; @@ -3623,6 +3643,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, if (need_resched()) { __set_current_state(TASK_RUNNING); schedule_preempt_disabled(); + nsleep++; loopcnt = 0; continue; } @@ -3693,6 +3714,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, */ if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS)) { schedule_preempt_disabled(); + nsleep++; loopcnt = 0; } __set_current_state(TASK_RUNNING); @@ -3719,7 +3741,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
[PATCH-tip v5 07/21] futex: Add a new futex type field into futex_state
As the futex_state structure will be overloaded in later patches to be used by non-PI futexes, it is necessary to add a type field to distinguish among different types of futexes. Signed-off-by: Waiman Long--- kernel/futex.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 25881e6..928212c 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -193,6 +193,10 @@ #define FLAGS_CLOCKRT 0x02 #define FLAGS_HAS_TIMEOUT 0x04 +enum futex_type { + TYPE_PI = 0, +}; + /* * Futex state object: * - Priority Inheritance state @@ -212,6 +216,7 @@ struct futex_state { struct task_struct *owner; atomic_t refcount; + enum futex_type type; union futex_key key; }; @@ -905,13 +910,14 @@ static void put_futex_state(struct futex_state *state) return; /* -* If state->owner is NULL, the owner is most probably dying -* and has cleaned up the futex state already +* If state->owner is NULL and the type is TYPE_PI, the owner is +* most probably dying and has cleaned up the futex state already. */ if (state->owner) { task_pi_list_del(state, false); - rt_mutex_proxy_unlock(>pi_mutex, state->owner); + if (state->type == TYPE_PI) + rt_mutex_proxy_unlock(>pi_mutex, state->owner); } if (current->pi_state_cache) @@ -1064,7 +1070,7 @@ static int attach_to_pi_state(u32 uval, struct futex_state *pi_state, /* * Userspace might have messed up non-PI and PI futexes [3] */ - if (unlikely(!pi_state)) + if (unlikely(!pi_state || (pi_state->type != TYPE_PI))) return -EINVAL; WARN_ON(!atomic_read(_state->refcount)); @@ -1182,6 +1188,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, /* Store the key for possible exit cleanups: */ pi_state->key = *key; + pi_state->type = TYPE_PI; WARN_ON(!list_empty(_state->list)); list_add(_state->list, >pi_state_list); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-tip v5 13/21] TP-futex: Add timeout support
Unlike other futexes, TP futexes do not accept the specification of a timeout value. To align with the other futexes, timeout support is now added. However, the timeout isn't as precise as the other futex types due to the fact that timer expiration can't be detected when the thread is waiting in the serialization mutex. Signed-off-by: Waiman Long--- kernel/futex.c | 50 ++ 1 file changed, 42 insertions(+), 8 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 22f7906..91b2e02 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3558,9 +3558,10 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval) /** * futex_spin_on_owner - Optimistically spin on futex owner - * @uaddr: futex address - * @vpid: PID of current task - * @state: futex state object + * @uaddr: futex address + * @vpid:PID of current task + * @state: futex state object + * @timeout: hrtimer_sleeper structure * * Spin on the futex word while the futex owner is active. Otherwise, set * the FUTEX_WAITERS bit and go to sleep. @@ -3576,7 +3577,8 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval) * Return: TP status code if lock acquired, < 0 if an error happens. */ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, - struct futex_state *state) + struct futex_state *state, + struct hrtimer_sleeper *timeout) { #define OWNER_DEAD_MESSAGE \ "futex: owner pid %d of TP futex 0x%lx was %s.\n" \ @@ -3648,6 +3650,12 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, continue; } + + if (timeout && !timeout->task) { + ret = -ETIMEDOUT; + break; + } + if (signal_pending(current)) { ret = -EINTR; break; @@ -3755,10 +3763,18 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, * This function is not inlined so that it can show up separately in perf * profile for performance analysis purpose. * + * The timeout functionality isn't as precise as other other futex types + * as timer expiration will not be detected while waiting in the + * serialization mutex. One possible solution is to modify the expiration + * function to send out a signal as well so as to break the thread out of + * the mutex code if we really want more precise timeout. + * * Return: TP status code if lock acquired, < 0 if an error happens. */ -static noinline int futex_lock(u32 __user *uaddr, unsigned int flags) +static noinline int +futex_lock(u32 __user *uaddr, unsigned int flags, ktime_t *time) { + struct hrtimer_sleeper timeout, *to; struct futex_hash_bucket *hb; union futex_key key = FUTEX_KEY_INIT; struct futex_state *state; @@ -3782,6 +3798,8 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags) if (refill_futex_state_cache()) return -ENOMEM; + to = futex_setup_timer(time, , flags, current->timer_slack_ns); + ret = get_futex_key(uaddr, flags & FLAGS_SHARED, , VERIFY_WRITE); if (unlikely(ret)) goto out; @@ -3807,6 +3825,9 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags) goto out_put_state_key; } + if (to) + hrtimer_start_expires(>timer, HRTIMER_MODE_ABS); + /* * Acquiring the serialization mutex. * @@ -3818,10 +3839,19 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags) goto out_put_state_key; /* +* Check for timeout. +*/ + if (to && !timeout.task) { + ret = -ETIMEDOUT; + mutex_unlock(>mutex); + goto out_put_state_key; + } + + /* * As the mutex owner, we can now spin on the futex word as well as * the active-ness of the futex owner. */ - ret = futex_spin_on_owner(uaddr, vpid, state); + ret = futex_spin_on_owner(uaddr, vpid, state, to); mutex_unlock(>mutex); @@ -3838,6 +3868,10 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags) put_futex_key(); out: + if (to) { + hrtimer_cancel(>timer); + destroy_hrtimer_on_stack(>timer); + } return ret; } @@ -3991,7 +4025,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, return futex_requeue(uaddr, flags, uaddr2, val, val2, , 1); #ifdef CONFIG_SMP case FUTEX_LOCK: - return futex_lock(uaddr, flags); + return futex_lock(uaddr, flags, timeout); case FUTEX_UNLOCK: return
[PATCH-tip v5 17/21] TP-futex: Group readers together in wait queue
All the TP futex lock waiters are serialized in the kernel using a kernel mutex which acts like a wait queue. The order at which the waiters popped out from the wait queue will affect performance when exclusive (writer) and shared (reader) lock waiters are mixed in the queue. The worst case scenarios will be something like RWRWRW... in term of ordering where no parallelism in term of lock ownership can happen. To improve throughput, the readers are now grouped together as a single entity in the wait queue. The first reader that enters the mutex wait queue will become the leader of the group. The other readers will spin on the group leader via an OSQ lock. When the futex is put in the shared mode, either by the group leader or by an external reader the spinning readers in the reader group will then acquire the read lock successively. The spinning readers in the group will get disbanded when the group leader goes to sleep. In this case, all those readers will go into the mutex wait queue alone and wait for their turn to acquire the TP futex. On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10 based kernel, the performance of TP rwlock with 1:1 reader/writer ratio versus one based on the wait-wake futexes as well as the Glibc rwlock with a microbenchmark (1 worker thread per cpu core) running for 10s were as follows: WW futex TP futex Glibc - Total locking ops35,707,234 58,645,434 10,930,422 Per-thread avg/sec 99,149162,887 30,362 Per-thread min/sec 93,190 38,641 29,872 Per-thread max/sec 104,213225,983 30,708 Write lock futex calls 11,161,534 14,094 - Write unlock futex calls 8,696,121167 - Read lock futex calls 1,717,863 6,659 - Read unlock futex calls 4,316,418323 - It can be seen that the throughput of the TP futex is close to 2X the WW futex and almost 6X the Glibc version in this particular case. The following table shows the CPU cores scaling for the average per-thread locking rates (ops/sec): WW futex TP futex Glibc - 9 threads (1 socket)422,014647,006 190,236 18 threads (2 sockets) 197,145330,35366,934 27 threads (2 sockets) 127,947213,41743,641 36 threads (2 sockets)99,149162,88730,362 The following tables shows the average per-thread locking rates (36 threads) with different reader percentages: WW futex TP futex Glibc - 90% readers124,426159,65760,208 95% readers148,905152,31568,666 100% reader 210,029191,75984,316 The rwlocks based on the WW futexes and Glibc prefer readers by default. So it can be seen that the performance of the WW futex and Glibc rwlocks increased with higher reader percentages. The TP futex rwlock, however, prefers writers a bit more than readers. So the performance didn't increase as the reader percentage rises. The WW futexes and Glibc rwlocks also have a writer-preferring version. Their performance with the same tests are as follows: WW futex TP futex Glibc - 90% readers 87,866159,65726,057 95% readers 93,193152,31532,611 100% reader 193,267191,75988,440 With separate 18 reader and 18 writer threads, the the average per-thread reader and writer locking rates with different load latencies (L, default = 1) and reader-preferring rwlocks are: WW futex TP futex Glibc - Reader rate, L=1 381,411 74,059 164,184 Writer rate, L=1 0240,841 0 Reader rate, L=5 330,732 57,361 150,691 Writer rate, L=5 0175,400 0 Reader rate, L=50304,505 17,35597,066 Writer rate, L=50 0114,504 0 The corresponding locking rates with writer-preferring rwlocks are: WW futex TP futex Glibc - Reader rate, L=1 138,805 74,05954 Writer rate, L=1 31,113240,84156,424 Reader rate, L=5 114,414 57,36124 Writer rate, L=5 28,062175,40052,039 Reader rate, L=50 88,619 51,483
[PATCH-tip v5 15/21] TP-futex: Support userspace reader/writer locks
The TP futexes was designed to support userspace mutually exclusive locks. They are now extended to support userspace reader/writer locks as well. Two new futex command codes are added: 1) FUTEX_LOCK_SHARED - to acquire a shared lock (reader-lock) 2) FUTEX_UNLOCK_SHARED - to release a shred lock (reader-unlock) The original FUTEX_LOCK and FUTEX_UNLOCK command codes acts like a writer lock and unlock method in a userspace rwlock. A special bit FUTEX_SHARED (bit 29) is used to indicate if a lock is an exclusive (writer) or shared (reader) lock. Because of that, the maximum TID value is now reduce to 1^29 instead of 1^30. With a shared lock, the lower 24 bits are used to track the reader counts. Bit 28 is a special FUTEX_SHARED_UNLOCK bit that should used by userspace to prevent racing in the FUTEX_UNLOCK_SHARED futex(2) syscall. Bits 24-27 are reserved for future extension. When a TP futex becomes reader-owned, there is no way to tell if the readers are running or not. So a fixed time interval (50us) is now used for spinning on a reader-owned futex. The top waiter will abort the spinning and goes to sleep when no activity is observed in the futex value itself after the spinning timer expired. Activities here means any change in the reader count of the futex indicating that some new readers are coming in or old readers are leaving. Sample code to acquire/release a reader lock in userspace are: void read_lock(int *faddr) { int val = cmpxchg(faddr, 0, FUTEX_SHARED + 1); if (!val) return; for (;;) { int old = val, new; if (!old) new = FUTEX_SHARED + 1; else if ((old & (~FUTEX_TID_SHARED_MASK| FUTEX_SHARED_UNLOCK) || !(old & FUTEX_SHARED)) break; else new = old + 1; val = cmpxchg(faddr, old, new); if (val == old) return; } while (futex(faddr, FUTEX_LOCK_SHARED, ...) < 0) ; } void read_unlock(int *faddr) { int val = xadd(faddr, -1); int old; for (;;) { if ((val & (FUTEX_SCNT_MASK|FUTEX_SHARED_UNLOCK) || !(val & FUTEX_SHARED)) return; /* Not the last reader or shared */ if (val & ~FUTEX_SHARED_TID_MASK) { old = cmpxchg(faddr, val, val|FUTEX_SHARED_UNLOCK); if (old == val) break; } else { old = cmpxchg(faddr, val, 0); if (old == val) return; } val = old; } futex(faddr, FUTEX_UNLOCK_SHARED, ...); } The read lock/unlock code is a bit more complicated than the corresponding write lock/unlock code. This is by design to minimize any additional overhead for mutex lock and unlock. As a result, the TP futex rwlock prefers writers a bit more than readers. Signed-off-by: Waiman Long--- include/uapi/linux/futex.h | 28 +- kernel/futex.c | 225 ++--- 2 files changed, 218 insertions(+), 35 deletions(-) diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h index 7f676ac..1a22ecc 100644 --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -22,6 +22,8 @@ #define FUTEX_CMP_REQUEUE_PI 12 #define FUTEX_LOCK 13 #define FUTEX_UNLOCK 14 +#define FUTEX_LOCK_SHARED 15 +#define FUTEX_UNLOCK_SHARED16 #define FUTEX_PRIVATE_FLAG 128 #define FUTEX_CLOCK_REALTIME 256 @@ -43,6 +45,9 @@ FUTEX_PRIVATE_FLAG) #define FUTEX_LOCK_PRIVATE (FUTEX_LOCK | FUTEX_PRIVATE_FLAG) #define FUTEX_UNLOCK_PRIVATE (FUTEX_UNLOCK | FUTEX_PRIVATE_FLAG) +#define FUTEX_LOCK_SHARED_PRIVATE (FUTEX_LOCK_SHARED | FUTEX_PRIVATE_FLAG) +#define FUTEX_UNLOCK_SHARED_PRIVATE(FUTEX_UNLOCK_SHARED | \ +FUTEX_PRIVATE_FLAG) /* * Support for robust futexes: the kernel cleans up held futexes at @@ -111,9 +116,28 @@ struct robust_list_head { #define FUTEX_OWNER_DIED 0x4000 /* - * The rest of the robust-futex field is for the TID: + * Shared locking bit (for readers). + * Robust futexes cannot be used in combination with shared locking. + * + * The FUTEX_SHARED_UNLOCK bit can be set by userspace to indicate that + * an shared unlock is in progress and so no shared locking is allowed + * within the kernel. + */ +#define FUTEX_SHARED 0x2000 +#define FUTEX_SHARED_UNLOCK0x1000 + +/* + * The rest of the bits is for the TID or, in the case of readers, the + * shared locking bit and the numbers of active readers present. IOW, the + * max supported
[PATCH-tip v5 16/21] TP-futex: Enable kernel reader lock stealing
The TP-futexes prefer writers as the code path is much simpler for them. This may not be good for overall system performance especially if there is a fair amount of reader activities. This patch enables kernel reader to steal the lock when the futex is currently reader-owned and the lock handoff mechanism hasn't been enabled yet. A new field locksteal_disabled is added to the futex state object for controlling reader lock stealing. So the waiting reader must retrieve the futex state object first before doing it. Signed-off-by: Waiman Long--- kernel/futex.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 1f97c85..90c8c80 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -239,6 +239,7 @@ struct futex_state { atomic_t refcount; u32 handoff_pid;/* For TP futexes only */ + int locksteal_disabled; /* For TP futexes only */ enum futex_type type; union futex_key key; @@ -3438,6 +3439,7 @@ void exit_robust_list(struct task_struct *curr) state = alloc_futex_state(); state->type = TYPE_TP; state->key = *key; + state->locksteal_disabled = false; list_add(>fs_list, >fs_head); WARN_ON(atomic_read(>refcount) != 1); @@ -3783,9 +3785,10 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, WRITE_ONCE(state->handoff_pid, vpid); /* -* Disable handoff check. +* Disable handoff check and reader lock +* stealing. */ - handoff_set = true; + state->locksteal_disabled = handoff_set = true; } } @@ -3888,6 +3891,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, */ WRITE_ONCE(state->mutex_owner, NULL); WRITE_ONCE(state->handoff_pid, 0); + state->locksteal_disabled = false; preempt_enable(); return (ret < 0) ? ret : TP_STATUS_SLEEP(ret, nsleep); @@ -3968,6 +3972,21 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags, goto out_put_state_key; } + /* +* For reader, we will try to steal the lock here as if it is the +* top waiter without taking the serialization mutex if reader +* lock stealing isn't disabled even though the FUTEX_WAITERS bit +* is set. +*/ + if (shared && !state->locksteal_disabled) { + ret = futex_trylock(uaddr, vpid, , true); + if (ret) { + if (ret > 0) + ret = TP_LOCK_STOLEN; + goto out_put_state_key; + } + } + if (to) hrtimer_start_expires(>timer, HRTIMER_MODE_ABS); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-tip v5 21/21] futex: Dump internal futex state via debugfs
For debugging purpose, it is sometimes useful to dump the internal states in the futex hash bucket table. This patch adds a file "futex_hash_table" in debugfs root filesystem to dump the internal futex states. Signed-off-by: Waiman Long--- kernel/futex.c | 82 ++ 1 file changed, 82 insertions(+) diff --git a/kernel/futex.c b/kernel/futex.c index 6bf5304..8b7a591 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -4428,3 +4428,85 @@ static int __init futex_init(void) return 0; } __initcall(futex_init); + +#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_SMP) +/* + * Debug code to dump selected content of in-kernel futex hash bucket table. + */ +#include + +static int futex_dump_show(struct seq_file *m, void *arg) +{ + struct futex_hash_bucket *hb = arg; + struct futex_state *state; + int i; + + if (list_empty(>fs_head)) + return 0; + + seq_printf(m, "\nHash bucket %d:\n", (int)(hb - futex_queues)); + spin_lock(>fs_lock); + i = 0; + list_for_each_entry(state, >fs_head, fs_list) { + seq_printf(m, " Futex state %d\n", i++); + if (state->owner) + seq_printf(m, "owner PID = %d\n", + task_pid_vnr(state->owner)); + if (state->mutex_owner) + seq_printf(m, "mutex owner PID = %d\n", + task_pid_vnr(state->mutex_owner)); + seq_printf(m, "reference count = %d\n", + atomic_read(>refcount)); + seq_printf(m, "handoff PID = %d\n", state->handoff_pid); + } + spin_unlock(>fs_lock); + return 0; +} + +static void *futex_dump_start(struct seq_file *m, loff_t *pos) +{ + return (*pos < futex_hashsize) ? _queues[*pos] : NULL; +} + +static void *futex_dump_next(struct seq_file *m, void *arg, loff_t *pos) +{ + (*pos)++; + return (*pos < futex_hashsize) ? _queues[*pos] : NULL; +} + +static void futex_dump_stop(struct seq_file *m, void *arg) +{ +} + +static const struct seq_operations futex_dump_op = { + .start = futex_dump_start, + .next = futex_dump_next, + .stop = futex_dump_stop, + .show = futex_dump_show, +}; + +static int futex_dump_open(struct inode *inode, struct file *file) +{ + return seq_open(file, _dump_op); +} + +static const struct file_operations fops_futex_dump = { + .open= futex_dump_open, + .read= seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +/* + * Initialize debugfs for the futex hash bucket table dump. + */ +static int __init init_futex_dump(void) +{ + if (!debugfs_create_file("futex_hash_table", 0400, NULL, NULL, +_futex_dump)) + return -ENOMEM; + return 0; +} +fs_initcall(init_futex_dump); + +#endif /* CONFIG_DEBUG_FS && CONFIG_SMP */ -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-tip v5 18/21] TP-futex, doc: Update TP futexes document on shared locking
The tp-futex.txt was updated to add description about shared locking support. Signed-off-by: Waiman Long--- Documentation/tp-futex.txt | 163 +++-- 1 file changed, 143 insertions(+), 20 deletions(-) diff --git a/Documentation/tp-futex.txt b/Documentation/tp-futex.txt index 5324ee0..d4f0ed6 100644 --- a/Documentation/tp-futex.txt +++ b/Documentation/tp-futex.txt @@ -68,22 +68,58 @@ the kernel using the FUTEX_LOCK futex(2) syscall. Only the optional timeout parameter is being used by this futex(2) syscall. -Inside the kernel, a kernel mutex is used for serialization among -the futex waiters. Only the top lock waiter which is the owner of -the serialization mutex is allowed to continuously spin and attempt -to acquire the lock. Other lock waiters will have one attempt to -steal the lock before entering the mutex queues. +A shared lock acquirer, on the other hand, has to do either one of +the following 2 actions depends on the current value of the futex: + + 1) If current futex value is 0, atomically put (FUTEX_SHARED + 1) +into the lower 30 bits of the futex. + + 2) If the FUTEX_SHARED bit is set and the FUTEX_SHARED_UNLOCK bit +isn't, atomically increment the reader count in the lower 24 bits. +The other unused bits (24-27) can be used for other management purpose +and will be ignored by the kernel. + +Any failure to both of the above actions will lead to the following +new FUTEX_LOCK_SHARED futex(2) syscall. + + futex(uaddr, FUTEX_LOCK_SHARED, 0, timeout, NULL, 0); + +The special FUTEX_SHARED bit (bit 30) is used to distinguish between a +shared lock and an exclusive lock. If the FUTEX_SHARED bit isn't set, +the lower 29 bits contain the TID of the exclusive lock owner. If +the FUTEX_SHARED bit is set, the lower 29 bits contain the number of +tasks owning the shared lock. Therefore, only 29 bits are allowed +to be used by the TID. + +The FUTEX_SHARED_UNLOCK bit can be used by userspace to prevent racing +and to indicate that a shared unlock is in progress as it will disable +any shared trylock attempt in the kernel. + +Inside the kernel, a kernel mutex is used for serialization among the +futex waiters. Only the top lock waiter which is the owner of the +serialization mutex is allowed to continuously spin and attempt to +acquire the lock. Other lock waiters will have one or two attempts +to steal the lock before entering the mutex queues. When the exclusive futex lock owner is no longer running, the top waiter will set the FUTEX_WAITERS bit before going to sleep. This is to make sure the futex owner will go into the kernel at unlock time to wake up the top waiter. +When the futex is shared by multiple owners, there is no way to +determine if all the shared owners are running or not. In this case, +the top waiter will spin for a fixed amount of time if no reader +activity is observed before giving up and going to sleep. Any change +in the reader count will be considered reader activities and so will +reset the timer. However, when the hand-off threshold has been reached, +reader optimistic spinning timer reset will be disabled automatically. + The return values of the above futex locking syscall, if non-negative, -are status code that consists of 2 fields - the lock acquisition code -(bits 0-7) and the number of sleeps (bits 8-30) in the optimistic -spinning loop before acquiring the futex. A negative returned value -means an error has happened. +are status code that consists of 3 fields - the lock acquisition code +(bits 0-7) and the number of sleeps (bits 16-30) in the optimistic +spinning loop before acquiring the futex. Bits 8-15 are reserved +for future extension. A negative returned value means an error has +happened. The lock acquisition code can have the following values: a) 0 - lock stolen as non-top waiter @@ -96,14 +132,21 @@ issue a FUTEX_UNLOCK futex(2) syscall to wake up the top waiter. futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0); -A return value of 1 from the FUTEX_UNLOCK futex(2) syscall indicates -a task has been woken up. The syscall returns 0 if no sleeping task -is woken. A negative value will be returned if an error happens. +For a shared lock owner, unlock is done by atomically decrementing +the reader count by one. If the resultant futex value has no reader +remaining, the process has to atomically change the futex value from +FUTEX_SHARED to 0. If that fails, it has to issue a FUTEX_UNLOCK_SHARED +futex(2) syscall to wake up the top waiter. + +A return value of 1 from the FUTEX_UNLOCK or FUTEX_UNLOCK_SHARED +futex(2) syscall indicates a task has been woken up. The syscall +returns 0 if no sleeping task is woken. A negative value will be +returned if an error happens. -The error number returned by a FUTEX_UNLOCK syscall on an empty futex -can be used to decide if the TP futex functionality is implemented -in the kernel. If it is present, an EPERFM error will be returned.
[PATCH-tip v5 10/21] TP-futex: Enable robust handling
The TP futexes don't have code to handle the death of futex owners. There are 2 different cases that need to be considered. As top waiter gets a reference to the task structure of the futex owner, the task structure will never go away even if the owner dies. When the futex owner died while the top waiter is spinning, the task structure will be marked dead or the pid won't have a matching task structure if the task died before a reference is taken. Alternatively, if robust futex attribute is enabled, the FUTEX_OWNER_DIED bit of the futex word may also be set. In all those cases, what the top waiter need to do is to grab the futex directly. An informational message will be printed to highlight this event. If the futex owner died while the top waiter is sleeping, we need to make the exit processing code to wake up the top waiter. This is done by chaining the futex state object into the pi_state_list of the futex owner before the top waiter sleeps so that if exit_pi_state_list() is called, the wakeup will happen. The top waiter needs to remove its futex state object from the pi_state_list of the old owner if the ownership changes hand or when the lock is acquired. Signed-off-by: Waiman Long--- kernel/futex.c | 85 +- 1 file changed, 79 insertions(+), 6 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 6a59e6d..46a1a4b 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1006,7 +1006,7 @@ static struct task_struct * futex_find_get_task(pid_t pid) } /* - * This task is holding PI mutexes at exit time => bad. + * This task is holding PI or TP mutexes at exit time => bad. * Kernel cleans up PI-state, but userspace is likely hosed. * (Robust-futex cleanup is separate and might save the day for userspace.) */ @@ -1023,12 +1023,31 @@ void exit_pi_state_list(struct task_struct *curr) * We are a ZOMBIE and nobody can enqueue itself on * pi_state_list anymore, but we have to be careful * versus waiters unqueueing themselves: +* +* For TP futexes, the only purpose of showing up in the +* pi_state_list is for this function to wake up the serialization +* mutex owner (state->mutex_owner). We don't actually need to take +* the HB lock. The futex state and task struct won't go away as long +* as we hold the pi_lock. */ raw_spin_lock_irq(>pi_lock); while (!list_empty(head)) { next = head->next; pi_state = list_entry(next, struct futex_state, list); + + if (pi_state->type == TYPE_TP) { + struct task_struct *owner; + + owner = READ_ONCE(pi_state->mutex_owner); + WARN_ON(list_empty(_state->list)); + list_del_init(_state->list); + pi_state->owner = NULL; + if (owner) + wake_up_process(owner); + continue; + } + key = pi_state->key; hb = hash_futex(); raw_spin_unlock_irq(>pi_lock); @@ -3185,8 +3204,8 @@ int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) goto retry; /* -* Wake robust non-PI futexes here. The wakeup of -* PI futexes happens in exit_pi_state(): +* Wake robust wait-wake futexes here. The wakeup of +* PI and TP futexes happens in exit_pi_state(): */ if (!pi && (uval & FUTEX_WAITERS)) futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY); @@ -3327,6 +3346,12 @@ void exit_robust_list(struct task_struct *curr) * Unlike the other futexes, the futex_q structures aren't used. Instead, * they will queue up in the serialization mutex of the futex state container * queued in the hash bucket. + * + * To handle the exceptional case that the futex owner died, the robust + * futexes list mechanism is used to for waking up sleeping top waiter. + * Checks are also made in the futex_spin_on_owner() loop for dead task + * structure or invalid pid. In both cases, the top waiter will take over + * the ownership of the futex. */ /** @@ -3515,6 +3540,10 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval) static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, struct futex_state *state) { +#define OWNER_DEAD_MESSAGE \ + "futex: owner pid %d of TP futex 0x%lx was %s.\n" \ + "\tLock is now acquired by pid %d!\n" + int ret; u32 uval; u32 owner_pid = 0; @@ -3529,13 +3558,47 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, break; if ((uval & FUTEX_TID_MASK) !=
[PATCH-tip v5 19/21] perf bench: Extend mutex/rwlock futex suite to test TP futexes
This patch extends the futex-mutex and futex-rwlock microbenchmarks to test userspace mutexes and rwlocks built on top of the TP futexes. We can then compare the relative performance of those userspace locks based on different type of futexes. Signed-off-by: Waiman Long--- tools/perf/bench/futex-locks.c | 258 +++-- tools/perf/bench/futex.h | 43 +++ tools/perf/check-headers.sh| 4 + 3 files changed, 298 insertions(+), 7 deletions(-) diff --git a/tools/perf/bench/futex-locks.c b/tools/perf/bench/futex-locks.c index 18dc71d..42be894 100644 --- a/tools/perf/bench/futex-locks.c +++ b/tools/perf/bench/futex-locks.c @@ -8,12 +8,14 @@ * indication of actual throughput of the mutex code as it may not really * need to call into the kernel. Therefore, 3 sets of simple mutex lock and * unlock functions are written to implenment a mutex lock using the - * wait-wake (2 versions) and PI futexes respectively. These functions serve - * as the basis for measuring the locking throughput. + * wait-wake, PI and TP futexes respectively. These functions serve as the + * basis for measuring the locking throughput. * - * Two sets of simple reader/writer lock and unlock functions are also - * implemented using the wait-wake futexes as well as the Glibc rwlock - * respectively for performance measurement purpose. + * Three sets of simple reader/writer lock and unlock functions are also + * implemented using the wait-wake and TP futexes as well as the Glibc + * rwlock respectively for performance measurement purpose. The TP futex + * writer lock/unlock functions are the same as that of the mutex functions. + * That is not the case for the wait-wake futexes. */ #include @@ -57,6 +59,12 @@ enum { STAT_SSLEEPS, /* # of shared lock sleeps */ STAT_EAGAINS, /* # of EAGAIN errors */ STAT_WAKEUPS, /* # of wakeups (unlock return) */ + STAT_HANDOFFS, /* # of lock handoff (TP only) */ + STAT_STEALS,/* # of lock steals (TP only) */ + STAT_SLRETRIES, /* # of shared lock retries */ + STAT_SURETRIES, /* # of shared unlock retries */ + STAT_SALONES, /* # of shared locking alone ops*/ + STAT_SGROUPS, /* # of shared locking group ops*/ STAT_LOCKERRS, /* # of exclusive lock errors */ STAT_UNLKERRS, /* # of exclusive unlock errors */ STAT_SLOCKERRS, /* # of shared lock errors */ @@ -192,7 +200,7 @@ static inline void stat_inc(int tid __maybe_unused, int item __maybe_unused) */ static const struct option mutex_options[] = { OPT_INTEGER ('d', "locklat",, "Specify inter-locking latency (default = 1)"), - OPT_STRING ('f', "ftype", ,"type", "Specify futex type: WW, PI, all (default)"), + OPT_STRING ('f', "ftype", ,"type", "Specify futex type: WW, PI, TP, all (default)"), OPT_INTEGER ('L', "loadlat",, "Specify load latency (default = 1)"), OPT_UINTEGER('r', "runtime",,"Specify runtime (in seconds, default = 10s)"), OPT_BOOLEAN ('S', "shared", , "Use shared futexes instead of private ones"), @@ -210,7 +218,7 @@ static inline void stat_inc(int tid __maybe_unused, int item __maybe_unused) static const struct option rwlock_options[] = { OPT_INTEGER ('d', "locklat",, "Specify inter-locking latency (default = 1)"), - OPT_STRING ('f', "ftype", ,"type", "Specify futex type: WW, GC, all (default)"), + OPT_STRING ('f', "ftype", ,"type", "Specify futex type: WW, TP, GC, all (default)"), OPT_INTEGER ('L', "loadlat",, "Specify load latency (default = 1)"), OPT_UINTEGER('R', "read-%", , "Specify reader percentage (default 50%)"), OPT_UINTEGER('r', "runtime",,"Specify runtime (in seconds, default = 10s)"), @@ -504,6 +512,72 @@ static void pi_mutex_unlock(futex_t *futex, int tid) stat_inc(tid, STAT_UNLOCKS); } +/* + * TP futex lock/unlock functions + */ +static void tp_mutex_lock(futex_t *futex, int tid) +{ + struct timespec stime, etime; + futex_t val; + int ret; + + val = futex_cmpxchg(futex, 0, thread_id); + if (val == 0) + return; + + /* +* Retry if an error happens +*/ + for (;;) { + if (timestat) { + clock_gettime(CLOCK_REALTIME, ); + ret = futex_lock(futex, NULL, flags); + clock_gettime(CLOCK_REALTIME, ); + systime_add(tid, TIME_LOCK, , ); + } else { + ret = futex_lock(futex, NULL, flags); + } + stat_inc(tid, STAT_LOCKS); + if (ret >= 0) + break; + stat_inc(tid,
[PATCH-tip v5 09/21] futex: Introduce throughput-optimized (TP) futexes
A new futex implementation called throughput-optimized (TP) futexes is introduced. The goal of this new futex type is to maximize locking throughput at the expense of fairness and deterministic latency. Its throughput is higher than that of the wait-wake futexes especially on systems with a large number of CPUs and for workloads where the lock owners are unlikely to sleep. The downside of this new futex is the increase in the response time variance. The new TP futex type is designed to be used for mutex-like exclusive user space locks. It is not as versatile as the wait-wake futexes which can be used for other locking primitives like conditional variables, or read-write locks. So it is not a direct replacement of wait-wake futexes. A futex locking microbenchmark was written where separate userspace mutexes are implemented using wait-wake (WW), PI and TP futexes respectively. This microbenchmark was intructed to run 10s of locking operations with a load latency of 5 as the critical section on a 2-socket 36-core E5-2699 v3 system (HT off). The system was running on a 4.10 based kernel. the results of the benchmark runs were as follows: WW futex PI futex TP futex Total locking ops 43,607,917 303,047 54,812,918 Per-thread avg/sec121,129 841 152,242 Per-thread min/sec114,842 841 37,440 Per-thread max/sec127,201 842 455,677 % Stddev0.50%0.00% 9.89% lock futex calls 10,262,001 303,041 134,989 unlock futex calls 12,956,025 303,042 21 The TP futexes had better throughput, but they also have larger variance. For the TP futexes, the heavy hitters in the perf profile were: 98.18% 7.25% [kernel.vmlinux] [k] futex_lock 87.31% 87.31% [kernel.vmlinux] [k] osq_lock 3.62% 3.62% [kernel.vmlinux] [k] mutex_spin_on_owner For the WW futexes, the heavy hitters were: 86.38% 86.38% [kernel.vmlinux] [k] queued_spin_lock_slowpath 48.39% 2.65% [kernel.vmlinux] [k] futex_wake 45.13% 3.57% [kernel.vmlinux] [k] futex_wait_setup By increasing the load latency, the performance discrepancies between WW and TP futexes actually increases as shown in the table below. Load Latency WW locking ops TP locking ops % change -- -- 5 43,607,917 54,812,918 +26% 10 32,822,045 45,685,420 +39% 20 23,841,429 39,439,048 +65% 30 18,360,775 31,494,788 +72% 40 15,383,536 27,687,160 +80% 50 13,290,368 23,096,937 +74% 1008,577,763 14,410,909 +68% 1us sleep 176,450 179,660 +2% Signed-off-by: Waiman Long--- include/uapi/linux/futex.h | 4 + kernel/futex.c | 559 - 2 files changed, 560 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h index 0b1f716..7f676ac 100644 --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -20,6 +20,8 @@ #define FUTEX_WAKE_BITSET 10 #define FUTEX_WAIT_REQUEUE_PI 11 #define FUTEX_CMP_REQUEUE_PI 12 +#define FUTEX_LOCK 13 +#define FUTEX_UNLOCK 14 #define FUTEX_PRIVATE_FLAG 128 #define FUTEX_CLOCK_REALTIME 256 @@ -39,6 +41,8 @@ FUTEX_PRIVATE_FLAG) #define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \ FUTEX_PRIVATE_FLAG) +#define FUTEX_LOCK_PRIVATE (FUTEX_LOCK | FUTEX_PRIVATE_FLAG) +#define FUTEX_UNLOCK_PRIVATE (FUTEX_UNLOCK | FUTEX_PRIVATE_FLAG) /* * Support for robust futexes: the kernel cleans up held futexes at diff --git a/kernel/futex.c b/kernel/futex.c index 2ced6c0..6a59e6d 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -30,6 +30,9 @@ * "The futexes are also cursed." * "But they come in a choice of three flavours!" * + * TP-futex support started by Waiman Long + * Copyright (C) 2016-2017 Red Hat, Inc., Waiman Long + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -195,6 +198,7 @@ enum futex_type { TYPE_PI = 0, + TYPE_TP, }; /* @@ -214,11 +218,23 @@ struct futex_state { struct list_head fs_list; /* -* The PI object: +* The PI or mutex object: */ - struct rt_mutex pi_mutex; + union { + struct rt_mutex pi_mutex; + struct mutex mutex; + }; + /* +* For both TP/PI futexes,
[PATCH-tip v5 20/21] sched, TP-futex: Make wake_up_q() return wakeup count
Unlike wake_up_process(), wake_up_q() doesn't tell us how many tasks have been woken up. This information can sometimes be useful for tracking purpose. So wake_up_q() is now modified to return that information. Signed-off-by: Waiman Long--- include/linux/sched/wake_q.h | 2 +- kernel/futex.c | 8 +++- kernel/sched/core.c | 6 -- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h index 9a32f17..2b2bc9d 100644 --- a/include/linux/sched/wake_q.h +++ b/include/linux/sched/wake_q.h @@ -49,6 +49,6 @@ static inline void wake_q_init(struct wake_q_head *head) extern void wake_q_add(struct wake_q_head *head, struct task_struct *task); -extern void wake_up_q(struct wake_q_head *head); +extern int wake_up_q(struct wake_q_head *head); #endif /* _LINUX_SCHED_WAKE_Q_H */ diff --git a/kernel/futex.c b/kernel/futex.c index 16c63b8..6bf5304 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -4257,11 +4257,9 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags, out_put_key: put_futex_key(); if (owner) { - /* -* No error would have happened if owner defined. -*/ - wake_up_q(_q); - return ret ? ret : 1; + int cnt = wake_up_q(_q); + + return ret ? ret : cnt; } return ret; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 679093d..668a094 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -452,9 +452,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task) head->lastp = >next; } -void wake_up_q(struct wake_q_head *head) +int wake_up_q(struct wake_q_head *head) { struct wake_q_node *node = head->first; + int wakecnt = 0; while (node != WAKE_Q_TAIL) { struct task_struct *task; @@ -469,9 +470,10 @@ void wake_up_q(struct wake_q_head *head) * wake_up_process() implies a wmb() to pair with the queueing * in wake_q_add() so as not to miss wakeups. */ - wake_up_process(task); + wakecnt += wake_up_process(task); put_task_struct(task); } + return wakecnt; } /* -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 0/2] Hardening configs refactor/rename
Hi, This is a follow up to my proposal to rename/refactor CONFIG_DEBUG_RODATA and CONFIG_DEBUG_SET_MODULE_RONX. Among other objections, there shouldn't be 'debug' in the name since these provide necessary kernel protection. v2 takes a slightly different approach to this per feedback. Patch #1 moves CONFIG_DEBUG_RODATA and CONFIG_DEBUG_SET_MODULE_RONX to a common arch config. These configs are def_bool y for every arch except !CPU_V7 for arm CONFIG_DEBUG_RODATA. I think this also mitigates another concern about changing the name since these are basically internal configs at this point and not end user selectable. Patch #2 does the rename to something more descriptive. Hopefully this should separate discussion more clearly into two parts (refactor and rename) Thanks, Laura Laura Abbott (2): arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX Documentation/DocBook/kgdb.tmpl| 8 Documentation/security/self-protection.txt | 4 ++-- arch/Kconfig | 28 arch/arm/Kconfig | 3 +++ arch/arm/Kconfig.debug | 11 --- arch/arm/configs/aspeed_g4_defconfig | 3 +-- arch/arm/configs/aspeed_g5_defconfig | 3 +-- arch/arm/include/asm/cacheflush.h | 2 +- arch/arm/kernel/patch.c| 4 ++-- arch/arm/kernel/vmlinux.lds.S | 8 arch/arm/mm/Kconfig| 14 +- arch/arm/mm/init.c | 4 ++-- arch/arm64/Kconfig | 5 ++--- arch/arm64/Kconfig.debug | 13 + arch/arm64/kernel/insn.c | 2 +- arch/parisc/Kconfig| 1 + arch/parisc/Kconfig.debug | 11 --- arch/parisc/configs/712_defconfig | 1 - arch/parisc/configs/c3000_defconfig| 1 - arch/parisc/mm/init.c | 2 +- arch/s390/Kconfig | 5 ++--- arch/s390/Kconfig.debug| 3 --- arch/x86/Kconfig | 5 ++--- arch/x86/Kconfig.debug | 11 --- include/linux/filter.h | 4 ++-- include/linux/init.h | 4 ++-- include/linux/module.h | 2 +- init/main.c| 4 ++-- kernel/configs/android-recommended.config | 2 +- kernel/module.c| 6 +++--- kernel/power/hibernate.c | 2 +- kernel/power/power.h | 4 ++-- kernel/power/snapshot.c| 4 ++-- 33 files changed, 75 insertions(+), 109 deletions(-) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 2/2] arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
Both of these options are poorly named. The features they provide are necessary for system security and should not be considered debug only. Change the name to something that accurately describes what these options do. Signed-off-by: Laura Abbott--- v2: This patch is now doing the renaming of CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX --- Documentation/DocBook/kgdb.tmpl| 8 Documentation/security/self-protection.txt | 4 ++-- arch/Kconfig | 4 ++-- arch/arm/configs/aspeed_g4_defconfig | 3 +-- arch/arm/configs/aspeed_g5_defconfig | 3 +-- arch/arm/include/asm/cacheflush.h | 2 +- arch/arm/kernel/patch.c| 4 ++-- arch/arm/kernel/vmlinux.lds.S | 8 arch/arm/mm/Kconfig| 2 +- arch/arm/mm/init.c | 4 ++-- arch/arm64/Kconfig.debug | 2 +- arch/arm64/kernel/insn.c | 2 +- arch/parisc/configs/712_defconfig | 1 - arch/parisc/configs/c3000_defconfig| 1 - arch/parisc/mm/init.c | 2 +- include/linux/filter.h | 4 ++-- include/linux/init.h | 4 ++-- include/linux/module.h | 2 +- init/main.c| 4 ++-- kernel/configs/android-recommended.config | 2 +- kernel/module.c| 6 +++--- kernel/power/hibernate.c | 2 +- kernel/power/power.h | 4 ++-- kernel/power/snapshot.c| 4 ++-- 24 files changed, 39 insertions(+), 43 deletions(-) diff --git a/Documentation/DocBook/kgdb.tmpl b/Documentation/DocBook/kgdb.tmpl index f3abca7..856ac20 100644 --- a/Documentation/DocBook/kgdb.tmpl +++ b/Documentation/DocBook/kgdb.tmpl @@ -115,12 +115,12 @@ If the architecture that you are using supports the kernel option -CONFIG_DEBUG_RODATA, you should consider turning it off. This +CONFIG_STRICT_KERNEL_RWX, you should consider turning it off. This option will prevent the use of software breakpoints because it marks certain regions of the kernel's memory space as read-only. If kgdb supports it for the architecture you are using, you can use hardware breakpoints if you desire to run with the -CONFIG_DEBUG_RODATA option turned on, else you need to turn off +CONFIG_STRICT_KERNEL_RWX option turned on, else you need to turn off this option. @@ -135,7 +135,7 @@ Here is an example set of .config symbols to enable or disable for kgdb: -# CONFIG_DEBUG_RODATA is not set +# CONFIG_STRICT_KERNEL_RWX is not set CONFIG_FRAME_POINTER=y CONFIG_KGDB=y CONFIG_KGDB_SERIAL_CONSOLE=y @@ -166,7 +166,7 @@ Here is an example set of .config symbols to enable/disable kdb: -# CONFIG_DEBUG_RODATA is not set +# CONFIG_STRICT_KERNEL_RWX is not set CONFIG_FRAME_POINTER=y CONFIG_KGDB=y CONFIG_KGDB_SERIAL_CONSOLE=y diff --git a/Documentation/security/self-protection.txt b/Documentation/security/self-protection.txt index 3010576..dd2a3b1 100644 --- a/Documentation/security/self-protection.txt +++ b/Documentation/security/self-protection.txt @@ -51,8 +51,8 @@ kernel, they are implemented in a way where the memory is temporarily made writable during the update, and then returned to the original permissions.) -In support of this are (the poorly named) CONFIG_DEBUG_RODATA and -CONFIG_DEBUG_SET_MODULE_RONX, which seek to make sure that code is not +In support of this are CONFIG_STRICT_KERNEL_RWX and +CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not writable, data is not executable, and read-only data is neither writable nor executable. diff --git a/arch/Kconfig b/arch/Kconfig index 22ee01e..406f6cd 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -787,7 +787,7 @@ config ARCH_NO_STRICT_RWX_DEFAULTS config ARCH_HAS_STRICT_KERNEL_RWX def_bool n -config DEBUG_RODATA +config STRICT_KERNEL_RWX def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS prompt "Make kernel text and rodata read-only" if ARCH_NO_STRICT_RWX_DEFAULTS depends on ARCH_HAS_STRICT_KERNEL_RWX @@ -800,7 +800,7 @@ config DEBUG_RODATA config ARCH_HAS_STRICT_MODULE_RWX def_bool n -config DEBUG_SET_MODULE_RONX +config STRICT_MODULE_RWX def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS prompt "Set loadable kenrel module data as NX and text as RO" if ARCH_NO_STRICT_RWX_DEFAULTS depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES diff --git a/arch/arm/configs/aspeed_g4_defconfig b/arch/arm/configs/aspeed_g4_defconfig index ca39c04..beea2cc 100644 --- a/arch/arm/configs/aspeed_g4_defconfig +++ b/arch/arm/configs/aspeed_g4_defconfig @@ -25,7 +25,6 @@ CONFIG_MODULE_UNLOAD=y # CONFIG_ARCH_MULTI_V7 is not set CONFIG_ARCH_ASPEED=y CONFIG_MACH_ASPEED_G4=y
[PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
There are multiple architectures that support CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX. These options also now have the ability to be turned off at runtime. Move these to an architecture independent location and make these options def_bool y for almost all of those arches. Signed-off-by: Laura Abbott--- v2: This patch is now doing just the refactor of the existing config options. --- arch/Kconfig | 28 arch/arm/Kconfig | 3 +++ arch/arm/Kconfig.debug| 11 --- arch/arm/mm/Kconfig | 12 arch/arm64/Kconfig| 5 ++--- arch/arm64/Kconfig.debug | 11 --- arch/parisc/Kconfig | 1 + arch/parisc/Kconfig.debug | 11 --- arch/s390/Kconfig | 5 ++--- arch/s390/Kconfig.debug | 3 --- arch/x86/Kconfig | 5 ++--- arch/x86/Kconfig.debug| 11 --- 12 files changed, 38 insertions(+), 68 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 99839c2..22ee01e 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -781,4 +781,32 @@ config VMAP_STACK the stack to map directly to the KASAN shadow map using a formula that is incorrect if the stack is in vmalloc space. +config ARCH_NO_STRICT_RWX_DEFAULTS + def_bool n + +config ARCH_HAS_STRICT_KERNEL_RWX + def_bool n + +config DEBUG_RODATA + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS + prompt "Make kernel text and rodata read-only" if ARCH_NO_STRICT_RWX_DEFAULTS + depends on ARCH_HAS_STRICT_KERNEL_RWX + help + If this is set, kernel text and rodata memory will be made read-only, + and non-text memory will be made non-executable. This provides + protection against certain security exploits (e.g. executing the heap + or modifying text) + +config ARCH_HAS_STRICT_MODULE_RWX + def_bool n + +config DEBUG_SET_MODULE_RONX + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS + prompt "Set loadable kenrel module data as NX and text as RO" if ARCH_NO_STRICT_RWX_DEFAULTS + depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES + help + If this is set, module text and rodata memory will be made read-only, + and non-text memory will be made non-executable. This provides + protection against certain security exploits (e.g. writing to text) + source "kernel/gcov/Kconfig" diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 186c4c2..aa73ca8 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -4,10 +4,13 @@ config ARM select ARCH_CLOCKSOURCE_DATA select ARCH_HAS_DEVMEM_IS_ALLOWED select ARCH_HAS_ELF_RANDOMIZE + select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL + select ARCH_HAS_STRICT_MODULE_RWX if MMU select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAVE_CUSTOM_GPIO_H select ARCH_HAS_GCOV_PROFILE_ALL select ARCH_MIGHT_HAVE_PC_PARPORT + select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7 select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug index d83f7c3..426d271 100644 --- a/arch/arm/Kconfig.debug +++ b/arch/arm/Kconfig.debug @@ -1738,17 +1738,6 @@ config PID_IN_CONTEXTIDR additional instructions during context switch. Say Y here only if you are planning to use hardware trace tools with this kernel. -config DEBUG_SET_MODULE_RONX - bool "Set loadable kernel module data as NX and text as RO" - depends on MODULES && MMU - ---help--- - This option helps catch unintended modifications to loadable - kernel module's text and read-only data. It also prevents execution - of module data. Such protection may interfere with run-time code - patching and dynamic kernel tracing - and they might also protect - against certain classes of kernel exploits. - If in doubt, say "N". - source "drivers/hwtracing/coresight/Kconfig" endmenu diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig index f68e8ec..419a035 100644 --- a/arch/arm/mm/Kconfig +++ b/arch/arm/mm/Kconfig @@ -1051,18 +1051,6 @@ config ARCH_SUPPORTS_BIG_ENDIAN This option specifies the architecture can support big endian operation. -config DEBUG_RODATA - bool "Make kernel text and rodata read-only" - depends on MMU && !XIP_KERNEL - default y if CPU_V7 - help - If this is set, kernel text and rodata memory will be made - read-only, and non-text kernel memory will be made non-executable. - The tradeoff is that each region is padded to section-size (1MiB) - boundaries (because their permissions are different and splitting - the 1M pages into 4K ones causes TLB performance problems), which - can waste memory. - config DEBUG_ALIGN_RODATA
[PATCH] Add 'nodoc' option
I've written this patch, and it seems to work, but I don't really know what I'm doing and I fear I may have broken something. I don't know what: required_argument = 1 optional_arguments = 4 mean, so I don't know whether I should have adjusted them when adding this new option. >From 26896bd43618e30cb3563114663ad6c39cef7dc6 Mon Sep 17 00:00:00 2001 From: Matthew WilcoxDate: Thu, 26 Jan 2017 11:43:24 -0500 Subject: [PATCH] kerneldoc: Add :nodoc: option This is functionality the perl script already has, we just need to expose it to our rst files. --- Documentation/sphinx/kerneldoc.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Documentation/sphinx/kerneldoc.py b/Documentation/sphinx/kerneldoc.py index d15e07f36881..d86d88da1d75 100644 --- a/Documentation/sphinx/kerneldoc.py +++ b/Documentation/sphinx/kerneldoc.py @@ -47,6 +47,7 @@ class KernelDocDirective(Directive): optional_arguments = 4 option_spec = { 'doc': directives.unchanged_required, +'nodoc': directives.unchanged, 'functions': directives.unchanged_required, 'export': directives.unchanged, 'internal': directives.unchanged, @@ -74,6 +75,8 @@ class KernelDocDirective(Directive): export_file_patterns = str(self.options.get('internal')).split() elif 'doc' in self.options: cmd += ['-function', str(self.options.get('doc'))] +elif 'nodoc' in self.options: +cmd += ['-no-doc-sections'] elif 'functions' in self.options: for f in str(self.options.get('functions')).split(): cmd += ['-function', f] -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] Documentation/sphinx: prevent generation of .pyc files in the source tree
On Wed, 01 Feb 2017, Jonathan Corbetwrote: > On Tue, 31 Jan 2017 20:18:05 +0200 > Jani Nikula wrote: > >> Considering the small amount of python code to compile (assuming sphinx >> itself has .pyc around), the impact on build is neglible. > > Hey...don't you know that performance-impacting patches need benchmarks? Heh, I was briefly worried at this point! :) > > Sphinx-only htmldocs build before: > > real2m26.063s > user2m19.184s > sys 0m5.429s > > After: > > real2m26.342s > user2m20.445s > sys 0m5.611s > > You owe me 0.279 seconds :) > > (Applied, thanks). > > jon -- Jani Nikula, Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] MicroSemi Switchtec management interface driver
On 2 February 2017 at 16:37, Logan Gunthorpewrote: > > > On 01/02/17 05:10 AM, Emil Velikov wrote: >> You can keep it roughly as-is if you're ~reasonably certain one won't >> change it in the future. > > I've made the change anyway. I think it's better now. > >> Some teams frown upon adding new IOCTL(s) where existing ones can be >> made backward/forward compatible. >> I'm not fully aware of the general direction/consensus on the topic, >> so it might be a minority. > > Sure, I just don't know what might be needed in the future so it's hard > to add a version or flags ioctl now. > Yes knowing how things will need to change in the is hard. That's why the documentation suggestions adding a flag to the ioctl structs. It [the flag] might imply certain functional/implementation change, support for new/deprecation of old features and others. >> On the other hand, reading through sysfs for module version in order >> to use IOCTL A or B sounds quite hacky. Do you have an example where >> this is used or pointed out as good approach ? > > I don't know of anything doing it that way now. But it sure would be > easy and make a bit of sense. (We'd actually use the module version for > something useful.) Either way, it would really depend on if and how > things change in the future. The point is there are options to expand if > needed. > The part that nobody else is doing such a thing should ring a bell ;-) It's no my call, but if it was I'd stick with the existing approach and not "reinvent the wheel" sort of speak. Thanks Emil -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 07/12] dt-bindings: i2c: i2c-mux-simple: document i2c-mux-simple bindings
On 2017-02-02 17:08, Rob Herring wrote: > On Tue, Jan 31, 2017 at 1:36 AM, Peter Rosinwrote: >> If you see this new driver as something that is superseding the existing >> i2c-mux-gpio driver, I'm sad to inform you that the code is not simply >> not there. i2c-mux-gpio has acpi support and users may provide platform >> data from code. The existing i2c-mux-gpio driver also has the below >> mentioned locking heuristic. Adding all those things to the new driver >> would make it big and unwieldy and having even more unwanted tentacles >> into other subsystems. And why should it be only i2c-mux-gpio that is >> merged into this new i2c-mux driver? Why not implement a mux-pinctrl >> driver for the new mux subsustem and then merge i2c-mux-pinctrl as well? >> I say no, that way lies madness. > > Sounds like a good idea to me. I'm not saying you need to merge any of > them right now though (that's Wolfram's call). If we're pedantic I probably have some stake in it too, being the i2c-mux maintainer and all. But, agreed, I arrived quite late to the Linux kernel party and my opinion might perhaps not carry all that much weight... > None of this has anything to do with the binding though. Compatible > strings should be specific. That's not up for debate. Whether the Ok, I'm going to focus on the compatible string for a minute and leave the implementation details for some other discussion. > driver bound to a compatible string is common or specific to that > compatible string is completely up to the OS. That decision can change > over time, but the binding should not. So, there's the existing compatible "i2c-mux-gpio" ("i2c-gpio-mux" is wrong) that you seem to suggest is what I should stick to. I object to that. As you say, the bindings and compatible strings should describe hardware, and you also state they should be specific. I agree. But, why are you then apparently suggesting (by implication) that for this (hypothetical) hardware... .. |SDA0|---. |SCL0|-. | || | | || .---. || |adg792a| || | | |ADC0|--|D1 S1A| signal A || |S1B| signal B || |S1C| signal C || |S1D| signal D || | | |SDA1|---+--|D2 S2A| i2s segment foo A |SCL1|-. | |S2B| i2s segment foo B '' | | |S2C| i2s segment foo C | | |S2D| i2s segment foo D | | | | | '--|D3 S3A| i2s segment bar A ||S3B| i2s segment bar B ||S3C| i2s segment bar C ||S3D| i2s segment bar D |'---' | A B C D A B C D (feed SCL1 to each of | | | | | | | | | the 8 muxed segments) '--+-+-+-+---+-+-+-' ...the devicetree should be like below? { mux: mux-controller@50 { compatible = "adi,adg792a"; reg = <0x50>; #mux-control-cells = <1>; }; }; adc-mux { compatible = "io-channel-mux"; io-channels = < 0>; io-channel-names = "parent"; mux-controls = < 0>; ... }; i2c-mux-foo { compatible = "i2c-mux-gpio"; i2c-parent = <>; mux-controls = < 1>; ... }; i2c-mux-bar { compatible = "i2c-mux-gpio"; i2c-parent = <>; mux-controls = < 2>; ... }; There must be some disconnect, because those "i2c-mux-gpio" compatible strings are just dead wrong. There simply are no gpio pins involved at all and that "gpio" suffix is just totally out of place. So, since you are not happy with "i2c-mux-simple", "i2c-mux-generic" or "i2c-mux" that I have suggested, can you please come up with something that is good enough for the above? Or, are you /really/ suggesting "i2c-mux-gpio"? Cheers, peda -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html