Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common

2017-02-03 Thread Kees Cook
On Fri, Feb 3, 2017 at 2:28 PM, Russell King - ARM Linux
 wrote:
> On Fri, Feb 03, 2017 at 01:08:40PM -0800, Kees Cook wrote:
>> On Fri, Feb 3, 2017 at 12:29 PM, Russell King - ARM Linux
>>  wrote:
>> > On Fri, Feb 03, 2017 at 11:45:56AM -0800, Kees Cook wrote:
>> >> On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbott  wrote:
>> >> > diff --git a/arch/Kconfig b/arch/Kconfig
>> >> > index 99839c2..22ee01e 100644
>> >> > --- a/arch/Kconfig
>> >> > +++ b/arch/Kconfig
>> >> > @@ -781,4 +781,32 @@ config VMAP_STACK
>> >> >   the stack to map directly to the KASAN shadow map using a 
>> >> > formula
>> >> >   that is incorrect if the stack is in vmalloc space.
>> >> >
>> >> > +config ARCH_NO_STRICT_RWX_DEFAULTS
>> >> > +   def_bool n
>> >> > +
>> >> > +config ARCH_HAS_STRICT_KERNEL_RWX
>> >> > +   def_bool n
>> >> > +
>> >> > +config DEBUG_RODATA
>> >> > +   def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
>> >> > +   prompt "Make kernel text and rodata read-only" if 
>> >> > ARCH_NO_STRICT_RWX_DEFAULTS
>> >>
>> >> Ah! Yes, perfect. I totally forgot about using conditional "prompt"
>> >> lines. Nice!
>> >
>> > It's no different from the more usual:
>> >
>> > bool "Make kernel text and rodata read-only" if 
>> > ARCH_NO_STRICT_RWX_DEFAULTS
>> > default y if !ARCH_NO_STRICT_RWX_DEFAULTS
>> > depends on ARCH_HAS_STRICT_KERNEL_RWX
>> >
>> > But... I really don't like this - way too many negations and negatives
>> > which make it difficult to figure out what's going on here.
>> >
>> > The situation we have today is:
>> >
>> > -config DEBUG_RODATA
>> > -   bool "Make kernel text and rodata read-only"
>> > -   depends on MMU && !XIP_KERNEL
>> > -   default y if CPU_V7
>> >
>> > which is "allow the user to select DEBUG_RODATA if building a MMU non-XIP
>> > kernel", suggesting that the user turns it on for ARMv7 CPUs.
>> >
>> > That changes with this and the above:
>> >
>> > +   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
>> > +   select ARCH_HAS_STRICT_MODULE_RWX if MMU
>> > +   select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7
>> >
>> > This means that ARCH_HAS_STRICT_KERNEL_RWX is set for a MMU non-XIP
>> > kernel, which carries the same pre-condition for DEBUG_RODATA - no
>> > problem there.
>> >
>> > However, ARCH_NO_STRICT_RWX_DEFAULTS is set for non-ARMv7 CPUs, which
>> > means the "Make kernel text and rodata read-only" prompt _is_ provided
>> > for those.  However, for all ARMv7 systems, we go from "suggesting that
>> > the user enables the option" to "you don't have a choice, you get this
>> > whether you want it or not."
>> >
>> > I'd prefer to keep it off for my development systems, where I don't
>> > care about kernel security.  If we don't wish to do that as a general
>> > rule, can we make it dependent on EMBEDDED?
>> >
>> > Given that on ARM it can add up to 4MB to the kernel image - there
>> > _will_ be about 1MB before the .text section, the padding on between
>> > __modver and __ex_table which for me is around 626k, the padding
>> > between .notes and the init sections start with .vectors (the space
>> > between __ex_table and end of .notes is only 4124, which gets padded
>> > up to 1MB) and lastly the padding between the .init section and the
>> > data section (for me around 593k).  This all adds up to an increase
>> > in kernel image size of 3.2MB on 14.2MB - an increase of 22%.
>> >
>> > So no, I'm really not happy with that.
>>
>> Ah yeah, good point. We have three cases: unsupported, mandatory,
>> optional, but we have the case of setting the default for the optional
>> case. Maybe something like this?
>>
>> config STRICT_KERNEL_RWX
>>   bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX
>>   depends on ARCH_HAS_STRICT_KERNEL_RWX
>>   default ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
>>
>> unsupported:
>> !ARCH_HAS_STRICT_KERNEL_RWX
>>
>> mandatory:
>> ARCH_HAS_STRICT_KERNEL_RWX
>> !ARCH_OPTIONAL_KERNEL_RWX
>>
>> optional:
>> ARCH_HAS_STRICT_KERNEL_RWX
>> ARCH_OPTIONAL_KERNEL_RWX
>> with default controlled by ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
>>
>> Then arm is:
>>   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
>>   select ARCH_HAS_STRICT_MODULE_RWX if MMU
>>   select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
>>   select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7
>>
>> x86 and arm64 are:
>>   select ARCH_HAS_STRICT_KERNEL_RWX
>>   select ARCH_HAS_STRICT_MODULE_RWX
>
> Looks to me like it will do the job.
>
> In passing, I noticed that, on ARM:
>
>   3 .rodata   002212b4  c0703000  c0703000  00703000  2**6
>   CONTENTS, ALLOC, LOAD, DATA
>
> a lack of READONLY there - which suggests something in this lot isn't
> actually read-only data:
>
> .rodata   : AT(ADDR(.rodata) - LOAD_OFFSET) {   \
> VMLINUX_SYMBOL(__start_rodata) = .; \
> 

Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common

2017-02-03 Thread Russell King - ARM Linux
On Fri, Feb 03, 2017 at 01:08:40PM -0800, Kees Cook wrote:
> On Fri, Feb 3, 2017 at 12:29 PM, Russell King - ARM Linux
>  wrote:
> > On Fri, Feb 03, 2017 at 11:45:56AM -0800, Kees Cook wrote:
> >> On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbott  wrote:
> >> > diff --git a/arch/Kconfig b/arch/Kconfig
> >> > index 99839c2..22ee01e 100644
> >> > --- a/arch/Kconfig
> >> > +++ b/arch/Kconfig
> >> > @@ -781,4 +781,32 @@ config VMAP_STACK
> >> >   the stack to map directly to the KASAN shadow map using a 
> >> > formula
> >> >   that is incorrect if the stack is in vmalloc space.
> >> >
> >> > +config ARCH_NO_STRICT_RWX_DEFAULTS
> >> > +   def_bool n
> >> > +
> >> > +config ARCH_HAS_STRICT_KERNEL_RWX
> >> > +   def_bool n
> >> > +
> >> > +config DEBUG_RODATA
> >> > +   def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
> >> > +   prompt "Make kernel text and rodata read-only" if 
> >> > ARCH_NO_STRICT_RWX_DEFAULTS
> >>
> >> Ah! Yes, perfect. I totally forgot about using conditional "prompt"
> >> lines. Nice!
> >
> > It's no different from the more usual:
> >
> > bool "Make kernel text and rodata read-only" if 
> > ARCH_NO_STRICT_RWX_DEFAULTS
> > default y if !ARCH_NO_STRICT_RWX_DEFAULTS
> > depends on ARCH_HAS_STRICT_KERNEL_RWX
> >
> > But... I really don't like this - way too many negations and negatives
> > which make it difficult to figure out what's going on here.
> >
> > The situation we have today is:
> >
> > -config DEBUG_RODATA
> > -   bool "Make kernel text and rodata read-only"
> > -   depends on MMU && !XIP_KERNEL
> > -   default y if CPU_V7
> >
> > which is "allow the user to select DEBUG_RODATA if building a MMU non-XIP
> > kernel", suggesting that the user turns it on for ARMv7 CPUs.
> >
> > That changes with this and the above:
> >
> > +   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
> > +   select ARCH_HAS_STRICT_MODULE_RWX if MMU
> > +   select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7
> >
> > This means that ARCH_HAS_STRICT_KERNEL_RWX is set for a MMU non-XIP
> > kernel, which carries the same pre-condition for DEBUG_RODATA - no
> > problem there.
> >
> > However, ARCH_NO_STRICT_RWX_DEFAULTS is set for non-ARMv7 CPUs, which
> > means the "Make kernel text and rodata read-only" prompt _is_ provided
> > for those.  However, for all ARMv7 systems, we go from "suggesting that
> > the user enables the option" to "you don't have a choice, you get this
> > whether you want it or not."
> >
> > I'd prefer to keep it off for my development systems, where I don't
> > care about kernel security.  If we don't wish to do that as a general
> > rule, can we make it dependent on EMBEDDED?
> >
> > Given that on ARM it can add up to 4MB to the kernel image - there
> > _will_ be about 1MB before the .text section, the padding on between
> > __modver and __ex_table which for me is around 626k, the padding
> > between .notes and the init sections start with .vectors (the space
> > between __ex_table and end of .notes is only 4124, which gets padded
> > up to 1MB) and lastly the padding between the .init section and the
> > data section (for me around 593k).  This all adds up to an increase
> > in kernel image size of 3.2MB on 14.2MB - an increase of 22%.
> >
> > So no, I'm really not happy with that.
> 
> Ah yeah, good point. We have three cases: unsupported, mandatory,
> optional, but we have the case of setting the default for the optional
> case. Maybe something like this?
> 
> config STRICT_KERNEL_RWX
>   bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX
>   depends on ARCH_HAS_STRICT_KERNEL_RWX
>   default ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
> 
> unsupported:
> !ARCH_HAS_STRICT_KERNEL_RWX
> 
> mandatory:
> ARCH_HAS_STRICT_KERNEL_RWX
> !ARCH_OPTIONAL_KERNEL_RWX
> 
> optional:
> ARCH_HAS_STRICT_KERNEL_RWX
> ARCH_OPTIONAL_KERNEL_RWX
> with default controlled by ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
> 
> Then arm is:
>   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
>   select ARCH_HAS_STRICT_MODULE_RWX if MMU
>   select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
>   select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7
> 
> x86 and arm64 are:
>   select ARCH_HAS_STRICT_KERNEL_RWX
>   select ARCH_HAS_STRICT_MODULE_RWX

Looks to me like it will do the job.

In passing, I noticed that, on ARM:

  3 .rodata   002212b4  c0703000  c0703000  00703000  2**6
  CONTENTS, ALLOC, LOAD, DATA

a lack of READONLY there - which suggests something in this lot isn't
actually read-only data:

.rodata   : AT(ADDR(.rodata) - LOAD_OFFSET) {   \
VMLINUX_SYMBOL(__start_rodata) = .; \
*(.rodata) *(.rodata.*) \
RO_AFTER_INIT_DATA  /* Read only after init */  \
KEEP(*(__vermagic)) /* Kernel version magic */  \

Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common

2017-02-03 Thread Kees Cook
On Fri, Feb 3, 2017 at 12:29 PM, Russell King - ARM Linux
 wrote:
> On Fri, Feb 03, 2017 at 11:45:56AM -0800, Kees Cook wrote:
>> On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbott  wrote:
>> > diff --git a/arch/Kconfig b/arch/Kconfig
>> > index 99839c2..22ee01e 100644
>> > --- a/arch/Kconfig
>> > +++ b/arch/Kconfig
>> > @@ -781,4 +781,32 @@ config VMAP_STACK
>> >   the stack to map directly to the KASAN shadow map using a formula
>> >   that is incorrect if the stack is in vmalloc space.
>> >
>> > +config ARCH_NO_STRICT_RWX_DEFAULTS
>> > +   def_bool n
>> > +
>> > +config ARCH_HAS_STRICT_KERNEL_RWX
>> > +   def_bool n
>> > +
>> > +config DEBUG_RODATA
>> > +   def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
>> > +   prompt "Make kernel text and rodata read-only" if 
>> > ARCH_NO_STRICT_RWX_DEFAULTS
>>
>> Ah! Yes, perfect. I totally forgot about using conditional "prompt"
>> lines. Nice!
>
> It's no different from the more usual:
>
> bool "Make kernel text and rodata read-only" if 
> ARCH_NO_STRICT_RWX_DEFAULTS
> default y if !ARCH_NO_STRICT_RWX_DEFAULTS
> depends on ARCH_HAS_STRICT_KERNEL_RWX
>
> But... I really don't like this - way too many negations and negatives
> which make it difficult to figure out what's going on here.
>
> The situation we have today is:
>
> -config DEBUG_RODATA
> -   bool "Make kernel text and rodata read-only"
> -   depends on MMU && !XIP_KERNEL
> -   default y if CPU_V7
>
> which is "allow the user to select DEBUG_RODATA if building a MMU non-XIP
> kernel", suggesting that the user turns it on for ARMv7 CPUs.
>
> That changes with this and the above:
>
> +   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
> +   select ARCH_HAS_STRICT_MODULE_RWX if MMU
> +   select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7
>
> This means that ARCH_HAS_STRICT_KERNEL_RWX is set for a MMU non-XIP
> kernel, which carries the same pre-condition for DEBUG_RODATA - no
> problem there.
>
> However, ARCH_NO_STRICT_RWX_DEFAULTS is set for non-ARMv7 CPUs, which
> means the "Make kernel text and rodata read-only" prompt _is_ provided
> for those.  However, for all ARMv7 systems, we go from "suggesting that
> the user enables the option" to "you don't have a choice, you get this
> whether you want it or not."
>
> I'd prefer to keep it off for my development systems, where I don't
> care about kernel security.  If we don't wish to do that as a general
> rule, can we make it dependent on EMBEDDED?
>
> Given that on ARM it can add up to 4MB to the kernel image - there
> _will_ be about 1MB before the .text section, the padding on between
> __modver and __ex_table which for me is around 626k, the padding
> between .notes and the init sections start with .vectors (the space
> between __ex_table and end of .notes is only 4124, which gets padded
> up to 1MB) and lastly the padding between the .init section and the
> data section (for me around 593k).  This all adds up to an increase
> in kernel image size of 3.2MB on 14.2MB - an increase of 22%.
>
> So no, I'm really not happy with that.

Ah yeah, good point. We have three cases: unsupported, mandatory,
optional, but we have the case of setting the default for the optional
case. Maybe something like this?

config STRICT_KERNEL_RWX
  bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX
  depends on ARCH_HAS_STRICT_KERNEL_RWX
  default ARCH_OPTIONAL_KERNEL_RWX_DEFAULT

unsupported:
!ARCH_HAS_STRICT_KERNEL_RWX

mandatory:
ARCH_HAS_STRICT_KERNEL_RWX
!ARCH_OPTIONAL_KERNEL_RWX

optional:
ARCH_HAS_STRICT_KERNEL_RWX
ARCH_OPTIONAL_KERNEL_RWX
with default controlled by ARCH_OPTIONAL_KERNEL_RWX_DEFAULT

Then arm is:
  select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
  select ARCH_HAS_STRICT_MODULE_RWX if MMU
  select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
  select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7

x86 and arm64 are:
  select ARCH_HAS_STRICT_KERNEL_RWX
  select ARCH_HAS_STRICT_MODULE_RWX

?

-Kees

-- 
Kees Cook
Pixel Security
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common

2017-02-03 Thread Russell King - ARM Linux
On Fri, Feb 03, 2017 at 11:45:56AM -0800, Kees Cook wrote:
> On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbott  wrote:
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 99839c2..22ee01e 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -781,4 +781,32 @@ config VMAP_STACK
> >   the stack to map directly to the KASAN shadow map using a formula
> >   that is incorrect if the stack is in vmalloc space.
> >
> > +config ARCH_NO_STRICT_RWX_DEFAULTS
> > +   def_bool n
> > +
> > +config ARCH_HAS_STRICT_KERNEL_RWX
> > +   def_bool n
> > +
> > +config DEBUG_RODATA
> > +   def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
> > +   prompt "Make kernel text and rodata read-only" if 
> > ARCH_NO_STRICT_RWX_DEFAULTS
> 
> Ah! Yes, perfect. I totally forgot about using conditional "prompt"
> lines. Nice!

It's no different from the more usual:

bool "Make kernel text and rodata read-only" if 
ARCH_NO_STRICT_RWX_DEFAULTS
default y if !ARCH_NO_STRICT_RWX_DEFAULTS
depends on ARCH_HAS_STRICT_KERNEL_RWX

But... I really don't like this - way too many negations and negatives
which make it difficult to figure out what's going on here.

The situation we have today is:

-config DEBUG_RODATA
-   bool "Make kernel text and rodata read-only"
-   depends on MMU && !XIP_KERNEL
-   default y if CPU_V7

which is "allow the user to select DEBUG_RODATA if building a MMU non-XIP
kernel", suggesting that the user turns it on for ARMv7 CPUs.

That changes with this and the above:

+   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
+   select ARCH_HAS_STRICT_MODULE_RWX if MMU
+   select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7

This means that ARCH_HAS_STRICT_KERNEL_RWX is set for a MMU non-XIP
kernel, which carries the same pre-condition for DEBUG_RODATA - no
problem there.

However, ARCH_NO_STRICT_RWX_DEFAULTS is set for non-ARMv7 CPUs, which
means the "Make kernel text and rodata read-only" prompt _is_ provided
for those.  However, for all ARMv7 systems, we go from "suggesting that
the user enables the option" to "you don't have a choice, you get this
whether you want it or not."

I'd prefer to keep it off for my development systems, where I don't
care about kernel security.  If we don't wish to do that as a general
rule, can we make it dependent on EMBEDDED?

Given that on ARM it can add up to 4MB to the kernel image - there
_will_ be about 1MB before the .text section, the padding on between
__modver and __ex_table which for me is around 626k, the padding
between .notes and the init sections start with .vectors (the space
between __ex_table and end of .notes is only 4124, which gets padded
up to 1MB) and lastly the padding between the .init section and the
data section (for me around 593k).  This all adds up to an increase
in kernel image size of 3.2MB on 14.2MB - an increase of 22%.

So no, I'm really not happy with that.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 2/2] arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX

2017-02-03 Thread Kees Cook
On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbott  wrote:
>
> Both of these options are poorly named. The features they provide are
> necessary for system security and should not be considered debug only.
> Change the name to something that accurately describes what these
> options do.

It may help to explicitly call out the name change from/to in the
commit message.

>
> Signed-off-by: Laura Abbott 
> ---
> [...]
> diff --git a/arch/arm/configs/aspeed_g4_defconfig 
> b/arch/arm/configs/aspeed_g4_defconfig
> index ca39c04..beea2cc 100644
> --- a/arch/arm/configs/aspeed_g4_defconfig
> +++ b/arch/arm/configs/aspeed_g4_defconfig
> @@ -25,7 +25,6 @@ CONFIG_MODULE_UNLOAD=y
>  # CONFIG_ARCH_MULTI_V7 is not set
>  CONFIG_ARCH_ASPEED=y
>  CONFIG_MACH_ASPEED_G4=y
> -CONFIG_DEBUG_RODATA=y
>  CONFIG_AEABI=y
>  CONFIG_UACCESS_WITH_MEMCPY=y
>  CONFIG_SECCOMP=y

Are these defconfig cases correct (dropping DEBUG_RODATA without
adding STRICT_KERNEL_RWX)?

Who should carry this series, btw?

-Kees

-- 
Kees Cook
Pixel Security
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common

2017-02-03 Thread Kees Cook
On Fri, Feb 3, 2017 at 9:52 AM, Laura Abbott  wrote:
> There are multiple architectures that support CONFIG_DEBUG_RODATA and
> CONFIG_SET_MODULE_RONX. These options also now have the ability to be
> turned off at runtime. Move these to an architecture independent
> location and make these options def_bool y for almost all of those
> arches.
>
> Signed-off-by: Laura Abbott 
> ---
> v2: This patch is now doing just the refactor of the existing config options.
> ---
>  arch/Kconfig  | 28 
>  arch/arm/Kconfig  |  3 +++
>  arch/arm/Kconfig.debug| 11 ---
>  arch/arm/mm/Kconfig   | 12 
>  arch/arm64/Kconfig|  5 ++---
>  arch/arm64/Kconfig.debug  | 11 ---
>  arch/parisc/Kconfig   |  1 +
>  arch/parisc/Kconfig.debug | 11 ---
>  arch/s390/Kconfig |  5 ++---
>  arch/s390/Kconfig.debug   |  3 ---
>  arch/x86/Kconfig  |  5 ++---
>  arch/x86/Kconfig.debug| 11 ---
>  12 files changed, 38 insertions(+), 68 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 99839c2..22ee01e 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -781,4 +781,32 @@ config VMAP_STACK
>   the stack to map directly to the KASAN shadow map using a formula
>   that is incorrect if the stack is in vmalloc space.
>
> +config ARCH_NO_STRICT_RWX_DEFAULTS
> +   def_bool n
> +
> +config ARCH_HAS_STRICT_KERNEL_RWX
> +   def_bool n
> +
> +config DEBUG_RODATA
> +   def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
> +   prompt "Make kernel text and rodata read-only" if 
> ARCH_NO_STRICT_RWX_DEFAULTS

Ah! Yes, perfect. I totally forgot about using conditional "prompt" lines. Nice!

Acked-by: Kees Cook 

-Kees

-- 
Kees Cook
Pixel Security
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-tip v5 17/21] TP-futex: Group readers together in wait queue

2017-02-03 Thread valdis . kletnieks
On Fri, 03 Feb 2017 13:42:46 -0500, Waiman Long said:

> This patch set does guarantee some minimum performance level, but it
> can't guarantee fairness for all the lock waiters.

OK, sounds like it's a situation that's statistically unlikely, but it has
protections against starvation so the system will eventually dig itself out of
the hole, and the scenario is at least known and understood.  The improved
numbers for average-case probably outweigh the worst-case then...



pgpkbpyHrgtN9.pgp
Description: PGP signature


Re: [PATCH-tip v5 17/21] TP-futex: Group readers together in wait queue

2017-02-03 Thread Waiman Long
On 02/03/2017 01:23 PM, valdis.kletni...@vt.edu wrote:
> On Fri, 03 Feb 2017 13:03:50 -0500, Waiman Long said:
>
>> On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10
>>   WW futex   TP futex Glibc
>>       -
>> Total locking ops35,707,234 58,645,434 10,930,422
>> Per-thread avg/sec   99,149162,887 30,362
>> Per-thread min/sec   93,190 38,641 29,872
>> Per-thread max/sec  104,213225,983 30,708
> Do we understand where the 38K number came from?  I'm a bit concerned that the
> min-to-max has such a large dispersion compared to all the other numbers.  Was
> that a worst-case issue, and is the worst-case something likely to happen in
> production, or requires special effort to trigger?
>
Because the lock isn't fair and depending on the placement of the lock,
you will see some CPUs have higher likelihood of getting the lock than
the others. This is reflected in the different locking rates as reported
by the micro-benchmark.  As the microbenchmark is included in this patch
set, you can play around with it if you want.

This patch set does guarantee some minimum performance level, but it
can't guarantee fairness for all the lock waiters.

Regards,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 2/2] arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX

2017-02-03 Thread Mark Rutland
On Fri, Feb 03, 2017 at 09:52:22AM -0800, Laura Abbott wrote:
> 
> Both of these options are poorly named. The features they provide are
> necessary for system security and should not be considered debug only.
> Change the name to something that accurately describes what these
> options do.
> 
> Signed-off-by: Laura Abbott 

As with patch 1, this looks good to me. FWIW:

Acked-by: Mark Rutland 

> diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
> index 939815e..560a8d8 100644
> --- a/arch/arm64/Kconfig.debug
> +++ b/arch/arm64/Kconfig.debug
> @@ -72,7 +72,7 @@ config DEBUG_WX
> If in doubt, say "Y".
>  
>  config DEBUG_ALIGN_RODATA
> - depends on DEBUG_RODATA
> + depends on STRICT_KERNEL_RWX
>   bool "Align linker sections up to SECTION_SIZE"
>   help
> If this option is enabled, sections that may potentially be marked as
> diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> index 94b62c1..67f9cb9 100644
> --- a/arch/arm64/kernel/insn.c
> +++ b/arch/arm64/kernel/insn.c
> @@ -93,7 +93,7 @@ static void __kprobes *patch_map(void *addr, int fixmap)
>   bool module = !core_kernel_text(uintaddr);
>   struct page *page;
>  
> - if (module && IS_ENABLED(CONFIG_DEBUG_SET_MODULE_RONX))
> + if (module && IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
>   page = vmalloc_to_page(addr);
>   else if (!module)
>   page = pfn_to_page(PHYS_PFN(__pa(addr)));

Since CONFIG_STRICT_MODULE_RWX is mandatory for arm64, we should be able
to simplify this, but that can wait for a later patch.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-tip v5 17/21] TP-futex: Group readers together in wait queue

2017-02-03 Thread valdis . kletnieks
On Fri, 03 Feb 2017 13:03:50 -0500, Waiman Long said:

> On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10

>   WW futex   TP futex Glibc
>       -
> Total locking ops35,707,234 58,645,434 10,930,422
> Per-thread avg/sec   99,149162,887 30,362
> Per-thread min/sec   93,190 38,641 29,872
> Per-thread max/sec  104,213225,983 30,708

Do we understand where the 38K number came from?  I'm a bit concerned that the
min-to-max has such a large dispersion compared to all the other numbers.  Was
that a worst-case issue, and is the worst-case something likely to happen in
production, or requires special effort to trigger?



pgphwbyLHN1gd.pgp
Description: PGP signature


Re: [PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common

2017-02-03 Thread Mark Rutland
On Fri, Feb 03, 2017 at 09:52:21AM -0800, Laura Abbott wrote:
> There are multiple architectures that support CONFIG_DEBUG_RODATA and
> CONFIG_SET_MODULE_RONX. These options also now have the ability to be
> turned off at runtime. Move these to an architecture independent
> location and make these options def_bool y for almost all of those
> arches.
> 
> Signed-off-by: Laura Abbott 

>From my POV this looks good. FWIW:

Acked-by: Mark Rutland 

Mark.

> ---
> v2: This patch is now doing just the refactor of the existing config options.
> ---
>  arch/Kconfig  | 28 
>  arch/arm/Kconfig  |  3 +++
>  arch/arm/Kconfig.debug| 11 ---
>  arch/arm/mm/Kconfig   | 12 
>  arch/arm64/Kconfig|  5 ++---
>  arch/arm64/Kconfig.debug  | 11 ---
>  arch/parisc/Kconfig   |  1 +
>  arch/parisc/Kconfig.debug | 11 ---
>  arch/s390/Kconfig |  5 ++---
>  arch/s390/Kconfig.debug   |  3 ---
>  arch/x86/Kconfig  |  5 ++---
>  arch/x86/Kconfig.debug| 11 ---
>  12 files changed, 38 insertions(+), 68 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 99839c2..22ee01e 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -781,4 +781,32 @@ config VMAP_STACK
> the stack to map directly to the KASAN shadow map using a formula
> that is incorrect if the stack is in vmalloc space.
>  
> +config ARCH_NO_STRICT_RWX_DEFAULTS
> + def_bool n
> +
> +config ARCH_HAS_STRICT_KERNEL_RWX
> + def_bool n
> +
> +config DEBUG_RODATA
> + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
> + prompt "Make kernel text and rodata read-only" if 
> ARCH_NO_STRICT_RWX_DEFAULTS
> + depends on ARCH_HAS_STRICT_KERNEL_RWX
> + help
> +   If this is set, kernel text and rodata memory will be made read-only,
> +   and non-text memory will be made non-executable. This provides
> +   protection against certain security exploits (e.g. executing the heap
> +   or modifying text)
> +
> +config ARCH_HAS_STRICT_MODULE_RWX
> + def_bool n
> +
> +config DEBUG_SET_MODULE_RONX
> + def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
> + prompt "Set loadable kenrel module data as NX and text as RO" if 
> ARCH_NO_STRICT_RWX_DEFAULTS
> + depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES
> + help
> +   If this is set, module text and rodata memory will be made read-only,
> +   and non-text memory will be made non-executable. This provides
> +   protection against certain security exploits (e.g. writing to text)
> +
>  source "kernel/gcov/Kconfig"
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 186c4c2..aa73ca8 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -4,10 +4,13 @@ config ARM
>   select ARCH_CLOCKSOURCE_DATA
>   select ARCH_HAS_DEVMEM_IS_ALLOWED
>   select ARCH_HAS_ELF_RANDOMIZE
> + select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
> + select ARCH_HAS_STRICT_MODULE_RWX if MMU
>   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>   select ARCH_HAVE_CUSTOM_GPIO_H
>   select ARCH_HAS_GCOV_PROFILE_ALL
>   select ARCH_MIGHT_HAVE_PC_PARPORT
> + select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7
>   select ARCH_SUPPORTS_ATOMIC_RMW
>   select ARCH_USE_BUILTIN_BSWAP
>   select ARCH_USE_CMPXCHG_LOCKREF
> diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
> index d83f7c3..426d271 100644
> --- a/arch/arm/Kconfig.debug
> +++ b/arch/arm/Kconfig.debug
> @@ -1738,17 +1738,6 @@ config PID_IN_CONTEXTIDR
> additional instructions during context switch. Say Y here only if you
> are planning to use hardware trace tools with this kernel.
>  
> -config DEBUG_SET_MODULE_RONX
> - bool "Set loadable kernel module data as NX and text as RO"
> - depends on MODULES && MMU
> - ---help---
> -   This option helps catch unintended modifications to loadable
> -   kernel module's text and read-only data. It also prevents execution
> -   of module data. Such protection may interfere with run-time code
> -   patching and dynamic kernel tracing - and they might also protect
> -   against certain classes of kernel exploits.
> -   If in doubt, say "N".
> -
>  source "drivers/hwtracing/coresight/Kconfig"
>  
>  endmenu
> diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
> index f68e8ec..419a035 100644
> --- a/arch/arm/mm/Kconfig
> +++ b/arch/arm/mm/Kconfig
> @@ -1051,18 +1051,6 @@ config ARCH_SUPPORTS_BIG_ENDIAN
> This option specifies the architecture can support big endian
> operation.
>  
> -config DEBUG_RODATA
> - bool "Make kernel text and rodata read-only"
> - depends on MMU && !XIP_KERNEL
> - default y if CPU_V7
> - help
> -   If this is set, kernel text and rodata memory will be made
> -   read-only, and non-text kernel memory will be made non-executable.
> -   

[PATCH-tip v5 01/21] perf bench: New microbenchmark for userspace mutex performance

2017-02-03 Thread Waiman Long
This microbenchmark simulates how the use of different futex types
can affect the actual performanace of userspace mutex locks. The
usage is:

perf bench futex mutex 

Three sets of simple mutex lock and unlock functions are implemented
using the wait-wake and PI futexes respectively. This microbenchmark
then runs the locking rate measurement tests using either one of
those mutexes or all of them consecutively.

An example output from this microbenchmark was as follows:

  [PID 7352]: 72 threads doing WW2 futex lockings (load=1) for 10 secs.

  Locking statistics:
  Test run time= 10.00s
  Total exclusive locking ops  = 90,875,633
  Exclusive lock futex calls   = 10,714,994
  Exclusive unlock futex calls = 8,047,536
  EAGAIN lock errors   = 10,653,695
  Process wakeups  = 61,299

  Percentages:
  Exclusive lock futex calls   = 11.8%
  Exclusive unlock futex calls = 8.9%
  EAGAIN lock errors   = 99.4%
  Process wakeups  = 0.8%

  Per-thread Locking Rates:
  Avg = 126,202 ops/sec (+- 0.21%)
  Min = 120,114 ops/sec
  Max = 131,375 ops/sec

Signed-off-by: Waiman Long 
---
 tools/perf/Documentation/perf-bench.txt |   2 +
 tools/perf/bench/Build  |   1 +
 tools/perf/bench/bench.h|   1 +
 tools/perf/bench/futex-locks.c  | 759 
 tools/perf/builtin-bench.c  |  10 +
 5 files changed, 773 insertions(+)
 create mode 100644 tools/perf/bench/futex-locks.c

diff --git a/tools/perf/Documentation/perf-bench.txt 
b/tools/perf/Documentation/perf-bench.txt
index 34750fc..1fa5a74 100644
--- a/tools/perf/Documentation/perf-bench.txt
+++ b/tools/perf/Documentation/perf-bench.txt
@@ -203,6 +203,8 @@ Suite for evaluating requeue calls.
 *lock-pi*::
 Suite for evaluating futex lock_pi calls.
 
+*mutex*::
+Suite for evaluating futex calls for implementing userspace mutexes.
 
 SEE ALSO
 
diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 60bf119..7ce1cd7 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -6,6 +6,7 @@ perf-y += futex-wake.o
 perf-y += futex-wake-parallel.o
 perf-y += futex-requeue.o
 perf-y += futex-lock-pi.o
+perf-y += futex-locks.o
 
 perf-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o
 perf-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 579a592..b0632df 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -36,6 +36,7 @@
 int bench_futex_requeue(int argc, const char **argv, const char *prefix);
 /* pi futexes */
 int bench_futex_lock_pi(int argc, const char **argv, const char *prefix);
+int bench_futex_mutex(int argc, const char **argv, const char *prefix);
 
 #define BENCH_FORMAT_DEFAULT_STR   "default"
 #define BENCH_FORMAT_DEFAULT   0
diff --git a/tools/perf/bench/futex-locks.c b/tools/perf/bench/futex-locks.c
new file mode 100644
index 000..622a57c
--- /dev/null
+++ b/tools/perf/bench/futex-locks.c
@@ -0,0 +1,759 @@
+/*
+ * Copyright (C) 2016-2017 Waiman Long 
+ *
+ * This microbenchmark simulates how the use of different futex types can
+ * affect the actual performanace of userspace locking primitives like mutex.
+ *
+ * The raw throughput of the futex lock and unlock calls is not a good
+ * indication of actual throughput of the mutex code as it may not really
+ * need to call into the kernel. Therefore, 3 sets of simple mutex lock and
+ * unlock functions are written to implenment a mutex lock using the
+ * wait-wake (2 versions) and PI futexes respectively. These functions serve
+ * as the basis for measuring the locking throughput.
+ */
+
+#include 
+
+#include 
+#include 
+#include "../util/stat.h"
+#include "../perf-sys.h"
+#include 
+#include 
+#include 
+#include 
+#include "bench.h"
+#include "futex.h"
+
+#include 
+#include 
+#include 
+
+#define CACHELINE_SIZE 64
+#define gettid()   syscall(SYS_gettid)
+#define __cacheline_aligned__attribute__((__aligned__(CACHELINE_SIZE)))
+
+typedef u32 futex_t;
+typedef void (*lock_fn_t)(futex_t *futex, int tid);
+typedef void (*unlock_fn_t)(futex_t *futex, int tid);
+
+/*
+ * Statistical count list
+ */
+enum {
+   STAT_OPS,   /* # of exclusive locking operations*/
+   STAT_LOCKS, /* # of exclusive lock futex calls  */
+   STAT_UNLOCKS,   /* # of exclusive unlock futex calls*/
+   STAT_SLEEPS,/* # of exclusive lock sleeps   */
+   STAT_EAGAINS,   /* # of EAGAIN errors   */
+   STAT_WAKEUPS,   /* # of wakeups (unlock return) */
+   STAT_LOCKERRS,  /* # of exclusive lock errors   */
+   STAT_UNLKERRS,  /* # of exclusive unlock errors */
+   STAT_NUM/* Total # of statistical count */
+};
+
+/*
+ * Syscall time list
+ */
+enum {
+   TIME_LOCK,  /* Total exclusive lock syscall time*/
+ 

[PATCH-tip v5 00/21] futex: Introducing throughput-optimized (TP) futexes

2017-02-03 Thread Waiman Long
 v4->v5:
  - Fix 0-day kernel build test compilation warnings.
  - Extract out non-TP futexes futex-mutex and futex-rwlock
microbenchmarks as separate patches, as suggested by Arnaldo.
  - Rebased to the latest tip tree & use running_clock() instead
of sched_clock().

 v3->v4:
  - Properly handle EFAULT error due to page fault.
  - Extend the use cases of TP futexes to userspace rwlock.
  - Change the lock handoff trigger from number of spinnings to
elapsed time (5ms).
  - Various updates to the "perf bench futex mutex" microbenchmark.
  - Add a new "perf bench futex rwlock" microbenchmark for measuring
rwlock performance.
  - Add mutex_owner to futex_state object to hold the serialization
mutex owner.
  - Streamline a number of helper functions and other miscellenous
coding improvements.
  - Rebase the patchset to the 4.10 kernel.

 v2->v3:
  - Use the abbreviation TP for the new futexes instead of TO.
  - Make a number of changes accordingly to review comments from
ThomasG, PeterZ and MikeG.
  - Breaks the main futex patch into smaller pieces to make them easier
to review.

 v1->v2:
  - Adds an explicit lock hand-off mechanism.
  - Adds timeout support.
  - Simplifies the required userspace code.
  - Fixes a number of problems in the v1 code.

This patchset introduces a new futex implementation called
throughput-optimized (TP) futexes. It is similar to PI futexes in its
calling convention, but provides better throughput than the wait-wake
(WW) futexes by encouraging lock stealing and optimistic spinning.

The new TP futexes can be used in implementing both userspace mutexes
and rwlocks. They provides better performance while simplifying the
userspace locking implementation at the same time. The WW futexes
are still needed to implement other synchronization primitives like
conditional variables and semaphores that cannot be handled by the
TP futexes.

Another advantage of TP futexes is that it has a built-in lock handoff
mechanism to prevent lock starvation from happenning as long as the
underlying kernel mutex doesn't have lock starvation problem.

Patches 1-2 implements userspace exclusive lock and read/write lock
benchmark in perf bench using futexes.

Patches 3-8 are preparatory patches that pave the way to implement
the TP futexes.

Patch 9 implements the basic TP futex that can support userspace
mutexes.

Patch 10 adds robust handling to TP futex to handle the death of TP
futex exclusive lock owners.

Patch 11 adds a lock hand-off mechanism that can prevent lock starvation
to happen while introducing minimal runtime overhead.

Patch 12 enables the FUTEX_LOCK futex(2) syscall to return status
information, such as how the lock is acquired and how many time
the task needs to sleep in the spinning loop, that can be used by
userspace utilities to monitor how the TP futexes are performing.

Patch 13 enables userspace applications to supply a timeout value to
abort the lock acquisition attempt after the specified time period
has passed.

Patch 14 adds a new document tp-futex.txt in the Documentation
directory to describe the new TP futexes.

Patch 15 extends the TP futexes to support userspace rwlocks.

Patch 16 enables more reader lock stealing of TP futexes in the kernel.

Patch 17 groups readers together as a spin group to enhance reader
throughput.

Patch 18 updates the tp-futex.txt file to add information about
rwlock support.

Patch 19 extends the perf bench futex locking benchmark to include
variants using the new TP futexes.

Patch 20 updates the wake_up_q() function to returns the number
woken tasks.

Patch 21 enables the dumping of internal futex state information
via debugfs.

All the benchmark results shown in the change logs were produced by
the microbenchmarks included in this patchset. So everyone can run
the microbenchmark to see how the TP futexes perform and behave in
their own test systems.

Once this patchset is finalized, an updated manpage patch to document
the new TP futexes will be sent out. The next step will then be to
make Glibc NTPL use the new TP futexes.

Performance highlight in term of average locking rates (ops/sec)
on a 2-socket system are as follows:

WW futex   TP futex  Glibc
     -
  mutex 121,129152,242 -
  rwlock 99,149162,887   30,362

Patches 9 and 17 contain more detailed information about the
performance characteristics of the TP futexes when implementing
userspace mutex and rwlock respectively when compared with other
possible way of doing so via the wait-wake futexes.

Waiman Long (21):
  perf bench: New microbenchmark for userspace mutex performance
  perf bench: New microbenchmark for userspace rwlock performance
  futex: Consolidate duplicated timer setup code
  futex: Rename futex_pi_state to futex_state
  futex: Add helpers to get & cmpxchg futex value without lock
  futex: Consolidate pure pi_state_list add & 

[PATCH-tip v5 04/21] futex: Rename futex_pi_state to futex_state

2017-02-03 Thread Waiman Long
The futex_pi_state structure will be overloaded in later patches to
store state information about non-PI futexes. So the structure name
itself is no longer a good description of its purpose. So its name
is changed to futex_state, a more generic name.

Some of the functions that process the futex states are also renamed.

Signed-off-by: Waiman Long 
---
 include/linux/sched.h |   4 +-
 kernel/futex.c| 107 +-
 2 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e9d5503..27a0b77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -59,7 +59,7 @@
 struct sched_attr;
 struct sched_param;
 
-struct futex_pi_state;
+struct futex_state;
 struct robust_list_head;
 struct bio_list;
 struct fs_struct;
@@ -1354,7 +1354,7 @@ struct task_struct {
struct compat_robust_list_head __user *compat_robust_list;
 #endif
struct list_head pi_state_list;
-   struct futex_pi_state *pi_state_cache;
+   struct futex_state *pi_state_cache;
 #endif
 #ifdef CONFIG_PERF_EVENTS
struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
diff --git a/kernel/futex.c b/kernel/futex.c
index 07886ba..b21c6a0 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -194,11 +194,12 @@
 #define FLAGS_HAS_TIMEOUT  0x04
 
 /*
- * Priority Inheritance state:
+ * Futex state object:
+ *  - Priority Inheritance state
  */
-struct futex_pi_state {
+struct futex_state {
/*
-* list of 'owned' pi_state instances - these have to be
+* list of 'owned' state instances - these have to be
 * cleaned up in do_exit() if the task exits prematurely:
 */
struct list_head list;
@@ -242,7 +243,7 @@ struct futex_q {
struct task_struct *task;
spinlock_t *lock_ptr;
union futex_key key;
-   struct futex_pi_state *pi_state;
+   struct futex_state *pi_state;
struct rt_mutex_waiter *rt_waiter;
union futex_key *requeue_pi_key;
u32 bitset;
@@ -808,76 +809,76 @@ static int get_futex_value_locked(u32 *dest, u32 __user 
*from)
 /*
  * PI code:
  */
-static int refill_pi_state_cache(void)
+static int refill_futex_state_cache(void)
 {
-   struct futex_pi_state *pi_state;
+   struct futex_state *state;
 
if (likely(current->pi_state_cache))
return 0;
 
-   pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL);
+   state = kzalloc(sizeof(*state), GFP_KERNEL);
 
-   if (!pi_state)
+   if (!state)
return -ENOMEM;
 
-   INIT_LIST_HEAD(_state->list);
+   INIT_LIST_HEAD(>list);
/* pi_mutex gets initialized later */
-   pi_state->owner = NULL;
-   atomic_set(_state->refcount, 1);
-   pi_state->key = FUTEX_KEY_INIT;
+   state->owner = NULL;
+   atomic_set(>refcount, 1);
+   state->key = FUTEX_KEY_INIT;
 
-   current->pi_state_cache = pi_state;
+   current->pi_state_cache = state;
 
return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_state *alloc_futex_state(void)
 {
-   struct futex_pi_state *pi_state = current->pi_state_cache;
+   struct futex_state *state = current->pi_state_cache;
 
-   WARN_ON(!pi_state);
+   WARN_ON(!state);
current->pi_state_cache = NULL;
 
-   return pi_state;
+   return state;
 }
 
 /*
- * Drops a reference to the pi_state object and frees or caches it
+ * Drops a reference to the futex state object and frees or caches it
  * when the last reference is gone.
  *
  * Must be called with the hb lock held.
  */
-static void put_pi_state(struct futex_pi_state *pi_state)
+static void put_futex_state(struct futex_state *state)
 {
-   if (!pi_state)
+   if (!state)
return;
 
-   if (!atomic_dec_and_test(_state->refcount))
+   if (!atomic_dec_and_test(>refcount))
return;
 
/*
-* If pi_state->owner is NULL, the owner is most probably dying
-* and has cleaned up the pi_state already
+* If state->owner is NULL, the owner is most probably dying
+* and has cleaned up the futex state already
 */
-   if (pi_state->owner) {
-   raw_spin_lock_irq(_state->owner->pi_lock);
-   list_del_init(_state->list);
-   raw_spin_unlock_irq(_state->owner->pi_lock);
+   if (state->owner) {
+   raw_spin_lock_irq(>owner->pi_lock);
+   list_del_init(>list);
+   raw_spin_unlock_irq(>owner->pi_lock);
 
-   rt_mutex_proxy_unlock(_state->pi_mutex, pi_state->owner);
+   rt_mutex_proxy_unlock(>pi_mutex, state->owner);
}
 
if (current->pi_state_cache)
-   kfree(pi_state);
+   kfree(state);
else {
/*
-* pi_state->list is already empty.
-* clear 

[PATCH-tip v5 05/21] futex: Add helpers to get & cmpxchg futex value without lock

2017-02-03 Thread Waiman Long
Two new helper functions cmpxchg_futex_value() and get_futex_value()
are added to access and change the futex value without the hash
bucket lock.  As a result, page fault is enabled and the page will
be faulted in if not present yet.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index b21c6a0..203a388 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -805,6 +805,21 @@ static int get_futex_value_locked(u32 *dest, u32 __user 
*from)
return ret ? -EFAULT : 0;
 }
 
+/*
+ * The equivalents of the above cmpxchg_futex_value_locked() and
+ * get_futex_value_locked which are called without the hash bucket lock
+ * and so can have page fault enabled.
+ */
+static inline int cmpxchg_futex_value(u32 *curval, u32 __user *uaddr,
+ u32 uval, u32 newval)
+{
+   return futex_atomic_cmpxchg_inatomic(curval, uaddr, uval, newval);
+}
+
+static inline int get_futex_value(u32 *dest, u32 __user *from)
+{
+   return __get_user(*dest, from);
+}
 
 /*
  * PI code:
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-tip v5 02/21] perf bench: New microbenchmark for userspace rwlock performance

2017-02-03 Thread Waiman Long
This microbenchmark simulates how the use of different futex types
can affect the actual performanace of userspace rwlock locks. The
usage is:

perf bench futex rwlock 

Two sets of simple rwlock lock and unlock functions are implemented
using the wait-wake futexes and glibc rwlock respectively. This
microbenchmark then runs the locking rate measurement tests using
either one of those rwlocks or all of them consecutively.

Sample output from this microbenchmark was as follows:

  [PID 7434]: 72 threads doing WW futex lockings (load=1) for 10 secs.

  Locking statistics:
  Test run time= 10.00s
  Total exclusive locking ops  = 27,147,143
  Total shared locking ops = 27,147,143
  Exclusive lock futex calls   = 8,812,299
  Exclusive unlock futex calls = 9,034,693
  Shared lock futex calls  = 1,685,538
  Shared unlock futex calls= 4,605,343
  EAGAIN lock errors   = 10,472,722
  Process wakeups  = 25,115

  Percentages:
  Exclusive lock futex calls   = 32.5%
  Shared lock futex calls  = 6.2%
  Exclusive unlock futex calls = 33.3%
  Shared unlock futex calls= 17.0%
  EAGAIN lock errors   = 99.8%
  Process wakeups  = 0.2%

  Per-thread Locking Rates:
  Avg = 75,392 ops/sec (+- 0.30%)
  Min = 71,259 ops/sec
  Max = 78,211 ops/sec

Signed-off-by: Waiman Long 
---
 tools/perf/Documentation/perf-bench.txt |   3 +
 tools/perf/bench/bench.h|   1 +
 tools/perf/bench/futex-locks.c  | 683 +++-
 tools/perf/builtin-bench.c  |   1 +
 4 files changed, 672 insertions(+), 16 deletions(-)

diff --git a/tools/perf/Documentation/perf-bench.txt 
b/tools/perf/Documentation/perf-bench.txt
index 1fa5a74..e5a7079 100644
--- a/tools/perf/Documentation/perf-bench.txt
+++ b/tools/perf/Documentation/perf-bench.txt
@@ -206,6 +206,9 @@ Suite for evaluating futex lock_pi calls.
 *mutex*::
 Suite for evaluating futex calls for implementing userspace mutexes.
 
+*rwlock*::
+Suite for evaluating futex calls for implementing userspace rwlocks.
+
 SEE ALSO
 
 linkperf:perf[1]
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index b0632df..a7e2037 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -37,6 +37,7 @@
 /* pi futexes */
 int bench_futex_lock_pi(int argc, const char **argv, const char *prefix);
 int bench_futex_mutex(int argc, const char **argv, const char *prefix);
+int bench_futex_rwlock(int argc, const char **argv, const char *prefix);
 
 #define BENCH_FORMAT_DEFAULT_STR   "default"
 #define BENCH_FORMAT_DEFAULT   0
diff --git a/tools/perf/bench/futex-locks.c b/tools/perf/bench/futex-locks.c
index 622a57c..18dc71d 100644
--- a/tools/perf/bench/futex-locks.c
+++ b/tools/perf/bench/futex-locks.c
@@ -10,6 +10,10 @@
  * unlock functions are written to implenment a mutex lock using the
  * wait-wake (2 versions) and PI futexes respectively. These functions serve
  * as the basis for measuring the locking throughput.
+ *
+ * Two sets of simple reader/writer lock and unlock functions are also
+ * implemented using the wait-wake futexes as well as the Glibc rwlock
+ * respectively for performance measurement purpose.
  */
 
 #include 
@@ -21,11 +25,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "bench.h"
 #include "futex.h"
 
 #include 
+#include 
 #include 
 #include 
 
@@ -42,13 +48,19 @@
  */
 enum {
STAT_OPS,   /* # of exclusive locking operations*/
+   STAT_SOPS,  /* # of shared locking operations   */
STAT_LOCKS, /* # of exclusive lock futex calls  */
STAT_UNLOCKS,   /* # of exclusive unlock futex calls*/
STAT_SLEEPS,/* # of exclusive lock sleeps   */
+   STAT_SLOCKS,/* # of shared lock futex calls */
+   STAT_SUNLOCKS,  /* # of shared unlock futex calls   */
+   STAT_SSLEEPS,   /* # of shared lock sleeps  */
STAT_EAGAINS,   /* # of EAGAIN errors   */
STAT_WAKEUPS,   /* # of wakeups (unlock return) */
STAT_LOCKERRS,  /* # of exclusive lock errors   */
STAT_UNLKERRS,  /* # of exclusive unlock errors */
+   STAT_SLOCKERRS, /* # of shared lock errors  */
+   STAT_SUNLKERRS, /* # of shared unlock errors*/
STAT_NUM/* Total # of statistical count */
 };
 
@@ -58,6 +70,8 @@ enum {
 enum {
TIME_LOCK,  /* Total exclusive lock syscall time*/
TIME_UNLK,  /* Total exclusive unlock syscall time  */
+   TIME_SLOCK, /* Total shared lock syscall time   */
+   TIME_SUNLK, /* Total shared unlock syscall time */
TIME_NUM,
 };
 
@@ -100,7 +114,11 @@ struct worker {
 static unsigned int threads_stopping;
 static struct stats throughput_stats;
 static lock_fn_t mutex_lock_fn;
+static lock_fn_t read_lock_fn;
+static 

[PATCH-tip v5 11/21] TP-futex: Implement lock handoff to prevent lock starvation

2017-02-03 Thread Waiman Long
The current TP futexes has no guarantee that the top waiter
(serialization mutex owner) can get the lock within a finite time.
As a result, lock starvation can happen.

A lock handoff mechanism is added to the TP futexes to prevent lock
starvation from happening. The idea is that the top waiter can set a
special handoff_pid variable with its own pid. The futex owner will
then check this variable at unlock time to see if it should free the
futex or change its value to the designated pid to transfer the lock
directly to the top waiter.

This change does have the effect of limiting the amount of unfairness
and hence may adversely affect performance depending on the workloads.

Currently the handoff mechanism is triggered when the top waiter fails
to acquire the lock after about 5ms of spinning and sleeping. So the
maximum lock handoff rate is about 200 handoffs per second.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 59 --
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 46a1a4b..348b44c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -237,6 +238,8 @@ struct futex_state {
struct task_struct *mutex_owner;/* For TP futexes only */
atomic_t refcount;
 
+   u32 handoff_pid;/* For TP futexes only */
+
enum futex_type type;
union futex_key key;
 };
@@ -914,6 +917,7 @@ static int refill_futex_state_cache(void)
/* pi_mutex gets initialized later */
state->owner = NULL;
state->mutex_owner = NULL;
+   state->handoff_pid = 0;
atomic_set(>refcount, 1);
state->key = FUTEX_KEY_INIT;
 
@@ -3337,8 +3341,9 @@ void exit_robust_list(struct task_struct *curr)
  * and unlocking.
  *
  * The purpose of this FUTEX_WAITERS bit is to make the unlocker wake up the
- * serialization mutex owner. Not having the FUTEX_WAITERS bit set doesn't
- * mean there is no waiter in the kernel.
+ * serialization mutex owner or to hand off the lock directly to the top
+ * waiter. Not having the FUTEX_WAITERS bit set doesn't mean there is no
+ * waiter in the kernel.
  *
  * Like PI futexes, TP futexes are orthogonal to robust futexes can be
  * used together.
@@ -3354,6 +3359,16 @@ void exit_robust_list(struct task_struct *curr)
  * the ownership of the futex.
  */
 
+/*
+ * Timeout value for enabling lock handoff and prevent lock starvation.
+ *
+ * Currently, lock handoff will be enabled if the top waiter can't get the
+ * futex after about 5ms. To reduce time checking overhead, it is only done
+ * after every 128 spins or right after wakeup. As a result, the actual
+ * elapsed time will be a bit longer than the specified value.
+ */
+#define TP_HANDOFF_TIMEOUT 500 /* 5ms  */
+
 /**
  * lookup_futex_state - Looking up the futex state structure.
  * @hb: hash bucket
@@ -3433,6 +3448,7 @@ static inline int put_futex_state_unlocked(struct 
futex_state *state)
  * If waiter is true
  * then
  *   don't preserve the flag bits;
+ *   check for handoff (futex word == own pid)
  * else
  *   preserve the flag bits
  * endif
@@ -3451,6 +3467,9 @@ static inline int futex_trylock(u32 __user *uaddr, const 
u32 vpid, u32 *puval,
 
uval = *puval;
 
+   if (waiter && (uval & FUTEX_TID_MASK) == vpid)
+   return 1;
+
if (uval & FUTEX_TID_MASK)
return 0;   /* Trylock fails */
 
@@ -3544,10 +3563,12 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
"futex: owner pid %d of TP futex 0x%lx was %s.\n"   \
"\tLock is now acquired by pid %d!\n"
 
-   int ret;
+   int ret, loopcnt = 1;
+   bool handoff_set = false;
u32 uval;
u32 owner_pid = 0;
struct task_struct *owner_task = NULL;
+   u64 handoff_time = running_clock() + TP_HANDOFF_TIMEOUT;
 
preempt_disable();
WRITE_ONCE(state->mutex_owner, current);
@@ -3602,6 +3623,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
if (need_resched()) {
__set_current_state(TASK_RUNNING);
schedule_preempt_disabled();
+   loopcnt = 0;
continue;
}
 
@@ -3610,6 +3632,23 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
break;
}
 
+   /*
+* Enable lock handoff if the elapsed time exceed the timeout
+* value. We also need to set the FUTEX_WAITERS bit to make
+* sure that futex lock holder will initiate the handoff at
+* unlock time.
+*/
+   if (!handoff_set && !(loopcnt++ & 0x7f)) {
+   if (running_clock() > 

[PATCH-tip v5 06/21] futex: Consolidate pure pi_state_list add & delete codes to helpers

2017-02-03 Thread Waiman Long
Two new helper functions (task_pi_list_add & task_pi_list_del)
are created to consolidate all the pure pi_state_list addition and
insertion codes. The set_owner argument in task_pi_list_add() will
be needed in a later patch.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 64 +++---
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 203a388..25881e6 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -824,6 +824,39 @@ static inline int get_futex_value(u32 *dest, u32 __user 
*from)
 /*
  * PI code:
  */
+
+/**
+ * task_pi_list_add - add futex state object to a task's pi_state_list
+ * @task : task structure
+ * @state: futex state object
+ */
+static inline void task_pi_list_add(struct task_struct *task,
+   struct futex_state *state)
+{
+   raw_spin_lock_irq(>pi_lock);
+   WARN_ON(!list_empty(>list));
+   list_add(>list, >pi_state_list);
+   state->owner = task;
+   raw_spin_unlock_irq(>pi_lock);
+}
+
+/**
+ * task_pi_list_del - delete futex state object from a task's pi_state_list
+ * @state: futex state object
+ * @warn : warn if list is empty when set
+ */
+static inline void task_pi_list_del(struct futex_state *state, const bool warn)
+{
+   struct task_struct *task = state->owner;
+
+   raw_spin_lock_irq(>pi_lock);
+   if (warn)
+   WARN_ON(list_empty(>list));
+   list_del_init(>list);
+   state->owner = NULL;
+   raw_spin_unlock_irq(>pi_lock);
+}
+
 static int refill_futex_state_cache(void)
 {
struct futex_state *state;
@@ -876,9 +909,7 @@ static void put_futex_state(struct futex_state *state)
 * and has cleaned up the futex state already
 */
if (state->owner) {
-   raw_spin_lock_irq(>owner->pi_lock);
-   list_del_init(>list);
-   raw_spin_unlock_irq(>owner->pi_lock);
+   task_pi_list_del(state, false);
 
rt_mutex_proxy_unlock(>pi_mutex, state->owner);
}
@@ -1399,16 +1430,8 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *this,
return ret;
}
 
-   raw_spin_lock(_state->owner->pi_lock);
-   WARN_ON(list_empty(_state->list));
-   list_del_init(_state->list);
-   raw_spin_unlock(_state->owner->pi_lock);
-
-   raw_spin_lock(_owner->pi_lock);
-   WARN_ON(!list_empty(_state->list));
-   list_add(_state->list, _owner->pi_state_list);
-   pi_state->owner = new_owner;
-   raw_spin_unlock(_owner->pi_lock);
+   task_pi_list_del(pi_state, true);
+   task_pi_list_add(new_owner, pi_state);
 
raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
 
@@ -2213,19 +2236,10 @@ static int fixup_pi_state_owner(u32 __user *uaddr, 
struct futex_q *q,
 * We fixed up user space. Now we need to fix the pi_state
 * itself.
 */
-   if (pi_state->owner != NULL) {
-   raw_spin_lock_irq(_state->owner->pi_lock);
-   WARN_ON(list_empty(_state->list));
-   list_del_init(_state->list);
-   raw_spin_unlock_irq(_state->owner->pi_lock);
-   }
+   if (pi_state->owner != NULL)
+   task_pi_list_del(pi_state, true);
 
-   pi_state->owner = newowner;
-
-   raw_spin_lock_irq(>pi_lock);
-   WARN_ON(!list_empty(_state->list));
-   list_add(_state->list, >pi_state_list);
-   raw_spin_unlock_irq(>pi_lock);
+   task_pi_list_add(newowner, pi_state);
return 0;
 
/*
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-tip v5 14/21] TP-futex, doc: Add TP futexes documentation

2017-02-03 Thread Waiman Long
This patch adds a new document file on how to use the TP futexes.

Signed-off-by: Waiman Long 
---
 Documentation/00-INDEX |   2 +
 Documentation/tp-futex.txt | 161 +
 2 files changed, 163 insertions(+)
 create mode 100644 Documentation/tp-futex.txt

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index c8a8eb1..326b68c 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -416,6 +416,8 @@ this_cpu_ops.txt
- List rationale behind and the way to use this_cpu operations.
 thermal/
- directory with information on managing thermal issues (CPU/temp)
+tp-futex.txt
+   - Documentation on lightweight throughput-optimized futexes.
 trace/
- directory with info on tracing technologies within linux
 translations/
diff --git a/Documentation/tp-futex.txt b/Documentation/tp-futex.txt
new file mode 100644
index 000..5324ee0
--- /dev/null
+++ b/Documentation/tp-futex.txt
@@ -0,0 +1,161 @@
+Started by: Waiman Long 
+
+Throughput-Optimized Futexes
+
+
+There are two main problems for a wait-wake futex (FUTEX_WAIT and
+FUTEX_WAKE) when used for creating user-space locking primitives:
+
+ 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
+in the futex queue to be woken up by the lock owner when it is done
+with the lock. Waking up a sleeping task, however, introduces some
+additional latency which can be large especially if the critical
+section protected by the lock is relatively short. This may cause
+a performance bottleneck on large systems with many CPUs running
+applications that need a lot of inter-thread synchronization.
+
+ 2) The performance of the wait-wake futex is currently
+spinlock-constrained.  When many threads are contending for a
+futex in a large system with many CPUs, it is not unusual to have
+spinlock contention accounting for more than 90% of the total
+CPU cycles consumed at various points in time.
+
+This two problems can create performance bottlenecks with a
+futex-constrained workload especially on systems with large number
+of CPUs.
+
+The goal of the throughput-optimized (TP) futexes is maximize the
+locking throughput at the expense of fairness and deterministic
+latency. This is done by encouraging lock stealing and optimistic
+spinning on a locked futex when the futex owner is running.  This is
+the same optimistic spinning mechanism used by the kernel mutex and rw
+semaphore implementations to improve performance. Optimistic spinning
+was done without taking any lock.
+
+Lock stealing is known to be a performance enhancement technique as
+long as the safeguards are in place to make sure that there will be no
+lock starvation.  The TP futexes has a built-in lock hand-off mechanism
+to prevent lock starvation from happening as long as the underlying
+kernel mutexes that the TP futexes use have no lock starvation problem.
+
+When the top lock waiter has failed to acquire the lock within a
+certain time threshold, it will initiate the hand-off mechanism by
+forcing the unlocker to transfer the lock to itself instead of freeing
+it for others to grab. This limit the maximum latency a waiter has
+to wait.
+
+The downside of this improved throughput is the increased variance
+of the actual response times of the locking operations. Some locking
+operations will be very fast, while others may be considerably slower.
+The average response time should be better than the wait-wake futexes.
+
+Performance-wise, TP futexes should be faster than wait-wake futexes
+especially if the futex locker holders do not sleep. For workload
+that does a lot of sleeping within the critical sections, the TP
+futexes may not be faster than the wait-wake futexes.
+
+Implementation
+--
+
+Like the PI and robust futexes, an exclusive lock acquirer has to
+atomically put its thread ID (TID) into the lower 30 bits of the
+32-bit futex which should has an original value of 0. If it succeeds,
+it will be the owner of the futex. Otherwise, it has to call into
+the kernel using the FUTEX_LOCK futex(2) syscall.
+
+  futex(uaddr, FUTEX_LOCK, 0, timeout, NULL, 0);
+
+Only the optional timeout parameter is being used by this futex(2)
+syscall.
+
+Inside the kernel, a kernel mutex is used for serialization among
+the futex waiters. Only the top lock waiter which is the owner of
+the serialization mutex is allowed to continuously spin and attempt
+to acquire the lock.  Other lock waiters will have one attempt to
+steal the lock before entering the mutex queues.
+
+When the exclusive futex lock owner is no longer running, the top
+waiter will set the FUTEX_WAITERS bit before going to sleep. This is
+to make sure the futex owner will go into the kernel at unlock time
+to wake up the top waiter.
+
+The return values of the above futex locking syscall, if non-negative,
+are status code that 

[PATCH-tip v5 08/21] futex: Allow direct attachment of futex_state objects to hash bucket

2017-02-03 Thread Waiman Long
Currently, the futex state objects can only be located indirectly as

hash bucket => futex_q => futex state

Actually it can be beneficial in some cases to locate the futex state
object directly from the hash bucket without the futex_q middleman.
Therefore, a new list head to link the futex state objects as well
as a new spinlock to manage them are added to the hash bucket.

To limit size increase for UP systems, these new fields are only for
SMP machines where the cacheline alignment of the hash bucket leaves
it with enough empty space for the new fields.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 928212c..2ced6c0 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -209,6 +209,11 @@ struct futex_state {
struct list_head list;
 
/*
+* Can link to fs_head in the owning hash bucket.
+*/
+   struct list_head fs_list;
+
+   /*
 * The PI object:
 */
struct rt_mutex pi_mutex;
@@ -264,11 +269,24 @@ struct futex_q {
  * Hash buckets are shared by all the futex_keys that hash to the same
  * location.  Each key may have multiple futex_q structures, one for each task
  * waiting on a futex.
+ *
+ * Alternatively (in SMP), a key can be associated with a unique futex_state
+ * object where multiple waiters waiting for that futex can queue up in that
+ * futex_state object without using the futex_q structure. A separate
+ * futex_state lock (fs_lock) is used for processing those futex_state objects.
  */
 struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
+
+#ifdef CONFIG_SMP
+   /*
+* Fields for managing futex_state object list
+*/
+   spinlock_t fs_lock;
+   struct list_head fs_head;
+#endif
 } cacheline_aligned_in_smp;
 
 /*
@@ -875,6 +893,8 @@ static int refill_futex_state_cache(void)
return -ENOMEM;
 
INIT_LIST_HEAD(>list);
+   INIT_LIST_HEAD(>fs_list);
+
/* pi_mutex gets initialized later */
state->owner = NULL;
atomic_set(>refcount, 1);
@@ -3365,6 +3385,10 @@ static int __init futex_init(void)
atomic_set(_queues[i].waiters, 0);
plist_head_init(_queues[i].chain);
spin_lock_init(_queues[i].lock);
+#ifdef CONFIG_SMP
+   INIT_LIST_HEAD(_queues[i].fs_head);
+   spin_lock_init(_queues[i].fs_lock);
+#endif
}
 
return 0;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-tip v5 12/21] TP-futex: Return status code on FUTEX_LOCK calls

2017-02-03 Thread Waiman Long
To better understand how the TP futexes are performing, it is useful to
return the internal status on the TP futexes. The FUTEX_LOCK futex(2)
syscall will now return a positive status code if no error happens. The
status code consists of the following 3 fields:

 1) Bits 00-07: code on how the lock is acquired.
 2) Bits 08-15: reserved
 3) Bits 16-30: how many time the task sleeps in the optimistic
spinning loop.

By returning the TP status code, an external monitoring or tracking
program can have a macro view of how the TP futexes are performing.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 41 +++--
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 348b44c..22f7906 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3369,6 +3369,22 @@ void exit_robust_list(struct task_struct *curr)
  */
 #define TP_HANDOFF_TIMEOUT 500 /* 5ms  */
 
+/*
+ * The futex_lock() function returns the internal status of the TP futex.
+ * The status code consists of the following 3 fields:
+ * 1) bits 00-07: code on how the lock is acquired
+ *0 - steals the lock
+ *1 - top waiter (mutex owner) acquires the lock
+ *2 - handed off the lock
+ * 2) bits 08-15: reserved
+ * 3) bits 15-30: how many times the task has slept or yield to scheduler
+ *   in futex_spin_on_owner().
+ */
+#define TP_LOCK_STOLEN 0
+#define TP_LOCK_ACQUIRED   1
+#define TP_LOCK_HANDOFF2
+#define TP_STATUS_SLEEP(val, sleep)((val)|((sleep) << 16))
+
 /**
  * lookup_futex_state - Looking up the futex state structure.
  * @hb: hash bucket
@@ -3453,9 +3469,11 @@ static inline int put_futex_state_unlocked(struct 
futex_state *state)
  *   preserve the flag bits
  * endif
  *
- * Return: 1 if lock acquired;
+ * Return: TP_LOCK_ACQUIRED if lock acquired;
+ *TP_LOCK_HANDOFF if lock was handed off;
  *0 if lock acquisition failed;
  *-EFAULT if an error happened.
+ **puval will contain the latest futex value when trylock fails.
  */
 static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
const bool waiter)
@@ -3468,7 +3486,7 @@ static inline int futex_trylock(u32 __user *uaddr, const 
u32 vpid, u32 *puval,
uval = *puval;
 
if (waiter && (uval & FUTEX_TID_MASK) == vpid)
-   return 1;
+   return TP_LOCK_HANDOFF;
 
if (uval & FUTEX_TID_MASK)
return 0;   /* Trylock fails */
@@ -3479,7 +3497,7 @@ static inline int futex_trylock(u32 __user *uaddr, const 
u32 vpid, u32 *puval,
if (unlikely(cmpxchg_futex_value(puval, uaddr, uval, vpid|flags)))
return -EFAULT;
 
-   return *puval == uval;
+   return (*puval == uval) ? TP_LOCK_ACQUIRED : 0;
 }
 
 /**
@@ -3493,7 +3511,8 @@ static inline int futex_trylock(u32 __user *uaddr, const 
u32 vpid, u32 *puval,
  * of faulting in the futex word. This function should only be called from
  * within futex_spin_on_owner().
  *
- * Return: 1 if lock acquired;
+ * Return: TP_LOCK_ACQUIRED if lock acquired;
+ *TP_LOCK_HANDOFF if lock was handed off;
  *0 if lock acquisition failed;
  *-EFAULT if an error happened.
  */
@@ -3554,7 +3573,7 @@ static inline int futex_set_waiters_bit(u32 __user 
*uaddr, u32 *puval)
  * unless the pid wraps around and the perceived owner is not the real owner.
  * To guard against this case, we will have to use the robust futex feature.
  *
- * Return: 0 if futex acquired, < 0 if an error happens.
+ * Return: TP status code if lock acquired, < 0 if an error happens.
  */
 static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
   struct futex_state *state)
@@ -3564,6 +3583,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
"\tLock is now acquired by pid %d!\n"
 
int ret, loopcnt = 1;
+   int nsleep = 0;
bool handoff_set = false;
u32 uval;
u32 owner_pid = 0;
@@ -3623,6 +3643,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
if (need_resched()) {
__set_current_state(TASK_RUNNING);
schedule_preempt_disabled();
+   nsleep++;
loopcnt = 0;
continue;
}
@@ -3693,6 +3714,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
 */
if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS)) {
schedule_preempt_disabled();
+   nsleep++;
loopcnt = 0;
}
__set_current_state(TASK_RUNNING);
@@ -3719,7 +3741,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,

[PATCH-tip v5 07/21] futex: Add a new futex type field into futex_state

2017-02-03 Thread Waiman Long
As the futex_state structure will be overloaded in later patches
to be used by non-PI futexes, it is necessary to add a type field to
distinguish among different types of futexes.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 25881e6..928212c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -193,6 +193,10 @@
 #define FLAGS_CLOCKRT  0x02
 #define FLAGS_HAS_TIMEOUT  0x04
 
+enum futex_type {
+   TYPE_PI = 0,
+};
+
 /*
  * Futex state object:
  *  - Priority Inheritance state
@@ -212,6 +216,7 @@ struct futex_state {
struct task_struct *owner;
atomic_t refcount;
 
+   enum futex_type type;
union futex_key key;
 };
 
@@ -905,13 +910,14 @@ static void put_futex_state(struct futex_state *state)
return;
 
/*
-* If state->owner is NULL, the owner is most probably dying
-* and has cleaned up the futex state already
+* If state->owner is NULL and the type is TYPE_PI, the owner is
+* most probably dying and has cleaned up the futex state already.
 */
if (state->owner) {
task_pi_list_del(state, false);
 
-   rt_mutex_proxy_unlock(>pi_mutex, state->owner);
+   if (state->type == TYPE_PI)
+   rt_mutex_proxy_unlock(>pi_mutex, state->owner);
}
 
if (current->pi_state_cache)
@@ -1064,7 +1070,7 @@ static int attach_to_pi_state(u32 uval, struct 
futex_state *pi_state,
/*
 * Userspace might have messed up non-PI and PI futexes [3]
 */
-   if (unlikely(!pi_state))
+   if (unlikely(!pi_state || (pi_state->type != TYPE_PI)))
return -EINVAL;
 
WARN_ON(!atomic_read(_state->refcount));
@@ -1182,6 +1188,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key 
*key,
 
/* Store the key for possible exit cleanups: */
pi_state->key = *key;
+   pi_state->type = TYPE_PI;
 
WARN_ON(!list_empty(_state->list));
list_add(_state->list, >pi_state_list);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-tip v5 13/21] TP-futex: Add timeout support

2017-02-03 Thread Waiman Long
Unlike other futexes, TP futexes do not accept the specification of
a timeout value. To align with the other futexes, timeout support is
now added.  However, the timeout isn't as precise as the other futex
types due to the fact that timer expiration can't be detected when
the thread is waiting in the serialization mutex.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 50 ++
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 22f7906..91b2e02 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3558,9 +3558,10 @@ static inline int futex_set_waiters_bit(u32 __user 
*uaddr, u32 *puval)
 
 /**
  * futex_spin_on_owner - Optimistically spin on futex owner
- * @uaddr: futex address
- * @vpid:  PID of current task
- * @state: futex state object
+ * @uaddr:   futex address
+ * @vpid:PID of current task
+ * @state:   futex state object
+ * @timeout: hrtimer_sleeper structure
  *
  * Spin on the futex word while the futex owner is active. Otherwise, set
  * the FUTEX_WAITERS bit and go to sleep.
@@ -3576,7 +3577,8 @@ static inline int futex_set_waiters_bit(u32 __user 
*uaddr, u32 *puval)
  * Return: TP status code if lock acquired, < 0 if an error happens.
  */
 static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
-  struct futex_state *state)
+  struct futex_state *state,
+  struct hrtimer_sleeper *timeout)
 {
 #define OWNER_DEAD_MESSAGE \
"futex: owner pid %d of TP futex 0x%lx was %s.\n"   \
@@ -3648,6 +3650,12 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
continue;
}
 
+
+   if (timeout && !timeout->task) {
+   ret = -ETIMEDOUT;
+   break;
+   }
+
if (signal_pending(current)) {
ret = -EINTR;
break;
@@ -3755,10 +3763,18 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
  * This function is not inlined so that it can show up separately in perf
  * profile for performance analysis purpose.
  *
+ * The timeout functionality isn't as precise as other other futex types
+ * as timer expiration will not be detected while waiting in the
+ * serialization mutex. One possible solution is to modify the expiration
+ * function to send out a signal as well so as to break the thread out of
+ * the mutex code if we really want more precise timeout.
+ *
  * Return: TP status code if lock acquired, < 0 if an error happens.
  */
-static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
+static noinline int
+futex_lock(u32 __user *uaddr, unsigned int flags, ktime_t *time)
 {
+   struct hrtimer_sleeper timeout, *to;
struct futex_hash_bucket *hb;
union futex_key key = FUTEX_KEY_INIT;
struct futex_state *state;
@@ -3782,6 +3798,8 @@ static noinline int futex_lock(u32 __user *uaddr, 
unsigned int flags)
if (refill_futex_state_cache())
return -ENOMEM;
 
+   to = futex_setup_timer(time, , flags, current->timer_slack_ns);
+
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, , VERIFY_WRITE);
if (unlikely(ret))
goto out;
@@ -3807,6 +3825,9 @@ static noinline int futex_lock(u32 __user *uaddr, 
unsigned int flags)
goto out_put_state_key;
}
 
+   if (to)
+   hrtimer_start_expires(>timer, HRTIMER_MODE_ABS);
+
/*
 * Acquiring the serialization mutex.
 *
@@ -3818,10 +3839,19 @@ static noinline int futex_lock(u32 __user *uaddr, 
unsigned int flags)
goto out_put_state_key;
 
/*
+* Check for timeout.
+*/
+   if (to && !timeout.task) {
+   ret = -ETIMEDOUT;
+   mutex_unlock(>mutex);
+   goto out_put_state_key;
+   }
+
+   /*
 * As the mutex owner, we can now spin on the futex word as well as
 * the active-ness of the futex owner.
 */
-   ret = futex_spin_on_owner(uaddr, vpid, state);
+   ret = futex_spin_on_owner(uaddr, vpid, state, to);
 
mutex_unlock(>mutex);
 
@@ -3838,6 +3868,10 @@ static noinline int futex_lock(u32 __user *uaddr, 
unsigned int flags)
put_futex_key();
 
 out:
+   if (to) {
+   hrtimer_cancel(>timer);
+   destroy_hrtimer_on_stack(>timer);
+   }
return ret;
 }
 
@@ -3991,7 +4025,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t 
*timeout,
return futex_requeue(uaddr, flags, uaddr2, val, val2, , 1);
 #ifdef CONFIG_SMP
case FUTEX_LOCK:
-   return futex_lock(uaddr, flags);
+   return futex_lock(uaddr, flags, timeout);
case FUTEX_UNLOCK:
return 

[PATCH-tip v5 17/21] TP-futex: Group readers together in wait queue

2017-02-03 Thread Waiman Long
All the TP futex lock waiters are serialized in the kernel using a
kernel mutex which acts like a wait queue. The order at which the
waiters popped out from the wait queue will affect performance when
exclusive (writer) and shared (reader) lock waiters are mixed in the
queue. The worst case scenarios will be something like RWRWRW... in
term of ordering where no parallelism in term of lock ownership
can happen.

To improve throughput, the readers are now grouped together as a single
entity in the wait queue. The first reader that enters the mutex wait
queue will become the leader of the group. The other readers will
spin on the group leader via an OSQ lock. When the futex is put in
the shared mode, either by the group leader or by an external reader
the spinning readers in the reader group will then acquire the read
lock successively.

The spinning readers in the group will get disbanded when the group
leader goes to sleep. In this case, all those readers will go into the
mutex wait queue alone and wait for their turn to acquire the TP futex.

On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10
based kernel, the performance of TP rwlock with 1:1 reader/writer
ratio versus one based on the wait-wake futexes as well as the Glibc
rwlock with a microbenchmark (1 worker thread per cpu core) running
for 10s were as follows:

  WW futex   TP futex Glibc
      -
Total locking ops35,707,234 58,645,434 10,930,422
Per-thread avg/sec   99,149162,887 30,362
Per-thread min/sec   93,190 38,641 29,872
Per-thread max/sec  104,213225,983 30,708
Write lock futex calls   11,161,534 14,094  -
Write unlock futex calls  8,696,121167  -
Read lock futex calls 1,717,863  6,659  -
Read unlock futex calls   4,316,418323  -

It can be seen that the throughput of the TP futex is close to 2X
the WW futex and almost 6X the Glibc version in this particular case.

The following table shows the CPU cores scaling for the average
per-thread locking rates (ops/sec):

  WW futex   TP futex   Glibc
        -
  9 threads (1 socket)422,014647,006   190,236
 18 threads (2 sockets)   197,145330,35366,934
 27 threads (2 sockets)   127,947213,41743,641
 36 threads (2 sockets)99,149162,88730,362

The following tables shows the average per-thread locking rates
(36 threads) with different reader percentages:

  WW futex   TP futex   Glibc
        -
   90% readers124,426159,65760,208
   95% readers148,905152,31568,666
  100% reader 210,029191,75984,316

The rwlocks based on the WW futexes and Glibc prefer readers by
default. So it can be seen that the performance of the WW futex and
Glibc rwlocks increased with higher reader percentages. The TP futex
rwlock, however, prefers writers a bit more than readers. So the
performance didn't increase as the reader percentage rises.

The WW futexes and Glibc rwlocks also have a writer-preferring version.
Their performance with the same tests are as follows:

  WW futex   TP futex   Glibc
        -
   90% readers 87,866159,65726,057
   95% readers 93,193152,31532,611
  100% reader 193,267191,75988,440

With separate 18 reader and 18 writer threads, the the average
per-thread reader and writer locking rates with different load
latencies (L, default = 1) and reader-preferring rwlocks are:

  WW futex   TP futex   Glibc
        -
 Reader rate, L=1 381,411 74,059   164,184
 Writer rate, L=1   0240,841 0

 Reader rate, L=5 330,732 57,361   150,691
 Writer rate, L=5   0175,400 0

 Reader rate, L=50304,505 17,35597,066
 Writer rate, L=50  0114,504 0

The corresponding locking rates with writer-preferring rwlocks are:

  WW futex   TP futex   Glibc
        -
 Reader rate, L=1 138,805 74,05954
 Writer rate, L=1  31,113240,84156,424

 Reader rate, L=5 114,414 57,36124
 Writer rate, L=5  28,062175,40052,039

 Reader rate, L=50 88,619 51,483

[PATCH-tip v5 15/21] TP-futex: Support userspace reader/writer locks

2017-02-03 Thread Waiman Long
The TP futexes was designed to support userspace mutually exclusive
locks. They are now extended to support userspace reader/writer
locks as well.

Two new futex command codes are added:
 1) FUTEX_LOCK_SHARED - to acquire a shared lock (reader-lock)
 2) FUTEX_UNLOCK_SHARED - to release a shred lock (reader-unlock)

The original FUTEX_LOCK and FUTEX_UNLOCK command codes acts like a
writer lock and unlock method in a userspace rwlock.

A special bit FUTEX_SHARED (bit 29) is used to indicate if a lock is
an exclusive (writer) or shared (reader) lock. Because of that, the
maximum TID value is now reduce to 1^29 instead of 1^30. With a shared
lock, the lower 24 bits are used to track the reader counts. Bit 28
is a special FUTEX_SHARED_UNLOCK bit that should used by userspace
to prevent racing in the FUTEX_UNLOCK_SHARED futex(2) syscall. Bits
24-27 are reserved for future extension.

When a TP futex becomes reader-owned, there is no way to tell if the
readers are running or not. So a fixed time interval (50us) is now
used for spinning on a reader-owned futex. The top waiter will abort
the spinning and goes to sleep when no activity is observed in the
futex value itself after the spinning timer expired. Activities here
means any change in the reader count of the futex indicating that
some new readers are coming in or old readers are leaving.

Sample code to acquire/release a reader lock in userspace are:

  void read_lock(int *faddr)
  {
int val = cmpxchg(faddr, 0, FUTEX_SHARED + 1);

if (!val)
return;
for (;;) {
int old = val, new;

if (!old)
new = FUTEX_SHARED + 1;
else if ((old & (~FUTEX_TID_SHARED_MASK|
  FUTEX_SHARED_UNLOCK) ||
!(old & FUTEX_SHARED))
break;
else
new = old + 1;
val = cmpxchg(faddr, old, new);
if (val == old)
return;
}
while (futex(faddr, FUTEX_LOCK_SHARED, ...) < 0)
;
  }

  void read_unlock(int *faddr)
  {
int val = xadd(faddr, -1);
int old;

for (;;) {
if ((val & (FUTEX_SCNT_MASK|FUTEX_SHARED_UNLOCK) ||
   !(val & FUTEX_SHARED))
return; /* Not the last reader or shared */
if (val & ~FUTEX_SHARED_TID_MASK) {
old = cmpxchg(faddr, val, val|FUTEX_SHARED_UNLOCK);
if (old == val)
break;
} else {
old = cmpxchg(faddr, val, 0);
if (old == val)
return;
}
val = old;
}
futex(faddr, FUTEX_UNLOCK_SHARED, ...);
  }

The read lock/unlock code is a bit more complicated than the
corresponding write lock/unlock code. This is by design to minimize
any additional overhead for mutex lock and unlock. As a result,
the TP futex rwlock prefers writers a bit more than readers.

Signed-off-by: Waiman Long 
---
 include/uapi/linux/futex.h |  28 +-
 kernel/futex.c | 225 ++---
 2 files changed, 218 insertions(+), 35 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 7f676ac..1a22ecc 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -22,6 +22,8 @@
 #define FUTEX_CMP_REQUEUE_PI   12
 #define FUTEX_LOCK 13
 #define FUTEX_UNLOCK   14
+#define FUTEX_LOCK_SHARED  15
+#define FUTEX_UNLOCK_SHARED16
 
 #define FUTEX_PRIVATE_FLAG 128
 #define FUTEX_CLOCK_REALTIME   256
@@ -43,6 +45,9 @@
 FUTEX_PRIVATE_FLAG)
 #define FUTEX_LOCK_PRIVATE (FUTEX_LOCK | FUTEX_PRIVATE_FLAG)
 #define FUTEX_UNLOCK_PRIVATE   (FUTEX_UNLOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_SHARED_PRIVATE  (FUTEX_LOCK_SHARED | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_SHARED_PRIVATE(FUTEX_UNLOCK_SHARED | \
+FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
@@ -111,9 +116,28 @@ struct robust_list_head {
 #define FUTEX_OWNER_DIED   0x4000
 
 /*
- * The rest of the robust-futex field is for the TID:
+ * Shared locking bit (for readers).
+ * Robust futexes cannot be used in combination with shared locking.
+ *
+ * The FUTEX_SHARED_UNLOCK bit can be set by userspace to indicate that
+ * an shared unlock is in progress and so no shared locking is allowed
+ * within the kernel.
+ */
+#define FUTEX_SHARED   0x2000
+#define FUTEX_SHARED_UNLOCK0x1000
+
+/*
+ * The rest of the bits is for the TID or, in the case of readers, the
+ * shared locking bit and the numbers of active readers present. IOW, the
+ * max supported 

[PATCH-tip v5 16/21] TP-futex: Enable kernel reader lock stealing

2017-02-03 Thread Waiman Long
The TP-futexes prefer writers as the code path is much simpler for
them. This may not be good for overall system performance especially
if there is a fair amount of reader activities.

This patch enables kernel reader to steal the lock when the futex is
currently reader-owned and the lock handoff mechanism hasn't been
enabled yet. A new field locksteal_disabled is added to the futex
state object for controlling reader lock stealing. So the waiting
reader must retrieve the futex state object first before doing it.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 1f97c85..90c8c80 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -239,6 +239,7 @@ struct futex_state {
atomic_t refcount;
 
u32 handoff_pid;/* For TP futexes only */
+   int locksteal_disabled; /* For TP futexes only */
 
enum futex_type type;
union futex_key key;
@@ -3438,6 +3439,7 @@ void exit_robust_list(struct task_struct *curr)
state = alloc_futex_state();
state->type = TYPE_TP;
state->key = *key;
+   state->locksteal_disabled = false;
list_add(>fs_list, >fs_head);
WARN_ON(atomic_read(>refcount) != 1);
 
@@ -3783,9 +3785,10 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
WRITE_ONCE(state->handoff_pid, vpid);
 
/*
-* Disable handoff check.
+* Disable handoff check and reader lock
+* stealing.
 */
-   handoff_set = true;
+   state->locksteal_disabled = handoff_set = true;
}
}
 
@@ -3888,6 +3891,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
 */
WRITE_ONCE(state->mutex_owner, NULL);
WRITE_ONCE(state->handoff_pid, 0);
+   state->locksteal_disabled = false;
 
preempt_enable();
return (ret < 0) ? ret : TP_STATUS_SLEEP(ret, nsleep);
@@ -3968,6 +3972,21 @@ static noinline int futex_lock(u32 __user *uaddr, 
unsigned int flags,
goto out_put_state_key;
}
 
+   /*
+* For reader, we will try to steal the lock here as if it is the
+* top waiter without taking the serialization mutex if reader
+* lock stealing isn't disabled even though the FUTEX_WAITERS bit
+* is set.
+*/
+   if (shared && !state->locksteal_disabled) {
+   ret = futex_trylock(uaddr, vpid, , true);
+   if (ret) {
+   if (ret > 0)
+   ret = TP_LOCK_STOLEN;
+   goto out_put_state_key;
+   }
+   }
+
if (to)
hrtimer_start_expires(>timer, HRTIMER_MODE_ABS);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-tip v5 21/21] futex: Dump internal futex state via debugfs

2017-02-03 Thread Waiman Long
For debugging purpose, it is sometimes useful to dump the internal
states in the futex hash bucket table. This patch adds a file
"futex_hash_table" in debugfs root filesystem to dump the internal
futex states.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 82 ++
 1 file changed, 82 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 6bf5304..8b7a591 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -4428,3 +4428,85 @@ static int __init futex_init(void)
return 0;
 }
 __initcall(futex_init);
+
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_SMP)
+/*
+ * Debug code to dump selected content of in-kernel futex hash bucket table.
+ */
+#include 
+
+static int futex_dump_show(struct seq_file *m, void *arg)
+{
+   struct futex_hash_bucket *hb = arg;
+   struct futex_state *state;
+   int i;
+
+   if (list_empty(>fs_head))
+   return 0;
+
+   seq_printf(m, "\nHash bucket %d:\n", (int)(hb - futex_queues));
+   spin_lock(>fs_lock);
+   i = 0;
+   list_for_each_entry(state, >fs_head, fs_list) {
+   seq_printf(m, "  Futex state %d\n", i++);
+   if (state->owner)
+   seq_printf(m, "owner PID = %d\n",
+  task_pid_vnr(state->owner));
+   if (state->mutex_owner)
+   seq_printf(m, "mutex owner PID = %d\n",
+  task_pid_vnr(state->mutex_owner));
+   seq_printf(m, "reference count = %d\n",
+  atomic_read(>refcount));
+   seq_printf(m, "handoff PID = %d\n", state->handoff_pid);
+   }
+   spin_unlock(>fs_lock);
+   return 0;
+}
+
+static void *futex_dump_start(struct seq_file *m, loff_t *pos)
+{
+   return (*pos < futex_hashsize) ? _queues[*pos] : NULL;
+}
+
+static void *futex_dump_next(struct seq_file *m, void *arg, loff_t *pos)
+{
+   (*pos)++;
+   return (*pos < futex_hashsize) ? _queues[*pos] : NULL;
+}
+
+static void futex_dump_stop(struct seq_file *m, void *arg)
+{
+}
+
+static const struct seq_operations futex_dump_op = {
+   .start = futex_dump_start,
+   .next  = futex_dump_next,
+   .stop  = futex_dump_stop,
+   .show  = futex_dump_show,
+};
+
+static int futex_dump_open(struct inode *inode, struct file *file)
+{
+   return seq_open(file, _dump_op);
+}
+
+static const struct file_operations fops_futex_dump = {
+   .open= futex_dump_open,
+   .read= seq_read,
+   .llseek  = seq_lseek,
+   .release = seq_release,
+};
+
+/*
+ * Initialize debugfs for the futex hash bucket table dump.
+ */
+static int __init init_futex_dump(void)
+{
+   if (!debugfs_create_file("futex_hash_table", 0400, NULL, NULL,
+_futex_dump))
+   return -ENOMEM;
+   return 0;
+}
+fs_initcall(init_futex_dump);
+
+#endif /* CONFIG_DEBUG_FS && CONFIG_SMP */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-tip v5 18/21] TP-futex, doc: Update TP futexes document on shared locking

2017-02-03 Thread Waiman Long
The tp-futex.txt was updated to add description about shared locking
support.

Signed-off-by: Waiman Long 
---
 Documentation/tp-futex.txt | 163 +++--
 1 file changed, 143 insertions(+), 20 deletions(-)

diff --git a/Documentation/tp-futex.txt b/Documentation/tp-futex.txt
index 5324ee0..d4f0ed6 100644
--- a/Documentation/tp-futex.txt
+++ b/Documentation/tp-futex.txt
@@ -68,22 +68,58 @@ the kernel using the FUTEX_LOCK futex(2) syscall.
 Only the optional timeout parameter is being used by this futex(2)
 syscall.
 
-Inside the kernel, a kernel mutex is used for serialization among
-the futex waiters. Only the top lock waiter which is the owner of
-the serialization mutex is allowed to continuously spin and attempt
-to acquire the lock.  Other lock waiters will have one attempt to
-steal the lock before entering the mutex queues.
+A shared lock acquirer, on the other hand, has to do either one of
+the following 2 actions depends on the current value of the futex:
+
+ 1) If current futex value is 0, atomically put (FUTEX_SHARED + 1)
+into the lower 30 bits of the futex.
+
+ 2) If the FUTEX_SHARED bit is set and the FUTEX_SHARED_UNLOCK bit
+isn't, atomically increment the reader count in the lower 24 bits.
+The other unused bits (24-27) can be used for other management purpose
+and will be ignored by the kernel.
+
+Any failure to both of the above actions will lead to the following
+new FUTEX_LOCK_SHARED futex(2) syscall.
+
+  futex(uaddr, FUTEX_LOCK_SHARED, 0, timeout, NULL, 0);
+
+The special FUTEX_SHARED bit (bit 30) is used to distinguish between a
+shared lock and an exclusive lock. If the FUTEX_SHARED bit isn't set,
+the lower 29 bits contain the TID of the exclusive lock owner. If
+the FUTEX_SHARED bit is set, the lower 29 bits contain the number of
+tasks owning the shared lock.  Therefore, only 29 bits are allowed
+to be used by the TID.
+
+The FUTEX_SHARED_UNLOCK bit can be used by userspace to prevent racing
+and to indicate that a shared unlock is in progress as it will disable
+any shared trylock attempt in the kernel.
+
+Inside the kernel, a kernel mutex is used for serialization among the
+futex waiters. Only the top lock waiter which is the owner of the
+serialization mutex is allowed to continuously spin and attempt to
+acquire the lock.  Other lock waiters will have one or two attempts
+to steal the lock before entering the mutex queues.
 
 When the exclusive futex lock owner is no longer running, the top
 waiter will set the FUTEX_WAITERS bit before going to sleep. This is
 to make sure the futex owner will go into the kernel at unlock time
 to wake up the top waiter.
 
+When the futex is shared by multiple owners, there is no way to
+determine if all the shared owners are running or not. In this case,
+the top waiter will spin for a fixed amount of time if no reader
+activity is observed before giving up and going to sleep. Any change
+in the reader count will be considered reader activities and so will
+reset the timer. However, when the hand-off threshold has been reached,
+reader optimistic spinning timer reset will be disabled automatically.
+
 The return values of the above futex locking syscall, if non-negative,
-are status code that consists of 2 fields - the lock acquisition code
-(bits 0-7) and the number of sleeps (bits 8-30) in the optimistic
-spinning loop before acquiring the futex. A negative returned value
-means an error has happened.
+are status code that consists of 3 fields - the lock acquisition code
+(bits 0-7) and the number of sleeps (bits 16-30) in the optimistic
+spinning loop before acquiring the futex. Bits 8-15 are reserved
+for future extension. A negative returned value means an error has
+happened.
 
 The lock acquisition code can have the following values:
  a) 0   - lock stolen as non-top waiter
@@ -96,14 +132,21 @@ issue a FUTEX_UNLOCK futex(2) syscall to wake up the top 
waiter.
 
   futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0);
 
-A return value of 1 from the FUTEX_UNLOCK futex(2) syscall indicates
-a task has been woken up. The syscall returns 0 if no sleeping task
-is woken. A negative value will be returned if an error happens.
+For a shared lock owner, unlock is done by atomically decrementing
+the reader count by one. If the resultant futex value has no reader
+remaining, the process has to atomically change the futex value from
+FUTEX_SHARED to 0. If that fails, it has to issue a FUTEX_UNLOCK_SHARED
+futex(2) syscall to wake up the top waiter.
+
+A return value of 1 from the FUTEX_UNLOCK or FUTEX_UNLOCK_SHARED
+futex(2) syscall indicates a task has been woken up. The syscall
+returns 0 if no sleeping task is woken. A negative value will be
+returned if an error happens.
 
-The error number returned by a FUTEX_UNLOCK syscall on an empty futex
-can be used to decide if the TP futex functionality is implemented
-in the kernel. If it is present, an EPERFM error will be returned.

[PATCH-tip v5 10/21] TP-futex: Enable robust handling

2017-02-03 Thread Waiman Long
The TP futexes don't have code to handle the death of futex
owners. There are 2 different cases that need to be considered.

As top waiter gets a reference to the task structure of the futex
owner, the task structure will never go away even if the owner dies.
When the futex owner died while the top waiter is spinning, the task
structure will be marked dead or the pid won't have a matching task
structure if the task died before a reference is taken. Alternatively,
if robust futex attribute is enabled, the FUTEX_OWNER_DIED bit of the
futex word may also be set. In all those cases, what the top waiter
need to do is to grab the futex directly. An informational message
will be printed to highlight this event.

If the futex owner died while the top waiter is sleeping, we need to
make the exit processing code to wake up the top waiter. This is done
by chaining the futex state object into the pi_state_list of the futex
owner before the top waiter sleeps so that if exit_pi_state_list()
is called, the wakeup will happen. The top waiter needs to remove
its futex state object from the pi_state_list of the old owner if
the ownership changes hand or when the lock is acquired.

Signed-off-by: Waiman Long 
---
 kernel/futex.c | 85 +-
 1 file changed, 79 insertions(+), 6 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 6a59e6d..46a1a4b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1006,7 +1006,7 @@ static struct task_struct * futex_find_get_task(pid_t pid)
 }
 
 /*
- * This task is holding PI mutexes at exit time => bad.
+ * This task is holding PI or TP mutexes at exit time => bad.
  * Kernel cleans up PI-state, but userspace is likely hosed.
  * (Robust-futex cleanup is separate and might save the day for userspace.)
  */
@@ -1023,12 +1023,31 @@ void exit_pi_state_list(struct task_struct *curr)
 * We are a ZOMBIE and nobody can enqueue itself on
 * pi_state_list anymore, but we have to be careful
 * versus waiters unqueueing themselves:
+*
+* For TP futexes, the only purpose of showing up in the
+* pi_state_list is for this function to wake up the serialization
+* mutex owner (state->mutex_owner). We don't actually need to take
+* the HB lock. The futex state and task struct won't go away as long
+* as we hold the pi_lock.
 */
raw_spin_lock_irq(>pi_lock);
while (!list_empty(head)) {
 
next = head->next;
pi_state = list_entry(next, struct futex_state, list);
+
+   if (pi_state->type == TYPE_TP) {
+   struct task_struct *owner;
+
+   owner = READ_ONCE(pi_state->mutex_owner);
+   WARN_ON(list_empty(_state->list));
+   list_del_init(_state->list);
+   pi_state->owner = NULL;
+   if (owner)
+   wake_up_process(owner);
+   continue;
+   }
+
key = pi_state->key;
hb = hash_futex();
raw_spin_unlock_irq(>pi_lock);
@@ -3185,8 +3204,8 @@ int handle_futex_death(u32 __user *uaddr, struct 
task_struct *curr, int pi)
goto retry;
 
/*
-* Wake robust non-PI futexes here. The wakeup of
-* PI futexes happens in exit_pi_state():
+* Wake robust wait-wake futexes here. The wakeup of
+* PI and TP futexes happens in exit_pi_state():
 */
if (!pi && (uval & FUTEX_WAITERS))
futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
@@ -3327,6 +3346,12 @@ void exit_robust_list(struct task_struct *curr)
  * Unlike the other futexes, the futex_q structures aren't used. Instead,
  * they will queue up in the serialization mutex of the futex state container
  * queued in the hash bucket.
+ *
+ * To handle the exceptional case that the futex owner died, the robust
+ * futexes list mechanism is used to for waking up sleeping top waiter.
+ * Checks are also made in the futex_spin_on_owner() loop for dead task
+ * structure or invalid pid. In both cases, the top waiter will take over
+ * the ownership of the futex.
  */
 
 /**
@@ -3515,6 +3540,10 @@ static inline int futex_set_waiters_bit(u32 __user 
*uaddr, u32 *puval)
 static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
   struct futex_state *state)
 {
+#define OWNER_DEAD_MESSAGE \
+   "futex: owner pid %d of TP futex 0x%lx was %s.\n"   \
+   "\tLock is now acquired by pid %d!\n"
+
int ret;
u32 uval;
u32 owner_pid = 0;
@@ -3529,13 +3558,47 @@ static int futex_spin_on_owner(u32 __user *uaddr, const 
u32 vpid,
break;
 
if ((uval & FUTEX_TID_MASK) != 

[PATCH-tip v5 19/21] perf bench: Extend mutex/rwlock futex suite to test TP futexes

2017-02-03 Thread Waiman Long
This patch extends the futex-mutex and futex-rwlock microbenchmarks to
test userspace mutexes and rwlocks built on top of the TP futexes. We
can then compare the relative performance of those userspace locks
based on different type of futexes.

Signed-off-by: Waiman Long 
---
 tools/perf/bench/futex-locks.c | 258 +++--
 tools/perf/bench/futex.h   |  43 +++
 tools/perf/check-headers.sh|   4 +
 3 files changed, 298 insertions(+), 7 deletions(-)

diff --git a/tools/perf/bench/futex-locks.c b/tools/perf/bench/futex-locks.c
index 18dc71d..42be894 100644
--- a/tools/perf/bench/futex-locks.c
+++ b/tools/perf/bench/futex-locks.c
@@ -8,12 +8,14 @@
  * indication of actual throughput of the mutex code as it may not really
  * need to call into the kernel. Therefore, 3 sets of simple mutex lock and
  * unlock functions are written to implenment a mutex lock using the
- * wait-wake (2 versions) and PI futexes respectively. These functions serve
- * as the basis for measuring the locking throughput.
+ * wait-wake, PI and TP futexes respectively. These functions serve as the
+ * basis for measuring the locking throughput.
  *
- * Two sets of simple reader/writer lock and unlock functions are also
- * implemented using the wait-wake futexes as well as the Glibc rwlock
- * respectively for performance measurement purpose.
+ * Three sets of simple reader/writer lock and unlock functions are also
+ * implemented using the wait-wake and TP futexes as well as the Glibc
+ * rwlock respectively for performance measurement purpose. The TP futex
+ * writer lock/unlock functions are the same as that of the mutex functions.
+ * That is not the case for the wait-wake futexes.
  */
 
 #include 
@@ -57,6 +59,12 @@ enum {
STAT_SSLEEPS,   /* # of shared lock sleeps  */
STAT_EAGAINS,   /* # of EAGAIN errors   */
STAT_WAKEUPS,   /* # of wakeups (unlock return) */
+   STAT_HANDOFFS,  /* # of lock handoff (TP only)  */
+   STAT_STEALS,/* # of lock steals (TP only)   */
+   STAT_SLRETRIES, /* # of shared lock retries */
+   STAT_SURETRIES, /* # of shared unlock retries   */
+   STAT_SALONES,   /* # of shared locking alone ops*/
+   STAT_SGROUPS,   /* # of shared locking group ops*/
STAT_LOCKERRS,  /* # of exclusive lock errors   */
STAT_UNLKERRS,  /* # of exclusive unlock errors */
STAT_SLOCKERRS, /* # of shared lock errors  */
@@ -192,7 +200,7 @@ static inline void stat_inc(int tid __maybe_unused, int 
item __maybe_unused)
  */
 static const struct option mutex_options[] = {
OPT_INTEGER ('d', "locklat",,  "Specify inter-locking 
latency (default = 1)"),
-   OPT_STRING  ('f', "ftype",  ,"type", "Specify futex type: 
WW, PI, all (default)"),
+   OPT_STRING  ('f', "ftype",  ,"type", "Specify futex type: 
WW, PI, TP, all (default)"),
OPT_INTEGER ('L', "loadlat",,  "Specify load latency 
(default = 1)"),
OPT_UINTEGER('r', "runtime",,"Specify runtime (in 
seconds, default = 10s)"),
OPT_BOOLEAN ('S', "shared", ,  "Use shared futexes instead 
of private ones"),
@@ -210,7 +218,7 @@ static inline void stat_inc(int tid __maybe_unused, int 
item __maybe_unused)
 
 static const struct option rwlock_options[] = {
OPT_INTEGER ('d', "locklat",,  "Specify inter-locking 
latency (default = 1)"),
-   OPT_STRING  ('f', "ftype",  ,"type", "Specify futex type: 
WW, GC, all (default)"),
+   OPT_STRING  ('f', "ftype",  ,"type", "Specify futex type: 
WW, TP, GC, all (default)"),
OPT_INTEGER ('L', "loadlat",,  "Specify load latency 
(default = 1)"),
OPT_UINTEGER('R', "read-%", , "Specify reader percentage 
(default 50%)"),
OPT_UINTEGER('r', "runtime",,"Specify runtime (in 
seconds, default = 10s)"),
@@ -504,6 +512,72 @@ static void pi_mutex_unlock(futex_t *futex, int tid)
stat_inc(tid, STAT_UNLOCKS);
 }
 
+/*
+ * TP futex lock/unlock functions
+ */
+static void tp_mutex_lock(futex_t *futex, int tid)
+{
+   struct timespec stime, etime;
+   futex_t val;
+   int ret;
+
+   val = futex_cmpxchg(futex, 0, thread_id);
+   if (val == 0)
+   return;
+
+   /*
+* Retry if an error happens
+*/
+   for (;;) {
+   if (timestat) {
+   clock_gettime(CLOCK_REALTIME, );
+   ret = futex_lock(futex, NULL, flags);
+   clock_gettime(CLOCK_REALTIME, );
+   systime_add(tid, TIME_LOCK, , );
+   } else {
+   ret = futex_lock(futex, NULL, flags);
+   }
+   stat_inc(tid, STAT_LOCKS);
+   if (ret >= 0)
+   break;
+   stat_inc(tid, 

[PATCH-tip v5 09/21] futex: Introduce throughput-optimized (TP) futexes

2017-02-03 Thread Waiman Long
A new futex implementation called throughput-optimized (TP) futexes
is introduced. The goal of this new futex type is to maximize locking
throughput at the expense of fairness and deterministic latency. Its
throughput is higher than that of the wait-wake futexes especially
on systems with a large number of CPUs and for workloads where the
lock owners are unlikely to sleep.  The downside of this new futex
is the increase in the response time variance.

The new TP futex type is designed to be used for mutex-like exclusive
user space locks. It is not as versatile as the wait-wake futexes which
can be used for other locking primitives like conditional variables,
or read-write locks. So it is not a direct replacement of wait-wake
futexes.

A futex locking microbenchmark was written where separate userspace
mutexes are implemented using wait-wake (WW), PI and TP futexes
respectively. This microbenchmark was intructed to run 10s of
locking operations with a load latency of 5 as the critical section
on a 2-socket 36-core E5-2699 v3 system (HT off).  The system was
running on a 4.10 based kernel.  the results of the benchmark runs
were as follows:

WW futex   PI futex  TP futex
     
Total locking ops  43,607,917  303,047  54,812,918
Per-thread avg/sec121,129  841 152,242
Per-thread min/sec114,842  841  37,440
Per-thread max/sec127,201  842 455,677
% Stddev0.50%0.00%   9.89%
lock futex calls   10,262,001  303,041 134,989
unlock futex calls 12,956,025  303,042  21

The TP futexes had better throughput, but they also have larger
variance.

For the TP futexes, the heavy hitters in the perf profile were:

 98.18%   7.25% [kernel.vmlinux] [k] futex_lock
 87.31%  87.31% [kernel.vmlinux] [k] osq_lock
  3.62%   3.62% [kernel.vmlinux] [k] mutex_spin_on_owner

For the WW futexes, the heavy hitters were:

 86.38%  86.38% [kernel.vmlinux] [k] queued_spin_lock_slowpath
 48.39%   2.65% [kernel.vmlinux] [k] futex_wake
 45.13%   3.57% [kernel.vmlinux] [k] futex_wait_setup

By increasing the load latency, the performance discrepancies between
WW and TP futexes actually increases as shown in the table below.

  Load Latency  WW locking ops  TP locking ops  % change
    --  --  
  5   43,607,917  54,812,918  +26%
 10   32,822,045  45,685,420  +39%
 20   23,841,429  39,439,048  +65%
 30   18,360,775  31,494,788  +72%
 40   15,383,536  27,687,160  +80%
 50   13,290,368  23,096,937  +74%
1008,577,763  14,410,909  +68%
  1us sleep  176,450 179,660   +2%

Signed-off-by: Waiman Long 
---
 include/uapi/linux/futex.h |   4 +
 kernel/futex.c | 559 -
 2 files changed, 560 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0b1f716..7f676ac 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,8 @@
 #define FUTEX_WAKE_BITSET  10
 #define FUTEX_WAIT_REQUEUE_PI  11
 #define FUTEX_CMP_REQUEUE_PI   12
+#define FUTEX_LOCK 13
+#define FUTEX_UNLOCK   14
 
 #define FUTEX_PRIVATE_FLAG 128
 #define FUTEX_CLOCK_REALTIME   256
@@ -39,6 +41,8 @@
 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE   (FUTEX_CMP_REQUEUE_PI | \
 FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PRIVATE (FUTEX_LOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PRIVATE   (FUTEX_UNLOCK | FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
diff --git a/kernel/futex.c b/kernel/futex.c
index 2ced6c0..6a59e6d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -30,6 +30,9 @@
  *  "The futexes are also cursed."
  *  "But they come in a choice of three flavours!"
  *
+ *  TP-futex support started by Waiman Long
+ *  Copyright (C) 2016-2017 Red Hat, Inc., Waiman Long 
+ *
  *  This program is free software; you can redistribute it and/or modify
  *  it under the terms of the GNU General Public License as published by
  *  the Free Software Foundation; either version 2 of the License, or
@@ -195,6 +198,7 @@
 
 enum futex_type {
TYPE_PI = 0,
+   TYPE_TP,
 };
 
 /*
@@ -214,11 +218,23 @@ struct futex_state {
struct list_head fs_list;
 
/*
-* The PI object:
+* The PI or mutex object:
 */
-   struct rt_mutex pi_mutex;
+   union {
+   struct rt_mutex pi_mutex;
+   struct mutex mutex;
+   };
 
+   /*
+* For both TP/PI futexes, 

[PATCH-tip v5 20/21] sched, TP-futex: Make wake_up_q() return wakeup count

2017-02-03 Thread Waiman Long
Unlike wake_up_process(), wake_up_q() doesn't tell us how many
tasks have been woken up. This information can sometimes be useful
for tracking purpose. So wake_up_q() is now modified to return that
information.

Signed-off-by: Waiman Long 
---
 include/linux/sched/wake_q.h | 2 +-
 kernel/futex.c   | 8 +++-
 kernel/sched/core.c  | 6 --
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 9a32f17..2b2bc9d 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -49,6 +49,6 @@ static inline void wake_q_init(struct wake_q_head *head)
 
 extern void wake_q_add(struct wake_q_head *head,
   struct task_struct *task);
-extern void wake_up_q(struct wake_q_head *head);
+extern int wake_up_q(struct wake_q_head *head);
 
 #endif /* _LINUX_SCHED_WAKE_Q_H */
diff --git a/kernel/futex.c b/kernel/futex.c
index 16c63b8..6bf5304 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -4257,11 +4257,9 @@ static int futex_unlock(u32 __user *uaddr, unsigned int 
flags,
 out_put_key:
put_futex_key();
if (owner) {
-   /*
-* No error would have happened if owner defined.
-*/
-   wake_up_q(_q);
-   return ret ? ret : 1;
+   int cnt = wake_up_q(_q);
+
+   return ret ? ret : cnt;
}
 
return ret;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 679093d..668a094 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -452,9 +452,10 @@ void wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
head->lastp = >next;
 }
 
-void wake_up_q(struct wake_q_head *head)
+int wake_up_q(struct wake_q_head *head)
 {
struct wake_q_node *node = head->first;
+   int wakecnt = 0;
 
while (node != WAKE_Q_TAIL) {
struct task_struct *task;
@@ -469,9 +470,10 @@ void wake_up_q(struct wake_q_head *head)
 * wake_up_process() implies a wmb() to pair with the queueing
 * in wake_q_add() so as not to miss wakeups.
 */
-   wake_up_process(task);
+   wakecnt += wake_up_process(task);
put_task_struct(task);
}
+   return wakecnt;
 }
 
 /*
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 0/2] Hardening configs refactor/rename

2017-02-03 Thread Laura Abbott
Hi,

This is a follow up to my proposal to rename/refactor CONFIG_DEBUG_RODATA
and CONFIG_DEBUG_SET_MODULE_RONX. Among other objections, there shouldn't
be 'debug' in the name since these provide necessary kernel protection.

v2 takes a slightly different approach to this per feedback. Patch #1 moves
CONFIG_DEBUG_RODATA and CONFIG_DEBUG_SET_MODULE_RONX to a common arch config.
These configs are def_bool y for every arch except !CPU_V7 for arm
CONFIG_DEBUG_RODATA. I think this also mitigates another concern about changing
the name since these are basically internal configs at this point and not end
user selectable. Patch #2 does the rename to something more descriptive.
Hopefully this should separate discussion more clearly into two parts (refactor
and rename)

Thanks,
Laura


Laura Abbott (2):
  arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
  arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX

 Documentation/DocBook/kgdb.tmpl|  8 
 Documentation/security/self-protection.txt |  4 ++--
 arch/Kconfig   | 28 
 arch/arm/Kconfig   |  3 +++
 arch/arm/Kconfig.debug | 11 ---
 arch/arm/configs/aspeed_g4_defconfig   |  3 +--
 arch/arm/configs/aspeed_g5_defconfig   |  3 +--
 arch/arm/include/asm/cacheflush.h  |  2 +-
 arch/arm/kernel/patch.c|  4 ++--
 arch/arm/kernel/vmlinux.lds.S  |  8 
 arch/arm/mm/Kconfig| 14 +-
 arch/arm/mm/init.c |  4 ++--
 arch/arm64/Kconfig |  5 ++---
 arch/arm64/Kconfig.debug   | 13 +
 arch/arm64/kernel/insn.c   |  2 +-
 arch/parisc/Kconfig|  1 +
 arch/parisc/Kconfig.debug  | 11 ---
 arch/parisc/configs/712_defconfig  |  1 -
 arch/parisc/configs/c3000_defconfig|  1 -
 arch/parisc/mm/init.c  |  2 +-
 arch/s390/Kconfig  |  5 ++---
 arch/s390/Kconfig.debug|  3 ---
 arch/x86/Kconfig   |  5 ++---
 arch/x86/Kconfig.debug | 11 ---
 include/linux/filter.h |  4 ++--
 include/linux/init.h   |  4 ++--
 include/linux/module.h |  2 +-
 init/main.c|  4 ++--
 kernel/configs/android-recommended.config  |  2 +-
 kernel/module.c|  6 +++---
 kernel/power/hibernate.c   |  2 +-
 kernel/power/power.h   |  4 ++--
 kernel/power/snapshot.c|  4 ++--
 33 files changed, 75 insertions(+), 109 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 2/2] arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX

2017-02-03 Thread Laura Abbott

Both of these options are poorly named. The features they provide are
necessary for system security and should not be considered debug only.
Change the name to something that accurately describes what these
options do.

Signed-off-by: Laura Abbott 
---
v2: This patch is now doing the renaming of CONFIG_DEBUG_RODATA and
CONFIG_DEBUG_MODULE_RONX
---
 Documentation/DocBook/kgdb.tmpl| 8 
 Documentation/security/self-protection.txt | 4 ++--
 arch/Kconfig   | 4 ++--
 arch/arm/configs/aspeed_g4_defconfig   | 3 +--
 arch/arm/configs/aspeed_g5_defconfig   | 3 +--
 arch/arm/include/asm/cacheflush.h  | 2 +-
 arch/arm/kernel/patch.c| 4 ++--
 arch/arm/kernel/vmlinux.lds.S  | 8 
 arch/arm/mm/Kconfig| 2 +-
 arch/arm/mm/init.c | 4 ++--
 arch/arm64/Kconfig.debug   | 2 +-
 arch/arm64/kernel/insn.c   | 2 +-
 arch/parisc/configs/712_defconfig  | 1 -
 arch/parisc/configs/c3000_defconfig| 1 -
 arch/parisc/mm/init.c  | 2 +-
 include/linux/filter.h | 4 ++--
 include/linux/init.h   | 4 ++--
 include/linux/module.h | 2 +-
 init/main.c| 4 ++--
 kernel/configs/android-recommended.config  | 2 +-
 kernel/module.c| 6 +++---
 kernel/power/hibernate.c   | 2 +-
 kernel/power/power.h   | 4 ++--
 kernel/power/snapshot.c| 4 ++--
 24 files changed, 39 insertions(+), 43 deletions(-)

diff --git a/Documentation/DocBook/kgdb.tmpl b/Documentation/DocBook/kgdb.tmpl
index f3abca7..856ac20 100644
--- a/Documentation/DocBook/kgdb.tmpl
+++ b/Documentation/DocBook/kgdb.tmpl
@@ -115,12 +115,12 @@
 
 
 If the architecture that you are using supports the kernel option
-CONFIG_DEBUG_RODATA, you should consider turning it off.  This
+CONFIG_STRICT_KERNEL_RWX, you should consider turning it off.  This
 option will prevent the use of software breakpoints because it
 marks certain regions of the kernel's memory space as read-only.
 If kgdb supports it for the architecture you are using, you can
 use hardware breakpoints if you desire to run with the
-CONFIG_DEBUG_RODATA option turned on, else you need to turn off
+CONFIG_STRICT_KERNEL_RWX option turned on, else you need to turn off
 this option.
 
 
@@ -135,7 +135,7 @@
 Here is an example set of .config symbols to enable or
 disable for kgdb:
 
-# CONFIG_DEBUG_RODATA is not set
+# CONFIG_STRICT_KERNEL_RWX is not set
 CONFIG_FRAME_POINTER=y
 CONFIG_KGDB=y
 CONFIG_KGDB_SERIAL_CONSOLE=y
@@ -166,7 +166,7 @@
 
 Here is an example set of .config symbols to enable/disable kdb:
 
-# CONFIG_DEBUG_RODATA is not set
+# CONFIG_STRICT_KERNEL_RWX is not set
 CONFIG_FRAME_POINTER=y
 CONFIG_KGDB=y
 CONFIG_KGDB_SERIAL_CONSOLE=y
diff --git a/Documentation/security/self-protection.txt 
b/Documentation/security/self-protection.txt
index 3010576..dd2a3b1 100644
--- a/Documentation/security/self-protection.txt
+++ b/Documentation/security/self-protection.txt
@@ -51,8 +51,8 @@ kernel, they are implemented in a way where the memory is 
temporarily
 made writable during the update, and then returned to the original
 permissions.)
 
-In support of this are (the poorly named) CONFIG_DEBUG_RODATA and
-CONFIG_DEBUG_SET_MODULE_RONX, which seek to make sure that code is not
+In support of this are CONFIG_STRICT_KERNEL_RWX and
+CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not
 writable, data is not executable, and read-only data is neither writable
 nor executable.
 
diff --git a/arch/Kconfig b/arch/Kconfig
index 22ee01e..406f6cd 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -787,7 +787,7 @@ config ARCH_NO_STRICT_RWX_DEFAULTS
 config ARCH_HAS_STRICT_KERNEL_RWX
def_bool n
 
-config DEBUG_RODATA
+config STRICT_KERNEL_RWX
def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
prompt "Make kernel text and rodata read-only" if 
ARCH_NO_STRICT_RWX_DEFAULTS
depends on ARCH_HAS_STRICT_KERNEL_RWX
@@ -800,7 +800,7 @@ config DEBUG_RODATA
 config ARCH_HAS_STRICT_MODULE_RWX
def_bool n
 
-config DEBUG_SET_MODULE_RONX
+config STRICT_MODULE_RWX
def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
prompt "Set loadable kenrel module data as NX and text as RO" if 
ARCH_NO_STRICT_RWX_DEFAULTS
depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES
diff --git a/arch/arm/configs/aspeed_g4_defconfig 
b/arch/arm/configs/aspeed_g4_defconfig
index ca39c04..beea2cc 100644
--- a/arch/arm/configs/aspeed_g4_defconfig
+++ b/arch/arm/configs/aspeed_g4_defconfig
@@ -25,7 +25,6 @@ CONFIG_MODULE_UNLOAD=y
 # CONFIG_ARCH_MULTI_V7 is not set
 CONFIG_ARCH_ASPEED=y
 CONFIG_MACH_ASPEED_G4=y

[PATCHv2 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common

2017-02-03 Thread Laura Abbott
There are multiple architectures that support CONFIG_DEBUG_RODATA and
CONFIG_SET_MODULE_RONX. These options also now have the ability to be
turned off at runtime. Move these to an architecture independent
location and make these options def_bool y for almost all of those
arches.

Signed-off-by: Laura Abbott 
---
v2: This patch is now doing just the refactor of the existing config options.
---
 arch/Kconfig  | 28 
 arch/arm/Kconfig  |  3 +++
 arch/arm/Kconfig.debug| 11 ---
 arch/arm/mm/Kconfig   | 12 
 arch/arm64/Kconfig|  5 ++---
 arch/arm64/Kconfig.debug  | 11 ---
 arch/parisc/Kconfig   |  1 +
 arch/parisc/Kconfig.debug | 11 ---
 arch/s390/Kconfig |  5 ++---
 arch/s390/Kconfig.debug   |  3 ---
 arch/x86/Kconfig  |  5 ++---
 arch/x86/Kconfig.debug| 11 ---
 12 files changed, 38 insertions(+), 68 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 99839c2..22ee01e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -781,4 +781,32 @@ config VMAP_STACK
  the stack to map directly to the KASAN shadow map using a formula
  that is incorrect if the stack is in vmalloc space.
 
+config ARCH_NO_STRICT_RWX_DEFAULTS
+   def_bool n
+
+config ARCH_HAS_STRICT_KERNEL_RWX
+   def_bool n
+
+config DEBUG_RODATA
+   def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
+   prompt "Make kernel text and rodata read-only" if 
ARCH_NO_STRICT_RWX_DEFAULTS
+   depends on ARCH_HAS_STRICT_KERNEL_RWX
+   help
+ If this is set, kernel text and rodata memory will be made read-only,
+ and non-text memory will be made non-executable. This provides
+ protection against certain security exploits (e.g. executing the heap
+ or modifying text)
+
+config ARCH_HAS_STRICT_MODULE_RWX
+   def_bool n
+
+config DEBUG_SET_MODULE_RONX
+   def_bool y if !ARCH_NO_STRICT_RWX_DEFAULTS
+   prompt "Set loadable kenrel module data as NX and text as RO" if 
ARCH_NO_STRICT_RWX_DEFAULTS
+   depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES
+   help
+ If this is set, module text and rodata memory will be made read-only,
+ and non-text memory will be made non-executable. This provides
+ protection against certain security exploits (e.g. writing to text)
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 186c4c2..aa73ca8 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -4,10 +4,13 @@ config ARM
select ARCH_CLOCKSOURCE_DATA
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
+   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
+   select ARCH_HAS_STRICT_MODULE_RWX if MMU
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAVE_CUSTOM_GPIO_H
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_MIGHT_HAVE_PC_PARPORT
+   select ARCH_NO_STRICT_RWX_DEFAULTS if !CPU_V7
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF
diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
index d83f7c3..426d271 100644
--- a/arch/arm/Kconfig.debug
+++ b/arch/arm/Kconfig.debug
@@ -1738,17 +1738,6 @@ config PID_IN_CONTEXTIDR
  additional instructions during context switch. Say Y here only if you
  are planning to use hardware trace tools with this kernel.
 
-config DEBUG_SET_MODULE_RONX
-   bool "Set loadable kernel module data as NX and text as RO"
-   depends on MODULES && MMU
-   ---help---
- This option helps catch unintended modifications to loadable
- kernel module's text and read-only data. It also prevents execution
- of module data. Such protection may interfere with run-time code
- patching and dynamic kernel tracing - and they might also protect
- against certain classes of kernel exploits.
- If in doubt, say "N".
-
 source "drivers/hwtracing/coresight/Kconfig"
 
 endmenu
diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
index f68e8ec..419a035 100644
--- a/arch/arm/mm/Kconfig
+++ b/arch/arm/mm/Kconfig
@@ -1051,18 +1051,6 @@ config ARCH_SUPPORTS_BIG_ENDIAN
  This option specifies the architecture can support big endian
  operation.
 
-config DEBUG_RODATA
-   bool "Make kernel text and rodata read-only"
-   depends on MMU && !XIP_KERNEL
-   default y if CPU_V7
-   help
- If this is set, kernel text and rodata memory will be made
- read-only, and non-text kernel memory will be made non-executable.
- The tradeoff is that each region is padded to section-size (1MiB)
- boundaries (because their permissions are different and splitting
- the 1M pages into 4K ones causes TLB performance problems), which
- can waste memory.
-
 config DEBUG_ALIGN_RODATA
   

[PATCH] Add 'nodoc' option

2017-02-03 Thread Matthew Wilcox

I've written this patch, and it seems to work, but I don't really know
what I'm doing and I fear I may have broken something.  I don't know what:

required_argument = 1
optional_arguments = 4

mean, so I don't know whether I should have adjusted them when adding
this new option.

>From 26896bd43618e30cb3563114663ad6c39cef7dc6 Mon Sep 17 00:00:00 2001
From: Matthew Wilcox 
Date: Thu, 26 Jan 2017 11:43:24 -0500
Subject: [PATCH] kerneldoc: Add :nodoc: option

This is functionality the perl script already has, we just need to
expose it to our rst files.
---
 Documentation/sphinx/kerneldoc.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/sphinx/kerneldoc.py 
b/Documentation/sphinx/kerneldoc.py
index d15e07f36881..d86d88da1d75 100644
--- a/Documentation/sphinx/kerneldoc.py
+++ b/Documentation/sphinx/kerneldoc.py
@@ -47,6 +47,7 @@ class KernelDocDirective(Directive):
 optional_arguments = 4
 option_spec = {
 'doc': directives.unchanged_required,
+'nodoc': directives.unchanged,
 'functions': directives.unchanged_required,
 'export': directives.unchanged,
 'internal': directives.unchanged,
@@ -74,6 +75,8 @@ class KernelDocDirective(Directive):
 export_file_patterns = str(self.options.get('internal')).split()
 elif 'doc' in self.options:
 cmd += ['-function', str(self.options.get('doc'))]
+elif 'nodoc' in self.options:
+cmd += ['-no-doc-sections']
 elif 'functions' in self.options:
 for f in str(self.options.get('functions')).split():
 cmd += ['-function', f]
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] Documentation/sphinx: prevent generation of .pyc files in the source tree

2017-02-03 Thread Jani Nikula
On Wed, 01 Feb 2017, Jonathan Corbet  wrote:
> On Tue, 31 Jan 2017 20:18:05 +0200
> Jani Nikula  wrote:
>
>> Considering the small amount of python code to compile (assuming sphinx
>> itself has .pyc around), the impact on build is neglible.
>
> Hey...don't you know that performance-impacting patches need benchmarks?

Heh, I was briefly worried at this point! :)

>
> Sphinx-only htmldocs build before:
>
>   real2m26.063s
>   user2m19.184s
>   sys 0m5.429s
>
> After:
>
>   real2m26.342s
>   user2m20.445s
>   sys 0m5.611s
>
> You owe me 0.279 seconds :)
>
> (Applied, thanks).
>
> jon

-- 
Jani Nikula, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] MicroSemi Switchtec management interface driver

2017-02-03 Thread Emil Velikov
On 2 February 2017 at 16:37, Logan Gunthorpe  wrote:
>
>
> On 01/02/17 05:10 AM, Emil Velikov wrote:
>> You can keep it roughly as-is if you're ~reasonably certain one won't
>> change it in the future.
>
> I've made the change anyway. I think it's better now.
>
>> Some teams frown upon adding new IOCTL(s) where existing ones can be
>> made backward/forward compatible.
>> I'm not fully aware of the general direction/consensus on the topic,
>> so it might be a minority.
>
> Sure, I just don't know what might be needed in the future so it's hard
> to add a version or flags ioctl now.
>
Yes knowing how things will need to change in the is hard. That's why
the documentation suggestions adding a flag to the ioctl structs.
It [the flag] might imply certain functional/implementation change,
support for new/deprecation of old features and others.

>> On the other hand, reading through sysfs for module version in order
>> to use IOCTL A or B sounds quite hacky. Do you have an example where
>> this is used or pointed out as good approach ?
>
> I don't know of anything doing it that way now. But it sure would be
> easy and make a bit of sense. (We'd actually use the module version for
> something useful.) Either way, it would really depend on if and how
> things change in the future. The point is there are options to expand if
> needed.
>
The part that nobody else is doing such a thing should ring a bell ;-)

It's no my call, but if it was I'd stick with the existing approach
and not "reinvent the wheel" sort of speak.

Thanks
Emil
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 07/12] dt-bindings: i2c: i2c-mux-simple: document i2c-mux-simple bindings

2017-02-03 Thread Peter Rosin
On 2017-02-02 17:08, Rob Herring wrote:
> On Tue, Jan 31, 2017 at 1:36 AM, Peter Rosin  wrote:
>> If you see this new driver as something that is superseding the existing
>> i2c-mux-gpio driver, I'm sad to inform you that the code is not simply
>> not there. i2c-mux-gpio has acpi support and users may provide platform
>> data from code. The existing i2c-mux-gpio driver also has the below
>> mentioned locking heuristic. Adding all those things to the new driver
>> would make it big and unwieldy and having even more unwanted tentacles
>> into other subsystems. And why should it be only i2c-mux-gpio that is
>> merged into this new i2c-mux driver? Why not implement a mux-pinctrl
>> driver for the new mux subsustem and then merge i2c-mux-pinctrl as well?
>> I say no, that way lies madness.
> 
> Sounds like a good idea to me. I'm not saying you need to merge any of
> them right now though (that's Wolfram's call).

If we're pedantic I probably have some stake in it too, being the i2c-mux
maintainer and all. But, agreed, I arrived quite late to the Linux kernel
party and my opinion might perhaps not carry all that much weight...

> None of this has anything to do with the binding though. Compatible
> strings should be specific. That's not up for debate. Whether the

Ok, I'm going to focus on the compatible string for a minute and leave
the implementation details for some other discussion.

> driver bound to a compatible string is common or specific to that
> compatible string is completely up to the OS. That decision can change
> over time, but the binding should not.

So, there's the existing compatible "i2c-mux-gpio" ("i2c-gpio-mux" is
wrong) that you seem to suggest is what I should stick to. I object to
that.

As you say, the bindings and compatible strings should describe hardware,
and you also state they should be specific. I agree. But, why are you
then apparently suggesting (by implication) that for this (hypothetical)
hardware...

..
|SDA0|---.
|SCL0|-. |
|| | |
||  .---.
||  |adg792a|
||  |   |
|ADC0|--|D1  S1A| signal A
||  |S1B| signal B
||  |S1C| signal C
||  |S1D| signal D
||  |   |
|SDA1|---+--|D2  S2A| i2s segment foo A
|SCL1|-. |  |S2B| i2s segment foo B
'' | |  |S2C| i2s segment foo C
   | |  |S2D| i2s segment foo D
   | |  |   |
   | '--|D3  S3A| i2s segment bar A
   ||S3B| i2s segment bar B
   ||S3C| i2s segment bar C
   ||S3D| i2s segment bar D
   |'---'
   |  A B C D   A B C D  (feed SCL1 to each of
   |  | | | |   | | | |   the 8 muxed segments)
   '--+-+-+-+---+-+-+-'

...the devicetree should be like below?

 {
mux: mux-controller@50 {
compatible = "adi,adg792a";
reg = <0x50>;
#mux-control-cells = <1>;
};
};

adc-mux {
compatible = "io-channel-mux";
io-channels = < 0>;
io-channel-names = "parent";

mux-controls = < 0>;

...
};

i2c-mux-foo {
compatible = "i2c-mux-gpio";
i2c-parent = <>;

mux-controls = < 1>;

...
};

i2c-mux-bar {
compatible = "i2c-mux-gpio";
i2c-parent = <>;

mux-controls = < 2>;

...
};

There must be some disconnect, because those "i2c-mux-gpio" compatible
strings are just dead wrong. There simply are no gpio pins involved at
all and that "gpio" suffix is just totally out of place.

So, since you are not happy with "i2c-mux-simple", "i2c-mux-generic" or
"i2c-mux" that I have suggested, can you please come up with something
that is good enough for the above?

Or, are you /really/ suggesting "i2c-mux-gpio"?

Cheers,
peda

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html