Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Joel Fernandes
On Fri, Mar 30, 2018 at 8:07 PM, Steven Rostedt  wrote:
> On Fri, 30 Mar 2018 19:18:57 -0700
> Matthew Wilcox  wrote:
>
>> Again though, this is the same pattern as vmalloc.  There are any number
>> of places where userspace can cause an arbitrarily large vmalloc to be
>> attempted (grep for kvmalloc_array for a list of promising candidates).
>> I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
>> __GFP_NOWARN will give you the exact behaviour that you want with no
>> need to grub around in the VM to find out if your huge allocation is
>> likely to succeed.
>
> Not sure how this helps. Note, I don't care about consecutive pages, so
> this is not an array. It's a link list of thousands of pages. How do
> you suggest allocating them? The ring buffer is a link list of pages.

Yeah I didn't understand the suggestion either. If I remember
correctly, not using either NO_RETRY or RETRY_MAY_FAIL, and just plain
GFP_KERNEL was precisely causing the buffer_size_kb write to cause an
OOM in my testing. So I think Steven's patch does the right thing in
checking in advance.

thanks,

- Joel


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Joel Fernandes
On Fri, Mar 30, 2018 at 8:07 PM, Steven Rostedt  wrote:
> On Fri, 30 Mar 2018 19:18:57 -0700
> Matthew Wilcox  wrote:
>
>> Again though, this is the same pattern as vmalloc.  There are any number
>> of places where userspace can cause an arbitrarily large vmalloc to be
>> attempted (grep for kvmalloc_array for a list of promising candidates).
>> I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
>> __GFP_NOWARN will give you the exact behaviour that you want with no
>> need to grub around in the VM to find out if your huge allocation is
>> likely to succeed.
>
> Not sure how this helps. Note, I don't care about consecutive pages, so
> this is not an array. It's a link list of thousands of pages. How do
> you suggest allocating them? The ring buffer is a link list of pages.

Yeah I didn't understand the suggestion either. If I remember
correctly, not using either NO_RETRY or RETRY_MAY_FAIL, and just plain
GFP_KERNEL was precisely causing the buffer_size_kb write to cause an
OOM in my testing. So I think Steven's patch does the right thing in
checking in advance.

thanks,

- Joel


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 03/30/2018 01:32 PM, Thomas Gleixner wrote:
> > On Fri, 30 Mar 2018, Dave Hansen wrote:
> > 
> >> On 03/30/2018 05:17 AM, Ingo Molnar wrote:
> >>> BTW., the expectation on !PCID Intel hardware would be for global pages 
> >>> to help 
> >>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID 
> >>> already 
> >>> _reduces_ the cost of TLB flushes - so if there's not even PCID then 
> >>> global pages 
> >>> should help even more.
> >>>
> >>> In theory at least. Would still be nice to measure it.
> >>
> >> I did the lseek test on a modern, non-PCID system:
> >>
> >> No Global pages (baseline): 6077741 lseeks/sec
> >> 94 Global pages (this set): 8433111 lseeks/sec
> >>   +2355370 lseeks/sec (+38.8%)
> > 
> > That's all kernel text, right? What's the result for the case where global
> > is only set for all user/kernel shared pages?
> 
> Yes, that's all kernel text (94 global entries).  Here's the number with
> just the entry data/text set global (88 global entries on this system):
> 
> No Global pages (baseline): 6077741 lseeks/sec
> 88 Global Pages (kentry  ): 7528609 lseeks/sec (+23.9%)
> 94 Global pages (this set): 8433111 lseeks/sec (+38.8%)

Very impressive!

Please incorporate the performance numbers in patches #9 and #11.

There were a couple of valid review comments which need to be addressed as 
well, 
but other than that it all looks good to me and I plan to apply the next 
iteration.

In fact I think I'll try to put it into the backporting tree: as PGE was really 
the pre PTI status quo and thus we should expect few quirks/bugs in this area, 
plus we still want to share as much core PTI logic with the -stable kernels as 
possible. The performance plus doesn't hurt either ... after so much lost 
performance.

Thanks,

Ingo


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 03/30/2018 01:32 PM, Thomas Gleixner wrote:
> > On Fri, 30 Mar 2018, Dave Hansen wrote:
> > 
> >> On 03/30/2018 05:17 AM, Ingo Molnar wrote:
> >>> BTW., the expectation on !PCID Intel hardware would be for global pages 
> >>> to help 
> >>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID 
> >>> already 
> >>> _reduces_ the cost of TLB flushes - so if there's not even PCID then 
> >>> global pages 
> >>> should help even more.
> >>>
> >>> In theory at least. Would still be nice to measure it.
> >>
> >> I did the lseek test on a modern, non-PCID system:
> >>
> >> No Global pages (baseline): 6077741 lseeks/sec
> >> 94 Global pages (this set): 8433111 lseeks/sec
> >>   +2355370 lseeks/sec (+38.8%)
> > 
> > That's all kernel text, right? What's the result for the case where global
> > is only set for all user/kernel shared pages?
> 
> Yes, that's all kernel text (94 global entries).  Here's the number with
> just the entry data/text set global (88 global entries on this system):
> 
> No Global pages (baseline): 6077741 lseeks/sec
> 88 Global Pages (kentry  ): 7528609 lseeks/sec (+23.9%)
> 94 Global pages (this set): 8433111 lseeks/sec (+38.8%)

Very impressive!

Please incorporate the performance numbers in patches #9 and #11.

There were a couple of valid review comments which need to be addressed as 
well, 
but other than that it all looks good to me and I plan to apply the next 
iteration.

In fact I think I'll try to put it into the backporting tree: as PGE was really 
the pre PTI status quo and thus we should expect few quirks/bugs in this area, 
plus we still want to share as much core PTI logic with the -stable kernels as 
possible. The performance plus doesn't hurt either ... after so much lost 
performance.

Thanks,

Ingo


Re: [PATCH v6] kernel.h: Retain constant expression output for max()/min()

2018-03-30 Thread Ingo Molnar

* Kees Cook  wrote:

> On Mon, Mar 26, 2018 at 10:47 PM, Ingo Molnar  wrote:
> >
> > * Kees Cook  wrote:
> >
> >> In the effort to remove all VLAs from the kernel[1], it is desirable to
> >> build with -Wvla. However, this warning is overly pessimistic, in that
> >> it is only happy with stack array sizes that are declared as constant
> >> expressions, and not constant values. One case of this is the evaluation
> >> of the max() macro which, due to its construction, ends up converting
> >> constant expression arguments into a constant value result.
> >>
> >> All attempts to rewrite this macro with __builtin_constant_p() failed with
> >> older compilers (e.g. gcc 4.4)[2]. However, Martin Uecker constructed[3] a
> >> mind-shattering solution that works everywhere. Cthulhu fhtagn!
> >>
> >> This patch updates the min()/max() macros to evaluate to a constant
> >> expression when called on constant expression arguments. This removes
> >> several false-positive stack VLA warnings from an x86 allmodconfig
> >> build when -Wvla is added:
> >
> > Cool!
> >
> > Acked-by: Ingo Molnar 
> >
> > How many warnings are left in an allmodconfig build?
> 
> For -Wvla? Out of the original 112 files with VLAs, 42 haven't had a
> patch applied yet. Doing a linux-next allmodconfig build with the
> max() patch and my latest ecc patch, we've gone from 316 warning
> instances to 205. More than half of those are in
> include/crypto/skcipher.h and include/crypto/hash.h.

Great - once the number of warnings is zero, is the plan to enable the warning 
unconditionally?

Thanks,

Ingo


Re: [PATCH v6] kernel.h: Retain constant expression output for max()/min()

2018-03-30 Thread Ingo Molnar

* Kees Cook  wrote:

> On Mon, Mar 26, 2018 at 10:47 PM, Ingo Molnar  wrote:
> >
> > * Kees Cook  wrote:
> >
> >> In the effort to remove all VLAs from the kernel[1], it is desirable to
> >> build with -Wvla. However, this warning is overly pessimistic, in that
> >> it is only happy with stack array sizes that are declared as constant
> >> expressions, and not constant values. One case of this is the evaluation
> >> of the max() macro which, due to its construction, ends up converting
> >> constant expression arguments into a constant value result.
> >>
> >> All attempts to rewrite this macro with __builtin_constant_p() failed with
> >> older compilers (e.g. gcc 4.4)[2]. However, Martin Uecker constructed[3] a
> >> mind-shattering solution that works everywhere. Cthulhu fhtagn!
> >>
> >> This patch updates the min()/max() macros to evaluate to a constant
> >> expression when called on constant expression arguments. This removes
> >> several false-positive stack VLA warnings from an x86 allmodconfig
> >> build when -Wvla is added:
> >
> > Cool!
> >
> > Acked-by: Ingo Molnar 
> >
> > How many warnings are left in an allmodconfig build?
> 
> For -Wvla? Out of the original 112 files with VLAs, 42 haven't had a
> patch applied yet. Doing a linux-next allmodconfig build with the
> max() patch and my latest ecc patch, we've gone from 316 warning
> instances to 205. More than half of those are in
> include/crypto/skcipher.h and include/crypto/hash.h.

Great - once the number of warnings is zero, is the plan to enable the warning 
unconditionally?

Thanks,

Ingo


Re: [03/10] genksyms: generate lexer and parser during build instead of shipping

2018-03-30 Thread Andrei Vagin
On Sat, Mar 31, 2018 at 11:20:22AM +0900, Masahiro Yamada wrote:
> 2018-03-31 7:21 GMT+09:00 Andrei Vagin :
> > On Fri, Mar 30, 2018 at 10:40:22AM -0700, Andrei Vagin wrote:
> >> On Fri, Mar 23, 2018 at 10:04:32PM +0900, Masahiro Yamada wrote:
> >> > Now that the kernel build supports flex and bison, remove the _shipped
> >> > files and generate them during the build instead.
> >> >
> >> > There are no more shipped lexer and parser, so I ripped off the rules
> >> > in scripts/Malefile.lib that were used for REGENERATE_PARSERS.
> >> >
> >> > The genksyms parser has ambiguous grammar, which would emit warnings:
> >> >
> >> >  scripts/genksyms/parse.y: warning: 9 shift/reduce conflicts 
> >> > [-Wconflicts-sr]
> >> >  scripts/genksyms/parse.y: warning: 5 reduce/reduce conflicts 
> >> > [-Wconflicts-rr]
> >> >
> >> > They are normally suppressed, but displayed when W=1 is given.
> >> >
> >> > Signed-off-by: Masahiro Yamada 
> >> > ---
> >> >
> >> >  scripts/Makefile.lib |   24 +-
> >> >  scripts/genksyms/Makefile|   23 +
> >> >  scripts/genksyms/lex.lex.c_shipped   | 2291 
> >> > 
> >> >  scripts/genksyms/parse.tab.c_shipped | 2394 
> >> > --
> >> >  scripts/genksyms/parse.tab.h_shipped |  119 --
> >> >  5 files changed, 26 insertions(+), 4825 deletions(-)
> >> >  delete mode 100644 scripts/genksyms/lex.lex.c_shipped
> >> >  delete mode 100644 scripts/genksyms/parse.tab.c_shipped
> >> >  delete mode 100644 scripts/genksyms/parse.tab.h_shipped
> >> >
> >> > diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> >> > index 2fde810..b7d2c97 100644
> >> > --- a/scripts/Makefile.lib
> >> > +++ b/scripts/Makefile.lib
> >> > @@ -183,14 +183,8 @@ endef
> >> >  quiet_cmd_flex = LEX $@
> >> >cmd_flex = $(LEX) -o$@ -L $<
> >> >
> >> > -ifdef REGENERATE_PARSERS
> >> > -.PRECIOUS: $(src)/%.lex.c_shipped
> >> > -$(src)/%.lex.c_shipped: $(src)/%.l
> >> > -   $(call cmd,flex)
> >> > -endif
> >> > -
> >> >  .PRECIOUS: $(obj)/%.lex.c
> >> > -$(filter %.lex.c,$(targets)): $(obj)/%.lex.c: $(src)/%.l FORCE
> >> > +$(obj)/%.lex.c: $(src)/%.l FORCE
> >> > $(call if_changed,flex)
> >> >
> >> >  # YACC
> >> > @@ -198,27 +192,15 @@ $(filter %.lex.c,$(targets)): $(obj)/%.lex.c: 
> >> > $(src)/%.l FORCE
> >> >  quiet_cmd_bison = YACC$@
> >> >cmd_bison = $(YACC) -o$@ -t -l $<
> >> >
> >> > -ifdef REGENERATE_PARSERS
> >> > -.PRECIOUS: $(src)/%.tab.c_shipped
> >> > -$(src)/%.tab.c_shipped: $(src)/%.y
> >> > -   $(call cmd,bison)
> >> > -endif
> >> > -
> >> >  .PRECIOUS: $(obj)/%.tab.c
> >> > -$(filter %.tab.c,$(targets)): $(obj)/%.tab.c: $(src)/%.y FORCE
> >> > +$(obj)/%.tab.c: $(src)/%.y FORCE
> >> > $(call if_changed,bison)
> >> >
> >> >  quiet_cmd_bison_h = YACC$@
> >> >cmd_bison_h = bison -o/dev/null --defines=$@ -t -l $<
> >> >
> >> > -ifdef REGENERATE_PARSERS
> >> > -.PRECIOUS: $(src)/%.tab.h_shipped
> >> > -$(src)/%.tab.h_shipped: $(src)/%.y
> >> > -   $(call cmd,bison_h)
> >> > -endif
> >> > -
> >> >  .PRECIOUS: $(obj)/%.tab.h
> >> > -$(filter %.tab.h,$(targets)): $(obj)/%.tab.h: $(src)/%.y FORCE
> >> > +$(obj)/%.tab.h: $(src)/%.y FORCE
> >> > $(call if_changed,bison_h)
> >> >
> >> >  # Shipped files
> >> > diff --git a/scripts/genksyms/Makefile b/scripts/genksyms/Makefile
> >> > index 0ccac51..f4749e8 100644
> >> > --- a/scripts/genksyms/Makefile
> >> > +++ b/scripts/genksyms/Makefile
> >> > @@ -5,9 +5,32 @@ always := $(hostprogs-y)
> >> >
> >> >  genksyms-objs  := genksyms.o parse.tab.o lex.lex.o
> >> >
> >> > +# FIXME: fix the ambiguous grammar in parse.y and delete this hack
> >> > +#
> >> > +# Suppress shift/reduce, reduce/reduce conflicts warnings
> >> > +# unless W=1 is specified.
> >> > +ifeq ($(findstring 1,$(KBUILD_ENABLE_EXTRA_GCC_CHECKS)),)
> >> > +SUPPRESS_BISON_WARNING := 2>/dev/null
> >>
> >> We have a robot which runs CRIU tests on linux-next.
> >> Yesterday it failed with this error:
> >>
> >>   HOSTCC  scripts/genksyms/genksyms.o
> >> make[2]: *** [scripts/genksyms/parse.tab.c] Error 127
> >>
> >> cripts/genksyms/Makefile:20: recipe for target 
> >> 'scripts/genksyms/parse.tab.c' failed
> >> scripts/Makefile.build:559: recipe for target 'scripts/genksyms' failed
> >> Makefile:1073: recipe for target 'scripts' failed
> >> make[1]: *** [scripts/genksyms] Error 2
> >> make: *** [scripts] Error 2
> >> make: *** Waiting for unfinished jobs
> >>
> >> https://travis-ci.org/avagin/linux/jobs/360056903
> >>
> >> From this output, it is very hard to understand what was going wrong.
> >
> >
> > The reason was that bison and fles were not installed, but I think the
> > error message should be more clear.
> >
> >>
> >> Thanks,
> >> Andrei
> >>
> 
> Thanks for the report.
> 
> 
> OK, I will apply the fix-up attached below.
> 
> If bison is not installed, it will fail with clear message.

Thank you!

> 
>   HOSTCC  

Re: [03/10] genksyms: generate lexer and parser during build instead of shipping

2018-03-30 Thread Andrei Vagin
On Sat, Mar 31, 2018 at 11:20:22AM +0900, Masahiro Yamada wrote:
> 2018-03-31 7:21 GMT+09:00 Andrei Vagin :
> > On Fri, Mar 30, 2018 at 10:40:22AM -0700, Andrei Vagin wrote:
> >> On Fri, Mar 23, 2018 at 10:04:32PM +0900, Masahiro Yamada wrote:
> >> > Now that the kernel build supports flex and bison, remove the _shipped
> >> > files and generate them during the build instead.
> >> >
> >> > There are no more shipped lexer and parser, so I ripped off the rules
> >> > in scripts/Malefile.lib that were used for REGENERATE_PARSERS.
> >> >
> >> > The genksyms parser has ambiguous grammar, which would emit warnings:
> >> >
> >> >  scripts/genksyms/parse.y: warning: 9 shift/reduce conflicts 
> >> > [-Wconflicts-sr]
> >> >  scripts/genksyms/parse.y: warning: 5 reduce/reduce conflicts 
> >> > [-Wconflicts-rr]
> >> >
> >> > They are normally suppressed, but displayed when W=1 is given.
> >> >
> >> > Signed-off-by: Masahiro Yamada 
> >> > ---
> >> >
> >> >  scripts/Makefile.lib |   24 +-
> >> >  scripts/genksyms/Makefile|   23 +
> >> >  scripts/genksyms/lex.lex.c_shipped   | 2291 
> >> > 
> >> >  scripts/genksyms/parse.tab.c_shipped | 2394 
> >> > --
> >> >  scripts/genksyms/parse.tab.h_shipped |  119 --
> >> >  5 files changed, 26 insertions(+), 4825 deletions(-)
> >> >  delete mode 100644 scripts/genksyms/lex.lex.c_shipped
> >> >  delete mode 100644 scripts/genksyms/parse.tab.c_shipped
> >> >  delete mode 100644 scripts/genksyms/parse.tab.h_shipped
> >> >
> >> > diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> >> > index 2fde810..b7d2c97 100644
> >> > --- a/scripts/Makefile.lib
> >> > +++ b/scripts/Makefile.lib
> >> > @@ -183,14 +183,8 @@ endef
> >> >  quiet_cmd_flex = LEX $@
> >> >cmd_flex = $(LEX) -o$@ -L $<
> >> >
> >> > -ifdef REGENERATE_PARSERS
> >> > -.PRECIOUS: $(src)/%.lex.c_shipped
> >> > -$(src)/%.lex.c_shipped: $(src)/%.l
> >> > -   $(call cmd,flex)
> >> > -endif
> >> > -
> >> >  .PRECIOUS: $(obj)/%.lex.c
> >> > -$(filter %.lex.c,$(targets)): $(obj)/%.lex.c: $(src)/%.l FORCE
> >> > +$(obj)/%.lex.c: $(src)/%.l FORCE
> >> > $(call if_changed,flex)
> >> >
> >> >  # YACC
> >> > @@ -198,27 +192,15 @@ $(filter %.lex.c,$(targets)): $(obj)/%.lex.c: 
> >> > $(src)/%.l FORCE
> >> >  quiet_cmd_bison = YACC$@
> >> >cmd_bison = $(YACC) -o$@ -t -l $<
> >> >
> >> > -ifdef REGENERATE_PARSERS
> >> > -.PRECIOUS: $(src)/%.tab.c_shipped
> >> > -$(src)/%.tab.c_shipped: $(src)/%.y
> >> > -   $(call cmd,bison)
> >> > -endif
> >> > -
> >> >  .PRECIOUS: $(obj)/%.tab.c
> >> > -$(filter %.tab.c,$(targets)): $(obj)/%.tab.c: $(src)/%.y FORCE
> >> > +$(obj)/%.tab.c: $(src)/%.y FORCE
> >> > $(call if_changed,bison)
> >> >
> >> >  quiet_cmd_bison_h = YACC$@
> >> >cmd_bison_h = bison -o/dev/null --defines=$@ -t -l $<
> >> >
> >> > -ifdef REGENERATE_PARSERS
> >> > -.PRECIOUS: $(src)/%.tab.h_shipped
> >> > -$(src)/%.tab.h_shipped: $(src)/%.y
> >> > -   $(call cmd,bison_h)
> >> > -endif
> >> > -
> >> >  .PRECIOUS: $(obj)/%.tab.h
> >> > -$(filter %.tab.h,$(targets)): $(obj)/%.tab.h: $(src)/%.y FORCE
> >> > +$(obj)/%.tab.h: $(src)/%.y FORCE
> >> > $(call if_changed,bison_h)
> >> >
> >> >  # Shipped files
> >> > diff --git a/scripts/genksyms/Makefile b/scripts/genksyms/Makefile
> >> > index 0ccac51..f4749e8 100644
> >> > --- a/scripts/genksyms/Makefile
> >> > +++ b/scripts/genksyms/Makefile
> >> > @@ -5,9 +5,32 @@ always := $(hostprogs-y)
> >> >
> >> >  genksyms-objs  := genksyms.o parse.tab.o lex.lex.o
> >> >
> >> > +# FIXME: fix the ambiguous grammar in parse.y and delete this hack
> >> > +#
> >> > +# Suppress shift/reduce, reduce/reduce conflicts warnings
> >> > +# unless W=1 is specified.
> >> > +ifeq ($(findstring 1,$(KBUILD_ENABLE_EXTRA_GCC_CHECKS)),)
> >> > +SUPPRESS_BISON_WARNING := 2>/dev/null
> >>
> >> We have a robot which runs CRIU tests on linux-next.
> >> Yesterday it failed with this error:
> >>
> >>   HOSTCC  scripts/genksyms/genksyms.o
> >> make[2]: *** [scripts/genksyms/parse.tab.c] Error 127
> >>
> >> cripts/genksyms/Makefile:20: recipe for target 
> >> 'scripts/genksyms/parse.tab.c' failed
> >> scripts/Makefile.build:559: recipe for target 'scripts/genksyms' failed
> >> Makefile:1073: recipe for target 'scripts' failed
> >> make[1]: *** [scripts/genksyms] Error 2
> >> make: *** [scripts] Error 2
> >> make: *** Waiting for unfinished jobs
> >>
> >> https://travis-ci.org/avagin/linux/jobs/360056903
> >>
> >> From this output, it is very hard to understand what was going wrong.
> >
> >
> > The reason was that bison and fles were not installed, but I think the
> > error message should be more clear.
> >
> >>
> >> Thanks,
> >> Andrei
> >>
> 
> Thanks for the report.
> 
> 
> OK, I will apply the fix-up attached below.
> 
> If bison is not installed, it will fail with clear message.

Thank you!

> 
>   HOSTCC  scripts/genksyms/genksyms.o
> /bin/sh: 1: bison: not 

Re: [PATCH v6] kernel.h: Retain constant expression output for max()/min()

2018-03-30 Thread Kees Cook
On Mon, Mar 26, 2018 at 10:47 PM, Ingo Molnar  wrote:
>
> * Kees Cook  wrote:
>
>> In the effort to remove all VLAs from the kernel[1], it is desirable to
>> build with -Wvla. However, this warning is overly pessimistic, in that
>> it is only happy with stack array sizes that are declared as constant
>> expressions, and not constant values. One case of this is the evaluation
>> of the max() macro which, due to its construction, ends up converting
>> constant expression arguments into a constant value result.
>>
>> All attempts to rewrite this macro with __builtin_constant_p() failed with
>> older compilers (e.g. gcc 4.4)[2]. However, Martin Uecker constructed[3] a
>> mind-shattering solution that works everywhere. Cthulhu fhtagn!
>>
>> This patch updates the min()/max() macros to evaluate to a constant
>> expression when called on constant expression arguments. This removes
>> several false-positive stack VLA warnings from an x86 allmodconfig
>> build when -Wvla is added:
>
> Cool!
>
> Acked-by: Ingo Molnar 
>
> How many warnings are left in an allmodconfig build?

For -Wvla? Out of the original 112 files with VLAs, 42 haven't had a
patch applied yet. Doing a linux-next allmodconfig build with the
max() patch and my latest ecc patch, we've gone from 316 warning
instances to 205. More than half of those are in
include/crypto/skcipher.h and include/crypto/hash.h.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v6] kernel.h: Retain constant expression output for max()/min()

2018-03-30 Thread Kees Cook
On Mon, Mar 26, 2018 at 10:47 PM, Ingo Molnar  wrote:
>
> * Kees Cook  wrote:
>
>> In the effort to remove all VLAs from the kernel[1], it is desirable to
>> build with -Wvla. However, this warning is overly pessimistic, in that
>> it is only happy with stack array sizes that are declared as constant
>> expressions, and not constant values. One case of this is the evaluation
>> of the max() macro which, due to its construction, ends up converting
>> constant expression arguments into a constant value result.
>>
>> All attempts to rewrite this macro with __builtin_constant_p() failed with
>> older compilers (e.g. gcc 4.4)[2]. However, Martin Uecker constructed[3] a
>> mind-shattering solution that works everywhere. Cthulhu fhtagn!
>>
>> This patch updates the min()/max() macros to evaluate to a constant
>> expression when called on constant expression arguments. This removes
>> several false-positive stack VLA warnings from an x86 allmodconfig
>> build when -Wvla is added:
>
> Cool!
>
> Acked-by: Ingo Molnar 
>
> How many warnings are left in an allmodconfig build?

For -Wvla? Out of the original 112 files with VLAs, 42 haven't had a
patch applied yet. Doing a linux-next allmodconfig build with the
max() patch and my latest ecc patch, we've gone from 316 warning
instances to 205. More than half of those are in
include/crypto/skcipher.h and include/crypto/hash.h.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v4 01/14] soc: qcom: Separate kryo l2 accessors from PMU driver

2018-03-30 Thread kbuild test robot
Hi Ilia,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on clk/clk-next]
[also build test ERROR on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Ilia-Lin/soc-qcom-Separate-kryo-l2-accessors-from-PMU-driver/20180331-093947
base:   https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git clk-next
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> drivers/soc/qcom/kryo-l2-accessors.c:17:10: fatal error: asm/sysreg.h: No 
>> such file or directory
#include 
 ^~
   compilation terminated.

vim +17 drivers/soc/qcom/kryo-l2-accessors.c

  > 17  #include 
18  #include 
19  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v4 01/14] soc: qcom: Separate kryo l2 accessors from PMU driver

2018-03-30 Thread kbuild test robot
Hi Ilia,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on clk/clk-next]
[also build test ERROR on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Ilia-Lin/soc-qcom-Separate-kryo-l2-accessors-from-PMU-driver/20180331-093947
base:   https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git clk-next
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> drivers/soc/qcom/kryo-l2-accessors.c:17:10: fatal error: asm/sysreg.h: No 
>> such file or directory
#include 
 ^~
   compilation terminated.

vim +17 drivers/soc/qcom/kryo-l2-accessors.c

  > 17  #include 
18  #include 
19  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings

2018-03-30 Thread Nicolai Stange
Julia Lawall  writes:

> On Fri, 30 Mar 2018, Nicolai Stange wrote:
>
>> Julia Lawall  writes:
>>
>> > On Thu, 29 Mar 2018, Fabio Estevam wrote:
>> >
>> >> Hi Julia,
>> >>
>> >> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall  
>> >> wrote:
>> >> >  Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
>> >> >  for debugfs files.
>> >> >
>> >> > Semantic patch information:
>> >> >  Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
>> >> >  imposes some significant overhead as compared to
>> >> >  DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
>> >>
>> >> Just curious: could you please expand on what "imposes some
>> >> significant overhead" means?
>> >
>> > I don't know.  I didn't write this rule.  Nicolai, can you explain?
>>
>> From commit 49d200deaa68 ("debugfs: prevent access to removed files' private
>> data"):
>>
>> Upon return of debugfs_remove()/debugfs_remove_recursive(), it might
>> still be attempted to access associated private file data through
>> previously opened struct file objects. If that data has been freed by
>> the caller of debugfs_remove*() in the meanwhile, the reading/writing
>> process would either encounter a fault or, if the memory address in
>> question has been reassigned again, unrelated data structures could get
>> overwritten.
>> [...]
>> Currently, there are ~1000 call sites of debugfs_create_file() spread
>> throughout the whole tree and touching all of those struct 
>> file_operations
>> in order to make them file removal aware by means of checking the result 
>> of
>> debugfs_use_file_start() from within their methods is unfeasible.
>>
>> Instead, wrap the struct file_operations by a lifetime managing proxy at
>> file open [...]
>>
>> The additional overhead comes in terms of additional memory needed: for
>> debugs files created through debugfs_create_file(), one such struct
>> file_operations proxy is allocated for each struct file instantiation,
>> c.f. full_proxy_open().
>>
>> This was needed to "repair" the ~1000 call sites without touching them.
>>
>> New debugfs users should make their file_operations removal aware
>> themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to
>> the debugfs core by instantiating them through
>> debugfs_create_file_unsafe().
>>
>> See commit c64688081490 ("debugfs: add support for self-protecting
>> attribute file fops") for further information.
>
> Thanks.  Perhaps it would be good to add a reference to this commit in
> the message generated by the semantic patch.

Thanks for doing this!


>
> Would it be sufficient to just apply the semantic patch everywhere and
> submit the patches?

In principle yes. But I'm note sure whether such a mass application is
worth it: the proxy allocation happens only at file open and the
expectation is that there aren't that many debugfs files kept open at a
time. OTOH, a struct file_operation consumes 256 bytes with
sizeof(long) == 8.

In my opinion, new users should avoid this overhead as it's easily
doable. For existing ones, I don't know.

Thanks,

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings

2018-03-30 Thread Nicolai Stange
Julia Lawall  writes:

> On Fri, 30 Mar 2018, Nicolai Stange wrote:
>
>> Julia Lawall  writes:
>>
>> > On Thu, 29 Mar 2018, Fabio Estevam wrote:
>> >
>> >> Hi Julia,
>> >>
>> >> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall  
>> >> wrote:
>> >> >  Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
>> >> >  for debugfs files.
>> >> >
>> >> > Semantic patch information:
>> >> >  Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
>> >> >  imposes some significant overhead as compared to
>> >> >  DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
>> >>
>> >> Just curious: could you please expand on what "imposes some
>> >> significant overhead" means?
>> >
>> > I don't know.  I didn't write this rule.  Nicolai, can you explain?
>>
>> From commit 49d200deaa68 ("debugfs: prevent access to removed files' private
>> data"):
>>
>> Upon return of debugfs_remove()/debugfs_remove_recursive(), it might
>> still be attempted to access associated private file data through
>> previously opened struct file objects. If that data has been freed by
>> the caller of debugfs_remove*() in the meanwhile, the reading/writing
>> process would either encounter a fault or, if the memory address in
>> question has been reassigned again, unrelated data structures could get
>> overwritten.
>> [...]
>> Currently, there are ~1000 call sites of debugfs_create_file() spread
>> throughout the whole tree and touching all of those struct 
>> file_operations
>> in order to make them file removal aware by means of checking the result 
>> of
>> debugfs_use_file_start() from within their methods is unfeasible.
>>
>> Instead, wrap the struct file_operations by a lifetime managing proxy at
>> file open [...]
>>
>> The additional overhead comes in terms of additional memory needed: for
>> debugs files created through debugfs_create_file(), one such struct
>> file_operations proxy is allocated for each struct file instantiation,
>> c.f. full_proxy_open().
>>
>> This was needed to "repair" the ~1000 call sites without touching them.
>>
>> New debugfs users should make their file_operations removal aware
>> themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to
>> the debugfs core by instantiating them through
>> debugfs_create_file_unsafe().
>>
>> See commit c64688081490 ("debugfs: add support for self-protecting
>> attribute file fops") for further information.
>
> Thanks.  Perhaps it would be good to add a reference to this commit in
> the message generated by the semantic patch.

Thanks for doing this!


>
> Would it be sufficient to just apply the semantic patch everywhere and
> submit the patches?

In principle yes. But I'm note sure whether such a mass application is
worth it: the proxy allocation happens only at file open and the
expectation is that there aren't that many debugfs files kept open at a
time. OTOH, a struct file_operation consumes 256 bytes with
sizeof(long) == 8.

In my opinion, new users should avoid this overhead as it's easily
doable. For existing ones, I don't know.

Thanks,

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


[PATCH v8 03/18] block, dax: remove dead code in blkdev_writepages()

2018-03-30 Thread Dan Williams
Block device inodes never have S_DAX set, so kill the check for DAX and
diversion to dax_writeback_mapping_range().

Cc: Jeff Moyer 
Cc: Ross Zwisler 
Cc: Matthew Wilcox 
Cc: Dave Chinner 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/block_dev.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index fe09ef9c21f3..846ee2d31781 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1946,11 +1946,6 @@ static int blkdev_releasepage(struct page *page, gfp_t 
wait)
 static int blkdev_writepages(struct address_space *mapping,
 struct writeback_control *wbc)
 {
-   if (dax_mapping(mapping)) {
-   struct block_device *bdev = I_BDEV(mapping->host);
-
-   return dax_writeback_mapping_range(mapping, bdev, wbc);
-   }
return generic_writepages(mapping, wbc);
 }
 



[PATCH v8 03/18] block, dax: remove dead code in blkdev_writepages()

2018-03-30 Thread Dan Williams
Block device inodes never have S_DAX set, so kill the check for DAX and
diversion to dax_writeback_mapping_range().

Cc: Jeff Moyer 
Cc: Ross Zwisler 
Cc: Matthew Wilcox 
Cc: Dave Chinner 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/block_dev.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index fe09ef9c21f3..846ee2d31781 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1946,11 +1946,6 @@ static int blkdev_releasepage(struct page *page, gfp_t 
wait)
 static int blkdev_writepages(struct address_space *mapping,
 struct writeback_control *wbc)
 {
-   if (dax_mapping(mapping)) {
-   struct block_device *bdev = I_BDEV(mapping->host);
-
-   return dax_writeback_mapping_range(mapping, bdev, wbc);
-   }
return generic_writepages(mapping, wbc);
 }
 



[PATCH v8 04/18] xfs, dax: introduce xfs_dax_aops

2018-03-30 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G   O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

Cc: Jeff Moyer 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Suggested-by: Jan Kara 
Suggested-by: Dave Chinner 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_aops.c |   34 ++
 fs/xfs/xfs_aops.h |1 +
 fs/xfs/xfs_iops.c |5 -
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..e7a56c4786ff 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1194,16 +1194,22 @@ xfs_vm_writepages(
int ret;
 
xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
-   if (dax_mapping(mapping))
-   return dax_writeback_mapping_range(mapping,
-   xfs_find_bdev_for_inode(mapping->host), wbc);
-
ret = write_cache_pages(mapping, wbc, xfs_do_writepage, );
if (wpc.ioend)
ret = xfs_submit_ioend(wbc, wpc.ioend, ret);
return ret;
 }
 
+STATIC int
+xfs_dax_writepages(
+   struct address_space*mapping,
+   struct writeback_control *wbc)
+{
+   xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
+   return dax_writeback_mapping_range(mapping,
+   xfs_find_bdev_for_inode(mapping->host), wbc);
+}
+
 /*
  * Called to move a page into cleanable state - and from there
  * to be released. The page should already be clean. We always
@@ -1367,17 +1373,6 @@ xfs_get_blocks(
return error;
 }
 
-STATIC ssize_t
-xfs_vm_direct_IO(
-   struct kiocb*iocb,
-   struct iov_iter *iter)
-{
-   /*
-* We just need the method present so that open/fcntl allow direct I/O.
-*/
-   return -EINVAL;
-}
-
 STATIC sector_t
 xfs_vm_bmap(
struct address_space*mapping,
@@ -1500,8 +1495,15 @@ const struct address_space_operations 
xfs_address_space_operations = {
.releasepage= xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
.bmap   = xfs_vm_bmap,
-   .direct_IO  = xfs_vm_direct_IO,
+   .direct_IO  = noop_direct_IO,
.migratepage= buffer_migrate_page,
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
 };
+
+const struct address_space_operations xfs_dax_aops = {
+   .writepages = xfs_dax_writepages,
+   .direct_IO  = noop_direct_IO,
+   .set_page_dirty = noop_set_page_dirty,
+   .invalidatepage = noop_invalidatepage,
+};
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..69346d460dfa 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -54,6 +54,7 @@ struct xfs_ioend {
 };
 
 extern const struct address_space_operations xfs_address_space_operations;
+extern const struct address_space_operations xfs_dax_aops;
 
 intxfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..951e84df5576 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1272,7 +1272,10 @@ xfs_setup_iops(
case S_IFREG:
inode->i_op = _inode_operations;
inode->i_fop = _file_operations;
-   inode->i_mapping->a_ops = _address_space_operations;
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else
+   inode->i_mapping->a_ops = _address_space_operations;
break;
case S_IFDIR:
if (xfs_sb_version_hasasciici(_M(inode->i_sb)->m_sb))



[PATCH v8 05/18] ext4, dax: introduce ext4_dax_aops

2018-03-30 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/ext4/inode.c |   42 +++---
 1 file changed, 31 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c94780075b04..249a97b19181 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2725,12 +2725,6 @@ static int ext4_writepages(struct address_space *mapping,
percpu_down_read(>s_journal_flag_rwsem);
trace_ext4_writepages(inode, wbc);
 
-   if (dax_mapping(mapping)) {
-   ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
- wbc);
-   goto out_writepages;
-   }
-
/*
 * No pages to write? This is mainly a kludge to avoid starting
 * a transaction for special inodes like journal inode on last iput()
@@ -2955,6 +2949,27 @@ static int ext4_writepages(struct address_space *mapping,
return ret;
 }
 
+static int ext4_dax_writepages(struct address_space *mapping,
+  struct writeback_control *wbc)
+{
+   int ret;
+   long nr_to_write = wbc->nr_to_write;
+   struct inode *inode = mapping->host;
+   struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
+
+   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
+   return -EIO;
+
+   percpu_down_read(>s_journal_flag_rwsem);
+   trace_ext4_writepages(inode, wbc);
+
+   ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+   trace_ext4_writepages_result(inode, wbc, ret,
+nr_to_write - wbc->nr_to_write);
+   percpu_up_read(>s_journal_flag_rwsem);
+   return ret;
+}
+
 static int ext4_nonda_switch(struct super_block *sb)
 {
s64 free_clusters, dirty_clusters;
@@ -3857,10 +3872,6 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
if (ext4_has_inline_data(inode))
return 0;
 
-   /* DAX uses iomap path now */
-   if (WARN_ON_ONCE(IS_DAX(inode)))
-   return 0;
-
trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
if (iov_iter_rw(iter) == READ)
ret = ext4_direct_IO_read(iocb, iter);
@@ -3946,6 +3957,13 @@ static const struct address_space_operations 
ext4_da_aops = {
.error_remove_page  = generic_error_remove_page,
 };
 
+static const struct address_space_operations ext4_dax_aops = {
+   .writepages = ext4_dax_writepages,
+   .direct_IO  = noop_direct_IO,
+   .set_page_dirty = noop_set_page_dirty,
+   .invalidatepage = noop_invalidatepage,
+};
+
 void ext4_set_aops(struct inode *inode)
 {
switch (ext4_inode_journal_mode(inode)) {
@@ -3958,7 +3976,9 @@ void ext4_set_aops(struct inode *inode)
default:
BUG();
}
-   if (test_opt(inode->i_sb, DELALLOC))
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else if (test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = _da_aops;
else
inode->i_mapping->a_ops = _aops;



[PATCH v8 05/18] ext4, dax: introduce ext4_dax_aops

2018-03-30 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/ext4/inode.c |   42 +++---
 1 file changed, 31 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c94780075b04..249a97b19181 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2725,12 +2725,6 @@ static int ext4_writepages(struct address_space *mapping,
percpu_down_read(>s_journal_flag_rwsem);
trace_ext4_writepages(inode, wbc);
 
-   if (dax_mapping(mapping)) {
-   ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
- wbc);
-   goto out_writepages;
-   }
-
/*
 * No pages to write? This is mainly a kludge to avoid starting
 * a transaction for special inodes like journal inode on last iput()
@@ -2955,6 +2949,27 @@ static int ext4_writepages(struct address_space *mapping,
return ret;
 }
 
+static int ext4_dax_writepages(struct address_space *mapping,
+  struct writeback_control *wbc)
+{
+   int ret;
+   long nr_to_write = wbc->nr_to_write;
+   struct inode *inode = mapping->host;
+   struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
+
+   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
+   return -EIO;
+
+   percpu_down_read(>s_journal_flag_rwsem);
+   trace_ext4_writepages(inode, wbc);
+
+   ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+   trace_ext4_writepages_result(inode, wbc, ret,
+nr_to_write - wbc->nr_to_write);
+   percpu_up_read(>s_journal_flag_rwsem);
+   return ret;
+}
+
 static int ext4_nonda_switch(struct super_block *sb)
 {
s64 free_clusters, dirty_clusters;
@@ -3857,10 +3872,6 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
if (ext4_has_inline_data(inode))
return 0;
 
-   /* DAX uses iomap path now */
-   if (WARN_ON_ONCE(IS_DAX(inode)))
-   return 0;
-
trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
if (iov_iter_rw(iter) == READ)
ret = ext4_direct_IO_read(iocb, iter);
@@ -3946,6 +3957,13 @@ static const struct address_space_operations 
ext4_da_aops = {
.error_remove_page  = generic_error_remove_page,
 };
 
+static const struct address_space_operations ext4_dax_aops = {
+   .writepages = ext4_dax_writepages,
+   .direct_IO  = noop_direct_IO,
+   .set_page_dirty = noop_set_page_dirty,
+   .invalidatepage = noop_invalidatepage,
+};
+
 void ext4_set_aops(struct inode *inode)
 {
switch (ext4_inode_journal_mode(inode)) {
@@ -3958,7 +3976,9 @@ void ext4_set_aops(struct inode *inode)
default:
BUG();
}
-   if (test_opt(inode->i_sb, DELALLOC))
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else if (test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = _da_aops;
else
inode->i_mapping->a_ops = _aops;



[PATCH v8 04/18] xfs, dax: introduce xfs_dax_aops

2018-03-30 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G   O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

Cc: Jeff Moyer 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Suggested-by: Jan Kara 
Suggested-by: Dave Chinner 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_aops.c |   34 ++
 fs/xfs/xfs_aops.h |1 +
 fs/xfs/xfs_iops.c |5 -
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..e7a56c4786ff 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1194,16 +1194,22 @@ xfs_vm_writepages(
int ret;
 
xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
-   if (dax_mapping(mapping))
-   return dax_writeback_mapping_range(mapping,
-   xfs_find_bdev_for_inode(mapping->host), wbc);
-
ret = write_cache_pages(mapping, wbc, xfs_do_writepage, );
if (wpc.ioend)
ret = xfs_submit_ioend(wbc, wpc.ioend, ret);
return ret;
 }
 
+STATIC int
+xfs_dax_writepages(
+   struct address_space*mapping,
+   struct writeback_control *wbc)
+{
+   xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
+   return dax_writeback_mapping_range(mapping,
+   xfs_find_bdev_for_inode(mapping->host), wbc);
+}
+
 /*
  * Called to move a page into cleanable state - and from there
  * to be released. The page should already be clean. We always
@@ -1367,17 +1373,6 @@ xfs_get_blocks(
return error;
 }
 
-STATIC ssize_t
-xfs_vm_direct_IO(
-   struct kiocb*iocb,
-   struct iov_iter *iter)
-{
-   /*
-* We just need the method present so that open/fcntl allow direct I/O.
-*/
-   return -EINVAL;
-}
-
 STATIC sector_t
 xfs_vm_bmap(
struct address_space*mapping,
@@ -1500,8 +1495,15 @@ const struct address_space_operations 
xfs_address_space_operations = {
.releasepage= xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
.bmap   = xfs_vm_bmap,
-   .direct_IO  = xfs_vm_direct_IO,
+   .direct_IO  = noop_direct_IO,
.migratepage= buffer_migrate_page,
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
 };
+
+const struct address_space_operations xfs_dax_aops = {
+   .writepages = xfs_dax_writepages,
+   .direct_IO  = noop_direct_IO,
+   .set_page_dirty = noop_set_page_dirty,
+   .invalidatepage = noop_invalidatepage,
+};
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..69346d460dfa 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -54,6 +54,7 @@ struct xfs_ioend {
 };
 
 extern const struct address_space_operations xfs_address_space_operations;
+extern const struct address_space_operations xfs_dax_aops;
 
 intxfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..951e84df5576 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1272,7 +1272,10 @@ xfs_setup_iops(
case S_IFREG:
inode->i_op = _inode_operations;
inode->i_fop = _file_operations;
-   inode->i_mapping->a_ops = _address_space_operations;
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else
+   inode->i_mapping->a_ops = _address_space_operations;
break;
case S_IFDIR:
if (xfs_sb_version_hasasciici(_M(inode->i_sb)->m_sb))



[PATCH v8 06/18] ext2, dax: introduce ext2_dax_aops

2018-03-30 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: Jan Kara 
Reported-by: kbuild test robot 
Signed-off-by: Dan Williams 
---
 fs/ext2/ext2.h  |1 +
 fs/ext2/inode.c |   46 +++---
 fs/ext2/namei.c |   18 ++
 3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 032295e1d386..cc40802ddfa8 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -814,6 +814,7 @@ extern const struct inode_operations 
ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
 
 /* inode.c */
+extern void ext2_set_file_ops(struct inode *inode);
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_nobh_aops;
 extern const struct iomap_ops ext2_iomap_ops;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b2ac55ac34f..1e01fabef130 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -940,9 +940,6 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
loff_t offset = iocb->ki_pos;
ssize_t ret;
 
-   if (WARN_ON_ONCE(IS_DAX(inode)))
-   return -EIO;
-
ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
if (ret < 0 && iov_iter_rw(iter) == WRITE)
ext2_write_failed(mapping, offset + count);
@@ -952,17 +949,16 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
-#ifdef CONFIG_FS_DAX
-   if (dax_mapping(mapping)) {
-   return dax_writeback_mapping_range(mapping,
-  mapping->host->i_sb->s_bdev,
-  wbc);
-   }
-#endif
-
return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
+static int
+ext2_dax_writepages(struct address_space *mapping, struct writeback_control 
*wbc)
+{
+   return dax_writeback_mapping_range(mapping,
+   mapping->host->i_sb->s_bdev, wbc);
+}
+
 const struct address_space_operations ext2_aops = {
.readpage   = ext2_readpage,
.readpages  = ext2_readpages,
@@ -990,6 +986,13 @@ const struct address_space_operations ext2_nobh_aops = {
.error_remove_page  = generic_error_remove_page,
 };
 
+static const struct address_space_operations ext2_dax_aops = {
+   .writepages = ext2_dax_writepages,
+   .direct_IO  = noop_direct_IO,
+   .set_page_dirty = noop_set_page_dirty,
+   .invalidatepage = noop_invalidatepage,
+};
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -1388,6 +1391,18 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_DAX;
 }
 
+void ext2_set_file_ops(struct inode *inode)
+{
+   inode->i_op = _file_inode_operations;
+   inode->i_fop = _file_operations;
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else if (test_opt(inode->i_sb, NOBH))
+   inode->i_mapping->a_ops = _nobh_aops;
+   else
+   inode->i_mapping->a_ops = _aops;
+}
+
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 {
struct ext2_inode_info *ei;
@@ -1480,14 +1495,7 @@ struct inode *ext2_iget (struct super_block *sb, 
unsigned long ino)
ei->i_data[n] = raw_inode->i_block[n];
 
if (S_ISREG(inode->i_mode)) {
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _file_operations;
-   }
+   ext2_set_file_ops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = _dir_inode_operations;
inode->i_fop = _dir_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e078075dc66f..55f7caadb093 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry 
* dentry, umode_t mode
if (IS_ERR(inode))
return PTR_ERR(inode);
 
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   

[PATCH v8 06/18] ext2, dax: introduce ext2_dax_aops

2018-03-30 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: Jan Kara 
Reported-by: kbuild test robot 
Signed-off-by: Dan Williams 
---
 fs/ext2/ext2.h  |1 +
 fs/ext2/inode.c |   46 +++---
 fs/ext2/namei.c |   18 ++
 3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 032295e1d386..cc40802ddfa8 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -814,6 +814,7 @@ extern const struct inode_operations 
ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
 
 /* inode.c */
+extern void ext2_set_file_ops(struct inode *inode);
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_nobh_aops;
 extern const struct iomap_ops ext2_iomap_ops;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b2ac55ac34f..1e01fabef130 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -940,9 +940,6 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
loff_t offset = iocb->ki_pos;
ssize_t ret;
 
-   if (WARN_ON_ONCE(IS_DAX(inode)))
-   return -EIO;
-
ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
if (ret < 0 && iov_iter_rw(iter) == WRITE)
ext2_write_failed(mapping, offset + count);
@@ -952,17 +949,16 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
-#ifdef CONFIG_FS_DAX
-   if (dax_mapping(mapping)) {
-   return dax_writeback_mapping_range(mapping,
-  mapping->host->i_sb->s_bdev,
-  wbc);
-   }
-#endif
-
return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
+static int
+ext2_dax_writepages(struct address_space *mapping, struct writeback_control 
*wbc)
+{
+   return dax_writeback_mapping_range(mapping,
+   mapping->host->i_sb->s_bdev, wbc);
+}
+
 const struct address_space_operations ext2_aops = {
.readpage   = ext2_readpage,
.readpages  = ext2_readpages,
@@ -990,6 +986,13 @@ const struct address_space_operations ext2_nobh_aops = {
.error_remove_page  = generic_error_remove_page,
 };
 
+static const struct address_space_operations ext2_dax_aops = {
+   .writepages = ext2_dax_writepages,
+   .direct_IO  = noop_direct_IO,
+   .set_page_dirty = noop_set_page_dirty,
+   .invalidatepage = noop_invalidatepage,
+};
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -1388,6 +1391,18 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_DAX;
 }
 
+void ext2_set_file_ops(struct inode *inode)
+{
+   inode->i_op = _file_inode_operations;
+   inode->i_fop = _file_operations;
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else if (test_opt(inode->i_sb, NOBH))
+   inode->i_mapping->a_ops = _nobh_aops;
+   else
+   inode->i_mapping->a_ops = _aops;
+}
+
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 {
struct ext2_inode_info *ei;
@@ -1480,14 +1495,7 @@ struct inode *ext2_iget (struct super_block *sb, 
unsigned long ino)
ei->i_data[n] = raw_inode->i_block[n];
 
if (S_ISREG(inode->i_mode)) {
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _file_operations;
-   }
+   ext2_set_file_ops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = _dir_inode_operations;
inode->i_fop = _dir_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e078075dc66f..55f7caadb093 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry 
* dentry, umode_t mode
if (IS_ERR(inode))
return PTR_ERR(inode);
 
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   inode->i_mapping->a_ops = _aops;
-   inode->i_fop = 

[PATCH v8 08/18] dax: introduce CONFIG_DAX_DRIVER

2018-03-30 Thread Dan Williams
In support of allowing device-mapper to compile out idle/dead code when
there are no dax providers in the system, introduce the DAX_DRIVER
symbol. This is selected by all leaf drivers that device-mapper might be
layered on top. This allows device-mapper to conditionally 'select DAX'
only when a provider is present.

Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Reported-by: Bart Van Assche 
Reviewed-by: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/dax/Kconfig|5 -
 drivers/nvdimm/Kconfig |2 +-
 drivers/s390/block/Kconfig |2 +-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index b79aa8f7a497..e0700bf4893a 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,3 +1,7 @@
+config DAX_DRIVER
+   select DAX
+   bool
+
 menuconfig DAX
tristate "DAX: direct access to differentiated memory"
select SRCU
@@ -16,7 +20,6 @@ config DEV_DAX
  baseline memory pool.  Mappings of a /dev/daxX.Y device impose
  restrictions that make the mapping behavior deterministic.
 
-
 config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
depends on LIBNVDIMM && NVDIMM_DAX && DEV_DAX
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index a65f2e1d9f53..40cbdb16e23e 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -20,7 +20,7 @@ if LIBNVDIMM
 config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
-   select DAX
+   select DAX_DRIVER
select ND_BTT if BTT
select ND_PFN if NVDIMM_PFN
help
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 1444333210c7..9ac7574e3cfb 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -15,8 +15,8 @@ config BLK_DEV_XPRAM
 
 config DCSSBLK
def_tristate m
-   select DAX
select FS_DAX_LIMITED
+   select DAX_DRIVER
prompt "DCSSBLK support"
depends on S390 && BLOCK
help



[PATCH v8 08/18] dax: introduce CONFIG_DAX_DRIVER

2018-03-30 Thread Dan Williams
In support of allowing device-mapper to compile out idle/dead code when
there are no dax providers in the system, introduce the DAX_DRIVER
symbol. This is selected by all leaf drivers that device-mapper might be
layered on top. This allows device-mapper to conditionally 'select DAX'
only when a provider is present.

Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Reported-by: Bart Van Assche 
Reviewed-by: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/dax/Kconfig|5 -
 drivers/nvdimm/Kconfig |2 +-
 drivers/s390/block/Kconfig |2 +-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index b79aa8f7a497..e0700bf4893a 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,3 +1,7 @@
+config DAX_DRIVER
+   select DAX
+   bool
+
 menuconfig DAX
tristate "DAX: direct access to differentiated memory"
select SRCU
@@ -16,7 +20,6 @@ config DEV_DAX
  baseline memory pool.  Mappings of a /dev/daxX.Y device impose
  restrictions that make the mapping behavior deterministic.
 
-
 config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
depends on LIBNVDIMM && NVDIMM_DAX && DEV_DAX
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index a65f2e1d9f53..40cbdb16e23e 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -20,7 +20,7 @@ if LIBNVDIMM
 config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
-   select DAX
+   select DAX_DRIVER
select ND_BTT if BTT
select ND_PFN if NVDIMM_PFN
help
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 1444333210c7..9ac7574e3cfb 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -15,8 +15,8 @@ config BLK_DEV_XPRAM
 
 config DCSSBLK
def_tristate m
-   select DAX
select FS_DAX_LIMITED
+   select DAX_DRIVER
prompt "DCSSBLK support"
depends on S390 && BLOCK
help



[PATCH v8 11/18] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks

2018-03-30 Thread Dan Williams
In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.

Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.

Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.

Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.

Cc: Michal Hocko 
Reviewed-by: "Jérôme Glisse" 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c   |   21 +++--
 drivers/nvdimm/pmem.c |3 ++-
 fs/ext2/super.c   |6 +++---
 fs/ext4/super.c   |6 +++---
 fs/xfs/xfs_super.c|   20 ++--
 include/linux/dax.h   |   23 ++-
 6 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4cf284dfe1c..7d260f118a39 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -63,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t 
sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-   if (!blk_queue_dax(bdev->bd_queue))
-   return NULL;
-   return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -579,6 +569,17 @@ struct dax_device *alloc_dax(void *private, const char 
*__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+   const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+   struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+   if (dax_dev)
+   dax_dev->pgmap = pgmap;
+   return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
if (!dax_dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 06f8dcc52ca6..e6d7351f3379 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
nvdimm_badblocks_populate(nd_region, >bb, _res);
disk->bb = >bb;
 
-   dax_dev = alloc_dax(pmem, disk->disk_name, _dax_ops);
+   dax_dev = alloc_dax_devmap(pmem, disk->disk_name, _dax_ops,
+   >pgmap);
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7666c065b96f..6ae20e319bc4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
brelse (sbi->s_sbh);
sb->s_fs_info = NULL;
kfree(sbi->s_blockgroup_lock);
-   fs_put_dax(sbi->s_daxdev);
+   fs_dax_release(sbi->s_daxdev, sb);
kfree(sbi);
 }
 
@@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-   struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+   struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
struct buffer_head * bh;
struct ext2_sb_info * sbi;
struct ext2_super_block * es;
@@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void 
*data, int silent)
kfree(sbi->s_blockgroup_lock);
kfree(sbi);
 failed:
-   fs_put_dax(dax_dev);
+   fs_dax_release(dax_dev, sb);
return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 39bf464c35f1..315a323729e3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
if (sbi->s_chksum_driver)
crypto_free_shash(sbi->s_chksum_driver);
kfree(sbi->s_blockgroup_lock);
-   fs_put_dax(sbi->s_daxdev);
+   fs_dax_release(sbi->s_daxdev, sb);

[PATCH v8 11/18] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks

2018-03-30 Thread Dan Williams
In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.

Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.

Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.

Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.

Cc: Michal Hocko 
Reviewed-by: "Jérôme Glisse" 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c   |   21 +++--
 drivers/nvdimm/pmem.c |3 ++-
 fs/ext2/super.c   |6 +++---
 fs/ext4/super.c   |6 +++---
 fs/xfs/xfs_super.c|   20 ++--
 include/linux/dax.h   |   23 ++-
 6 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4cf284dfe1c..7d260f118a39 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -63,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t 
sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-   if (!blk_queue_dax(bdev->bd_queue))
-   return NULL;
-   return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -579,6 +569,17 @@ struct dax_device *alloc_dax(void *private, const char 
*__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+   const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+   struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+   if (dax_dev)
+   dax_dev->pgmap = pgmap;
+   return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
if (!dax_dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 06f8dcc52ca6..e6d7351f3379 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
nvdimm_badblocks_populate(nd_region, >bb, _res);
disk->bb = >bb;
 
-   dax_dev = alloc_dax(pmem, disk->disk_name, _dax_ops);
+   dax_dev = alloc_dax_devmap(pmem, disk->disk_name, _dax_ops,
+   >pgmap);
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7666c065b96f..6ae20e319bc4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
brelse (sbi->s_sbh);
sb->s_fs_info = NULL;
kfree(sbi->s_blockgroup_lock);
-   fs_put_dax(sbi->s_daxdev);
+   fs_dax_release(sbi->s_daxdev, sb);
kfree(sbi);
 }
 
@@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-   struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+   struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
struct buffer_head * bh;
struct ext2_sb_info * sbi;
struct ext2_super_block * es;
@@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void 
*data, int silent)
kfree(sbi->s_blockgroup_lock);
kfree(sbi);
 failed:
-   fs_put_dax(dax_dev);
+   fs_dax_release(dax_dev, sb);
return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 39bf464c35f1..315a323729e3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
if (sbi->s_chksum_driver)
crypto_free_shash(sbi->s_chksum_driver);
kfree(sbi->s_blockgroup_lock);
-   fs_put_dax(sbi->s_daxdev);
+   fs_dax_release(sbi->s_daxdev, sb);
kfree(sbi);
 }
 
@@ -3398,7 +3398,7 @@ static void ext4_set_resv_clusters(struct 

[PATCH v8 15/18] mm, fs, dax: handle layout changes to pinned dax mappings

2018-03-30 Thread Dan Williams
Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Dave Chinner 
Cc: Matthew Wilcox 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Dave Hansen 
Cc: Andrew Morton 
Reported-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c |2 +
 fs/dax.c|   92 +++
 include/linux/dax.h |   25 ++
 mm/gup.c|5 +++
 4 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 3bafaddd02f1..91bfc34e3ca7 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
-   /* TODO: wakeup page-idle waiters */
+   wake_up_var(>_refcount);
 }
 
 struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index a77394fe586e..c01f7989e0aa 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -355,6 +355,19 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
}
 }
 
+static struct page *dax_busy_page(void *entry)
+{
+   unsigned long pfn;
+
+   for_each_mapped_pfn(entry, pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   if (page_ref_count(page) > 1)
+   return page;
+   }
+   return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -496,6 +509,85 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index,
return entry;
 }
 
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true. It expects that get_user_pages() pte
+ * walks are performed under rcu_read_lock().
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+   pgoff_t indices[PAGEVEC_SIZE];
+   struct page *page = NULL;
+   struct pagevec pvec;
+   pgoff_t index, end;
+   unsigned i;
+
+   

[PATCH v8 15/18] mm, fs, dax: handle layout changes to pinned dax mappings

2018-03-30 Thread Dan Williams
Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Dave Chinner 
Cc: Matthew Wilcox 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Dave Hansen 
Cc: Andrew Morton 
Reported-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c |2 +
 fs/dax.c|   92 +++
 include/linux/dax.h |   25 ++
 mm/gup.c|5 +++
 4 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 3bafaddd02f1..91bfc34e3ca7 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
-   /* TODO: wakeup page-idle waiters */
+   wake_up_var(>_refcount);
 }
 
 struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index a77394fe586e..c01f7989e0aa 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -355,6 +355,19 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
}
 }
 
+static struct page *dax_busy_page(void *entry)
+{
+   unsigned long pfn;
+
+   for_each_mapped_pfn(entry, pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   if (page_ref_count(page) > 1)
+   return page;
+   }
+   return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -496,6 +509,85 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index,
return entry;
 }
 
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true. It expects that get_user_pages() pte
+ * walks are performed under rcu_read_lock().
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+   pgoff_t indices[PAGEVEC_SIZE];
+   struct page *page = NULL;
+   struct pagevec pvec;
+   pgoff_t index, end;
+   unsigned i;
+
+   /*
+* In the 'limited' case get_user_pages() for dax is disabled.
+*/
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return NULL;
+
+   if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+   return NULL;
+
+   

[PATCH v8 16/18] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL

2018-03-30 Thread Dan Williams
In preparation for adding coordination between extent unmap operations
and busy dax-pages, update xfs_break_layouts() to permit it to be called
with the mmap lock held. This lock scheme will be required for
coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
pages mapped into the file's address space). Breaking dax layouts will
be added to xfs_break_layouts() in a future patch, for now this preps
the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
xfs_break_layouts().

Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Dave Chinner 
Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Reviewed-by: "Darrick J. Wong" 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c  |5 +
 fs/xfs/xfs_ioctl.c |5 +
 fs/xfs/xfs_iops.c  |   10 +++---
 fs/xfs/xfs_pnfs.c  |3 ++-
 4 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9ea08326f876..18edf04811d0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -768,7 +768,7 @@ xfs_file_fallocate(
struct xfs_inode*ip = XFS_I(inode);
longerror;
enum xfs_prealloc_flags flags = 0;
-   uintiolock = XFS_IOLOCK_EXCL;
+   uintiolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
loff_t  new_size = 0;
booldo_file_insert = false;
 
@@ -782,9 +782,6 @@ xfs_file_fallocate(
if (error)
goto out_unlock;
 
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-   iolock |= XFS_MMAPLOCK_EXCL;
-
if (mode & FALLOC_FL_PUNCH_HOLE) {
error = xfs_free_file_space(ip, offset, len);
if (error)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89fb1eb80aae..4151fade4bb1 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -614,7 +614,7 @@ xfs_ioc_space(
struct xfs_inode*ip = XFS_I(inode);
struct iattriattr;
enum xfs_prealloc_flags flags = 0;
-   uintiolock = XFS_IOLOCK_EXCL;
+   uintiolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
int error;
 
/*
@@ -648,9 +648,6 @@ xfs_ioc_space(
if (error)
goto out_unlock;
 
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-   iolock |= XFS_MMAPLOCK_EXCL;
-
switch (bf->l_whence) {
case 0: /*SEEK_SET*/
break;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 951e84df5576..d23aa08426f9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1028,13 +1028,17 @@ xfs_vn_setattr(
 
if (iattr->ia_valid & ATTR_SIZE) {
struct xfs_inode*ip = XFS_I(d_inode(dentry));
-   uintiolock = XFS_IOLOCK_EXCL;
+   uintiolock;
+
+   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+   iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
error = xfs_break_layouts(d_inode(dentry), );
-   if (error)
+   if (error) {
+   xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
return error;
+   }
 
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
error = xfs_vn_setattr_size(dentry, iattr);
xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
} else {
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index aa6c5c193f45..6ea7b0b55d02 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -43,7 +43,8 @@ xfs_break_layouts(
while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
xfs_iunlock(ip, *iolock);
error = break_layout(inode, true);
-   *iolock = XFS_IOLOCK_EXCL;
+   *iolock &= ~XFS_IOLOCK_SHARED;
+   *iolock |= XFS_IOLOCK_EXCL;
xfs_ilock(ip, *iolock);
}
 



[PATCH v8 14/18] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL

2018-03-30 Thread Dan Williams
The devm_memremap_pages() facility is tightly integrated with the
kernel's memory hotplug functionality. It injects an altmap argument
deep into the architecture specific vmemmap implementation to allow
allocating from specific reserved pages, and it has Linux specific
assumptions about page structure reference counting relative to
get_user_pages() and get_user_pages_fast(). It was an oversight that
this was not marked EXPORT_SYMBOL_GPL from the outset.

Cc: Michal Hocko 
Cc: "Jérôme Glisse" 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 kernel/memremap.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07a6a405cf3d..4b0e17df8981 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -257,7 +257,7 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
devres_free(pgmap);
return ERR_PTR(error);
 }
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {



[PATCH v8 16/18] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL

2018-03-30 Thread Dan Williams
In preparation for adding coordination between extent unmap operations
and busy dax-pages, update xfs_break_layouts() to permit it to be called
with the mmap lock held. This lock scheme will be required for
coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
pages mapped into the file's address space). Breaking dax layouts will
be added to xfs_break_layouts() in a future patch, for now this preps
the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
xfs_break_layouts().

Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Dave Chinner 
Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Reviewed-by: "Darrick J. Wong" 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c  |5 +
 fs/xfs/xfs_ioctl.c |5 +
 fs/xfs/xfs_iops.c  |   10 +++---
 fs/xfs/xfs_pnfs.c  |3 ++-
 4 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9ea08326f876..18edf04811d0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -768,7 +768,7 @@ xfs_file_fallocate(
struct xfs_inode*ip = XFS_I(inode);
longerror;
enum xfs_prealloc_flags flags = 0;
-   uintiolock = XFS_IOLOCK_EXCL;
+   uintiolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
loff_t  new_size = 0;
booldo_file_insert = false;
 
@@ -782,9 +782,6 @@ xfs_file_fallocate(
if (error)
goto out_unlock;
 
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-   iolock |= XFS_MMAPLOCK_EXCL;
-
if (mode & FALLOC_FL_PUNCH_HOLE) {
error = xfs_free_file_space(ip, offset, len);
if (error)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89fb1eb80aae..4151fade4bb1 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -614,7 +614,7 @@ xfs_ioc_space(
struct xfs_inode*ip = XFS_I(inode);
struct iattriattr;
enum xfs_prealloc_flags flags = 0;
-   uintiolock = XFS_IOLOCK_EXCL;
+   uintiolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
int error;
 
/*
@@ -648,9 +648,6 @@ xfs_ioc_space(
if (error)
goto out_unlock;
 
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-   iolock |= XFS_MMAPLOCK_EXCL;
-
switch (bf->l_whence) {
case 0: /*SEEK_SET*/
break;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 951e84df5576..d23aa08426f9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1028,13 +1028,17 @@ xfs_vn_setattr(
 
if (iattr->ia_valid & ATTR_SIZE) {
struct xfs_inode*ip = XFS_I(d_inode(dentry));
-   uintiolock = XFS_IOLOCK_EXCL;
+   uintiolock;
+
+   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+   iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
error = xfs_break_layouts(d_inode(dentry), );
-   if (error)
+   if (error) {
+   xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
return error;
+   }
 
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
error = xfs_vn_setattr_size(dentry, iattr);
xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
} else {
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index aa6c5c193f45..6ea7b0b55d02 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -43,7 +43,8 @@ xfs_break_layouts(
while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
xfs_iunlock(ip, *iolock);
error = break_layout(inode, true);
-   *iolock = XFS_IOLOCK_EXCL;
+   *iolock &= ~XFS_IOLOCK_SHARED;
+   *iolock |= XFS_IOLOCK_EXCL;
xfs_ilock(ip, *iolock);
}
 



[PATCH v8 14/18] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL

2018-03-30 Thread Dan Williams
The devm_memremap_pages() facility is tightly integrated with the
kernel's memory hotplug functionality. It injects an altmap argument
deep into the architecture specific vmemmap implementation to allow
allocating from specific reserved pages, and it has Linux specific
assumptions about page structure reference counting relative to
get_user_pages() and get_user_pages_fast(). It was an oversight that
this was not marked EXPORT_SYMBOL_GPL from the outset.

Cc: Michal Hocko 
Cc: "Jérôme Glisse" 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 kernel/memremap.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07a6a405cf3d..4b0e17df8981 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -257,7 +257,7 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
devres_free(pgmap);
return ERR_PTR(error);
 }
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {



[PATCH v8 17/18] xfs: prepare xfs_break_layouts() for another layout type

2018-03-30 Thread Dan Williams
When xfs is operating as the back-end of a pNFS block server, it
prevents collisions between local and remote operations by requiring a
lease to be held for remotely accessed blocks. Local filesystem
operations break those leases before writing or mutating the extent map
of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of
layout breaking.

Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
do not collide with local writes. Additionally, layouts are broken in
the BREAK_UNMAP case to make sure the layout-holder has a consistent
view of the file's extent map. While BREAK_WRITE breaks can be satisfied
be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require
waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.

After this refactoring xfs_break_layouts() becomes the entry point for
coordinating both types of breaks. Finally, xfs_break_leased_layouts()
becomes just the BREAK_WRITE handler.

Note that the unlock tracking is needed in a follow on change. That will
coordinate retrying either break handler until both successfully test
for a lease break while maintaining the lock state.

Cc: Ross Zwisler 
Cc: "Darrick J. Wong" 
Reported-by: Dave Chinner 
Reported-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c  |   30 --
 fs/xfs/xfs_inode.h |   16 
 fs/xfs/xfs_ioctl.c |3 +--
 fs/xfs/xfs_iops.c  |6 +++---
 fs/xfs/xfs_pnfs.c  |   13 +++--
 fs/xfs/xfs_pnfs.h  |6 --
 6 files changed, 59 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 18edf04811d0..51e6506bdcb1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -350,7 +350,7 @@ xfs_file_aio_write_checks(
if (error <= 0)
return error;
 
-   error = xfs_break_layouts(inode, iolock);
+   error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
if (error)
return error;
 
@@ -752,6 +752,32 @@ xfs_file_write_iter(
return ret;
 }
 
+int
+xfs_break_layouts(
+   struct inode*inode,
+   uint*iolock,
+   enum layout_break_reason reason)
+{
+   boolretry = false;
+   int error = 0;
+
+   ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+   switch (reason) {
+   case BREAK_UNMAP:
+   ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
+   /* fall through */
+   case BREAK_WRITE:
+   error = xfs_break_leased_layouts(inode, iolock, );
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+
+   return error;
+}
+
 #defineXFS_FALLOC_FL_SUPPORTED 
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |  \
@@ -778,7 +804,7 @@ xfs_file_fallocate(
return -EOPNOTSUPP;
 
xfs_ilock(ip, iolock);
-   error = xfs_break_layouts(inode, );
+   error = xfs_break_layouts(inode, , BREAK_UNMAP);
if (error)
goto out_unlock;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3e8dc990d41c..7e1a077dfc04 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -379,6 +379,20 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>> XFS_ILOCK_SHIFT)
 
 /*
+ * Layouts are broken in the BREAK_WRITE case to ensure that
+ * layout-holders do not collide with local writes. Additionally,
+ * layouts are broken in the BREAK_UNMAP case to make sure the
+ * layout-holder has a consistent view of the file's extent map. While
+ * BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases,
+ * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to
+ * go idle.
+ */
+enum layout_break_reason {
+BREAK_WRITE,
+BREAK_UNMAP,
+};
+
+/*
  * For multiple groups support: if S_ISGID bit is set in the parent
  * directory, group of new file is set to that of the parent, and
  * new subdirectory gets S_ISGID bit from parent.
@@ -447,6 +461,8 @@ int xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 xfs_fsize_t isize, bool *did_zeroing);
 intxfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
bool *did_zero);
+intxfs_break_layouts(struct inode *inode, uint *iolock,
+   enum layout_break_reason reason);
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct 

[PATCH v8 17/18] xfs: prepare xfs_break_layouts() for another layout type

2018-03-30 Thread Dan Williams
When xfs is operating as the back-end of a pNFS block server, it
prevents collisions between local and remote operations by requiring a
lease to be held for remotely accessed blocks. Local filesystem
operations break those leases before writing or mutating the extent map
of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of
layout breaking.

Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
do not collide with local writes. Additionally, layouts are broken in
the BREAK_UNMAP case to make sure the layout-holder has a consistent
view of the file's extent map. While BREAK_WRITE breaks can be satisfied
be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require
waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.

After this refactoring xfs_break_layouts() becomes the entry point for
coordinating both types of breaks. Finally, xfs_break_leased_layouts()
becomes just the BREAK_WRITE handler.

Note that the unlock tracking is needed in a follow on change. That will
coordinate retrying either break handler until both successfully test
for a lease break while maintaining the lock state.

Cc: Ross Zwisler 
Cc: "Darrick J. Wong" 
Reported-by: Dave Chinner 
Reported-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c  |   30 --
 fs/xfs/xfs_inode.h |   16 
 fs/xfs/xfs_ioctl.c |3 +--
 fs/xfs/xfs_iops.c  |6 +++---
 fs/xfs/xfs_pnfs.c  |   13 +++--
 fs/xfs/xfs_pnfs.h  |6 --
 6 files changed, 59 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 18edf04811d0..51e6506bdcb1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -350,7 +350,7 @@ xfs_file_aio_write_checks(
if (error <= 0)
return error;
 
-   error = xfs_break_layouts(inode, iolock);
+   error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
if (error)
return error;
 
@@ -752,6 +752,32 @@ xfs_file_write_iter(
return ret;
 }
 
+int
+xfs_break_layouts(
+   struct inode*inode,
+   uint*iolock,
+   enum layout_break_reason reason)
+{
+   boolretry = false;
+   int error = 0;
+
+   ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+   switch (reason) {
+   case BREAK_UNMAP:
+   ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
+   /* fall through */
+   case BREAK_WRITE:
+   error = xfs_break_leased_layouts(inode, iolock, );
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+
+   return error;
+}
+
 #defineXFS_FALLOC_FL_SUPPORTED 
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |  \
@@ -778,7 +804,7 @@ xfs_file_fallocate(
return -EOPNOTSUPP;
 
xfs_ilock(ip, iolock);
-   error = xfs_break_layouts(inode, );
+   error = xfs_break_layouts(inode, , BREAK_UNMAP);
if (error)
goto out_unlock;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3e8dc990d41c..7e1a077dfc04 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -379,6 +379,20 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>> XFS_ILOCK_SHIFT)
 
 /*
+ * Layouts are broken in the BREAK_WRITE case to ensure that
+ * layout-holders do not collide with local writes. Additionally,
+ * layouts are broken in the BREAK_UNMAP case to make sure the
+ * layout-holder has a consistent view of the file's extent map. While
+ * BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases,
+ * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to
+ * go idle.
+ */
+enum layout_break_reason {
+BREAK_WRITE,
+BREAK_UNMAP,
+};
+
+/*
  * For multiple groups support: if S_ISGID bit is set in the parent
  * directory, group of new file is set to that of the parent, and
  * new subdirectory gets S_ISGID bit from parent.
@@ -447,6 +461,8 @@ int xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 xfs_fsize_t isize, bool *did_zeroing);
 intxfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
bool *did_zero);
+intxfs_break_layouts(struct inode *inode, uint *iolock,
+   enum layout_break_reason reason);
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4151fade4bb1..91e73d663099 100644
--- a/fs/xfs/xfs_ioctl.c

[PATCH v8 12/18] memremap: split devm_memremap_pages() and memremap() infrastructure

2018-03-30 Thread Dan Williams
Currently, kernel/memremap.c contains generic code for supporting
memremap() (CONFIG_HAS_IOMEM) and devm_memremap_pages()
(CONFIG_ZONE_DEVICE). This causes ongoing build maintenance problems as
additions to memremap.c, especially for the ZONE_DEVICE case, need to be
careful about being placed in ifdef guards. Remove the need for these
ifdef guards by moving the ZONE_DEVICE support functions to their own
compilation unit.

Cc: Jan Kara 
Cc: Christoph Hellwig 
Cc: "Jérôme Glisse" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 kernel/Makefile   |3 +
 kernel/iomem.c|  167 ++
 kernel/memremap.c |  178 +
 3 files changed, 171 insertions(+), 177 deletions(-)
 create mode 100644 kernel/iomem.c

diff --git a/kernel/Makefile b/kernel/Makefile
index f85ae5dfa474..9b9241361311 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,7 +112,8 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 
-obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_HAS_IOMEM) += iomem.o
+obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/iomem.c b/kernel/iomem.c
new file mode 100644
index ..f7525e14ebc6
--- /dev/null
+++ b/kernel/iomem.c
@@ -0,0 +1,167 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include 
+#include 
+#include 
+#include 
+
+#ifndef ioremap_cache
+/* temporary while we convert existing ioremap_cache users to memremap */
+__weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)
+{
+   return ioremap(offset, size);
+}
+#endif
+
+#ifndef arch_memremap_wb
+static void *arch_memremap_wb(resource_size_t offset, unsigned long size)
+{
+   return (__force void *)ioremap_cache(offset, size);
+}
+#endif
+
+#ifndef arch_memremap_can_ram_remap
+static bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
+   unsigned long flags)
+{
+   return true;
+}
+#endif
+
+static void *try_ram_remap(resource_size_t offset, size_t size,
+  unsigned long flags)
+{
+   unsigned long pfn = PHYS_PFN(offset);
+
+   /* In the simple case just return the existing linear address */
+   if (pfn_valid(pfn) && !PageHighMem(pfn_to_page(pfn)) &&
+   arch_memremap_can_ram_remap(offset, size, flags))
+   return __va(offset);
+
+   return NULL; /* fallback to arch_memremap_wb */
+}
+
+/**
+ * memremap() - remap an iomem_resource as cacheable memory
+ * @offset: iomem resource start address
+ * @size: size of remap
+ * @flags: any of MEMREMAP_WB, MEMREMAP_WT, MEMREMAP_WC,
+ *   MEMREMAP_ENC, MEMREMAP_DEC
+ *
+ * memremap() is "ioremap" for cases where it is known that the resource
+ * being mapped does not have i/o side effects and the __iomem
+ * annotation is not applicable. In the case of multiple flags, the different
+ * mapping types will be attempted in the order listed below until one of
+ * them succeeds.
+ *
+ * MEMREMAP_WB - matches the default mapping for System RAM on
+ * the architecture.  This is usually a read-allocate write-back cache.
+ * Morever, if MEMREMAP_WB is specified and the requested remap region is RAM
+ * memremap() will bypass establishing a new mapping and instead return
+ * a pointer into the direct map.
+ *
+ * MEMREMAP_WT - establish a mapping whereby writes either bypass the
+ * cache or are written through to memory and never exist in a
+ * cache-dirty state with respect to program visibility.  Attempts to
+ * map System RAM with this mapping type will fail.
+ *
+ * MEMREMAP_WC - establish a writecombine mapping, whereby writes may
+ * be coalesced together (e.g. in the CPU's write buffers), but is otherwise
+ * uncached. Attempts to map System RAM with this mapping type will fail.
+ */
+void *memremap(resource_size_t offset, size_t size, unsigned long flags)
+{
+   int is_ram = region_intersects(offset, size,
+  IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
+   void *addr = NULL;
+
+   if (!flags)
+   return NULL;
+
+   if (is_ram == REGION_MIXED) {
+   WARN_ONCE(1, "memremap attempted on mixed range %pa size: 
%#lx\n",
+   , (unsigned long) size);
+   return NULL;
+   }
+
+   /* Try all mapping types requested until one returns non-NULL */
+   if (flags & MEMREMAP_WB) {
+   /*
+* MEMREMAP_WB is special in that it can be satisifed
+* from the direct map.  Some archs depend on the
+* capability of memremap() to autodetect cases where
+* the requested range is potentially in System RAM.
+*/
+  

[PATCH v8 18/18] xfs, dax: introduce xfs_break_dax_layouts()

2018-03-30 Thread Dan Williams
xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1, see generic_dax_pagefree()).

While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
deadlock the process that might be trying to elevate the page count of
more pages before arranging for any of them to go idle. I.e. the typical
case of submitting I/O is that iov_iter_get_pages() elevates the
reference count of all pages in the I/O before starting I/O on the first
page. The process of elevating the reference count of all pages involved
in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.

Cc: Jan Kara 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c |   60 +++--
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 51e6506bdcb1..0342f6fb782f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -752,6 +752,38 @@ xfs_file_write_iter(
return ret;
 }
 
+static void
+xfs_wait_var_event(
+   struct inode*inode,
+   uintiolock,
+   bool*did_unlock)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+
+   *did_unlock = true;
+   xfs_iunlock(ip, iolock);
+   schedule();
+   xfs_ilock(ip, iolock);
+}
+
+static int
+xfs_break_dax_layouts(
+   struct inode*inode,
+   uintiolock,
+   bool*did_unlock)
+{
+   struct page *page;
+
+   *did_unlock = false;
+   page = dax_layout_busy_page(inode->i_mapping);
+   if (!page)
+   return 0;
+
+   return ___wait_var_event(>_refcount,
+   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
+   0, 0, xfs_wait_var_event(inode, iolock, did_unlock));
+}
+
 int
 xfs_break_layouts(
struct inode*inode,
@@ -763,17 +795,23 @@ xfs_break_layouts(
 
ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
 
-   switch (reason) {
-   case BREAK_UNMAP:
-   ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
-   /* fall through */
-   case BREAK_WRITE:
-   error = xfs_break_leased_layouts(inode, iolock, );
-   break;
-   default:
-   WARN_ON_ONCE(1);
-   return -EINVAL;
-   }
+   do {
+   switch (reason) {
+   case BREAK_UNMAP:
+   ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
+
+   error = xfs_break_dax_layouts(inode, *iolock, );
+   /* fall through */
+   case BREAK_WRITE:
+   if (error || retry)
+   break;
+   error = xfs_break_leased_layouts(inode, iolock, );
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+   } while (error == 0 && retry);
 
return error;
 }



[PATCH v8 12/18] memremap: split devm_memremap_pages() and memremap() infrastructure

2018-03-30 Thread Dan Williams
Currently, kernel/memremap.c contains generic code for supporting
memremap() (CONFIG_HAS_IOMEM) and devm_memremap_pages()
(CONFIG_ZONE_DEVICE). This causes ongoing build maintenance problems as
additions to memremap.c, especially for the ZONE_DEVICE case, need to be
careful about being placed in ifdef guards. Remove the need for these
ifdef guards by moving the ZONE_DEVICE support functions to their own
compilation unit.

Cc: Jan Kara 
Cc: Christoph Hellwig 
Cc: "Jérôme Glisse" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 kernel/Makefile   |3 +
 kernel/iomem.c|  167 ++
 kernel/memremap.c |  178 +
 3 files changed, 171 insertions(+), 177 deletions(-)
 create mode 100644 kernel/iomem.c

diff --git a/kernel/Makefile b/kernel/Makefile
index f85ae5dfa474..9b9241361311 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,7 +112,8 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 
-obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_HAS_IOMEM) += iomem.o
+obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/iomem.c b/kernel/iomem.c
new file mode 100644
index ..f7525e14ebc6
--- /dev/null
+++ b/kernel/iomem.c
@@ -0,0 +1,167 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include 
+#include 
+#include 
+#include 
+
+#ifndef ioremap_cache
+/* temporary while we convert existing ioremap_cache users to memremap */
+__weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)
+{
+   return ioremap(offset, size);
+}
+#endif
+
+#ifndef arch_memremap_wb
+static void *arch_memremap_wb(resource_size_t offset, unsigned long size)
+{
+   return (__force void *)ioremap_cache(offset, size);
+}
+#endif
+
+#ifndef arch_memremap_can_ram_remap
+static bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
+   unsigned long flags)
+{
+   return true;
+}
+#endif
+
+static void *try_ram_remap(resource_size_t offset, size_t size,
+  unsigned long flags)
+{
+   unsigned long pfn = PHYS_PFN(offset);
+
+   /* In the simple case just return the existing linear address */
+   if (pfn_valid(pfn) && !PageHighMem(pfn_to_page(pfn)) &&
+   arch_memremap_can_ram_remap(offset, size, flags))
+   return __va(offset);
+
+   return NULL; /* fallback to arch_memremap_wb */
+}
+
+/**
+ * memremap() - remap an iomem_resource as cacheable memory
+ * @offset: iomem resource start address
+ * @size: size of remap
+ * @flags: any of MEMREMAP_WB, MEMREMAP_WT, MEMREMAP_WC,
+ *   MEMREMAP_ENC, MEMREMAP_DEC
+ *
+ * memremap() is "ioremap" for cases where it is known that the resource
+ * being mapped does not have i/o side effects and the __iomem
+ * annotation is not applicable. In the case of multiple flags, the different
+ * mapping types will be attempted in the order listed below until one of
+ * them succeeds.
+ *
+ * MEMREMAP_WB - matches the default mapping for System RAM on
+ * the architecture.  This is usually a read-allocate write-back cache.
+ * Morever, if MEMREMAP_WB is specified and the requested remap region is RAM
+ * memremap() will bypass establishing a new mapping and instead return
+ * a pointer into the direct map.
+ *
+ * MEMREMAP_WT - establish a mapping whereby writes either bypass the
+ * cache or are written through to memory and never exist in a
+ * cache-dirty state with respect to program visibility.  Attempts to
+ * map System RAM with this mapping type will fail.
+ *
+ * MEMREMAP_WC - establish a writecombine mapping, whereby writes may
+ * be coalesced together (e.g. in the CPU's write buffers), but is otherwise
+ * uncached. Attempts to map System RAM with this mapping type will fail.
+ */
+void *memremap(resource_size_t offset, size_t size, unsigned long flags)
+{
+   int is_ram = region_intersects(offset, size,
+  IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
+   void *addr = NULL;
+
+   if (!flags)
+   return NULL;
+
+   if (is_ram == REGION_MIXED) {
+   WARN_ONCE(1, "memremap attempted on mixed range %pa size: 
%#lx\n",
+   , (unsigned long) size);
+   return NULL;
+   }
+
+   /* Try all mapping types requested until one returns non-NULL */
+   if (flags & MEMREMAP_WB) {
+   /*
+* MEMREMAP_WB is special in that it can be satisifed
+* from the direct map.  Some archs depend on the
+* capability of memremap() to autodetect cases where
+* the requested range is potentially in System RAM.
+*/
+   if (is_ram == REGION_INTERSECTS)
+   addr = try_ram_remap(offset, size, 

[PATCH v8 18/18] xfs, dax: introduce xfs_break_dax_layouts()

2018-03-30 Thread Dan Williams
xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1, see generic_dax_pagefree()).

While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
deadlock the process that might be trying to elevate the page count of
more pages before arranging for any of them to go idle. I.e. the typical
case of submitting I/O is that iov_iter_get_pages() elevates the
reference count of all pages in the I/O before starting I/O on the first
page. The process of elevating the reference count of all pages involved
in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.

Cc: Jan Kara 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c |   60 +++--
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 51e6506bdcb1..0342f6fb782f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -752,6 +752,38 @@ xfs_file_write_iter(
return ret;
 }
 
+static void
+xfs_wait_var_event(
+   struct inode*inode,
+   uintiolock,
+   bool*did_unlock)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+
+   *did_unlock = true;
+   xfs_iunlock(ip, iolock);
+   schedule();
+   xfs_ilock(ip, iolock);
+}
+
+static int
+xfs_break_dax_layouts(
+   struct inode*inode,
+   uintiolock,
+   bool*did_unlock)
+{
+   struct page *page;
+
+   *did_unlock = false;
+   page = dax_layout_busy_page(inode->i_mapping);
+   if (!page)
+   return 0;
+
+   return ___wait_var_event(>_refcount,
+   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
+   0, 0, xfs_wait_var_event(inode, iolock, did_unlock));
+}
+
 int
 xfs_break_layouts(
struct inode*inode,
@@ -763,17 +795,23 @@ xfs_break_layouts(
 
ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
 
-   switch (reason) {
-   case BREAK_UNMAP:
-   ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
-   /* fall through */
-   case BREAK_WRITE:
-   error = xfs_break_leased_layouts(inode, iolock, );
-   break;
-   default:
-   WARN_ON_ONCE(1);
-   return -EINVAL;
-   }
+   do {
+   switch (reason) {
+   case BREAK_UNMAP:
+   ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
+
+   error = xfs_break_dax_layouts(inode, *iolock, );
+   /* fall through */
+   case BREAK_WRITE:
+   if (error || retry)
+   break;
+   error = xfs_break_leased_layouts(inode, iolock, );
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+   } while (error == 0 && retry);
 
return error;
 }



[PATCH v8 07/18] fs, dax: use page->mapping to warn if truncate collides with a busy page

2018-03-30 Thread Dan Williams
Catch cases where extent unmap operations encounter pages that are
pinned / busy. Typically this is pinned pages that are under active dma.
This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
performing i/o.

Here is an example of a collision that this implementation catches:

 WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
 [..]
 Call Trace:
  __dax_invalidate_mapping_entry+0x6c/0xf0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.12+0x1af/0x200
  truncate_inode_pages_range+0x268/0x970
  ? tlb_gather_mmu+0x10/0x20
  ? up_write+0x1c/0x40
  ? unmap_mapping_range+0x73/0x140
  xfs_free_file_space+0x1b6/0x5b0 [xfs]
  ? xfs_file_fallocate+0x7f/0x320 [xfs]
  ? down_write_nested+0x40/0x70
  ? xfs_ilock+0x21d/0x2f0 [xfs]
  xfs_file_fallocate+0x162/0x320 [xfs]
  ? rcu_read_lock_sched_held+0x3f/0x70
  ? rcu_sync_lockdep_assert+0x2a/0x50
  ? __sb_start_write+0xd0/0x1b0
  ? vfs_fallocate+0x20c/0x270
  vfs_fallocate+0x154/0x270
  SyS_fallocate+0x43/0x80
  entry_SYSCALL_64_fastpath+0x1f/0x96

Cc: Jeff Moyer 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/dax.c |   63 ++
 1 file changed, 63 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index b646a46e4d12..a77394fe586e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -298,6 +298,63 @@ static void put_unlocked_mapping_entry(struct 
address_space *mapping,
dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+   if (dax_is_zero_entry(entry))
+   return 0;
+   else if (dax_is_empty_entry(entry))
+   return 0;
+   else if (dax_is_pmd_entry(entry))
+   return PMD_SIZE;
+   else
+   return PAGE_SIZE;
+}
+
+static unsigned long dax_radix_end_pfn(void *entry)
+{
+   return dax_radix_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
+}
+
+/*
+ * Iterate through all mapped pfns represented by an entry, i.e. skip
+ * 'empty' and 'zero' entries.
+ */
+#define for_each_mapped_pfn(entry, pfn) \
+   for (pfn = dax_radix_pfn(entry); \
+   pfn < dax_radix_end_pfn(entry); pfn++)
+
+static void dax_associate_entry(void *entry, struct address_space *mapping)
+{
+   unsigned long pfn;
+
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return;
+
+   for_each_mapped_pfn(entry, pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   WARN_ON_ONCE(page->mapping);
+   page->mapping = mapping;
+   }
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+   bool trunc)
+{
+   unsigned long pfn;
+
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return;
+
+   for_each_mapped_pfn(entry, pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+   page->mapping = NULL;
+   }
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -404,6 +461,7 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index,
}
 
if (pmd_downgrade) {
+   dax_disassociate_entry(entry, mapping, false);
radix_tree_delete(>page_tree, index);
mapping->nrexceptional--;
dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -453,6 +511,7 @@ static int __dax_invalidate_mapping_entry(struct 
address_space *mapping,
(radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
goto out;
+   dax_disassociate_entry(entry, mapping, trunc);
radix_tree_delete(page_tree, index);
mapping->nrexceptional--;
ret = 1;
@@ -547,6 +606,10 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
 
spin_lock_irq(>tree_lock);
new_entry = dax_radix_locked_entry(pfn, flags);
+   if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+   dax_disassociate_entry(entry, mapping, false);
+   dax_associate_entry(new_entry, mapping);
+   }
 
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*



[PATCH v8 10/18] dax, dm: introduce ->fs_{claim, release}() dax_device infrastructure

2018-03-30 Thread Dan Williams
In preparation for allowing filesystems to augment the dev_pagemap
associated with a dax_device, add an ->fs_claim() callback. The
->fs_claim() callback is leveraged by the device-mapper dax
implementation to iterate all member devices in the map and repeat the
claim operation across the array.

In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.
The ->fs_claim() operation arranges for this callback to be registered,
although that implementation is saved for a later patch.

Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: "Jérôme Glisse" 
Cc: Christoph Hellwig 
Cc: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |   80 ++
 drivers/md/dm.c  |   56 
 include/linux/dax.h  |   16 +
 include/linux/memremap.h |8 +
 4 files changed, 160 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..c4cf284dfe1c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -169,9 +170,88 @@ struct dax_device {
const char *host;
void *private;
unsigned long flags;
+   struct dev_pagemap *pgmap;
const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+   /* TODO: wakeup page-idle waiters */
+}
+
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
+{
+   struct dev_pagemap *pgmap;
+
+   if (!dax_dev->pgmap)
+   return dax_dev;
+   pgmap = dax_dev->pgmap;
+
+   mutex_lock(_lock);
+   if (pgmap->data && pgmap->data == owner) {
+   /* dm might try to claim the same device more than once... */
+   mutex_unlock(_lock);
+   return dax_dev;
+   } else if (pgmap->page_free || pgmap->page_fault
+   || pgmap->type != MEMORY_DEVICE_HOST) {
+   put_dax(dax_dev);
+   mutex_unlock(_lock);
+   return NULL;
+   }
+
+   pgmap->type = MEMORY_DEVICE_FS_DAX;
+   pgmap->page_free = generic_dax_pagefree;
+   pgmap->data = owner;
+   mutex_unlock(_lock);
+
+   return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim);
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+   struct dax_device *dax_dev;
+
+   if (!blk_queue_dax(bdev->bd_queue))
+   return NULL;
+   dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+   if (dax_dev->ops->fs_claim)
+   return dax_dev->ops->fs_claim(dax_dev, owner);
+   else
+   return fs_dax_claim(dax_dev, owner);
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void __fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+   struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+   put_dax(dax_dev);
+   if (!pgmap)
+   return;
+   if (!pgmap->data)
+   return;
+
+   mutex_lock(_lock);
+   WARN_ON(pgmap->data != owner);
+   pgmap->type = MEMORY_DEVICE_HOST;
+   pgmap->page_free = NULL;
+   pgmap->data = NULL;
+   mutex_unlock(_lock);
+}
+EXPORT_SYMBOL_GPL(__fs_dax_release);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+   if (dax_dev->ops->fs_release)
+   dax_dev->ops->fs_release(dax_dev, owner);
+   else
+   __fs_dax_release(dax_dev, owner);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ffc93aecc02a..964cb7537f11 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1090,6 +1090,60 @@ static size_t dm_dax_copy_from_iter(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
+static int dm_dax_dev_claim(struct dm_target *ti, struct dm_dev *dev,
+   sector_t start, sector_t len, void *owner)
+{
+   if (fs_dax_claim(dev->dax_dev, owner))
+   return 0;
+   /*
+* Outside of a kernel bug there is no reason a dax_dev should
+* fail a claim attempt. Device-mapper should 

[PATCH v8 10/18] dax, dm: introduce ->fs_{claim, release}() dax_device infrastructure

2018-03-30 Thread Dan Williams
In preparation for allowing filesystems to augment the dev_pagemap
associated with a dax_device, add an ->fs_claim() callback. The
->fs_claim() callback is leveraged by the device-mapper dax
implementation to iterate all member devices in the map and repeat the
claim operation across the array.

In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.
The ->fs_claim() operation arranges for this callback to be registered,
although that implementation is saved for a later patch.

Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: "Jérôme Glisse" 
Cc: Christoph Hellwig 
Cc: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |   80 ++
 drivers/md/dm.c  |   56 
 include/linux/dax.h  |   16 +
 include/linux/memremap.h |8 +
 4 files changed, 160 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..c4cf284dfe1c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -169,9 +170,88 @@ struct dax_device {
const char *host;
void *private;
unsigned long flags;
+   struct dev_pagemap *pgmap;
const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+   /* TODO: wakeup page-idle waiters */
+}
+
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
+{
+   struct dev_pagemap *pgmap;
+
+   if (!dax_dev->pgmap)
+   return dax_dev;
+   pgmap = dax_dev->pgmap;
+
+   mutex_lock(_lock);
+   if (pgmap->data && pgmap->data == owner) {
+   /* dm might try to claim the same device more than once... */
+   mutex_unlock(_lock);
+   return dax_dev;
+   } else if (pgmap->page_free || pgmap->page_fault
+   || pgmap->type != MEMORY_DEVICE_HOST) {
+   put_dax(dax_dev);
+   mutex_unlock(_lock);
+   return NULL;
+   }
+
+   pgmap->type = MEMORY_DEVICE_FS_DAX;
+   pgmap->page_free = generic_dax_pagefree;
+   pgmap->data = owner;
+   mutex_unlock(_lock);
+
+   return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim);
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+   struct dax_device *dax_dev;
+
+   if (!blk_queue_dax(bdev->bd_queue))
+   return NULL;
+   dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+   if (dax_dev->ops->fs_claim)
+   return dax_dev->ops->fs_claim(dax_dev, owner);
+   else
+   return fs_dax_claim(dax_dev, owner);
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void __fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+   struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+   put_dax(dax_dev);
+   if (!pgmap)
+   return;
+   if (!pgmap->data)
+   return;
+
+   mutex_lock(_lock);
+   WARN_ON(pgmap->data != owner);
+   pgmap->type = MEMORY_DEVICE_HOST;
+   pgmap->page_free = NULL;
+   pgmap->data = NULL;
+   mutex_unlock(_lock);
+}
+EXPORT_SYMBOL_GPL(__fs_dax_release);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+   if (dax_dev->ops->fs_release)
+   dax_dev->ops->fs_release(dax_dev, owner);
+   else
+   __fs_dax_release(dax_dev, owner);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ffc93aecc02a..964cb7537f11 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1090,6 +1090,60 @@ static size_t dm_dax_copy_from_iter(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
+static int dm_dax_dev_claim(struct dm_target *ti, struct dm_dev *dev,
+   sector_t start, sector_t len, void *owner)
+{
+   if (fs_dax_claim(dev->dax_dev, owner))
+   return 0;
+   /*
+* Outside of a kernel bug there is no reason a dax_dev should
+* fail a claim attempt. Device-mapper should have exclusive
+* ownership of the dm_dev and the filesystem should have
+* exclusive ownership of the dm_target.
+*/
+   WARN_ON_ONCE(1);

[PATCH v8 07/18] fs, dax: use page->mapping to warn if truncate collides with a busy page

2018-03-30 Thread Dan Williams
Catch cases where extent unmap operations encounter pages that are
pinned / busy. Typically this is pinned pages that are under active dma.
This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
performing i/o.

Here is an example of a collision that this implementation catches:

 WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
 [..]
 Call Trace:
  __dax_invalidate_mapping_entry+0x6c/0xf0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.12+0x1af/0x200
  truncate_inode_pages_range+0x268/0x970
  ? tlb_gather_mmu+0x10/0x20
  ? up_write+0x1c/0x40
  ? unmap_mapping_range+0x73/0x140
  xfs_free_file_space+0x1b6/0x5b0 [xfs]
  ? xfs_file_fallocate+0x7f/0x320 [xfs]
  ? down_write_nested+0x40/0x70
  ? xfs_ilock+0x21d/0x2f0 [xfs]
  xfs_file_fallocate+0x162/0x320 [xfs]
  ? rcu_read_lock_sched_held+0x3f/0x70
  ? rcu_sync_lockdep_assert+0x2a/0x50
  ? __sb_start_write+0xd0/0x1b0
  ? vfs_fallocate+0x20c/0x270
  vfs_fallocate+0x154/0x270
  SyS_fallocate+0x43/0x80
  entry_SYSCALL_64_fastpath+0x1f/0x96

Cc: Jeff Moyer 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/dax.c |   63 ++
 1 file changed, 63 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index b646a46e4d12..a77394fe586e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -298,6 +298,63 @@ static void put_unlocked_mapping_entry(struct 
address_space *mapping,
dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+   if (dax_is_zero_entry(entry))
+   return 0;
+   else if (dax_is_empty_entry(entry))
+   return 0;
+   else if (dax_is_pmd_entry(entry))
+   return PMD_SIZE;
+   else
+   return PAGE_SIZE;
+}
+
+static unsigned long dax_radix_end_pfn(void *entry)
+{
+   return dax_radix_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
+}
+
+/*
+ * Iterate through all mapped pfns represented by an entry, i.e. skip
+ * 'empty' and 'zero' entries.
+ */
+#define for_each_mapped_pfn(entry, pfn) \
+   for (pfn = dax_radix_pfn(entry); \
+   pfn < dax_radix_end_pfn(entry); pfn++)
+
+static void dax_associate_entry(void *entry, struct address_space *mapping)
+{
+   unsigned long pfn;
+
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return;
+
+   for_each_mapped_pfn(entry, pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   WARN_ON_ONCE(page->mapping);
+   page->mapping = mapping;
+   }
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+   bool trunc)
+{
+   unsigned long pfn;
+
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return;
+
+   for_each_mapped_pfn(entry, pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+   page->mapping = NULL;
+   }
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -404,6 +461,7 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index,
}
 
if (pmd_downgrade) {
+   dax_disassociate_entry(entry, mapping, false);
radix_tree_delete(>page_tree, index);
mapping->nrexceptional--;
dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -453,6 +511,7 @@ static int __dax_invalidate_mapping_entry(struct 
address_space *mapping,
(radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
goto out;
+   dax_disassociate_entry(entry, mapping, trunc);
radix_tree_delete(page_tree, index);
mapping->nrexceptional--;
ret = 1;
@@ -547,6 +606,10 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
 
spin_lock_irq(>tree_lock);
new_entry = dax_radix_locked_entry(pfn, flags);
+   if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+   dax_disassociate_entry(entry, mapping, false);
+   dax_associate_entry(new_entry, mapping);
+   }
 
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*



Re: [PATCH][next] apparmor: fix memory leak on buffer on error exit path

2018-03-30 Thread John Johansen
On 03/27/2018 06:35 AM, Colin King wrote:
> From: Colin Ian King 
> 
> Currently on the error exit path the allocated buffer is not free'd
> causing a memory leak. Fix this by kfree'ing it.
> 
> Detected by CoverityScan, CID#1466876 ("Resource leaks")
> 
> Fixes: 1180b4c757aa ("apparmor: fix dangling symlinks to policy rawdata after 
> replacement")
> Signed-off-by: Colin Ian King 

thanks Colin

I've pulled it into apparmor-next

> ---
>  security/apparmor/apparmorfs.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
> index 96bb6b73af65..949dd8a48164 100644
> --- a/security/apparmor/apparmorfs.c
> +++ b/security/apparmor/apparmorfs.c
> @@ -1497,8 +1497,10 @@ static char *gen_symlink_name(int depth, const char 
> *dirname, const char *fname)
>   }
>  
>   error = snprintf(s, size, "raw_data/%s/%s", dirname, fname);
> - if (error >= size || error < 0)
> + if (error >= size || error < 0) {
> + kfree(buffer);
>   return ERR_PTR(-ENAMETOOLONG);
> + }
>  
>   return buffer;
>  }
> 



Re: [PATCH][next] apparmor: fix memory leak on buffer on error exit path

2018-03-30 Thread John Johansen
On 03/27/2018 06:35 AM, Colin King wrote:
> From: Colin Ian King 
> 
> Currently on the error exit path the allocated buffer is not free'd
> causing a memory leak. Fix this by kfree'ing it.
> 
> Detected by CoverityScan, CID#1466876 ("Resource leaks")
> 
> Fixes: 1180b4c757aa ("apparmor: fix dangling symlinks to policy rawdata after 
> replacement")
> Signed-off-by: Colin Ian King 

thanks Colin

I've pulled it into apparmor-next

> ---
>  security/apparmor/apparmorfs.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
> index 96bb6b73af65..949dd8a48164 100644
> --- a/security/apparmor/apparmorfs.c
> +++ b/security/apparmor/apparmorfs.c
> @@ -1497,8 +1497,10 @@ static char *gen_symlink_name(int depth, const char 
> *dirname, const char *fname)
>   }
>  
>   error = snprintf(s, size, "raw_data/%s/%s", dirname, fname);
> - if (error >= size || error < 0)
> + if (error >= size || error < 0) {
> + kfree(buffer);
>   return ERR_PTR(-ENAMETOOLONG);
> + }
>  
>   return buffer;
>  }
> 



[PATCH v8 09/18] dax, dm: allow device-mapper to operate without dax support

2018-03-30 Thread Dan Williams
Change device-mapper's DAX dependency to require the presence of at
least one DAX_DRIVER. This allows device-mapper to be built without
bringing the DAX core along which is especially wasteful when there are
no DAX drivers, like BLK_DEV_PMEM, configured.

Cc: Alasdair Kergon 
Reported-by: Bart Van Assche 
Reported-by: kbuild test robot 
Reviewed-by: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/md/Kconfig |1 
 drivers/md/dm-linear.c |6 +++
 drivers/md/dm-log-writes.c |   95 +++-
 drivers/md/dm-stripe.c |6 +++
 drivers/md/dm.c|   10 +++--
 include/linux/dax.h|   30 +++---
 6 files changed, 92 insertions(+), 56 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2c8ac3688815..6dfc328b8f99 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -201,7 +201,6 @@ config BLK_DEV_DM_BUILTIN
 config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
-   select DAX
---help---
  Device-mapper is a low level volume manager.  It works by allowing
  people to specify mappings for ranges of logical sectors.  Various
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index d5f8eff7c11d..89443e0ededa 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -154,6 +154,7 @@ static int linear_iterate_devices(struct dm_target *ti,
return fn(ti, lc->dev, lc->start, ti->len, data);
 }
 
+#if IS_ENABLED(CONFIG_DAX_DRIVER)
 static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -184,6 +185,11 @@ static size_t linear_dax_copy_from_iter(struct dm_target 
*ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+#else
+#define linear_dax_direct_access NULL
+#define linear_dax_copy_from_iter NULL
+#endif
+
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 4, 0},
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 3362d866793b..7fcb4216973f 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -610,51 +610,6 @@ static int log_mark(struct log_writes_c *lc, char *data)
return 0;
 }
 
-static int log_dax(struct log_writes_c *lc, sector_t sector, size_t bytes,
-  struct iov_iter *i)
-{
-   struct pending_block *block;
-
-   if (!bytes)
-   return 0;
-
-   block = kzalloc(sizeof(struct pending_block), GFP_KERNEL);
-   if (!block) {
-   DMERR("Error allocating dax pending block");
-   return -ENOMEM;
-   }
-
-   block->data = kzalloc(bytes, GFP_KERNEL);
-   if (!block->data) {
-   DMERR("Error allocating dax data space");
-   kfree(block);
-   return -ENOMEM;
-   }
-
-   /* write data provided via the iterator */
-   if (!copy_from_iter(block->data, bytes, i)) {
-   DMERR("Error copying dax data");
-   kfree(block->data);
-   kfree(block);
-   return -EIO;
-   }
-
-   /* rewind the iterator so that the block driver can use it */
-   iov_iter_revert(i, bytes);
-
-   block->datalen = bytes;
-   block->sector = bio_to_dev_sectors(lc, sector);
-   block->nr_sectors = ALIGN(bytes, lc->sectorsize) >> lc->sectorshift;
-
-   atomic_inc(>pending_blocks);
-   spin_lock_irq(>blocks_lock);
-   list_add_tail(>list, >unflushed_blocks);
-   spin_unlock_irq(>blocks_lock);
-   wake_up_process(lc->log_kthread);
-
-   return 0;
-}
-
 static void log_writes_dtr(struct dm_target *ti)
 {
struct log_writes_c *lc = ti->private;
@@ -920,6 +875,52 @@ static void log_writes_io_hints(struct dm_target *ti, 
struct queue_limits *limit
limits->io_min = limits->physical_block_size;
 }
 
+#if IS_ENABLED(CONFIG_DAX_DRIVER)
+static int log_dax(struct log_writes_c *lc, sector_t sector, size_t bytes,
+  struct iov_iter *i)
+{
+   struct pending_block *block;
+
+   if (!bytes)
+   return 0;
+
+   block = kzalloc(sizeof(struct pending_block), GFP_KERNEL);
+   if (!block) {
+   DMERR("Error allocating dax pending block");
+   return -ENOMEM;
+   }
+
+   block->data = kzalloc(bytes, GFP_KERNEL);
+   if (!block->data) {
+   DMERR("Error allocating dax data space");
+   kfree(block);
+   return -ENOMEM;
+   }
+
+   /* write data provided via the iterator */
+   if (!copy_from_iter(block->data, bytes, i)) {
+   DMERR("Error copying dax data");
+   kfree(block->data);
+   kfree(block);
+   return -EIO;
+   }
+
+   /* rewind the 

[PATCH v8 13/18] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS

2018-03-30 Thread Dan Williams
The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.

Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Michal Hocko 
Reported-by: Thomas Meyer 
Reviewed-by: Christoph Hellwig 
Reviewed-by: "Jérôme Glisse" 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |4 ++-
 fs/Kconfig   |1 +
 include/linux/dax.h  |   35 ---
 include/linux/memremap.h |   17 ---
 include/linux/mm.h   |   71 ++
 kernel/memremap.c|   30 +--
 mm/Kconfig   |5 +++
 mm/hmm.c |   13 +---
 mm/swap.c|3 +-
 9 files changed, 116 insertions(+), 63 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 7d260f118a39..3bafaddd02f1 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -164,7 +164,7 @@ struct dax_device {
const struct dax_operations *ops;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
/* TODO: wakeup page-idle waiters */
@@ -190,6 +190,7 @@ struct dax_device *fs_dax_claim(struct dax_device *dax_dev, 
void *owner)
return NULL;
}
 
+   dev_pagemap_get_ops();
pgmap->type = MEMORY_DEVICE_FS_DAX;
pgmap->page_free = generic_dax_pagefree;
pgmap->data = owner;
@@ -228,6 +229,7 @@ void __fs_dax_release(struct dax_device *dax_dev, void 
*owner)
pgmap->type = MEMORY_DEVICE_HOST;
pgmap->page_free = NULL;
pgmap->data = NULL;
+   dev_pagemap_put_ops();
mutex_unlock(_lock);
 }
 EXPORT_SYMBOL_GPL(__fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1e050e012eb9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+   select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
select FS_IOMAP
select DAX
help
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a88ff009e2a1..a36b74aa96e8 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -87,12 +88,8 @@ static inline struct dax_device *fs_dax_get_by_host(const 
char *host)
return dax_get_by_host(host);
 }
 
-struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
-void fs_dax_release(struct dax_device *dax_dev, void *owner);
 int dax_writeback_mapping_range(struct address_space *mapping,
struct block_device *bdev, struct writeback_control *wbc);
-struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
-void __fs_dax_release(struct dax_device *dax_dev, void *owner);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -104,26 +101,42 @@ static inline struct dax_device *fs_dax_get_by_host(const 
char *host)
return NULL;
 }
 
+static inline int dax_writeback_mapping_range(struct address_space *mapping,
+   struct block_device *bdev, struct writeback_control *wbc)
+{
+   return -EOPNOTSUPP;
+}
+#endif
+
+#if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
+void __fs_dax_release(struct dax_device *dax_dev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
+#else
+#ifdef CONFIG_BLOCK
 static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
void *owner)
 {
-   return NULL;
+   return fs_dax_get_by_host(bdev->bd_disk->disk_name);
 }
-
-static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
+#else
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+   

[PATCH v8 09/18] dax, dm: allow device-mapper to operate without dax support

2018-03-30 Thread Dan Williams
Change device-mapper's DAX dependency to require the presence of at
least one DAX_DRIVER. This allows device-mapper to be built without
bringing the DAX core along which is especially wasteful when there are
no DAX drivers, like BLK_DEV_PMEM, configured.

Cc: Alasdair Kergon 
Reported-by: Bart Van Assche 
Reported-by: kbuild test robot 
Reviewed-by: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/md/Kconfig |1 
 drivers/md/dm-linear.c |6 +++
 drivers/md/dm-log-writes.c |   95 +++-
 drivers/md/dm-stripe.c |6 +++
 drivers/md/dm.c|   10 +++--
 include/linux/dax.h|   30 +++---
 6 files changed, 92 insertions(+), 56 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2c8ac3688815..6dfc328b8f99 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -201,7 +201,6 @@ config BLK_DEV_DM_BUILTIN
 config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
-   select DAX
---help---
  Device-mapper is a low level volume manager.  It works by allowing
  people to specify mappings for ranges of logical sectors.  Various
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index d5f8eff7c11d..89443e0ededa 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -154,6 +154,7 @@ static int linear_iterate_devices(struct dm_target *ti,
return fn(ti, lc->dev, lc->start, ti->len, data);
 }
 
+#if IS_ENABLED(CONFIG_DAX_DRIVER)
 static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -184,6 +185,11 @@ static size_t linear_dax_copy_from_iter(struct dm_target 
*ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+#else
+#define linear_dax_direct_access NULL
+#define linear_dax_copy_from_iter NULL
+#endif
+
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 4, 0},
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 3362d866793b..7fcb4216973f 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -610,51 +610,6 @@ static int log_mark(struct log_writes_c *lc, char *data)
return 0;
 }
 
-static int log_dax(struct log_writes_c *lc, sector_t sector, size_t bytes,
-  struct iov_iter *i)
-{
-   struct pending_block *block;
-
-   if (!bytes)
-   return 0;
-
-   block = kzalloc(sizeof(struct pending_block), GFP_KERNEL);
-   if (!block) {
-   DMERR("Error allocating dax pending block");
-   return -ENOMEM;
-   }
-
-   block->data = kzalloc(bytes, GFP_KERNEL);
-   if (!block->data) {
-   DMERR("Error allocating dax data space");
-   kfree(block);
-   return -ENOMEM;
-   }
-
-   /* write data provided via the iterator */
-   if (!copy_from_iter(block->data, bytes, i)) {
-   DMERR("Error copying dax data");
-   kfree(block->data);
-   kfree(block);
-   return -EIO;
-   }
-
-   /* rewind the iterator so that the block driver can use it */
-   iov_iter_revert(i, bytes);
-
-   block->datalen = bytes;
-   block->sector = bio_to_dev_sectors(lc, sector);
-   block->nr_sectors = ALIGN(bytes, lc->sectorsize) >> lc->sectorshift;
-
-   atomic_inc(>pending_blocks);
-   spin_lock_irq(>blocks_lock);
-   list_add_tail(>list, >unflushed_blocks);
-   spin_unlock_irq(>blocks_lock);
-   wake_up_process(lc->log_kthread);
-
-   return 0;
-}
-
 static void log_writes_dtr(struct dm_target *ti)
 {
struct log_writes_c *lc = ti->private;
@@ -920,6 +875,52 @@ static void log_writes_io_hints(struct dm_target *ti, 
struct queue_limits *limit
limits->io_min = limits->physical_block_size;
 }
 
+#if IS_ENABLED(CONFIG_DAX_DRIVER)
+static int log_dax(struct log_writes_c *lc, sector_t sector, size_t bytes,
+  struct iov_iter *i)
+{
+   struct pending_block *block;
+
+   if (!bytes)
+   return 0;
+
+   block = kzalloc(sizeof(struct pending_block), GFP_KERNEL);
+   if (!block) {
+   DMERR("Error allocating dax pending block");
+   return -ENOMEM;
+   }
+
+   block->data = kzalloc(bytes, GFP_KERNEL);
+   if (!block->data) {
+   DMERR("Error allocating dax data space");
+   kfree(block);
+   return -ENOMEM;
+   }
+
+   /* write data provided via the iterator */
+   if (!copy_from_iter(block->data, bytes, i)) {
+   DMERR("Error copying dax data");
+   kfree(block->data);
+   kfree(block);
+   return -EIO;
+   }
+
+   /* rewind the iterator so that the block driver can use it */
+   iov_iter_revert(i, bytes);
+
+   block->datalen = 

[PATCH v8 13/18] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS

2018-03-30 Thread Dan Williams
The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.

Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Michal Hocko 
Reported-by: Thomas Meyer 
Reviewed-by: Christoph Hellwig 
Reviewed-by: "Jérôme Glisse" 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |4 ++-
 fs/Kconfig   |1 +
 include/linux/dax.h  |   35 ---
 include/linux/memremap.h |   17 ---
 include/linux/mm.h   |   71 ++
 kernel/memremap.c|   30 +--
 mm/Kconfig   |5 +++
 mm/hmm.c |   13 +---
 mm/swap.c|3 +-
 9 files changed, 116 insertions(+), 63 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 7d260f118a39..3bafaddd02f1 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -164,7 +164,7 @@ struct dax_device {
const struct dax_operations *ops;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
/* TODO: wakeup page-idle waiters */
@@ -190,6 +190,7 @@ struct dax_device *fs_dax_claim(struct dax_device *dax_dev, 
void *owner)
return NULL;
}
 
+   dev_pagemap_get_ops();
pgmap->type = MEMORY_DEVICE_FS_DAX;
pgmap->page_free = generic_dax_pagefree;
pgmap->data = owner;
@@ -228,6 +229,7 @@ void __fs_dax_release(struct dax_device *dax_dev, void 
*owner)
pgmap->type = MEMORY_DEVICE_HOST;
pgmap->page_free = NULL;
pgmap->data = NULL;
+   dev_pagemap_put_ops();
mutex_unlock(_lock);
 }
 EXPORT_SYMBOL_GPL(__fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1e050e012eb9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+   select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
select FS_IOMAP
select DAX
help
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a88ff009e2a1..a36b74aa96e8 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -87,12 +88,8 @@ static inline struct dax_device *fs_dax_get_by_host(const 
char *host)
return dax_get_by_host(host);
 }
 
-struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
-void fs_dax_release(struct dax_device *dax_dev, void *owner);
 int dax_writeback_mapping_range(struct address_space *mapping,
struct block_device *bdev, struct writeback_control *wbc);
-struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
-void __fs_dax_release(struct dax_device *dax_dev, void *owner);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -104,26 +101,42 @@ static inline struct dax_device *fs_dax_get_by_host(const 
char *host)
return NULL;
 }
 
+static inline int dax_writeback_mapping_range(struct address_space *mapping,
+   struct block_device *bdev, struct writeback_control *wbc)
+{
+   return -EOPNOTSUPP;
+}
+#endif
+
+#if IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
+void __fs_dax_release(struct dax_device *dax_dev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
+#else
+#ifdef CONFIG_BLOCK
 static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
void *owner)
 {
-   return NULL;
+   return fs_dax_get_by_host(bdev->bd_disk->disk_name);
 }
-
-static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
+#else
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+   void *owner)
 {
+   return NULL;
 }
+#endif
 
-static inline int dax_writeback_mapping_range(struct address_space *mapping,
-   struct 

[PATCH v8 01/18] dax: store pfns in the radix

2018-03-30 Thread Dan Williams
In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c |   15 +++--
 fs/dax.c|   83 +++
 2 files changed, 43 insertions(+), 55 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecdc292aa4e4..2b2332b605e4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int 
blocksize)
return len < 0 ? len : -EIO;
}
 
-   if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-   || pfn_t_devmap(pfn))
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+   /*
+* An arch that has enabled the pmem api should also
+* have its drivers support pfn_t_devmap()
+*
+* This is a developer warning and should not trigger in
+* production. dax_flush() will crash since it depends
+* on being able to do (page_address(pfn_to_page())).
+*/
+   WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+   } else if (pfn_t_devmap(pfn)) {
/* pass */;
-   else {
+   } else {
pr_debug("VFS (%s): error: dax support not enabled\n",
sb->s_id);
return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..b646a46e4d12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -73,16 +73,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 
3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-   ((unsigned long)sector << RADIX_DAX_SHIFT) |
-   RADIX_DAX_ENTRY_LOCK);
+   (pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -526,12 +525,13 @@ static int copy_user_dax(struct block_device *bdev, 
struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
  struct vm_fault *vmf,
- void *entry, sector_t sector,
+ void *entry, pfn_t pfn_t,
  unsigned long flags, bool dirty)
 {
struct radix_tree_root *page_tree = >page_tree;
-   void *new_entry;
+   unsigned long pfn = pfn_t_to_pfn(pfn_t);
pgoff_t index = vmf->pgoff;
+   void *new_entry;
 
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -546,7 +546,7 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
}
 
spin_lock_irq(>tree_lock);
-   new_entry = dax_radix_locked_entry(sector, flags);
+   new_entry = dax_radix_locked_entry(pfn, flags);
 
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*
@@ -657,17 +657,14 @@ static void dax_mapping_entry_mkclean(struct 
address_space *mapping,
i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-   struct dax_device *dax_dev, struct address_space *mapping,
-   pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+   struct address_space *mapping, pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
-   void *entry2, **slot, *kaddr;
-   long ret = 0, id;
-   sector_t sector;
-   pgoff_t pgoff;
+   void *entry2, **slot;
+   unsigned long pfn;
+   long ret = 0;
size_t size;
-   pfn_t pfn;
 
/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -683,10 +680,10 @@ static int dax_writeback_one(struct block_device *bdev,
goto put_unlocked;
/*
 * Entry got reallocated elsewhere? No need to writeback. We have to
-* compare sectors as we must not bail out due to difference in lockbit
+* compare pfns as we must not bail out due to difference in lockbit
 * or entry type.
 */
-   if 

[PATCH v8 01/18] dax: store pfns in the radix

2018-03-30 Thread Dan Williams
In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c |   15 +++--
 fs/dax.c|   83 +++
 2 files changed, 43 insertions(+), 55 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecdc292aa4e4..2b2332b605e4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int 
blocksize)
return len < 0 ? len : -EIO;
}
 
-   if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-   || pfn_t_devmap(pfn))
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+   /*
+* An arch that has enabled the pmem api should also
+* have its drivers support pfn_t_devmap()
+*
+* This is a developer warning and should not trigger in
+* production. dax_flush() will crash since it depends
+* on being able to do (page_address(pfn_to_page())).
+*/
+   WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+   } else if (pfn_t_devmap(pfn)) {
/* pass */;
-   else {
+   } else {
pr_debug("VFS (%s): error: dax support not enabled\n",
sb->s_id);
return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..b646a46e4d12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -73,16 +73,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 
3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-   ((unsigned long)sector << RADIX_DAX_SHIFT) |
-   RADIX_DAX_ENTRY_LOCK);
+   (pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -526,12 +525,13 @@ static int copy_user_dax(struct block_device *bdev, 
struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
  struct vm_fault *vmf,
- void *entry, sector_t sector,
+ void *entry, pfn_t pfn_t,
  unsigned long flags, bool dirty)
 {
struct radix_tree_root *page_tree = >page_tree;
-   void *new_entry;
+   unsigned long pfn = pfn_t_to_pfn(pfn_t);
pgoff_t index = vmf->pgoff;
+   void *new_entry;
 
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -546,7 +546,7 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
}
 
spin_lock_irq(>tree_lock);
-   new_entry = dax_radix_locked_entry(sector, flags);
+   new_entry = dax_radix_locked_entry(pfn, flags);
 
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*
@@ -657,17 +657,14 @@ static void dax_mapping_entry_mkclean(struct 
address_space *mapping,
i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-   struct dax_device *dax_dev, struct address_space *mapping,
-   pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+   struct address_space *mapping, pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
-   void *entry2, **slot, *kaddr;
-   long ret = 0, id;
-   sector_t sector;
-   pgoff_t pgoff;
+   void *entry2, **slot;
+   unsigned long pfn;
+   long ret = 0;
size_t size;
-   pfn_t pfn;
 
/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -683,10 +680,10 @@ static int dax_writeback_one(struct block_device *bdev,
goto put_unlocked;
/*
 * Entry got reallocated elsewhere? No need to writeback. We have to
-* compare sectors as we must not bail out due to difference in lockbit
+* compare pfns as we must not bail out due to difference in lockbit
 * or entry type.
 */
-   if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+   if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
goto 

[PATCH v8 02/18] fs, dax: prepare for dax-specific address_space_operations

2018-03-30 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Define some generic VFS aops
helpers for dax. These noop implementations are there in the dax case to
prevent the VFS from falling back to operations with page-cache
assumptions, dax_writeback_mapping_range() may not be referenced in the
FS_DAX=n case.

Cc: Jeff Moyer 
Cc: Ross Zwisler 
Suggested-by: Matthew Wilcox 
Suggested-by: Jan Kara 
Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Suggested-by: Dave Chinner 
Signed-off-by: Dan Williams 
---
 fs/libfs.c  |   39 +++
 include/linux/dax.h |   12 +---
 include/linux/fs.h  |4 
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 7ff3cb904acd..0fb590d79f30 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1060,6 +1060,45 @@ int noop_fsync(struct file *file, loff_t start, loff_t 
end, int datasync)
 }
 EXPORT_SYMBOL(noop_fsync);
 
+int noop_set_page_dirty(struct page *page)
+{
+   /*
+* Unlike __set_page_dirty_no_writeback that handles dirty page
+* tracking in the page object, dax does all dirty tracking in
+* the inode address_space in response to mkwrite faults. In the
+* dax case we only need to worry about potentially dirty CPU
+* caches, not dirty page cache pages to write back.
+*
+* This callback is defined to prevent fallback to
+* __set_page_dirty_buffers() in set_page_dirty().
+*/
+   return 0;
+}
+EXPORT_SYMBOL_GPL(noop_set_page_dirty);
+
+void noop_invalidatepage(struct page *page, unsigned int offset,
+   unsigned int length)
+{
+   /*
+* There is no page cache to invalidate in the dax case, however
+* we need this callback defined to prevent falling back to
+* block_invalidatepage() in do_invalidatepage().
+*/
+}
+EXPORT_SYMBOL_GPL(noop_invalidatepage);
+
+ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+   /*
+* iomap based filesystems support direct I/O without need for
+* this callback. However, it still needs to be set in
+* inode->a_ops so that open/fcntl know that direct I/O is
+* generally supported.
+*/
+   return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(noop_direct_IO);
+
 /* Because kfree isn't assignment-compatible with void(void*) ;-/ */
 void kfree_link(void *p)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0185ecdae135..ae27a7efe7ab 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,6 +38,7 @@ static inline void put_dax(struct dax_device *dax_dev)
 }
 #endif
 
+struct writeback_control;
 int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
 #if IS_ENABLED(CONFIG_FS_DAX)
 int __bdev_dax_supported(struct super_block *sb, int blocksize);
@@ -57,6 +58,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 }
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+int dax_writeback_mapping_range(struct address_space *mapping,
+   struct block_device *bdev, struct writeback_control *wbc);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -76,6 +79,12 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct 
block_device *bdev)
 {
return NULL;
 }
+
+static inline int dax_writeback_mapping_range(struct address_space *mapping,
+   struct block_device *bdev, struct writeback_control *wbc)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 int dax_read_lock(void);
@@ -121,7 +130,4 @@ static inline bool dax_mapping(struct address_space 
*mapping)
return mapping->host && IS_DAX(mapping->host);
 }
 
-struct writeback_control;
-int dax_writeback_mapping_range(struct address_space *mapping,
-   struct block_device *bdev, struct writeback_control *wbc);
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79c413985305..44f7f7080faa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3129,6 +3129,10 @@ extern int simple_rmdir(struct inode *, struct dentry *);
 extern int simple_rename(struct inode *, struct dentry *,
 struct inode *, struct dentry *, unsigned int);
 extern int noop_fsync(struct file *, loff_t, loff_t, int);
+extern int noop_set_page_dirty(struct page *page);
+extern void noop_invalidatepage(struct page *page, unsigned int offset,
+   unsigned int length);
+extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
 extern int simple_empty(struct dentry *);
 extern int simple_readpage(struct file *file, struct page *page);
 extern 

[PATCH v8 00/18] dax: fix dma vs truncate/hole-punch

2018-03-30 Thread Dan Williams
Changes since v7 [1]:

* Introduce noop_direct_IO() and use it to clean up xfs_dax_aops,
  ext4_dax_aops, and ext2_dax_aops (Jan, Christoph)

* Clarify dax_associcate_entry() vs zero-page and empty entries with
  for_each_mapped_pfn() and a comment (Jan)

* Collect reviewed-by's from Jan and Darrick

* Fix an ARCH=UML build failure that made me realize that the patch to
  enable filesystems to trigger ->page_free() callbacks was incomplete
  with respect to the device-mapper dax enabling.

  The investigation of adding support for device-mapper and
  DEV_PAGEMAP_OPS resulted in a wider rework that includes 1) picking up
  the CONFIG_DAX_DRIVER patches that missed the 4.16 merge window. 2)
  Refactoring the build implementation to allow FS_DAX_LIMITED in the s390
  case with the dcssblk driver, and full blown FS_DAX + DEV_PAGEMAP_OPS
  for everyone else with the pmem driver.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014921.html 

---

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

---

Dan Williams (18):
  dax: store pfns in the radix
  fs, dax: prepare for dax-specific address_space_operations
  block, dax: remove dead code in blkdev_writepages()
  xfs, dax: introduce xfs_dax_aops
  ext4, dax: introduce ext4_dax_aops
  ext2, dax: introduce ext2_dax_aops
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  dax: introduce CONFIG_DAX_DRIVER
  dax, dm: allow device-mapper to operate without dax support
  dax, dm: introduce ->fs_{claim,release}() dax_device infrastructure
  mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  memremap: split devm_memremap_pages() and memremap() infrastructure
  mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  mm, fs, dax: handle layout changes to pinned dax mappings
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  xfs: prepare xfs_break_layouts() for another layout type
  xfs, dax: introduce xfs_break_dax_layouts()


 drivers/dax/Kconfig|5 +
 drivers/dax/super.c|  118 +++---
 drivers/md/Kconfig |1 
 drivers/md/dm-linear.c |6 +
 drivers/md/dm-log-writes.c |   95 +-
 drivers/md/dm-stripe.c |6 +
 drivers/md/dm.c|   66 +++-
 drivers/nvdimm/Kconfig |2 
 drivers/nvdimm/pmem.c  |3 -
 drivers/s390/block/Kconfig |2 
 fs/Kconfig |1 
 fs/block_dev.c |5 -
 fs/dax.c   |  238 ++--
 fs/ext2/ext2.h |1 
 fs/ext2/inode.c|   46 +
 fs/ext2/namei.c|   18 ---
 fs/ext2/super.c|6 +
 fs/ext4/inode.c|   42 ++--
 fs/ext4/super.c|6 +
 fs/libfs.c |   39 +++
 fs/xfs/xfs_aops.c  |   34 

[PATCH v8 00/18] dax: fix dma vs truncate/hole-punch

2018-03-30 Thread Dan Williams
Changes since v7 [1]:

* Introduce noop_direct_IO() and use it to clean up xfs_dax_aops,
  ext4_dax_aops, and ext2_dax_aops (Jan, Christoph)

* Clarify dax_associcate_entry() vs zero-page and empty entries with
  for_each_mapped_pfn() and a comment (Jan)

* Collect reviewed-by's from Jan and Darrick

* Fix an ARCH=UML build failure that made me realize that the patch to
  enable filesystems to trigger ->page_free() callbacks was incomplete
  with respect to the device-mapper dax enabling.

  The investigation of adding support for device-mapper and
  DEV_PAGEMAP_OPS resulted in a wider rework that includes 1) picking up
  the CONFIG_DAX_DRIVER patches that missed the 4.16 merge window. 2)
  Refactoring the build implementation to allow FS_DAX_LIMITED in the s390
  case with the dcssblk driver, and full blown FS_DAX + DEV_PAGEMAP_OPS
  for everyone else with the pmem driver.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014921.html 

---

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

---

Dan Williams (18):
  dax: store pfns in the radix
  fs, dax: prepare for dax-specific address_space_operations
  block, dax: remove dead code in blkdev_writepages()
  xfs, dax: introduce xfs_dax_aops
  ext4, dax: introduce ext4_dax_aops
  ext2, dax: introduce ext2_dax_aops
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  dax: introduce CONFIG_DAX_DRIVER
  dax, dm: allow device-mapper to operate without dax support
  dax, dm: introduce ->fs_{claim,release}() dax_device infrastructure
  mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  memremap: split devm_memremap_pages() and memremap() infrastructure
  mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  mm, fs, dax: handle layout changes to pinned dax mappings
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  xfs: prepare xfs_break_layouts() for another layout type
  xfs, dax: introduce xfs_break_dax_layouts()


 drivers/dax/Kconfig|5 +
 drivers/dax/super.c|  118 +++---
 drivers/md/Kconfig |1 
 drivers/md/dm-linear.c |6 +
 drivers/md/dm-log-writes.c |   95 +-
 drivers/md/dm-stripe.c |6 +
 drivers/md/dm.c|   66 +++-
 drivers/nvdimm/Kconfig |2 
 drivers/nvdimm/pmem.c  |3 -
 drivers/s390/block/Kconfig |2 
 fs/Kconfig |1 
 fs/block_dev.c |5 -
 fs/dax.c   |  238 ++--
 fs/ext2/ext2.h |1 
 fs/ext2/inode.c|   46 +
 fs/ext2/namei.c|   18 ---
 fs/ext2/super.c|6 +
 fs/ext4/inode.c|   42 ++--
 fs/ext4/super.c|6 +
 fs/libfs.c |   39 +++
 fs/xfs/xfs_aops.c  |   34 

[PATCH v8 02/18] fs, dax: prepare for dax-specific address_space_operations

2018-03-30 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Define some generic VFS aops
helpers for dax. These noop implementations are there in the dax case to
prevent the VFS from falling back to operations with page-cache
assumptions, dax_writeback_mapping_range() may not be referenced in the
FS_DAX=n case.

Cc: Jeff Moyer 
Cc: Ross Zwisler 
Suggested-by: Matthew Wilcox 
Suggested-by: Jan Kara 
Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Suggested-by: Dave Chinner 
Signed-off-by: Dan Williams 
---
 fs/libfs.c  |   39 +++
 include/linux/dax.h |   12 +---
 include/linux/fs.h  |4 
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 7ff3cb904acd..0fb590d79f30 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1060,6 +1060,45 @@ int noop_fsync(struct file *file, loff_t start, loff_t 
end, int datasync)
 }
 EXPORT_SYMBOL(noop_fsync);
 
+int noop_set_page_dirty(struct page *page)
+{
+   /*
+* Unlike __set_page_dirty_no_writeback that handles dirty page
+* tracking in the page object, dax does all dirty tracking in
+* the inode address_space in response to mkwrite faults. In the
+* dax case we only need to worry about potentially dirty CPU
+* caches, not dirty page cache pages to write back.
+*
+* This callback is defined to prevent fallback to
+* __set_page_dirty_buffers() in set_page_dirty().
+*/
+   return 0;
+}
+EXPORT_SYMBOL_GPL(noop_set_page_dirty);
+
+void noop_invalidatepage(struct page *page, unsigned int offset,
+   unsigned int length)
+{
+   /*
+* There is no page cache to invalidate in the dax case, however
+* we need this callback defined to prevent falling back to
+* block_invalidatepage() in do_invalidatepage().
+*/
+}
+EXPORT_SYMBOL_GPL(noop_invalidatepage);
+
+ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+   /*
+* iomap based filesystems support direct I/O without need for
+* this callback. However, it still needs to be set in
+* inode->a_ops so that open/fcntl know that direct I/O is
+* generally supported.
+*/
+   return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(noop_direct_IO);
+
 /* Because kfree isn't assignment-compatible with void(void*) ;-/ */
 void kfree_link(void *p)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0185ecdae135..ae27a7efe7ab 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,6 +38,7 @@ static inline void put_dax(struct dax_device *dax_dev)
 }
 #endif
 
+struct writeback_control;
 int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
 #if IS_ENABLED(CONFIG_FS_DAX)
 int __bdev_dax_supported(struct super_block *sb, int blocksize);
@@ -57,6 +58,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 }
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+int dax_writeback_mapping_range(struct address_space *mapping,
+   struct block_device *bdev, struct writeback_control *wbc);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -76,6 +79,12 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct 
block_device *bdev)
 {
return NULL;
 }
+
+static inline int dax_writeback_mapping_range(struct address_space *mapping,
+   struct block_device *bdev, struct writeback_control *wbc)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 int dax_read_lock(void);
@@ -121,7 +130,4 @@ static inline bool dax_mapping(struct address_space 
*mapping)
return mapping->host && IS_DAX(mapping->host);
 }
 
-struct writeback_control;
-int dax_writeback_mapping_range(struct address_space *mapping,
-   struct block_device *bdev, struct writeback_control *wbc);
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79c413985305..44f7f7080faa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3129,6 +3129,10 @@ extern int simple_rmdir(struct inode *, struct dentry *);
 extern int simple_rename(struct inode *, struct dentry *,
 struct inode *, struct dentry *, unsigned int);
 extern int noop_fsync(struct file *, loff_t, loff_t, int);
+extern int noop_set_page_dirty(struct page *page);
+extern void noop_invalidatepage(struct page *page, unsigned int offset,
+   unsigned int length);
+extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
 extern int simple_empty(struct dentry *);
 extern int simple_readpage(struct file *file, struct page *page);
 extern int simple_write_begin(struct file *file, struct address_space *mapping,



Re: [PATCH] staging: lustre: libcfs: use dynamic minors for /dev/{lnet,obd}

2018-03-30 Thread NeilBrown
On Fri, Mar 30 2018, James Simmons wrote:

> From: "John L. Hammond" 
>
> Request dynamic minor allocation when registering /dev/lnet and
> /dev/obd.
>
> Signed-off-by: John L. Hammond 
> Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-100086
> Reviewed-on: https://review.whamcloud.com/29741
> Reviewed-by: Andreas Dilger 
> Reviewed-by: Jian Yu 
> Reviewed-by: Oleg Drokin 
> Signed-off-by: James Simmons 

Yes, this is a much better fix than my kconfig change.

 Reviewed-by: NeilBrown 

and thanks for your quick review on my last series!
Thanks,
NeilBrown


> ---
>  drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h|  1 -
>  drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h  | 11 
> ---
>  .../staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h   |  2 --
>  drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c|  1 -
>  drivers/staging/lustre/lnet/libcfs/linux/linux-module.c   |  5 ++---
>  drivers/staging/lustre/lnet/libcfs/module.c   |  1 +
>  drivers/staging/lustre/lustre/obdclass/class_obd.c|  6 --
>  drivers/staging/lustre/lustre/obdclass/linux/linux-module.c   |  3 +--
>  8 files changed, 8 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h 
> b/drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h
> index 30e333a..cf4c606 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h
> @@ -50,7 +50,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h 
> b/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h
> index d9da625..cccb32d 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h
> @@ -119,16 +119,5 @@ struct lnet_fault_stat {
>  
>  #define LNET_DEV_ID 0
>  #define LNET_DEV_PATH "/dev/lnet"
> -#define LNET_DEV_MAJOR 10
> -#define LNET_DEV_MINOR 240
> -#define OBD_DEV_ID 1
> -#define OBD_DEV_NAME "obd"
> -#define OBD_DEV_PATH "/dev/" OBD_DEV_NAME
> -#define OBD_DEV_MAJOR 10
> -#define OBD_DEV_MINOR 241
> -#define SMFS_DEV_ID  2
> -#define SMFS_DEV_PATH "/dev/snapdev"
> -#define SMFS_DEV_MAJOR 10
> -#define SMFS_DEV_MINOR 242
>  
>  #endif
> diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h 
> b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> index 9590864..6e4e109 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> @@ -51,8 +51,6 @@ enum md_echo_cmd {
>  #define OBD_DEV_ID 1
>  #define OBD_DEV_NAME "obd"
>  #define OBD_DEV_PATH "/dev/" OBD_DEV_NAME
> -#define OBD_DEV_MAJOR 10
> -#define OBD_DEV_MINOR 241
>  
>  #define OBD_IOCTL_VERSION0x00010004
>  #define OBD_DEV_BY_DEVNAME   0xd0de
> diff --git a/drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c 
> b/drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c
> index 0092166..1d728f1 100644
> --- a/drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c
> +++ b/drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c
> @@ -48,7 +48,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  # define DEBUG_SUBSYSTEM S_LNET
>  
> diff --git a/drivers/staging/lustre/lnet/libcfs/linux/linux-module.c 
> b/drivers/staging/lustre/lnet/libcfs/linux/linux-module.c
> index ddf6256..c8908e8 100644
> --- a/drivers/staging/lustre/lnet/libcfs/linux/linux-module.c
> +++ b/drivers/staging/lustre/lnet/libcfs/linux/linux-module.c
> @@ -33,10 +33,9 @@
>  
>  #define DEBUG_SUBSYSTEM S_LNET
>  
> +#include 
>  #include 
>  
> -#define LNET_MINOR 240
> -
>  static inline size_t libcfs_ioctl_packlen(struct libcfs_ioctl_data *data)
>  {
>   size_t len = sizeof(*data);
> @@ -191,7 +190,7 @@ int libcfs_ioctl_getdata(struct libcfs_ioctl_hdr **hdr_pp,
>  };
>  
>  struct miscdevice libcfs_dev = {
> - .minor = LNET_MINOR,
> + .minor = MISC_DYNAMIC_MINOR,
>   .name = "lnet",
>   .fops = _fops,
>  };
> diff --git a/drivers/staging/lustre/lnet/libcfs/module.c 
> b/drivers/staging/lustre/lnet/libcfs/module.c
> index a03f924..4b9acd7 100644
> --- a/drivers/staging/lustre/lnet/libcfs/module.c
> +++ b/drivers/staging/lustre/lnet/libcfs/module.c
> @@ -30,6 +30,7 @@
>   * This file is part of Lustre, http://www.lustre.org/
>   * Lustre is a trademark of Sun Microsystems, Inc.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c 
> b/drivers/staging/lustre/lustre/obdclass/class_obd.c
> index 3e24b76..7b5be6b 100644
> --- 

Re: [PATCH] staging: lustre: libcfs: use dynamic minors for /dev/{lnet,obd}

2018-03-30 Thread NeilBrown
On Fri, Mar 30 2018, James Simmons wrote:

> From: "John L. Hammond" 
>
> Request dynamic minor allocation when registering /dev/lnet and
> /dev/obd.
>
> Signed-off-by: John L. Hammond 
> Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-100086
> Reviewed-on: https://review.whamcloud.com/29741
> Reviewed-by: Andreas Dilger 
> Reviewed-by: Jian Yu 
> Reviewed-by: Oleg Drokin 
> Signed-off-by: James Simmons 

Yes, this is a much better fix than my kconfig change.

 Reviewed-by: NeilBrown 

and thanks for your quick review on my last series!
Thanks,
NeilBrown


> ---
>  drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h|  1 -
>  drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h  | 11 
> ---
>  .../staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h   |  2 --
>  drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c|  1 -
>  drivers/staging/lustre/lnet/libcfs/linux/linux-module.c   |  5 ++---
>  drivers/staging/lustre/lnet/libcfs/module.c   |  1 +
>  drivers/staging/lustre/lustre/obdclass/class_obd.c|  6 --
>  drivers/staging/lustre/lustre/obdclass/linux/linux-module.c   |  3 +--
>  8 files changed, 8 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h 
> b/drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h
> index 30e333a..cf4c606 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/linux/libcfs.h
> @@ -50,7 +50,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h 
> b/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h
> index d9da625..cccb32d 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h
> @@ -119,16 +119,5 @@ struct lnet_fault_stat {
>  
>  #define LNET_DEV_ID 0
>  #define LNET_DEV_PATH "/dev/lnet"
> -#define LNET_DEV_MAJOR 10
> -#define LNET_DEV_MINOR 240
> -#define OBD_DEV_ID 1
> -#define OBD_DEV_NAME "obd"
> -#define OBD_DEV_PATH "/dev/" OBD_DEV_NAME
> -#define OBD_DEV_MAJOR 10
> -#define OBD_DEV_MINOR 241
> -#define SMFS_DEV_ID  2
> -#define SMFS_DEV_PATH "/dev/snapdev"
> -#define SMFS_DEV_MAJOR 10
> -#define SMFS_DEV_MINOR 242
>  
>  #endif
> diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h 
> b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> index 9590864..6e4e109 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> @@ -51,8 +51,6 @@ enum md_echo_cmd {
>  #define OBD_DEV_ID 1
>  #define OBD_DEV_NAME "obd"
>  #define OBD_DEV_PATH "/dev/" OBD_DEV_NAME
> -#define OBD_DEV_MAJOR 10
> -#define OBD_DEV_MINOR 241
>  
>  #define OBD_IOCTL_VERSION0x00010004
>  #define OBD_DEV_BY_DEVNAME   0xd0de
> diff --git a/drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c 
> b/drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c
> index 0092166..1d728f1 100644
> --- a/drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c
> +++ b/drivers/staging/lustre/lnet/libcfs/linux/linux-debug.c
> @@ -48,7 +48,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  # define DEBUG_SUBSYSTEM S_LNET
>  
> diff --git a/drivers/staging/lustre/lnet/libcfs/linux/linux-module.c 
> b/drivers/staging/lustre/lnet/libcfs/linux/linux-module.c
> index ddf6256..c8908e8 100644
> --- a/drivers/staging/lustre/lnet/libcfs/linux/linux-module.c
> +++ b/drivers/staging/lustre/lnet/libcfs/linux/linux-module.c
> @@ -33,10 +33,9 @@
>  
>  #define DEBUG_SUBSYSTEM S_LNET
>  
> +#include 
>  #include 
>  
> -#define LNET_MINOR 240
> -
>  static inline size_t libcfs_ioctl_packlen(struct libcfs_ioctl_data *data)
>  {
>   size_t len = sizeof(*data);
> @@ -191,7 +190,7 @@ int libcfs_ioctl_getdata(struct libcfs_ioctl_hdr **hdr_pp,
>  };
>  
>  struct miscdevice libcfs_dev = {
> - .minor = LNET_MINOR,
> + .minor = MISC_DYNAMIC_MINOR,
>   .name = "lnet",
>   .fops = _fops,
>  };
> diff --git a/drivers/staging/lustre/lnet/libcfs/module.c 
> b/drivers/staging/lustre/lnet/libcfs/module.c
> index a03f924..4b9acd7 100644
> --- a/drivers/staging/lustre/lnet/libcfs/module.c
> +++ b/drivers/staging/lustre/lnet/libcfs/module.c
> @@ -30,6 +30,7 @@
>   * This file is part of Lustre, http://www.lustre.org/
>   * Lustre is a trademark of Sun Microsystems, Inc.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c 
> b/drivers/staging/lustre/lustre/obdclass/class_obd.c
> index 3e24b76..7b5be6b 100644
> --- a/drivers/staging/lustre/lustre/obdclass/class_obd.c
> +++ b/drivers/staging/lustre/lustre/obdclass/class_obd.c
> @@ -32,7 +32,9 @@
>   */
>  
>  #define DEBUG_SUBSYSTEM S_CLASS
> -# include 
> +
> 

[GIT PULL] Kbuild fixes for v4.16

2018-03-30 Thread Masahiro Yamada
Hi Linus,

Please pull a little more Kbuild fixes for v4.16.


The following changes since commit 0c8efd610b58cb23cefdfa12015799079aef94ae:

  Linux 4.16-rc5 (2018-03-11 17:25:09 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git
kbuild-fixes-v4.16-3

for you to fetch changes up to 28913ee8191adf4bbc01cbfb9ee18cca782ab141:

  netfilter: nf_nat_snmp_basic: add correct dependency to Makefile
(2018-03-29 09:42:32 +0900)


Kbuild fixes for v4.16 (3rd)

- fix missed rebuild of TRIM_UNUSED_KSYMS

- fix rpm-pkg for GNU tar >= 1.29

- include scripts/dtc/include-prefixes/* to kernel header deb-pkg

- add -no-integrated-as option ealier to fix building with Clang

- fix netfilter Makefile for parallel building


Jan Kiszka (1):
  builddeb: Fix header package regarding dtc source links

Jason Gunthorpe (1):
  kbuild: rpm-pkg: Support GNU tar >= 1.29

Masahiro Yamada (1):
  netfilter: nf_nat_snmp_basic: add correct dependency to Makefile

Nicolas Pitre (1):
  kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races

Stefan Agner (1):
  kbuild: set no-integrated-as before incl. arch Makefile

 Makefile| 4 ++--
 net/ipv4/netfilter/Makefile | 2 +-
 scripts/adjust_autoksyms.sh | 7 +++
 scripts/package/builddeb| 2 +-
 scripts/package/mkspec  | 2 +-
 5 files changed, 12 insertions(+), 5 deletions(-)

-- 
Best Regards
Masahiro Yamada


[GIT PULL] Kbuild fixes for v4.16

2018-03-30 Thread Masahiro Yamada
Hi Linus,

Please pull a little more Kbuild fixes for v4.16.


The following changes since commit 0c8efd610b58cb23cefdfa12015799079aef94ae:

  Linux 4.16-rc5 (2018-03-11 17:25:09 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git
kbuild-fixes-v4.16-3

for you to fetch changes up to 28913ee8191adf4bbc01cbfb9ee18cca782ab141:

  netfilter: nf_nat_snmp_basic: add correct dependency to Makefile
(2018-03-29 09:42:32 +0900)


Kbuild fixes for v4.16 (3rd)

- fix missed rebuild of TRIM_UNUSED_KSYMS

- fix rpm-pkg for GNU tar >= 1.29

- include scripts/dtc/include-prefixes/* to kernel header deb-pkg

- add -no-integrated-as option ealier to fix building with Clang

- fix netfilter Makefile for parallel building


Jan Kiszka (1):
  builddeb: Fix header package regarding dtc source links

Jason Gunthorpe (1):
  kbuild: rpm-pkg: Support GNU tar >= 1.29

Masahiro Yamada (1):
  netfilter: nf_nat_snmp_basic: add correct dependency to Makefile

Nicolas Pitre (1):
  kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races

Stefan Agner (1):
  kbuild: set no-integrated-as before incl. arch Makefile

 Makefile| 4 ++--
 net/ipv4/netfilter/Makefile | 2 +-
 scripts/adjust_autoksyms.sh | 7 +++
 scripts/package/builddeb| 2 +-
 scripts/package/mkspec  | 2 +-
 5 files changed, 12 insertions(+), 5 deletions(-)

-- 
Best Regards
Masahiro Yamada


[PATCH 02/10] sparc: Convert local_softirq_pending() to use per-cpu op

2018-03-30 Thread Frederic Weisbecker
In order to consolidate and optimize generic softirq mask accesses, we
first need to convert architectures to use per-cpu operations when
possible.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/sparc/include/asm/hardirq_64.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/sparc/include/asm/hardirq_64.h 
b/arch/sparc/include/asm/hardirq_64.h
index f565402..6aba904 100644
--- a/arch/sparc/include/asm/hardirq_64.h
+++ b/arch/sparc/include/asm/hardirq_64.h
@@ -11,7 +11,7 @@
 
 #define __ARCH_IRQ_STAT
 #define local_softirq_pending() \
-   (local_cpu_data().__softirq_pending)
+   (*this_cpu_ptr(&__cpu_data.__softirq_pending))
 
 void ack_bad_irq(unsigned int irq);
 
-- 
2.7.4



[PATCH 02/10] sparc: Convert local_softirq_pending() to use per-cpu op

2018-03-30 Thread Frederic Weisbecker
In order to consolidate and optimize generic softirq mask accesses, we
first need to convert architectures to use per-cpu operations when
possible.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/sparc/include/asm/hardirq_64.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/sparc/include/asm/hardirq_64.h 
b/arch/sparc/include/asm/hardirq_64.h
index f565402..6aba904 100644
--- a/arch/sparc/include/asm/hardirq_64.h
+++ b/arch/sparc/include/asm/hardirq_64.h
@@ -11,7 +11,7 @@
 
 #define __ARCH_IRQ_STAT
 #define local_softirq_pending() \
-   (local_cpu_data().__softirq_pending)
+   (*this_cpu_ptr(&__cpu_data.__softirq_pending))
 
 void ack_bad_irq(unsigned int irq);
 
-- 
2.7.4



[PATCH 06/10] parisc: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Remove the ad-hoc implementation, the generic code now allows us not to
reinvent the wheel.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/parisc/include/asm/hardirq.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/parisc/include/asm/hardirq.h 
b/arch/parisc/include/asm/hardirq.h
index 0778151..1a1235a 100644
--- a/arch/parisc/include/asm/hardirq.h
+++ b/arch/parisc/include/asm/hardirq.h
@@ -34,14 +34,6 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 #define __IRQ_STAT(cpu, member) (irq_stat[cpu].member)
 #define inc_irq_stat(member)   this_cpu_inc(irq_stat.member)
 #define __inc_irq_stat(member) __this_cpu_inc(irq_stat.member)
-#define local_softirq_pending()
this_cpu_read(irq_stat.__softirq_pending)
-
-#define __ARCH_SET_SOFTIRQ_PENDING
-
-#define set_softirq_pending(x) \
-   this_cpu_write(irq_stat.__softirq_pending, (x))
-#define or_softirq_pending(x)  this_cpu_or(irq_stat.__softirq_pending, (x))
-
 #define ack_bad_irq(irq) WARN(1, "unexpected IRQ trap at vector %02x\n", irq)
 
 #endif /* _PARISC_HARDIRQ_H */
-- 
2.7.4



[PATCH 04/10] softirq: Consolidate default local_softirq_pending() implementations

2018-03-30 Thread Frederic Weisbecker
Consolidate and optimize default softirq mask API implementations.
Per-CPU operations are expected to be faster and a few architectures
already rely on them to implement local_softirq_pending() and related
accessors/mutators. Those will be migrated to the new generic code.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 include/linux/interrupt.h   | 14 ++
 include/linux/irq_cpustat.h |  6 +-
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 69c2382..01caeca 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -434,11 +434,25 @@ extern bool force_irqthreads;
 #define force_irqthreads   (0)
 #endif
 
+#ifndef local_softirq_pending
+
+#ifndef local_softirq_pending_ref
+#define local_softirq_pending_ref irq_stat.__softirq_pending
+#endif
+
+#define local_softirq_pending()
(__this_cpu_read(local_softirq_pending_ref))
+#define set_softirq_pending(x) (__this_cpu_write(local_softirq_pending_ref, 
(x)))
+#define or_softirq_pending(x)  (__this_cpu_or(local_softirq_pending_ref, (x)))
+
+#else /* local_softirq_pending */
+
 #ifndef __ARCH_SET_SOFTIRQ_PENDING
 #define set_softirq_pending(x) (local_softirq_pending() = (x))
 #define or_softirq_pending(x)  (local_softirq_pending() |= (x))
 #endif
 
+#endif /* local_softirq_pending */
+
 /* Some architectures might implement lazy enabling/disabling of
  * interrupts. In some cases, such as stop_machine, we might want
  * to ensure that after a local_irq_disable(), interrupts have
diff --git a/include/linux/irq_cpustat.h b/include/linux/irq_cpustat.h
index ddea03c..6e8895c 100644
--- a/include/linux/irq_cpustat.h
+++ b/include/linux/irq_cpustat.h
@@ -22,11 +22,7 @@ DECLARE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);/* 
defined in asm/hardirq.h */
 #define __IRQ_STAT(cpu, member)(per_cpu(irq_stat.member, cpu))
 #endif
 
-  /* arch independent irq_stat fields */
-#define local_softirq_pending() \
-   __IRQ_STAT(smp_processor_id(), __softirq_pending)
-
-  /* arch dependent irq_stat fields */
+/* arch dependent irq_stat fields */
 #define nmi_count(cpu) __IRQ_STAT((cpu), __nmi_count)  /* i386 */
 
 #endif /* __irq_cpustat_h */
-- 
2.7.4



[PATCH 03/10] softirq: Turn default irq_cpustat_t to standard per-cpu

2018-03-30 Thread Frederic Weisbecker
In order to optimize and consolidate softirq mask accesses, let's
convert the default irq_cpustat_t implementation to per-CPU standard API.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 include/linux/irq_cpustat.h | 4 ++--
 kernel/softirq.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/irq_cpustat.h b/include/linux/irq_cpustat.h
index 4954948..ddea03c 100644
--- a/include/linux/irq_cpustat.h
+++ b/include/linux/irq_cpustat.h
@@ -18,8 +18,8 @@
  */
 
 #ifndef __ARCH_IRQ_STAT
-extern irq_cpustat_t irq_stat[];   /* defined in asm/hardirq.h */
-#define __IRQ_STAT(cpu, member)(irq_stat[cpu].member)
+DECLARE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);  /* defined in 
asm/hardirq.h */
+#define __IRQ_STAT(cpu, member)(per_cpu(irq_stat.member, cpu))
 #endif
 
   /* arch independent irq_stat fields */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 24d243e..fdbb171 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -49,8 +49,8 @@
  */
 
 #ifndef __ARCH_IRQ_STAT
-irq_cpustat_t irq_stat[NR_CPUS] cacheline_aligned;
-EXPORT_SYMBOL(irq_stat);
+DEFINE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);
+EXPORT_PER_CPU_SYMBOL(irq_stat);
 #endif
 
 static struct softirq_action softirq_vec[NR_SOFTIRQS] 
__cacheline_aligned_in_smp;
-- 
2.7.4



[PATCH 04/10] softirq: Consolidate default local_softirq_pending() implementations

2018-03-30 Thread Frederic Weisbecker
Consolidate and optimize default softirq mask API implementations.
Per-CPU operations are expected to be faster and a few architectures
already rely on them to implement local_softirq_pending() and related
accessors/mutators. Those will be migrated to the new generic code.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 include/linux/interrupt.h   | 14 ++
 include/linux/irq_cpustat.h |  6 +-
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 69c2382..01caeca 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -434,11 +434,25 @@ extern bool force_irqthreads;
 #define force_irqthreads   (0)
 #endif
 
+#ifndef local_softirq_pending
+
+#ifndef local_softirq_pending_ref
+#define local_softirq_pending_ref irq_stat.__softirq_pending
+#endif
+
+#define local_softirq_pending()
(__this_cpu_read(local_softirq_pending_ref))
+#define set_softirq_pending(x) (__this_cpu_write(local_softirq_pending_ref, 
(x)))
+#define or_softirq_pending(x)  (__this_cpu_or(local_softirq_pending_ref, (x)))
+
+#else /* local_softirq_pending */
+
 #ifndef __ARCH_SET_SOFTIRQ_PENDING
 #define set_softirq_pending(x) (local_softirq_pending() = (x))
 #define or_softirq_pending(x)  (local_softirq_pending() |= (x))
 #endif
 
+#endif /* local_softirq_pending */
+
 /* Some architectures might implement lazy enabling/disabling of
  * interrupts. In some cases, such as stop_machine, we might want
  * to ensure that after a local_irq_disable(), interrupts have
diff --git a/include/linux/irq_cpustat.h b/include/linux/irq_cpustat.h
index ddea03c..6e8895c 100644
--- a/include/linux/irq_cpustat.h
+++ b/include/linux/irq_cpustat.h
@@ -22,11 +22,7 @@ DECLARE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);/* 
defined in asm/hardirq.h */
 #define __IRQ_STAT(cpu, member)(per_cpu(irq_stat.member, cpu))
 #endif
 
-  /* arch independent irq_stat fields */
-#define local_softirq_pending() \
-   __IRQ_STAT(smp_processor_id(), __softirq_pending)
-
-  /* arch dependent irq_stat fields */
+/* arch dependent irq_stat fields */
 #define nmi_count(cpu) __IRQ_STAT((cpu), __nmi_count)  /* i386 */
 
 #endif /* __irq_cpustat_h */
-- 
2.7.4



[PATCH 03/10] softirq: Turn default irq_cpustat_t to standard per-cpu

2018-03-30 Thread Frederic Weisbecker
In order to optimize and consolidate softirq mask accesses, let's
convert the default irq_cpustat_t implementation to per-CPU standard API.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 include/linux/irq_cpustat.h | 4 ++--
 kernel/softirq.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/irq_cpustat.h b/include/linux/irq_cpustat.h
index 4954948..ddea03c 100644
--- a/include/linux/irq_cpustat.h
+++ b/include/linux/irq_cpustat.h
@@ -18,8 +18,8 @@
  */
 
 #ifndef __ARCH_IRQ_STAT
-extern irq_cpustat_t irq_stat[];   /* defined in asm/hardirq.h */
-#define __IRQ_STAT(cpu, member)(irq_stat[cpu].member)
+DECLARE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);  /* defined in 
asm/hardirq.h */
+#define __IRQ_STAT(cpu, member)(per_cpu(irq_stat.member, cpu))
 #endif
 
   /* arch independent irq_stat fields */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 24d243e..fdbb171 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -49,8 +49,8 @@
  */
 
 #ifndef __ARCH_IRQ_STAT
-irq_cpustat_t irq_stat[NR_CPUS] cacheline_aligned;
-EXPORT_SYMBOL(irq_stat);
+DEFINE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);
+EXPORT_PER_CPU_SYMBOL(irq_stat);
 #endif
 
 static struct softirq_action softirq_vec[NR_SOFTIRQS] 
__cacheline_aligned_in_smp;
-- 
2.7.4



[PATCH 06/10] parisc: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Remove the ad-hoc implementation, the generic code now allows us not to
reinvent the wheel.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/parisc/include/asm/hardirq.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/parisc/include/asm/hardirq.h 
b/arch/parisc/include/asm/hardirq.h
index 0778151..1a1235a 100644
--- a/arch/parisc/include/asm/hardirq.h
+++ b/arch/parisc/include/asm/hardirq.h
@@ -34,14 +34,6 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 #define __IRQ_STAT(cpu, member) (irq_stat[cpu].member)
 #define inc_irq_stat(member)   this_cpu_inc(irq_stat.member)
 #define __inc_irq_stat(member) __this_cpu_inc(irq_stat.member)
-#define local_softirq_pending()
this_cpu_read(irq_stat.__softirq_pending)
-
-#define __ARCH_SET_SOFTIRQ_PENDING
-
-#define set_softirq_pending(x) \
-   this_cpu_write(irq_stat.__softirq_pending, (x))
-#define or_softirq_pending(x)  this_cpu_or(irq_stat.__softirq_pending, (x))
-
 #define ack_bad_irq(irq) WARN(1, "unexpected IRQ trap at vector %02x\n", irq)
 
 #endif /* _PARISC_HARDIRQ_H */
-- 
2.7.4



[PATCH 08/10] sparc: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Benefit from the generic softirq mask implementation that rely on per-CPU
mutators instead of working with raw operators on top of this_cpu_ptr().

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/sparc/include/asm/hardirq_64.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/include/asm/hardirq_64.h 
b/arch/sparc/include/asm/hardirq_64.h
index 6aba904..75b92bf 100644
--- a/arch/sparc/include/asm/hardirq_64.h
+++ b/arch/sparc/include/asm/hardirq_64.h
@@ -10,8 +10,9 @@
 #include 
 
 #define __ARCH_IRQ_STAT
-#define local_softirq_pending() \
-   (*this_cpu_ptr(&__cpu_data.__softirq_pending))
+
+#define local_softirq_pending_ref \
+   __cpu_data.__softirq_pending
 
 void ack_bad_irq(unsigned int irq);
 
-- 
2.7.4



[PATCH 08/10] sparc: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Benefit from the generic softirq mask implementation that rely on per-CPU
mutators instead of working with raw operators on top of this_cpu_ptr().

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/sparc/include/asm/hardirq_64.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/include/asm/hardirq_64.h 
b/arch/sparc/include/asm/hardirq_64.h
index 6aba904..75b92bf 100644
--- a/arch/sparc/include/asm/hardirq_64.h
+++ b/arch/sparc/include/asm/hardirq_64.h
@@ -10,8 +10,9 @@
 #include 
 
 #define __ARCH_IRQ_STAT
-#define local_softirq_pending() \
-   (*this_cpu_ptr(&__cpu_data.__softirq_pending))
+
+#define local_softirq_pending_ref \
+   __cpu_data.__softirq_pending
 
 void ack_bad_irq(unsigned int irq);
 
-- 
2.7.4



[PATCH 09/10] x86: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Remove the ad-hoc implementation, the generic code now allows us not to
reinvent the wheel.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/x86/include/asm/hardirq.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 7c341a7..fd73beb 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -49,14 +49,6 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 
 #define inc_irq_stat(member)   this_cpu_inc(irq_stat.member)
 
-#define local_softirq_pending()
this_cpu_read(irq_stat.__softirq_pending)
-
-#define __ARCH_SET_SOFTIRQ_PENDING
-
-#define set_softirq_pending(x) \
-   this_cpu_write(irq_stat.__softirq_pending, (x))
-#define or_softirq_pending(x)  this_cpu_or(irq_stat.__softirq_pending, (x))
-
 extern void ack_bad_irq(unsigned int irq);
 
 extern u64 arch_irq_stat_cpu(unsigned int cpu);
-- 
2.7.4



[PATCH 09/10] x86: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Remove the ad-hoc implementation, the generic code now allows us not to
reinvent the wheel.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/x86/include/asm/hardirq.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 7c341a7..fd73beb 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -49,14 +49,6 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 
 #define inc_irq_stat(member)   this_cpu_inc(irq_stat.member)
 
-#define local_softirq_pending()
this_cpu_read(irq_stat.__softirq_pending)
-
-#define __ARCH_SET_SOFTIRQ_PENDING
-
-#define set_softirq_pending(x) \
-   this_cpu_write(irq_stat.__softirq_pending, (x))
-#define or_softirq_pending(x)  this_cpu_or(irq_stat.__softirq_pending, (x))
-
 extern void ack_bad_irq(unsigned int irq);
 
 extern u64 arch_irq_stat_cpu(unsigned int cpu);
-- 
2.7.4



[PATCH 07/10] powerpc: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Remove the ad-hoc implementation, the generic code now allows us not to
reinvent the wheel.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/powerpc/include/asm/hardirq.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/include/asm/hardirq.h 
b/arch/powerpc/include/asm/hardirq.h
index 5986d47..383f628 100644
--- a/arch/powerpc/include/asm/hardirq.h
+++ b/arch/powerpc/include/asm/hardirq.h
@@ -25,15 +25,8 @@ typedef struct {
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 
 #define __ARCH_IRQ_STAT
-
-#define local_softirq_pending()
__this_cpu_read(irq_stat.__softirq_pending)
-
-#define __ARCH_SET_SOFTIRQ_PENDING
 #define __ARCH_IRQ_EXIT_IRQS_DISABLED
 
-#define set_softirq_pending(x) __this_cpu_write(irq_stat.__softirq_pending, 
(x))
-#define or_softirq_pending(x) __this_cpu_or(irq_stat.__softirq_pending, (x))
-
 static inline void ack_bad_irq(unsigned int irq)
 {
printk(KERN_CRIT "unexpected IRQ trap at vector %02x\n", irq);
-- 
2.7.4



[PATCH 05/10] ia64: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Benefit from the generic softirq mask implementation that rely on per-CPU
mutators instead of working with raw operators on top of this_cpu_ptr().

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/ia64/include/asm/hardirq.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/ia64/include/asm/hardirq.h b/arch/ia64/include/asm/hardirq.h
index 22fae71..ccde7c2 100644
--- a/arch/ia64/include/asm/hardirq.h
+++ b/arch/ia64/include/asm/hardirq.h
@@ -13,7 +13,7 @@
 
 #define __ARCH_IRQ_STAT1
 
-#define local_softirq_pending()
(*this_cpu_ptr(_cpu_info.softirq_pending))
+#define local_softirq_pending_ref  ia64_cpu_info.softirq_pending
 
 #include 
 #include 
-- 
2.7.4



[PATCH 07/10] powerpc: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Remove the ad-hoc implementation, the generic code now allows us not to
reinvent the wheel.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/powerpc/include/asm/hardirq.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/include/asm/hardirq.h 
b/arch/powerpc/include/asm/hardirq.h
index 5986d47..383f628 100644
--- a/arch/powerpc/include/asm/hardirq.h
+++ b/arch/powerpc/include/asm/hardirq.h
@@ -25,15 +25,8 @@ typedef struct {
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 
 #define __ARCH_IRQ_STAT
-
-#define local_softirq_pending()
__this_cpu_read(irq_stat.__softirq_pending)
-
-#define __ARCH_SET_SOFTIRQ_PENDING
 #define __ARCH_IRQ_EXIT_IRQS_DISABLED
 
-#define set_softirq_pending(x) __this_cpu_write(irq_stat.__softirq_pending, 
(x))
-#define or_softirq_pending(x) __this_cpu_or(irq_stat.__softirq_pending, (x))
-
 static inline void ack_bad_irq(unsigned int irq)
 {
printk(KERN_CRIT "unexpected IRQ trap at vector %02x\n", irq);
-- 
2.7.4



[PATCH 05/10] ia64: Switch to generic local_softirq_pending() implementation

2018-03-30 Thread Frederic Weisbecker
Benefit from the generic softirq mask implementation that rely on per-CPU
mutators instead of working with raw operators on top of this_cpu_ptr().

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/ia64/include/asm/hardirq.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/ia64/include/asm/hardirq.h b/arch/ia64/include/asm/hardirq.h
index 22fae71..ccde7c2 100644
--- a/arch/ia64/include/asm/hardirq.h
+++ b/arch/ia64/include/asm/hardirq.h
@@ -13,7 +13,7 @@
 
 #define __ARCH_IRQ_STAT1
 
-#define local_softirq_pending()
(*this_cpu_ptr(_cpu_info.softirq_pending))
+#define local_softirq_pending_ref  ia64_cpu_info.softirq_pending
 
 #include 
 #include 
-- 
2.7.4



[PATCH 00/10] softirq: Consolidate and optimize softirq mask v2

2018-03-30 Thread Frederic Weisbecker
Only the last patch has changed since v1 to integrate review from peterz.

Quote from the v1 summary:

The softirq mask and its accessors/mutators have many implementations
scattered around many architectures. Most do the same things consisting
in a field in a per-cpu struct (often irq_cpustat_t) accessed through
per-cpu ops. We can provide instead a generic efficient version that
most of them can use. In fact s390 is the only exception because the
field is stored in lowcore.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
softirq/mask-v2

HEAD: cce39c380f2dbe3e92db6a018c8ba65579c6311b

Thanks,
Frederic
---

Frederic Weisbecker (10):
  ia64: Convert local_softirq_pending() to per-cpu ops
  sparc: Convert local_softirq_pending() to use per-cpu op
  softirq: Turn default irq_cpustat_t to standard per-cpu
  softirq: Consolidate default local_softirq_pending() implementations
  ia64: Switch to generic local_softirq_pending() implementation
  parisc: Switch to generic local_softirq_pending() implementation
  powerpc: Switch to generic local_softirq_pending() implementation
  sparc: Switch to generic local_softirq_pending() implementation
  x86: Switch to generic local_softirq_pending() implementation
  softirq/s390: Move default mutators of overwritten softirq mask to s390


 arch/ia64/include/asm/hardirq.h |  2 +-
 arch/parisc/include/asm/hardirq.h   |  8 
 arch/powerpc/include/asm/hardirq.h  |  7 ---
 arch/s390/include/asm/hardirq.h |  2 ++
 arch/sparc/include/asm/hardirq_64.h |  5 +++--
 arch/x86/include/asm/hardirq.h  |  8 
 include/linux/interrupt.h   | 13 ++---
 include/linux/irq_cpustat.h | 10 +++---
 kernel/softirq.c|  4 ++--
 9 files changed, 21 insertions(+), 38 deletions(-)


[PATCH 01/10] ia64: Convert local_softirq_pending() to per-cpu ops

2018-03-30 Thread Frederic Weisbecker
In order to consolidate and optimize generic softirq mask accesses, we
first need to convert architectures to use per-cpu operations when
possible.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/ia64/include/asm/hardirq.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/ia64/include/asm/hardirq.h b/arch/ia64/include/asm/hardirq.h
index bdc4669..22fae71 100644
--- a/arch/ia64/include/asm/hardirq.h
+++ b/arch/ia64/include/asm/hardirq.h
@@ -13,7 +13,7 @@
 
 #define __ARCH_IRQ_STAT1
 
-#define local_softirq_pending()
(local_cpu_data->softirq_pending)
+#define local_softirq_pending()
(*this_cpu_ptr(_cpu_info.softirq_pending))
 
 #include 
 #include 
-- 
2.7.4



[PATCH 10/10] softirq/s390: Move default mutators of overwritten softirq mask to s390

2018-03-30 Thread Frederic Weisbecker
s390 is now the last architecture that entirely overwrites
local_softirq_pending() and uses the according default definitions of
set_softirq_pending() and or_softirq_pending().

Just move these to s390 to debloat the generic code complexity.

Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/s390/include/asm/hardirq.h | 2 ++
 include/linux/interrupt.h   | 7 ---
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/hardirq.h b/arch/s390/include/asm/hardirq.h
index a296c6a..dfbc3c6c0 100644
--- a/arch/s390/include/asm/hardirq.h
+++ b/arch/s390/include/asm/hardirq.h
@@ -14,6 +14,8 @@
 #include 
 
 #define local_softirq_pending() (S390_lowcore.softirq_pending)
+#define set_softirq_pending(x) (S390_lowcore.softirq_pending = (x))
+#define or_softirq_pending(x)  (S390_lowcore.softirq_pending |= (x))
 
 #define __ARCH_IRQ_STAT
 #define __ARCH_HAS_DO_SOFTIRQ
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 01caeca..df35a26 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -444,13 +444,6 @@ extern bool force_irqthreads;
 #define set_softirq_pending(x) (__this_cpu_write(local_softirq_pending_ref, 
(x)))
 #define or_softirq_pending(x)  (__this_cpu_or(local_softirq_pending_ref, (x)))
 
-#else /* local_softirq_pending */
-
-#ifndef __ARCH_SET_SOFTIRQ_PENDING
-#define set_softirq_pending(x) (local_softirq_pending() = (x))
-#define or_softirq_pending(x)  (local_softirq_pending() |= (x))
-#endif
-
 #endif /* local_softirq_pending */
 
 /* Some architectures might implement lazy enabling/disabling of
-- 
2.7.4



[PATCH 01/10] ia64: Convert local_softirq_pending() to per-cpu ops

2018-03-30 Thread Frederic Weisbecker
In order to consolidate and optimize generic softirq mask accesses, we
first need to convert architectures to use per-cpu operations when
possible.

Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/ia64/include/asm/hardirq.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/ia64/include/asm/hardirq.h b/arch/ia64/include/asm/hardirq.h
index bdc4669..22fae71 100644
--- a/arch/ia64/include/asm/hardirq.h
+++ b/arch/ia64/include/asm/hardirq.h
@@ -13,7 +13,7 @@
 
 #define __ARCH_IRQ_STAT1
 
-#define local_softirq_pending()
(local_cpu_data->softirq_pending)
+#define local_softirq_pending()
(*this_cpu_ptr(_cpu_info.softirq_pending))
 
 #include 
 #include 
-- 
2.7.4



[PATCH 10/10] softirq/s390: Move default mutators of overwritten softirq mask to s390

2018-03-30 Thread Frederic Weisbecker
s390 is now the last architecture that entirely overwrites
local_softirq_pending() and uses the according default definitions of
set_softirq_pending() and or_softirq_pending().

Just move these to s390 to debloat the generic code complexity.

Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Sebastian Andrzej Siewior 
Cc: David S. Miller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: James E.J. Bottomley 
Cc: Helge Deller 
Cc: Tony Luck 
Cc: Fenghua Yu 
---
 arch/s390/include/asm/hardirq.h | 2 ++
 include/linux/interrupt.h   | 7 ---
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/hardirq.h b/arch/s390/include/asm/hardirq.h
index a296c6a..dfbc3c6c0 100644
--- a/arch/s390/include/asm/hardirq.h
+++ b/arch/s390/include/asm/hardirq.h
@@ -14,6 +14,8 @@
 #include 
 
 #define local_softirq_pending() (S390_lowcore.softirq_pending)
+#define set_softirq_pending(x) (S390_lowcore.softirq_pending = (x))
+#define or_softirq_pending(x)  (S390_lowcore.softirq_pending |= (x))
 
 #define __ARCH_IRQ_STAT
 #define __ARCH_HAS_DO_SOFTIRQ
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 01caeca..df35a26 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -444,13 +444,6 @@ extern bool force_irqthreads;
 #define set_softirq_pending(x) (__this_cpu_write(local_softirq_pending_ref, 
(x)))
 #define or_softirq_pending(x)  (__this_cpu_or(local_softirq_pending_ref, (x)))
 
-#else /* local_softirq_pending */
-
-#ifndef __ARCH_SET_SOFTIRQ_PENDING
-#define set_softirq_pending(x) (local_softirq_pending() = (x))
-#define or_softirq_pending(x)  (local_softirq_pending() |= (x))
-#endif
-
 #endif /* local_softirq_pending */
 
 /* Some architectures might implement lazy enabling/disabling of
-- 
2.7.4



[PATCH 00/10] softirq: Consolidate and optimize softirq mask v2

2018-03-30 Thread Frederic Weisbecker
Only the last patch has changed since v1 to integrate review from peterz.

Quote from the v1 summary:

The softirq mask and its accessors/mutators have many implementations
scattered around many architectures. Most do the same things consisting
in a field in a per-cpu struct (often irq_cpustat_t) accessed through
per-cpu ops. We can provide instead a generic efficient version that
most of them can use. In fact s390 is the only exception because the
field is stored in lowcore.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
softirq/mask-v2

HEAD: cce39c380f2dbe3e92db6a018c8ba65579c6311b

Thanks,
Frederic
---

Frederic Weisbecker (10):
  ia64: Convert local_softirq_pending() to per-cpu ops
  sparc: Convert local_softirq_pending() to use per-cpu op
  softirq: Turn default irq_cpustat_t to standard per-cpu
  softirq: Consolidate default local_softirq_pending() implementations
  ia64: Switch to generic local_softirq_pending() implementation
  parisc: Switch to generic local_softirq_pending() implementation
  powerpc: Switch to generic local_softirq_pending() implementation
  sparc: Switch to generic local_softirq_pending() implementation
  x86: Switch to generic local_softirq_pending() implementation
  softirq/s390: Move default mutators of overwritten softirq mask to s390


 arch/ia64/include/asm/hardirq.h |  2 +-
 arch/parisc/include/asm/hardirq.h   |  8 
 arch/powerpc/include/asm/hardirq.h  |  7 ---
 arch/s390/include/asm/hardirq.h |  2 ++
 arch/sparc/include/asm/hardirq_64.h |  5 +++--
 arch/x86/include/asm/hardirq.h  |  8 
 include/linux/interrupt.h   | 13 ++---
 include/linux/irq_cpustat.h | 10 +++---
 kernel/softirq.c|  4 ++--
 9 files changed, 21 insertions(+), 38 deletions(-)


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Steven Rostedt
On Fri, 30 Mar 2018 19:18:57 -0700
Matthew Wilcox  wrote:

> Again though, this is the same pattern as vmalloc.  There are any number
> of places where userspace can cause an arbitrarily large vmalloc to be
> attempted (grep for kvmalloc_array for a list of promising candidates).
> I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
> __GFP_NOWARN will give you the exact behaviour that you want with no
> need to grub around in the VM to find out if your huge allocation is
> likely to succeed.

Not sure how this helps. Note, I don't care about consecutive pages, so
this is not an array. It's a link list of thousands of pages. How do
you suggest allocating them? The ring buffer is a link list of pages.

What I currently do is to see how many more pages I need. Allocate them
one at a time and put them in a temporary list, if it succeeds I add
them to the ring buffer, if not, I free the entire list (it's an all or
nothing operation).

The allocation I'm making doesn't warn. The problem is the
GFP_RETRY_MAYFAIL, which will try to allocate any possible memory in
the system. When it succeeds, the ring buffer allocation logic will
then try to allocate another page. If we add too many pages, we will
allocate all possible pages and then try to allocate more. This
allocation will fail without causing an OOM. That's not the problem.
The problem is if the system is active during this time, and something
else tries to do any allocation, after all memory has been consumed,
that allocation will fail. Then it will trigger an OOM.

I showed this in my Call Trace, that the allocation that failed during
my test was something completely unrelated, and that failure caused an
OOM.

What this last patch does is see if there's space available before it
even starts the process.

Maybe I'm missing something, but I don't see how NOWARN can help. My
allocations are not what is giving the warning.

-- Steve



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Steven Rostedt
On Fri, 30 Mar 2018 19:18:57 -0700
Matthew Wilcox  wrote:

> Again though, this is the same pattern as vmalloc.  There are any number
> of places where userspace can cause an arbitrarily large vmalloc to be
> attempted (grep for kvmalloc_array for a list of promising candidates).
> I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
> __GFP_NOWARN will give you the exact behaviour that you want with no
> need to grub around in the VM to find out if your huge allocation is
> likely to succeed.

Not sure how this helps. Note, I don't care about consecutive pages, so
this is not an array. It's a link list of thousands of pages. How do
you suggest allocating them? The ring buffer is a link list of pages.

What I currently do is to see how many more pages I need. Allocate them
one at a time and put them in a temporary list, if it succeeds I add
them to the ring buffer, if not, I free the entire list (it's an all or
nothing operation).

The allocation I'm making doesn't warn. The problem is the
GFP_RETRY_MAYFAIL, which will try to allocate any possible memory in
the system. When it succeeds, the ring buffer allocation logic will
then try to allocate another page. If we add too many pages, we will
allocate all possible pages and then try to allocate more. This
allocation will fail without causing an OOM. That's not the problem.
The problem is if the system is active during this time, and something
else tries to do any allocation, after all memory has been consumed,
that allocation will fail. Then it will trigger an OOM.

I showed this in my Call Trace, that the allocation that failed during
my test was something completely unrelated, and that failure caused an
OOM.

What this last patch does is see if there's space available before it
even starts the process.

Maybe I'm missing something, but I don't see how NOWARN can help. My
allocations are not what is giving the warning.

-- Steve



[PATCH] autofs4: use wake_up() instead of wake_up_interruptible

2018-03-30 Thread Andrei Vagin
In "autofs4: use wait_event_killable",  wait_event_interruptible() was
replaced by wait_event_killable(), but in this case we have to use
wake_up() instead of wake_up_interruptible().

Cc: Matthew Wilcox 
Cc: Ian Kent 
Cc: Andrew Morton 
Cc: Stephen Rothwell 
Signed-off-by: Andrei Vagin 
---
 fs/autofs4/waitq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index c160e9b3aa0f..be9c3dc048ab 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -549,7 +549,7 @@ int autofs4_wait_release(struct autofs_sb_info *sbi, 
autofs_wqt_t wait_queue_tok
kfree(wq->name.name);
wq->name.name = NULL;   /* Do not wait on this queue */
wq->status = status;
-   wake_up_interruptible(>queue);
+   wake_up(>queue);
if (!--wq->wait_ctr)
kfree(wq);
mutex_unlock(>wq_mutex);
-- 
2.13.6



[PATCH] autofs4: use wake_up() instead of wake_up_interruptible

2018-03-30 Thread Andrei Vagin
In "autofs4: use wait_event_killable",  wait_event_interruptible() was
replaced by wait_event_killable(), but in this case we have to use
wake_up() instead of wake_up_interruptible().

Cc: Matthew Wilcox 
Cc: Ian Kent 
Cc: Andrew Morton 
Cc: Stephen Rothwell 
Signed-off-by: Andrei Vagin 
---
 fs/autofs4/waitq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index c160e9b3aa0f..be9c3dc048ab 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -549,7 +549,7 @@ int autofs4_wait_release(struct autofs_sb_info *sbi, 
autofs_wqt_t wait_queue_tok
kfree(wq->name.name);
wq->name.name = NULL;   /* Do not wait on this queue */
wq->status = status;
-   wake_up_interruptible(>queue);
+   wake_up(>queue);
if (!--wq->wait_ctr)
kfree(wq);
mutex_unlock(>wq_mutex);
-- 
2.13.6



Re: [03/10] genksyms: generate lexer and parser during build instead of shipping

2018-03-30 Thread Masahiro Yamada
2018-03-31 7:21 GMT+09:00 Andrei Vagin :
> On Fri, Mar 30, 2018 at 10:40:22AM -0700, Andrei Vagin wrote:
>> On Fri, Mar 23, 2018 at 10:04:32PM +0900, Masahiro Yamada wrote:
>> > Now that the kernel build supports flex and bison, remove the _shipped
>> > files and generate them during the build instead.
>> >
>> > There are no more shipped lexer and parser, so I ripped off the rules
>> > in scripts/Malefile.lib that were used for REGENERATE_PARSERS.
>> >
>> > The genksyms parser has ambiguous grammar, which would emit warnings:
>> >
>> >  scripts/genksyms/parse.y: warning: 9 shift/reduce conflicts 
>> > [-Wconflicts-sr]
>> >  scripts/genksyms/parse.y: warning: 5 reduce/reduce conflicts 
>> > [-Wconflicts-rr]
>> >
>> > They are normally suppressed, but displayed when W=1 is given.
>> >
>> > Signed-off-by: Masahiro Yamada 
>> > ---
>> >
>> >  scripts/Makefile.lib |   24 +-
>> >  scripts/genksyms/Makefile|   23 +
>> >  scripts/genksyms/lex.lex.c_shipped   | 2291 
>> > 
>> >  scripts/genksyms/parse.tab.c_shipped | 2394 
>> > --
>> >  scripts/genksyms/parse.tab.h_shipped |  119 --
>> >  5 files changed, 26 insertions(+), 4825 deletions(-)
>> >  delete mode 100644 scripts/genksyms/lex.lex.c_shipped
>> >  delete mode 100644 scripts/genksyms/parse.tab.c_shipped
>> >  delete mode 100644 scripts/genksyms/parse.tab.h_shipped
>> >
>> > diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
>> > index 2fde810..b7d2c97 100644
>> > --- a/scripts/Makefile.lib
>> > +++ b/scripts/Makefile.lib
>> > @@ -183,14 +183,8 @@ endef
>> >  quiet_cmd_flex = LEX $@
>> >cmd_flex = $(LEX) -o$@ -L $<
>> >
>> > -ifdef REGENERATE_PARSERS
>> > -.PRECIOUS: $(src)/%.lex.c_shipped
>> > -$(src)/%.lex.c_shipped: $(src)/%.l
>> > -   $(call cmd,flex)
>> > -endif
>> > -
>> >  .PRECIOUS: $(obj)/%.lex.c
>> > -$(filter %.lex.c,$(targets)): $(obj)/%.lex.c: $(src)/%.l FORCE
>> > +$(obj)/%.lex.c: $(src)/%.l FORCE
>> > $(call if_changed,flex)
>> >
>> >  # YACC
>> > @@ -198,27 +192,15 @@ $(filter %.lex.c,$(targets)): $(obj)/%.lex.c: 
>> > $(src)/%.l FORCE
>> >  quiet_cmd_bison = YACC$@
>> >cmd_bison = $(YACC) -o$@ -t -l $<
>> >
>> > -ifdef REGENERATE_PARSERS
>> > -.PRECIOUS: $(src)/%.tab.c_shipped
>> > -$(src)/%.tab.c_shipped: $(src)/%.y
>> > -   $(call cmd,bison)
>> > -endif
>> > -
>> >  .PRECIOUS: $(obj)/%.tab.c
>> > -$(filter %.tab.c,$(targets)): $(obj)/%.tab.c: $(src)/%.y FORCE
>> > +$(obj)/%.tab.c: $(src)/%.y FORCE
>> > $(call if_changed,bison)
>> >
>> >  quiet_cmd_bison_h = YACC$@
>> >cmd_bison_h = bison -o/dev/null --defines=$@ -t -l $<
>> >
>> > -ifdef REGENERATE_PARSERS
>> > -.PRECIOUS: $(src)/%.tab.h_shipped
>> > -$(src)/%.tab.h_shipped: $(src)/%.y
>> > -   $(call cmd,bison_h)
>> > -endif
>> > -
>> >  .PRECIOUS: $(obj)/%.tab.h
>> > -$(filter %.tab.h,$(targets)): $(obj)/%.tab.h: $(src)/%.y FORCE
>> > +$(obj)/%.tab.h: $(src)/%.y FORCE
>> > $(call if_changed,bison_h)
>> >
>> >  # Shipped files
>> > diff --git a/scripts/genksyms/Makefile b/scripts/genksyms/Makefile
>> > index 0ccac51..f4749e8 100644
>> > --- a/scripts/genksyms/Makefile
>> > +++ b/scripts/genksyms/Makefile
>> > @@ -5,9 +5,32 @@ always := $(hostprogs-y)
>> >
>> >  genksyms-objs  := genksyms.o parse.tab.o lex.lex.o
>> >
>> > +# FIXME: fix the ambiguous grammar in parse.y and delete this hack
>> > +#
>> > +# Suppress shift/reduce, reduce/reduce conflicts warnings
>> > +# unless W=1 is specified.
>> > +ifeq ($(findstring 1,$(KBUILD_ENABLE_EXTRA_GCC_CHECKS)),)
>> > +SUPPRESS_BISON_WARNING := 2>/dev/null
>>
>> We have a robot which runs CRIU tests on linux-next.
>> Yesterday it failed with this error:
>>
>>   HOSTCC  scripts/genksyms/genksyms.o
>> make[2]: *** [scripts/genksyms/parse.tab.c] Error 127
>>
>> cripts/genksyms/Makefile:20: recipe for target 
>> 'scripts/genksyms/parse.tab.c' failed
>> scripts/Makefile.build:559: recipe for target 'scripts/genksyms' failed
>> Makefile:1073: recipe for target 'scripts' failed
>> make[1]: *** [scripts/genksyms] Error 2
>> make: *** [scripts] Error 2
>> make: *** Waiting for unfinished jobs
>>
>> https://travis-ci.org/avagin/linux/jobs/360056903
>>
>> From this output, it is very hard to understand what was going wrong.
>
>
> The reason was that bison and fles were not installed, but I think the
> error message should be more clear.
>
>>
>> Thanks,
>> Andrei
>>

Thanks for the report.


OK, I will apply the fix-up attached below.

If bison is not installed, it will fail with clear message.

  HOSTCC  scripts/genksyms/genksyms.o
/bin/sh: 1: bison: not found
make[2]: *** [scripts/genksyms/Makefile:18:
scripts/genksyms/parse.tab.c] Error 127
make[1]: *** [scripts/Makefile.build:559: scripts/genksyms] Error 2
make: *** [Makefile:1073: scripts] Error 2


BTW, without flex and bison, how did you build Kconfig?

Since commit 

Re: [03/10] genksyms: generate lexer and parser during build instead of shipping

2018-03-30 Thread Masahiro Yamada
2018-03-31 7:21 GMT+09:00 Andrei Vagin :
> On Fri, Mar 30, 2018 at 10:40:22AM -0700, Andrei Vagin wrote:
>> On Fri, Mar 23, 2018 at 10:04:32PM +0900, Masahiro Yamada wrote:
>> > Now that the kernel build supports flex and bison, remove the _shipped
>> > files and generate them during the build instead.
>> >
>> > There are no more shipped lexer and parser, so I ripped off the rules
>> > in scripts/Malefile.lib that were used for REGENERATE_PARSERS.
>> >
>> > The genksyms parser has ambiguous grammar, which would emit warnings:
>> >
>> >  scripts/genksyms/parse.y: warning: 9 shift/reduce conflicts 
>> > [-Wconflicts-sr]
>> >  scripts/genksyms/parse.y: warning: 5 reduce/reduce conflicts 
>> > [-Wconflicts-rr]
>> >
>> > They are normally suppressed, but displayed when W=1 is given.
>> >
>> > Signed-off-by: Masahiro Yamada 
>> > ---
>> >
>> >  scripts/Makefile.lib |   24 +-
>> >  scripts/genksyms/Makefile|   23 +
>> >  scripts/genksyms/lex.lex.c_shipped   | 2291 
>> > 
>> >  scripts/genksyms/parse.tab.c_shipped | 2394 
>> > --
>> >  scripts/genksyms/parse.tab.h_shipped |  119 --
>> >  5 files changed, 26 insertions(+), 4825 deletions(-)
>> >  delete mode 100644 scripts/genksyms/lex.lex.c_shipped
>> >  delete mode 100644 scripts/genksyms/parse.tab.c_shipped
>> >  delete mode 100644 scripts/genksyms/parse.tab.h_shipped
>> >
>> > diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
>> > index 2fde810..b7d2c97 100644
>> > --- a/scripts/Makefile.lib
>> > +++ b/scripts/Makefile.lib
>> > @@ -183,14 +183,8 @@ endef
>> >  quiet_cmd_flex = LEX $@
>> >cmd_flex = $(LEX) -o$@ -L $<
>> >
>> > -ifdef REGENERATE_PARSERS
>> > -.PRECIOUS: $(src)/%.lex.c_shipped
>> > -$(src)/%.lex.c_shipped: $(src)/%.l
>> > -   $(call cmd,flex)
>> > -endif
>> > -
>> >  .PRECIOUS: $(obj)/%.lex.c
>> > -$(filter %.lex.c,$(targets)): $(obj)/%.lex.c: $(src)/%.l FORCE
>> > +$(obj)/%.lex.c: $(src)/%.l FORCE
>> > $(call if_changed,flex)
>> >
>> >  # YACC
>> > @@ -198,27 +192,15 @@ $(filter %.lex.c,$(targets)): $(obj)/%.lex.c: 
>> > $(src)/%.l FORCE
>> >  quiet_cmd_bison = YACC$@
>> >cmd_bison = $(YACC) -o$@ -t -l $<
>> >
>> > -ifdef REGENERATE_PARSERS
>> > -.PRECIOUS: $(src)/%.tab.c_shipped
>> > -$(src)/%.tab.c_shipped: $(src)/%.y
>> > -   $(call cmd,bison)
>> > -endif
>> > -
>> >  .PRECIOUS: $(obj)/%.tab.c
>> > -$(filter %.tab.c,$(targets)): $(obj)/%.tab.c: $(src)/%.y FORCE
>> > +$(obj)/%.tab.c: $(src)/%.y FORCE
>> > $(call if_changed,bison)
>> >
>> >  quiet_cmd_bison_h = YACC$@
>> >cmd_bison_h = bison -o/dev/null --defines=$@ -t -l $<
>> >
>> > -ifdef REGENERATE_PARSERS
>> > -.PRECIOUS: $(src)/%.tab.h_shipped
>> > -$(src)/%.tab.h_shipped: $(src)/%.y
>> > -   $(call cmd,bison_h)
>> > -endif
>> > -
>> >  .PRECIOUS: $(obj)/%.tab.h
>> > -$(filter %.tab.h,$(targets)): $(obj)/%.tab.h: $(src)/%.y FORCE
>> > +$(obj)/%.tab.h: $(src)/%.y FORCE
>> > $(call if_changed,bison_h)
>> >
>> >  # Shipped files
>> > diff --git a/scripts/genksyms/Makefile b/scripts/genksyms/Makefile
>> > index 0ccac51..f4749e8 100644
>> > --- a/scripts/genksyms/Makefile
>> > +++ b/scripts/genksyms/Makefile
>> > @@ -5,9 +5,32 @@ always := $(hostprogs-y)
>> >
>> >  genksyms-objs  := genksyms.o parse.tab.o lex.lex.o
>> >
>> > +# FIXME: fix the ambiguous grammar in parse.y and delete this hack
>> > +#
>> > +# Suppress shift/reduce, reduce/reduce conflicts warnings
>> > +# unless W=1 is specified.
>> > +ifeq ($(findstring 1,$(KBUILD_ENABLE_EXTRA_GCC_CHECKS)),)
>> > +SUPPRESS_BISON_WARNING := 2>/dev/null
>>
>> We have a robot which runs CRIU tests on linux-next.
>> Yesterday it failed with this error:
>>
>>   HOSTCC  scripts/genksyms/genksyms.o
>> make[2]: *** [scripts/genksyms/parse.tab.c] Error 127
>>
>> cripts/genksyms/Makefile:20: recipe for target 
>> 'scripts/genksyms/parse.tab.c' failed
>> scripts/Makefile.build:559: recipe for target 'scripts/genksyms' failed
>> Makefile:1073: recipe for target 'scripts' failed
>> make[1]: *** [scripts/genksyms] Error 2
>> make: *** [scripts] Error 2
>> make: *** Waiting for unfinished jobs
>>
>> https://travis-ci.org/avagin/linux/jobs/360056903
>>
>> From this output, it is very hard to understand what was going wrong.
>
>
> The reason was that bison and fles were not installed, but I think the
> error message should be more clear.
>
>>
>> Thanks,
>> Andrei
>>

Thanks for the report.


OK, I will apply the fix-up attached below.

If bison is not installed, it will fail with clear message.

  HOSTCC  scripts/genksyms/genksyms.o
/bin/sh: 1: bison: not found
make[2]: *** [scripts/genksyms/Makefile:18:
scripts/genksyms/parse.tab.c] Error 127
make[1]: *** [scripts/Makefile.build:559: scripts/genksyms] Error 2
make: *** [Makefile:1073: scripts] Error 2


BTW, without flex and bison, how did you build Kconfig?

Since commit 29c833061c1d8c2d1d23a62e7061561eadd76cdb,
Kconfig requires flex and bison, 

Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Matthew Wilcox
On Fri, Mar 30, 2018 at 09:41:51PM -0400, Steven Rostedt wrote:
> On Fri, 30 Mar 2018 16:38:52 -0700
> Joel Fernandes  wrote:
> 
> > > --- a/kernel/trace/ring_buffer.c
> > > +++ b/kernel/trace/ring_buffer.c
> > > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, 
> > > struct list_head *pages, int cpu)
> > > struct buffer_page *bpage, *tmp;
> > > long i;
> > >
> > > +   /* Check if the available memory is there first */
> > > +   i = si_mem_available();
> > > +   if (i < nr_pages)  
> > 
> > Does it make sense to add a small margin here so that after ftrace
> > finishes allocating, we still have some memory left for the system?
> > But then then we have to define a magic number :-|
> 
> I don't think so. The memory is allocated by user defined numbers. They
> can do "free" to see what is available. The original patch from
> Zhaoyang was due to a script that would just try a very large number
> and cause issues.
> 
> If the memory is available, I just say let them have it. This is
> borderline user space issue and not a kernel one.

Again though, this is the same pattern as vmalloc.  There are any number
of places where userspace can cause an arbitrarily large vmalloc to be
attempted (grep for kvmalloc_array for a list of promising candidates).
I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
__GFP_NOWARN will give you the exact behaviour that you want with no
need to grub around in the VM to find out if your huge allocation is
likely to succeed.



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Matthew Wilcox
On Fri, Mar 30, 2018 at 09:41:51PM -0400, Steven Rostedt wrote:
> On Fri, 30 Mar 2018 16:38:52 -0700
> Joel Fernandes  wrote:
> 
> > > --- a/kernel/trace/ring_buffer.c
> > > +++ b/kernel/trace/ring_buffer.c
> > > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, 
> > > struct list_head *pages, int cpu)
> > > struct buffer_page *bpage, *tmp;
> > > long i;
> > >
> > > +   /* Check if the available memory is there first */
> > > +   i = si_mem_available();
> > > +   if (i < nr_pages)  
> > 
> > Does it make sense to add a small margin here so that after ftrace
> > finishes allocating, we still have some memory left for the system?
> > But then then we have to define a magic number :-|
> 
> I don't think so. The memory is allocated by user defined numbers. They
> can do "free" to see what is available. The original patch from
> Zhaoyang was due to a script that would just try a very large number
> and cause issues.
> 
> If the memory is available, I just say let them have it. This is
> borderline user space issue and not a kernel one.

Again though, this is the same pattern as vmalloc.  There are any number
of places where userspace can cause an arbitrarily large vmalloc to be
attempted (grep for kvmalloc_array for a list of promising candidates).
I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
__GFP_NOWARN will give you the exact behaviour that you want with no
need to grub around in the VM to find out if your huge allocation is
likely to succeed.



Re: [REVIEW][PATCH 12/11] ipc: Directly call the security hook in ipc_ops.associate

2018-03-30 Thread James Morris
On Sat, 24 Mar 2018, Eric W. Biederman wrote:

> 
> After the last round of cleanups the shm, sem, and msg associate
> operations just became trivial wrappers around the appropriate security
> method.  Simplify things further by just calling the security method
> directly.
> 
> Signed-off-by: "Eric W. Biederman" 


Reviewed-by: James Morris 


-- 
James Morris




Re: [REVIEW][PATCH 12/11] ipc: Directly call the security hook in ipc_ops.associate

2018-03-30 Thread James Morris
On Sat, 24 Mar 2018, Eric W. Biederman wrote:

> 
> After the last round of cleanups the shm, sem, and msg associate
> operations just became trivial wrappers around the appropriate security
> method.  Simplify things further by just calling the security method
> directly.
> 
> Signed-off-by: "Eric W. Biederman" 


Reviewed-by: James Morris 


-- 
James Morris




[PATCH] ACPI / PM: Fix wake up by PS2 keyboard fail on ASUS UX331UA

2018-03-30 Thread Chris Chiu
This issue happens on new ASUS laptop UX331UA which has modern
standby mode (suspend-to-idle). Pressing keys on the PS2 keyboard
can't wake up the system from suspend-to-idle which is not expected.
However, pressing power button can wake up without problem.

Per the engineers of ASUS, the keypress event is routed to Embedded
Controller (EC) in standby mode. EC then signals the SCI event to
BIOS so BIOS would Notify() power button to wake up the system. It's
from BIOS perspective. What we observe here is that kernel receives
the SCI event from SCI interrupt handler which informs that the GPE
status bit belongs to EC needs to be handled and then queries the EC
to find out what event is pending. Then execute the following ACPI
_QDF method which defined in ACPI DSDT for EC to notify power button.

 Method (_QDF, 0, NotSerialized)  // _Qxx: EC Query
{
Notify (PWRB, 0x80) // Status Change
}

With more debug messages added to analyze this problem, we find that
the keypress does wake up the system from suspend-to-idle but it's back
to suspend again almost immediately. As we see in the following messages,
the acpi_button_notify() is invoked but acpi_pm_wakeup_event() can not
really wake up the system here because acpi_s2idle_wakeup() is false.
The acpi_s2idle_wakeup() returnd false because the acpi_s2idle_sync() has
alrealdy exited.

[   52.987048] s2idle_loop going s2idle
[   59.713392] acpi_s2idle_wake enter
[   59.713394] acpi_s2idle_wake exit
[   59.760888] acpi_ev_gpe_detect enter
[   59.760893] acpi_s2idle_sync enter
[   59.760893] acpi_ec_query_flushed ec pending queries 0
[   59.760953] Read registers for GPE 50-57: Status=01, Enable=01, 
RunEnable=01, WakeEnable=00
[   59.760955] ACPI: EC: = IRQ (1) =
[   59.760972] ACPI: EC: EC_SC(R) = 0x28 SCI_EVT=1 BURST=0 CMD=1 IBF=0 OBF=0
[   59.760979] ACPI: EC: + Polling enabled +
[   59.760979] ACPI: EC: # Command(QR_EC) submitted/blocked #
[   59.761003] acpi_s2idle_sync exit
[   59.769587] ACPI: EC: # Query(0xdf) started #
[   59.769611] ACPI: EC: # Query(0xdf) stopped #
[   59.774154] acpi_button_notify button type 1
[   59.813175] s2idle_loop going s2idle

acpi_s2idle_sync() already makes an effort to flush the EC event
queue, but in this case, the EC event has yet to be generated when
the call to acpi_ec_flush_work() is made. The event is generated
shortly after, through the ongoing handling of the SCI interrupt
which is happening on another CPU, and we must synchronize that
to make sure that it has run and completed. Adding another call to
acpi_os_wait_events_complete() solves this issue, since that
function synchronizes with SCI interrupt completion.

Signed-off-by: Chris Chiu 

---
 drivers/acpi/sleep.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 8082871..c6e1b4b 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -982,8 +982,9 @@ static void acpi_s2idle_sync(void)
 * The EC driver uses the system workqueue and an additional special
 * one, so those need to be flushed too.
 */
+   acpi_os_wait_events_complete(); /* synchronize SCI IRQ handling */
acpi_ec_flush_work();
-   acpi_os_wait_events_complete();
+   acpi_os_wait_events_complete(); /* synchronize Notify handling */
s2idle_wakeup = false;
 }
 
-- 
2.7.4



[PATCH] ACPI / PM: Fix wake up by PS2 keyboard fail on ASUS UX331UA

2018-03-30 Thread Chris Chiu
This issue happens on new ASUS laptop UX331UA which has modern
standby mode (suspend-to-idle). Pressing keys on the PS2 keyboard
can't wake up the system from suspend-to-idle which is not expected.
However, pressing power button can wake up without problem.

Per the engineers of ASUS, the keypress event is routed to Embedded
Controller (EC) in standby mode. EC then signals the SCI event to
BIOS so BIOS would Notify() power button to wake up the system. It's
from BIOS perspective. What we observe here is that kernel receives
the SCI event from SCI interrupt handler which informs that the GPE
status bit belongs to EC needs to be handled and then queries the EC
to find out what event is pending. Then execute the following ACPI
_QDF method which defined in ACPI DSDT for EC to notify power button.

 Method (_QDF, 0, NotSerialized)  // _Qxx: EC Query
{
Notify (PWRB, 0x80) // Status Change
}

With more debug messages added to analyze this problem, we find that
the keypress does wake up the system from suspend-to-idle but it's back
to suspend again almost immediately. As we see in the following messages,
the acpi_button_notify() is invoked but acpi_pm_wakeup_event() can not
really wake up the system here because acpi_s2idle_wakeup() is false.
The acpi_s2idle_wakeup() returnd false because the acpi_s2idle_sync() has
alrealdy exited.

[   52.987048] s2idle_loop going s2idle
[   59.713392] acpi_s2idle_wake enter
[   59.713394] acpi_s2idle_wake exit
[   59.760888] acpi_ev_gpe_detect enter
[   59.760893] acpi_s2idle_sync enter
[   59.760893] acpi_ec_query_flushed ec pending queries 0
[   59.760953] Read registers for GPE 50-57: Status=01, Enable=01, 
RunEnable=01, WakeEnable=00
[   59.760955] ACPI: EC: = IRQ (1) =
[   59.760972] ACPI: EC: EC_SC(R) = 0x28 SCI_EVT=1 BURST=0 CMD=1 IBF=0 OBF=0
[   59.760979] ACPI: EC: + Polling enabled +
[   59.760979] ACPI: EC: # Command(QR_EC) submitted/blocked #
[   59.761003] acpi_s2idle_sync exit
[   59.769587] ACPI: EC: # Query(0xdf) started #
[   59.769611] ACPI: EC: # Query(0xdf) stopped #
[   59.774154] acpi_button_notify button type 1
[   59.813175] s2idle_loop going s2idle

acpi_s2idle_sync() already makes an effort to flush the EC event
queue, but in this case, the EC event has yet to be generated when
the call to acpi_ec_flush_work() is made. The event is generated
shortly after, through the ongoing handling of the SCI interrupt
which is happening on another CPU, and we must synchronize that
to make sure that it has run and completed. Adding another call to
acpi_os_wait_events_complete() solves this issue, since that
function synchronizes with SCI interrupt completion.

Signed-off-by: Chris Chiu 

---
 drivers/acpi/sleep.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 8082871..c6e1b4b 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -982,8 +982,9 @@ static void acpi_s2idle_sync(void)
 * The EC driver uses the system workqueue and an additional special
 * one, so those need to be flushed too.
 */
+   acpi_os_wait_events_complete(); /* synchronize SCI IRQ handling */
acpi_ec_flush_work();
-   acpi_os_wait_events_complete();
+   acpi_os_wait_events_complete(); /* synchronize Notify handling */
s2idle_wakeup = false;
 }
 
-- 
2.7.4



[PATCH] ACPI / PM: Fix wake up by PS2 keyboard fail on ASUS UX331UA

2018-03-30 Thread Chris Chiu
This issue happens on new ASUS laptop UX331UA which has modern
standby mode (suspend-to-idle). Pressing keys on the PS2 keyboard
can't wake up the system from suspend-to-idle which is not expected.
However, pressing power button can wake up without problem.

Per the engineers of ASUS, the keypress event is routed to Embedded
Controller (EC) in standby mode. EC then signals the SCI event to
BIOS so BIOS would Notify() power button to wake up the system. It's
from BIOS perspective. What we observe here is that kernel receives
the SCI event from SCI interrupt handler which informs that the GPE
status bit belongs to EC needs to be handled and then queries the EC
to find out what event is pending. Then execute the following ACPI
_QDF method which defined in ACPI DSDT for EC to notify power button.

 Method (_QDF, 0, NotSerialized)  // _Qxx: EC Query
{
Notify (PWRB, 0x80) // Status Change
}

With more debug messages added to analyze this problem, we find that
the keypress does wake up the system from suspend-to-idle but it's back
to suspend again almost immediately. As we see in the following messages,
the acpi_button_notify() is invoked but acpi_pm_wakeup_event() can not
really wake up the system here because acpi_s2idle_wakeup() is false.
The acpi_s2idle_wakeup() returnd false because the acpi_s2idle_sync() has
alrealdy exited.

[   52.987048] s2idle_loop going s2idle
[   59.713392] acpi_s2idle_wake enter
[   59.713394] acpi_s2idle_wake exit
[   59.760888] acpi_ev_gpe_detect enter
[   59.760893] acpi_s2idle_sync enter
[   59.760893] acpi_ec_query_flushed ec pending queries 0
[   59.760953] Read registers for GPE 50-57: Status=01, Enable=01, 
RunEnable=01, WakeEnable=00
[   59.760955] ACPI: EC: = IRQ (1) =
[   59.760972] ACPI: EC: EC_SC(R) = 0x28 SCI_EVT=1 BURST=0 CMD=1 IBF=0 OBF=0
[   59.760979] ACPI: EC: + Polling enabled +
[   59.760979] ACPI: EC: # Command(QR_EC) submitted/blocked #
[   59.761003] acpi_s2idle_sync exit
[   59.769587] ACPI: EC: # Query(0xdf) started #
[   59.769611] ACPI: EC: # Query(0xdf) stopped #
[   59.774154] acpi_button_notify button type 1
[   59.813175] s2idle_loop going s2idle

acpi_s2idle_sync() already makes an effort to flush the EC event
queue, but in this case, the EC event has yet to be generated when
the call to acpi_ec_flush_work() is made. The event is generated
shortly after, through the ongoing handling of the SCI interrupt
which is happening on another CPU, and we must synchronize that
to make sure that it has run and completed. Adding another call to
acpi_os_wait_events_complete() solves this issue, since that
function synchronizes with SCI interrupt completion.

Signed-off-by: Chris Chiu 

https://phabricator.endlessm.com/T21599
---
 drivers/acpi/sleep.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 8082871..c6e1b4b 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -982,8 +982,9 @@ static void acpi_s2idle_sync(void)
 * The EC driver uses the system workqueue and an additional special
 * one, so those need to be flushed too.
 */
+   acpi_os_wait_events_complete(); /* synchronize SCI IRQ handling */
acpi_ec_flush_work();
-   acpi_os_wait_events_complete();
+   acpi_os_wait_events_complete(); /* synchronize Notify handling */
s2idle_wakeup = false;
 }
 
-- 
2.7.4



[PATCH] ACPI / PM: Fix wake up by PS2 keyboard fail on ASUS UX331UA

2018-03-30 Thread Chris Chiu
This issue happens on new ASUS laptop UX331UA which has modern
standby mode (suspend-to-idle). Pressing keys on the PS2 keyboard
can't wake up the system from suspend-to-idle which is not expected.
However, pressing power button can wake up without problem.

Per the engineers of ASUS, the keypress event is routed to Embedded
Controller (EC) in standby mode. EC then signals the SCI event to
BIOS so BIOS would Notify() power button to wake up the system. It's
from BIOS perspective. What we observe here is that kernel receives
the SCI event from SCI interrupt handler which informs that the GPE
status bit belongs to EC needs to be handled and then queries the EC
to find out what event is pending. Then execute the following ACPI
_QDF method which defined in ACPI DSDT for EC to notify power button.

 Method (_QDF, 0, NotSerialized)  // _Qxx: EC Query
{
Notify (PWRB, 0x80) // Status Change
}

With more debug messages added to analyze this problem, we find that
the keypress does wake up the system from suspend-to-idle but it's back
to suspend again almost immediately. As we see in the following messages,
the acpi_button_notify() is invoked but acpi_pm_wakeup_event() can not
really wake up the system here because acpi_s2idle_wakeup() is false.
The acpi_s2idle_wakeup() returnd false because the acpi_s2idle_sync() has
alrealdy exited.

[   52.987048] s2idle_loop going s2idle
[   59.713392] acpi_s2idle_wake enter
[   59.713394] acpi_s2idle_wake exit
[   59.760888] acpi_ev_gpe_detect enter
[   59.760893] acpi_s2idle_sync enter
[   59.760893] acpi_ec_query_flushed ec pending queries 0
[   59.760953] Read registers for GPE 50-57: Status=01, Enable=01, 
RunEnable=01, WakeEnable=00
[   59.760955] ACPI: EC: = IRQ (1) =
[   59.760972] ACPI: EC: EC_SC(R) = 0x28 SCI_EVT=1 BURST=0 CMD=1 IBF=0 OBF=0
[   59.760979] ACPI: EC: + Polling enabled +
[   59.760979] ACPI: EC: # Command(QR_EC) submitted/blocked #
[   59.761003] acpi_s2idle_sync exit
[   59.769587] ACPI: EC: # Query(0xdf) started #
[   59.769611] ACPI: EC: # Query(0xdf) stopped #
[   59.774154] acpi_button_notify button type 1
[   59.813175] s2idle_loop going s2idle

acpi_s2idle_sync() already makes an effort to flush the EC event
queue, but in this case, the EC event has yet to be generated when
the call to acpi_ec_flush_work() is made. The event is generated
shortly after, through the ongoing handling of the SCI interrupt
which is happening on another CPU, and we must synchronize that
to make sure that it has run and completed. Adding another call to
acpi_os_wait_events_complete() solves this issue, since that
function synchronizes with SCI interrupt completion.

Signed-off-by: Chris Chiu 

https://phabricator.endlessm.com/T21599
---
 drivers/acpi/sleep.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 8082871..c6e1b4b 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -982,8 +982,9 @@ static void acpi_s2idle_sync(void)
 * The EC driver uses the system workqueue and an additional special
 * one, so those need to be flushed too.
 */
+   acpi_os_wait_events_complete(); /* synchronize SCI IRQ handling */
acpi_ec_flush_work();
-   acpi_os_wait_events_complete();
+   acpi_os_wait_events_complete(); /* synchronize Notify handling */
s2idle_wakeup = false;
 }
 
-- 
2.7.4



4.15.14 crash with iscsi target and dvd

2018-03-30 Thread Wakko Warner
I reported this before but noone responded.

I have an iscsi target setup with /dev/sr[012] using pscsi.  On the
initiator, I mount only 1 disc.  Then I issue find -type f | xargs cat >
/dev/null  Then after a few seconds, I get 2 oops and the system has to be
hard reset.

I noticed if I cat /dev/sr1 from the initiator, it doesn't crash.  I was
also able to burn without crashing.

Here's the OOPS:
[2692.733468] WARNING: CPU: 8 PID: 0 at 
/usr/src/linux/dist/4.15.14-nobklcd/drivers/scsi/scsi_lib.c:1068 
scsi_init_io+0x111/0x1a0 [scsi_mod]
[2692.734154] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison 
dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
raid6_pq libcrc32c crc32c_generic md_mod dm_crypt algif_skcipher af_alg dm_mod 
dax ext4 crc16 mbcache jbd2 af_packet iscsi_target_mod tcm_loop vhost_scsi 
vhost target_core_file target_core_iblock target_core_pscsi target_core_mod 
nfsd exportfs dummy bridge stp llc ib_iser rdma_cm iw_cm ib_cm ib_core ipv6 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi netconsole configfs sr_mod 
cdrom adt7475 hwmon_vid sd_mod sg coretemp x86_pkg_temp_thermal kvm_intel kvm 
irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc 
snd_hda_codec_realtek snd_hda_codec_generic nouveau video led_class 
drm_kms_helper cfbfillrect syscopyarea cfbimgblt
[2692.737388]  sysfillrect sysimgblt snd_hda_intel fb_sys_fops cfbcopyarea 
snd_hda_codec ttm snd_hda_core snd_pcm_oss drm snd_mixer_oss agpgart snd_pcm 
igb snd_timer snd aesni_intel soundcore hwmon aes_x86_64 i2c_algo_bit ahci 
mpt3sas crypto_simd i2c_core libahci glue_helper mptsas raid_class libata 
mptscsih scsi_transport_sas mptbase scsi_mod wmi button unix
[2692.737388] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.15.14 #1
[2692.737388] Hardware name: Dell Inc. Precision T5610/0WN7Y6, BIOS A16 
02/05/2018
[2692.737388] RIP: 0010:scsi_init_io+0x111/0x1a0 [scsi_mod]
[2692.737388] RSP: 0018:8806b7a03d78 EFLAGS: 00010046
[2692.737388] RAX:  RBX: 8806aa4a9c00 RCX: 
[2692.737388] RDX:  RSI: 8806aa4a9c00 RDI: 8806aa4a9d48
[2692.737388] RBP: 8806aa4a9d48 R08:  R09: 8806aa4a9d80
[2692.737388] R10: 8806ab949088 R11:  R12: 8806b29bb000
[2692.737388] R13:  R14: 8806b29bb000 R15: 8806ac4f6950
[2692.737388] FS:  () GS:8806b7a0() 
knlGS:
[2692.737388] CS:  0010 DS:  ES:  CR0: 80050033
[2692.737388] CR2: 7f1359a8b756 CR3: 01c09003 CR4: 001606e0
[2692.737388] Call Trace:
[2692.737388]  
[2692.737388]  ? scsi_setup_cmnd+0xb3/0x140 [scsi_mod]
[2692.737388]  ? scsi_prep_fn+0x53/0x130 [scsi_mod]
[2692.737388]  ? blk_peek_request+0x136/0x220
[2692.737388]  ? scsi_request_fn+0x2b/0x510 [scsi_mod]
[2692.737388]  ? __blk_run_queue+0x34/0x50
[2692.737388]  ? blk_run_queue+0x26/0x40
[2692.737388]  ? scsi_run_queue+0x229/0x2b0 [scsi_mod]
[2692.737388]  ? scsi_io_completion+0x3ce/0x5a0 [scsi_mod]
[2692.737388]  ? blk_done_softirq+0x67/0x80
[2692.737388]  ? __do_softirq+0xdb/0x1dd
[2692.737388]  ? irq_exit+0xa3/0xb0
[2692.737388]  ? do_IRQ+0x45/0xc0
[2692.737388]  ? common_interrupt+0x77/0x77
[2692.737388]  
[2692.737388]  ? cpuidle_enter_state+0x124/0x200
[2692.737388]  ? cpuidle_enter_state+0x119/0x200
[2692.737388]  ? do_idle+0xdc/0x180
[2692.737388]  ? cpu_startup_entry+0x14/0x20
[2692.737388]  ? secondary_startup_64+0xa5/0xb0
[2692.737388] Code: 8b 7b 30 e8 d2 6b 20 e1 49 8b 17 4c 89 ff 89 c6 89 44 24 04 
e8 81 81 22 e1 85 c0 41 89 c4 74 55 41 bc 02 00 00 00 e9 39 ff ff ff <0f> 0b 41 
bc 01 00 00 00 e9 2c ff ff ff 48 8b 3d 6b dc 00 00 be 
[2692.737388] ---[ end trace 9801970f9b249e10 ]---
[2692.737388] [ cut here ]
[2692.737388] kernel BUG at 
/usr/src/linux/dist/4.15.14-nobklcd/block/blk-core.c:3235!
[2692.737388] invalid opcode:  [#1] PREEMPT SMP
[2692.737388] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison 
dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
raid6_pq libcrc32c crc32c_generic md_mod dm_crypt algif_skcipher af_alg dm_mod 
dax ext4 crc16 mbcache jbd2 af_packet iscsi_target_mod tcm_loop vhost_scsi 
vhost target_core_file target_core_iblock target_core_pscsi target_core_mod 
nfsd exportfs dummy bridge stp llc ib_iser rdma_cm iw_cm ib_cm ib_core ipv6 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi netconsole configfs sr_mod 
cdrom adt7475 hwmon_vid sd_mod sg coretemp x86_pkg_temp_thermal kvm_intel kvm 
irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc 
snd_hda_codec_realtek snd_hda_codec_generic nouveau video led_class 
drm_kms_helper cfbfillrect syscopyarea cfbimgblt
[2692.737388]  sysfillrect sysimgblt snd_hda_intel fb_sys_fops cfbcopyarea 
snd_hda_codec ttm snd_hda_core snd_pcm_oss drm snd_mixer_oss agpgart snd_pcm 
igb snd_timer snd 

4.15.14 crash with iscsi target and dvd

2018-03-30 Thread Wakko Warner
I reported this before but noone responded.

I have an iscsi target setup with /dev/sr[012] using pscsi.  On the
initiator, I mount only 1 disc.  Then I issue find -type f | xargs cat >
/dev/null  Then after a few seconds, I get 2 oops and the system has to be
hard reset.

I noticed if I cat /dev/sr1 from the initiator, it doesn't crash.  I was
also able to burn without crashing.

Here's the OOPS:
[2692.733468] WARNING: CPU: 8 PID: 0 at 
/usr/src/linux/dist/4.15.14-nobklcd/drivers/scsi/scsi_lib.c:1068 
scsi_init_io+0x111/0x1a0 [scsi_mod]
[2692.734154] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison 
dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
raid6_pq libcrc32c crc32c_generic md_mod dm_crypt algif_skcipher af_alg dm_mod 
dax ext4 crc16 mbcache jbd2 af_packet iscsi_target_mod tcm_loop vhost_scsi 
vhost target_core_file target_core_iblock target_core_pscsi target_core_mod 
nfsd exportfs dummy bridge stp llc ib_iser rdma_cm iw_cm ib_cm ib_core ipv6 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi netconsole configfs sr_mod 
cdrom adt7475 hwmon_vid sd_mod sg coretemp x86_pkg_temp_thermal kvm_intel kvm 
irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc 
snd_hda_codec_realtek snd_hda_codec_generic nouveau video led_class 
drm_kms_helper cfbfillrect syscopyarea cfbimgblt
[2692.737388]  sysfillrect sysimgblt snd_hda_intel fb_sys_fops cfbcopyarea 
snd_hda_codec ttm snd_hda_core snd_pcm_oss drm snd_mixer_oss agpgart snd_pcm 
igb snd_timer snd aesni_intel soundcore hwmon aes_x86_64 i2c_algo_bit ahci 
mpt3sas crypto_simd i2c_core libahci glue_helper mptsas raid_class libata 
mptscsih scsi_transport_sas mptbase scsi_mod wmi button unix
[2692.737388] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.15.14 #1
[2692.737388] Hardware name: Dell Inc. Precision T5610/0WN7Y6, BIOS A16 
02/05/2018
[2692.737388] RIP: 0010:scsi_init_io+0x111/0x1a0 [scsi_mod]
[2692.737388] RSP: 0018:8806b7a03d78 EFLAGS: 00010046
[2692.737388] RAX:  RBX: 8806aa4a9c00 RCX: 
[2692.737388] RDX:  RSI: 8806aa4a9c00 RDI: 8806aa4a9d48
[2692.737388] RBP: 8806aa4a9d48 R08:  R09: 8806aa4a9d80
[2692.737388] R10: 8806ab949088 R11:  R12: 8806b29bb000
[2692.737388] R13:  R14: 8806b29bb000 R15: 8806ac4f6950
[2692.737388] FS:  () GS:8806b7a0() 
knlGS:
[2692.737388] CS:  0010 DS:  ES:  CR0: 80050033
[2692.737388] CR2: 7f1359a8b756 CR3: 01c09003 CR4: 001606e0
[2692.737388] Call Trace:
[2692.737388]  
[2692.737388]  ? scsi_setup_cmnd+0xb3/0x140 [scsi_mod]
[2692.737388]  ? scsi_prep_fn+0x53/0x130 [scsi_mod]
[2692.737388]  ? blk_peek_request+0x136/0x220
[2692.737388]  ? scsi_request_fn+0x2b/0x510 [scsi_mod]
[2692.737388]  ? __blk_run_queue+0x34/0x50
[2692.737388]  ? blk_run_queue+0x26/0x40
[2692.737388]  ? scsi_run_queue+0x229/0x2b0 [scsi_mod]
[2692.737388]  ? scsi_io_completion+0x3ce/0x5a0 [scsi_mod]
[2692.737388]  ? blk_done_softirq+0x67/0x80
[2692.737388]  ? __do_softirq+0xdb/0x1dd
[2692.737388]  ? irq_exit+0xa3/0xb0
[2692.737388]  ? do_IRQ+0x45/0xc0
[2692.737388]  ? common_interrupt+0x77/0x77
[2692.737388]  
[2692.737388]  ? cpuidle_enter_state+0x124/0x200
[2692.737388]  ? cpuidle_enter_state+0x119/0x200
[2692.737388]  ? do_idle+0xdc/0x180
[2692.737388]  ? cpu_startup_entry+0x14/0x20
[2692.737388]  ? secondary_startup_64+0xa5/0xb0
[2692.737388] Code: 8b 7b 30 e8 d2 6b 20 e1 49 8b 17 4c 89 ff 89 c6 89 44 24 04 
e8 81 81 22 e1 85 c0 41 89 c4 74 55 41 bc 02 00 00 00 e9 39 ff ff ff <0f> 0b 41 
bc 01 00 00 00 e9 2c ff ff ff 48 8b 3d 6b dc 00 00 be 
[2692.737388] ---[ end trace 9801970f9b249e10 ]---
[2692.737388] [ cut here ]
[2692.737388] kernel BUG at 
/usr/src/linux/dist/4.15.14-nobklcd/block/blk-core.c:3235!
[2692.737388] invalid opcode:  [#1] PREEMPT SMP
[2692.737388] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison 
dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
raid6_pq libcrc32c crc32c_generic md_mod dm_crypt algif_skcipher af_alg dm_mod 
dax ext4 crc16 mbcache jbd2 af_packet iscsi_target_mod tcm_loop vhost_scsi 
vhost target_core_file target_core_iblock target_core_pscsi target_core_mod 
nfsd exportfs dummy bridge stp llc ib_iser rdma_cm iw_cm ib_cm ib_core ipv6 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi netconsole configfs sr_mod 
cdrom adt7475 hwmon_vid sd_mod sg coretemp x86_pkg_temp_thermal kvm_intel kvm 
irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc 
snd_hda_codec_realtek snd_hda_codec_generic nouveau video led_class 
drm_kms_helper cfbfillrect syscopyarea cfbimgblt
[2692.737388]  sysfillrect sysimgblt snd_hda_intel fb_sys_fops cfbcopyarea 
snd_hda_codec ttm snd_hda_core snd_pcm_oss drm snd_mixer_oss agpgart snd_pcm 
igb snd_timer snd 

[PATCH] f2fs: truncate preallocated blocks in error case

2018-03-30 Thread Jaegeuk Kim
If write is failed, we must deallocate the blocks that we couldn't write.

Cc: sta...@vger.kernel.org
Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/file.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 8068b015ece5..f18f62dd60a3 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -2911,6 +2911,8 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
 
ret = generic_write_checks(iocb, from);
if (ret > 0) {
+   bool preallocated = false;
+   size_t target_size;
int err;
 
if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
@@ -2927,6 +2929,9 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
}
 
} else {
+   preallocated = true;
+   target_size = iocb->ki_pos + iov_iter_count(from);
+
err = f2fs_preallocate_blocks(iocb, from);
if (err) {
clear_inode_flag(inode, FI_NO_PREALLOC);
@@ -2939,6 +2944,10 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
blk_finish_plug();
clear_inode_flag(inode, FI_NO_PREALLOC);
 
+   /* if we couldn't write data, we should deallocate blocks. */
+   if (preallocated && i_size_read(inode) < target_size)
+   f2fs_truncate(inode);
+
if (ret > 0)
f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
}
-- 
2.15.0.531.g2ccb3012c9-goog



[PATCH] f2fs: truncate preallocated blocks in error case

2018-03-30 Thread Jaegeuk Kim
If write is failed, we must deallocate the blocks that we couldn't write.

Cc: sta...@vger.kernel.org
Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/file.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 8068b015ece5..f18f62dd60a3 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -2911,6 +2911,8 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
 
ret = generic_write_checks(iocb, from);
if (ret > 0) {
+   bool preallocated = false;
+   size_t target_size;
int err;
 
if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
@@ -2927,6 +2929,9 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
}
 
} else {
+   preallocated = true;
+   target_size = iocb->ki_pos + iov_iter_count(from);
+
err = f2fs_preallocate_blocks(iocb, from);
if (err) {
clear_inode_flag(inode, FI_NO_PREALLOC);
@@ -2939,6 +2944,10 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
blk_finish_plug();
clear_inode_flag(inode, FI_NO_PREALLOC);
 
+   /* if we couldn't write data, we should deallocate blocks. */
+   if (preallocated && i_size_read(inode) < target_size)
+   f2fs_truncate(inode);
+
if (ret > 0)
f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
}
-- 
2.15.0.531.g2ccb3012c9-goog



[PATCH v7] kernel.h: Retain constant expression output for max()/min()

2018-03-30 Thread Kees Cook
In the effort to remove all VLAs from the kernel[1], it is desirable to
build with -Wvla. However, this warning is overly pessimistic, in that
it is only happy with stack array sizes that are declared as constant
expressions, and not constant values. One case of this is the evaluation
of the max() macro which, due to its construction, ends up converting
constant expression arguments into a constant value result.

All attempts to rewrite this macro with __builtin_constant_p() failed with
older compilers (e.g. gcc 4.4)[2]. However, Martin Uecker, constructed[3]
a mind-shattering solution that works everywhere. Cthulhu fhtagn!

This patch updates the min()/max() macros to evaluate to a constant
expression when called on constant expression arguments. This removes
several false-positive stack VLA warnings from an x86 allmodconfig build
when -Wvla is added:

$ diff -u before.txt after.txt | grep ^-
-drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
variable length array ‘ids’ [-Wvla]
-fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length array 
‘namebuf’ [-Wvla]
-lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
[-Wvla]
-net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
[-Wvla]
-net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
[-Wvla]
-net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array ‘buff64’ 
[-Wvla]

This also updates the one case where different enums were being compared
and explicitly casts them to int (which matches the old side-effect of
the single-evaluation code).

[1] https://lkml.org/lkml/2018/3/7/621
[2] https://lkml.org/lkml/2018/3/10/170
[3] https://lkml.org/lkml/2018/3/20/845

Co-Developed-by: Linus Torvalds 
Co-Developed-by: Martin Uecker 
Signed-off-by: Kees Cook 
Acked-by: Ingo Molnar 
Acked-by: Miguel Ojeda 
---
v7:
- __is_constant() renamed to __is_constexpr() (Miguel Ojeda)
- adjust memory offset from 1 to 8 (David Laight)
- min_t()/max_t() "t" renamed back to "type" (0-day bot)
- add Acks
---
 drivers/char/tpm/tpm_tis_core.h |  8 ++---
 include/linux/kernel.h  | 71 -
 2 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h
index d5c6a2e952b3..f6e1dbe212a7 100644
--- a/drivers/char/tpm/tpm_tis_core.h
+++ b/drivers/char/tpm/tpm_tis_core.h
@@ -62,10 +62,10 @@ enum tis_defaults {
 /* Some timeout values are needed before it is known whether the chip is
  * TPM 1.0 or TPM 2.0.
  */
-#define TIS_TIMEOUT_A_MAX  max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_A)
-#define TIS_TIMEOUT_B_MAX  max(TIS_LONG_TIMEOUT, TPM2_TIMEOUT_B)
-#define TIS_TIMEOUT_C_MAX  max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_C)
-#define TIS_TIMEOUT_D_MAX  max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_D)
+#define TIS_TIMEOUT_A_MAX  max_t(int, TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_A)
+#define TIS_TIMEOUT_B_MAX  max_t(int, TIS_LONG_TIMEOUT, TPM2_TIMEOUT_B)
+#define TIS_TIMEOUT_C_MAX  max_t(int, TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_C)
+#define TIS_TIMEOUT_D_MAX  max_t(int, TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_D)
 
 #defineTPM_ACCESS(l)   (0x | ((l) << 12))
 #defineTPM_INT_ENABLE(l)   (0x0008 | ((l) << 12))
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..87acb8b58ae9 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -783,41 +783,58 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
 #endif /* CONFIG_TRACING */
 
 /*
- * min()/max()/clamp() macros that also do
- * strict type-checking.. See the
- * "unnecessary" pointer comparison.
+ * min()/max()/clamp() macros must accomplish three things:
+ *
+ * - avoid multiple evaluations of the arguments (so side-effects like
+ *   "x++" happen only once) when non-constant.
+ * - perform strict type-checking (to generate warnings instead of
+ *   nasty runtime surprises). See the "unnecessary" pointer comparison
+ *   in __typecheck().
+ * - retain result as a constant expressions when called with only
+ *   constant expressions (to avoid tripping VLA warnings in stack
+ *   allocation usage).
+ */
+#define __typecheck(x, y) \
+   (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
+
+/*
+ * This returns a constant expression while determining if an argument is
+ * a constant expression, most importantly without evaluating the argument.
+ * Glory to Martin Uecker 
  */
-#define __min(t1, t2, min1, min2, x, y) ({ \
-   t1 min1 = (x);  \
-   t2 min2 = (y);  \
-   (void) ( == );\
-   min1 < min2 ? min1 : min2; })
+#define 

[PATCH v7] kernel.h: Retain constant expression output for max()/min()

2018-03-30 Thread Kees Cook
In the effort to remove all VLAs from the kernel[1], it is desirable to
build with -Wvla. However, this warning is overly pessimistic, in that
it is only happy with stack array sizes that are declared as constant
expressions, and not constant values. One case of this is the evaluation
of the max() macro which, due to its construction, ends up converting
constant expression arguments into a constant value result.

All attempts to rewrite this macro with __builtin_constant_p() failed with
older compilers (e.g. gcc 4.4)[2]. However, Martin Uecker, constructed[3]
a mind-shattering solution that works everywhere. Cthulhu fhtagn!

This patch updates the min()/max() macros to evaluate to a constant
expression when called on constant expression arguments. This removes
several false-positive stack VLA warnings from an x86 allmodconfig build
when -Wvla is added:

$ diff -u before.txt after.txt | grep ^-
-drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
variable length array ‘ids’ [-Wvla]
-fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length array 
‘namebuf’ [-Wvla]
-lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
[-Wvla]
-net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
[-Wvla]
-net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
[-Wvla]
-net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array ‘buff64’ 
[-Wvla]

This also updates the one case where different enums were being compared
and explicitly casts them to int (which matches the old side-effect of
the single-evaluation code).

[1] https://lkml.org/lkml/2018/3/7/621
[2] https://lkml.org/lkml/2018/3/10/170
[3] https://lkml.org/lkml/2018/3/20/845

Co-Developed-by: Linus Torvalds 
Co-Developed-by: Martin Uecker 
Signed-off-by: Kees Cook 
Acked-by: Ingo Molnar 
Acked-by: Miguel Ojeda 
---
v7:
- __is_constant() renamed to __is_constexpr() (Miguel Ojeda)
- adjust memory offset from 1 to 8 (David Laight)
- min_t()/max_t() "t" renamed back to "type" (0-day bot)
- add Acks
---
 drivers/char/tpm/tpm_tis_core.h |  8 ++---
 include/linux/kernel.h  | 71 -
 2 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h
index d5c6a2e952b3..f6e1dbe212a7 100644
--- a/drivers/char/tpm/tpm_tis_core.h
+++ b/drivers/char/tpm/tpm_tis_core.h
@@ -62,10 +62,10 @@ enum tis_defaults {
 /* Some timeout values are needed before it is known whether the chip is
  * TPM 1.0 or TPM 2.0.
  */
-#define TIS_TIMEOUT_A_MAX  max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_A)
-#define TIS_TIMEOUT_B_MAX  max(TIS_LONG_TIMEOUT, TPM2_TIMEOUT_B)
-#define TIS_TIMEOUT_C_MAX  max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_C)
-#define TIS_TIMEOUT_D_MAX  max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_D)
+#define TIS_TIMEOUT_A_MAX  max_t(int, TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_A)
+#define TIS_TIMEOUT_B_MAX  max_t(int, TIS_LONG_TIMEOUT, TPM2_TIMEOUT_B)
+#define TIS_TIMEOUT_C_MAX  max_t(int, TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_C)
+#define TIS_TIMEOUT_D_MAX  max_t(int, TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_D)
 
 #defineTPM_ACCESS(l)   (0x | ((l) << 12))
 #defineTPM_INT_ENABLE(l)   (0x0008 | ((l) << 12))
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..87acb8b58ae9 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -783,41 +783,58 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
 #endif /* CONFIG_TRACING */
 
 /*
- * min()/max()/clamp() macros that also do
- * strict type-checking.. See the
- * "unnecessary" pointer comparison.
+ * min()/max()/clamp() macros must accomplish three things:
+ *
+ * - avoid multiple evaluations of the arguments (so side-effects like
+ *   "x++" happen only once) when non-constant.
+ * - perform strict type-checking (to generate warnings instead of
+ *   nasty runtime surprises). See the "unnecessary" pointer comparison
+ *   in __typecheck().
+ * - retain result as a constant expressions when called with only
+ *   constant expressions (to avoid tripping VLA warnings in stack
+ *   allocation usage).
+ */
+#define __typecheck(x, y) \
+   (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
+
+/*
+ * This returns a constant expression while determining if an argument is
+ * a constant expression, most importantly without evaluating the argument.
+ * Glory to Martin Uecker 
  */
-#define __min(t1, t2, min1, min2, x, y) ({ \
-   t1 min1 = (x);  \
-   t2 min2 = (y);  \
-   (void) ( == );\
-   min1 < min2 ? min1 : min2; })
+#define __is_constexpr(x) \
+   (sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))
+
+#define __no_side_effects(x, y) \
+   (__is_constexpr(x) && 

  1   2   3   4   5   6   7   8   9   10   >