Re: [PATCH 06/38] Annotate hardware config module parameters in drivers/clocksource/

2017-04-14 Thread David Howells
Thomas Gleixner  wrote:

> > Btw, is it possible to use IRQ grants to prevent a device that has limited
> > IRQ options from being drivable?
> 
> What do you mean with 'IRQ grants' ?

request_irq().

David


Re: [PATCH 06/38] Annotate hardware config module parameters in drivers/clocksource/

2017-04-14 Thread David Howells
Thomas Gleixner  wrote:

> > Btw, is it possible to use IRQ grants to prevent a device that has limited
> > IRQ options from being drivable?
> 
> What do you mean with 'IRQ grants' ?

request_irq().

David


Re: [PATCH 06/38] Annotate hardware config module parameters in drivers/clocksource/

2017-04-14 Thread Thomas Gleixner
On Fri, 14 Apr 2017, David Howells wrote:
> Thomas Gleixner  wrote:
> 
> > > -module_param_named(irq, timer_irq, int, 0644);
> > > +module_param_hw_named(irq, timer_irq, int, irq, 0644);
> > >  MODULE_PARM_DESC(irq, "Which IRQ to use for the clock source MFGPT 
> > > ticks.");
> > 
> > I'm not sure about this. AFAIR the parameter is required to work on
> > anything else than some arbitrary hardware which has it mapped to 0.
> 
> Should it then be set through in-kernel platform initialisation since the
> AMD Geode is an embedded chip?

I think so. 

> Btw, is it possible to use IRQ grants to prevent a device that has limited IRQ
> options from being drivable?

What do you mean with 'IRQ grants' ?

Thanks

tglx


Re: [PATCH 06/38] Annotate hardware config module parameters in drivers/clocksource/

2017-04-14 Thread Thomas Gleixner
On Fri, 14 Apr 2017, David Howells wrote:
> Thomas Gleixner  wrote:
> 
> > > -module_param_named(irq, timer_irq, int, 0644);
> > > +module_param_hw_named(irq, timer_irq, int, irq, 0644);
> > >  MODULE_PARM_DESC(irq, "Which IRQ to use for the clock source MFGPT 
> > > ticks.");
> > 
> > I'm not sure about this. AFAIR the parameter is required to work on
> > anything else than some arbitrary hardware which has it mapped to 0.
> 
> Should it then be set through in-kernel platform initialisation since the
> AMD Geode is an embedded chip?

I think so. 

> Btw, is it possible to use IRQ grants to prevent a device that has limited IRQ
> options from being drivable?

What do you mean with 'IRQ grants' ?

Thanks

tglx


Re: [tip:x86/cpu 8/12] arch/x86/kernel/cpu/intel_rdt.c:63: error: unknown field 'cache' specified in initializer

2017-04-14 Thread Thomas Gleixner
On Sat, 15 Apr 2017, kbuild test robot wrote:

> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/cpu
> head:   64e8ed3d4a6dcd6139a869a3e760e625cb0d3022
> commit: 05b93417ce5b924c6652de19fdcc27439ab37c90 [8/12] x86/intel_rdt/mba: 
> Add primary support for Memory Bandwidth Allocation (MBA)
> config: x86_64-randconfig-s0-04150438 (attached as .config)
> compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
> reproduce:
> git checkout 05b93417ce5b924c6652de19fdcc27439ab37c90
> # save the attached .config to linux build tree
> make ARCH=x86_64 

That's weird.

> c1c7c3f9 Fenghua Yu  2016-10-22  57   {
> c1c7c3f9 Fenghua Yu  2016-10-22  58   .name   
> = "L3",
> c1c7c3f9 Fenghua Yu  2016-10-22  59   .domains
> = domain_init(RDT_RESOURCE_L3),
> c1c7c3f9 Fenghua Yu  2016-10-22  60   .msr_base   
> = IA32_L3_CBM_BASE,
> 0921c547 Thomas Gleixner 2017-04-14  61   .msr_update 
> = cat_wrmsr,
> c1c7c3f9 Fenghua Yu  2016-10-22  62   .cache_level
> = 3,
> d3e11b4d Thomas Gleixner 2017-04-14 @63   .cache = {
> d3e11b4d Thomas Gleixner 2017-04-14 @64   .min_cbm_bits   
> = 1,
> d3e11b4d Thomas Gleixner 2017-04-14 @65   .cbm_idx_mult   
> = 1,
> d3e11b4d Thomas Gleixner 2017-04-14  66   .cbm_idx_offset 
> = 0,
> d3e11b4d Thomas Gleixner 2017-04-14  67   },
> c1c7c3f9 Fenghua Yu  2016-10-22  68   },
> 
> :: The code at line 63 was first introduced by commit
> :: d3e11b4d6ffd363747ac6e6b5522baa9ca5a20c0 x86/intel_rdt: Move CBM 
> specific data into a struct
>

So the compiler fails to handle the anon union, which was introduced in
05b93417ce5b924. No idea why, but that concept is not new and widely used
in the kernel already.

Thanks,

tglx




Re: [tip:x86/cpu 8/12] arch/x86/kernel/cpu/intel_rdt.c:63: error: unknown field 'cache' specified in initializer

2017-04-14 Thread Thomas Gleixner
On Sat, 15 Apr 2017, kbuild test robot wrote:

> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/cpu
> head:   64e8ed3d4a6dcd6139a869a3e760e625cb0d3022
> commit: 05b93417ce5b924c6652de19fdcc27439ab37c90 [8/12] x86/intel_rdt/mba: 
> Add primary support for Memory Bandwidth Allocation (MBA)
> config: x86_64-randconfig-s0-04150438 (attached as .config)
> compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
> reproduce:
> git checkout 05b93417ce5b924c6652de19fdcc27439ab37c90
> # save the attached .config to linux build tree
> make ARCH=x86_64 

That's weird.

> c1c7c3f9 Fenghua Yu  2016-10-22  57   {
> c1c7c3f9 Fenghua Yu  2016-10-22  58   .name   
> = "L3",
> c1c7c3f9 Fenghua Yu  2016-10-22  59   .domains
> = domain_init(RDT_RESOURCE_L3),
> c1c7c3f9 Fenghua Yu  2016-10-22  60   .msr_base   
> = IA32_L3_CBM_BASE,
> 0921c547 Thomas Gleixner 2017-04-14  61   .msr_update 
> = cat_wrmsr,
> c1c7c3f9 Fenghua Yu  2016-10-22  62   .cache_level
> = 3,
> d3e11b4d Thomas Gleixner 2017-04-14 @63   .cache = {
> d3e11b4d Thomas Gleixner 2017-04-14 @64   .min_cbm_bits   
> = 1,
> d3e11b4d Thomas Gleixner 2017-04-14 @65   .cbm_idx_mult   
> = 1,
> d3e11b4d Thomas Gleixner 2017-04-14  66   .cbm_idx_offset 
> = 0,
> d3e11b4d Thomas Gleixner 2017-04-14  67   },
> c1c7c3f9 Fenghua Yu  2016-10-22  68   },
> 
> :: The code at line 63 was first introduced by commit
> :: d3e11b4d6ffd363747ac6e6b5522baa9ca5a20c0 x86/intel_rdt: Move CBM 
> specific data into a struct
>

So the compiler fails to handle the anon union, which was introduced in
05b93417ce5b924. No idea why, but that concept is not new and widely used
in the kernel already.

Thanks,

tglx




Re: [PATCH] remove return statement

2017-04-14 Thread Joe Perches
On Sat, 2017-04-15 at 10:35 +0530, surenderpolsani wrote:
> staging : rtl8188e : remove return in void function

Your patch subject isn't correct.

It should be something like:

Subject: [PATCH] staging: rtl8188e: Remove void function return

> kernel coding style doesn't allow the return statement
> in void function.
[]
> diff --git a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c 
> b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c
[]
> @@ -165,7 +165,6 @@ void rtw_hal_dm_watchdog(struct adapter *Adapter)
>  skip_dm:
>   /*  Check GPIO to determine current RF on/off and Pbc status. */
>   /*  Check Hardware Radio ON/OFF or not */
> - return;

And the comments?
Are those supposed to be reminders of code to write?



Re: [PATCH] remove return statement

2017-04-14 Thread Joe Perches
On Sat, 2017-04-15 at 10:35 +0530, surenderpolsani wrote:
> staging : rtl8188e : remove return in void function

Your patch subject isn't correct.

It should be something like:

Subject: [PATCH] staging: rtl8188e: Remove void function return

> kernel coding style doesn't allow the return statement
> in void function.
[]
> diff --git a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c 
> b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c
[]
> @@ -165,7 +165,6 @@ void rtw_hal_dm_watchdog(struct adapter *Adapter)
>  skip_dm:
>   /*  Check GPIO to determine current RF on/off and Pbc status. */
>   /*  Check Hardware Radio ON/OFF or not */
> - return;

And the comments?
Are those supposed to be reminders of code to write?



[PATCH] remove return statement

2017-04-14 Thread surenderpolsani
staging : rtl8188e : remove return in void function

kernel coding style doesn't allow the return statement
in void function.

Signed-off-by: surenderpolsani 
---
 drivers/staging/rtl8188eu/hal/rtl8188e_dm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c 
b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c
index d04b7fb..6db0e19 100644
--- a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c
+++ b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c
@@ -165,7 +165,6 @@ void rtw_hal_dm_watchdog(struct adapter *Adapter)
 skip_dm:
/*  Check GPIO to determine current RF on/off and Pbc status. */
/*  Check Hardware Radio ON/OFF or not */
-   return;
 }
 
 void rtw_hal_dm_init(struct adapter *Adapter)
-- 
1.9.1



[PATCH] remove return statement

2017-04-14 Thread surenderpolsani
staging : rtl8188e : remove return in void function

kernel coding style doesn't allow the return statement
in void function.

Signed-off-by: surenderpolsani 
---
 drivers/staging/rtl8188eu/hal/rtl8188e_dm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c 
b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c
index d04b7fb..6db0e19 100644
--- a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c
+++ b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c
@@ -165,7 +165,6 @@ void rtw_hal_dm_watchdog(struct adapter *Adapter)
 skip_dm:
/*  Check GPIO to determine current RF on/off and Pbc status. */
/*  Check Hardware Radio ON/OFF or not */
-   return;
 }
 
 void rtw_hal_dm_init(struct adapter *Adapter)
-- 
1.9.1



Re: [PATCH 08/22] crypto: chcr: Make use of the new sg_map helper function

2017-04-14 Thread Harsh Jain
On Fri, Apr 14, 2017 at 3:35 AM, Logan Gunthorpe  wrote:
> The get_page in this area looks *highly* suspect due to there being no
> corresponding put_page. However, I've left that as is to avoid breaking
> things.
chcr driver will post the request to LLD driver cxgb4 and put_page is
implemented there. it will no harm. Any how
we have removed the below code from driver.

http://www.mail-archive.com/linux-crypto@vger.kernel.org/msg24561.html

After this merge we can ignore your patch. Thanks

>
> I've also removed the KMAP_ATOMIC_ARGS check as it appears to be dead
> code that dates back to when it was first committed...


>
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/crypto/chelsio/chcr_algo.c | 28 +++-
>  1 file changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/crypto/chelsio/chcr_algo.c 
> b/drivers/crypto/chelsio/chcr_algo.c
> index 41bc7f4..a993d1d 100644
> --- a/drivers/crypto/chelsio/chcr_algo.c
> +++ b/drivers/crypto/chelsio/chcr_algo.c
> @@ -1489,22 +1489,21 @@ static struct sk_buff *create_authenc_wr(struct 
> aead_request *req,
> return ERR_PTR(-EINVAL);
>  }
>
> -static void aes_gcm_empty_pld_pad(struct scatterlist *sg,
> - unsigned short offset)
> +static int aes_gcm_empty_pld_pad(struct scatterlist *sg,
> +unsigned short offset)
>  {
> -   struct page *spage;
> unsigned char *addr;
>
> -   spage = sg_page(sg);
> -   get_page(spage); /* so that it is not freed by NIC */
> -#ifdef KMAP_ATOMIC_ARGS
> -   addr = kmap_atomic(spage, KM_SOFTIRQ0);
> -#else
> -   addr = kmap_atomic(spage);
> -#endif
> -   memset(addr + sg->offset, 0, offset + 1);
> +   get_page(sg_page(sg)); /* so that it is not freed by NIC */
> +
> +   addr = sg_map(sg, SG_KMAP_ATOMIC);
> +   if (IS_ERR(addr))
> +   return PTR_ERR(addr);
> +
> +   memset(addr, 0, offset + 1);
> +   sg_unmap(sg, addr, SG_KMAP_ATOMIC);
>
> -   kunmap_atomic(addr);
> +   return 0;
>  }
>
>  static int set_msg_len(u8 *block, unsigned int msglen, int csize)
> @@ -1940,7 +1939,10 @@ static struct sk_buff *create_gcm_wr(struct 
> aead_request *req,
> if (req->cryptlen) {
> write_sg_to_skb(skb, , src, req->cryptlen);
> } else {
> -   aes_gcm_empty_pld_pad(req->dst, authsize - 1);
> +   err = aes_gcm_empty_pld_pad(req->dst, authsize - 1);
> +   if (err)
> +   goto dstmap_fail;
> +
> write_sg_to_skb(skb, , reqctx->dst, crypt_len);
>
> }
> --
> 2.1.4
>


Re: [PATCH 08/22] crypto: chcr: Make use of the new sg_map helper function

2017-04-14 Thread Harsh Jain
On Fri, Apr 14, 2017 at 3:35 AM, Logan Gunthorpe  wrote:
> The get_page in this area looks *highly* suspect due to there being no
> corresponding put_page. However, I've left that as is to avoid breaking
> things.
chcr driver will post the request to LLD driver cxgb4 and put_page is
implemented there. it will no harm. Any how
we have removed the below code from driver.

http://www.mail-archive.com/linux-crypto@vger.kernel.org/msg24561.html

After this merge we can ignore your patch. Thanks

>
> I've also removed the KMAP_ATOMIC_ARGS check as it appears to be dead
> code that dates back to when it was first committed...


>
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/crypto/chelsio/chcr_algo.c | 28 +++-
>  1 file changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/crypto/chelsio/chcr_algo.c 
> b/drivers/crypto/chelsio/chcr_algo.c
> index 41bc7f4..a993d1d 100644
> --- a/drivers/crypto/chelsio/chcr_algo.c
> +++ b/drivers/crypto/chelsio/chcr_algo.c
> @@ -1489,22 +1489,21 @@ static struct sk_buff *create_authenc_wr(struct 
> aead_request *req,
> return ERR_PTR(-EINVAL);
>  }
>
> -static void aes_gcm_empty_pld_pad(struct scatterlist *sg,
> - unsigned short offset)
> +static int aes_gcm_empty_pld_pad(struct scatterlist *sg,
> +unsigned short offset)
>  {
> -   struct page *spage;
> unsigned char *addr;
>
> -   spage = sg_page(sg);
> -   get_page(spage); /* so that it is not freed by NIC */
> -#ifdef KMAP_ATOMIC_ARGS
> -   addr = kmap_atomic(spage, KM_SOFTIRQ0);
> -#else
> -   addr = kmap_atomic(spage);
> -#endif
> -   memset(addr + sg->offset, 0, offset + 1);
> +   get_page(sg_page(sg)); /* so that it is not freed by NIC */
> +
> +   addr = sg_map(sg, SG_KMAP_ATOMIC);
> +   if (IS_ERR(addr))
> +   return PTR_ERR(addr);
> +
> +   memset(addr, 0, offset + 1);
> +   sg_unmap(sg, addr, SG_KMAP_ATOMIC);
>
> -   kunmap_atomic(addr);
> +   return 0;
>  }
>
>  static int set_msg_len(u8 *block, unsigned int msglen, int csize)
> @@ -1940,7 +1939,10 @@ static struct sk_buff *create_gcm_wr(struct 
> aead_request *req,
> if (req->cryptlen) {
> write_sg_to_skb(skb, , src, req->cryptlen);
> } else {
> -   aes_gcm_empty_pld_pad(req->dst, authsize - 1);
> +   err = aes_gcm_empty_pld_pad(req->dst, authsize - 1);
> +   if (err)
> +   goto dstmap_fail;
> +
> write_sg_to_skb(skb, , reqctx->dst, crypt_len);
>
> }
> --
> 2.1.4
>


[PATCH] dt-bindings: input: rotary-encoder: fix typo

2017-04-14 Thread Rahul Bedarkar
s/rollove/rollover/

Signed-off-by: Rahul Bedarkar 
---
 Documentation/devicetree/bindings/input/rotary-encoder.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/input/rotary-encoder.txt 
b/Documentation/devicetree/bindings/input/rotary-encoder.txt
index e85ce3d..f99fe5c 100644
--- a/Documentation/devicetree/bindings/input/rotary-encoder.txt
+++ b/Documentation/devicetree/bindings/input/rotary-encoder.txt
@@ -12,7 +12,7 @@ Optional properties:
 - rotary-encoder,relative-axis: register a relative axis rather than an
   absolute one. Relative axis will only generate +1/-1 events on the input
   device, hence no steps need to be passed.
-- rotary-encoder,rollover: Automatic rollove when the rotary value becomes
+- rotary-encoder,rollover: Automatic rollover when the rotary value becomes
   greater than the specified steps or smaller than 0. For absolute axis only.
 - rotary-encoder,steps-per-period: Number of steps (stable states) per period.
   The values have the following meaning:
-- 
2.7.4



[PATCH] dt-bindings: input: rotary-encoder: fix typo

2017-04-14 Thread Rahul Bedarkar
s/rollove/rollover/

Signed-off-by: Rahul Bedarkar 
---
 Documentation/devicetree/bindings/input/rotary-encoder.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/input/rotary-encoder.txt 
b/Documentation/devicetree/bindings/input/rotary-encoder.txt
index e85ce3d..f99fe5c 100644
--- a/Documentation/devicetree/bindings/input/rotary-encoder.txt
+++ b/Documentation/devicetree/bindings/input/rotary-encoder.txt
@@ -12,7 +12,7 @@ Optional properties:
 - rotary-encoder,relative-axis: register a relative axis rather than an
   absolute one. Relative axis will only generate +1/-1 events on the input
   device, hence no steps need to be passed.
-- rotary-encoder,rollover: Automatic rollove when the rotary value becomes
+- rotary-encoder,rollover: Automatic rollover when the rotary value becomes
   greater than the specified steps or smaller than 0. For absolute axis only.
 - rotary-encoder,steps-per-period: Number of steps (stable states) per period.
   The values have the following meaning:
-- 
2.7.4



Re: [PATCH] clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK

2017-04-14 Thread Dmitry Torokhov
Hi Matthias,

On Tue, Apr 11, 2017 at 12:17 PM, Matthias Kaehlcke  wrote:
> Besides reusing existing code this removes the special case handling
> for 64-bit masks, which causes clang to raise a shift count overflow
> warning due to https://bugs.llvm.org//show_bug.cgi?id=10030.
>
> Suggested-by: Dmitry Torokhov 
> Signed-off-by: Matthias Kaehlcke 
> ---
>  include/linux/clocksource.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index cfc75848a35d..06e604b9e9dc 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -120,7 +120,7 @@ struct clocksource {
>  #define CLOCK_SOURCE_RESELECT  0x100
>
>  /* simplify initialization of mask field */
> -#define CLOCKSOURCE_MASK(bits) (u64)((bits) < 64 ? ((1ULL<<(bits))-1) : -1)
> +#define CLOCKSOURCE_MASK(bits) (u64)GENMASK_ULL((bits) - 1, 0)

I do not think cast to u64 is needed for GENMASK_ULL.

>
>  static inline u32 clocksource_freq2mult(u32 freq, u32 shift_constant, u64 
> from)
>  {
> --
> 2.12.2.715.g7642488e1d-goog
>

Thanks,
Dmitry


Re: [PATCH] clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK

2017-04-14 Thread Dmitry Torokhov
Hi Matthias,

On Tue, Apr 11, 2017 at 12:17 PM, Matthias Kaehlcke  wrote:
> Besides reusing existing code this removes the special case handling
> for 64-bit masks, which causes clang to raise a shift count overflow
> warning due to https://bugs.llvm.org//show_bug.cgi?id=10030.
>
> Suggested-by: Dmitry Torokhov 
> Signed-off-by: Matthias Kaehlcke 
> ---
>  include/linux/clocksource.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index cfc75848a35d..06e604b9e9dc 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -120,7 +120,7 @@ struct clocksource {
>  #define CLOCK_SOURCE_RESELECT  0x100
>
>  /* simplify initialization of mask field */
> -#define CLOCKSOURCE_MASK(bits) (u64)((bits) < 64 ? ((1ULL<<(bits))-1) : -1)
> +#define CLOCKSOURCE_MASK(bits) (u64)GENMASK_ULL((bits) - 1, 0)

I do not think cast to u64 is needed for GENMASK_ULL.

>
>  static inline u32 clocksource_freq2mult(u32 freq, u32 shift_constant, u64 
> from)
>  {
> --
> 2.12.2.715.g7642488e1d-goog
>

Thanks,
Dmitry


Re: [PATCH 3/4] of: be consistent in form of file mode

2017-04-14 Thread Frank Rowand
Adding Stephen.

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> checkpatch whined about using S_IRUGO instead of octal equivalent
> when adding phandle sysfs code, so used octal in that patch.
> Change other instances of the S_* constants in the same file to
> the octal form.
> 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/base.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 197946615503..4a8bd9623140 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -168,7 +168,7 @@ int __of_add_property_sysfs(struct device_node *np, 
> struct property *pp)
>  
>   sysfs_bin_attr_init(>attr);
>   pp->attr.attr.name = safe_name(>kobj, pp->name);
> - pp->attr.attr.mode = secure ? S_IRUSR : S_IRUGO;
> + pp->attr.attr.mode = secure ? 0400 : 0444;
>   pp->attr.size = secure ? 0 : pp->length;
>   pp->attr.read = of_node_property_read;
>  
> 



Re: [PATCH 4/4] of: detect invalid phandle in overlay

2017-04-14 Thread Frank Rowand
Adding Stephen.

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> Overlays are not allowed to modify phandle values of previously existing
> nodes because there is no information available to allow fixup up
> properties that use the previously existing phandle.
> 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/overlay.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index ca0b85f5deb1..20ab49d2f7a4 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c
> @@ -130,6 +130,10 @@ static int of_overlay_apply_single_device_node(struct 
> of_overlay *ov,
>   /* NOTE: Multiple mods of created nodes not supported */
>   tchild = of_get_child_by_name(target, cname);
>   if (tchild != NULL) {
> + /* new overlay phandle value conflicts with existing value */
> + if (child->phandle)
> + return -EINVAL;
> +
>   /* apply overlay recursively */
>   ret = of_overlay_apply_one(ov, tchild, child);
>   of_node_put(tchild);
> 



Re: [PATCH 3/4] of: be consistent in form of file mode

2017-04-14 Thread Frank Rowand
Adding Stephen.

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> checkpatch whined about using S_IRUGO instead of octal equivalent
> when adding phandle sysfs code, so used octal in that patch.
> Change other instances of the S_* constants in the same file to
> the octal form.
> 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/base.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 197946615503..4a8bd9623140 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -168,7 +168,7 @@ int __of_add_property_sysfs(struct device_node *np, 
> struct property *pp)
>  
>   sysfs_bin_attr_init(>attr);
>   pp->attr.attr.name = safe_name(>kobj, pp->name);
> - pp->attr.attr.mode = secure ? S_IRUSR : S_IRUGO;
> + pp->attr.attr.mode = secure ? 0400 : 0444;
>   pp->attr.size = secure ? 0 : pp->length;
>   pp->attr.read = of_node_property_read;
>  
> 



Re: [PATCH 4/4] of: detect invalid phandle in overlay

2017-04-14 Thread Frank Rowand
Adding Stephen.

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> Overlays are not allowed to modify phandle values of previously existing
> nodes because there is no information available to allow fixup up
> properties that use the previously existing phandle.
> 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/overlay.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index ca0b85f5deb1..20ab49d2f7a4 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c
> @@ -130,6 +130,10 @@ static int of_overlay_apply_single_device_node(struct 
> of_overlay *ov,
>   /* NOTE: Multiple mods of created nodes not supported */
>   tchild = of_get_child_by_name(target, cname);
>   if (tchild != NULL) {
> + /* new overlay phandle value conflicts with existing value */
> + if (child->phandle)
> + return -EINVAL;
> +
>   /* apply overlay recursively */
>   ret = of_overlay_apply_one(ov, tchild, child);
>   of_node_put(tchild);
> 



Re: [PATCH 2/4] of: make __of_attach_node() static

2017-04-14 Thread Frank Rowand
Adding Stephen.

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> __of_attach_node() is not used outside of drivers/of/dynamic.c.  Make
> it static and remove it from drivers/of/of_private.h.
> 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/dynamic.c| 2 +-
>  drivers/of/of_private.h | 1 -
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index c6fd3f32bfcb..74aafe594ad5 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -216,7 +216,7 @@ int of_property_notify(int action, struct device_node *np,
>   return of_reconfig_notify(action, );
>  }
>  
> -void __of_attach_node(struct device_node *np)
> +static void __of_attach_node(struct device_node *np)
>  {
>   np->child = NULL;
>   np->sibling = np->parent->child;
> diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
> index 18bbb4517e25..efcedcff7dba 100644
> --- a/drivers/of/of_private.h
> +++ b/drivers/of/of_private.h
> @@ -78,7 +78,6 @@ extern int __of_update_property(struct device_node *np,
>  extern void __of_update_property_sysfs(struct device_node *np,
>   struct property *newprop, struct property *oldprop);
>  
> -extern void __of_attach_node(struct device_node *np);
>  extern int __of_attach_node_sysfs(struct device_node *np);
>  extern void __of_detach_node(struct device_node *np);
>  extern void __of_detach_node_sysfs(struct device_node *np);
> 



Re: [PATCH 2/4] of: make __of_attach_node() static

2017-04-14 Thread Frank Rowand
Adding Stephen.

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> __of_attach_node() is not used outside of drivers/of/dynamic.c.  Make
> it static and remove it from drivers/of/of_private.h.
> 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/dynamic.c| 2 +-
>  drivers/of/of_private.h | 1 -
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index c6fd3f32bfcb..74aafe594ad5 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -216,7 +216,7 @@ int of_property_notify(int action, struct device_node *np,
>   return of_reconfig_notify(action, );
>  }
>  
> -void __of_attach_node(struct device_node *np)
> +static void __of_attach_node(struct device_node *np)
>  {
>   np->child = NULL;
>   np->sibling = np->parent->child;
> diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
> index 18bbb4517e25..efcedcff7dba 100644
> --- a/drivers/of/of_private.h
> +++ b/drivers/of/of_private.h
> @@ -78,7 +78,6 @@ extern int __of_update_property(struct device_node *np,
>  extern void __of_update_property_sysfs(struct device_node *np,
>   struct property *newprop, struct property *oldprop);
>  
> -extern void __of_attach_node(struct device_node *np);
>  extern int __of_attach_node_sysfs(struct device_node *np);
>  extern void __of_detach_node(struct device_node *np);
>  extern void __of_detach_node_sysfs(struct device_node *np);
> 



Re: [PATCH 1/4] of: remove *phandle properties from expanded device tree

2017-04-14 Thread Frank Rowand
Adding Stephen.

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> Remove "phandle" and "linux,phandle" properties from the internal
> device tree.  The phandle will still be in the struct device_node
> phandle field.
> 
> This is to resolve the issue found by Stephen Boyd [1] when he changed
> the type of struct property.value from void * to const void *.  As
> a result of the type change, the overlay code had compile errors
> where the resolver updates phandle values.
> 
>   [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html
> 
> - Add sysfs infrastructure to report np->phandle, as if it was a property.
> - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties
>   in the expanded device tree.
> - Remove no longer needed checks to exclude "phandle" and "linux,phandle"
>   properties in several locations.
> - A side effect of these changes is that the obsolete "linux,phandle"
>   properties will no longer appear in /proc/device-tree
> 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/base.c | 51 
> ---
>  drivers/of/dynamic.c  | 29 -
>  drivers/of/fdt.c  | 40 
>  drivers/of/overlay.c  |  4 +---
>  drivers/of/resolver.c | 23 +--
>  include/linux/of.h|  1 +
>  6 files changed, 91 insertions(+), 57 deletions(-)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index d7c4629a3a2d..197946615503 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -116,6 +116,19 @@ static ssize_t of_node_property_read(struct file *filp, 
> struct kobject *kobj,
>   return memory_read_from_buffer(buf, count, , pp->value, 
> pp->length);
>  }
>  
> +static ssize_t of_node_phandle_read(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *bin_attr, char *buf,
> + loff_t offset, size_t count)
> +{
> + phandle phandle;
> + struct device_node *np;
> +
> + np = container_of(bin_attr, struct device_node, attr_phandle);
> + phandle = cpu_to_be32(np->phandle);
> + return memory_read_from_buffer(buf, count, , ,
> +sizeof(phandle));
> +}
> +
>  /* always return newly allocated name, caller must free after use */
>  static const char *safe_name(struct kobject *kobj, const char *orig_name)
>  {
> @@ -164,6 +177,38 @@ int __of_add_property_sysfs(struct device_node *np, 
> struct property *pp)
>   return rc;
>  }
>  
> +/*
> + * In the imported device tree (fdt), phandle is a property.  In the
> + * internal data structure it is instead stored in the struct device_node.
> + * Make phandle visible in sysfs as if it was a property.
> + */
> +static int __of_add_phandle_sysfs(struct device_node *np)
> +{
> + int rc;
> +
> + if (IS_ENABLED(CONFIG_PPC_PSERIES))
> + return 0;
> +
> + if (!IS_ENABLED(CONFIG_SYSFS))
> + return 0;
> +
> + if (!of_kset || !of_node_is_attached(np))
> + return 0;
> +
> + if (!np->phandle || np->phandle == 0x)
> + return 0;
> +
> + sysfs_bin_attr_init(>attr);
> + np->attr_phandle.attr.name = "phandle";
> + np->attr_phandle.attr.mode = 0444;
> + np->attr_phandle.size = sizeof(np->phandle);
> + np->attr_phandle.read = of_node_phandle_read;
> +
> + rc = sysfs_create_bin_file(>kobj, >attr_phandle);
> + WARN(rc, "error adding attribute phandle to node %s\n", np->full_name);
> + return rc;
> +}
> +
>  int __of_attach_node_sysfs(struct device_node *np)
>  {
>   const char *name;
> @@ -193,6 +238,8 @@ int __of_attach_node_sysfs(struct device_node *np)
>   if (rc)
>   return rc;
>  
> + __of_add_phandle_sysfs(np);
> +
>   for_each_property_of_node(np, pp)
>   __of_add_property_sysfs(np, pp);
>  
> @@ -2097,9 +2144,7 @@ void of_alias_scan(void * (*dt_alloc)(u64 size, u64 
> align))
>   int id, len;
>  
>   /* Skip those we do not want to proceed */
> - if (!strcmp(pp->name, "name") ||
> - !strcmp(pp->name, "phandle") ||
> - !strcmp(pp->name, "linux,phandle"))
> + if (!strcmp(pp->name, "name"))
>   continue;
>  
>   np = of_find_node_by_path(pp->value);
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index 888fdbc09992..c6fd3f32bfcb 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -218,19 +218,6 @@ int of_property_notify(int action, struct device_node 
> *np,
>  
>  void __of_attach_node(struct device_node *np)
>  {
> - const __be32 *phandle;
> - int sz;
> -
> - np->name = __of_get_property(np, "name", NULL) ? : "";
> - np->type = __of_get_property(np, "device_type", NULL) ? : "";
> -
> - phandle = __of_get_property(np, "phandle", 

Re: [PATCH 1/4] of: remove *phandle properties from expanded device tree

2017-04-14 Thread Frank Rowand
Adding Stephen.

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> Remove "phandle" and "linux,phandle" properties from the internal
> device tree.  The phandle will still be in the struct device_node
> phandle field.
> 
> This is to resolve the issue found by Stephen Boyd [1] when he changed
> the type of struct property.value from void * to const void *.  As
> a result of the type change, the overlay code had compile errors
> where the resolver updates phandle values.
> 
>   [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html
> 
> - Add sysfs infrastructure to report np->phandle, as if it was a property.
> - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties
>   in the expanded device tree.
> - Remove no longer needed checks to exclude "phandle" and "linux,phandle"
>   properties in several locations.
> - A side effect of these changes is that the obsolete "linux,phandle"
>   properties will no longer appear in /proc/device-tree
> 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/base.c | 51 
> ---
>  drivers/of/dynamic.c  | 29 -
>  drivers/of/fdt.c  | 40 
>  drivers/of/overlay.c  |  4 +---
>  drivers/of/resolver.c | 23 +--
>  include/linux/of.h|  1 +
>  6 files changed, 91 insertions(+), 57 deletions(-)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index d7c4629a3a2d..197946615503 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -116,6 +116,19 @@ static ssize_t of_node_property_read(struct file *filp, 
> struct kobject *kobj,
>   return memory_read_from_buffer(buf, count, , pp->value, 
> pp->length);
>  }
>  
> +static ssize_t of_node_phandle_read(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *bin_attr, char *buf,
> + loff_t offset, size_t count)
> +{
> + phandle phandle;
> + struct device_node *np;
> +
> + np = container_of(bin_attr, struct device_node, attr_phandle);
> + phandle = cpu_to_be32(np->phandle);
> + return memory_read_from_buffer(buf, count, , ,
> +sizeof(phandle));
> +}
> +
>  /* always return newly allocated name, caller must free after use */
>  static const char *safe_name(struct kobject *kobj, const char *orig_name)
>  {
> @@ -164,6 +177,38 @@ int __of_add_property_sysfs(struct device_node *np, 
> struct property *pp)
>   return rc;
>  }
>  
> +/*
> + * In the imported device tree (fdt), phandle is a property.  In the
> + * internal data structure it is instead stored in the struct device_node.
> + * Make phandle visible in sysfs as if it was a property.
> + */
> +static int __of_add_phandle_sysfs(struct device_node *np)
> +{
> + int rc;
> +
> + if (IS_ENABLED(CONFIG_PPC_PSERIES))
> + return 0;
> +
> + if (!IS_ENABLED(CONFIG_SYSFS))
> + return 0;
> +
> + if (!of_kset || !of_node_is_attached(np))
> + return 0;
> +
> + if (!np->phandle || np->phandle == 0x)
> + return 0;
> +
> + sysfs_bin_attr_init(>attr);
> + np->attr_phandle.attr.name = "phandle";
> + np->attr_phandle.attr.mode = 0444;
> + np->attr_phandle.size = sizeof(np->phandle);
> + np->attr_phandle.read = of_node_phandle_read;
> +
> + rc = sysfs_create_bin_file(>kobj, >attr_phandle);
> + WARN(rc, "error adding attribute phandle to node %s\n", np->full_name);
> + return rc;
> +}
> +
>  int __of_attach_node_sysfs(struct device_node *np)
>  {
>   const char *name;
> @@ -193,6 +238,8 @@ int __of_attach_node_sysfs(struct device_node *np)
>   if (rc)
>   return rc;
>  
> + __of_add_phandle_sysfs(np);
> +
>   for_each_property_of_node(np, pp)
>   __of_add_property_sysfs(np, pp);
>  
> @@ -2097,9 +2144,7 @@ void of_alias_scan(void * (*dt_alloc)(u64 size, u64 
> align))
>   int id, len;
>  
>   /* Skip those we do not want to proceed */
> - if (!strcmp(pp->name, "name") ||
> - !strcmp(pp->name, "phandle") ||
> - !strcmp(pp->name, "linux,phandle"))
> + if (!strcmp(pp->name, "name"))
>   continue;
>  
>   np = of_find_node_by_path(pp->value);
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index 888fdbc09992..c6fd3f32bfcb 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -218,19 +218,6 @@ int of_property_notify(int action, struct device_node 
> *np,
>  
>  void __of_attach_node(struct device_node *np)
>  {
> - const __be32 *phandle;
> - int sz;
> -
> - np->name = __of_get_property(np, "name", NULL) ? : "";
> - np->type = __of_get_property(np, "device_type", NULL) ? : "";
> -
> - phandle = __of_get_property(np, "phandle", );
> - if (!phandle)
> - 

Re: [PATCH 0/4] of: remove *phandle properties from expanded device tree

2017-04-14 Thread Frank Rowand
Hi Stephen,

I left you off the distribution list, sorry...

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> Remove "phandle" and "linux,phandle" properties from the internal
> device tree.  The phandle will still be in the struct device_node
> phandle field.
> 
> This is to resolve the issue found by Stephen Boyd [1] when he changed
> the type of struct property.value from void * to const void *.  As
> a result of the type change, the overlay code had compile errors
> where the resolver updates phandle values.
> 
>   [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html
> 
> Patch 1 is the phandle related changes.
> 
> Patches 2 - 4 are minor fixups for issues that became visible
> while implementing patch 1.
> 
> Frank Rowand (4):
>   of: remove *phandle properties from expanded device tree
>   of: make __of_attach_node() static
>   of: be consistent in form of file mode
>   of: detect invalid phandle in overlay
> 
>  drivers/of/base.c   | 53 
> +
>  drivers/of/dynamic.c| 31 -
>  drivers/of/fdt.c| 40 ++---
>  drivers/of/of_private.h |  1 -
>  drivers/of/overlay.c|  8 +---
>  drivers/of/resolver.c   | 23 +
>  include/linux/of.h  |  1 +
>  7 files changed, 97 insertions(+), 60 deletions(-)
> 



Re: [PATCH 0/4] of: remove *phandle properties from expanded device tree

2017-04-14 Thread Frank Rowand
Hi Stephen,

I left you off the distribution list, sorry...

On 04/14/17 20:55, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> Remove "phandle" and "linux,phandle" properties from the internal
> device tree.  The phandle will still be in the struct device_node
> phandle field.
> 
> This is to resolve the issue found by Stephen Boyd [1] when he changed
> the type of struct property.value from void * to const void *.  As
> a result of the type change, the overlay code had compile errors
> where the resolver updates phandle values.
> 
>   [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html
> 
> Patch 1 is the phandle related changes.
> 
> Patches 2 - 4 are minor fixups for issues that became visible
> while implementing patch 1.
> 
> Frank Rowand (4):
>   of: remove *phandle properties from expanded device tree
>   of: make __of_attach_node() static
>   of: be consistent in form of file mode
>   of: detect invalid phandle in overlay
> 
>  drivers/of/base.c   | 53 
> +
>  drivers/of/dynamic.c| 31 -
>  drivers/of/fdt.c| 40 ++---
>  drivers/of/of_private.h |  1 -
>  drivers/of/overlay.c|  8 +---
>  drivers/of/resolver.c   | 23 +
>  include/linux/of.h  |  1 +
>  7 files changed, 97 insertions(+), 60 deletions(-)
> 



[PATCH 1/4] of: remove *phandle properties from expanded device tree

2017-04-14 Thread frowand . list
From: Frank Rowand 

Remove "phandle" and "linux,phandle" properties from the internal
device tree.  The phandle will still be in the struct device_node
phandle field.

This is to resolve the issue found by Stephen Boyd [1] when he changed
the type of struct property.value from void * to const void *.  As
a result of the type change, the overlay code had compile errors
where the resolver updates phandle values.

  [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html

- Add sysfs infrastructure to report np->phandle, as if it was a property.
- Do not create "phandle" "ibm,phandle", and "linux,phandle" properties
  in the expanded device tree.
- Remove no longer needed checks to exclude "phandle" and "linux,phandle"
  properties in several locations.
- A side effect of these changes is that the obsolete "linux,phandle"
  properties will no longer appear in /proc/device-tree

Signed-off-by: Frank Rowand 
---
 drivers/of/base.c | 51 ---
 drivers/of/dynamic.c  | 29 -
 drivers/of/fdt.c  | 40 
 drivers/of/overlay.c  |  4 +---
 drivers/of/resolver.c | 23 +--
 include/linux/of.h|  1 +
 6 files changed, 91 insertions(+), 57 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index d7c4629a3a2d..197946615503 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -116,6 +116,19 @@ static ssize_t of_node_property_read(struct file *filp, 
struct kobject *kobj,
return memory_read_from_buffer(buf, count, , pp->value, 
pp->length);
 }
 
+static ssize_t of_node_phandle_read(struct file *filp, struct kobject *kobj,
+   struct bin_attribute *bin_attr, char *buf,
+   loff_t offset, size_t count)
+{
+   phandle phandle;
+   struct device_node *np;
+
+   np = container_of(bin_attr, struct device_node, attr_phandle);
+   phandle = cpu_to_be32(np->phandle);
+   return memory_read_from_buffer(buf, count, , ,
+  sizeof(phandle));
+}
+
 /* always return newly allocated name, caller must free after use */
 static const char *safe_name(struct kobject *kobj, const char *orig_name)
 {
@@ -164,6 +177,38 @@ int __of_add_property_sysfs(struct device_node *np, struct 
property *pp)
return rc;
 }
 
+/*
+ * In the imported device tree (fdt), phandle is a property.  In the
+ * internal data structure it is instead stored in the struct device_node.
+ * Make phandle visible in sysfs as if it was a property.
+ */
+static int __of_add_phandle_sysfs(struct device_node *np)
+{
+   int rc;
+
+   if (IS_ENABLED(CONFIG_PPC_PSERIES))
+   return 0;
+
+   if (!IS_ENABLED(CONFIG_SYSFS))
+   return 0;
+
+   if (!of_kset || !of_node_is_attached(np))
+   return 0;
+
+   if (!np->phandle || np->phandle == 0x)
+   return 0;
+
+   sysfs_bin_attr_init(>attr);
+   np->attr_phandle.attr.name = "phandle";
+   np->attr_phandle.attr.mode = 0444;
+   np->attr_phandle.size = sizeof(np->phandle);
+   np->attr_phandle.read = of_node_phandle_read;
+
+   rc = sysfs_create_bin_file(>kobj, >attr_phandle);
+   WARN(rc, "error adding attribute phandle to node %s\n", np->full_name);
+   return rc;
+}
+
 int __of_attach_node_sysfs(struct device_node *np)
 {
const char *name;
@@ -193,6 +238,8 @@ int __of_attach_node_sysfs(struct device_node *np)
if (rc)
return rc;
 
+   __of_add_phandle_sysfs(np);
+
for_each_property_of_node(np, pp)
__of_add_property_sysfs(np, pp);
 
@@ -2097,9 +2144,7 @@ void of_alias_scan(void * (*dt_alloc)(u64 size, u64 
align))
int id, len;
 
/* Skip those we do not want to proceed */
-   if (!strcmp(pp->name, "name") ||
-   !strcmp(pp->name, "phandle") ||
-   !strcmp(pp->name, "linux,phandle"))
+   if (!strcmp(pp->name, "name"))
continue;
 
np = of_find_node_by_path(pp->value);
diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index 888fdbc09992..c6fd3f32bfcb 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -218,19 +218,6 @@ int of_property_notify(int action, struct device_node *np,
 
 void __of_attach_node(struct device_node *np)
 {
-   const __be32 *phandle;
-   int sz;
-
-   np->name = __of_get_property(np, "name", NULL) ? : "";
-   np->type = __of_get_property(np, "device_type", NULL) ? : "";
-
-   phandle = __of_get_property(np, "phandle", );
-   if (!phandle)
-   phandle = __of_get_property(np, "linux,phandle", );
-   if (IS_ENABLED(CONFIG_PPC_PSERIES) && !phandle)
-   phandle = __of_get_property(np, "ibm,phandle", );
-   np->phandle = (phandle 

[PATCH 1/4] of: remove *phandle properties from expanded device tree

2017-04-14 Thread frowand . list
From: Frank Rowand 

Remove "phandle" and "linux,phandle" properties from the internal
device tree.  The phandle will still be in the struct device_node
phandle field.

This is to resolve the issue found by Stephen Boyd [1] when he changed
the type of struct property.value from void * to const void *.  As
a result of the type change, the overlay code had compile errors
where the resolver updates phandle values.

  [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html

- Add sysfs infrastructure to report np->phandle, as if it was a property.
- Do not create "phandle" "ibm,phandle", and "linux,phandle" properties
  in the expanded device tree.
- Remove no longer needed checks to exclude "phandle" and "linux,phandle"
  properties in several locations.
- A side effect of these changes is that the obsolete "linux,phandle"
  properties will no longer appear in /proc/device-tree

Signed-off-by: Frank Rowand 
---
 drivers/of/base.c | 51 ---
 drivers/of/dynamic.c  | 29 -
 drivers/of/fdt.c  | 40 
 drivers/of/overlay.c  |  4 +---
 drivers/of/resolver.c | 23 +--
 include/linux/of.h|  1 +
 6 files changed, 91 insertions(+), 57 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index d7c4629a3a2d..197946615503 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -116,6 +116,19 @@ static ssize_t of_node_property_read(struct file *filp, 
struct kobject *kobj,
return memory_read_from_buffer(buf, count, , pp->value, 
pp->length);
 }
 
+static ssize_t of_node_phandle_read(struct file *filp, struct kobject *kobj,
+   struct bin_attribute *bin_attr, char *buf,
+   loff_t offset, size_t count)
+{
+   phandle phandle;
+   struct device_node *np;
+
+   np = container_of(bin_attr, struct device_node, attr_phandle);
+   phandle = cpu_to_be32(np->phandle);
+   return memory_read_from_buffer(buf, count, , ,
+  sizeof(phandle));
+}
+
 /* always return newly allocated name, caller must free after use */
 static const char *safe_name(struct kobject *kobj, const char *orig_name)
 {
@@ -164,6 +177,38 @@ int __of_add_property_sysfs(struct device_node *np, struct 
property *pp)
return rc;
 }
 
+/*
+ * In the imported device tree (fdt), phandle is a property.  In the
+ * internal data structure it is instead stored in the struct device_node.
+ * Make phandle visible in sysfs as if it was a property.
+ */
+static int __of_add_phandle_sysfs(struct device_node *np)
+{
+   int rc;
+
+   if (IS_ENABLED(CONFIG_PPC_PSERIES))
+   return 0;
+
+   if (!IS_ENABLED(CONFIG_SYSFS))
+   return 0;
+
+   if (!of_kset || !of_node_is_attached(np))
+   return 0;
+
+   if (!np->phandle || np->phandle == 0x)
+   return 0;
+
+   sysfs_bin_attr_init(>attr);
+   np->attr_phandle.attr.name = "phandle";
+   np->attr_phandle.attr.mode = 0444;
+   np->attr_phandle.size = sizeof(np->phandle);
+   np->attr_phandle.read = of_node_phandle_read;
+
+   rc = sysfs_create_bin_file(>kobj, >attr_phandle);
+   WARN(rc, "error adding attribute phandle to node %s\n", np->full_name);
+   return rc;
+}
+
 int __of_attach_node_sysfs(struct device_node *np)
 {
const char *name;
@@ -193,6 +238,8 @@ int __of_attach_node_sysfs(struct device_node *np)
if (rc)
return rc;
 
+   __of_add_phandle_sysfs(np);
+
for_each_property_of_node(np, pp)
__of_add_property_sysfs(np, pp);
 
@@ -2097,9 +2144,7 @@ void of_alias_scan(void * (*dt_alloc)(u64 size, u64 
align))
int id, len;
 
/* Skip those we do not want to proceed */
-   if (!strcmp(pp->name, "name") ||
-   !strcmp(pp->name, "phandle") ||
-   !strcmp(pp->name, "linux,phandle"))
+   if (!strcmp(pp->name, "name"))
continue;
 
np = of_find_node_by_path(pp->value);
diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index 888fdbc09992..c6fd3f32bfcb 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -218,19 +218,6 @@ int of_property_notify(int action, struct device_node *np,
 
 void __of_attach_node(struct device_node *np)
 {
-   const __be32 *phandle;
-   int sz;
-
-   np->name = __of_get_property(np, "name", NULL) ? : "";
-   np->type = __of_get_property(np, "device_type", NULL) ? : "";
-
-   phandle = __of_get_property(np, "phandle", );
-   if (!phandle)
-   phandle = __of_get_property(np, "linux,phandle", );
-   if (IS_ENABLED(CONFIG_PPC_PSERIES) && !phandle)
-   phandle = __of_get_property(np, "ibm,phandle", );
-   np->phandle = (phandle && (sz >= 4)) ? be32_to_cpup(phandle) : 0;
-
 

[PATCH 0/4] of: remove *phandle properties from expanded device tree

2017-04-14 Thread frowand . list
From: Frank Rowand 

Remove "phandle" and "linux,phandle" properties from the internal
device tree.  The phandle will still be in the struct device_node
phandle field.

This is to resolve the issue found by Stephen Boyd [1] when he changed
the type of struct property.value from void * to const void *.  As
a result of the type change, the overlay code had compile errors
where the resolver updates phandle values.

  [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html

Patch 1 is the phandle related changes.

Patches 2 - 4 are minor fixups for issues that became visible
while implementing patch 1.

Frank Rowand (4):
  of: remove *phandle properties from expanded device tree
  of: make __of_attach_node() static
  of: be consistent in form of file mode
  of: detect invalid phandle in overlay

 drivers/of/base.c   | 53 +
 drivers/of/dynamic.c| 31 -
 drivers/of/fdt.c| 40 ++---
 drivers/of/of_private.h |  1 -
 drivers/of/overlay.c|  8 +---
 drivers/of/resolver.c   | 23 +
 include/linux/of.h  |  1 +
 7 files changed, 97 insertions(+), 60 deletions(-)

-- 
Frank Rowand 



[PATCH 2/4] of: make __of_attach_node() static

2017-04-14 Thread frowand . list
From: Frank Rowand 

__of_attach_node() is not used outside of drivers/of/dynamic.c.  Make
it static and remove it from drivers/of/of_private.h.

Signed-off-by: Frank Rowand 
---
 drivers/of/dynamic.c| 2 +-
 drivers/of/of_private.h | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index c6fd3f32bfcb..74aafe594ad5 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -216,7 +216,7 @@ int of_property_notify(int action, struct device_node *np,
return of_reconfig_notify(action, );
 }
 
-void __of_attach_node(struct device_node *np)
+static void __of_attach_node(struct device_node *np)
 {
np->child = NULL;
np->sibling = np->parent->child;
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index 18bbb4517e25..efcedcff7dba 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -78,7 +78,6 @@ extern int __of_update_property(struct device_node *np,
 extern void __of_update_property_sysfs(struct device_node *np,
struct property *newprop, struct property *oldprop);
 
-extern void __of_attach_node(struct device_node *np);
 extern int __of_attach_node_sysfs(struct device_node *np);
 extern void __of_detach_node(struct device_node *np);
 extern void __of_detach_node_sysfs(struct device_node *np);
-- 
Frank Rowand 



[PATCH 4/4] of: detect invalid phandle in overlay

2017-04-14 Thread frowand . list
From: Frank Rowand 

Overlays are not allowed to modify phandle values of previously existing
nodes because there is no information available to allow fixup up
properties that use the previously existing phandle.

Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index ca0b85f5deb1..20ab49d2f7a4 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -130,6 +130,10 @@ static int of_overlay_apply_single_device_node(struct 
of_overlay *ov,
/* NOTE: Multiple mods of created nodes not supported */
tchild = of_get_child_by_name(target, cname);
if (tchild != NULL) {
+   /* new overlay phandle value conflicts with existing value */
+   if (child->phandle)
+   return -EINVAL;
+
/* apply overlay recursively */
ret = of_overlay_apply_one(ov, tchild, child);
of_node_put(tchild);
-- 
Frank Rowand 



[PATCH 0/4] of: remove *phandle properties from expanded device tree

2017-04-14 Thread frowand . list
From: Frank Rowand 

Remove "phandle" and "linux,phandle" properties from the internal
device tree.  The phandle will still be in the struct device_node
phandle field.

This is to resolve the issue found by Stephen Boyd [1] when he changed
the type of struct property.value from void * to const void *.  As
a result of the type change, the overlay code had compile errors
where the resolver updates phandle values.

  [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html

Patch 1 is the phandle related changes.

Patches 2 - 4 are minor fixups for issues that became visible
while implementing patch 1.

Frank Rowand (4):
  of: remove *phandle properties from expanded device tree
  of: make __of_attach_node() static
  of: be consistent in form of file mode
  of: detect invalid phandle in overlay

 drivers/of/base.c   | 53 +
 drivers/of/dynamic.c| 31 -
 drivers/of/fdt.c| 40 ++---
 drivers/of/of_private.h |  1 -
 drivers/of/overlay.c|  8 +---
 drivers/of/resolver.c   | 23 +
 include/linux/of.h  |  1 +
 7 files changed, 97 insertions(+), 60 deletions(-)

-- 
Frank Rowand 



[PATCH 2/4] of: make __of_attach_node() static

2017-04-14 Thread frowand . list
From: Frank Rowand 

__of_attach_node() is not used outside of drivers/of/dynamic.c.  Make
it static and remove it from drivers/of/of_private.h.

Signed-off-by: Frank Rowand 
---
 drivers/of/dynamic.c| 2 +-
 drivers/of/of_private.h | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index c6fd3f32bfcb..74aafe594ad5 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -216,7 +216,7 @@ int of_property_notify(int action, struct device_node *np,
return of_reconfig_notify(action, );
 }
 
-void __of_attach_node(struct device_node *np)
+static void __of_attach_node(struct device_node *np)
 {
np->child = NULL;
np->sibling = np->parent->child;
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index 18bbb4517e25..efcedcff7dba 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -78,7 +78,6 @@ extern int __of_update_property(struct device_node *np,
 extern void __of_update_property_sysfs(struct device_node *np,
struct property *newprop, struct property *oldprop);
 
-extern void __of_attach_node(struct device_node *np);
 extern int __of_attach_node_sysfs(struct device_node *np);
 extern void __of_detach_node(struct device_node *np);
 extern void __of_detach_node_sysfs(struct device_node *np);
-- 
Frank Rowand 



[PATCH 4/4] of: detect invalid phandle in overlay

2017-04-14 Thread frowand . list
From: Frank Rowand 

Overlays are not allowed to modify phandle values of previously existing
nodes because there is no information available to allow fixup up
properties that use the previously existing phandle.

Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index ca0b85f5deb1..20ab49d2f7a4 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -130,6 +130,10 @@ static int of_overlay_apply_single_device_node(struct 
of_overlay *ov,
/* NOTE: Multiple mods of created nodes not supported */
tchild = of_get_child_by_name(target, cname);
if (tchild != NULL) {
+   /* new overlay phandle value conflicts with existing value */
+   if (child->phandle)
+   return -EINVAL;
+
/* apply overlay recursively */
ret = of_overlay_apply_one(ov, tchild, child);
of_node_put(tchild);
-- 
Frank Rowand 



[PATCH 3/4] of: be consistent in form of file mode

2017-04-14 Thread frowand . list
From: Frank Rowand 

checkpatch whined about using S_IRUGO instead of octal equivalent
when adding phandle sysfs code, so used octal in that patch.
Change other instances of the S_* constants in the same file to
the octal form.

Signed-off-by: Frank Rowand 
---
 drivers/of/base.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 197946615503..4a8bd9623140 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -168,7 +168,7 @@ int __of_add_property_sysfs(struct device_node *np, struct 
property *pp)
 
sysfs_bin_attr_init(>attr);
pp->attr.attr.name = safe_name(>kobj, pp->name);
-   pp->attr.attr.mode = secure ? S_IRUSR : S_IRUGO;
+   pp->attr.attr.mode = secure ? 0400 : 0444;
pp->attr.size = secure ? 0 : pp->length;
pp->attr.read = of_node_property_read;
 
-- 
Frank Rowand 



[PATCH 3/4] of: be consistent in form of file mode

2017-04-14 Thread frowand . list
From: Frank Rowand 

checkpatch whined about using S_IRUGO instead of octal equivalent
when adding phandle sysfs code, so used octal in that patch.
Change other instances of the S_* constants in the same file to
the octal form.

Signed-off-by: Frank Rowand 
---
 drivers/of/base.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 197946615503..4a8bd9623140 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -168,7 +168,7 @@ int __of_add_property_sysfs(struct device_node *np, struct 
property *pp)
 
sysfs_bin_attr_init(>attr);
pp->attr.attr.name = safe_name(>kobj, pp->name);
-   pp->attr.attr.mode = secure ? S_IRUSR : S_IRUGO;
+   pp->attr.attr.mode = secure ? 0400 : 0444;
pp->attr.size = secure ? 0 : pp->length;
pp->attr.read = of_node_property_read;
 
-- 
Frank Rowand 



Re: [PATCH v4 2/2] thermal: core: Add a back up thermal shutdown mechanism

2017-04-14 Thread Keerthy


On Friday 14 April 2017 11:48 PM, Eduardo Valentin wrote:
> Hey,
> 
> On Fri, Apr 14, 2017 at 08:42:20AM -0700, Eduardo Valentin wrote:
>> Hello again,
>>
>> On Fri, Apr 14, 2017 at 08:38:40AM -0700, Eduardo Valentin wrote:
>>> Hey,
>>>
>>> On Fri, Apr 14, 2017 at 02:22:13PM +0530, Keerthy wrote:
 orderly_poweroff is triggered when a graceful shutdown
 of system is desired. This may be used in many critical states of the
 kernel such as when subsystems detects conditions such as critical
 temperature conditions. However, in certain conditions in system
 boot up sequences like those in the middle of driver probes being
 initiated, userspace will be unable to power off the system in a clean
 manner and leaves the system in a critical state. In cases like these,
 the /sbin/poweroff will return success (having forked off to attempt
 powering off the system. However, the system overall will fail to
 completely poweroff (since other modules will be probed) and the system
 is still functional with no userspace (since that would have shut itself
 off).

 However, there is no clean way of detecting such failure of userspace
 powering off the system. In such scenarios, it is necessary for a backup
 workqueue to be able to force a shutdown of the system when orderly
 shutdown is not successful after a configurable time period.

 Reported-by: Nishanth Menon 
 Signed-off-by: Keerthy 
 ---

 Changes in v4:

   * Updated documentation
   * changed emergency_poweroff_func to thermal_emergency_poweroff_func

 Changes in v3:

   * Removed unnecessary mutex init.
   * Added WARN messages instead of a simple warning message.
   * Added Documentation.

  Documentation/thermal/sysfs-api.txt | 19 +++
  drivers/thermal/Kconfig | 13 +++
  drivers/thermal/thermal_core.c  | 46 
 +
  3 files changed, 78 insertions(+)

 diff --git a/Documentation/thermal/sysfs-api.txt 
 b/Documentation/thermal/sysfs-api.txt
 index ef473dc..e73cc12 100644
 --- a/Documentation/thermal/sysfs-api.txt
 +++ b/Documentation/thermal/sysfs-api.txt
 @@ -582,3 +582,22 @@ platform data is provided, this uses the step_wise 
 throttling policy.
  This function serves as an arbitrator to set the state of a cooling
  device. It sets the cooling device to the deepest cooling state if
  possible.
 +
 +6. thermal_emergency_poweroff:
 +
 +On an event of critical trip temperature crossing. Thermal framework
 +allows the system to shutdown gracefully by calling orderly_poweroff().
 +In the event of a failure of orderly_poweroff() to shut down the system
 +we are in danger of keeping the system alive at undesirably high
 +temperatures. To mitigate this high risk scenario we program a work
 +queue to fire after a pre-determined number of seconds to start
 +an emergency shutdown of the device using the kernel_power_off()
 +function. In case kernel_power_off() fails then finally
 +emergency_restart() is called in the worst case.
 +
 +The delay should be carefully profiled so as to give adequate time for
 +orderly_poweroff(). In case of failure of an orderly_poweroff() the
 +emergency poweroff kicks in after the delay has elapsed and shuts down
 +the system.
 +
 +If set to 0 emergency poweroff will happen immediately.
 diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
 index 9347401..0dd5b85 100644
 --- a/drivers/thermal/Kconfig
 +++ b/drivers/thermal/Kconfig
 @@ -15,6 +15,19 @@ menuconfig THERMAL
  
  if THERMAL
  
 +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
 +  int "Emergency poweroff delay in milli-seconds"
 +  depends on THERMAL
 +  default 0
>>>
>>> Only now I realized that merging this may break the working
>>> orderly_poweroff() out there, because you are defaulting this to 0, no
>>> delay, therefore giving no time for orderly_poweroff() to finish. This
>>> is not good.
>>>
>>> I think using 0 delay as immediate power off is not good as we give no
>>> time for graceful shutdown, and by default. My suggestion here
>>> is to use 0 delay as no forced shutdown. Meaning, by default, this
>>> feature is disabled, and all other systems out there, despite DRA7 with
>>> arago over NFS, work as before.
> 
> A better solution could be to have bool Kconfig, say
> THERMAL_EMERGENCY_POWEROFF, which would default to false. If one selects
> that option, you get the DELAY_MS configurable, and then you could get
> the 0 ms still as a valid entry, with the same semantics of immediate
> power off, no orderly_poweroff.
> 
> I just want to avoid breaking everybody (or changing userland
> expectation) in honor of this change.

Sure. I have now used default value 

Re: [PATCH v4 2/2] thermal: core: Add a back up thermal shutdown mechanism

2017-04-14 Thread Keerthy


On Friday 14 April 2017 11:48 PM, Eduardo Valentin wrote:
> Hey,
> 
> On Fri, Apr 14, 2017 at 08:42:20AM -0700, Eduardo Valentin wrote:
>> Hello again,
>>
>> On Fri, Apr 14, 2017 at 08:38:40AM -0700, Eduardo Valentin wrote:
>>> Hey,
>>>
>>> On Fri, Apr 14, 2017 at 02:22:13PM +0530, Keerthy wrote:
 orderly_poweroff is triggered when a graceful shutdown
 of system is desired. This may be used in many critical states of the
 kernel such as when subsystems detects conditions such as critical
 temperature conditions. However, in certain conditions in system
 boot up sequences like those in the middle of driver probes being
 initiated, userspace will be unable to power off the system in a clean
 manner and leaves the system in a critical state. In cases like these,
 the /sbin/poweroff will return success (having forked off to attempt
 powering off the system. However, the system overall will fail to
 completely poweroff (since other modules will be probed) and the system
 is still functional with no userspace (since that would have shut itself
 off).

 However, there is no clean way of detecting such failure of userspace
 powering off the system. In such scenarios, it is necessary for a backup
 workqueue to be able to force a shutdown of the system when orderly
 shutdown is not successful after a configurable time period.

 Reported-by: Nishanth Menon 
 Signed-off-by: Keerthy 
 ---

 Changes in v4:

   * Updated documentation
   * changed emergency_poweroff_func to thermal_emergency_poweroff_func

 Changes in v3:

   * Removed unnecessary mutex init.
   * Added WARN messages instead of a simple warning message.
   * Added Documentation.

  Documentation/thermal/sysfs-api.txt | 19 +++
  drivers/thermal/Kconfig | 13 +++
  drivers/thermal/thermal_core.c  | 46 
 +
  3 files changed, 78 insertions(+)

 diff --git a/Documentation/thermal/sysfs-api.txt 
 b/Documentation/thermal/sysfs-api.txt
 index ef473dc..e73cc12 100644
 --- a/Documentation/thermal/sysfs-api.txt
 +++ b/Documentation/thermal/sysfs-api.txt
 @@ -582,3 +582,22 @@ platform data is provided, this uses the step_wise 
 throttling policy.
  This function serves as an arbitrator to set the state of a cooling
  device. It sets the cooling device to the deepest cooling state if
  possible.
 +
 +6. thermal_emergency_poweroff:
 +
 +On an event of critical trip temperature crossing. Thermal framework
 +allows the system to shutdown gracefully by calling orderly_poweroff().
 +In the event of a failure of orderly_poweroff() to shut down the system
 +we are in danger of keeping the system alive at undesirably high
 +temperatures. To mitigate this high risk scenario we program a work
 +queue to fire after a pre-determined number of seconds to start
 +an emergency shutdown of the device using the kernel_power_off()
 +function. In case kernel_power_off() fails then finally
 +emergency_restart() is called in the worst case.
 +
 +The delay should be carefully profiled so as to give adequate time for
 +orderly_poweroff(). In case of failure of an orderly_poweroff() the
 +emergency poweroff kicks in after the delay has elapsed and shuts down
 +the system.
 +
 +If set to 0 emergency poweroff will happen immediately.
 diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
 index 9347401..0dd5b85 100644
 --- a/drivers/thermal/Kconfig
 +++ b/drivers/thermal/Kconfig
 @@ -15,6 +15,19 @@ menuconfig THERMAL
  
  if THERMAL
  
 +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
 +  int "Emergency poweroff delay in milli-seconds"
 +  depends on THERMAL
 +  default 0
>>>
>>> Only now I realized that merging this may break the working
>>> orderly_poweroff() out there, because you are defaulting this to 0, no
>>> delay, therefore giving no time for orderly_poweroff() to finish. This
>>> is not good.
>>>
>>> I think using 0 delay as immediate power off is not good as we give no
>>> time for graceful shutdown, and by default. My suggestion here
>>> is to use 0 delay as no forced shutdown. Meaning, by default, this
>>> feature is disabled, and all other systems out there, despite DRA7 with
>>> arago over NFS, work as before.
> 
> A better solution could be to have bool Kconfig, say
> THERMAL_EMERGENCY_POWEROFF, which would default to false. If one selects
> that option, you get the DELAY_MS configurable, and then you could get
> the 0 ms still as a valid entry, with the same semantics of immediate
> power off, no orderly_poweroff.
> 
> I just want to avoid breaking everybody (or changing userland
> expectation) in honor of this change.

Sure. I have now used default value as no emergency shutdown.
Any 

[PATCH v5 2/2] thermal: core: Add a back up thermal shutdown mechanism

2017-04-14 Thread Keerthy
orderly_poweroff is triggered when a graceful shutdown
of system is desired. This may be used in many critical states of the
kernel such as when subsystems detects conditions such as critical
temperature conditions. However, in certain conditions in system
boot up sequences like those in the middle of driver probes being
initiated, userspace will be unable to power off the system in a clean
manner and leaves the system in a critical state. In cases like these,
the /sbin/poweroff will return success (having forked off to attempt
powering off the system. However, the system overall will fail to
completely poweroff (since other modules will be probed) and the system
is still functional with no userspace (since that would have shut itself
off).

However, there is no clean way of detecting such failure of userspace
powering off the system. In such scenarios, it is necessary for a backup
workqueue to be able to force a shutdown of the system when orderly
shutdown is not successful after a configurable time period.

Reported-by: Nishanth Menon 
Signed-off-by: Keerthy 
---

Changes in v5:

  * Mandated delay for thermal emergency poweroff to be a non-zero value.

Changes in v4:

  * Updated documentation
  * changed emergency_poweroff_func to thermal_emergency_poweroff_func

Changes in v3:

  * Removed unnecessary mutex init.
  * Added WARN messages instead of a simple warning message.
  * Added Documentation.

 Documentation/thermal/sysfs-api.txt | 21 +++
 drivers/thermal/Kconfig | 15 +++
 drivers/thermal/thermal_core.c  | 53 +
 3 files changed, 89 insertions(+)

diff --git a/Documentation/thermal/sysfs-api.txt 
b/Documentation/thermal/sysfs-api.txt
index ef473dc..98dc04f 100644
--- a/Documentation/thermal/sysfs-api.txt
+++ b/Documentation/thermal/sysfs-api.txt
@@ -582,3 +582,24 @@ platform data is provided, this uses the step_wise 
throttling policy.
 This function serves as an arbitrator to set the state of a cooling
 device. It sets the cooling device to the deepest cooling state if
 possible.
+
+6. thermal_emergency_poweroff:
+
+On an event of critical trip temperature crossing. Thermal framework
+allows the system to shutdown gracefully by calling orderly_poweroff().
+In the event of a failure of orderly_poweroff() to shut down the system
+we are in danger of keeping the system alive at undesirably high
+temperatures. To mitigate this high risk scenario we program a work
+queue to fire after a pre-determined number of seconds to start
+an emergency shutdown of the device using the kernel_power_off()
+function. In case kernel_power_off() fails then finally
+emergency_restart() is called in the worst case.
+
+The delay should be carefully profiled so as to give adequate time for
+orderly_poweroff(). In case of failure of an orderly_poweroff() the
+emergency poweroff kicks in after the delay has elapsed and shuts down
+the system.
+
+If set to 0 emergency poweroff will not be supported. So a carefully
+profiled non-zero positive value is a must for emergerncy poweroff to be
+triggered.
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 9347401..2a748a6 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -15,6 +15,21 @@ menuconfig THERMAL
 
 if THERMAL
 
+config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
+   int "Emergency poweroff delay in milli-seconds"
+   depends on THERMAL
+   default 0
+   help
+ The number of milliseconds to delay before emergency
+ poweroff kicks in. The delay should be carefully profiled
+ so as to give adequate time for orderly_poweroff(). In case
+ of failure of an orderly_poweroff() the emergency poweroff
+ kicks in after the delay has elapsed and shuts down the system.
+
+ If set to 0 emergency poweroff will not be supported. So a carefully
+ profiled non-zero positive value is a must for emergerncy poweroff to 
be
+ triggered.
+
 config THERMAL_HWMON
bool
prompt "Expose thermal sensors as hwmon device"
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 8337c27..de1f7be 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -324,6 +324,54 @@ static void handle_non_critical_trips(struct 
thermal_zone_device *tz,
   def_governor->throttle(tz, trip);
 }
 
+/**
+ * thermal_emergency_poweroff_func - emergency poweroff work after a known 
delay
+ * @work: work_struct associated with the emergency poweroff function
+ *
+ * This function is called in very critical situations to force
+ * a kernel poweroff after a configurable timeout value.
+ */
+static void thermal_emergency_poweroff_func(struct work_struct *work)
+{
+   /*
+* We have reached here after the emergency thermal shutdown
+* Waiting period has expired. This means orderly_poweroff has
+* not been 

[PATCH v5 2/2] thermal: core: Add a back up thermal shutdown mechanism

2017-04-14 Thread Keerthy
orderly_poweroff is triggered when a graceful shutdown
of system is desired. This may be used in many critical states of the
kernel such as when subsystems detects conditions such as critical
temperature conditions. However, in certain conditions in system
boot up sequences like those in the middle of driver probes being
initiated, userspace will be unable to power off the system in a clean
manner and leaves the system in a critical state. In cases like these,
the /sbin/poweroff will return success (having forked off to attempt
powering off the system. However, the system overall will fail to
completely poweroff (since other modules will be probed) and the system
is still functional with no userspace (since that would have shut itself
off).

However, there is no clean way of detecting such failure of userspace
powering off the system. In such scenarios, it is necessary for a backup
workqueue to be able to force a shutdown of the system when orderly
shutdown is not successful after a configurable time period.

Reported-by: Nishanth Menon 
Signed-off-by: Keerthy 
---

Changes in v5:

  * Mandated delay for thermal emergency poweroff to be a non-zero value.

Changes in v4:

  * Updated documentation
  * changed emergency_poweroff_func to thermal_emergency_poweroff_func

Changes in v3:

  * Removed unnecessary mutex init.
  * Added WARN messages instead of a simple warning message.
  * Added Documentation.

 Documentation/thermal/sysfs-api.txt | 21 +++
 drivers/thermal/Kconfig | 15 +++
 drivers/thermal/thermal_core.c  | 53 +
 3 files changed, 89 insertions(+)

diff --git a/Documentation/thermal/sysfs-api.txt 
b/Documentation/thermal/sysfs-api.txt
index ef473dc..98dc04f 100644
--- a/Documentation/thermal/sysfs-api.txt
+++ b/Documentation/thermal/sysfs-api.txt
@@ -582,3 +582,24 @@ platform data is provided, this uses the step_wise 
throttling policy.
 This function serves as an arbitrator to set the state of a cooling
 device. It sets the cooling device to the deepest cooling state if
 possible.
+
+6. thermal_emergency_poweroff:
+
+On an event of critical trip temperature crossing. Thermal framework
+allows the system to shutdown gracefully by calling orderly_poweroff().
+In the event of a failure of orderly_poweroff() to shut down the system
+we are in danger of keeping the system alive at undesirably high
+temperatures. To mitigate this high risk scenario we program a work
+queue to fire after a pre-determined number of seconds to start
+an emergency shutdown of the device using the kernel_power_off()
+function. In case kernel_power_off() fails then finally
+emergency_restart() is called in the worst case.
+
+The delay should be carefully profiled so as to give adequate time for
+orderly_poweroff(). In case of failure of an orderly_poweroff() the
+emergency poweroff kicks in after the delay has elapsed and shuts down
+the system.
+
+If set to 0 emergency poweroff will not be supported. So a carefully
+profiled non-zero positive value is a must for emergerncy poweroff to be
+triggered.
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 9347401..2a748a6 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -15,6 +15,21 @@ menuconfig THERMAL
 
 if THERMAL
 
+config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
+   int "Emergency poweroff delay in milli-seconds"
+   depends on THERMAL
+   default 0
+   help
+ The number of milliseconds to delay before emergency
+ poweroff kicks in. The delay should be carefully profiled
+ so as to give adequate time for orderly_poweroff(). In case
+ of failure of an orderly_poweroff() the emergency poweroff
+ kicks in after the delay has elapsed and shuts down the system.
+
+ If set to 0 emergency poweroff will not be supported. So a carefully
+ profiled non-zero positive value is a must for emergerncy poweroff to 
be
+ triggered.
+
 config THERMAL_HWMON
bool
prompt "Expose thermal sensors as hwmon device"
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 8337c27..de1f7be 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -324,6 +324,54 @@ static void handle_non_critical_trips(struct 
thermal_zone_device *tz,
   def_governor->throttle(tz, trip);
 }
 
+/**
+ * thermal_emergency_poweroff_func - emergency poweroff work after a known 
delay
+ * @work: work_struct associated with the emergency poweroff function
+ *
+ * This function is called in very critical situations to force
+ * a kernel poweroff after a configurable timeout value.
+ */
+static void thermal_emergency_poweroff_func(struct work_struct *work)
+{
+   /*
+* We have reached here after the emergency thermal shutdown
+* Waiting period has expired. This means orderly_poweroff has
+* not been able to shut off the system for 

[PATCH v5 1/2] thermal: core: Allow orderly_poweroff to be called only once

2017-04-14 Thread Keerthy
thermal_zone_device_check --> thermal_zone_device_update -->
handle_thermal_trip --> handle_critical_trips --> orderly_poweroff

The above sequence happens every 250/500 mS based on the configuration.
The orderly_poweroff function is getting called every 250/500 mS.
With a full fledged file system it takes at least 5-10 Seconds to
power off gracefully.

In that period due to the thermal_zone_device_check triggering
periodically the thermal work queues bombard with
orderly_poweroff calls multiple times eventually leading to
failures in gracefully powering off the system.

Make sure that orderly_poweroff is called only once.

Signed-off-by: Keerthy 
Acked-by: Eduardo Valentin 
---

Changes in v5:

  * Added Eduardo's Ack.

Changes in v4:

  * power_off_triggered declaration together with mutex definition.

Changes in v3:

  * Changed the place where mutex was locked and unlocked.

Changes in v2:

  * Added a global mutex to serialize poweroff code sequence.

 drivers/thermal/thermal_core.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 11f0675..8337c27 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -45,8 +45,10 @@
 
 static DEFINE_MUTEX(thermal_list_lock);
 static DEFINE_MUTEX(thermal_governor_lock);
+static DEFINE_MUTEX(poweroff_lock);
 
 static atomic_t in_suspend;
+static bool power_off_triggered;
 
 static struct thermal_governor *def_governor;
 
@@ -342,7 +344,12 @@ static void handle_critical_trips(struct 
thermal_zone_device *tz,
dev_emerg(>device,
  "critical temperature reached(%d C),shutting down\n",
  tz->temperature / 1000);
-   orderly_poweroff(true);
+   mutex_lock(_lock);
+   if (!power_off_triggered) {
+   orderly_poweroff(true);
+   power_off_triggered = true;
+   }
+   mutex_unlock(_lock);
}
 }
 
@@ -1463,6 +1470,7 @@ static int __init thermal_init(void)
 {
int result;
 
+   mutex_init(_lock);
result = thermal_register_governors();
if (result)
goto error;
@@ -1497,6 +1505,7 @@ static int __init thermal_init(void)
ida_destroy(_cdev_ida);
mutex_destroy(_list_lock);
mutex_destroy(_governor_lock);
+   mutex_destroy(_lock);
return result;
 }
 
-- 
1.9.1



[PATCH v5 1/2] thermal: core: Allow orderly_poweroff to be called only once

2017-04-14 Thread Keerthy
thermal_zone_device_check --> thermal_zone_device_update -->
handle_thermal_trip --> handle_critical_trips --> orderly_poweroff

The above sequence happens every 250/500 mS based on the configuration.
The orderly_poweroff function is getting called every 250/500 mS.
With a full fledged file system it takes at least 5-10 Seconds to
power off gracefully.

In that period due to the thermal_zone_device_check triggering
periodically the thermal work queues bombard with
orderly_poweroff calls multiple times eventually leading to
failures in gracefully powering off the system.

Make sure that orderly_poweroff is called only once.

Signed-off-by: Keerthy 
Acked-by: Eduardo Valentin 
---

Changes in v5:

  * Added Eduardo's Ack.

Changes in v4:

  * power_off_triggered declaration together with mutex definition.

Changes in v3:

  * Changed the place where mutex was locked and unlocked.

Changes in v2:

  * Added a global mutex to serialize poweroff code sequence.

 drivers/thermal/thermal_core.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 11f0675..8337c27 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -45,8 +45,10 @@
 
 static DEFINE_MUTEX(thermal_list_lock);
 static DEFINE_MUTEX(thermal_governor_lock);
+static DEFINE_MUTEX(poweroff_lock);
 
 static atomic_t in_suspend;
+static bool power_off_triggered;
 
 static struct thermal_governor *def_governor;
 
@@ -342,7 +344,12 @@ static void handle_critical_trips(struct 
thermal_zone_device *tz,
dev_emerg(>device,
  "critical temperature reached(%d C),shutting down\n",
  tz->temperature / 1000);
-   orderly_poweroff(true);
+   mutex_lock(_lock);
+   if (!power_off_triggered) {
+   orderly_poweroff(true);
+   power_off_triggered = true;
+   }
+   mutex_unlock(_lock);
}
 }
 
@@ -1463,6 +1470,7 @@ static int __init thermal_init(void)
 {
int result;
 
+   mutex_init(_lock);
result = thermal_register_governors();
if (result)
goto error;
@@ -1497,6 +1505,7 @@ static int __init thermal_init(void)
ida_destroy(_cdev_ida);
mutex_destroy(_list_lock);
mutex_destroy(_governor_lock);
+   mutex_destroy(_lock);
return result;
 }
 
-- 
1.9.1



[PATCH v2 02/33] dax: refactor dax-fs into a generic provider of 'struct dax_device' instances

2017-04-14 Thread Dan Williams
We want dax capable drivers to be able to publish a set of dax
operations [1]. However, we do not want to further abuse block_devices
to advertise these operations. Instead we will attach these operations
to a dax device and add a lookup mechanism to go from block device path
to a dax device. A dax capable driver like pmem or brd is responsible
for registering a dax device, alongside a block device, and then a dax
capable filesystem is responsible for retrieving the dax device by path
name if it wants to call dax_operations.

For now, we refactor the dax pseudo-fs to be a generic facility, rather
than an implementation detail, of the device-dax use case. Where a "dax
device" is just an inode + dax infrastructure, and "Device DAX" is a
mapping service layered on top of that base 'struct dax_device'.
"Filesystem DAX" is then a mapping service that layers a filesystem on
top of that same base device. Filesystem DAX is associated with a
block_device for now, but perhaps directly to a dax device in the
future, or for new pmem-only filesystems.

[1]: https://lkml.org/lkml/2017/1/19/880

Suggested-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/Makefile|2 
 drivers/dax/Kconfig |   10 +
 drivers/dax/Makefile|5 +
 drivers/dax/dax.h   |   20 +--
 drivers/dax/device-dax.h|   25 
 drivers/dax/device.c|  241 ++
 drivers/dax/pmem.c  |2 
 drivers/dax/super.c |  303 +++
 include/linux/dax.h |3 
 tools/testing/nvdimm/Kbuild |   10 +
 10 files changed, 404 insertions(+), 217 deletions(-)
 create mode 100644 drivers/dax/device-dax.h
 rename drivers/dax/{dax.c => device.c} (77%)
 create mode 100644 drivers/dax/super.c

diff --git a/drivers/Makefile b/drivers/Makefile
index 2eced9afba53..0442e982cf35 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -71,7 +71,7 @@ obj-$(CONFIG_PARPORT) += parport/
 obj-$(CONFIG_NVM)  += lightnvm/
 obj-y  += base/ block/ misc/ mfd/ nfc/
 obj-$(CONFIG_LIBNVDIMM)+= nvdimm/
-obj-$(CONFIG_DEV_DAX)  += dax/
+obj-$(CONFIG_DAX)  += dax/
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
 obj-$(CONFIG_NUBUS)+= nubus/
 obj-y  += macintosh/
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 9e95bf94eb13..b7053eafd88e 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,8 +1,13 @@
-menuconfig DEV_DAX
+menuconfig DAX
tristate "DAX: direct access to differentiated memory"
+   select SRCU
default m if NVDIMM_DAX
+
+if DAX
+
+config DEV_DAX
+   tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
-   select SRCU
help
  Support raw access to differentiated (persistence, bandwidth,
  latency...) memory via an mmap(2) capable character
@@ -11,7 +16,6 @@ menuconfig DEV_DAX
  baseline memory pool.  Mappings of a /dev/daxX.Y device impose
  restrictions that make the mapping behavior deterministic.
 
-if DEV_DAX
 
 config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 27c54e38478a..dc7422530462 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -1,4 +1,7 @@
-obj-$(CONFIG_DEV_DAX) += dax.o
+obj-$(CONFIG_DAX) += dax.o
+obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
 
+dax-y := super.o
 dax_pmem-y := pmem.o
+device_dax-y := device.o
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index ea176d875d60..2472d9da96db 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of version 2 of the GNU General Public License as
@@ -12,14 +12,12 @@
  */
 #ifndef __DAX_H__
 #define __DAX_H__
-struct device;
-struct dev_dax;
-struct resource;
-struct dax_region;
-void dax_region_put(struct dax_region *dax_region);
-struct dax_region *alloc_dax_region(struct device *parent,
-   int region_id, struct resource *res, unsigned int align,
-   void *addr, unsigned long flags);
-struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
-   struct resource *res, int count);
+struct dax_device;
+struct dax_device *alloc_dax(void *private);
+void put_dax(struct dax_device *dax_dev);
+bool dax_alive(struct dax_device *dax_dev);
+void kill_dax(struct dax_device *dax_dev);
+struct dax_device *inode_dax(struct inode *inode);
+struct inode *dax_inode(struct dax_device *dax_dev);
+void *dax_get_private(struct dax_device *dax_dev);
 #endif /* __DAX_H__ */
diff 

[PATCH v2 02/33] dax: refactor dax-fs into a generic provider of 'struct dax_device' instances

2017-04-14 Thread Dan Williams
We want dax capable drivers to be able to publish a set of dax
operations [1]. However, we do not want to further abuse block_devices
to advertise these operations. Instead we will attach these operations
to a dax device and add a lookup mechanism to go from block device path
to a dax device. A dax capable driver like pmem or brd is responsible
for registering a dax device, alongside a block device, and then a dax
capable filesystem is responsible for retrieving the dax device by path
name if it wants to call dax_operations.

For now, we refactor the dax pseudo-fs to be a generic facility, rather
than an implementation detail, of the device-dax use case. Where a "dax
device" is just an inode + dax infrastructure, and "Device DAX" is a
mapping service layered on top of that base 'struct dax_device'.
"Filesystem DAX" is then a mapping service that layers a filesystem on
top of that same base device. Filesystem DAX is associated with a
block_device for now, but perhaps directly to a dax device in the
future, or for new pmem-only filesystems.

[1]: https://lkml.org/lkml/2017/1/19/880

Suggested-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/Makefile|2 
 drivers/dax/Kconfig |   10 +
 drivers/dax/Makefile|5 +
 drivers/dax/dax.h   |   20 +--
 drivers/dax/device-dax.h|   25 
 drivers/dax/device.c|  241 ++
 drivers/dax/pmem.c  |2 
 drivers/dax/super.c |  303 +++
 include/linux/dax.h |3 
 tools/testing/nvdimm/Kbuild |   10 +
 10 files changed, 404 insertions(+), 217 deletions(-)
 create mode 100644 drivers/dax/device-dax.h
 rename drivers/dax/{dax.c => device.c} (77%)
 create mode 100644 drivers/dax/super.c

diff --git a/drivers/Makefile b/drivers/Makefile
index 2eced9afba53..0442e982cf35 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -71,7 +71,7 @@ obj-$(CONFIG_PARPORT) += parport/
 obj-$(CONFIG_NVM)  += lightnvm/
 obj-y  += base/ block/ misc/ mfd/ nfc/
 obj-$(CONFIG_LIBNVDIMM)+= nvdimm/
-obj-$(CONFIG_DEV_DAX)  += dax/
+obj-$(CONFIG_DAX)  += dax/
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
 obj-$(CONFIG_NUBUS)+= nubus/
 obj-y  += macintosh/
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 9e95bf94eb13..b7053eafd88e 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,8 +1,13 @@
-menuconfig DEV_DAX
+menuconfig DAX
tristate "DAX: direct access to differentiated memory"
+   select SRCU
default m if NVDIMM_DAX
+
+if DAX
+
+config DEV_DAX
+   tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
-   select SRCU
help
  Support raw access to differentiated (persistence, bandwidth,
  latency...) memory via an mmap(2) capable character
@@ -11,7 +16,6 @@ menuconfig DEV_DAX
  baseline memory pool.  Mappings of a /dev/daxX.Y device impose
  restrictions that make the mapping behavior deterministic.
 
-if DEV_DAX
 
 config DEV_DAX_PMEM
tristate "PMEM DAX: direct access to persistent memory"
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 27c54e38478a..dc7422530462 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -1,4 +1,7 @@
-obj-$(CONFIG_DEV_DAX) += dax.o
+obj-$(CONFIG_DAX) += dax.o
+obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
 
+dax-y := super.o
 dax_pmem-y := pmem.o
+device_dax-y := device.o
diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index ea176d875d60..2472d9da96db 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of version 2 of the GNU General Public License as
@@ -12,14 +12,12 @@
  */
 #ifndef __DAX_H__
 #define __DAX_H__
-struct device;
-struct dev_dax;
-struct resource;
-struct dax_region;
-void dax_region_put(struct dax_region *dax_region);
-struct dax_region *alloc_dax_region(struct device *parent,
-   int region_id, struct resource *res, unsigned int align,
-   void *addr, unsigned long flags);
-struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
-   struct resource *res, int count);
+struct dax_device;
+struct dax_device *alloc_dax(void *private);
+void put_dax(struct dax_device *dax_dev);
+bool dax_alive(struct dax_device *dax_dev);
+void kill_dax(struct dax_device *dax_dev);
+struct dax_device *inode_dax(struct inode *inode);
+struct inode *dax_inode(struct dax_device *dax_dev);
+void *dax_get_private(struct dax_device *dax_dev);
 #endif /* __DAX_H__ */
diff --git a/drivers/dax/device-dax.h 

[PATCH v2 11/33] dm: add dax_device and dax_operations support

2017-04-14 Thread Dan Williams
Allocate a dax_device to represent the capacity of a device-mapper
instance. Provide a ->direct_access() method via the new dax_operations
indirection that mirrors the functionality of the current direct_access
support via block_device_operations.  Once fs/dax.c has been converted
to use dax_operations the old dm_blk_direct_access() will be removed.

A new helper dm_dax_get_live_target() is introduced to separate some of
the dm-specifics from the direct_access implementation.

This enabling is only for the top-level dm representation to upper
layers. Converting target direct_access implementations is deferred to a
separate patch.

Cc: Toshi Kani 
Cc: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/md/Kconfig|1 
 drivers/md/dm-core.h  |1 
 drivers/md/dm.c   |   84 ++---
 include/linux/device-mapper.h |1 
 4 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b7767da50c26..1de8372d9459 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
 config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
+   select DAX
---help---
  Device-mapper is a low level volume manager.  It works by allowing
  people to specify mappings for ranges of logical sectors.  Various
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..538630190f66 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -58,6 +58,7 @@ struct mapped_device {
struct target_type *immutable_target_type;
 
struct gendisk *disk;
+   struct dax_device *dax_dev;
char name[16];
 
void *interface_ptr;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..bd56dfe43a99 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -908,31 +909,68 @@ int dm_set_target_max_io_len(struct dm_target *ti, 
sector_t len)
 }
 EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
 
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
+   sector_t sector, int *srcu_idx)
 {
-   struct mapped_device *md = bdev->bd_disk->private_data;
struct dm_table *map;
struct dm_target *ti;
-   int srcu_idx;
-   long len, ret = -EIO;
 
-   map = dm_get_live_table(md, _idx);
+   map = dm_get_live_table(md, srcu_idx);
if (!map)
-   goto out;
+   return NULL;
 
ti = dm_table_find_target(map, sector);
if (!dm_target_is_valid(ti))
-   goto out;
+   return NULL;
 
-   len = max_io_len(sector, ti) << SECTOR_SHIFT;
-   size = min(len, size);
+   return ti;
+}
 
-   if (ti->type->direct_access)
-   ret = ti->type->direct_access(ti, sector, kaddr, pfn, size);
-out:
+static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct mapped_device *md = dax_get_private(dax_dev);
+   sector_t sector = pgoff * PAGE_SECTORS;
+   struct dm_target *ti;
+   long len, ret = -EIO;
+   int srcu_idx;
+
+   ti = dm_dax_get_live_target(md, sector, _idx);
+
+   if (!ti)
+   goto out;
+   if (!ti->type->direct_access)
+   goto out;
+   len = max_io_len(sector, ti) / PAGE_SECTORS;
+   if (len < 1)
+   goto out;
+   nr_pages = min(len, nr_pages);
+   if (ti->type->direct_access) {
+   ret = ti->type->direct_access(ti, sector, kaddr, pfn,
+   nr_pages * PAGE_SIZE);
+   /*
+* FIXME: convert ti->type->direct_access to return
+* nr_pages directly.
+*/
+   if (ret >= 0)
+   ret /= PAGE_SIZE;
+   }
+ out:
dm_put_live_table(md, srcu_idx);
-   return min(ret, size);
+
+   return ret;
+}
+
+static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct mapped_device *md = bdev->bd_disk->private_data;
+   struct dax_device *dax_dev = md->dax_dev;
+   long nr_pages = size / PAGE_SIZE;
+
+   nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS,
+   nr_pages, kaddr, pfn);
+   return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE;
 }
 
 /*
@@ -1437,6 +1475,7 @@ static int next_free_minor(int *minor)
 }
 
 static const struct block_device_operations dm_blk_dops;
+static const struct dax_operations 

[PATCH v2 11/33] dm: add dax_device and dax_operations support

2017-04-14 Thread Dan Williams
Allocate a dax_device to represent the capacity of a device-mapper
instance. Provide a ->direct_access() method via the new dax_operations
indirection that mirrors the functionality of the current direct_access
support via block_device_operations.  Once fs/dax.c has been converted
to use dax_operations the old dm_blk_direct_access() will be removed.

A new helper dm_dax_get_live_target() is introduced to separate some of
the dm-specifics from the direct_access implementation.

This enabling is only for the top-level dm representation to upper
layers. Converting target direct_access implementations is deferred to a
separate patch.

Cc: Toshi Kani 
Cc: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/md/Kconfig|1 
 drivers/md/dm-core.h  |1 
 drivers/md/dm.c   |   84 ++---
 include/linux/device-mapper.h |1 
 4 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b7767da50c26..1de8372d9459 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN
 config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
+   select DAX
---help---
  Device-mapper is a low level volume manager.  It works by allowing
  people to specify mappings for ranges of logical sectors.  Various
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..538630190f66 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -58,6 +58,7 @@ struct mapped_device {
struct target_type *immutable_target_type;
 
struct gendisk *disk;
+   struct dax_device *dax_dev;
char name[16];
 
void *interface_ptr;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..bd56dfe43a99 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -908,31 +909,68 @@ int dm_set_target_max_io_len(struct dm_target *ti, 
sector_t len)
 }
 EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
 
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static struct dm_target *dm_dax_get_live_target(struct mapped_device *md,
+   sector_t sector, int *srcu_idx)
 {
-   struct mapped_device *md = bdev->bd_disk->private_data;
struct dm_table *map;
struct dm_target *ti;
-   int srcu_idx;
-   long len, ret = -EIO;
 
-   map = dm_get_live_table(md, _idx);
+   map = dm_get_live_table(md, srcu_idx);
if (!map)
-   goto out;
+   return NULL;
 
ti = dm_table_find_target(map, sector);
if (!dm_target_is_valid(ti))
-   goto out;
+   return NULL;
 
-   len = max_io_len(sector, ti) << SECTOR_SHIFT;
-   size = min(len, size);
+   return ti;
+}
 
-   if (ti->type->direct_access)
-   ret = ti->type->direct_access(ti, sector, kaddr, pfn, size);
-out:
+static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct mapped_device *md = dax_get_private(dax_dev);
+   sector_t sector = pgoff * PAGE_SECTORS;
+   struct dm_target *ti;
+   long len, ret = -EIO;
+   int srcu_idx;
+
+   ti = dm_dax_get_live_target(md, sector, _idx);
+
+   if (!ti)
+   goto out;
+   if (!ti->type->direct_access)
+   goto out;
+   len = max_io_len(sector, ti) / PAGE_SECTORS;
+   if (len < 1)
+   goto out;
+   nr_pages = min(len, nr_pages);
+   if (ti->type->direct_access) {
+   ret = ti->type->direct_access(ti, sector, kaddr, pfn,
+   nr_pages * PAGE_SIZE);
+   /*
+* FIXME: convert ti->type->direct_access to return
+* nr_pages directly.
+*/
+   if (ret >= 0)
+   ret /= PAGE_SIZE;
+   }
+ out:
dm_put_live_table(md, srcu_idx);
-   return min(ret, size);
+
+   return ret;
+}
+
+static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct mapped_device *md = bdev->bd_disk->private_data;
+   struct dax_device *dax_dev = md->dax_dev;
+   long nr_pages = size / PAGE_SIZE;
+
+   nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS,
+   nr_pages, kaddr, pfn);
+   return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE;
 }
 
 /*
@@ -1437,6 +1475,7 @@ static int next_free_minor(int *minor)
 }
 
 static const struct block_device_operations dm_blk_dops;
+static const struct dax_operations dm_dax_ops;
 
 static void dm_wq_work(struct work_struct *work);
 
@@ 

[PATCH v2 07/33] brd: add dax_operations support

2017-04-14 Thread Dan Williams
Setup a dax_inode to have the same lifetime as the brd block device and
add a ->direct_access() method that is equivalent to
brd_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old brd_direct_access() will be removed.

Signed-off-by: Dan Williams 
---
 drivers/block/Kconfig |1 +
 drivers/block/brd.c   |   65 +
 2 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index f744de7a0f9b..e66956fc2c88 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -339,6 +339,7 @@ config BLK_DEV_SX8
 
 config BLK_DEV_RAM
tristate "RAM block device support"
+   select DAX if BLK_DEV_RAM_DAX
---help---
  Saying Y here will allow you to use a portion of your RAM memory as
  a block device, so that you can make file systems on it, read and
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..60f3193c9ce2 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -21,6 +21,7 @@
 #include 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 #include 
+#include 
 #endif
 
 #include 
@@ -41,6 +42,9 @@ struct brd_device {
 
struct request_queue*brd_queue;
struct gendisk  *brd_disk;
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_device   *dax_dev;
+#endif
struct list_headbrd_list;
 
/*
@@ -375,30 +379,53 @@ static int brd_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
+static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
-   struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
 
if (!brd)
return -ENODEV;
-   page = brd_insert_page(brd, sector);
+   page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
*pfn = page_to_pfn_t(page);
 
-   return PAGE_SIZE;
+   return 1;
+}
+
+static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct brd_device *brd = bdev->bd_disk->private_data;
+   long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512),
+   PHYS_PFN(size), kaddr, pfn);
+
+   if (nr_pages < 0)
+   return nr_pages;
+   return nr_pages * PAGE_SIZE;
+}
+
+static long brd_dax_direct_access(struct dax_device *dax_dev,
+   pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct brd_device *brd = dax_get_private(dax_dev);
+
+   return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn);
 }
+
+static const struct dax_operations brd_dax_ops = {
+   .direct_access = brd_dax_direct_access,
+};
 #else
-#define brd_direct_access NULL
+#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_direct_access,
+   .direct_access =brd_blk_direct_access,
 };
 
 /*
@@ -441,7 +468,9 @@ static struct brd_device *brd_alloc(int i)
 {
struct brd_device *brd;
struct gendisk *disk;
-
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_device *dax_dev;
+#endif
brd = kzalloc(sizeof(*brd), GFP_KERNEL);
if (!brd)
goto out;
@@ -469,9 +498,6 @@ static struct brd_device *brd_alloc(int i)
blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
brd->brd_queue->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
-#endif
disk = brd->brd_disk = alloc_disk(max_part);
if (!disk)
goto out_free_queue;
@@ -484,8 +510,21 @@ static struct brd_device *brd_alloc(int i)
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
+   dax_dev = alloc_dax(brd, disk->disk_name, _dax_ops);
+   if (!dax_dev)
+   goto out_free_inode;
+#endif
+
+
return brd;
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+out_free_inode:
+   kill_dax(dax_dev);
+   put_dax(dax_dev);
+#endif
 out_free_queue:
blk_cleanup_queue(brd->brd_queue);
 out_free_dev:
@@ -525,6 +564,10 @@ static struct brd_device *brd_init_one(int i, bool *new)
 static void brd_del_one(struct brd_device *brd)
 {
list_del(>brd_list);
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   

[PATCH v2 07/33] brd: add dax_operations support

2017-04-14 Thread Dan Williams
Setup a dax_inode to have the same lifetime as the brd block device and
add a ->direct_access() method that is equivalent to
brd_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old brd_direct_access() will be removed.

Signed-off-by: Dan Williams 
---
 drivers/block/Kconfig |1 +
 drivers/block/brd.c   |   65 +
 2 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index f744de7a0f9b..e66956fc2c88 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -339,6 +339,7 @@ config BLK_DEV_SX8
 
 config BLK_DEV_RAM
tristate "RAM block device support"
+   select DAX if BLK_DEV_RAM_DAX
---help---
  Saying Y here will allow you to use a portion of your RAM memory as
  a block device, so that you can make file systems on it, read and
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..60f3193c9ce2 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -21,6 +21,7 @@
 #include 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 #include 
+#include 
 #endif
 
 #include 
@@ -41,6 +42,9 @@ struct brd_device {
 
struct request_queue*brd_queue;
struct gendisk  *brd_disk;
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_device   *dax_dev;
+#endif
struct list_headbrd_list;
 
/*
@@ -375,30 +379,53 @@ static int brd_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
+static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
-   struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
 
if (!brd)
return -ENODEV;
-   page = brd_insert_page(brd, sector);
+   page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
*pfn = page_to_pfn_t(page);
 
-   return PAGE_SIZE;
+   return 1;
+}
+
+static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct brd_device *brd = bdev->bd_disk->private_data;
+   long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512),
+   PHYS_PFN(size), kaddr, pfn);
+
+   if (nr_pages < 0)
+   return nr_pages;
+   return nr_pages * PAGE_SIZE;
+}
+
+static long brd_dax_direct_access(struct dax_device *dax_dev,
+   pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct brd_device *brd = dax_get_private(dax_dev);
+
+   return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn);
 }
+
+static const struct dax_operations brd_dax_ops = {
+   .direct_access = brd_dax_direct_access,
+};
 #else
-#define brd_direct_access NULL
+#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_direct_access,
+   .direct_access =brd_blk_direct_access,
 };
 
 /*
@@ -441,7 +468,9 @@ static struct brd_device *brd_alloc(int i)
 {
struct brd_device *brd;
struct gendisk *disk;
-
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   struct dax_device *dax_dev;
+#endif
brd = kzalloc(sizeof(*brd), GFP_KERNEL);
if (!brd)
goto out;
@@ -469,9 +498,6 @@ static struct brd_device *brd_alloc(int i)
blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
brd->brd_queue->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
-#endif
disk = brd->brd_disk = alloc_disk(max_part);
if (!disk)
goto out_free_queue;
@@ -484,8 +510,21 @@ static struct brd_device *brd_alloc(int i)
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
+   dax_dev = alloc_dax(brd, disk->disk_name, _dax_ops);
+   if (!dax_dev)
+   goto out_free_inode;
+#endif
+
+
return brd;
 
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+out_free_inode:
+   kill_dax(dax_dev);
+   put_dax(dax_dev);
+#endif
 out_free_queue:
blk_cleanup_queue(brd->brd_queue);
 out_free_dev:
@@ -525,6 +564,10 @@ static struct brd_device *brd_init_one(int i, bool *new)
 static void brd_del_one(struct brd_device *brd)
 {
list_del(>brd_list);
+#ifdef CONFIG_BLK_DEV_RAM_DAX
+   kill_dax(brd->dax_dev);
+   

[PATCH v2 12/33] dm: teach dm-targets to use a dax_device + dax_operations

2017-04-14 Thread Dan Williams
Arrange for dm to lookup the dax services available from member devices.
Update the dax-capable targets, linear and stripe, to route dax
operations to the underlying device. Changes the target-internal
->direct_access() method to more closely align with the dax_operations
->direct_access() calling convention.

Cc: Toshi Kani 
Cc: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/md/dm-linear.c|   27 +--
 drivers/md/dm-snap.c  |6 +++---
 drivers/md/dm-stripe.c|   29 ++---
 drivers/md/dm-target.c|6 +++---
 drivers/md/dm.c   |   16 ++--
 include/linux/device-mapper.h |7 ---
 6 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..c5a52f4dae81 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -141,22 +142,20 @@ static int linear_iterate_devices(struct dm_target *ti,
return fn(ti, lc->dev, lc->start, ti->len, data);
 }
 
-static long linear_direct_access(struct dm_target *ti, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
+   long ret;
struct linear_c *lc = ti->private;
struct block_device *bdev = lc->dev->bdev;
-   struct blk_dax_ctl dax = {
-   .sector = linear_map_sector(ti, sector),
-   .size = size,
-   };
-   long ret;
-
-   ret = bdev_direct_access(bdev, );
-   *kaddr = dax.addr;
-   *pfn = dax.pfn;
-
-   return ret;
+   struct dax_device *dax_dev = lc->dev->dax_dev;
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+   dev_sector = linear_map_sector(ti, sector);
+   ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, );
+   if (ret)
+   return ret;
+   return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
 }
 
 static struct target_type linear_target = {
@@ -169,7 +168,7 @@ static struct target_type linear_target = {
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
-   .direct_access = linear_direct_access,
+   .direct_access = linear_dax_direct_access,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index c65feeada864..e152d9817c81 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2302,8 +2302,8 @@ static int origin_map(struct dm_target *ti, struct bio 
*bio)
return do_origin(o->dev, bio);
 }
 
-static long origin_direct_access(struct dm_target *ti, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
+static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
DMWARN("device does not support dax.");
return -EIO;
@@ -2368,7 +2368,7 @@ static struct target_type origin_target = {
.postsuspend = origin_postsuspend,
.status  = origin_status,
.iterate_devices = origin_iterate_devices,
-   .direct_access = origin_direct_access,
+   .direct_access = origin_dax_direct_access,
 };
 
 static struct target_type snapshot_target = {
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..cb4b1e9e16ab 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -308,27 +309,25 @@ static int stripe_map(struct dm_target *ti, struct bio 
*bio)
return DM_MAPIO_REMAPPED;
 }
 
-static long stripe_direct_access(struct dm_target *ti, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
struct stripe_c *sc = ti->private;
-   uint32_t stripe;
+   struct dax_device *dax_dev;
struct block_device *bdev;
-   struct blk_dax_ctl dax = {
-   .size = size,
-   };
+   uint32_t stripe;
long ret;
 
-   stripe_map_sector(sc, sector, , );
-
-   dax.sector += sc->stripe[stripe].physical_start;
+   stripe_map_sector(sc, sector, , _sector);
+   dev_sector += sc->stripe[stripe].physical_start;
+   dax_dev = sc->stripe[stripe].dev->dax_dev;
bdev = sc->stripe[stripe].dev->bdev;
 
-   ret = bdev_direct_access(bdev, );
-   *kaddr = dax.addr;
-   *pfn = dax.pfn;
-
-   return ret;
+   ret = bdev_dax_pgoff(bdev, dev_sector, 

[PATCH v2 12/33] dm: teach dm-targets to use a dax_device + dax_operations

2017-04-14 Thread Dan Williams
Arrange for dm to lookup the dax services available from member devices.
Update the dax-capable targets, linear and stripe, to route dax
operations to the underlying device. Changes the target-internal
->direct_access() method to more closely align with the dax_operations
->direct_access() calling convention.

Cc: Toshi Kani 
Cc: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/md/dm-linear.c|   27 +--
 drivers/md/dm-snap.c  |6 +++---
 drivers/md/dm-stripe.c|   29 ++---
 drivers/md/dm-target.c|6 +++---
 drivers/md/dm.c   |   16 ++--
 include/linux/device-mapper.h |7 ---
 6 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..c5a52f4dae81 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -141,22 +142,20 @@ static int linear_iterate_devices(struct dm_target *ti,
return fn(ti, lc->dev, lc->start, ti->len, data);
 }
 
-static long linear_direct_access(struct dm_target *ti, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
+   long ret;
struct linear_c *lc = ti->private;
struct block_device *bdev = lc->dev->bdev;
-   struct blk_dax_ctl dax = {
-   .sector = linear_map_sector(ti, sector),
-   .size = size,
-   };
-   long ret;
-
-   ret = bdev_direct_access(bdev, );
-   *kaddr = dax.addr;
-   *pfn = dax.pfn;
-
-   return ret;
+   struct dax_device *dax_dev = lc->dev->dax_dev;
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+   dev_sector = linear_map_sector(ti, sector);
+   ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, );
+   if (ret)
+   return ret;
+   return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
 }
 
 static struct target_type linear_target = {
@@ -169,7 +168,7 @@ static struct target_type linear_target = {
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
-   .direct_access = linear_direct_access,
+   .direct_access = linear_dax_direct_access,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index c65feeada864..e152d9817c81 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2302,8 +2302,8 @@ static int origin_map(struct dm_target *ti, struct bio 
*bio)
return do_origin(o->dev, bio);
 }
 
-static long origin_direct_access(struct dm_target *ti, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
+static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
DMWARN("device does not support dax.");
return -EIO;
@@ -2368,7 +2368,7 @@ static struct target_type origin_target = {
.postsuspend = origin_postsuspend,
.status  = origin_status,
.iterate_devices = origin_iterate_devices,
-   .direct_access = origin_direct_access,
+   .direct_access = origin_dax_direct_access,
 };
 
 static struct target_type snapshot_target = {
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..cb4b1e9e16ab 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -308,27 +309,25 @@ static int stripe_map(struct dm_target *ti, struct bio 
*bio)
return DM_MAPIO_REMAPPED;
 }
 
-static long stripe_direct_access(struct dm_target *ti, sector_t sector,
-void **kaddr, pfn_t *pfn, long size)
+static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
struct stripe_c *sc = ti->private;
-   uint32_t stripe;
+   struct dax_device *dax_dev;
struct block_device *bdev;
-   struct blk_dax_ctl dax = {
-   .size = size,
-   };
+   uint32_t stripe;
long ret;
 
-   stripe_map_sector(sc, sector, , );
-
-   dax.sector += sc->stripe[stripe].physical_start;
+   stripe_map_sector(sc, sector, , _sector);
+   dev_sector += sc->stripe[stripe].physical_start;
+   dax_dev = sc->stripe[stripe].dev->dax_dev;
bdev = sc->stripe[stripe].dev->bdev;
 
-   ret = bdev_direct_access(bdev, );
-   *kaddr = dax.addr;
-   *pfn = dax.pfn;
-
-   return ret;
+   ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, );
+   if (ret)
+   return ret;

[PATCH v2 21/33] filesystem-dax: convert to dax_copy_from_iter()

2017-04-14 Thread Dan Williams
Now that all possible providers of the dax_operations copy_from_iter
method are implemented, switch filesytem-dax to call the driver rather
than copy_to_iter_pmem.

Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h |   50 ---
 fs/dax.c|3 ++-
 include/linux/pmem.h|   24 -
 3 files changed, 2 insertions(+), 75 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d5a22bac9988..60e8edbe0205 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -66,56 +66,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
 }
 
 /**
- * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr:  PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- */
-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
-   struct iov_iter *i)
-{
-   size_t len;
-
-   /* TODO: skip the write-back by always using non-temporal stores */
-   len = copy_from_iter_nocache(addr, bytes, i);
-
-   /*
-* In the iovec case on x86_64 copy_from_iter_nocache() uses
-* non-temporal stores for the bulk of the transfer, but we need
-* to manually flush if the transfer is unaligned. A cached
-* memory copy is used when destination or size is not naturally
-* aligned. That is:
-*   - Require 8-byte alignment when size is 8 bytes or larger.
-*   - Require 4-byte alignment when size is 4 bytes.
-*
-* In the non-iovec case the entire destination needs to be
-* flushed.
-*/
-   if (iter_is_iovec(i)) {
-   unsigned long flushed, dest = (unsigned long) addr;
-
-   if (bytes < 8) {
-   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
-   arch_wb_cache_pmem(addr, 1);
-   } else {
-   if (!IS_ALIGNED(dest, 8)) {
-   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
-   arch_wb_cache_pmem(addr, 1);
-   }
-
-   flushed = dest - (unsigned long) addr;
-   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
-   arch_wb_cache_pmem(addr + bytes - 1, 1);
-   }
-   } else
-   arch_wb_cache_pmem(addr, bytes);
-
-   return len;
-}
-
-/**
  * arch_clear_pmem - zero a PMEM memory range
  * @addr:  virtual start address
  * @size:  number of bytes to zero
diff --git a/fs/dax.c b/fs/dax.c
index ce9dc9c3e829..11b9909c91df 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1061,7 +1061,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
map_len = end - pos;
 
if (iov_iter_rw(iter) == WRITE)
-   map_len = copy_from_iter_pmem(kaddr, map_len, iter);
+   map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
+   map_len, iter);
else
map_len = copy_to_iter(kaddr, map_len, iter);
if (map_len <= 0) {
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 71ecf3d46aac..9d542a5600e4 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,13 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
-   struct iov_iter *i)
-{
-   BUG();
-   return 0;
-}
-
 static inline void arch_clear_pmem(void *addr, size_t size)
 {
BUG();
@@ -80,23 +73,6 @@ static inline void memcpy_to_pmem(void *dst, const void 
*src, size_t n)
 }
 
 /**
- * copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr:  PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline size_t copy_from_iter_pmem(void *addr, size_t bytes,
-   struct iov_iter *i)
-{
-   if (arch_has_pmem_api())
-   return arch_copy_from_iter_pmem(addr, bytes, i);
-   return copy_from_iter_nocache(addr, bytes, i);
-}
-
-/**
  * clear_pmem - zero a PMEM memory range
  * @addr:  virtual start address
  * @size:  number of bytes to zero



[PATCH v2 21/33] filesystem-dax: convert to dax_copy_from_iter()

2017-04-14 Thread Dan Williams
Now that all possible providers of the dax_operations copy_from_iter
method are implemented, switch filesytem-dax to call the driver rather
than copy_to_iter_pmem.

Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h |   50 ---
 fs/dax.c|3 ++-
 include/linux/pmem.h|   24 -
 3 files changed, 2 insertions(+), 75 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d5a22bac9988..60e8edbe0205 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -66,56 +66,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
 }
 
 /**
- * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr:  PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- */
-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
-   struct iov_iter *i)
-{
-   size_t len;
-
-   /* TODO: skip the write-back by always using non-temporal stores */
-   len = copy_from_iter_nocache(addr, bytes, i);
-
-   /*
-* In the iovec case on x86_64 copy_from_iter_nocache() uses
-* non-temporal stores for the bulk of the transfer, but we need
-* to manually flush if the transfer is unaligned. A cached
-* memory copy is used when destination or size is not naturally
-* aligned. That is:
-*   - Require 8-byte alignment when size is 8 bytes or larger.
-*   - Require 4-byte alignment when size is 4 bytes.
-*
-* In the non-iovec case the entire destination needs to be
-* flushed.
-*/
-   if (iter_is_iovec(i)) {
-   unsigned long flushed, dest = (unsigned long) addr;
-
-   if (bytes < 8) {
-   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
-   arch_wb_cache_pmem(addr, 1);
-   } else {
-   if (!IS_ALIGNED(dest, 8)) {
-   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
-   arch_wb_cache_pmem(addr, 1);
-   }
-
-   flushed = dest - (unsigned long) addr;
-   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
-   arch_wb_cache_pmem(addr + bytes - 1, 1);
-   }
-   } else
-   arch_wb_cache_pmem(addr, bytes);
-
-   return len;
-}
-
-/**
  * arch_clear_pmem - zero a PMEM memory range
  * @addr:  virtual start address
  * @size:  number of bytes to zero
diff --git a/fs/dax.c b/fs/dax.c
index ce9dc9c3e829..11b9909c91df 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1061,7 +1061,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
map_len = end - pos;
 
if (iov_iter_rw(iter) == WRITE)
-   map_len = copy_from_iter_pmem(kaddr, map_len, iter);
+   map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
+   map_len, iter);
else
map_len = copy_to_iter(kaddr, map_len, iter);
if (map_len <= 0) {
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 71ecf3d46aac..9d542a5600e4 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,13 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
-   struct iov_iter *i)
-{
-   BUG();
-   return 0;
-}
-
 static inline void arch_clear_pmem(void *addr, size_t size)
 {
BUG();
@@ -80,23 +73,6 @@ static inline void memcpy_to_pmem(void *dst, const void 
*src, size_t n)
 }
 
 /**
- * copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr:  PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline size_t copy_from_iter_pmem(void *addr, size_t bytes,
-   struct iov_iter *i)
-{
-   if (arch_has_pmem_api())
-   return arch_copy_from_iter_pmem(addr, bytes, i);
-   return copy_from_iter_nocache(addr, bytes, i);
-}
-
-/**
  * clear_pmem - zero a PMEM memory range
  * @addr:  virtual start address
  * @size:  number of bytes to zero



[PATCH v2 17/33] block: remove block_device_operations ->direct_access()

2017-04-14 Thread Dan Williams
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams 
---
 arch/powerpc/sysdev/axonram.c |   23 -
 drivers/block/brd.c   |   15 --
 drivers/md/dm.c   |   13 
 drivers/nvdimm/pmem.c |   10 -
 drivers/s390/block/dcssblk.c  |   16 ---
 fs/block_dev.c|   45 -
 include/linux/blkdev.h|   17 ---
 7 files changed, 4 insertions(+), 135 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ad857d5e81b1..83eb56ff1d2c 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,6 +139,10 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
return BLK_QC_T_NONE;
 }
 
+static const struct block_device_operations axon_ram_devops = {
+   .owner  = THIS_MODULE,
+};
+
 static long
 __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long 
nr_pages,
   void **kaddr, pfn_t *pfn)
@@ -150,25 +154,6 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, 
pgoff_t pgoff, long nr_page
return (bank->size - offset) / PAGE_SIZE;
 }
 
-/**
- * axon_ram_direct_access - direct_access() method for block device
- * @device, @sector, @data: see block_device_operations method
- */
-static long
-axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
-  void **kaddr, pfn_t *pfn, long size)
-{
-   struct axon_ram_bank *bank = device->bd_disk->private_data;
-
-   return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE,
-   size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE;
-}
-
-static const struct block_device_operations axon_ram_devops = {
-   .owner  = THIS_MODULE,
-   .direct_access  = axon_ram_blk_direct_access
-};
-
 static long
 axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
   void **kaddr, pfn_t *pfn)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 60f3193c9ce2..bfa4ed2c75ef 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -395,18 +395,6 @@ static long __brd_direct_access(struct brd_device *brd, 
pgoff_t pgoff,
return 1;
 }
 
-static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   struct brd_device *brd = bdev->bd_disk->private_data;
-   long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512),
-   PHYS_PFN(size), kaddr, pfn);
-
-   if (nr_pages < 0)
-   return nr_pages;
-   return nr_pages * PAGE_SIZE;
-}
-
 static long brd_dax_direct_access(struct dax_device *dax_dev,
pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -418,14 +406,11 @@ static long brd_dax_direct_access(struct dax_device 
*dax_dev,
 static const struct dax_operations brd_dax_ops = {
.direct_access = brd_dax_direct_access,
 };
-#else
-#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_blk_direct_access,
 };
 
 /*
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ef4c6f8cad47..79d5f5fd823e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -957,18 +957,6 @@ static long dm_dax_direct_access(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   struct mapped_device *md = bdev->bd_disk->private_data;
-   struct dax_device *dax_dev = md->dax_dev;
-   long nr_pages = size / PAGE_SIZE;
-
-   nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS,
-   nr_pages, kaddr, pfn);
-   return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE;
-}
-
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH.
@@ -2823,7 +2811,6 @@ static const struct block_device_operations dm_blk_dops = 
{
.open = dm_blk_open,
.release = dm_blk_close,
.ioctl = dm_blk_ioctl,
-   .direct_access = dm_blk_direct_access,
.getgeo = dm_blk_getgeo,
.pr_ops = _pr_ops,
.owner = THIS_MODULE
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fbbcf8154eec..85b85633d674 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -220,19 +220,9 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
return PHYS_PFN(pmem->size - pmem->pfn_pad 

[PATCH v2 22/33] dax, pmem: introduce an optional 'flush' dax_operation

2017-04-14 Thread Dan Williams
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so add a dax operation
to allow pmem to take this extra action, but skip it for other dax
capable devices that do not provide a flush routine.

An example for this differentiation might be a volatile ram disk where
there is no expectation of persistence. In fact the pmem driver itself might
front such an address range specified by the NFIT. So, this "no flush"
property might be something passed down by the bus / libnvdimm.

Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pmem.c |   11 +++
 include/linux/dax.h   |2 ++
 2 files changed, 13 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e501df4ab4b4..822b85fb3365 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -276,9 +276,20 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t size)
+{
+   /*
+* TODO: move arch specific cache management into the driver
+* directly.
+*/
+   wb_cache_pmem(addr, size);
+}
+
 static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
.copy_from_iter = pmem_copy_from_iter,
+   .flush = pmem_dax_flush,
 };
 
 static void pmem_release_queue(void *q)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index cd8561bb21f3..c88bbcba26d9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -19,6 +19,8 @@ struct dax_operations {
/* copy_from_iter: dax-driver override for default copy_from_iter */
size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
struct iov_iter *);
+   /* flush: optional driver-specific cache management after writes */
+   void (*flush)(struct dax_device *, pgoff_t, void *, size_t);
 };
 
 int dax_read_lock(void);



[PATCH v2 17/33] block: remove block_device_operations ->direct_access()

2017-04-14 Thread Dan Williams
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams 
---
 arch/powerpc/sysdev/axonram.c |   23 -
 drivers/block/brd.c   |   15 --
 drivers/md/dm.c   |   13 
 drivers/nvdimm/pmem.c |   10 -
 drivers/s390/block/dcssblk.c  |   16 ---
 fs/block_dev.c|   45 -
 include/linux/blkdev.h|   17 ---
 7 files changed, 4 insertions(+), 135 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ad857d5e81b1..83eb56ff1d2c 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,6 +139,10 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
return BLK_QC_T_NONE;
 }
 
+static const struct block_device_operations axon_ram_devops = {
+   .owner  = THIS_MODULE,
+};
+
 static long
 __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long 
nr_pages,
   void **kaddr, pfn_t *pfn)
@@ -150,25 +154,6 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, 
pgoff_t pgoff, long nr_page
return (bank->size - offset) / PAGE_SIZE;
 }
 
-/**
- * axon_ram_direct_access - direct_access() method for block device
- * @device, @sector, @data: see block_device_operations method
- */
-static long
-axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
-  void **kaddr, pfn_t *pfn, long size)
-{
-   struct axon_ram_bank *bank = device->bd_disk->private_data;
-
-   return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE,
-   size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE;
-}
-
-static const struct block_device_operations axon_ram_devops = {
-   .owner  = THIS_MODULE,
-   .direct_access  = axon_ram_blk_direct_access
-};
-
 static long
 axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
   void **kaddr, pfn_t *pfn)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 60f3193c9ce2..bfa4ed2c75ef 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -395,18 +395,6 @@ static long __brd_direct_access(struct brd_device *brd, 
pgoff_t pgoff,
return 1;
 }
 
-static long brd_blk_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   struct brd_device *brd = bdev->bd_disk->private_data;
-   long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512),
-   PHYS_PFN(size), kaddr, pfn);
-
-   if (nr_pages < 0)
-   return nr_pages;
-   return nr_pages * PAGE_SIZE;
-}
-
 static long brd_dax_direct_access(struct dax_device *dax_dev,
pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -418,14 +406,11 @@ static long brd_dax_direct_access(struct dax_device 
*dax_dev,
 static const struct dax_operations brd_dax_ops = {
.direct_access = brd_dax_direct_access,
 };
-#else
-#define brd_blk_direct_access NULL
 #endif
 
 static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
-   .direct_access =brd_blk_direct_access,
 };
 
 /*
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ef4c6f8cad47..79d5f5fd823e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -957,18 +957,6 @@ static long dm_dax_direct_access(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
-static long dm_blk_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, pfn_t *pfn, long size)
-{
-   struct mapped_device *md = bdev->bd_disk->private_data;
-   struct dax_device *dax_dev = md->dax_dev;
-   long nr_pages = size / PAGE_SIZE;
-
-   nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS,
-   nr_pages, kaddr, pfn);
-   return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE;
-}
-
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH.
@@ -2823,7 +2811,6 @@ static const struct block_device_operations dm_blk_dops = 
{
.open = dm_blk_open,
.release = dm_blk_close,
.ioctl = dm_blk_ioctl,
-   .direct_access = dm_blk_direct_access,
.getgeo = dm_blk_getgeo,
.pr_ops = _pr_ops,
.owner = THIS_MODULE
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fbbcf8154eec..85b85633d674 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -220,19 +220,9 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
 }
 
-static 

[PATCH v2 22/33] dax, pmem: introduce an optional 'flush' dax_operation

2017-04-14 Thread Dan Williams
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so add a dax operation
to allow pmem to take this extra action, but skip it for other dax
capable devices that do not provide a flush routine.

An example for this differentiation might be a volatile ram disk where
there is no expectation of persistence. In fact the pmem driver itself might
front such an address range specified by the NFIT. So, this "no flush"
property might be something passed down by the bus / libnvdimm.

Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pmem.c |   11 +++
 include/linux/dax.h   |2 ++
 2 files changed, 13 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e501df4ab4b4..822b85fb3365 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -276,9 +276,20 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t size)
+{
+   /*
+* TODO: move arch specific cache management into the driver
+* directly.
+*/
+   wb_cache_pmem(addr, size);
+}
+
 static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
.copy_from_iter = pmem_copy_from_iter,
+   .flush = pmem_dax_flush,
 };
 
 static void pmem_release_queue(void *q)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index cd8561bb21f3..c88bbcba26d9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -19,6 +19,8 @@ struct dax_operations {
/* copy_from_iter: dax-driver override for default copy_from_iter */
size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
struct iov_iter *);
+   /* flush: optional driver-specific cache management after writes */
+   void (*flush)(struct dax_device *, pgoff_t, void *, size_t);
 };
 
 int dax_read_lock(void);



[PATCH v2 18/33] x86, dax, pmem: remove indirection around memcpy_from_pmem()

2017-04-14 Thread Dan Williams
memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
serves no real benefit aside from affording a more generic function name
than the x86-specific 'mcsafe'. However this would not be the first time
that x86 terminology leaked into the global namespace. For lack of
better name, just use memcpy_mcsafe() directly.

This conversion also catches a place where we should have been using
plain memcpy, acpi_nfit_blk_single_io().

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h  |5 -
 arch/x86/include/asm/string_64.h |1 +
 drivers/acpi/nfit/core.c |3 +--
 drivers/nvdimm/claim.c   |2 +-
 drivers/nvdimm/pmem.c|2 +-
 include/linux/pmem.h |   23 ---
 include/linux/string.h   |8 
 7 files changed, 12 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 529bb4a6487a..d5a22bac9988 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,11 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
-{
-   return memcpy_mcsafe(dst, src, n);
-}
-
 /**
  * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr: virtual start address
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index a164862d77e3..733bae07fb29 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -79,6 +79,7 @@ int strcmp(const char *cs, const char *ct);
 #define memset(s, c, n) __memset(s, c, n)
 #endif
 
+#define __HAVE_ARCH_MEMCPY_MCSAFE 1
 __must_check int memcpy_mcsafe_unrolled(void *dst, const void *src, size_t 
cnt);
 DECLARE_STATIC_KEY_FALSE(mcsafe_key);
 
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index c8ea9d698cd0..d0c07b2344e4 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1783,8 +1783,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
*nfit_blk,
mmio_flush_range((void __force *)
mmio->addr.aperture + offset, c);
 
-   memcpy_from_pmem(iobuf + copied,
-   mmio->addr.aperture + offset, c);
+   memcpy(iobuf + copied, mmio->addr.aperture + offset, c);
}
 
copied += c;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index ca6d572c48fc..3a35e8028b9c 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -239,7 +239,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
if (rw == READ) {
if (unlikely(is_bad_pmem(>bb, sector, sz_align)))
return -EIO;
-   return memcpy_from_pmem(buf, nsio->addr + offset, size);
+   return memcpy_mcsafe(buf, nsio->addr + offset, size);
}
 
if (unlikely(is_bad_pmem(>bb, sector, sz_align))) {
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 85b85633d674..3b3dab73d741 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -89,7 +89,7 @@ static int read_pmem(struct page *page, unsigned int off,
int rc;
void *mem = kmap_atomic(page);
 
-   rc = memcpy_from_pmem(mem + off, pmem_addr, len);
+   rc = memcpy_mcsafe(mem + off, pmem_addr, len);
kunmap_atomic(mem);
if (rc)
return -EIO;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index e856c2cb0fe8..71ecf3d46aac 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,12 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
-{
-   BUG();
-   return -EFAULT;
-}
-
 static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
struct iov_iter *i)
 {
@@ -65,23 +59,6 @@ static inline bool arch_has_pmem_api(void)
return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
 }
 
-/*
- * memcpy_from_pmem - read from persistent memory with error handling
- * @dst: destination buffer
- * @src: source buffer
- * @size: transfer length
- *
- * Returns 0 on success negative error code on failure.
- */
-static inline int memcpy_from_pmem(void *dst, void const *src, size_t size)
-{
-   if (arch_has_pmem_api())
-   return arch_memcpy_from_pmem(dst, src, size);
-  

[PATCH v2 18/33] x86, dax, pmem: remove indirection around memcpy_from_pmem()

2017-04-14 Thread Dan Williams
memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
serves no real benefit aside from affording a more generic function name
than the x86-specific 'mcsafe'. However this would not be the first time
that x86 terminology leaked into the global namespace. For lack of
better name, just use memcpy_mcsafe() directly.

This conversion also catches a place where we should have been using
plain memcpy, acpi_nfit_blk_single_io().

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h  |5 -
 arch/x86/include/asm/string_64.h |1 +
 drivers/acpi/nfit/core.c |3 +--
 drivers/nvdimm/claim.c   |2 +-
 drivers/nvdimm/pmem.c|2 +-
 include/linux/pmem.h |   23 ---
 include/linux/string.h   |8 
 7 files changed, 12 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 529bb4a6487a..d5a22bac9988 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,11 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
-{
-   return memcpy_mcsafe(dst, src, n);
-}
-
 /**
  * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr: virtual start address
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index a164862d77e3..733bae07fb29 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -79,6 +79,7 @@ int strcmp(const char *cs, const char *ct);
 #define memset(s, c, n) __memset(s, c, n)
 #endif
 
+#define __HAVE_ARCH_MEMCPY_MCSAFE 1
 __must_check int memcpy_mcsafe_unrolled(void *dst, const void *src, size_t 
cnt);
 DECLARE_STATIC_KEY_FALSE(mcsafe_key);
 
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index c8ea9d698cd0..d0c07b2344e4 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1783,8 +1783,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
*nfit_blk,
mmio_flush_range((void __force *)
mmio->addr.aperture + offset, c);
 
-   memcpy_from_pmem(iobuf + copied,
-   mmio->addr.aperture + offset, c);
+   memcpy(iobuf + copied, mmio->addr.aperture + offset, c);
}
 
copied += c;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index ca6d572c48fc..3a35e8028b9c 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -239,7 +239,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
if (rw == READ) {
if (unlikely(is_bad_pmem(>bb, sector, sz_align)))
return -EIO;
-   return memcpy_from_pmem(buf, nsio->addr + offset, size);
+   return memcpy_mcsafe(buf, nsio->addr + offset, size);
}
 
if (unlikely(is_bad_pmem(>bb, sector, sz_align))) {
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 85b85633d674..3b3dab73d741 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -89,7 +89,7 @@ static int read_pmem(struct page *page, unsigned int off,
int rc;
void *mem = kmap_atomic(page);
 
-   rc = memcpy_from_pmem(mem + off, pmem_addr, len);
+   rc = memcpy_mcsafe(mem + off, pmem_addr, len);
kunmap_atomic(mem);
if (rc)
return -EIO;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index e856c2cb0fe8..71ecf3d46aac 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,12 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
-{
-   BUG();
-   return -EFAULT;
-}
-
 static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
struct iov_iter *i)
 {
@@ -65,23 +59,6 @@ static inline bool arch_has_pmem_api(void)
return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
 }
 
-/*
- * memcpy_from_pmem - read from persistent memory with error handling
- * @dst: destination buffer
- * @src: source buffer
- * @size: transfer length
- *
- * Returns 0 on success negative error code on failure.
- */
-static inline int memcpy_from_pmem(void *dst, void const *src, size_t size)
-{
-   if (arch_has_pmem_api())
-   return arch_memcpy_from_pmem(dst, src, size);
-   else
-   memcpy(dst, src, size);
-   return 0;
-}
-
 /**
  * memcpy_to_pmem - copy data to persistent memory
  * @dst: destination buffer for the copy
diff --git a/include/linux/string.h 

[PATCH v2 28/33] x86, libnvdimm, dax: stop abusing __copy_user_nocache

2017-04-14 Thread Dan Williams
The pmem and nd_blk drivers both have need to copy data through the cpu
cache to persistent memory. To date they have been abusing
__copy_user_nocache through the memcpy_to_pmem abstraction, but this has
several problems:

* __copy_user_nocache does not guarantee that it will always avoid the
  cache. While we have fixed the cases where the pmem usage might
  trigger that behavior it's a fragile assumption and burdens the
  uaccess.h implementation with worrying about the distinction between
  'nocache' and the stricter write-through semantic needed by pmem.
  Quoting Linus: "Quite frankly, the whole "memcpy_nocache()" idea or
  (ab-)using copy_user_nocache() just needs to die. ... If some driver
  ends up using "movnt" by hand, that is up to that *driver*."

* It implements SMAP (supervisor mode access protection) which is only
  meant for user copies.

* It expects faults. For in-kernel copies, faults are fatal and we
  should not be coding for exception handling in that case.

__arch_memcpy_to_pmem() is effectively a copy of __copy_user_nocache()
minus SMAP, unaligned support, and exception handling. The configuration
symbol ARCH_HAS_PMEM_API is also moved local to libnvdimm to be next to
the implementation.

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Toshi Kani 
Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Al Viro 
Cc: Thomas Gleixner 
Cc: Oliver O'Halloran 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: Linus Torvalds 
Signed-off-by: Dan Williams 
---
 MAINTAINERS |2 -
 arch/x86/Kconfig|1 -
 arch/x86/include/asm/pmem.h |   48 -
 drivers/acpi/nfit/core.c|3 +-
 drivers/nvdimm/Kconfig  |4 ++
 drivers/nvdimm/claim.c  |4 +-
 drivers/nvdimm/namespace_devs.c |1 -
 drivers/nvdimm/pmem.c   |4 +-
 drivers/nvdimm/region_devs.c|1 -
 drivers/nvdimm/x86.c|   65 +++
 fs/dax.c|1 -
 include/linux/libnvdimm.h   |9 +
 include/linux/pmem.h|   59 ---
 lib/Kconfig |3 --
 14 files changed, 83 insertions(+), 122 deletions(-)
 delete mode 100644 arch/x86/include/asm/pmem.h
 delete mode 100644 include/linux/pmem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 819d5e8b668a..1c4da1bebd7c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7458,8 +7458,6 @@ L:linux-nvd...@lists.01.org
 Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/nvdimm/pmem.c
-F: include/linux/pmem.h
-F: arch/*/include/asm/pmem.h
 
 LIGHTNVM PLATFORM SUPPORT
 M: Matias Bjorling 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..d377da696903 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -53,7 +53,6 @@ config X86
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MMIO_FLUSH
-   select ARCH_HAS_PMEM_APIif X86_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
deleted file mode 100644
index ded2541a7ba9..
--- a/arch/x86/include/asm/pmem.h
+++ /dev/null
@@ -1,48 +0,0 @@
-/*
- * Copyright(c) 2015 Intel Corporation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- */
-#ifndef __ASM_X86_PMEM_H__
-#define __ASM_X86_PMEM_H__
-
-#include 
-#include 
-#include 
-#include 
-
-#ifdef CONFIG_ARCH_HAS_PMEM_API
-/**
- * arch_memcpy_to_pmem - copy data to persistent memory
- * @dst: destination buffer for the copy
- * @src: source buffer for the copy
- * @n: length of the copy in bytes
- *
- * Copy data to persistent memory media via non-temporal stores so that
- * a subsequent pmem driver flush operation will drain posted write queues.
- */
-static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
-{
-   int rem;
-
-   /*
-* We are copying between two kernel buffers, if
-* __copy_from_user_inatomic_nocache() returns an error (page
-* fault) we 

[PATCH v2 26/33] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

2017-04-14 Thread Dan Williams
With all calls to this routine re-directed through the pmem driver, we
can kill the pmem api indirection. arch_wb_cache_pmem() is now
optionally supplied by an arch specific extension to libnvdimm.  Same as
before, pmem flushing is only defined for x86_64, but it is
straightforward to add other archs in the future.

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Oliver O'Halloran 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h |   21 -
 drivers/nvdimm/Makefile |1 +
 drivers/nvdimm/pmem.c   |   14 +-
 drivers/nvdimm/pmem.h   |8 
 drivers/nvdimm/x86.c|   36 
 include/linux/pmem.h|   19 ---
 tools/testing/nvdimm/Kbuild |1 +
 7 files changed, 51 insertions(+), 49 deletions(-)
 create mode 100644 drivers/nvdimm/x86.c

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index f4c119d253f3..4759a179aa52 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,27 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-/**
- * arch_wb_cache_pmem - write back a cache range with CLWB
- * @vaddr: virtual start address
- * @size:  number of bytes to write back
- *
- * Write back a cache range using the CLWB (cache line write back)
- * instruction. Note that @size is internally rounded up to be cache
- * line size aligned.
- */
-static inline void arch_wb_cache_pmem(void *addr, size_t size)
-{
-   u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
-   unsigned long clflush_mask = x86_clflush_size - 1;
-   void *vend = addr + size;
-   void *p;
-
-   for (p = (void *)((unsigned long)addr & ~clflush_mask);
-p < vend; p += x86_clflush_size)
-   clwb(p);
-}
-
 static inline void arch_invalidate_pmem(void *addr, size_t size)
 {
clflush_cache_range(addr, size);
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 909554c3f955..9eafb1dd2876 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -24,3 +24,4 @@ libnvdimm-$(CONFIG_ND_CLAIM) += claim.o
 libnvdimm-$(CONFIG_BTT) += btt_devs.o
 libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o
 libnvdimm-$(CONFIG_NVDIMM_DAX) += dax_devs.o
+libnvdimm-$(CONFIG_X86_64) += x86.o
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 822b85fb3365..c77a3a757729 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -245,19 +245,19 @@ static size_t pmem_copy_from_iter(struct dax_device 
*dax_dev, pgoff_t pgoff,
 
if (bytes < 8) {
if (!IS_ALIGNED(dest, 4) || (bytes != 4))
-   wb_cache_pmem(addr, 1);
+   arch_wb_cache_pmem(addr, 1);
} else {
if (!IS_ALIGNED(dest, 8)) {
dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
-   wb_cache_pmem(addr, 1);
+   arch_wb_cache_pmem(addr, 1);
}
 
flushed = dest - (unsigned long) addr;
if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
-   wb_cache_pmem(addr + bytes - 1, 1);
+   arch_wb_cache_pmem(addr + bytes - 1, 1);
}
} else
-   wb_cache_pmem(addr, bytes);
+   arch_wb_cache_pmem(addr, bytes);
 
return len;
 }
@@ -279,11 +279,7 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
 static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t size)
 {
-   /*
-* TODO: move arch specific cache management into the driver
-* directly.
-*/
-   wb_cache_pmem(addr, size);
+   arch_wb_cache_pmem(addr, size);
 }
 
 static const struct dax_operations pmem_dax_ops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 7f4dbd72a90a..c4b3371c7f88 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -5,6 +5,14 @@
 #include 
 #include 
 
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+void arch_wb_cache_pmem(void *addr, size_t size);
+#else
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+}
+#endif
+
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
 struct pmem_device {
/* One contiguous memory region per device */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
new file mode 100644
index ..79d7267da4d2

[PATCH v2 25/33] x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush

2017-04-14 Thread Dan Williams
The clear_pmem() helper simply combines a memset() plus a cache flush.
Now that the flush routine is optionally provided by the dax device
driver we can avoid unnecessary cache management on dax devices fronting
volatile memory.

With clear_pmem() gone we can follow on with a patch to make pmem cache
management completely defined within the pmem driver.

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h |   13 -
 fs/dax.c|3 ++-
 include/linux/pmem.h|   21 -
 3 files changed, 2 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 60e8edbe0205..f4c119d253f3 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -65,19 +65,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
clwb(p);
 }
 
-/**
- * arch_clear_pmem - zero a PMEM memory range
- * @addr:  virtual start address
- * @size:  number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- */
-static inline void arch_clear_pmem(void *addr, size_t size)
-{
-   memset(addr, 0, size);
-   arch_wb_cache_pmem(addr, size);
-}
-
 static inline void arch_invalidate_pmem(void *addr, size_t size)
 {
clflush_cache_range(addr, size);
diff --git a/fs/dax.c b/fs/dax.c
index edbf988de86c..edee7e8298bc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -982,7 +982,8 @@ int __dax_zero_page_range(struct block_device *bdev,
dax_read_unlock(id);
return rc;
}
-   clear_pmem(kaddr + offset, size);
+   memset(kaddr + offset, 0, size);
+   dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 9d542a5600e4..772bd02a5b52 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline void arch_clear_pmem(void *addr, size_t size)
-{
-   BUG();
-}
-
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
BUG();
@@ -73,22 +68,6 @@ static inline void memcpy_to_pmem(void *dst, const void 
*src, size_t n)
 }
 
 /**
- * clear_pmem - zero a PMEM memory range
- * @addr:  virtual start address
- * @size:  number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline void clear_pmem(void *addr, size_t size)
-{
-   if (arch_has_pmem_api())
-   arch_clear_pmem(addr, size);
-   else
-   memset(addr, 0, size);
-}
-
-/**
  * invalidate_pmem - flush a pmem range from the cache hierarchy
  * @addr:  virtual start address
  * @size:  bytes to invalidate (internally aligned to cache line size)



[PATCH v2 26/33] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

2017-04-14 Thread Dan Williams
With all calls to this routine re-directed through the pmem driver, we
can kill the pmem api indirection. arch_wb_cache_pmem() is now
optionally supplied by an arch specific extension to libnvdimm.  Same as
before, pmem flushing is only defined for x86_64, but it is
straightforward to add other archs in the future.

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Oliver O'Halloran 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h |   21 -
 drivers/nvdimm/Makefile |1 +
 drivers/nvdimm/pmem.c   |   14 +-
 drivers/nvdimm/pmem.h   |8 
 drivers/nvdimm/x86.c|   36 
 include/linux/pmem.h|   19 ---
 tools/testing/nvdimm/Kbuild |1 +
 7 files changed, 51 insertions(+), 49 deletions(-)
 create mode 100644 drivers/nvdimm/x86.c

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index f4c119d253f3..4759a179aa52 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,27 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-/**
- * arch_wb_cache_pmem - write back a cache range with CLWB
- * @vaddr: virtual start address
- * @size:  number of bytes to write back
- *
- * Write back a cache range using the CLWB (cache line write back)
- * instruction. Note that @size is internally rounded up to be cache
- * line size aligned.
- */
-static inline void arch_wb_cache_pmem(void *addr, size_t size)
-{
-   u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
-   unsigned long clflush_mask = x86_clflush_size - 1;
-   void *vend = addr + size;
-   void *p;
-
-   for (p = (void *)((unsigned long)addr & ~clflush_mask);
-p < vend; p += x86_clflush_size)
-   clwb(p);
-}
-
 static inline void arch_invalidate_pmem(void *addr, size_t size)
 {
clflush_cache_range(addr, size);
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 909554c3f955..9eafb1dd2876 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -24,3 +24,4 @@ libnvdimm-$(CONFIG_ND_CLAIM) += claim.o
 libnvdimm-$(CONFIG_BTT) += btt_devs.o
 libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o
 libnvdimm-$(CONFIG_NVDIMM_DAX) += dax_devs.o
+libnvdimm-$(CONFIG_X86_64) += x86.o
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 822b85fb3365..c77a3a757729 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -245,19 +245,19 @@ static size_t pmem_copy_from_iter(struct dax_device 
*dax_dev, pgoff_t pgoff,
 
if (bytes < 8) {
if (!IS_ALIGNED(dest, 4) || (bytes != 4))
-   wb_cache_pmem(addr, 1);
+   arch_wb_cache_pmem(addr, 1);
} else {
if (!IS_ALIGNED(dest, 8)) {
dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
-   wb_cache_pmem(addr, 1);
+   arch_wb_cache_pmem(addr, 1);
}
 
flushed = dest - (unsigned long) addr;
if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
-   wb_cache_pmem(addr + bytes - 1, 1);
+   arch_wb_cache_pmem(addr + bytes - 1, 1);
}
} else
-   wb_cache_pmem(addr, bytes);
+   arch_wb_cache_pmem(addr, bytes);
 
return len;
 }
@@ -279,11 +279,7 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
 static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t size)
 {
-   /*
-* TODO: move arch specific cache management into the driver
-* directly.
-*/
-   wb_cache_pmem(addr, size);
+   arch_wb_cache_pmem(addr, size);
 }
 
 static const struct dax_operations pmem_dax_ops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 7f4dbd72a90a..c4b3371c7f88 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -5,6 +5,14 @@
 #include 
 #include 
 
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+void arch_wb_cache_pmem(void *addr, size_t size);
+#else
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+}
+#endif
+
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
 struct pmem_device {
/* One contiguous memory region per device */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
new file mode 100644
index ..79d7267da4d2
--- /dev/null
+++ b/drivers/nvdimm/x86.c
@@ -0,0 +1,36 @@
+/*
+ * Copyright(c) 2015 - 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it 

[PATCH v2 25/33] x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush

2017-04-14 Thread Dan Williams
The clear_pmem() helper simply combines a memset() plus a cache flush.
Now that the flush routine is optionally provided by the dax device
driver we can avoid unnecessary cache management on dax devices fronting
volatile memory.

With clear_pmem() gone we can follow on with a patch to make pmem cache
management completely defined within the pmem driver.

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h |   13 -
 fs/dax.c|3 ++-
 include/linux/pmem.h|   21 -
 3 files changed, 2 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 60e8edbe0205..f4c119d253f3 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -65,19 +65,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
clwb(p);
 }
 
-/**
- * arch_clear_pmem - zero a PMEM memory range
- * @addr:  virtual start address
- * @size:  number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- */
-static inline void arch_clear_pmem(void *addr, size_t size)
-{
-   memset(addr, 0, size);
-   arch_wb_cache_pmem(addr, size);
-}
-
 static inline void arch_invalidate_pmem(void *addr, size_t size)
 {
clflush_cache_range(addr, size);
diff --git a/fs/dax.c b/fs/dax.c
index edbf988de86c..edee7e8298bc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -982,7 +982,8 @@ int __dax_zero_page_range(struct block_device *bdev,
dax_read_unlock(id);
return rc;
}
-   clear_pmem(kaddr + offset, size);
+   memset(kaddr + offset, 0, size);
+   dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 9d542a5600e4..772bd02a5b52 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline void arch_clear_pmem(void *addr, size_t size)
-{
-   BUG();
-}
-
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
BUG();
@@ -73,22 +68,6 @@ static inline void memcpy_to_pmem(void *dst, const void 
*src, size_t n)
 }
 
 /**
- * clear_pmem - zero a PMEM memory range
- * @addr:  virtual start address
- * @size:  number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline void clear_pmem(void *addr, size_t size)
-{
-   if (arch_has_pmem_api())
-   arch_clear_pmem(addr, size);
-   else
-   memset(addr, 0, size);
-}
-
-/**
  * invalidate_pmem - flush a pmem range from the cache hierarchy
  * @addr:  virtual start address
  * @size:  bytes to invalidate (internally aligned to cache line size)



[PATCH v2 28/33] x86, libnvdimm, dax: stop abusing __copy_user_nocache

2017-04-14 Thread Dan Williams
The pmem and nd_blk drivers both have need to copy data through the cpu
cache to persistent memory. To date they have been abusing
__copy_user_nocache through the memcpy_to_pmem abstraction, but this has
several problems:

* __copy_user_nocache does not guarantee that it will always avoid the
  cache. While we have fixed the cases where the pmem usage might
  trigger that behavior it's a fragile assumption and burdens the
  uaccess.h implementation with worrying about the distinction between
  'nocache' and the stricter write-through semantic needed by pmem.
  Quoting Linus: "Quite frankly, the whole "memcpy_nocache()" idea or
  (ab-)using copy_user_nocache() just needs to die. ... If some driver
  ends up using "movnt" by hand, that is up to that *driver*."

* It implements SMAP (supervisor mode access protection) which is only
  meant for user copies.

* It expects faults. For in-kernel copies, faults are fatal and we
  should not be coding for exception handling in that case.

__arch_memcpy_to_pmem() is effectively a copy of __copy_user_nocache()
minus SMAP, unaligned support, and exception handling. The configuration
symbol ARCH_HAS_PMEM_API is also moved local to libnvdimm to be next to
the implementation.

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Toshi Kani 
Cc: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Al Viro 
Cc: Thomas Gleixner 
Cc: Oliver O'Halloran 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: Linus Torvalds 
Signed-off-by: Dan Williams 
---
 MAINTAINERS |2 -
 arch/x86/Kconfig|1 -
 arch/x86/include/asm/pmem.h |   48 -
 drivers/acpi/nfit/core.c|3 +-
 drivers/nvdimm/Kconfig  |4 ++
 drivers/nvdimm/claim.c  |4 +-
 drivers/nvdimm/namespace_devs.c |1 -
 drivers/nvdimm/pmem.c   |4 +-
 drivers/nvdimm/region_devs.c|1 -
 drivers/nvdimm/x86.c|   65 +++
 fs/dax.c|1 -
 include/linux/libnvdimm.h   |9 +
 include/linux/pmem.h|   59 ---
 lib/Kconfig |3 --
 14 files changed, 83 insertions(+), 122 deletions(-)
 delete mode 100644 arch/x86/include/asm/pmem.h
 delete mode 100644 include/linux/pmem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 819d5e8b668a..1c4da1bebd7c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7458,8 +7458,6 @@ L:linux-nvd...@lists.01.org
 Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/nvdimm/pmem.c
-F: include/linux/pmem.h
-F: arch/*/include/asm/pmem.h
 
 LIGHTNVM PLATFORM SUPPORT
 M: Matias Bjorling 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..d377da696903 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -53,7 +53,6 @@ config X86
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MMIO_FLUSH
-   select ARCH_HAS_PMEM_APIif X86_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
deleted file mode 100644
index ded2541a7ba9..
--- a/arch/x86/include/asm/pmem.h
+++ /dev/null
@@ -1,48 +0,0 @@
-/*
- * Copyright(c) 2015 Intel Corporation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- */
-#ifndef __ASM_X86_PMEM_H__
-#define __ASM_X86_PMEM_H__
-
-#include 
-#include 
-#include 
-#include 
-
-#ifdef CONFIG_ARCH_HAS_PMEM_API
-/**
- * arch_memcpy_to_pmem - copy data to persistent memory
- * @dst: destination buffer for the copy
- * @src: source buffer for the copy
- * @n: length of the copy in bytes
- *
- * Copy data to persistent memory media via non-temporal stores so that
- * a subsequent pmem driver flush operation will drain posted write queues.
- */
-static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
-{
-   int rem;
-
-   /*
-* We are copying between two kernel buffers, if
-* __copy_from_user_inatomic_nocache() returns an error (page
-* fault) we would have already reported a general protection fault
-* before the WARN+BUG.
-*/
-   rem = __copy_from_user_inatomic_nocache(dst, (void __user *) src, n);
-   if (WARN(rem, "%s: fault copying %p <- %p unwritten: %d\n",
-   __func__, dst, src, rem))
-   BUG();
-}
-

[PATCH v2 27/33] x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm

2017-04-14 Thread Dan Williams
Kill this globally defined wrapper and move to libnvdimm so that we can
ultimately remove the public pmem api.

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h |4 
 drivers/nvdimm/claim.c  |3 ++-
 drivers/nvdimm/pmem.c   |2 +-
 drivers/nvdimm/pmem.h   |4 
 drivers/nvdimm/x86.c|6 ++
 include/linux/pmem.h|   19 ---
 6 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 4759a179aa52..ded2541a7ba9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,9 +44,5 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline void arch_invalidate_pmem(void *addr, size_t size)
-{
-   clflush_cache_range(addr, size);
-}
 #endif /* CONFIG_ARCH_HAS_PMEM_API */
 #endif /* __ASM_X86_PMEM_H__ */
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..1e13a196ce4b 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include "nd-core.h"
+#include "pmem.h"
 #include "pfn.h"
 #include "btt.h"
 #include "nd.h"
@@ -261,7 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
cleared /= 512;
badblocks_clear(>bb, sector, cleared);
}
-   invalidate_pmem(nsio->addr + offset, size);
+   arch_invalidate_pmem(nsio->addr + offset, size);
} else
rc = -EIO;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c77a3a757729..769a510c20e8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -69,7 +69,7 @@ static int pmem_clear_poison(struct pmem_device *pmem, 
phys_addr_t offset,
badblocks_clear(>bb, sector, cleared);
}
 
-   invalidate_pmem(pmem->virt_addr + offset, len);
+   arch_invalidate_pmem(pmem->virt_addr + offset, len);
 
return rc;
 }
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index c4b3371c7f88..5900c1b7 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -7,10 +7,14 @@
 
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
+void arch_invalidate_pmem(void *addr, size_t size);
 #else
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
 }
+static inline void arch_invalidate_pmem(void *addr, size_t size)
+{
+}
 #endif
 
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
index 79d7267da4d2..07478ed7ce97 100644
--- a/drivers/nvdimm/x86.c
+++ b/drivers/nvdimm/x86.c
@@ -34,3 +34,9 @@ void arch_wb_cache_pmem(void *addr, size_t size)
clwb(p);
 }
 EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
+
+void arch_invalidate_pmem(void *addr, size_t size)
+{
+   clflush_cache_range(addr, size);
+}
+EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 33ae761f010a..559c00848583 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -30,11 +30,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
 {
BUG();
 }
-
-static inline void arch_invalidate_pmem(void *addr, size_t size)
-{
-   BUG();
-}
 #endif
 
 static inline bool arch_has_pmem_api(void)
@@ -61,18 +56,4 @@ static inline void memcpy_to_pmem(void *dst, const void 
*src, size_t n)
else
memcpy(dst, src, n);
 }
-
-/**
- * invalidate_pmem - flush a pmem range from the cache hierarchy
- * @addr:  virtual start address
- * @size:  bytes to invalidate (internally aligned to cache line size)
- *
- * For platforms that support clearing poison this flushes any poisoned
- * ranges out of the cache
- */
-static inline void invalidate_pmem(void *addr, size_t size)
-{
-   if (arch_has_pmem_api())
-   arch_invalidate_pmem(addr, size);
-}
 #endif /* __PMEM_H__ */



[PATCH v2 27/33] x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm

2017-04-14 Thread Dan Williams
Kill this globally defined wrapper and move to libnvdimm so that we can
ultimately remove the public pmem api.

Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/pmem.h |4 
 drivers/nvdimm/claim.c  |3 ++-
 drivers/nvdimm/pmem.c   |2 +-
 drivers/nvdimm/pmem.h   |4 
 drivers/nvdimm/x86.c|6 ++
 include/linux/pmem.h|   19 ---
 6 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 4759a179aa52..ded2541a7ba9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,9 +44,5 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
BUG();
 }
 
-static inline void arch_invalidate_pmem(void *addr, size_t size)
-{
-   clflush_cache_range(addr, size);
-}
 #endif /* CONFIG_ARCH_HAS_PMEM_API */
 #endif /* __ASM_X86_PMEM_H__ */
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..1e13a196ce4b 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include "nd-core.h"
+#include "pmem.h"
 #include "pfn.h"
 #include "btt.h"
 #include "nd.h"
@@ -261,7 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
cleared /= 512;
badblocks_clear(>bb, sector, cleared);
}
-   invalidate_pmem(nsio->addr + offset, size);
+   arch_invalidate_pmem(nsio->addr + offset, size);
} else
rc = -EIO;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c77a3a757729..769a510c20e8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -69,7 +69,7 @@ static int pmem_clear_poison(struct pmem_device *pmem, 
phys_addr_t offset,
badblocks_clear(>bb, sector, cleared);
}
 
-   invalidate_pmem(pmem->virt_addr + offset, len);
+   arch_invalidate_pmem(pmem->virt_addr + offset, len);
 
return rc;
 }
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index c4b3371c7f88..5900c1b7 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -7,10 +7,14 @@
 
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
+void arch_invalidate_pmem(void *addr, size_t size);
 #else
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
 }
+static inline void arch_invalidate_pmem(void *addr, size_t size)
+{
+}
 #endif
 
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
index 79d7267da4d2..07478ed7ce97 100644
--- a/drivers/nvdimm/x86.c
+++ b/drivers/nvdimm/x86.c
@@ -34,3 +34,9 @@ void arch_wb_cache_pmem(void *addr, size_t size)
clwb(p);
 }
 EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
+
+void arch_invalidate_pmem(void *addr, size_t size)
+{
+   clflush_cache_range(addr, size);
+}
+EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 33ae761f010a..559c00848583 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -30,11 +30,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void 
*src, size_t n)
 {
BUG();
 }
-
-static inline void arch_invalidate_pmem(void *addr, size_t size)
-{
-   BUG();
-}
 #endif
 
 static inline bool arch_has_pmem_api(void)
@@ -61,18 +56,4 @@ static inline void memcpy_to_pmem(void *dst, const void 
*src, size_t n)
else
memcpy(dst, src, n);
 }
-
-/**
- * invalidate_pmem - flush a pmem range from the cache hierarchy
- * @addr:  virtual start address
- * @size:  bytes to invalidate (internally aligned to cache line size)
- *
- * For platforms that support clearing poison this flushes any poisoned
- * ranges out of the cache
- */
-static inline void invalidate_pmem(void *addr, size_t size)
-{
-   if (arch_has_pmem_api())
-   arch_invalidate_pmem(addr, size);
-}
 #endif /* __PMEM_H__ */



[PATCH v2 30/33] libnvdimm, pmem: fix persistence warning

2017-04-14 Thread Dan Williams
The pmem driver assumes if platform firmware describes the memory
devices associated with a persistent memory range and
CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to
flush data to a power-fail safe zone. We warn if the firmware does not
describe memory devices, but we also need to warn if the architecture
does not claim pmem support.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/region_devs.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 307a48060aa3..5976f6c0407f 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -970,8 +970,9 @@ int nvdimm_has_flush(struct nd_region *nd_region)
struct nd_region_data *ndrd = dev_get_drvdata(_region->dev);
int i;
 
-   /* no nvdimm == flushing capability unknown */
-   if (nd_region->ndr_mappings == 0)
+   /* no nvdimm or pmem api == flushing capability unknown */
+   if (nd_region->ndr_mappings == 0
+   || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API))
return -ENXIO;
 
for (i = 0; i < nd_region->ndr_mappings; i++)



[PATCH v2 31/33] libnvdimm, nfit: enable support for volatile ranges

2017-04-14 Thread Dan Williams
Allow volatile nfit ranges to participate in all the same infrastructure
provided for persistent memory regions. A resulting resulting namespace
device will still be called "pmem", but the parent region type will be
"nd_volatile". This is in preparation for disabling the dax ->flush()
operation in the pmem driver when it is hosted on a volatile range.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/acpi/nfit/core.c|9 -
 drivers/nvdimm/bus.c|   10 +-
 drivers/nvdimm/core.c   |2 +-
 drivers/nvdimm/dax_devs.c   |2 +-
 drivers/nvdimm/dimm_devs.c  |2 +-
 drivers/nvdimm/namespace_devs.c |8 
 drivers/nvdimm/nd-core.h|9 +
 drivers/nvdimm/pfn_devs.c   |4 ++--
 drivers/nvdimm/region_devs.c|   27 ++-
 9 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 8b4c6212737c..6ac31846c4df 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2162,6 +2162,13 @@ static bool nfit_spa_is_virtual(struct 
acpi_nfit_system_address *spa)
nfit_spa_type(spa) == NFIT_SPA_PCD);
 }
 
+static bool nfit_spa_is_volatile(struct acpi_nfit_system_address *spa)
+{
+   return (nfit_spa_type(spa) == NFIT_SPA_VDISK ||
+   nfit_spa_type(spa) == NFIT_SPA_VCD   ||
+   nfit_spa_type(spa) == NFIT_SPA_VOLATILE);
+}
+
 static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
struct nfit_spa *nfit_spa)
 {
@@ -2236,7 +2243,7 @@ static int acpi_nfit_register_region(struct 
acpi_nfit_desc *acpi_desc,
ndr_desc);
if (!nfit_spa->nd_region)
rc = -ENOMEM;
-   } else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
+   } else if (nfit_spa_is_volatile(spa)) {
nfit_spa->nd_region = nvdimm_volatile_region_create(nvdimm_bus,
ndr_desc);
if (!nfit_spa->nd_region)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 351bac8f6503..d4173fbdba28 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -37,13 +37,13 @@ static int to_nd_device_type(struct device *dev)
 {
if (is_nvdimm(dev))
return ND_DEVICE_DIMM;
-   else if (is_nd_pmem(dev))
+   else if (is_memory(dev))
return ND_DEVICE_REGION_PMEM;
else if (is_nd_blk(dev))
return ND_DEVICE_REGION_BLK;
else if (is_nd_dax(dev))
return ND_DEVICE_DAX_PMEM;
-   else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
+   else if (is_nd_region(dev->parent))
return nd_region_to_nstype(to_nd_region(dev->parent));
 
return 0;
@@ -55,7 +55,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
kobj_uevent_env *env)
 * Ensure that region devices always have their numa node set as
 * early as possible.
 */
-   if (is_nd_pmem(dev) || is_nd_blk(dev))
+   if (is_nd_region(dev))
set_dev_node(dev, to_nd_region(dev)->numa_node);
return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
to_nd_device_type(dev));
@@ -64,7 +64,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
kobj_uevent_env *env)
 static struct module *to_bus_provider(struct device *dev)
 {
/* pin bus providers while regions are enabled */
-   if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+   if (is_nd_region(dev)) {
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
 
return nvdimm_bus->nd_desc->module;
@@ -771,7 +771,7 @@ void wait_nvdimm_bus_probe_idle(struct device *dev)
 
 static int pmem_active(struct device *dev, void *data)
 {
-   if (is_nd_pmem(dev) && dev->driver)
+   if (is_memory(dev) && dev->driver)
return -EBUSY;
return 0;
 }
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 9303cfeb8bee..875ef4cecb35 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -504,7 +504,7 @@ void nvdimm_badblocks_populate(struct nd_region *nd_region,
struct nvdimm_bus *nvdimm_bus;
struct list_head *poison_list;
 
-   if (!is_nd_pmem(_region->dev)) {
+   if (!is_memory(_region->dev)) {
dev_WARN_ONCE(_region->dev, 1,
"%s only valid for pmem regions\n", __func__);
return;
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index 45fa82cae87c..6a92b84c8072 100644
--- a/drivers/nvdimm/dax_devs.c
+++ b/drivers/nvdimm/dax_devs.c
@@ -89,7 +89,7 @@ struct device *nd_dax_create(struct nd_region 

[PATCH v2 32/33] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC

2017-04-14 Thread Dan Williams
Some platforms arrange for cpu caches to be flushed on power-fail. On
those platforms there is no requirement that the kernel track and flush
potentially dirty cache lines. Given that we still insert entries into
the radix for locking purposes this patch only disables the cache flush
loop, not the dirty tracking.

Userspace can override the default cache setting via the block device
queue "write_cache" attribute in sysfs.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/dax.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f37ed21e4093..5b7ee1bc74d0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -797,7 +797,8 @@ static int dax_writeback_one(struct block_device *bdev,
}
 
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-   dax_flush(dax_dev, pgoff, kaddr, size);
+   if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags))
+   dax_flush(dax_dev, pgoff, kaddr, size);
/*
 * After we have flushed the cache, we can clear the dirty tag. There
 * cannot be new dirty data in the pfn after the flush has completed as
@@ -982,7 +983,8 @@ int __dax_zero_page_range(struct block_device *bdev,
return rc;
}
memset(kaddr + offset, 0, size);
-   dax_flush(dax_dev, pgoff, kaddr + offset, size);
+   if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags))
+   dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;



[PATCH v2 33/33] libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region

2017-04-14 Thread Dan Williams
The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set QUEUE_FLAG_WC if it is.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pmem.c|9 +++--
 drivers/nvdimm/region_devs.c |6 ++
 include/linux/libnvdimm.h|1 +
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b000c6db5731..42876a75dab8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -275,6 +275,7 @@ static int pmem_attach_disk(struct device *dev,
struct vmem_altmap __altmap, *altmap = NULL;
struct resource *res = >res;
struct nd_pfn *nd_pfn = NULL;
+   int has_flush, fua = 0, wbc;
struct dax_device *dax_dev;
int nid = dev_to_node(dev);
struct nd_pfn_sb *pfn_sb;
@@ -302,8 +303,12 @@ static int pmem_attach_disk(struct device *dev,
dev_set_drvdata(dev, pmem);
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
-   if (nvdimm_has_flush(nd_region) < 0)
+   has_flush = nvdimm_has_flush(nd_region);
+   if (has_flush < 0)
dev_warn(dev, "unable to guarantee persistence of writes\n");
+   else
+   fua = has_flush;
+   wbc = nvdimm_has_cache(nd_region);
 
if (!devm_request_mem_region(dev, res->start, resource_size(res),
dev_name(>dev))) {
@@ -344,7 +349,7 @@ static int pmem_attach_disk(struct device *dev,
return PTR_ERR(addr);
pmem->virt_addr = addr;
 
-   blk_queue_write_cache(q, true, true);
+   blk_queue_write_cache(q, wbc, fua);
blk_queue_make_request(q, pmem_make_request);
blk_queue_physical_block_size(q, PAGE_SIZE);
blk_queue_max_hw_sectors(q, UINT_MAX);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 2df259010720..a085f7094b76 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -989,6 +989,12 @@ int nvdimm_has_flush(struct nd_region *nd_region)
 }
 EXPORT_SYMBOL_GPL(nvdimm_has_flush);
 
+int nvdimm_has_cache(struct nd_region *nd_region)
+{
+   return is_nd_pmem(_region->dev);
+}
+EXPORT_SYMBOL_GPL(nvdimm_has_cache);
+
 void __exit nd_region_devs_exit(void)
 {
ida_destroy(_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index a98004745768..b733030107bb 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -162,6 +162,7 @@ void nd_region_release_lane(struct nd_region *nd_region, 
unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
 void nvdimm_flush(struct nd_region *nd_region);
 int nvdimm_has_flush(struct nd_region *nd_region);
+int nvdimm_has_cache(struct nd_region *nd_region);
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_memcpy_to_pmem(void *dst, void *src, unsigned size);
 #define ARCH_MEMREMAP_PMEM MEMREMAP_WB



[PATCH v2 30/33] libnvdimm, pmem: fix persistence warning

2017-04-14 Thread Dan Williams
The pmem driver assumes if platform firmware describes the memory
devices associated with a persistent memory range and
CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to
flush data to a power-fail safe zone. We warn if the firmware does not
describe memory devices, but we also need to warn if the architecture
does not claim pmem support.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/region_devs.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 307a48060aa3..5976f6c0407f 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -970,8 +970,9 @@ int nvdimm_has_flush(struct nd_region *nd_region)
struct nd_region_data *ndrd = dev_get_drvdata(_region->dev);
int i;
 
-   /* no nvdimm == flushing capability unknown */
-   if (nd_region->ndr_mappings == 0)
+   /* no nvdimm or pmem api == flushing capability unknown */
+   if (nd_region->ndr_mappings == 0
+   || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API))
return -ENXIO;
 
for (i = 0; i < nd_region->ndr_mappings; i++)



[PATCH v2 31/33] libnvdimm, nfit: enable support for volatile ranges

2017-04-14 Thread Dan Williams
Allow volatile nfit ranges to participate in all the same infrastructure
provided for persistent memory regions. A resulting resulting namespace
device will still be called "pmem", but the parent region type will be
"nd_volatile". This is in preparation for disabling the dax ->flush()
operation in the pmem driver when it is hosted on a volatile range.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/acpi/nfit/core.c|9 -
 drivers/nvdimm/bus.c|   10 +-
 drivers/nvdimm/core.c   |2 +-
 drivers/nvdimm/dax_devs.c   |2 +-
 drivers/nvdimm/dimm_devs.c  |2 +-
 drivers/nvdimm/namespace_devs.c |8 
 drivers/nvdimm/nd-core.h|9 +
 drivers/nvdimm/pfn_devs.c   |4 ++--
 drivers/nvdimm/region_devs.c|   27 ++-
 9 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 8b4c6212737c..6ac31846c4df 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2162,6 +2162,13 @@ static bool nfit_spa_is_virtual(struct 
acpi_nfit_system_address *spa)
nfit_spa_type(spa) == NFIT_SPA_PCD);
 }
 
+static bool nfit_spa_is_volatile(struct acpi_nfit_system_address *spa)
+{
+   return (nfit_spa_type(spa) == NFIT_SPA_VDISK ||
+   nfit_spa_type(spa) == NFIT_SPA_VCD   ||
+   nfit_spa_type(spa) == NFIT_SPA_VOLATILE);
+}
+
 static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
struct nfit_spa *nfit_spa)
 {
@@ -2236,7 +2243,7 @@ static int acpi_nfit_register_region(struct 
acpi_nfit_desc *acpi_desc,
ndr_desc);
if (!nfit_spa->nd_region)
rc = -ENOMEM;
-   } else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
+   } else if (nfit_spa_is_volatile(spa)) {
nfit_spa->nd_region = nvdimm_volatile_region_create(nvdimm_bus,
ndr_desc);
if (!nfit_spa->nd_region)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 351bac8f6503..d4173fbdba28 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -37,13 +37,13 @@ static int to_nd_device_type(struct device *dev)
 {
if (is_nvdimm(dev))
return ND_DEVICE_DIMM;
-   else if (is_nd_pmem(dev))
+   else if (is_memory(dev))
return ND_DEVICE_REGION_PMEM;
else if (is_nd_blk(dev))
return ND_DEVICE_REGION_BLK;
else if (is_nd_dax(dev))
return ND_DEVICE_DAX_PMEM;
-   else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
+   else if (is_nd_region(dev->parent))
return nd_region_to_nstype(to_nd_region(dev->parent));
 
return 0;
@@ -55,7 +55,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
kobj_uevent_env *env)
 * Ensure that region devices always have their numa node set as
 * early as possible.
 */
-   if (is_nd_pmem(dev) || is_nd_blk(dev))
+   if (is_nd_region(dev))
set_dev_node(dev, to_nd_region(dev)->numa_node);
return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
to_nd_device_type(dev));
@@ -64,7 +64,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
kobj_uevent_env *env)
 static struct module *to_bus_provider(struct device *dev)
 {
/* pin bus providers while regions are enabled */
-   if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+   if (is_nd_region(dev)) {
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
 
return nvdimm_bus->nd_desc->module;
@@ -771,7 +771,7 @@ void wait_nvdimm_bus_probe_idle(struct device *dev)
 
 static int pmem_active(struct device *dev, void *data)
 {
-   if (is_nd_pmem(dev) && dev->driver)
+   if (is_memory(dev) && dev->driver)
return -EBUSY;
return 0;
 }
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 9303cfeb8bee..875ef4cecb35 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -504,7 +504,7 @@ void nvdimm_badblocks_populate(struct nd_region *nd_region,
struct nvdimm_bus *nvdimm_bus;
struct list_head *poison_list;
 
-   if (!is_nd_pmem(_region->dev)) {
+   if (!is_memory(_region->dev)) {
dev_WARN_ONCE(_region->dev, 1,
"%s only valid for pmem regions\n", __func__);
return;
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index 45fa82cae87c..6a92b84c8072 100644
--- a/drivers/nvdimm/dax_devs.c
+++ b/drivers/nvdimm/dax_devs.c
@@ -89,7 +89,7 @@ struct device *nd_dax_create(struct nd_region *nd_region)
struct device *dev = NULL;
struct nd_dax *nd_dax;
 
-   if (!is_nd_pmem(_region->dev))
+   if 

[PATCH v2 32/33] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC

2017-04-14 Thread Dan Williams
Some platforms arrange for cpu caches to be flushed on power-fail. On
those platforms there is no requirement that the kernel track and flush
potentially dirty cache lines. Given that we still insert entries into
the radix for locking purposes this patch only disables the cache flush
loop, not the dirty tracking.

Userspace can override the default cache setting via the block device
queue "write_cache" attribute in sysfs.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/dax.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f37ed21e4093..5b7ee1bc74d0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -797,7 +797,8 @@ static int dax_writeback_one(struct block_device *bdev,
}
 
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-   dax_flush(dax_dev, pgoff, kaddr, size);
+   if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags))
+   dax_flush(dax_dev, pgoff, kaddr, size);
/*
 * After we have flushed the cache, we can clear the dirty tag. There
 * cannot be new dirty data in the pfn after the flush has completed as
@@ -982,7 +983,8 @@ int __dax_zero_page_range(struct block_device *bdev,
return rc;
}
memset(kaddr + offset, 0, size);
-   dax_flush(dax_dev, pgoff, kaddr + offset, size);
+   if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags))
+   dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;



[PATCH v2 33/33] libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region

2017-04-14 Thread Dan Williams
The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set QUEUE_FLAG_WC if it is.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pmem.c|9 +++--
 drivers/nvdimm/region_devs.c |6 ++
 include/linux/libnvdimm.h|1 +
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b000c6db5731..42876a75dab8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -275,6 +275,7 @@ static int pmem_attach_disk(struct device *dev,
struct vmem_altmap __altmap, *altmap = NULL;
struct resource *res = >res;
struct nd_pfn *nd_pfn = NULL;
+   int has_flush, fua = 0, wbc;
struct dax_device *dax_dev;
int nid = dev_to_node(dev);
struct nd_pfn_sb *pfn_sb;
@@ -302,8 +303,12 @@ static int pmem_attach_disk(struct device *dev,
dev_set_drvdata(dev, pmem);
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
-   if (nvdimm_has_flush(nd_region) < 0)
+   has_flush = nvdimm_has_flush(nd_region);
+   if (has_flush < 0)
dev_warn(dev, "unable to guarantee persistence of writes\n");
+   else
+   fua = has_flush;
+   wbc = nvdimm_has_cache(nd_region);
 
if (!devm_request_mem_region(dev, res->start, resource_size(res),
dev_name(>dev))) {
@@ -344,7 +349,7 @@ static int pmem_attach_disk(struct device *dev,
return PTR_ERR(addr);
pmem->virt_addr = addr;
 
-   blk_queue_write_cache(q, true, true);
+   blk_queue_write_cache(q, wbc, fua);
blk_queue_make_request(q, pmem_make_request);
blk_queue_physical_block_size(q, PAGE_SIZE);
blk_queue_max_hw_sectors(q, UINT_MAX);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 2df259010720..a085f7094b76 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -989,6 +989,12 @@ int nvdimm_has_flush(struct nd_region *nd_region)
 }
 EXPORT_SYMBOL_GPL(nvdimm_has_flush);
 
+int nvdimm_has_cache(struct nd_region *nd_region)
+{
+   return is_nd_pmem(_region->dev);
+}
+EXPORT_SYMBOL_GPL(nvdimm_has_cache);
+
 void __exit nd_region_devs_exit(void)
 {
ida_destroy(_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index a98004745768..b733030107bb 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -162,6 +162,7 @@ void nd_region_release_lane(struct nd_region *nd_region, 
unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
 void nvdimm_flush(struct nd_region *nd_region);
 int nvdimm_has_flush(struct nd_region *nd_region);
+int nvdimm_has_cache(struct nd_region *nd_region);
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_memcpy_to_pmem(void *dst, void *src, unsigned size);
 #define ARCH_MEMREMAP_PMEM MEMREMAP_WB



[PATCH v2 29/33] uio, libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations

2017-04-14 Thread Dan Williams
Introduce copy_from_iter_ops() to enable passing custom sub-routines to
iterate_and_advance(). Define pmem operations that guarantee cache
bypass to supplement the existing usage of __copy_from_iter_nocache()
backed by arch_wb_cache_pmem().

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Toshi Kani 
Cc: Al Viro 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: Linus Torvalds 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/Kconfig |1 +
 drivers/nvdimm/pmem.c  |   38 +-
 drivers/nvdimm/pmem.h  |7 +++
 drivers/nvdimm/x86.c   |   48 
 include/linux/uio.h|4 
 lib/Kconfig|3 +++
 lib/iov_iter.c |   25 +
 7 files changed, 89 insertions(+), 37 deletions(-)

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 4d45196d6f94..28002298cdc8 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -38,6 +38,7 @@ config BLK_DEV_PMEM
 
 config ARCH_HAS_PMEM_API
depends on X86_64
+   select COPY_FROM_ITER_OPS
def_bool y
 
 config ND_BLK
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 329895ca88e1..b000c6db5731 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -223,43 +223,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
 static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
 {
-   size_t len;
-
-   /* TODO: skip the write-back by always using non-temporal stores */
-   len = copy_from_iter_nocache(addr, bytes, i);
-
-   /*
-* In the iovec case on x86_64 copy_from_iter_nocache() uses
-* non-temporal stores for the bulk of the transfer, but we need
-* to manually flush if the transfer is unaligned. A cached
-* memory copy is used when destination or size is not naturally
-* aligned. That is:
-*   - Require 8-byte alignment when size is 8 bytes or larger.
-*   - Require 4-byte alignment when size is 4 bytes.
-*
-* In the non-iovec case the entire destination needs to be
-* flushed.
-*/
-   if (iter_is_iovec(i)) {
-   unsigned long flushed, dest = (unsigned long) addr;
-
-   if (bytes < 8) {
-   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
-   arch_wb_cache_pmem(addr, 1);
-   } else {
-   if (!IS_ALIGNED(dest, 8)) {
-   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
-   arch_wb_cache_pmem(addr, 1);
-   }
-
-   flushed = dest - (unsigned long) addr;
-   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
-   arch_wb_cache_pmem(addr + bytes - 1, 1);
-   }
-   } else
-   arch_wb_cache_pmem(addr, bytes);
-
-   return len;
+   return arch_copy_from_iter_pmem(addr, bytes, i);
 }
 
 static const struct block_device_operations pmem_fops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 5900c1b7..574b63fb5376 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -3,11 +3,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void arch_invalidate_pmem(void *addr, size_t size);
+size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, struct iov_iter *i);
 #else
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
@@ -15,6 +17,11 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
 static inline void arch_invalidate_pmem(void *addr, size_t size)
 {
 }
+static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
+   struct iov_iter *i)
+{
+   return copy_from_iter_nocache(addr, bytes, i);
+}
 #endif
 
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
index d99b452332a9..bc145d760d43 100644
--- a/drivers/nvdimm/x86.c
+++ b/drivers/nvdimm/x86.c
@@ -10,6 +10,9 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -105,3 +108,48 @@ void arch_memcpy_to_pmem(void *_dst, void *_src, unsigned 
size)
}
 }
 EXPORT_SYMBOL_GPL(arch_memcpy_to_pmem);
+
+static int pmem_from_user(void *dst, const void __user *src, unsigned size)
+{
+   unsigned long flushed, dest = (unsigned long) dest;
+   int 

[PATCH v2 29/33] uio, libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations

2017-04-14 Thread Dan Williams
Introduce copy_from_iter_ops() to enable passing custom sub-routines to
iterate_and_advance(). Define pmem operations that guarantee cache
bypass to supplement the existing usage of __copy_from_iter_nocache()
backed by arch_wb_cache_pmem().

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Toshi Kani 
Cc: Al Viro 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: Linus Torvalds 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/Kconfig |1 +
 drivers/nvdimm/pmem.c  |   38 +-
 drivers/nvdimm/pmem.h  |7 +++
 drivers/nvdimm/x86.c   |   48 
 include/linux/uio.h|4 
 lib/Kconfig|3 +++
 lib/iov_iter.c |   25 +
 7 files changed, 89 insertions(+), 37 deletions(-)

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 4d45196d6f94..28002298cdc8 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -38,6 +38,7 @@ config BLK_DEV_PMEM
 
 config ARCH_HAS_PMEM_API
depends on X86_64
+   select COPY_FROM_ITER_OPS
def_bool y
 
 config ND_BLK
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 329895ca88e1..b000c6db5731 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -223,43 +223,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
 static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
 {
-   size_t len;
-
-   /* TODO: skip the write-back by always using non-temporal stores */
-   len = copy_from_iter_nocache(addr, bytes, i);
-
-   /*
-* In the iovec case on x86_64 copy_from_iter_nocache() uses
-* non-temporal stores for the bulk of the transfer, but we need
-* to manually flush if the transfer is unaligned. A cached
-* memory copy is used when destination or size is not naturally
-* aligned. That is:
-*   - Require 8-byte alignment when size is 8 bytes or larger.
-*   - Require 4-byte alignment when size is 4 bytes.
-*
-* In the non-iovec case the entire destination needs to be
-* flushed.
-*/
-   if (iter_is_iovec(i)) {
-   unsigned long flushed, dest = (unsigned long) addr;
-
-   if (bytes < 8) {
-   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
-   arch_wb_cache_pmem(addr, 1);
-   } else {
-   if (!IS_ALIGNED(dest, 8)) {
-   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
-   arch_wb_cache_pmem(addr, 1);
-   }
-
-   flushed = dest - (unsigned long) addr;
-   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
-   arch_wb_cache_pmem(addr + bytes - 1, 1);
-   }
-   } else
-   arch_wb_cache_pmem(addr, bytes);
-
-   return len;
+   return arch_copy_from_iter_pmem(addr, bytes, i);
 }
 
 static const struct block_device_operations pmem_fops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 5900c1b7..574b63fb5376 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -3,11 +3,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void arch_invalidate_pmem(void *addr, size_t size);
+size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, struct iov_iter *i);
 #else
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
@@ -15,6 +17,11 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
 static inline void arch_invalidate_pmem(void *addr, size_t size)
 {
 }
+static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
+   struct iov_iter *i)
+{
+   return copy_from_iter_nocache(addr, bytes, i);
+}
 #endif
 
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c
index d99b452332a9..bc145d760d43 100644
--- a/drivers/nvdimm/x86.c
+++ b/drivers/nvdimm/x86.c
@@ -10,6 +10,9 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -105,3 +108,48 @@ void arch_memcpy_to_pmem(void *_dst, void *_src, unsigned 
size)
}
 }
 EXPORT_SYMBOL_GPL(arch_memcpy_to_pmem);
+
+static int pmem_from_user(void *dst, const void __user *src, unsigned size)
+{
+   unsigned long flushed, dest = (unsigned long) dest;
+   int rc = __copy_from_user_nocache(dst, src, size);
+
+   /*
+* On x86_64 __copy_from_user_nocache() uses non-temporal stores
+* for the bulk of the transfer, but we need to manually flush

[PATCH v2 24/33] filesystem-dax: convert to dax_flush()

2017-04-14 Thread Dan Williams
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so the dax_flush()
helper skips cache management work when the underlying driver does not
specify a flush method.

We still do all the dirty tracking since the radix entry will already be
there for locking purposes. However, the work to clean the entry will be
a nop for some dax drivers.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/dax.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b9909c91df..edbf988de86c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -798,7 +798,7 @@ static int dax_writeback_one(struct block_device *bdev,
}
 
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-   wb_cache_pmem(kaddr, size);
+   dax_flush(dax_dev, pgoff, kaddr, size);
/*
 * After we have flushed the cache, we can clear the dirty tag. There
 * cannot be new dirty data in the pfn after the flush has completed as



[PATCH v2 24/33] filesystem-dax: convert to dax_flush()

2017-04-14 Thread Dan Williams
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so the dax_flush()
helper skips cache management work when the underlying driver does not
specify a flush method.

We still do all the dirty tracking since the radix entry will already be
there for locking purposes. However, the work to clean the entry will be
a nop for some dax drivers.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/dax.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b9909c91df..edbf988de86c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -798,7 +798,7 @@ static int dax_writeback_one(struct block_device *bdev,
}
 
dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-   wb_cache_pmem(kaddr, size);
+   dax_flush(dax_dev, pgoff, kaddr, size);
/*
 * After we have flushed the cache, we can clear the dirty tag. There
 * cannot be new dirty data in the pfn after the flush has completed as



[PATCH v2 20/33] dm: add ->copy_from_iter() dax operation support

2017-04-14 Thread Dan Williams
Allow device-mapper to route copy_from_iter operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.

This conceptually allows for an array of mixed device drivers with
varying copy_from_iter implementations.

Cc: Toshi Kani 
Cc: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c   |   13 +
 drivers/md/dm-linear.c|   15 +++
 drivers/md/dm-stripe.c|   20 
 drivers/md/dm.c   |   26 ++
 include/linux/dax.h   |2 ++
 include/linux/device-mapper.h |3 +++
 6 files changed, 79 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 23ce3ab49f10..73f0da8e5d27 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -104,6 +105,18 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t 
pgoff, long nr_pages,
 }
 EXPORT_SYMBOL_GPL(dax_direct_access);
 
+size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void 
*addr,
+   size_t bytes, struct iov_iter *i)
+{
+   if (!dax_alive(dax_dev))
+   return 0;
+
+   if (!dax_dev->ops->copy_from_iter)
+   return copy_from_iter(addr, bytes, i);
+   return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+EXPORT_SYMBOL_GPL(dax_copy_from_iter);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index c5a52f4dae81..5fe44a0ddfab 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -158,6 +158,20 @@ static long linear_dax_direct_access(struct dm_target *ti, 
pgoff_t pgoff,
return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   struct linear_c *lc = ti->private;
+   struct block_device *bdev = lc->dev->bdev;
+   struct dax_device *dax_dev = lc->dev->dax_dev;
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+   dev_sector = linear_map_sector(ti, sector);
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), ))
+   return 0;
+   return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 3, 0},
@@ -169,6 +183,7 @@ static struct target_type linear_target = {
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
+   .dax_copy_from_iter = linear_dax_copy_from_iter,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index cb4b1e9e16ab..4f45d23249b2 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -330,6 +330,25 @@ static long stripe_dax_direct_access(struct dm_target *ti, 
pgoff_t pgoff,
return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+   struct stripe_c *sc = ti->private;
+   struct dax_device *dax_dev;
+   struct block_device *bdev;
+   uint32_t stripe;
+
+   stripe_map_sector(sc, sector, , _sector);
+   dev_sector += sc->stripe[stripe].physical_start;
+   dax_dev = sc->stripe[stripe].dev->dax_dev;
+   bdev = sc->stripe[stripe].dev->bdev;
+
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), ))
+   return 0;
+   return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+
 /*
  * Stripe status:
  *
@@ -448,6 +467,7 @@ static struct target_type stripe_target = {
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
+   .dax_copy_from_iter = stripe_dax_copy_from_iter,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 79d5f5fd823e..8c8579efcba2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -957,6 +958,30 @@ static long dm_dax_direct_access(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
+static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   struct mapped_device *md = 

[PATCH v2 20/33] dm: add ->copy_from_iter() dax operation support

2017-04-14 Thread Dan Williams
Allow device-mapper to route copy_from_iter operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.

This conceptually allows for an array of mixed device drivers with
varying copy_from_iter implementations.

Cc: Toshi Kani 
Cc: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c   |   13 +
 drivers/md/dm-linear.c|   15 +++
 drivers/md/dm-stripe.c|   20 
 drivers/md/dm.c   |   26 ++
 include/linux/dax.h   |2 ++
 include/linux/device-mapper.h |3 +++
 6 files changed, 79 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 23ce3ab49f10..73f0da8e5d27 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -104,6 +105,18 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t 
pgoff, long nr_pages,
 }
 EXPORT_SYMBOL_GPL(dax_direct_access);
 
+size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void 
*addr,
+   size_t bytes, struct iov_iter *i)
+{
+   if (!dax_alive(dax_dev))
+   return 0;
+
+   if (!dax_dev->ops->copy_from_iter)
+   return copy_from_iter(addr, bytes, i);
+   return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+EXPORT_SYMBOL_GPL(dax_copy_from_iter);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index c5a52f4dae81..5fe44a0ddfab 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -158,6 +158,20 @@ static long linear_dax_direct_access(struct dm_target *ti, 
pgoff_t pgoff,
return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   struct linear_c *lc = ti->private;
+   struct block_device *bdev = lc->dev->bdev;
+   struct dax_device *dax_dev = lc->dev->dax_dev;
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+   dev_sector = linear_map_sector(ti, sector);
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), ))
+   return 0;
+   return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 3, 0},
@@ -169,6 +183,7 @@ static struct target_type linear_target = {
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
+   .dax_copy_from_iter = linear_dax_copy_from_iter,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index cb4b1e9e16ab..4f45d23249b2 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -330,6 +330,25 @@ static long stripe_dax_direct_access(struct dm_target *ti, 
pgoff_t pgoff,
return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+   struct stripe_c *sc = ti->private;
+   struct dax_device *dax_dev;
+   struct block_device *bdev;
+   uint32_t stripe;
+
+   stripe_map_sector(sc, sector, , _sector);
+   dev_sector += sc->stripe[stripe].physical_start;
+   dax_dev = sc->stripe[stripe].dev->dax_dev;
+   bdev = sc->stripe[stripe].dev->bdev;
+
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), ))
+   return 0;
+   return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+
 /*
  * Stripe status:
  *
@@ -448,6 +467,7 @@ static struct target_type stripe_target = {
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
+   .dax_copy_from_iter = stripe_dax_copy_from_iter,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 79d5f5fd823e..8c8579efcba2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -957,6 +958,30 @@ static long dm_dax_direct_access(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
+static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   struct mapped_device *md = dax_get_private(dax_dev);
+   sector_t sector = pgoff * 

[PATCH v2 23/33] dm: add ->flush() dax operation support

2017-04-14 Thread Dan Williams
Allow device-mapper to route flush operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.

This conceptually allows for an array of mixed device drivers with
varying flush implementations.

Cc: Toshi Kani 
Cc: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c   |   11 +++
 drivers/md/dm-linear.c|   15 +++
 drivers/md/dm-stripe.c|   20 
 drivers/md/dm.c   |   19 +++
 include/linux/dax.h   |2 ++
 include/linux/device-mapper.h |3 +++
 6 files changed, 70 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 73f0da8e5d27..1253c05a2e53 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -117,6 +117,17 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, 
pgoff_t pgoff, void *addr,
 }
 EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 
+void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   if (!dax_alive(dax_dev))
+   return;
+
+   if (dax_dev->ops->flush)
+   dax_dev->ops->flush(dax_dev, pgoff, addr, size);
+}
+EXPORT_SYMBOL_GPL(dax_flush);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 5fe44a0ddfab..70d8439a1b63 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -172,6 +172,20 @@ static size_t linear_dax_copy_from_iter(struct dm_target 
*ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static void linear_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   struct linear_c *lc = ti->private;
+   struct block_device *bdev = lc->dev->bdev;
+   struct dax_device *dax_dev = lc->dev->dax_dev;
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+   dev_sector = linear_map_sector(ti, sector);
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), ))
+   return;
+   dax_flush(dax_dev, pgoff, addr, size);
+}
+
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 3, 0},
@@ -184,6 +198,7 @@ static struct target_type linear_target = {
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
.dax_copy_from_iter = linear_dax_copy_from_iter,
+   .dax_flush = linear_dax_flush,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 4f45d23249b2..829fd438318d 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -349,6 +349,25 @@ static size_t stripe_dax_copy_from_iter(struct dm_target 
*ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static void stripe_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+   struct stripe_c *sc = ti->private;
+   struct dax_device *dax_dev;
+   struct block_device *bdev;
+   uint32_t stripe;
+
+   stripe_map_sector(sc, sector, , _sector);
+   dev_sector += sc->stripe[stripe].physical_start;
+   dax_dev = sc->stripe[stripe].dev->dax_dev;
+   bdev = sc->stripe[stripe].dev->bdev;
+
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), ))
+   return;
+   dax_flush(dax_dev, pgoff, addr, size);
+}
+
 /*
  * Stripe status:
  *
@@ -468,6 +487,7 @@ static struct target_type stripe_target = {
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
.dax_copy_from_iter = stripe_dax_copy_from_iter,
+   .dax_flush = stripe_dax_flush,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c8579efcba2..6a97711cdbdf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -982,6 +982,24 @@ static size_t dm_dax_copy_from_iter(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
+static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   struct mapped_device *md = dax_get_private(dax_dev);
+   sector_t sector = pgoff * PAGE_SECTORS;
+   struct dm_target *ti;
+   int srcu_idx;
+
+   ti = dm_dax_get_live_target(md, sector, _idx);
+
+   if (!ti)
+   goto out;
+   if (ti->type->dax_flush)
+   ti->type->dax_flush(ti, pgoff, addr, size);
+ out:
+   dm_put_live_table(md, srcu_idx);
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except 

[PATCH v2 19/33] dax, pmem: introduce 'copy_from_iter' dax operation

2017-04-14 Thread Dan Williams
The direct-I/O write path for a pmem device must ensure that data is
flushed to a power-fail safe zone when the operation is complete.
However, other dax capable block devices, like brd, do not have this
requirement.  Introduce a 'copy_from_iter' dax operation so that pmem
can inject cache management without imposing this overhead on other dax
capable block_device drivers.

This is also a first step of moving all architecture-specific
pmem-operations to the pmem driver.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Al Viro 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pmem.c |   43 +++
 include/linux/dax.h   |3 +++
 2 files changed, 46 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..e501df4ab4b4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -220,6 +220,48 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   size_t len;
+
+   /* TODO: skip the write-back by always using non-temporal stores */
+   len = copy_from_iter_nocache(addr, bytes, i);
+
+   /*
+* In the iovec case on x86_64 copy_from_iter_nocache() uses
+* non-temporal stores for the bulk of the transfer, but we need
+* to manually flush if the transfer is unaligned. A cached
+* memory copy is used when destination or size is not naturally
+* aligned. That is:
+*   - Require 8-byte alignment when size is 8 bytes or larger.
+*   - Require 4-byte alignment when size is 4 bytes.
+*
+* In the non-iovec case the entire destination needs to be
+* flushed.
+*/
+   if (iter_is_iovec(i)) {
+   unsigned long flushed, dest = (unsigned long) addr;
+
+   if (bytes < 8) {
+   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
+   wb_cache_pmem(addr, 1);
+   } else {
+   if (!IS_ALIGNED(dest, 8)) {
+   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
+   wb_cache_pmem(addr, 1);
+   }
+
+   flushed = dest - (unsigned long) addr;
+   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
+   wb_cache_pmem(addr + bytes - 1, 1);
+   }
+   } else
+   wb_cache_pmem(addr, bytes);
+
+   return len;
+}
+
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.rw_page =  pmem_rw_page,
@@ -236,6 +278,7 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
 
 static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
+   .copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 */
long (*direct_access)(struct dax_device *, pgoff_t, long,
void **, pfn_t *);
+   /* copy_from_iter: dax-driver override for default copy_from_iter */
+   size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+   struct iov_iter *);
 };
 
 int dax_read_lock(void);



[PATCH v2 19/33] dax, pmem: introduce 'copy_from_iter' dax operation

2017-04-14 Thread Dan Williams
The direct-I/O write path for a pmem device must ensure that data is
flushed to a power-fail safe zone when the operation is complete.
However, other dax capable block devices, like brd, do not have this
requirement.  Introduce a 'copy_from_iter' dax operation so that pmem
can inject cache management without imposing this overhead on other dax
capable block_device drivers.

This is also a first step of moving all architecture-specific
pmem-operations to the pmem driver.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Al Viro 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pmem.c |   43 +++
 include/linux/dax.h   |3 +++
 2 files changed, 46 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..e501df4ab4b4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -220,6 +220,48 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, 
pgoff_t pgoff,
return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   size_t len;
+
+   /* TODO: skip the write-back by always using non-temporal stores */
+   len = copy_from_iter_nocache(addr, bytes, i);
+
+   /*
+* In the iovec case on x86_64 copy_from_iter_nocache() uses
+* non-temporal stores for the bulk of the transfer, but we need
+* to manually flush if the transfer is unaligned. A cached
+* memory copy is used when destination or size is not naturally
+* aligned. That is:
+*   - Require 8-byte alignment when size is 8 bytes or larger.
+*   - Require 4-byte alignment when size is 4 bytes.
+*
+* In the non-iovec case the entire destination needs to be
+* flushed.
+*/
+   if (iter_is_iovec(i)) {
+   unsigned long flushed, dest = (unsigned long) addr;
+
+   if (bytes < 8) {
+   if (!IS_ALIGNED(dest, 4) || (bytes != 4))
+   wb_cache_pmem(addr, 1);
+   } else {
+   if (!IS_ALIGNED(dest, 8)) {
+   dest = ALIGN(dest, 
boot_cpu_data.x86_clflush_size);
+   wb_cache_pmem(addr, 1);
+   }
+
+   flushed = dest - (unsigned long) addr;
+   if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
+   wb_cache_pmem(addr + bytes - 1, 1);
+   }
+   } else
+   wb_cache_pmem(addr, bytes);
+
+   return len;
+}
+
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.rw_page =  pmem_rw_page,
@@ -236,6 +278,7 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
 
 static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
+   .copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 */
long (*direct_access)(struct dax_device *, pgoff_t, long,
void **, pfn_t *);
+   /* copy_from_iter: dax-driver override for default copy_from_iter */
+   size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+   struct iov_iter *);
 };
 
 int dax_read_lock(void);



[PATCH v2 23/33] dm: add ->flush() dax operation support

2017-04-14 Thread Dan Williams
Allow device-mapper to route flush operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.

This conceptually allows for an array of mixed device drivers with
varying flush implementations.

Cc: Toshi Kani 
Cc: Mike Snitzer 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c   |   11 +++
 drivers/md/dm-linear.c|   15 +++
 drivers/md/dm-stripe.c|   20 
 drivers/md/dm.c   |   19 +++
 include/linux/dax.h   |2 ++
 include/linux/device-mapper.h |3 +++
 6 files changed, 70 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 73f0da8e5d27..1253c05a2e53 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -117,6 +117,17 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, 
pgoff_t pgoff, void *addr,
 }
 EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 
+void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   if (!dax_alive(dax_dev))
+   return;
+
+   if (dax_dev->ops->flush)
+   dax_dev->ops->flush(dax_dev, pgoff, addr, size);
+}
+EXPORT_SYMBOL_GPL(dax_flush);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 5fe44a0ddfab..70d8439a1b63 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -172,6 +172,20 @@ static size_t linear_dax_copy_from_iter(struct dm_target 
*ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static void linear_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   struct linear_c *lc = ti->private;
+   struct block_device *bdev = lc->dev->bdev;
+   struct dax_device *dax_dev = lc->dev->dax_dev;
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+   dev_sector = linear_map_sector(ti, sector);
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), ))
+   return;
+   dax_flush(dax_dev, pgoff, addr, size);
+}
+
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 3, 0},
@@ -184,6 +198,7 @@ static struct target_type linear_target = {
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
.dax_copy_from_iter = linear_dax_copy_from_iter,
+   .dax_flush = linear_dax_flush,
 };
 
 int __init dm_linear_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 4f45d23249b2..829fd438318d 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -349,6 +349,25 @@ static size_t stripe_dax_copy_from_iter(struct dm_target 
*ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
+static void stripe_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+   struct stripe_c *sc = ti->private;
+   struct dax_device *dax_dev;
+   struct block_device *bdev;
+   uint32_t stripe;
+
+   stripe_map_sector(sc, sector, , _sector);
+   dev_sector += sc->stripe[stripe].physical_start;
+   dax_dev = sc->stripe[stripe].dev->dax_dev;
+   bdev = sc->stripe[stripe].dev->bdev;
+
+   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), ))
+   return;
+   dax_flush(dax_dev, pgoff, addr, size);
+}
+
 /*
  * Stripe status:
  *
@@ -468,6 +487,7 @@ static struct target_type stripe_target = {
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
.dax_copy_from_iter = stripe_dax_copy_from_iter,
+   .dax_flush = stripe_dax_flush,
 };
 
 int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c8579efcba2..6a97711cdbdf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -982,6 +982,24 @@ static size_t dm_dax_copy_from_iter(struct dax_device 
*dax_dev, pgoff_t pgoff,
return ret;
 }
 
+static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+   size_t size)
+{
+   struct mapped_device *md = dax_get_private(dax_dev);
+   sector_t sector = pgoff * PAGE_SECTORS;
+   struct dm_target *ti;
+   int srcu_idx;
+
+   ti = dm_dax_get_live_target(md, sector, _idx);
+
+   if (!ti)
+   goto out;
+   if (ti->type->dax_flush)
+   ti->type->dax_flush(ti, pgoff, addr, size);
+ out:
+   dm_put_live_table(md, srcu_idx);
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH.
@@ -2844,6 +2862,7 @@ static const struct 

[PATCH v2 16/33] block, dax: convert bdev_dax_supported() to dax_direct_access()

2017-04-14 Thread Dan Williams
Kill of the final user of bdev_direct_access() and struct blk_dax_ctl.

Signed-off-by: Dan Williams 
---
 fs/block_dev.c |   48 
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2f7885712575..ecbdc8f9f718 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -788,35 +788,43 @@ EXPORT_SYMBOL(bdev_dax_pgoff);
  */
 int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
-   struct blk_dax_ctl dax = {
-   .sector = 0,
-   .size = PAGE_SIZE,
-   };
-   int err;
+   struct block_device *bdev = sb->s_bdev;
+   struct dax_device *dax_dev;
+   pgoff_t pgoff;
+   int err, id;
+   void *kaddr;
+   pfn_t pfn;
+   long len;
 
if (blocksize != PAGE_SIZE) {
vfs_msg(sb, KERN_ERR, "error: unsupported blocksize for dax");
return -EINVAL;
}
 
-   err = bdev_direct_access(sb->s_bdev, );
-   if (err < 0) {
-   switch (err) {
-   case -EOPNOTSUPP:
-   vfs_msg(sb, KERN_ERR,
-   "error: device does not support dax");
-   break;
-   case -EINVAL:
-   vfs_msg(sb, KERN_ERR,
-   "error: unaligned partition for dax");
-   break;
-   default:
-   vfs_msg(sb, KERN_ERR,
-   "error: dax access failed (%d)", err);
-   }
+   err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
+   if (err) {
+   vfs_msg(sb, KERN_ERR, "error: unaligned partition for dax");
return err;
}
 
+   dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   if (!dax_dev) {
+   vfs_msg(sb, KERN_ERR, "error: device does not support dax");
+   return -EOPNOTSUPP;
+   }
+
+   id = dax_read_lock();
+   len = dax_direct_access(dax_dev, pgoff, 1, , );
+   dax_read_unlock(id);
+
+   put_dax(dax_dev);
+
+   if (len < 1) {
+   vfs_msg(sb, KERN_ERR,
+   "error: dax access failed (%d)", len);
+   return len < 0 ? len : -EIO;
+   }
+
return 0;
 }
 EXPORT_SYMBOL_GPL(bdev_dax_supported);



[PATCH v2 16/33] block, dax: convert bdev_dax_supported() to dax_direct_access()

2017-04-14 Thread Dan Williams
Kill of the final user of bdev_direct_access() and struct blk_dax_ctl.

Signed-off-by: Dan Williams 
---
 fs/block_dev.c |   48 
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2f7885712575..ecbdc8f9f718 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -788,35 +788,43 @@ EXPORT_SYMBOL(bdev_dax_pgoff);
  */
 int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
-   struct blk_dax_ctl dax = {
-   .sector = 0,
-   .size = PAGE_SIZE,
-   };
-   int err;
+   struct block_device *bdev = sb->s_bdev;
+   struct dax_device *dax_dev;
+   pgoff_t pgoff;
+   int err, id;
+   void *kaddr;
+   pfn_t pfn;
+   long len;
 
if (blocksize != PAGE_SIZE) {
vfs_msg(sb, KERN_ERR, "error: unsupported blocksize for dax");
return -EINVAL;
}
 
-   err = bdev_direct_access(sb->s_bdev, );
-   if (err < 0) {
-   switch (err) {
-   case -EOPNOTSUPP:
-   vfs_msg(sb, KERN_ERR,
-   "error: device does not support dax");
-   break;
-   case -EINVAL:
-   vfs_msg(sb, KERN_ERR,
-   "error: unaligned partition for dax");
-   break;
-   default:
-   vfs_msg(sb, KERN_ERR,
-   "error: dax access failed (%d)", err);
-   }
+   err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
+   if (err) {
+   vfs_msg(sb, KERN_ERR, "error: unaligned partition for dax");
return err;
}
 
+   dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   if (!dax_dev) {
+   vfs_msg(sb, KERN_ERR, "error: device does not support dax");
+   return -EOPNOTSUPP;
+   }
+
+   id = dax_read_lock();
+   len = dax_direct_access(dax_dev, pgoff, 1, , );
+   dax_read_unlock(id);
+
+   put_dax(dax_dev);
+
+   if (len < 1) {
+   vfs_msg(sb, KERN_ERR,
+   "error: dax access failed (%d)", len);
+   return len < 0 ? len : -EIO;
+   }
+
return 0;
 }
 EXPORT_SYMBOL_GPL(bdev_dax_supported);



[PATCH v2 15/33] filesystem-dax: convert to dax_direct_access()

2017-04-14 Thread Dan Williams
Now that a dax_device is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.

This also lets us kill off some usages of struct blk_dax_ctl on the way
to its eventual removal.

Suggested-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/dax.c|  277 +--
 fs/iomap.c  |3 -
 include/linux/dax.h |6 +
 3 files changed, 162 insertions(+), 124 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b78a6947c4f5..ce9dc9c3e829 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -55,32 +55,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
-{
-   struct request_queue *q = bdev->bd_queue;
-   long rc = -EIO;
-
-   dax->addr = ERR_PTR(-EIO);
-   if (blk_queue_enter(q, true) != 0)
-   return rc;
-
-   rc = bdev_direct_access(bdev, dax);
-   if (rc < 0) {
-   dax->addr = ERR_PTR(rc);
-   blk_queue_exit(q);
-   return rc;
-   }
-   return rc;
-}
-
-static void dax_unmap_atomic(struct block_device *bdev,
-   const struct blk_dax_ctl *dax)
-{
-   if (IS_ERR(dax->addr))
-   return;
-   blk_queue_exit(bdev->bd_queue);
-}
-
 static int dax_is_pmd_entry(void *entry)
 {
return (unsigned long)entry & RADIX_DAX_PMD;
@@ -553,21 +527,30 @@ static int dax_load_hole(struct address_space *mapping, 
void **entry,
return ret;
 }
 
-static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t 
size,
-   struct page *to, unsigned long vaddr)
+static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
+   sector_t sector, size_t size, struct page *to,
+   unsigned long vaddr)
 {
-   struct blk_dax_ctl dax = {
-   .sector = sector,
-   .size = size,
-   };
-   void *vto;
-
-   if (dax_map_atomic(bdev, ) < 0)
-   return PTR_ERR(dax.addr);
+   void *vto, *kaddr;
+   pgoff_t pgoff;
+   pfn_t pfn;
+   long rc;
+   int id;
+
+   rc = bdev_dax_pgoff(bdev, sector, size, );
+   if (rc)
+   return rc;
+
+   id = dax_read_lock();
+   rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), , );
+   if (rc < 0) {
+   dax_read_unlock(id);
+   return rc;
+   }
vto = kmap_atomic(to);
-   copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
+   copy_user_page(vto, (void __force *)kaddr, vaddr, to);
kunmap_atomic(vto);
-   dax_unmap_atomic(bdev, );
+   dax_read_unlock(id);
return 0;
 }
 
@@ -735,12 +718,16 @@ static void dax_mapping_entry_mkclean(struct 
address_space *mapping,
 }
 
 static int dax_writeback_one(struct block_device *bdev,
-   struct address_space *mapping, pgoff_t index, void *entry)
+   struct dax_device *dax_dev, struct address_space *mapping,
+   pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
-   struct blk_dax_ctl dax;
-   void *entry2, **slot;
-   int ret = 0;
+   void *entry2, **slot, *kaddr;
+   long ret = 0, id;
+   sector_t sector;
+   pgoff_t pgoff;
+   size_t size;
+   pfn_t pfn;
 
/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -789,26 +776,29 @@ static int dax_writeback_one(struct block_device *bdev,
 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 * worry about partial PMD writebacks.
 */
-   dax.sector = dax_radix_sector(entry);
-   dax.size = PAGE_SIZE << dax_radix_order(entry);
+   sector = dax_radix_sector(entry);
+   size = PAGE_SIZE << dax_radix_order(entry);
+
+   id = dax_read_lock();
+   ret = bdev_dax_pgoff(bdev, sector, size, );
+   if (ret)
+   goto dax_unlock;
 
/*
-* We cannot hold tree_lock while calling dax_map_atomic() because it
-* eventually calls cond_resched().
+* dax_direct_access() may sleep, so cannot hold tree_lock over
+* its invocation.
 */
-   ret = dax_map_atomic(bdev, );
-   if (ret < 0) {
-   put_locked_mapping_entry(mapping, index, entry);
-   return ret;
-   }
+   ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, , );
+   if (ret < 0)
+   goto dax_unlock;
 
-   if (WARN_ON_ONCE(ret < dax.size)) {
+   if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
ret = -EIO;
-   goto unmap;
+   goto dax_unlock;
}
 
-   dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
-   wb_cache_pmem(dax.addr, dax.size);
+   

[PATCH v2 14/33] Revert "block: use DAX for partition table reads"

2017-04-14 Thread Dan Williams
commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.

Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.

Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
223757016837 ("block_dev: remove DAX leftovers").

Cc: Jeff Moyer 
Signed-off-by: Dan Williams 
---
 block/partition-generic.c |   17 ++---
 fs/dax.c  |   20 
 include/linux/dax.h   |6 --
 3 files changed, 2 insertions(+), 41 deletions(-)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 7afb9907821f..5dfac337b0f2 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "partitions/check.h"
@@ -631,24 +630,12 @@ int invalidate_partitions(struct gendisk *disk, struct 
block_device *bdev)
return 0;
 }
 
-static struct page *read_pagecache_sector(struct block_device *bdev, sector_t 
n)
-{
-   struct address_space *mapping = bdev->bd_inode->i_mapping;
-
-   return read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)),
-NULL);
-}
-
 unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector 
*p)
 {
+   struct address_space *mapping = bdev->bd_inode->i_mapping;
struct page *page;
 
-   /* don't populate page cache for dax capable devices */
-   if (IS_DAX(bdev->bd_inode))
-   page = read_dax_sector(bdev, n);
-   else
-   page = read_pagecache_sector(bdev, n);
-
+   page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), NULL);
if (!IS_ERR(page)) {
if (PageError(page))
goto fail;
diff --git a/fs/dax.c b/fs/dax.c
index de622d4282a6..b78a6947c4f5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -101,26 +101,6 @@ static int dax_is_empty_entry(void *entry)
return (unsigned long)entry & RADIX_DAX_EMPTY;
 }
 
-struct page *read_dax_sector(struct block_device *bdev, sector_t n)
-{
-   struct page *page = alloc_pages(GFP_KERNEL, 0);
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   .sector = n & ~int) PAGE_SIZE) / 512) - 1),
-   };
-   long rc;
-
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   rc = dax_map_atomic(bdev, );
-   if (rc < 0)
-   return ERR_PTR(rc);
-   memcpy_from_pmem(page_address(page), dax.addr, PAGE_SIZE);
-   dax_unmap_atomic(bdev, );
-   return page;
-}
-
 /*
  * DAX radix tree locking
  */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7e62e280c11f..0d0d890f9186 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -70,15 +70,9 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
pgoff_t index, void *entry, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
-struct page *read_dax_sector(struct block_device *bdev, sector_t n);
 int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
unsigned int offset, unsigned int length);
 #else
-static inline struct page *read_dax_sector(struct block_device *bdev,
-   sector_t n)
-{
-   return ERR_PTR(-ENXIO);
-}
 static inline int __dax_zero_page_range(struct block_device *bdev,
sector_t sector, unsigned int offset, unsigned int length)
 {



[PATCH v2 13/33] ext2, ext4, xfs: retrieve dax_device for iomap operations

2017-04-14 Thread Dan Williams
In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.

Signed-off-by: Dan Williams 
---
 fs/ext2/inode.c   |9 -
 fs/ext4/inode.c   |9 -
 fs/xfs/xfs_iomap.c|   10 ++
 include/linux/iomap.h |1 +
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 128cce540645..4c9d2d44e879 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -799,6 +799,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
 static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
 {
+   struct block_device *bdev;
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block = offset >> blkbits;
unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits;
@@ -812,8 +813,13 @@ static int ext2_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
return ret;
 
iomap->flags = 0;
-   iomap->bdev = inode->i_sb->s_bdev;
+   bdev = inode->i_sb->s_bdev;
+   iomap->bdev = bdev;
iomap->offset = (u64)first_block << blkbits;
+   if (blk_queue_dax(bdev->bd_queue))
+   iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   iomap->dax_dev = NULL;
 
if (ret == 0) {
iomap->type = IOMAP_HOLE;
@@ -835,6 +841,7 @@ static int
 ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
 {
+   put_dax(iomap->dax_dev);
if (iomap->type == IOMAP_MAPPED &&
written < length &&
(flags & IOMAP_WRITE))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4247d8d25687..2cb2634daa99 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3305,6 +3305,7 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
 static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
 {
+   struct block_device *bdev;
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block = offset >> blkbits;
unsigned long last_block = (offset + length - 1) >> blkbits;
@@ -3373,7 +3374,12 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
}
 
iomap->flags = 0;
-   iomap->bdev = inode->i_sb->s_bdev;
+   bdev = inode->i_sb->s_bdev;
+   iomap->bdev = bdev;
+   if (blk_queue_dax(bdev->bd_queue))
+   iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   iomap->dax_dev = NULL;
iomap->offset = first_block << blkbits;
 
if (ret == 0) {
@@ -3406,6 +3412,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t 
offset, loff_t length,
int blkbits = inode->i_blkbits;
bool truncate = false;
 
+   put_dax(iomap->dax_dev);
if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
return 0;
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 288ee5b840d7..4b47403f8089 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -976,6 +976,7 @@ xfs_file_iomap_begin(
int nimaps = 1, error = 0;
boolshared = false, trimmed = false;
unsignedlockmode;
+   struct block_device *bdev;
 
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
@@ -1063,6 +1064,14 @@ xfs_file_iomap_begin(
}
 
xfs_bmbt_to_iomap(ip, iomap, );
+
+   /* optionally associate a dax device with the iomap bdev */
+   bdev = iomap->bdev;
+   if (blk_queue_dax(bdev->bd_queue))
+   iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   iomap->dax_dev = NULL;
+
if (shared)
iomap->flags |= IOMAP_F_SHARED;
return 0;
@@ -1140,6 +1149,7 @@ xfs_file_iomap_end(
unsignedflags,
struct iomap*iomap)
 {
+   put_dax(iomap->dax_dev);
if ((flags & IOMAP_WRITE) && iomap->type == IOMAP_DELALLOC)
return xfs_file_iomap_end_delalloc(XFS_I(inode), offset,
length, written, iomap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7291810067eb..f753e788da31 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -41,6 +41,7 @@ struct iomap {
u16 type;   /* type of mapping */
u16 flags;  /* flags for mapping */
struct block_device *bdev;  /* block device for I/O */
+   struct dax_device   *dax_dev; /* dax_dev for dax operations */
 };
 
 /*



[PATCH v2 15/33] filesystem-dax: convert to dax_direct_access()

2017-04-14 Thread Dan Williams
Now that a dax_device is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.

This also lets us kill off some usages of struct blk_dax_ctl on the way
to its eventual removal.

Suggested-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/dax.c|  277 +--
 fs/iomap.c  |3 -
 include/linux/dax.h |6 +
 3 files changed, 162 insertions(+), 124 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b78a6947c4f5..ce9dc9c3e829 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -55,32 +55,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
-{
-   struct request_queue *q = bdev->bd_queue;
-   long rc = -EIO;
-
-   dax->addr = ERR_PTR(-EIO);
-   if (blk_queue_enter(q, true) != 0)
-   return rc;
-
-   rc = bdev_direct_access(bdev, dax);
-   if (rc < 0) {
-   dax->addr = ERR_PTR(rc);
-   blk_queue_exit(q);
-   return rc;
-   }
-   return rc;
-}
-
-static void dax_unmap_atomic(struct block_device *bdev,
-   const struct blk_dax_ctl *dax)
-{
-   if (IS_ERR(dax->addr))
-   return;
-   blk_queue_exit(bdev->bd_queue);
-}
-
 static int dax_is_pmd_entry(void *entry)
 {
return (unsigned long)entry & RADIX_DAX_PMD;
@@ -553,21 +527,30 @@ static int dax_load_hole(struct address_space *mapping, 
void **entry,
return ret;
 }
 
-static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t 
size,
-   struct page *to, unsigned long vaddr)
+static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
+   sector_t sector, size_t size, struct page *to,
+   unsigned long vaddr)
 {
-   struct blk_dax_ctl dax = {
-   .sector = sector,
-   .size = size,
-   };
-   void *vto;
-
-   if (dax_map_atomic(bdev, ) < 0)
-   return PTR_ERR(dax.addr);
+   void *vto, *kaddr;
+   pgoff_t pgoff;
+   pfn_t pfn;
+   long rc;
+   int id;
+
+   rc = bdev_dax_pgoff(bdev, sector, size, );
+   if (rc)
+   return rc;
+
+   id = dax_read_lock();
+   rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), , );
+   if (rc < 0) {
+   dax_read_unlock(id);
+   return rc;
+   }
vto = kmap_atomic(to);
-   copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
+   copy_user_page(vto, (void __force *)kaddr, vaddr, to);
kunmap_atomic(vto);
-   dax_unmap_atomic(bdev, );
+   dax_read_unlock(id);
return 0;
 }
 
@@ -735,12 +718,16 @@ static void dax_mapping_entry_mkclean(struct 
address_space *mapping,
 }
 
 static int dax_writeback_one(struct block_device *bdev,
-   struct address_space *mapping, pgoff_t index, void *entry)
+   struct dax_device *dax_dev, struct address_space *mapping,
+   pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
-   struct blk_dax_ctl dax;
-   void *entry2, **slot;
-   int ret = 0;
+   void *entry2, **slot, *kaddr;
+   long ret = 0, id;
+   sector_t sector;
+   pgoff_t pgoff;
+   size_t size;
+   pfn_t pfn;
 
/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -789,26 +776,29 @@ static int dax_writeback_one(struct block_device *bdev,
 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 * worry about partial PMD writebacks.
 */
-   dax.sector = dax_radix_sector(entry);
-   dax.size = PAGE_SIZE << dax_radix_order(entry);
+   sector = dax_radix_sector(entry);
+   size = PAGE_SIZE << dax_radix_order(entry);
+
+   id = dax_read_lock();
+   ret = bdev_dax_pgoff(bdev, sector, size, );
+   if (ret)
+   goto dax_unlock;
 
/*
-* We cannot hold tree_lock while calling dax_map_atomic() because it
-* eventually calls cond_resched().
+* dax_direct_access() may sleep, so cannot hold tree_lock over
+* its invocation.
 */
-   ret = dax_map_atomic(bdev, );
-   if (ret < 0) {
-   put_locked_mapping_entry(mapping, index, entry);
-   return ret;
-   }
+   ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, , );
+   if (ret < 0)
+   goto dax_unlock;
 
-   if (WARN_ON_ONCE(ret < dax.size)) {
+   if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
ret = -EIO;
-   goto unmap;
+   goto dax_unlock;
}
 
-   dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
-   wb_cache_pmem(dax.addr, dax.size);
+   dax_mapping_entry_mkclean(mapping, index, 

[PATCH v2 14/33] Revert "block: use DAX for partition table reads"

2017-04-14 Thread Dan Williams
commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.

Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.

Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
223757016837 ("block_dev: remove DAX leftovers").

Cc: Jeff Moyer 
Signed-off-by: Dan Williams 
---
 block/partition-generic.c |   17 ++---
 fs/dax.c  |   20 
 include/linux/dax.h   |6 --
 3 files changed, 2 insertions(+), 41 deletions(-)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 7afb9907821f..5dfac337b0f2 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "partitions/check.h"
@@ -631,24 +630,12 @@ int invalidate_partitions(struct gendisk *disk, struct 
block_device *bdev)
return 0;
 }
 
-static struct page *read_pagecache_sector(struct block_device *bdev, sector_t 
n)
-{
-   struct address_space *mapping = bdev->bd_inode->i_mapping;
-
-   return read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)),
-NULL);
-}
-
 unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector 
*p)
 {
+   struct address_space *mapping = bdev->bd_inode->i_mapping;
struct page *page;
 
-   /* don't populate page cache for dax capable devices */
-   if (IS_DAX(bdev->bd_inode))
-   page = read_dax_sector(bdev, n);
-   else
-   page = read_pagecache_sector(bdev, n);
-
+   page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), NULL);
if (!IS_ERR(page)) {
if (PageError(page))
goto fail;
diff --git a/fs/dax.c b/fs/dax.c
index de622d4282a6..b78a6947c4f5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -101,26 +101,6 @@ static int dax_is_empty_entry(void *entry)
return (unsigned long)entry & RADIX_DAX_EMPTY;
 }
 
-struct page *read_dax_sector(struct block_device *bdev, sector_t n)
-{
-   struct page *page = alloc_pages(GFP_KERNEL, 0);
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   .sector = n & ~int) PAGE_SIZE) / 512) - 1),
-   };
-   long rc;
-
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   rc = dax_map_atomic(bdev, );
-   if (rc < 0)
-   return ERR_PTR(rc);
-   memcpy_from_pmem(page_address(page), dax.addr, PAGE_SIZE);
-   dax_unmap_atomic(bdev, );
-   return page;
-}
-
 /*
  * DAX radix tree locking
  */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7e62e280c11f..0d0d890f9186 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -70,15 +70,9 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
pgoff_t index, void *entry, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
-struct page *read_dax_sector(struct block_device *bdev, sector_t n);
 int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
unsigned int offset, unsigned int length);
 #else
-static inline struct page *read_dax_sector(struct block_device *bdev,
-   sector_t n)
-{
-   return ERR_PTR(-ENXIO);
-}
 static inline int __dax_zero_page_range(struct block_device *bdev,
sector_t sector, unsigned int offset, unsigned int length)
 {



[PATCH v2 13/33] ext2, ext4, xfs: retrieve dax_device for iomap operations

2017-04-14 Thread Dan Williams
In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.

Signed-off-by: Dan Williams 
---
 fs/ext2/inode.c   |9 -
 fs/ext4/inode.c   |9 -
 fs/xfs/xfs_iomap.c|   10 ++
 include/linux/iomap.h |1 +
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 128cce540645..4c9d2d44e879 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -799,6 +799,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
 static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
 {
+   struct block_device *bdev;
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block = offset >> blkbits;
unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits;
@@ -812,8 +813,13 @@ static int ext2_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
return ret;
 
iomap->flags = 0;
-   iomap->bdev = inode->i_sb->s_bdev;
+   bdev = inode->i_sb->s_bdev;
+   iomap->bdev = bdev;
iomap->offset = (u64)first_block << blkbits;
+   if (blk_queue_dax(bdev->bd_queue))
+   iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   iomap->dax_dev = NULL;
 
if (ret == 0) {
iomap->type = IOMAP_HOLE;
@@ -835,6 +841,7 @@ static int
 ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
 {
+   put_dax(iomap->dax_dev);
if (iomap->type == IOMAP_MAPPED &&
written < length &&
(flags & IOMAP_WRITE))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4247d8d25687..2cb2634daa99 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3305,6 +3305,7 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
 static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
 {
+   struct block_device *bdev;
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block = offset >> blkbits;
unsigned long last_block = (offset + length - 1) >> blkbits;
@@ -3373,7 +3374,12 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
}
 
iomap->flags = 0;
-   iomap->bdev = inode->i_sb->s_bdev;
+   bdev = inode->i_sb->s_bdev;
+   iomap->bdev = bdev;
+   if (blk_queue_dax(bdev->bd_queue))
+   iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   iomap->dax_dev = NULL;
iomap->offset = first_block << blkbits;
 
if (ret == 0) {
@@ -3406,6 +3412,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t 
offset, loff_t length,
int blkbits = inode->i_blkbits;
bool truncate = false;
 
+   put_dax(iomap->dax_dev);
if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
return 0;
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 288ee5b840d7..4b47403f8089 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -976,6 +976,7 @@ xfs_file_iomap_begin(
int nimaps = 1, error = 0;
boolshared = false, trimmed = false;
unsignedlockmode;
+   struct block_device *bdev;
 
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
@@ -1063,6 +1064,14 @@ xfs_file_iomap_begin(
}
 
xfs_bmbt_to_iomap(ip, iomap, );
+
+   /* optionally associate a dax device with the iomap bdev */
+   bdev = iomap->bdev;
+   if (blk_queue_dax(bdev->bd_queue))
+   iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   else
+   iomap->dax_dev = NULL;
+
if (shared)
iomap->flags |= IOMAP_F_SHARED;
return 0;
@@ -1140,6 +1149,7 @@ xfs_file_iomap_end(
unsignedflags,
struct iomap*iomap)
 {
+   put_dax(iomap->dax_dev);
if ((flags & IOMAP_WRITE) && iomap->type == IOMAP_DELALLOC)
return xfs_file_iomap_end_delalloc(XFS_I(inode), offset,
length, written, iomap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7291810067eb..f753e788da31 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -41,6 +41,7 @@ struct iomap {
u16 type;   /* type of mapping */
u16 flags;  /* flags for mapping */
struct block_device *bdev;  /* block device for I/O */
+   struct dax_device   *dax_dev; /* dax_dev for dax operations */
 };
 
 /*



[PATCH v2 09/33] block: kill bdev_dax_capable()

2017-04-14 Thread Dan Williams
This is leftover dead code that has since been replaced by
bdev_dax_supported().

Signed-off-by: Dan Williams 
---
 fs/block_dev.c |   24 
 include/linux/blkdev.h |1 -
 2 files changed, 25 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..7f40ea2f0875 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -807,30 +807,6 @@ int bdev_dax_supported(struct super_block *sb, int 
blocksize)
 }
 EXPORT_SYMBOL_GPL(bdev_dax_supported);
 
-/**
- * bdev_dax_capable() - Return if the raw device is capable for dax
- * @bdev: The device for raw block device access
- */
-bool bdev_dax_capable(struct block_device *bdev)
-{
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   };
-
-   if (!IS_ENABLED(CONFIG_FS_DAX))
-   return false;
-
-   dax.sector = 0;
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   dax.sector = bdev->bd_part->nr_sects - (PAGE_SIZE / 512);
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   return true;
-}
-
 /*
  * pseudo-fs
  */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da607ca04..f72708399b83 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1958,7 +1958,6 @@ extern int bdev_write_page(struct block_device *, 
sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 extern int bdev_dax_supported(struct super_block *, int);
-extern bool bdev_dax_capable(struct block_device *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;



[PATCH v2 10/33] dax: introduce dax_direct_access()

2017-04-14 Thread Dan Williams
Replace bdev_direct_access() with dax_direct_access() that uses
dax_device and dax_operations instead of a block_device and
block_device_operations for dax. Once all consumers of the old api have
been converted bdev_direct_access() will be deleted.

Given that block device partitioning decisions can cause dax page
alignment constraints to be violated this also introduces the
bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
to the dax_device and also checks for page alignment.

Signed-off-by: Dan Williams 
---
 block/Kconfig  |1 +
 drivers/dax/super.c|   39 +++
 fs/block_dev.c |   14 ++
 include/linux/blkdev.h |1 +
 include/linux/dax.h|2 ++
 5 files changed, 57 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index e9f780f815f5..93da7fc3f254 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -6,6 +6,7 @@ menuconfig BLOCK
default y
select SBITMAP
select SRCU
+   select DAX
help
 Provide block layer support for the kernel.
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 45ccfc043da8..23ce3ab49f10 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -65,6 +65,45 @@ struct dax_device {
const struct dax_operations *ops;
 };
 
+/**
+ * dax_direct_access() - translate a device pgoff to an absolute pfn
+ * @dax_dev: a dax_device instance representing the logical memory range
+ * @pgoff: offset in pages from the start of the device to translate
+ * @nr_pages: number of consecutive pages caller can handle relative to @pfn
+ * @kaddr: output parameter that returns a virtual address mapping of pfn
+ * @pfn: output parameter that returns an absolute pfn translation of @pgoff
+ *
+ * Return: negative errno if an error occurs, otherwise the number of
+ * pages accessible at the device relative @pgoff.
+ */
+long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+   void **kaddr, pfn_t *pfn)
+{
+   long avail;
+
+   /*
+* The device driver is allowed to sleep, in order to make the
+* memory directly accessible.
+*/
+   might_sleep();
+
+   if (!dax_dev)
+   return -EOPNOTSUPP;
+
+   if (!dax_alive(dax_dev))
+   return -ENXIO;
+
+   if (nr_pages < 0)
+   return nr_pages;
+
+   avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
+   kaddr, pfn);
+   if (!avail)
+   return -ERANGE;
+   return min(avail, nr_pages);
+}
+EXPORT_SYMBOL_GPL(dax_direct_access);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7f40ea2f0875..2f7885712575 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -762,6 +763,19 @@ long bdev_direct_access(struct block_device *bdev, struct 
blk_dax_ctl *dax)
 }
 EXPORT_SYMBOL_GPL(bdev_direct_access);
 
+int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
+   pgoff_t *pgoff)
+{
+   phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+
+   if (pgoff)
+   *pgoff = PHYS_PFN(phys_off);
+   if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
+   return -EINVAL;
+   return 0;
+}
+EXPORT_SYMBOL(bdev_dax_pgoff);
+
 /**
  * bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f72708399b83..612c497d1461 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1958,6 +1958,7 @@ extern int bdev_write_page(struct block_device *, 
sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 extern int bdev_dax_supported(struct super_block *, int);
+int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 39a0312c45c3..7e62e280c11f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -27,6 +27,8 @@ void put_dax(struct dax_device *dax_dev);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
+long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+   void **kaddr, pfn_t *pfn);
 
 /*
  * We use lowest available bit in exceptional entry for locking, one bit for



[PATCH v2 08/33] dcssblk: add dax_operations support

2017-04-14 Thread Dan Williams
Setup a dax_dev to have the same lifetime as the dcssblk block device
and add a ->direct_access() method that is equivalent to
dcssblk_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old dcssblk_direct_access() will be removed.

Cc: Gerald Schaefer 
Signed-off-by: Dan Williams 
---
 drivers/s390/block/Kconfig   |1 +
 drivers/s390/block/dcssblk.c |   54 +++---
 2 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 4a3b62326183..0acb8c2f9475 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
 
 config DCSSBLK
def_tristate m
+   select DAX
prompt "DCSSBLK support"
depends on S390 && BLOCK
help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 415d10a67b7a..682a9eb4934d 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t 
mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t 
secnum,
 void **kaddr, pfn_t *pfn, long size);
+static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t 
pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = 
{
.owner  = THIS_MODULE,
.open   = dcssblk_open,
.release= dcssblk_release,
-   .direct_access  = dcssblk_direct_access,
+   .direct_access  = dcssblk_blk_direct_access,
+};
+
+static const struct dax_operations dcssblk_dax_ops = {
+   .direct_access = dcssblk_dax_direct_access,
 };
 
 struct dcssblk_dev_info {
@@ -57,6 +64,7 @@ struct dcssblk_dev_info {
struct request_queue *dcssblk_queue;
int num_of_segments;
struct list_head seg_list;
+   struct dax_device *dax_dev;
 };
 
 struct segment_info {
@@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct 
device_attribute *attr, const ch
}
list_del(_info->lh);
 
+   kill_dax(dev_info->dax_dev);
+   put_dax(dev_info->dax_dev);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
+   dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
+   _dax_ops);
+   if (!dax_dev)
+   goto put_dev;
+
get_device(_info->dev);
device_add_disk(_info->dev, dev_info->gd);
 
@@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct 
device_attribute *attr, const ch
}
 
list_del(_info->lh);
+   kill_dax(dev_info->dax_dev);
+   put_dax(dev_info->dax_dev);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -883,21 +901,39 @@ dcssblk_make_request(struct request_queue *q, struct bio 
*bio)
 }
 
 static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
+__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   resource_size_t offset = pgoff * PAGE_SIZE;
+   unsigned long dev_sz;
+
+   dev_sz = dev_info->end - dev_info->start + 1;
+   *kaddr = (void *) dev_info->start + offset;
+   *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+
+   return (dev_sz - offset) / PAGE_SIZE;
+}
+
+static long
+dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
void **kaddr, pfn_t *pfn, long size)
 {
struct dcssblk_dev_info *dev_info;
-   unsigned long offset, dev_sz;
 
dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
-   dev_sz = dev_info->end - dev_info->start + 1;
-   offset = secnum * 512;
-   *kaddr = 

[PATCH v2 09/33] block: kill bdev_dax_capable()

2017-04-14 Thread Dan Williams
This is leftover dead code that has since been replaced by
bdev_dax_supported().

Signed-off-by: Dan Williams 
---
 fs/block_dev.c |   24 
 include/linux/blkdev.h |1 -
 2 files changed, 25 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..7f40ea2f0875 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -807,30 +807,6 @@ int bdev_dax_supported(struct super_block *sb, int 
blocksize)
 }
 EXPORT_SYMBOL_GPL(bdev_dax_supported);
 
-/**
- * bdev_dax_capable() - Return if the raw device is capable for dax
- * @bdev: The device for raw block device access
- */
-bool bdev_dax_capable(struct block_device *bdev)
-{
-   struct blk_dax_ctl dax = {
-   .size = PAGE_SIZE,
-   };
-
-   if (!IS_ENABLED(CONFIG_FS_DAX))
-   return false;
-
-   dax.sector = 0;
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   dax.sector = bdev->bd_part->nr_sects - (PAGE_SIZE / 512);
-   if (bdev_direct_access(bdev, ) < 0)
-   return false;
-
-   return true;
-}
-
 /*
  * pseudo-fs
  */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da607ca04..f72708399b83 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1958,7 +1958,6 @@ extern int bdev_write_page(struct block_device *, 
sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 extern int bdev_dax_supported(struct super_block *, int);
-extern bool bdev_dax_capable(struct block_device *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;



[PATCH v2 10/33] dax: introduce dax_direct_access()

2017-04-14 Thread Dan Williams
Replace bdev_direct_access() with dax_direct_access() that uses
dax_device and dax_operations instead of a block_device and
block_device_operations for dax. Once all consumers of the old api have
been converted bdev_direct_access() will be deleted.

Given that block device partitioning decisions can cause dax page
alignment constraints to be violated this also introduces the
bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
to the dax_device and also checks for page alignment.

Signed-off-by: Dan Williams 
---
 block/Kconfig  |1 +
 drivers/dax/super.c|   39 +++
 fs/block_dev.c |   14 ++
 include/linux/blkdev.h |1 +
 include/linux/dax.h|2 ++
 5 files changed, 57 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index e9f780f815f5..93da7fc3f254 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -6,6 +6,7 @@ menuconfig BLOCK
default y
select SBITMAP
select SRCU
+   select DAX
help
 Provide block layer support for the kernel.
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 45ccfc043da8..23ce3ab49f10 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -65,6 +65,45 @@ struct dax_device {
const struct dax_operations *ops;
 };
 
+/**
+ * dax_direct_access() - translate a device pgoff to an absolute pfn
+ * @dax_dev: a dax_device instance representing the logical memory range
+ * @pgoff: offset in pages from the start of the device to translate
+ * @nr_pages: number of consecutive pages caller can handle relative to @pfn
+ * @kaddr: output parameter that returns a virtual address mapping of pfn
+ * @pfn: output parameter that returns an absolute pfn translation of @pgoff
+ *
+ * Return: negative errno if an error occurs, otherwise the number of
+ * pages accessible at the device relative @pgoff.
+ */
+long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+   void **kaddr, pfn_t *pfn)
+{
+   long avail;
+
+   /*
+* The device driver is allowed to sleep, in order to make the
+* memory directly accessible.
+*/
+   might_sleep();
+
+   if (!dax_dev)
+   return -EOPNOTSUPP;
+
+   if (!dax_alive(dax_dev))
+   return -ENXIO;
+
+   if (nr_pages < 0)
+   return nr_pages;
+
+   avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages,
+   kaddr, pfn);
+   if (!avail)
+   return -ERANGE;
+   return min(avail, nr_pages);
+}
+EXPORT_SYMBOL_GPL(dax_direct_access);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
lockdep_assert_held(_srcu);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7f40ea2f0875..2f7885712575 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -762,6 +763,19 @@ long bdev_direct_access(struct block_device *bdev, struct 
blk_dax_ctl *dax)
 }
 EXPORT_SYMBOL_GPL(bdev_direct_access);
 
+int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
+   pgoff_t *pgoff)
+{
+   phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+
+   if (pgoff)
+   *pgoff = PHYS_PFN(phys_off);
+   if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
+   return -EINVAL;
+   return 0;
+}
+EXPORT_SYMBOL(bdev_dax_pgoff);
+
 /**
  * bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f72708399b83..612c497d1461 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1958,6 +1958,7 @@ extern int bdev_write_page(struct block_device *, 
sector_t, struct page *,
struct writeback_control *);
 extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 extern int bdev_dax_supported(struct super_block *, int);
+int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 39a0312c45c3..7e62e280c11f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -27,6 +27,8 @@ void put_dax(struct dax_device *dax_dev);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
+long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+   void **kaddr, pfn_t *pfn);
 
 /*
  * We use lowest available bit in exceptional entry for locking, one bit for



[PATCH v2 08/33] dcssblk: add dax_operations support

2017-04-14 Thread Dan Williams
Setup a dax_dev to have the same lifetime as the dcssblk block device
and add a ->direct_access() method that is equivalent to
dcssblk_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old dcssblk_direct_access() will be removed.

Cc: Gerald Schaefer 
Signed-off-by: Dan Williams 
---
 drivers/s390/block/Kconfig   |1 +
 drivers/s390/block/dcssblk.c |   54 +++---
 2 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 4a3b62326183..0acb8c2f9475 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -14,6 +14,7 @@ config BLK_DEV_XPRAM
 
 config DCSSBLK
def_tristate m
+   select DAX
prompt "DCSSBLK support"
depends on S390 && BLOCK
help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 415d10a67b7a..682a9eb4934d 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t 
mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static blk_qc_t dcssblk_make_request(struct request_queue *q,
struct bio *bio);
-static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t 
secnum,
 void **kaddr, pfn_t *pfn, long size);
+static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t 
pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = 
{
.owner  = THIS_MODULE,
.open   = dcssblk_open,
.release= dcssblk_release,
-   .direct_access  = dcssblk_direct_access,
+   .direct_access  = dcssblk_blk_direct_access,
+};
+
+static const struct dax_operations dcssblk_dax_ops = {
+   .direct_access = dcssblk_dax_direct_access,
 };
 
 struct dcssblk_dev_info {
@@ -57,6 +64,7 @@ struct dcssblk_dev_info {
struct request_queue *dcssblk_queue;
int num_of_segments;
struct list_head seg_list;
+   struct dax_device *dax_dev;
 };
 
 struct segment_info {
@@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct 
device_attribute *attr, const ch
}
list_del(_info->lh);
 
+   kill_dax(dev_info->dax_dev);
+   put_dax(dev_info->dax_dev);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
+   dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
+   _dax_ops);
+   if (!dax_dev)
+   goto put_dev;
+
get_device(_info->dev);
device_add_disk(_info->dev, dev_info->gd);
 
@@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct 
device_attribute *attr, const ch
}
 
list_del(_info->lh);
+   kill_dax(dev_info->dax_dev);
+   put_dax(dev_info->dax_dev);
del_gendisk(dev_info->gd);
blk_cleanup_queue(dev_info->dcssblk_queue);
dev_info->gd->queue = NULL;
@@ -883,21 +901,39 @@ dcssblk_make_request(struct request_queue *q, struct bio 
*bio)
 }
 
 static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
+__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   resource_size_t offset = pgoff * PAGE_SIZE;
+   unsigned long dev_sz;
+
+   dev_sz = dev_info->end - dev_info->start + 1;
+   *kaddr = (void *) dev_info->start + offset;
+   *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+
+   return (dev_sz - offset) / PAGE_SIZE;
+}
+
+static long
+dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum,
void **kaddr, pfn_t *pfn, long size)
 {
struct dcssblk_dev_info *dev_info;
-   unsigned long offset, dev_sz;
 
dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
-   dev_sz = dev_info->end - dev_info->start + 1;
-   offset = secnum * 512;
-   *kaddr = (void *) dev_info->start + offset;
-   *pfn = 

[PATCH v2 06/33] axon_ram: add dax_operations support

2017-04-14 Thread Dan Williams
Setup a dax_device to have the same lifetime as the axon_ram block
device and add a ->direct_access() method that is equivalent to
axon_ram_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old axon_ram_direct_access() will be removed.

Signed-off-by: Dan Williams 
---
 arch/powerpc/platforms/Kconfig |1 +
 arch/powerpc/sysdev/axonram.c  |   48 +++-
 2 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 7e3a2ebba29b..33244e3d9375 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -284,6 +284,7 @@ config CPM2
 config AXON_RAM
tristate "Axon DDR2 memory device driver"
depends on PPC_IBM_CELL_BLADE && BLOCK
+   select DAX
default m
help
  It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index f523ac883150..ad857d5e81b1 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -62,6 +63,7 @@ static int azfs_major, azfs_minor;
 struct axon_ram_bank {
struct platform_device  *device;
struct gendisk  *disk;
+   struct dax_device   *dax_dev;
unsigned intirq_id;
unsigned long   ph_addr;
unsigned long   io_addr;
@@ -137,25 +139,47 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
return BLK_QC_T_NONE;
 }
 
+static long
+__axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long 
nr_pages,
+  void **kaddr, pfn_t *pfn)
+{
+   resource_size_t offset = pgoff * PAGE_SIZE;
+
+   *kaddr = (void *) bank->io_addr + offset;
+   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+   return (bank->size - offset) / PAGE_SIZE;
+}
+
 /**
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
 static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
+axon_ram_blk_direct_access(struct block_device *device, sector_t sector,
   void **kaddr, pfn_t *pfn, long size)
 {
struct axon_ram_bank *bank = device->bd_disk->private_data;
-   loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
 
-   *kaddr = (void *) bank->io_addr + offset;
-   *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
-   return bank->size - offset;
+   return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE,
+   size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE;
 }
 
 static const struct block_device_operations axon_ram_devops = {
.owner  = THIS_MODULE,
-   .direct_access  = axon_ram_direct_access
+   .direct_access  = axon_ram_blk_direct_access
+};
+
+static long
+axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
nr_pages,
+  void **kaddr, pfn_t *pfn)
+{
+   struct axon_ram_bank *bank = dax_get_private(dax_dev);
+
+   return __axon_ram_direct_access(bank, pgoff, nr_pages, kaddr, pfn);
+}
+
+static const struct dax_operations axon_ram_dax_ops = {
+   .direct_access = axon_ram_dax_direct_access,
 };
 
 /**
@@ -219,6 +243,7 @@ static int axon_ram_probe(struct platform_device *device)
goto failed;
}
 
+
bank->disk->major = azfs_major;
bank->disk->first_minor = azfs_minor;
bank->disk->fops = _ram_devops;
@@ -227,6 +252,11 @@ static int axon_ram_probe(struct platform_device *device)
sprintf(bank->disk->disk_name, "%s%d",
AXON_RAM_DEVICE_NAME, axon_ram_bank_id);
 
+   bank->dax_dev = alloc_dax(bank, bank->disk->disk_name,
+   _ram_dax_ops);
+   if (!bank->dax_dev)
+   goto failed;
+
bank->disk->queue = blk_alloc_queue(GFP_KERNEL);
if (bank->disk->queue == NULL) {
dev_err(>dev, "Cannot register disk queue\n");
@@ -278,6 +308,10 @@ static int axon_ram_probe(struct platform_device *device)
del_gendisk(bank->disk);
put_disk(bank->disk);
}
+   if (bank->dax_dev) {
+   kill_dax(bank->dax_dev);
+   put_dax(bank->dax_dev);
+   }
device->dev.platform_data = NULL;
if (bank->io_addr != 0)
iounmap((void __iomem *) bank->io_addr);
@@ -300,6 +334,8 @@ axon_ram_remove(struct platform_device *device)
 
device_remove_file(>dev, _attr_ecc);
free_irq(bank->irq_id, device);
+   kill_dax(bank->dax_dev);
+   put_dax(bank->dax_dev);
del_gendisk(bank->disk);
 

[PATCH v2 05/33] pmem: add dax_operations support

2017-04-14 Thread Dan Williams
Setup a dax_device to have the same lifetime as the pmem block device
and add a ->direct_access() method that is equivalent to
pmem_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old pmem_direct_access() will be removed.

Signed-off-by: Dan Williams 
---
 drivers/dax/dax.h   |7 
 drivers/nvdimm/Kconfig  |1 +
 drivers/nvdimm/pmem.c   |   61 +++
 drivers/nvdimm/pmem.h   |7 +++-
 include/linux/dax.h |6 
 tools/testing/nvdimm/pmem-dax.c |   21 ++---
 6 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h
index 617bbc24be2b..f9e5feea742c 100644
--- a/drivers/dax/dax.h
+++ b/drivers/dax/dax.h
@@ -13,13 +13,6 @@
 #ifndef __DAX_H__
 #define __DAX_H__
 struct dax_device;
-struct dax_operations;
-struct dax_device *alloc_dax(void *private, const char *host,
-   const struct dax_operations *ops);
-void put_dax(struct dax_device *dax_dev);
-bool dax_alive(struct dax_device *dax_dev);
-void kill_dax(struct dax_device *dax_dev);
 struct dax_device *inode_dax(struct inode *inode);
 struct inode *dax_inode(struct dax_device *dax_dev);
-void *dax_get_private(struct dax_device *dax_dev);
 #endif /* __DAX_H__ */
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 59e750183b7f..5bdd499b5f4f 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -20,6 +20,7 @@ if LIBNVDIMM
 config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
+   select DAX
select ND_BTT if BTT
select ND_PFN if NVDIMM_PFN
help
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5b536be5a12e..fbbcf8154eec 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "pmem.h"
 #include "pfn.h"
@@ -199,13 +200,13 @@ static int pmem_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
-__weak long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, pfn_t *pfn, long size)
+__weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
 {
-   struct pmem_device *pmem = bdev->bd_queue->queuedata;
-   resource_size_t offset = sector * 512 + pmem->data_offset;
+   resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
 
-   if (unlikely(is_bad_pmem(>bb, sector, size)))
+   if (unlikely(is_bad_pmem(>bb, PFN_PHYS(pgoff) / 512,
+   PFN_PHYS(nr_pages
return -EIO;
*kaddr = pmem->virt_addr + offset;
*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
@@ -215,26 +216,51 @@ __weak long pmem_direct_access(struct block_device *bdev, 
sector_t sector,
 * requested range.
 */
if (unlikely(pmem->bb.count))
-   return size;
-   return pmem->size - pmem->pfn_pad - offset;
+   return nr_pages;
+   return PHYS_PFN(pmem->size - pmem->pfn_pad - offset);
+}
+
+static long pmem_blk_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, pfn_t *pfn, long size)
+{
+   struct pmem_device *pmem = bdev->bd_queue->queuedata;
+
+   return __pmem_direct_access(pmem, PHYS_PFN(sector * 512),
+   PHYS_PFN(size), kaddr, pfn);
 }
 
 static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.rw_page =  pmem_rw_page,
-   .direct_access =pmem_direct_access,
+   .direct_access =pmem_blk_direct_access,
.revalidate_disk =  nvdimm_revalidate_disk,
 };
 
+static long pmem_dax_direct_access(struct dax_device *dax_dev,
+   pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct pmem_device *pmem = dax_get_private(dax_dev);
+
+   return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
+}
+
+static const struct dax_operations pmem_dax_ops = {
+   .direct_access = pmem_dax_direct_access,
+};
+
 static void pmem_release_queue(void *q)
 {
blk_cleanup_queue(q);
 }
 
-static void pmem_release_disk(void *disk)
+static void pmem_release_disk(void *__pmem)
 {
-   del_gendisk(disk);
-   put_disk(disk);
+   struct pmem_device *pmem = __pmem;
+
+   kill_dax(pmem->dax_dev);
+   put_dax(pmem->dax_dev);
+   del_gendisk(pmem->disk);
+   put_disk(pmem->disk);
 }
 
 static int pmem_attach_disk(struct device *dev,
@@ -245,6 +271,7 @@ static int pmem_attach_disk(struct device *dev,
struct vmem_altmap __altmap, *altmap = NULL;
struct resource *res = >res;
struct nd_pfn *nd_pfn = NULL;
+ 

  1   2   3   4   5   6   7   8   9   10   >