Re: [PATCH 06/38] Annotate hardware config module parameters in drivers/clocksource/
Thomas Gleixnerwrote: > > Btw, is it possible to use IRQ grants to prevent a device that has limited > > IRQ options from being drivable? > > What do you mean with 'IRQ grants' ? request_irq(). David
Re: [PATCH 06/38] Annotate hardware config module parameters in drivers/clocksource/
Thomas Gleixner wrote: > > Btw, is it possible to use IRQ grants to prevent a device that has limited > > IRQ options from being drivable? > > What do you mean with 'IRQ grants' ? request_irq(). David
Re: [PATCH 06/38] Annotate hardware config module parameters in drivers/clocksource/
On Fri, 14 Apr 2017, David Howells wrote: > Thomas Gleixnerwrote: > > > > -module_param_named(irq, timer_irq, int, 0644); > > > +module_param_hw_named(irq, timer_irq, int, irq, 0644); > > > MODULE_PARM_DESC(irq, "Which IRQ to use for the clock source MFGPT > > > ticks."); > > > > I'm not sure about this. AFAIR the parameter is required to work on > > anything else than some arbitrary hardware which has it mapped to 0. > > Should it then be set through in-kernel platform initialisation since the > AMD Geode is an embedded chip? I think so. > Btw, is it possible to use IRQ grants to prevent a device that has limited IRQ > options from being drivable? What do you mean with 'IRQ grants' ? Thanks tglx
Re: [PATCH 06/38] Annotate hardware config module parameters in drivers/clocksource/
On Fri, 14 Apr 2017, David Howells wrote: > Thomas Gleixner wrote: > > > > -module_param_named(irq, timer_irq, int, 0644); > > > +module_param_hw_named(irq, timer_irq, int, irq, 0644); > > > MODULE_PARM_DESC(irq, "Which IRQ to use for the clock source MFGPT > > > ticks."); > > > > I'm not sure about this. AFAIR the parameter is required to work on > > anything else than some arbitrary hardware which has it mapped to 0. > > Should it then be set through in-kernel platform initialisation since the > AMD Geode is an embedded chip? I think so. > Btw, is it possible to use IRQ grants to prevent a device that has limited IRQ > options from being drivable? What do you mean with 'IRQ grants' ? Thanks tglx
Re: [tip:x86/cpu 8/12] arch/x86/kernel/cpu/intel_rdt.c:63: error: unknown field 'cache' specified in initializer
On Sat, 15 Apr 2017, kbuild test robot wrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/cpu > head: 64e8ed3d4a6dcd6139a869a3e760e625cb0d3022 > commit: 05b93417ce5b924c6652de19fdcc27439ab37c90 [8/12] x86/intel_rdt/mba: > Add primary support for Memory Bandwidth Allocation (MBA) > config: x86_64-randconfig-s0-04150438 (attached as .config) > compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7 > reproduce: > git checkout 05b93417ce5b924c6652de19fdcc27439ab37c90 > # save the attached .config to linux build tree > make ARCH=x86_64 That's weird. > c1c7c3f9 Fenghua Yu 2016-10-22 57 { > c1c7c3f9 Fenghua Yu 2016-10-22 58 .name > = "L3", > c1c7c3f9 Fenghua Yu 2016-10-22 59 .domains > = domain_init(RDT_RESOURCE_L3), > c1c7c3f9 Fenghua Yu 2016-10-22 60 .msr_base > = IA32_L3_CBM_BASE, > 0921c547 Thomas Gleixner 2017-04-14 61 .msr_update > = cat_wrmsr, > c1c7c3f9 Fenghua Yu 2016-10-22 62 .cache_level > = 3, > d3e11b4d Thomas Gleixner 2017-04-14 @63 .cache = { > d3e11b4d Thomas Gleixner 2017-04-14 @64 .min_cbm_bits > = 1, > d3e11b4d Thomas Gleixner 2017-04-14 @65 .cbm_idx_mult > = 1, > d3e11b4d Thomas Gleixner 2017-04-14 66 .cbm_idx_offset > = 0, > d3e11b4d Thomas Gleixner 2017-04-14 67 }, > c1c7c3f9 Fenghua Yu 2016-10-22 68 }, > > :: The code at line 63 was first introduced by commit > :: d3e11b4d6ffd363747ac6e6b5522baa9ca5a20c0 x86/intel_rdt: Move CBM > specific data into a struct > So the compiler fails to handle the anon union, which was introduced in 05b93417ce5b924. No idea why, but that concept is not new and widely used in the kernel already. Thanks, tglx
Re: [tip:x86/cpu 8/12] arch/x86/kernel/cpu/intel_rdt.c:63: error: unknown field 'cache' specified in initializer
On Sat, 15 Apr 2017, kbuild test robot wrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/cpu > head: 64e8ed3d4a6dcd6139a869a3e760e625cb0d3022 > commit: 05b93417ce5b924c6652de19fdcc27439ab37c90 [8/12] x86/intel_rdt/mba: > Add primary support for Memory Bandwidth Allocation (MBA) > config: x86_64-randconfig-s0-04150438 (attached as .config) > compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7 > reproduce: > git checkout 05b93417ce5b924c6652de19fdcc27439ab37c90 > # save the attached .config to linux build tree > make ARCH=x86_64 That's weird. > c1c7c3f9 Fenghua Yu 2016-10-22 57 { > c1c7c3f9 Fenghua Yu 2016-10-22 58 .name > = "L3", > c1c7c3f9 Fenghua Yu 2016-10-22 59 .domains > = domain_init(RDT_RESOURCE_L3), > c1c7c3f9 Fenghua Yu 2016-10-22 60 .msr_base > = IA32_L3_CBM_BASE, > 0921c547 Thomas Gleixner 2017-04-14 61 .msr_update > = cat_wrmsr, > c1c7c3f9 Fenghua Yu 2016-10-22 62 .cache_level > = 3, > d3e11b4d Thomas Gleixner 2017-04-14 @63 .cache = { > d3e11b4d Thomas Gleixner 2017-04-14 @64 .min_cbm_bits > = 1, > d3e11b4d Thomas Gleixner 2017-04-14 @65 .cbm_idx_mult > = 1, > d3e11b4d Thomas Gleixner 2017-04-14 66 .cbm_idx_offset > = 0, > d3e11b4d Thomas Gleixner 2017-04-14 67 }, > c1c7c3f9 Fenghua Yu 2016-10-22 68 }, > > :: The code at line 63 was first introduced by commit > :: d3e11b4d6ffd363747ac6e6b5522baa9ca5a20c0 x86/intel_rdt: Move CBM > specific data into a struct > So the compiler fails to handle the anon union, which was introduced in 05b93417ce5b924. No idea why, but that concept is not new and widely used in the kernel already. Thanks, tglx
Re: [PATCH] remove return statement
On Sat, 2017-04-15 at 10:35 +0530, surenderpolsani wrote: > staging : rtl8188e : remove return in void function Your patch subject isn't correct. It should be something like: Subject: [PATCH] staging: rtl8188e: Remove void function return > kernel coding style doesn't allow the return statement > in void function. [] > diff --git a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c > b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c [] > @@ -165,7 +165,6 @@ void rtw_hal_dm_watchdog(struct adapter *Adapter) > skip_dm: > /* Check GPIO to determine current RF on/off and Pbc status. */ > /* Check Hardware Radio ON/OFF or not */ > - return; And the comments? Are those supposed to be reminders of code to write?
Re: [PATCH] remove return statement
On Sat, 2017-04-15 at 10:35 +0530, surenderpolsani wrote: > staging : rtl8188e : remove return in void function Your patch subject isn't correct. It should be something like: Subject: [PATCH] staging: rtl8188e: Remove void function return > kernel coding style doesn't allow the return statement > in void function. [] > diff --git a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c > b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c [] > @@ -165,7 +165,6 @@ void rtw_hal_dm_watchdog(struct adapter *Adapter) > skip_dm: > /* Check GPIO to determine current RF on/off and Pbc status. */ > /* Check Hardware Radio ON/OFF or not */ > - return; And the comments? Are those supposed to be reminders of code to write?
[PATCH] remove return statement
staging : rtl8188e : remove return in void function kernel coding style doesn't allow the return statement in void function. Signed-off-by: surenderpolsani--- drivers/staging/rtl8188eu/hal/rtl8188e_dm.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c index d04b7fb..6db0e19 100644 --- a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c +++ b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c @@ -165,7 +165,6 @@ void rtw_hal_dm_watchdog(struct adapter *Adapter) skip_dm: /* Check GPIO to determine current RF on/off and Pbc status. */ /* Check Hardware Radio ON/OFF or not */ - return; } void rtw_hal_dm_init(struct adapter *Adapter) -- 1.9.1
[PATCH] remove return statement
staging : rtl8188e : remove return in void function kernel coding style doesn't allow the return statement in void function. Signed-off-by: surenderpolsani --- drivers/staging/rtl8188eu/hal/rtl8188e_dm.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c index d04b7fb..6db0e19 100644 --- a/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c +++ b/drivers/staging/rtl8188eu/hal/rtl8188e_dm.c @@ -165,7 +165,6 @@ void rtw_hal_dm_watchdog(struct adapter *Adapter) skip_dm: /* Check GPIO to determine current RF on/off and Pbc status. */ /* Check Hardware Radio ON/OFF or not */ - return; } void rtw_hal_dm_init(struct adapter *Adapter) -- 1.9.1
Re: [PATCH 08/22] crypto: chcr: Make use of the new sg_map helper function
On Fri, Apr 14, 2017 at 3:35 AM, Logan Gunthorpewrote: > The get_page in this area looks *highly* suspect due to there being no > corresponding put_page. However, I've left that as is to avoid breaking > things. chcr driver will post the request to LLD driver cxgb4 and put_page is implemented there. it will no harm. Any how we have removed the below code from driver. http://www.mail-archive.com/linux-crypto@vger.kernel.org/msg24561.html After this merge we can ignore your patch. Thanks > > I've also removed the KMAP_ATOMIC_ARGS check as it appears to be dead > code that dates back to when it was first committed... > > Signed-off-by: Logan Gunthorpe > --- > drivers/crypto/chelsio/chcr_algo.c | 28 +++- > 1 file changed, 15 insertions(+), 13 deletions(-) > > diff --git a/drivers/crypto/chelsio/chcr_algo.c > b/drivers/crypto/chelsio/chcr_algo.c > index 41bc7f4..a993d1d 100644 > --- a/drivers/crypto/chelsio/chcr_algo.c > +++ b/drivers/crypto/chelsio/chcr_algo.c > @@ -1489,22 +1489,21 @@ static struct sk_buff *create_authenc_wr(struct > aead_request *req, > return ERR_PTR(-EINVAL); > } > > -static void aes_gcm_empty_pld_pad(struct scatterlist *sg, > - unsigned short offset) > +static int aes_gcm_empty_pld_pad(struct scatterlist *sg, > +unsigned short offset) > { > - struct page *spage; > unsigned char *addr; > > - spage = sg_page(sg); > - get_page(spage); /* so that it is not freed by NIC */ > -#ifdef KMAP_ATOMIC_ARGS > - addr = kmap_atomic(spage, KM_SOFTIRQ0); > -#else > - addr = kmap_atomic(spage); > -#endif > - memset(addr + sg->offset, 0, offset + 1); > + get_page(sg_page(sg)); /* so that it is not freed by NIC */ > + > + addr = sg_map(sg, SG_KMAP_ATOMIC); > + if (IS_ERR(addr)) > + return PTR_ERR(addr); > + > + memset(addr, 0, offset + 1); > + sg_unmap(sg, addr, SG_KMAP_ATOMIC); > > - kunmap_atomic(addr); > + return 0; > } > > static int set_msg_len(u8 *block, unsigned int msglen, int csize) > @@ -1940,7 +1939,10 @@ static struct sk_buff *create_gcm_wr(struct > aead_request *req, > if (req->cryptlen) { > write_sg_to_skb(skb, , src, req->cryptlen); > } else { > - aes_gcm_empty_pld_pad(req->dst, authsize - 1); > + err = aes_gcm_empty_pld_pad(req->dst, authsize - 1); > + if (err) > + goto dstmap_fail; > + > write_sg_to_skb(skb, , reqctx->dst, crypt_len); > > } > -- > 2.1.4 >
Re: [PATCH 08/22] crypto: chcr: Make use of the new sg_map helper function
On Fri, Apr 14, 2017 at 3:35 AM, Logan Gunthorpe wrote: > The get_page in this area looks *highly* suspect due to there being no > corresponding put_page. However, I've left that as is to avoid breaking > things. chcr driver will post the request to LLD driver cxgb4 and put_page is implemented there. it will no harm. Any how we have removed the below code from driver. http://www.mail-archive.com/linux-crypto@vger.kernel.org/msg24561.html After this merge we can ignore your patch. Thanks > > I've also removed the KMAP_ATOMIC_ARGS check as it appears to be dead > code that dates back to when it was first committed... > > Signed-off-by: Logan Gunthorpe > --- > drivers/crypto/chelsio/chcr_algo.c | 28 +++- > 1 file changed, 15 insertions(+), 13 deletions(-) > > diff --git a/drivers/crypto/chelsio/chcr_algo.c > b/drivers/crypto/chelsio/chcr_algo.c > index 41bc7f4..a993d1d 100644 > --- a/drivers/crypto/chelsio/chcr_algo.c > +++ b/drivers/crypto/chelsio/chcr_algo.c > @@ -1489,22 +1489,21 @@ static struct sk_buff *create_authenc_wr(struct > aead_request *req, > return ERR_PTR(-EINVAL); > } > > -static void aes_gcm_empty_pld_pad(struct scatterlist *sg, > - unsigned short offset) > +static int aes_gcm_empty_pld_pad(struct scatterlist *sg, > +unsigned short offset) > { > - struct page *spage; > unsigned char *addr; > > - spage = sg_page(sg); > - get_page(spage); /* so that it is not freed by NIC */ > -#ifdef KMAP_ATOMIC_ARGS > - addr = kmap_atomic(spage, KM_SOFTIRQ0); > -#else > - addr = kmap_atomic(spage); > -#endif > - memset(addr + sg->offset, 0, offset + 1); > + get_page(sg_page(sg)); /* so that it is not freed by NIC */ > + > + addr = sg_map(sg, SG_KMAP_ATOMIC); > + if (IS_ERR(addr)) > + return PTR_ERR(addr); > + > + memset(addr, 0, offset + 1); > + sg_unmap(sg, addr, SG_KMAP_ATOMIC); > > - kunmap_atomic(addr); > + return 0; > } > > static int set_msg_len(u8 *block, unsigned int msglen, int csize) > @@ -1940,7 +1939,10 @@ static struct sk_buff *create_gcm_wr(struct > aead_request *req, > if (req->cryptlen) { > write_sg_to_skb(skb, , src, req->cryptlen); > } else { > - aes_gcm_empty_pld_pad(req->dst, authsize - 1); > + err = aes_gcm_empty_pld_pad(req->dst, authsize - 1); > + if (err) > + goto dstmap_fail; > + > write_sg_to_skb(skb, , reqctx->dst, crypt_len); > > } > -- > 2.1.4 >
[PATCH] dt-bindings: input: rotary-encoder: fix typo
s/rollove/rollover/ Signed-off-by: Rahul Bedarkar--- Documentation/devicetree/bindings/input/rotary-encoder.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/input/rotary-encoder.txt b/Documentation/devicetree/bindings/input/rotary-encoder.txt index e85ce3d..f99fe5c 100644 --- a/Documentation/devicetree/bindings/input/rotary-encoder.txt +++ b/Documentation/devicetree/bindings/input/rotary-encoder.txt @@ -12,7 +12,7 @@ Optional properties: - rotary-encoder,relative-axis: register a relative axis rather than an absolute one. Relative axis will only generate +1/-1 events on the input device, hence no steps need to be passed. -- rotary-encoder,rollover: Automatic rollove when the rotary value becomes +- rotary-encoder,rollover: Automatic rollover when the rotary value becomes greater than the specified steps or smaller than 0. For absolute axis only. - rotary-encoder,steps-per-period: Number of steps (stable states) per period. The values have the following meaning: -- 2.7.4
[PATCH] dt-bindings: input: rotary-encoder: fix typo
s/rollove/rollover/ Signed-off-by: Rahul Bedarkar --- Documentation/devicetree/bindings/input/rotary-encoder.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/input/rotary-encoder.txt b/Documentation/devicetree/bindings/input/rotary-encoder.txt index e85ce3d..f99fe5c 100644 --- a/Documentation/devicetree/bindings/input/rotary-encoder.txt +++ b/Documentation/devicetree/bindings/input/rotary-encoder.txt @@ -12,7 +12,7 @@ Optional properties: - rotary-encoder,relative-axis: register a relative axis rather than an absolute one. Relative axis will only generate +1/-1 events on the input device, hence no steps need to be passed. -- rotary-encoder,rollover: Automatic rollove when the rotary value becomes +- rotary-encoder,rollover: Automatic rollover when the rotary value becomes greater than the specified steps or smaller than 0. For absolute axis only. - rotary-encoder,steps-per-period: Number of steps (stable states) per period. The values have the following meaning: -- 2.7.4
Re: [PATCH] clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK
Hi Matthias, On Tue, Apr 11, 2017 at 12:17 PM, Matthias Kaehlckewrote: > Besides reusing existing code this removes the special case handling > for 64-bit masks, which causes clang to raise a shift count overflow > warning due to https://bugs.llvm.org//show_bug.cgi?id=10030. > > Suggested-by: Dmitry Torokhov > Signed-off-by: Matthias Kaehlcke > --- > include/linux/clocksource.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h > index cfc75848a35d..06e604b9e9dc 100644 > --- a/include/linux/clocksource.h > +++ b/include/linux/clocksource.h > @@ -120,7 +120,7 @@ struct clocksource { > #define CLOCK_SOURCE_RESELECT 0x100 > > /* simplify initialization of mask field */ > -#define CLOCKSOURCE_MASK(bits) (u64)((bits) < 64 ? ((1ULL<<(bits))-1) : -1) > +#define CLOCKSOURCE_MASK(bits) (u64)GENMASK_ULL((bits) - 1, 0) I do not think cast to u64 is needed for GENMASK_ULL. > > static inline u32 clocksource_freq2mult(u32 freq, u32 shift_constant, u64 > from) > { > -- > 2.12.2.715.g7642488e1d-goog > Thanks, Dmitry
Re: [PATCH] clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK
Hi Matthias, On Tue, Apr 11, 2017 at 12:17 PM, Matthias Kaehlcke wrote: > Besides reusing existing code this removes the special case handling > for 64-bit masks, which causes clang to raise a shift count overflow > warning due to https://bugs.llvm.org//show_bug.cgi?id=10030. > > Suggested-by: Dmitry Torokhov > Signed-off-by: Matthias Kaehlcke > --- > include/linux/clocksource.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h > index cfc75848a35d..06e604b9e9dc 100644 > --- a/include/linux/clocksource.h > +++ b/include/linux/clocksource.h > @@ -120,7 +120,7 @@ struct clocksource { > #define CLOCK_SOURCE_RESELECT 0x100 > > /* simplify initialization of mask field */ > -#define CLOCKSOURCE_MASK(bits) (u64)((bits) < 64 ? ((1ULL<<(bits))-1) : -1) > +#define CLOCKSOURCE_MASK(bits) (u64)GENMASK_ULL((bits) - 1, 0) I do not think cast to u64 is needed for GENMASK_ULL. > > static inline u32 clocksource_freq2mult(u32 freq, u32 shift_constant, u64 > from) > { > -- > 2.12.2.715.g7642488e1d-goog > Thanks, Dmitry
Re: [PATCH 3/4] of: be consistent in form of file mode
Adding Stephen. On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand> > checkpatch whined about using S_IRUGO instead of octal equivalent > when adding phandle sysfs code, so used octal in that patch. > Change other instances of the S_* constants in the same file to > the octal form. > > Signed-off-by: Frank Rowand > --- > drivers/of/base.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/of/base.c b/drivers/of/base.c > index 197946615503..4a8bd9623140 100644 > --- a/drivers/of/base.c > +++ b/drivers/of/base.c > @@ -168,7 +168,7 @@ int __of_add_property_sysfs(struct device_node *np, > struct property *pp) > > sysfs_bin_attr_init(>attr); > pp->attr.attr.name = safe_name(>kobj, pp->name); > - pp->attr.attr.mode = secure ? S_IRUSR : S_IRUGO; > + pp->attr.attr.mode = secure ? 0400 : 0444; > pp->attr.size = secure ? 0 : pp->length; > pp->attr.read = of_node_property_read; > >
Re: [PATCH 4/4] of: detect invalid phandle in overlay
Adding Stephen. On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand> > Overlays are not allowed to modify phandle values of previously existing > nodes because there is no information available to allow fixup up > properties that use the previously existing phandle. > > Signed-off-by: Frank Rowand > --- > drivers/of/overlay.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c > index ca0b85f5deb1..20ab49d2f7a4 100644 > --- a/drivers/of/overlay.c > +++ b/drivers/of/overlay.c > @@ -130,6 +130,10 @@ static int of_overlay_apply_single_device_node(struct > of_overlay *ov, > /* NOTE: Multiple mods of created nodes not supported */ > tchild = of_get_child_by_name(target, cname); > if (tchild != NULL) { > + /* new overlay phandle value conflicts with existing value */ > + if (child->phandle) > + return -EINVAL; > + > /* apply overlay recursively */ > ret = of_overlay_apply_one(ov, tchild, child); > of_node_put(tchild); >
Re: [PATCH 3/4] of: be consistent in form of file mode
Adding Stephen. On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand > > checkpatch whined about using S_IRUGO instead of octal equivalent > when adding phandle sysfs code, so used octal in that patch. > Change other instances of the S_* constants in the same file to > the octal form. > > Signed-off-by: Frank Rowand > --- > drivers/of/base.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/of/base.c b/drivers/of/base.c > index 197946615503..4a8bd9623140 100644 > --- a/drivers/of/base.c > +++ b/drivers/of/base.c > @@ -168,7 +168,7 @@ int __of_add_property_sysfs(struct device_node *np, > struct property *pp) > > sysfs_bin_attr_init(>attr); > pp->attr.attr.name = safe_name(>kobj, pp->name); > - pp->attr.attr.mode = secure ? S_IRUSR : S_IRUGO; > + pp->attr.attr.mode = secure ? 0400 : 0444; > pp->attr.size = secure ? 0 : pp->length; > pp->attr.read = of_node_property_read; > >
Re: [PATCH 4/4] of: detect invalid phandle in overlay
Adding Stephen. On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand > > Overlays are not allowed to modify phandle values of previously existing > nodes because there is no information available to allow fixup up > properties that use the previously existing phandle. > > Signed-off-by: Frank Rowand > --- > drivers/of/overlay.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c > index ca0b85f5deb1..20ab49d2f7a4 100644 > --- a/drivers/of/overlay.c > +++ b/drivers/of/overlay.c > @@ -130,6 +130,10 @@ static int of_overlay_apply_single_device_node(struct > of_overlay *ov, > /* NOTE: Multiple mods of created nodes not supported */ > tchild = of_get_child_by_name(target, cname); > if (tchild != NULL) { > + /* new overlay phandle value conflicts with existing value */ > + if (child->phandle) > + return -EINVAL; > + > /* apply overlay recursively */ > ret = of_overlay_apply_one(ov, tchild, child); > of_node_put(tchild); >
Re: [PATCH 2/4] of: make __of_attach_node() static
Adding Stephen. On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand> > __of_attach_node() is not used outside of drivers/of/dynamic.c. Make > it static and remove it from drivers/of/of_private.h. > > Signed-off-by: Frank Rowand > --- > drivers/of/dynamic.c| 2 +- > drivers/of/of_private.h | 1 - > 2 files changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c > index c6fd3f32bfcb..74aafe594ad5 100644 > --- a/drivers/of/dynamic.c > +++ b/drivers/of/dynamic.c > @@ -216,7 +216,7 @@ int of_property_notify(int action, struct device_node *np, > return of_reconfig_notify(action, ); > } > > -void __of_attach_node(struct device_node *np) > +static void __of_attach_node(struct device_node *np) > { > np->child = NULL; > np->sibling = np->parent->child; > diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h > index 18bbb4517e25..efcedcff7dba 100644 > --- a/drivers/of/of_private.h > +++ b/drivers/of/of_private.h > @@ -78,7 +78,6 @@ extern int __of_update_property(struct device_node *np, > extern void __of_update_property_sysfs(struct device_node *np, > struct property *newprop, struct property *oldprop); > > -extern void __of_attach_node(struct device_node *np); > extern int __of_attach_node_sysfs(struct device_node *np); > extern void __of_detach_node(struct device_node *np); > extern void __of_detach_node_sysfs(struct device_node *np); >
Re: [PATCH 2/4] of: make __of_attach_node() static
Adding Stephen. On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand > > __of_attach_node() is not used outside of drivers/of/dynamic.c. Make > it static and remove it from drivers/of/of_private.h. > > Signed-off-by: Frank Rowand > --- > drivers/of/dynamic.c| 2 +- > drivers/of/of_private.h | 1 - > 2 files changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c > index c6fd3f32bfcb..74aafe594ad5 100644 > --- a/drivers/of/dynamic.c > +++ b/drivers/of/dynamic.c > @@ -216,7 +216,7 @@ int of_property_notify(int action, struct device_node *np, > return of_reconfig_notify(action, ); > } > > -void __of_attach_node(struct device_node *np) > +static void __of_attach_node(struct device_node *np) > { > np->child = NULL; > np->sibling = np->parent->child; > diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h > index 18bbb4517e25..efcedcff7dba 100644 > --- a/drivers/of/of_private.h > +++ b/drivers/of/of_private.h > @@ -78,7 +78,6 @@ extern int __of_update_property(struct device_node *np, > extern void __of_update_property_sysfs(struct device_node *np, > struct property *newprop, struct property *oldprop); > > -extern void __of_attach_node(struct device_node *np); > extern int __of_attach_node_sysfs(struct device_node *np); > extern void __of_detach_node(struct device_node *np); > extern void __of_detach_node_sysfs(struct device_node *np); >
Re: [PATCH 1/4] of: remove *phandle properties from expanded device tree
Adding Stephen. On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand> > Remove "phandle" and "linux,phandle" properties from the internal > device tree. The phandle will still be in the struct device_node > phandle field. > > This is to resolve the issue found by Stephen Boyd [1] when he changed > the type of struct property.value from void * to const void *. As > a result of the type change, the overlay code had compile errors > where the resolver updates phandle values. > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html > > - Add sysfs infrastructure to report np->phandle, as if it was a property. > - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties > in the expanded device tree. > - Remove no longer needed checks to exclude "phandle" and "linux,phandle" > properties in several locations. > - A side effect of these changes is that the obsolete "linux,phandle" > properties will no longer appear in /proc/device-tree > > Signed-off-by: Frank Rowand > --- > drivers/of/base.c | 51 > --- > drivers/of/dynamic.c | 29 - > drivers/of/fdt.c | 40 > drivers/of/overlay.c | 4 +--- > drivers/of/resolver.c | 23 +-- > include/linux/of.h| 1 + > 6 files changed, 91 insertions(+), 57 deletions(-) > > diff --git a/drivers/of/base.c b/drivers/of/base.c > index d7c4629a3a2d..197946615503 100644 > --- a/drivers/of/base.c > +++ b/drivers/of/base.c > @@ -116,6 +116,19 @@ static ssize_t of_node_property_read(struct file *filp, > struct kobject *kobj, > return memory_read_from_buffer(buf, count, , pp->value, > pp->length); > } > > +static ssize_t of_node_phandle_read(struct file *filp, struct kobject *kobj, > + struct bin_attribute *bin_attr, char *buf, > + loff_t offset, size_t count) > +{ > + phandle phandle; > + struct device_node *np; > + > + np = container_of(bin_attr, struct device_node, attr_phandle); > + phandle = cpu_to_be32(np->phandle); > + return memory_read_from_buffer(buf, count, , , > +sizeof(phandle)); > +} > + > /* always return newly allocated name, caller must free after use */ > static const char *safe_name(struct kobject *kobj, const char *orig_name) > { > @@ -164,6 +177,38 @@ int __of_add_property_sysfs(struct device_node *np, > struct property *pp) > return rc; > } > > +/* > + * In the imported device tree (fdt), phandle is a property. In the > + * internal data structure it is instead stored in the struct device_node. > + * Make phandle visible in sysfs as if it was a property. > + */ > +static int __of_add_phandle_sysfs(struct device_node *np) > +{ > + int rc; > + > + if (IS_ENABLED(CONFIG_PPC_PSERIES)) > + return 0; > + > + if (!IS_ENABLED(CONFIG_SYSFS)) > + return 0; > + > + if (!of_kset || !of_node_is_attached(np)) > + return 0; > + > + if (!np->phandle || np->phandle == 0x) > + return 0; > + > + sysfs_bin_attr_init(>attr); > + np->attr_phandle.attr.name = "phandle"; > + np->attr_phandle.attr.mode = 0444; > + np->attr_phandle.size = sizeof(np->phandle); > + np->attr_phandle.read = of_node_phandle_read; > + > + rc = sysfs_create_bin_file(>kobj, >attr_phandle); > + WARN(rc, "error adding attribute phandle to node %s\n", np->full_name); > + return rc; > +} > + > int __of_attach_node_sysfs(struct device_node *np) > { > const char *name; > @@ -193,6 +238,8 @@ int __of_attach_node_sysfs(struct device_node *np) > if (rc) > return rc; > > + __of_add_phandle_sysfs(np); > + > for_each_property_of_node(np, pp) > __of_add_property_sysfs(np, pp); > > @@ -2097,9 +2144,7 @@ void of_alias_scan(void * (*dt_alloc)(u64 size, u64 > align)) > int id, len; > > /* Skip those we do not want to proceed */ > - if (!strcmp(pp->name, "name") || > - !strcmp(pp->name, "phandle") || > - !strcmp(pp->name, "linux,phandle")) > + if (!strcmp(pp->name, "name")) > continue; > > np = of_find_node_by_path(pp->value); > diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c > index 888fdbc09992..c6fd3f32bfcb 100644 > --- a/drivers/of/dynamic.c > +++ b/drivers/of/dynamic.c > @@ -218,19 +218,6 @@ int of_property_notify(int action, struct device_node > *np, > > void __of_attach_node(struct device_node *np) > { > - const __be32 *phandle; > - int sz; > - > - np->name = __of_get_property(np, "name", NULL) ? : ""; > - np->type = __of_get_property(np, "device_type", NULL) ? : ""; > - > - phandle = __of_get_property(np, "phandle",
Re: [PATCH 1/4] of: remove *phandle properties from expanded device tree
Adding Stephen. On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand > > Remove "phandle" and "linux,phandle" properties from the internal > device tree. The phandle will still be in the struct device_node > phandle field. > > This is to resolve the issue found by Stephen Boyd [1] when he changed > the type of struct property.value from void * to const void *. As > a result of the type change, the overlay code had compile errors > where the resolver updates phandle values. > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html > > - Add sysfs infrastructure to report np->phandle, as if it was a property. > - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties > in the expanded device tree. > - Remove no longer needed checks to exclude "phandle" and "linux,phandle" > properties in several locations. > - A side effect of these changes is that the obsolete "linux,phandle" > properties will no longer appear in /proc/device-tree > > Signed-off-by: Frank Rowand > --- > drivers/of/base.c | 51 > --- > drivers/of/dynamic.c | 29 - > drivers/of/fdt.c | 40 > drivers/of/overlay.c | 4 +--- > drivers/of/resolver.c | 23 +-- > include/linux/of.h| 1 + > 6 files changed, 91 insertions(+), 57 deletions(-) > > diff --git a/drivers/of/base.c b/drivers/of/base.c > index d7c4629a3a2d..197946615503 100644 > --- a/drivers/of/base.c > +++ b/drivers/of/base.c > @@ -116,6 +116,19 @@ static ssize_t of_node_property_read(struct file *filp, > struct kobject *kobj, > return memory_read_from_buffer(buf, count, , pp->value, > pp->length); > } > > +static ssize_t of_node_phandle_read(struct file *filp, struct kobject *kobj, > + struct bin_attribute *bin_attr, char *buf, > + loff_t offset, size_t count) > +{ > + phandle phandle; > + struct device_node *np; > + > + np = container_of(bin_attr, struct device_node, attr_phandle); > + phandle = cpu_to_be32(np->phandle); > + return memory_read_from_buffer(buf, count, , , > +sizeof(phandle)); > +} > + > /* always return newly allocated name, caller must free after use */ > static const char *safe_name(struct kobject *kobj, const char *orig_name) > { > @@ -164,6 +177,38 @@ int __of_add_property_sysfs(struct device_node *np, > struct property *pp) > return rc; > } > > +/* > + * In the imported device tree (fdt), phandle is a property. In the > + * internal data structure it is instead stored in the struct device_node. > + * Make phandle visible in sysfs as if it was a property. > + */ > +static int __of_add_phandle_sysfs(struct device_node *np) > +{ > + int rc; > + > + if (IS_ENABLED(CONFIG_PPC_PSERIES)) > + return 0; > + > + if (!IS_ENABLED(CONFIG_SYSFS)) > + return 0; > + > + if (!of_kset || !of_node_is_attached(np)) > + return 0; > + > + if (!np->phandle || np->phandle == 0x) > + return 0; > + > + sysfs_bin_attr_init(>attr); > + np->attr_phandle.attr.name = "phandle"; > + np->attr_phandle.attr.mode = 0444; > + np->attr_phandle.size = sizeof(np->phandle); > + np->attr_phandle.read = of_node_phandle_read; > + > + rc = sysfs_create_bin_file(>kobj, >attr_phandle); > + WARN(rc, "error adding attribute phandle to node %s\n", np->full_name); > + return rc; > +} > + > int __of_attach_node_sysfs(struct device_node *np) > { > const char *name; > @@ -193,6 +238,8 @@ int __of_attach_node_sysfs(struct device_node *np) > if (rc) > return rc; > > + __of_add_phandle_sysfs(np); > + > for_each_property_of_node(np, pp) > __of_add_property_sysfs(np, pp); > > @@ -2097,9 +2144,7 @@ void of_alias_scan(void * (*dt_alloc)(u64 size, u64 > align)) > int id, len; > > /* Skip those we do not want to proceed */ > - if (!strcmp(pp->name, "name") || > - !strcmp(pp->name, "phandle") || > - !strcmp(pp->name, "linux,phandle")) > + if (!strcmp(pp->name, "name")) > continue; > > np = of_find_node_by_path(pp->value); > diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c > index 888fdbc09992..c6fd3f32bfcb 100644 > --- a/drivers/of/dynamic.c > +++ b/drivers/of/dynamic.c > @@ -218,19 +218,6 @@ int of_property_notify(int action, struct device_node > *np, > > void __of_attach_node(struct device_node *np) > { > - const __be32 *phandle; > - int sz; > - > - np->name = __of_get_property(np, "name", NULL) ? : ""; > - np->type = __of_get_property(np, "device_type", NULL) ? : ""; > - > - phandle = __of_get_property(np, "phandle", ); > - if (!phandle) > -
Re: [PATCH 0/4] of: remove *phandle properties from expanded device tree
Hi Stephen, I left you off the distribution list, sorry... On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand> > Remove "phandle" and "linux,phandle" properties from the internal > device tree. The phandle will still be in the struct device_node > phandle field. > > This is to resolve the issue found by Stephen Boyd [1] when he changed > the type of struct property.value from void * to const void *. As > a result of the type change, the overlay code had compile errors > where the resolver updates phandle values. > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html > > Patch 1 is the phandle related changes. > > Patches 2 - 4 are minor fixups for issues that became visible > while implementing patch 1. > > Frank Rowand (4): > of: remove *phandle properties from expanded device tree > of: make __of_attach_node() static > of: be consistent in form of file mode > of: detect invalid phandle in overlay > > drivers/of/base.c | 53 > + > drivers/of/dynamic.c| 31 - > drivers/of/fdt.c| 40 ++--- > drivers/of/of_private.h | 1 - > drivers/of/overlay.c| 8 +--- > drivers/of/resolver.c | 23 + > include/linux/of.h | 1 + > 7 files changed, 97 insertions(+), 60 deletions(-) >
Re: [PATCH 0/4] of: remove *phandle properties from expanded device tree
Hi Stephen, I left you off the distribution list, sorry... On 04/14/17 20:55, frowand.l...@gmail.com wrote: > From: Frank Rowand > > Remove "phandle" and "linux,phandle" properties from the internal > device tree. The phandle will still be in the struct device_node > phandle field. > > This is to resolve the issue found by Stephen Boyd [1] when he changed > the type of struct property.value from void * to const void *. As > a result of the type change, the overlay code had compile errors > where the resolver updates phandle values. > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html > > Patch 1 is the phandle related changes. > > Patches 2 - 4 are minor fixups for issues that became visible > while implementing patch 1. > > Frank Rowand (4): > of: remove *phandle properties from expanded device tree > of: make __of_attach_node() static > of: be consistent in form of file mode > of: detect invalid phandle in overlay > > drivers/of/base.c | 53 > + > drivers/of/dynamic.c| 31 - > drivers/of/fdt.c| 40 ++--- > drivers/of/of_private.h | 1 - > drivers/of/overlay.c| 8 +--- > drivers/of/resolver.c | 23 + > include/linux/of.h | 1 + > 7 files changed, 97 insertions(+), 60 deletions(-) >
[PATCH 1/4] of: remove *phandle properties from expanded device tree
From: Frank RowandRemove "phandle" and "linux,phandle" properties from the internal device tree. The phandle will still be in the struct device_node phandle field. This is to resolve the issue found by Stephen Boyd [1] when he changed the type of struct property.value from void * to const void *. As a result of the type change, the overlay code had compile errors where the resolver updates phandle values. [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html - Add sysfs infrastructure to report np->phandle, as if it was a property. - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties in the expanded device tree. - Remove no longer needed checks to exclude "phandle" and "linux,phandle" properties in several locations. - A side effect of these changes is that the obsolete "linux,phandle" properties will no longer appear in /proc/device-tree Signed-off-by: Frank Rowand --- drivers/of/base.c | 51 --- drivers/of/dynamic.c | 29 - drivers/of/fdt.c | 40 drivers/of/overlay.c | 4 +--- drivers/of/resolver.c | 23 +-- include/linux/of.h| 1 + 6 files changed, 91 insertions(+), 57 deletions(-) diff --git a/drivers/of/base.c b/drivers/of/base.c index d7c4629a3a2d..197946615503 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -116,6 +116,19 @@ static ssize_t of_node_property_read(struct file *filp, struct kobject *kobj, return memory_read_from_buffer(buf, count, , pp->value, pp->length); } +static ssize_t of_node_phandle_read(struct file *filp, struct kobject *kobj, + struct bin_attribute *bin_attr, char *buf, + loff_t offset, size_t count) +{ + phandle phandle; + struct device_node *np; + + np = container_of(bin_attr, struct device_node, attr_phandle); + phandle = cpu_to_be32(np->phandle); + return memory_read_from_buffer(buf, count, , , + sizeof(phandle)); +} + /* always return newly allocated name, caller must free after use */ static const char *safe_name(struct kobject *kobj, const char *orig_name) { @@ -164,6 +177,38 @@ int __of_add_property_sysfs(struct device_node *np, struct property *pp) return rc; } +/* + * In the imported device tree (fdt), phandle is a property. In the + * internal data structure it is instead stored in the struct device_node. + * Make phandle visible in sysfs as if it was a property. + */ +static int __of_add_phandle_sysfs(struct device_node *np) +{ + int rc; + + if (IS_ENABLED(CONFIG_PPC_PSERIES)) + return 0; + + if (!IS_ENABLED(CONFIG_SYSFS)) + return 0; + + if (!of_kset || !of_node_is_attached(np)) + return 0; + + if (!np->phandle || np->phandle == 0x) + return 0; + + sysfs_bin_attr_init(>attr); + np->attr_phandle.attr.name = "phandle"; + np->attr_phandle.attr.mode = 0444; + np->attr_phandle.size = sizeof(np->phandle); + np->attr_phandle.read = of_node_phandle_read; + + rc = sysfs_create_bin_file(>kobj, >attr_phandle); + WARN(rc, "error adding attribute phandle to node %s\n", np->full_name); + return rc; +} + int __of_attach_node_sysfs(struct device_node *np) { const char *name; @@ -193,6 +238,8 @@ int __of_attach_node_sysfs(struct device_node *np) if (rc) return rc; + __of_add_phandle_sysfs(np); + for_each_property_of_node(np, pp) __of_add_property_sysfs(np, pp); @@ -2097,9 +2144,7 @@ void of_alias_scan(void * (*dt_alloc)(u64 size, u64 align)) int id, len; /* Skip those we do not want to proceed */ - if (!strcmp(pp->name, "name") || - !strcmp(pp->name, "phandle") || - !strcmp(pp->name, "linux,phandle")) + if (!strcmp(pp->name, "name")) continue; np = of_find_node_by_path(pp->value); diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c index 888fdbc09992..c6fd3f32bfcb 100644 --- a/drivers/of/dynamic.c +++ b/drivers/of/dynamic.c @@ -218,19 +218,6 @@ int of_property_notify(int action, struct device_node *np, void __of_attach_node(struct device_node *np) { - const __be32 *phandle; - int sz; - - np->name = __of_get_property(np, "name", NULL) ? : ""; - np->type = __of_get_property(np, "device_type", NULL) ? : ""; - - phandle = __of_get_property(np, "phandle", ); - if (!phandle) - phandle = __of_get_property(np, "linux,phandle", ); - if (IS_ENABLED(CONFIG_PPC_PSERIES) && !phandle) - phandle = __of_get_property(np, "ibm,phandle", ); - np->phandle = (phandle
[PATCH 1/4] of: remove *phandle properties from expanded device tree
From: Frank Rowand Remove "phandle" and "linux,phandle" properties from the internal device tree. The phandle will still be in the struct device_node phandle field. This is to resolve the issue found by Stephen Boyd [1] when he changed the type of struct property.value from void * to const void *. As a result of the type change, the overlay code had compile errors where the resolver updates phandle values. [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html - Add sysfs infrastructure to report np->phandle, as if it was a property. - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties in the expanded device tree. - Remove no longer needed checks to exclude "phandle" and "linux,phandle" properties in several locations. - A side effect of these changes is that the obsolete "linux,phandle" properties will no longer appear in /proc/device-tree Signed-off-by: Frank Rowand --- drivers/of/base.c | 51 --- drivers/of/dynamic.c | 29 - drivers/of/fdt.c | 40 drivers/of/overlay.c | 4 +--- drivers/of/resolver.c | 23 +-- include/linux/of.h| 1 + 6 files changed, 91 insertions(+), 57 deletions(-) diff --git a/drivers/of/base.c b/drivers/of/base.c index d7c4629a3a2d..197946615503 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -116,6 +116,19 @@ static ssize_t of_node_property_read(struct file *filp, struct kobject *kobj, return memory_read_from_buffer(buf, count, , pp->value, pp->length); } +static ssize_t of_node_phandle_read(struct file *filp, struct kobject *kobj, + struct bin_attribute *bin_attr, char *buf, + loff_t offset, size_t count) +{ + phandle phandle; + struct device_node *np; + + np = container_of(bin_attr, struct device_node, attr_phandle); + phandle = cpu_to_be32(np->phandle); + return memory_read_from_buffer(buf, count, , , + sizeof(phandle)); +} + /* always return newly allocated name, caller must free after use */ static const char *safe_name(struct kobject *kobj, const char *orig_name) { @@ -164,6 +177,38 @@ int __of_add_property_sysfs(struct device_node *np, struct property *pp) return rc; } +/* + * In the imported device tree (fdt), phandle is a property. In the + * internal data structure it is instead stored in the struct device_node. + * Make phandle visible in sysfs as if it was a property. + */ +static int __of_add_phandle_sysfs(struct device_node *np) +{ + int rc; + + if (IS_ENABLED(CONFIG_PPC_PSERIES)) + return 0; + + if (!IS_ENABLED(CONFIG_SYSFS)) + return 0; + + if (!of_kset || !of_node_is_attached(np)) + return 0; + + if (!np->phandle || np->phandle == 0x) + return 0; + + sysfs_bin_attr_init(>attr); + np->attr_phandle.attr.name = "phandle"; + np->attr_phandle.attr.mode = 0444; + np->attr_phandle.size = sizeof(np->phandle); + np->attr_phandle.read = of_node_phandle_read; + + rc = sysfs_create_bin_file(>kobj, >attr_phandle); + WARN(rc, "error adding attribute phandle to node %s\n", np->full_name); + return rc; +} + int __of_attach_node_sysfs(struct device_node *np) { const char *name; @@ -193,6 +238,8 @@ int __of_attach_node_sysfs(struct device_node *np) if (rc) return rc; + __of_add_phandle_sysfs(np); + for_each_property_of_node(np, pp) __of_add_property_sysfs(np, pp); @@ -2097,9 +2144,7 @@ void of_alias_scan(void * (*dt_alloc)(u64 size, u64 align)) int id, len; /* Skip those we do not want to proceed */ - if (!strcmp(pp->name, "name") || - !strcmp(pp->name, "phandle") || - !strcmp(pp->name, "linux,phandle")) + if (!strcmp(pp->name, "name")) continue; np = of_find_node_by_path(pp->value); diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c index 888fdbc09992..c6fd3f32bfcb 100644 --- a/drivers/of/dynamic.c +++ b/drivers/of/dynamic.c @@ -218,19 +218,6 @@ int of_property_notify(int action, struct device_node *np, void __of_attach_node(struct device_node *np) { - const __be32 *phandle; - int sz; - - np->name = __of_get_property(np, "name", NULL) ? : ""; - np->type = __of_get_property(np, "device_type", NULL) ? : ""; - - phandle = __of_get_property(np, "phandle", ); - if (!phandle) - phandle = __of_get_property(np, "linux,phandle", ); - if (IS_ENABLED(CONFIG_PPC_PSERIES) && !phandle) - phandle = __of_get_property(np, "ibm,phandle", ); - np->phandle = (phandle && (sz >= 4)) ? be32_to_cpup(phandle) : 0; -
[PATCH 0/4] of: remove *phandle properties from expanded device tree
From: Frank RowandRemove "phandle" and "linux,phandle" properties from the internal device tree. The phandle will still be in the struct device_node phandle field. This is to resolve the issue found by Stephen Boyd [1] when he changed the type of struct property.value from void * to const void *. As a result of the type change, the overlay code had compile errors where the resolver updates phandle values. [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html Patch 1 is the phandle related changes. Patches 2 - 4 are minor fixups for issues that became visible while implementing patch 1. Frank Rowand (4): of: remove *phandle properties from expanded device tree of: make __of_attach_node() static of: be consistent in form of file mode of: detect invalid phandle in overlay drivers/of/base.c | 53 + drivers/of/dynamic.c| 31 - drivers/of/fdt.c| 40 ++--- drivers/of/of_private.h | 1 - drivers/of/overlay.c| 8 +--- drivers/of/resolver.c | 23 + include/linux/of.h | 1 + 7 files changed, 97 insertions(+), 60 deletions(-) -- Frank Rowand
[PATCH 2/4] of: make __of_attach_node() static
From: Frank Rowand__of_attach_node() is not used outside of drivers/of/dynamic.c. Make it static and remove it from drivers/of/of_private.h. Signed-off-by: Frank Rowand --- drivers/of/dynamic.c| 2 +- drivers/of/of_private.h | 1 - 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c index c6fd3f32bfcb..74aafe594ad5 100644 --- a/drivers/of/dynamic.c +++ b/drivers/of/dynamic.c @@ -216,7 +216,7 @@ int of_property_notify(int action, struct device_node *np, return of_reconfig_notify(action, ); } -void __of_attach_node(struct device_node *np) +static void __of_attach_node(struct device_node *np) { np->child = NULL; np->sibling = np->parent->child; diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h index 18bbb4517e25..efcedcff7dba 100644 --- a/drivers/of/of_private.h +++ b/drivers/of/of_private.h @@ -78,7 +78,6 @@ extern int __of_update_property(struct device_node *np, extern void __of_update_property_sysfs(struct device_node *np, struct property *newprop, struct property *oldprop); -extern void __of_attach_node(struct device_node *np); extern int __of_attach_node_sysfs(struct device_node *np); extern void __of_detach_node(struct device_node *np); extern void __of_detach_node_sysfs(struct device_node *np); -- Frank Rowand
[PATCH 4/4] of: detect invalid phandle in overlay
From: Frank RowandOverlays are not allowed to modify phandle values of previously existing nodes because there is no information available to allow fixup up properties that use the previously existing phandle. Signed-off-by: Frank Rowand --- drivers/of/overlay.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index ca0b85f5deb1..20ab49d2f7a4 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -130,6 +130,10 @@ static int of_overlay_apply_single_device_node(struct of_overlay *ov, /* NOTE: Multiple mods of created nodes not supported */ tchild = of_get_child_by_name(target, cname); if (tchild != NULL) { + /* new overlay phandle value conflicts with existing value */ + if (child->phandle) + return -EINVAL; + /* apply overlay recursively */ ret = of_overlay_apply_one(ov, tchild, child); of_node_put(tchild); -- Frank Rowand
[PATCH 0/4] of: remove *phandle properties from expanded device tree
From: Frank Rowand Remove "phandle" and "linux,phandle" properties from the internal device tree. The phandle will still be in the struct device_node phandle field. This is to resolve the issue found by Stephen Boyd [1] when he changed the type of struct property.value from void * to const void *. As a result of the type change, the overlay code had compile errors where the resolver updates phandle values. [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html Patch 1 is the phandle related changes. Patches 2 - 4 are minor fixups for issues that became visible while implementing patch 1. Frank Rowand (4): of: remove *phandle properties from expanded device tree of: make __of_attach_node() static of: be consistent in form of file mode of: detect invalid phandle in overlay drivers/of/base.c | 53 + drivers/of/dynamic.c| 31 - drivers/of/fdt.c| 40 ++--- drivers/of/of_private.h | 1 - drivers/of/overlay.c| 8 +--- drivers/of/resolver.c | 23 + include/linux/of.h | 1 + 7 files changed, 97 insertions(+), 60 deletions(-) -- Frank Rowand
[PATCH 2/4] of: make __of_attach_node() static
From: Frank Rowand __of_attach_node() is not used outside of drivers/of/dynamic.c. Make it static and remove it from drivers/of/of_private.h. Signed-off-by: Frank Rowand --- drivers/of/dynamic.c| 2 +- drivers/of/of_private.h | 1 - 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c index c6fd3f32bfcb..74aafe594ad5 100644 --- a/drivers/of/dynamic.c +++ b/drivers/of/dynamic.c @@ -216,7 +216,7 @@ int of_property_notify(int action, struct device_node *np, return of_reconfig_notify(action, ); } -void __of_attach_node(struct device_node *np) +static void __of_attach_node(struct device_node *np) { np->child = NULL; np->sibling = np->parent->child; diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h index 18bbb4517e25..efcedcff7dba 100644 --- a/drivers/of/of_private.h +++ b/drivers/of/of_private.h @@ -78,7 +78,6 @@ extern int __of_update_property(struct device_node *np, extern void __of_update_property_sysfs(struct device_node *np, struct property *newprop, struct property *oldprop); -extern void __of_attach_node(struct device_node *np); extern int __of_attach_node_sysfs(struct device_node *np); extern void __of_detach_node(struct device_node *np); extern void __of_detach_node_sysfs(struct device_node *np); -- Frank Rowand
[PATCH 4/4] of: detect invalid phandle in overlay
From: Frank Rowand Overlays are not allowed to modify phandle values of previously existing nodes because there is no information available to allow fixup up properties that use the previously existing phandle. Signed-off-by: Frank Rowand --- drivers/of/overlay.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c index ca0b85f5deb1..20ab49d2f7a4 100644 --- a/drivers/of/overlay.c +++ b/drivers/of/overlay.c @@ -130,6 +130,10 @@ static int of_overlay_apply_single_device_node(struct of_overlay *ov, /* NOTE: Multiple mods of created nodes not supported */ tchild = of_get_child_by_name(target, cname); if (tchild != NULL) { + /* new overlay phandle value conflicts with existing value */ + if (child->phandle) + return -EINVAL; + /* apply overlay recursively */ ret = of_overlay_apply_one(ov, tchild, child); of_node_put(tchild); -- Frank Rowand
[PATCH 3/4] of: be consistent in form of file mode
From: Frank Rowandcheckpatch whined about using S_IRUGO instead of octal equivalent when adding phandle sysfs code, so used octal in that patch. Change other instances of the S_* constants in the same file to the octal form. Signed-off-by: Frank Rowand --- drivers/of/base.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/of/base.c b/drivers/of/base.c index 197946615503..4a8bd9623140 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -168,7 +168,7 @@ int __of_add_property_sysfs(struct device_node *np, struct property *pp) sysfs_bin_attr_init(>attr); pp->attr.attr.name = safe_name(>kobj, pp->name); - pp->attr.attr.mode = secure ? S_IRUSR : S_IRUGO; + pp->attr.attr.mode = secure ? 0400 : 0444; pp->attr.size = secure ? 0 : pp->length; pp->attr.read = of_node_property_read; -- Frank Rowand
[PATCH 3/4] of: be consistent in form of file mode
From: Frank Rowand checkpatch whined about using S_IRUGO instead of octal equivalent when adding phandle sysfs code, so used octal in that patch. Change other instances of the S_* constants in the same file to the octal form. Signed-off-by: Frank Rowand --- drivers/of/base.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/of/base.c b/drivers/of/base.c index 197946615503..4a8bd9623140 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -168,7 +168,7 @@ int __of_add_property_sysfs(struct device_node *np, struct property *pp) sysfs_bin_attr_init(>attr); pp->attr.attr.name = safe_name(>kobj, pp->name); - pp->attr.attr.mode = secure ? S_IRUSR : S_IRUGO; + pp->attr.attr.mode = secure ? 0400 : 0444; pp->attr.size = secure ? 0 : pp->length; pp->attr.read = of_node_property_read; -- Frank Rowand
Re: [PATCH v4 2/2] thermal: core: Add a back up thermal shutdown mechanism
On Friday 14 April 2017 11:48 PM, Eduardo Valentin wrote: > Hey, > > On Fri, Apr 14, 2017 at 08:42:20AM -0700, Eduardo Valentin wrote: >> Hello again, >> >> On Fri, Apr 14, 2017 at 08:38:40AM -0700, Eduardo Valentin wrote: >>> Hey, >>> >>> On Fri, Apr 14, 2017 at 02:22:13PM +0530, Keerthy wrote: orderly_poweroff is triggered when a graceful shutdown of system is desired. This may be used in many critical states of the kernel such as when subsystems detects conditions such as critical temperature conditions. However, in certain conditions in system boot up sequences like those in the middle of driver probes being initiated, userspace will be unable to power off the system in a clean manner and leaves the system in a critical state. In cases like these, the /sbin/poweroff will return success (having forked off to attempt powering off the system. However, the system overall will fail to completely poweroff (since other modules will be probed) and the system is still functional with no userspace (since that would have shut itself off). However, there is no clean way of detecting such failure of userspace powering off the system. In such scenarios, it is necessary for a backup workqueue to be able to force a shutdown of the system when orderly shutdown is not successful after a configurable time period. Reported-by: Nishanth MenonSigned-off-by: Keerthy --- Changes in v4: * Updated documentation * changed emergency_poweroff_func to thermal_emergency_poweroff_func Changes in v3: * Removed unnecessary mutex init. * Added WARN messages instead of a simple warning message. * Added Documentation. Documentation/thermal/sysfs-api.txt | 19 +++ drivers/thermal/Kconfig | 13 +++ drivers/thermal/thermal_core.c | 46 + 3 files changed, 78 insertions(+) diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt index ef473dc..e73cc12 100644 --- a/Documentation/thermal/sysfs-api.txt +++ b/Documentation/thermal/sysfs-api.txt @@ -582,3 +582,22 @@ platform data is provided, this uses the step_wise throttling policy. This function serves as an arbitrator to set the state of a cooling device. It sets the cooling device to the deepest cooling state if possible. + +6. thermal_emergency_poweroff: + +On an event of critical trip temperature crossing. Thermal framework +allows the system to shutdown gracefully by calling orderly_poweroff(). +In the event of a failure of orderly_poweroff() to shut down the system +we are in danger of keeping the system alive at undesirably high +temperatures. To mitigate this high risk scenario we program a work +queue to fire after a pre-determined number of seconds to start +an emergency shutdown of the device using the kernel_power_off() +function. In case kernel_power_off() fails then finally +emergency_restart() is called in the worst case. + +The delay should be carefully profiled so as to give adequate time for +orderly_poweroff(). In case of failure of an orderly_poweroff() the +emergency poweroff kicks in after the delay has elapsed and shuts down +the system. + +If set to 0 emergency poweroff will happen immediately. diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig index 9347401..0dd5b85 100644 --- a/drivers/thermal/Kconfig +++ b/drivers/thermal/Kconfig @@ -15,6 +15,19 @@ menuconfig THERMAL if THERMAL +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS + int "Emergency poweroff delay in milli-seconds" + depends on THERMAL + default 0 >>> >>> Only now I realized that merging this may break the working >>> orderly_poweroff() out there, because you are defaulting this to 0, no >>> delay, therefore giving no time for orderly_poweroff() to finish. This >>> is not good. >>> >>> I think using 0 delay as immediate power off is not good as we give no >>> time for graceful shutdown, and by default. My suggestion here >>> is to use 0 delay as no forced shutdown. Meaning, by default, this >>> feature is disabled, and all other systems out there, despite DRA7 with >>> arago over NFS, work as before. > > A better solution could be to have bool Kconfig, say > THERMAL_EMERGENCY_POWEROFF, which would default to false. If one selects > that option, you get the DELAY_MS configurable, and then you could get > the 0 ms still as a valid entry, with the same semantics of immediate > power off, no orderly_poweroff. > > I just want to avoid breaking everybody (or changing userland > expectation) in honor of this change. Sure. I have now used default value
Re: [PATCH v4 2/2] thermal: core: Add a back up thermal shutdown mechanism
On Friday 14 April 2017 11:48 PM, Eduardo Valentin wrote: > Hey, > > On Fri, Apr 14, 2017 at 08:42:20AM -0700, Eduardo Valentin wrote: >> Hello again, >> >> On Fri, Apr 14, 2017 at 08:38:40AM -0700, Eduardo Valentin wrote: >>> Hey, >>> >>> On Fri, Apr 14, 2017 at 02:22:13PM +0530, Keerthy wrote: orderly_poweroff is triggered when a graceful shutdown of system is desired. This may be used in many critical states of the kernel such as when subsystems detects conditions such as critical temperature conditions. However, in certain conditions in system boot up sequences like those in the middle of driver probes being initiated, userspace will be unable to power off the system in a clean manner and leaves the system in a critical state. In cases like these, the /sbin/poweroff will return success (having forked off to attempt powering off the system. However, the system overall will fail to completely poweroff (since other modules will be probed) and the system is still functional with no userspace (since that would have shut itself off). However, there is no clean way of detecting such failure of userspace powering off the system. In such scenarios, it is necessary for a backup workqueue to be able to force a shutdown of the system when orderly shutdown is not successful after a configurable time period. Reported-by: Nishanth Menon Signed-off-by: Keerthy --- Changes in v4: * Updated documentation * changed emergency_poweroff_func to thermal_emergency_poweroff_func Changes in v3: * Removed unnecessary mutex init. * Added WARN messages instead of a simple warning message. * Added Documentation. Documentation/thermal/sysfs-api.txt | 19 +++ drivers/thermal/Kconfig | 13 +++ drivers/thermal/thermal_core.c | 46 + 3 files changed, 78 insertions(+) diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt index ef473dc..e73cc12 100644 --- a/Documentation/thermal/sysfs-api.txt +++ b/Documentation/thermal/sysfs-api.txt @@ -582,3 +582,22 @@ platform data is provided, this uses the step_wise throttling policy. This function serves as an arbitrator to set the state of a cooling device. It sets the cooling device to the deepest cooling state if possible. + +6. thermal_emergency_poweroff: + +On an event of critical trip temperature crossing. Thermal framework +allows the system to shutdown gracefully by calling orderly_poweroff(). +In the event of a failure of orderly_poweroff() to shut down the system +we are in danger of keeping the system alive at undesirably high +temperatures. To mitigate this high risk scenario we program a work +queue to fire after a pre-determined number of seconds to start +an emergency shutdown of the device using the kernel_power_off() +function. In case kernel_power_off() fails then finally +emergency_restart() is called in the worst case. + +The delay should be carefully profiled so as to give adequate time for +orderly_poweroff(). In case of failure of an orderly_poweroff() the +emergency poweroff kicks in after the delay has elapsed and shuts down +the system. + +If set to 0 emergency poweroff will happen immediately. diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig index 9347401..0dd5b85 100644 --- a/drivers/thermal/Kconfig +++ b/drivers/thermal/Kconfig @@ -15,6 +15,19 @@ menuconfig THERMAL if THERMAL +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS + int "Emergency poweroff delay in milli-seconds" + depends on THERMAL + default 0 >>> >>> Only now I realized that merging this may break the working >>> orderly_poweroff() out there, because you are defaulting this to 0, no >>> delay, therefore giving no time for orderly_poweroff() to finish. This >>> is not good. >>> >>> I think using 0 delay as immediate power off is not good as we give no >>> time for graceful shutdown, and by default. My suggestion here >>> is to use 0 delay as no forced shutdown. Meaning, by default, this >>> feature is disabled, and all other systems out there, despite DRA7 with >>> arago over NFS, work as before. > > A better solution could be to have bool Kconfig, say > THERMAL_EMERGENCY_POWEROFF, which would default to false. If one selects > that option, you get the DELAY_MS configurable, and then you could get > the 0 ms still as a valid entry, with the same semantics of immediate > power off, no orderly_poweroff. > > I just want to avoid breaking everybody (or changing userland > expectation) in honor of this change. Sure. I have now used default value as no emergency shutdown. Any
[PATCH v5 2/2] thermal: core: Add a back up thermal shutdown mechanism
orderly_poweroff is triggered when a graceful shutdown of system is desired. This may be used in many critical states of the kernel such as when subsystems detects conditions such as critical temperature conditions. However, in certain conditions in system boot up sequences like those in the middle of driver probes being initiated, userspace will be unable to power off the system in a clean manner and leaves the system in a critical state. In cases like these, the /sbin/poweroff will return success (having forked off to attempt powering off the system. However, the system overall will fail to completely poweroff (since other modules will be probed) and the system is still functional with no userspace (since that would have shut itself off). However, there is no clean way of detecting such failure of userspace powering off the system. In such scenarios, it is necessary for a backup workqueue to be able to force a shutdown of the system when orderly shutdown is not successful after a configurable time period. Reported-by: Nishanth MenonSigned-off-by: Keerthy --- Changes in v5: * Mandated delay for thermal emergency poweroff to be a non-zero value. Changes in v4: * Updated documentation * changed emergency_poweroff_func to thermal_emergency_poweroff_func Changes in v3: * Removed unnecessary mutex init. * Added WARN messages instead of a simple warning message. * Added Documentation. Documentation/thermal/sysfs-api.txt | 21 +++ drivers/thermal/Kconfig | 15 +++ drivers/thermal/thermal_core.c | 53 + 3 files changed, 89 insertions(+) diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt index ef473dc..98dc04f 100644 --- a/Documentation/thermal/sysfs-api.txt +++ b/Documentation/thermal/sysfs-api.txt @@ -582,3 +582,24 @@ platform data is provided, this uses the step_wise throttling policy. This function serves as an arbitrator to set the state of a cooling device. It sets the cooling device to the deepest cooling state if possible. + +6. thermal_emergency_poweroff: + +On an event of critical trip temperature crossing. Thermal framework +allows the system to shutdown gracefully by calling orderly_poweroff(). +In the event of a failure of orderly_poweroff() to shut down the system +we are in danger of keeping the system alive at undesirably high +temperatures. To mitigate this high risk scenario we program a work +queue to fire after a pre-determined number of seconds to start +an emergency shutdown of the device using the kernel_power_off() +function. In case kernel_power_off() fails then finally +emergency_restart() is called in the worst case. + +The delay should be carefully profiled so as to give adequate time for +orderly_poweroff(). In case of failure of an orderly_poweroff() the +emergency poweroff kicks in after the delay has elapsed and shuts down +the system. + +If set to 0 emergency poweroff will not be supported. So a carefully +profiled non-zero positive value is a must for emergerncy poweroff to be +triggered. diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig index 9347401..2a748a6 100644 --- a/drivers/thermal/Kconfig +++ b/drivers/thermal/Kconfig @@ -15,6 +15,21 @@ menuconfig THERMAL if THERMAL +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS + int "Emergency poweroff delay in milli-seconds" + depends on THERMAL + default 0 + help + The number of milliseconds to delay before emergency + poweroff kicks in. The delay should be carefully profiled + so as to give adequate time for orderly_poweroff(). In case + of failure of an orderly_poweroff() the emergency poweroff + kicks in after the delay has elapsed and shuts down the system. + + If set to 0 emergency poweroff will not be supported. So a carefully + profiled non-zero positive value is a must for emergerncy poweroff to be + triggered. + config THERMAL_HWMON bool prompt "Expose thermal sensors as hwmon device" diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 8337c27..de1f7be 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -324,6 +324,54 @@ static void handle_non_critical_trips(struct thermal_zone_device *tz, def_governor->throttle(tz, trip); } +/** + * thermal_emergency_poweroff_func - emergency poweroff work after a known delay + * @work: work_struct associated with the emergency poweroff function + * + * This function is called in very critical situations to force + * a kernel poweroff after a configurable timeout value. + */ +static void thermal_emergency_poweroff_func(struct work_struct *work) +{ + /* +* We have reached here after the emergency thermal shutdown +* Waiting period has expired. This means orderly_poweroff has +* not been
[PATCH v5 2/2] thermal: core: Add a back up thermal shutdown mechanism
orderly_poweroff is triggered when a graceful shutdown of system is desired. This may be used in many critical states of the kernel such as when subsystems detects conditions such as critical temperature conditions. However, in certain conditions in system boot up sequences like those in the middle of driver probes being initiated, userspace will be unable to power off the system in a clean manner and leaves the system in a critical state. In cases like these, the /sbin/poweroff will return success (having forked off to attempt powering off the system. However, the system overall will fail to completely poweroff (since other modules will be probed) and the system is still functional with no userspace (since that would have shut itself off). However, there is no clean way of detecting such failure of userspace powering off the system. In such scenarios, it is necessary for a backup workqueue to be able to force a shutdown of the system when orderly shutdown is not successful after a configurable time period. Reported-by: Nishanth Menon Signed-off-by: Keerthy --- Changes in v5: * Mandated delay for thermal emergency poweroff to be a non-zero value. Changes in v4: * Updated documentation * changed emergency_poweroff_func to thermal_emergency_poweroff_func Changes in v3: * Removed unnecessary mutex init. * Added WARN messages instead of a simple warning message. * Added Documentation. Documentation/thermal/sysfs-api.txt | 21 +++ drivers/thermal/Kconfig | 15 +++ drivers/thermal/thermal_core.c | 53 + 3 files changed, 89 insertions(+) diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt index ef473dc..98dc04f 100644 --- a/Documentation/thermal/sysfs-api.txt +++ b/Documentation/thermal/sysfs-api.txt @@ -582,3 +582,24 @@ platform data is provided, this uses the step_wise throttling policy. This function serves as an arbitrator to set the state of a cooling device. It sets the cooling device to the deepest cooling state if possible. + +6. thermal_emergency_poweroff: + +On an event of critical trip temperature crossing. Thermal framework +allows the system to shutdown gracefully by calling orderly_poweroff(). +In the event of a failure of orderly_poweroff() to shut down the system +we are in danger of keeping the system alive at undesirably high +temperatures. To mitigate this high risk scenario we program a work +queue to fire after a pre-determined number of seconds to start +an emergency shutdown of the device using the kernel_power_off() +function. In case kernel_power_off() fails then finally +emergency_restart() is called in the worst case. + +The delay should be carefully profiled so as to give adequate time for +orderly_poweroff(). In case of failure of an orderly_poweroff() the +emergency poweroff kicks in after the delay has elapsed and shuts down +the system. + +If set to 0 emergency poweroff will not be supported. So a carefully +profiled non-zero positive value is a must for emergerncy poweroff to be +triggered. diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig index 9347401..2a748a6 100644 --- a/drivers/thermal/Kconfig +++ b/drivers/thermal/Kconfig @@ -15,6 +15,21 @@ menuconfig THERMAL if THERMAL +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS + int "Emergency poweroff delay in milli-seconds" + depends on THERMAL + default 0 + help + The number of milliseconds to delay before emergency + poweroff kicks in. The delay should be carefully profiled + so as to give adequate time for orderly_poweroff(). In case + of failure of an orderly_poweroff() the emergency poweroff + kicks in after the delay has elapsed and shuts down the system. + + If set to 0 emergency poweroff will not be supported. So a carefully + profiled non-zero positive value is a must for emergerncy poweroff to be + triggered. + config THERMAL_HWMON bool prompt "Expose thermal sensors as hwmon device" diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 8337c27..de1f7be 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -324,6 +324,54 @@ static void handle_non_critical_trips(struct thermal_zone_device *tz, def_governor->throttle(tz, trip); } +/** + * thermal_emergency_poweroff_func - emergency poweroff work after a known delay + * @work: work_struct associated with the emergency poweroff function + * + * This function is called in very critical situations to force + * a kernel poweroff after a configurable timeout value. + */ +static void thermal_emergency_poweroff_func(struct work_struct *work) +{ + /* +* We have reached here after the emergency thermal shutdown +* Waiting period has expired. This means orderly_poweroff has +* not been able to shut off the system for
[PATCH v5 1/2] thermal: core: Allow orderly_poweroff to be called only once
thermal_zone_device_check --> thermal_zone_device_update --> handle_thermal_trip --> handle_critical_trips --> orderly_poweroff The above sequence happens every 250/500 mS based on the configuration. The orderly_poweroff function is getting called every 250/500 mS. With a full fledged file system it takes at least 5-10 Seconds to power off gracefully. In that period due to the thermal_zone_device_check triggering periodically the thermal work queues bombard with orderly_poweroff calls multiple times eventually leading to failures in gracefully powering off the system. Make sure that orderly_poweroff is called only once. Signed-off-by: KeerthyAcked-by: Eduardo Valentin --- Changes in v5: * Added Eduardo's Ack. Changes in v4: * power_off_triggered declaration together with mutex definition. Changes in v3: * Changed the place where mutex was locked and unlocked. Changes in v2: * Added a global mutex to serialize poweroff code sequence. drivers/thermal/thermal_core.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 11f0675..8337c27 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -45,8 +45,10 @@ static DEFINE_MUTEX(thermal_list_lock); static DEFINE_MUTEX(thermal_governor_lock); +static DEFINE_MUTEX(poweroff_lock); static atomic_t in_suspend; +static bool power_off_triggered; static struct thermal_governor *def_governor; @@ -342,7 +344,12 @@ static void handle_critical_trips(struct thermal_zone_device *tz, dev_emerg(>device, "critical temperature reached(%d C),shutting down\n", tz->temperature / 1000); - orderly_poweroff(true); + mutex_lock(_lock); + if (!power_off_triggered) { + orderly_poweroff(true); + power_off_triggered = true; + } + mutex_unlock(_lock); } } @@ -1463,6 +1470,7 @@ static int __init thermal_init(void) { int result; + mutex_init(_lock); result = thermal_register_governors(); if (result) goto error; @@ -1497,6 +1505,7 @@ static int __init thermal_init(void) ida_destroy(_cdev_ida); mutex_destroy(_list_lock); mutex_destroy(_governor_lock); + mutex_destroy(_lock); return result; } -- 1.9.1
[PATCH v5 1/2] thermal: core: Allow orderly_poweroff to be called only once
thermal_zone_device_check --> thermal_zone_device_update --> handle_thermal_trip --> handle_critical_trips --> orderly_poweroff The above sequence happens every 250/500 mS based on the configuration. The orderly_poweroff function is getting called every 250/500 mS. With a full fledged file system it takes at least 5-10 Seconds to power off gracefully. In that period due to the thermal_zone_device_check triggering periodically the thermal work queues bombard with orderly_poweroff calls multiple times eventually leading to failures in gracefully powering off the system. Make sure that orderly_poweroff is called only once. Signed-off-by: Keerthy Acked-by: Eduardo Valentin --- Changes in v5: * Added Eduardo's Ack. Changes in v4: * power_off_triggered declaration together with mutex definition. Changes in v3: * Changed the place where mutex was locked and unlocked. Changes in v2: * Added a global mutex to serialize poweroff code sequence. drivers/thermal/thermal_core.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 11f0675..8337c27 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -45,8 +45,10 @@ static DEFINE_MUTEX(thermal_list_lock); static DEFINE_MUTEX(thermal_governor_lock); +static DEFINE_MUTEX(poweroff_lock); static atomic_t in_suspend; +static bool power_off_triggered; static struct thermal_governor *def_governor; @@ -342,7 +344,12 @@ static void handle_critical_trips(struct thermal_zone_device *tz, dev_emerg(>device, "critical temperature reached(%d C),shutting down\n", tz->temperature / 1000); - orderly_poweroff(true); + mutex_lock(_lock); + if (!power_off_triggered) { + orderly_poweroff(true); + power_off_triggered = true; + } + mutex_unlock(_lock); } } @@ -1463,6 +1470,7 @@ static int __init thermal_init(void) { int result; + mutex_init(_lock); result = thermal_register_governors(); if (result) goto error; @@ -1497,6 +1505,7 @@ static int __init thermal_init(void) ida_destroy(_cdev_ida); mutex_destroy(_list_lock); mutex_destroy(_governor_lock); + mutex_destroy(_lock); return result; } -- 1.9.1
[PATCH v2 02/33] dax: refactor dax-fs into a generic provider of 'struct dax_device' instances
We want dax capable drivers to be able to publish a set of dax operations [1]. However, we do not want to further abuse block_devices to advertise these operations. Instead we will attach these operations to a dax device and add a lookup mechanism to go from block device path to a dax device. A dax capable driver like pmem or brd is responsible for registering a dax device, alongside a block device, and then a dax capable filesystem is responsible for retrieving the dax device by path name if it wants to call dax_operations. For now, we refactor the dax pseudo-fs to be a generic facility, rather than an implementation detail, of the device-dax use case. Where a "dax device" is just an inode + dax infrastructure, and "Device DAX" is a mapping service layered on top of that base 'struct dax_device'. "Filesystem DAX" is then a mapping service that layers a filesystem on top of that same base device. Filesystem DAX is associated with a block_device for now, but perhaps directly to a dax device in the future, or for new pmem-only filesystems. [1]: https://lkml.org/lkml/2017/1/19/880 Suggested-by: Christoph HellwigSigned-off-by: Dan Williams --- drivers/Makefile|2 drivers/dax/Kconfig | 10 + drivers/dax/Makefile|5 + drivers/dax/dax.h | 20 +-- drivers/dax/device-dax.h| 25 drivers/dax/device.c| 241 ++ drivers/dax/pmem.c |2 drivers/dax/super.c | 303 +++ include/linux/dax.h |3 tools/testing/nvdimm/Kbuild | 10 + 10 files changed, 404 insertions(+), 217 deletions(-) create mode 100644 drivers/dax/device-dax.h rename drivers/dax/{dax.c => device.c} (77%) create mode 100644 drivers/dax/super.c diff --git a/drivers/Makefile b/drivers/Makefile index 2eced9afba53..0442e982cf35 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -71,7 +71,7 @@ obj-$(CONFIG_PARPORT) += parport/ obj-$(CONFIG_NVM) += lightnvm/ obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM)+= nvdimm/ -obj-$(CONFIG_DEV_DAX) += dax/ +obj-$(CONFIG_DAX) += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ obj-$(CONFIG_NUBUS)+= nubus/ obj-y += macintosh/ diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 9e95bf94eb13..b7053eafd88e 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -1,8 +1,13 @@ -menuconfig DEV_DAX +menuconfig DAX tristate "DAX: direct access to differentiated memory" + select SRCU default m if NVDIMM_DAX + +if DAX + +config DEV_DAX + tristate "Device DAX: direct access mapping device" depends on TRANSPARENT_HUGEPAGE - select SRCU help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character @@ -11,7 +16,6 @@ menuconfig DEV_DAX baseline memory pool. Mappings of a /dev/daxX.Y device impose restrictions that make the mapping behavior deterministic. -if DEV_DAX config DEV_DAX_PMEM tristate "PMEM DAX: direct access to persistent memory" diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 27c54e38478a..dc7422530462 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -1,4 +1,7 @@ -obj-$(CONFIG_DEV_DAX) += dax.o +obj-$(CONFIG_DAX) += dax.o +obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o +dax-y := super.o dax_pmem-y := pmem.o +device_dax-y := device.o diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h index ea176d875d60..2472d9da96db 100644 --- a/drivers/dax/dax.h +++ b/drivers/dax/dax.h @@ -1,5 +1,5 @@ /* - * Copyright(c) 2016 Intel Corporation. All rights reserved. + * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of version 2 of the GNU General Public License as @@ -12,14 +12,12 @@ */ #ifndef __DAX_H__ #define __DAX_H__ -struct device; -struct dev_dax; -struct resource; -struct dax_region; -void dax_region_put(struct dax_region *dax_region); -struct dax_region *alloc_dax_region(struct device *parent, - int region_id, struct resource *res, unsigned int align, - void *addr, unsigned long flags); -struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, - struct resource *res, int count); +struct dax_device; +struct dax_device *alloc_dax(void *private); +void put_dax(struct dax_device *dax_dev); +bool dax_alive(struct dax_device *dax_dev); +void kill_dax(struct dax_device *dax_dev); +struct dax_device *inode_dax(struct inode *inode); +struct inode *dax_inode(struct dax_device *dax_dev); +void *dax_get_private(struct dax_device *dax_dev); #endif /* __DAX_H__ */ diff
[PATCH v2 02/33] dax: refactor dax-fs into a generic provider of 'struct dax_device' instances
We want dax capable drivers to be able to publish a set of dax operations [1]. However, we do not want to further abuse block_devices to advertise these operations. Instead we will attach these operations to a dax device and add a lookup mechanism to go from block device path to a dax device. A dax capable driver like pmem or brd is responsible for registering a dax device, alongside a block device, and then a dax capable filesystem is responsible for retrieving the dax device by path name if it wants to call dax_operations. For now, we refactor the dax pseudo-fs to be a generic facility, rather than an implementation detail, of the device-dax use case. Where a "dax device" is just an inode + dax infrastructure, and "Device DAX" is a mapping service layered on top of that base 'struct dax_device'. "Filesystem DAX" is then a mapping service that layers a filesystem on top of that same base device. Filesystem DAX is associated with a block_device for now, but perhaps directly to a dax device in the future, or for new pmem-only filesystems. [1]: https://lkml.org/lkml/2017/1/19/880 Suggested-by: Christoph Hellwig Signed-off-by: Dan Williams --- drivers/Makefile|2 drivers/dax/Kconfig | 10 + drivers/dax/Makefile|5 + drivers/dax/dax.h | 20 +-- drivers/dax/device-dax.h| 25 drivers/dax/device.c| 241 ++ drivers/dax/pmem.c |2 drivers/dax/super.c | 303 +++ include/linux/dax.h |3 tools/testing/nvdimm/Kbuild | 10 + 10 files changed, 404 insertions(+), 217 deletions(-) create mode 100644 drivers/dax/device-dax.h rename drivers/dax/{dax.c => device.c} (77%) create mode 100644 drivers/dax/super.c diff --git a/drivers/Makefile b/drivers/Makefile index 2eced9afba53..0442e982cf35 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -71,7 +71,7 @@ obj-$(CONFIG_PARPORT) += parport/ obj-$(CONFIG_NVM) += lightnvm/ obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM)+= nvdimm/ -obj-$(CONFIG_DEV_DAX) += dax/ +obj-$(CONFIG_DAX) += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ obj-$(CONFIG_NUBUS)+= nubus/ obj-y += macintosh/ diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 9e95bf94eb13..b7053eafd88e 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -1,8 +1,13 @@ -menuconfig DEV_DAX +menuconfig DAX tristate "DAX: direct access to differentiated memory" + select SRCU default m if NVDIMM_DAX + +if DAX + +config DEV_DAX + tristate "Device DAX: direct access mapping device" depends on TRANSPARENT_HUGEPAGE - select SRCU help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character @@ -11,7 +16,6 @@ menuconfig DEV_DAX baseline memory pool. Mappings of a /dev/daxX.Y device impose restrictions that make the mapping behavior deterministic. -if DEV_DAX config DEV_DAX_PMEM tristate "PMEM DAX: direct access to persistent memory" diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 27c54e38478a..dc7422530462 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -1,4 +1,7 @@ -obj-$(CONFIG_DEV_DAX) += dax.o +obj-$(CONFIG_DAX) += dax.o +obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o +dax-y := super.o dax_pmem-y := pmem.o +device_dax-y := device.o diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h index ea176d875d60..2472d9da96db 100644 --- a/drivers/dax/dax.h +++ b/drivers/dax/dax.h @@ -1,5 +1,5 @@ /* - * Copyright(c) 2016 Intel Corporation. All rights reserved. + * Copyright(c) 2016 - 2017 Intel Corporation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of version 2 of the GNU General Public License as @@ -12,14 +12,12 @@ */ #ifndef __DAX_H__ #define __DAX_H__ -struct device; -struct dev_dax; -struct resource; -struct dax_region; -void dax_region_put(struct dax_region *dax_region); -struct dax_region *alloc_dax_region(struct device *parent, - int region_id, struct resource *res, unsigned int align, - void *addr, unsigned long flags); -struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, - struct resource *res, int count); +struct dax_device; +struct dax_device *alloc_dax(void *private); +void put_dax(struct dax_device *dax_dev); +bool dax_alive(struct dax_device *dax_dev); +void kill_dax(struct dax_device *dax_dev); +struct dax_device *inode_dax(struct inode *inode); +struct inode *dax_inode(struct dax_device *dax_dev); +void *dax_get_private(struct dax_device *dax_dev); #endif /* __DAX_H__ */ diff --git a/drivers/dax/device-dax.h
[PATCH v2 11/33] dm: add dax_device and dax_operations support
Allocate a dax_device to represent the capacity of a device-mapper instance. Provide a ->direct_access() method via the new dax_operations indirection that mirrors the functionality of the current direct_access support via block_device_operations. Once fs/dax.c has been converted to use dax_operations the old dm_blk_direct_access() will be removed. A new helper dm_dax_get_live_target() is introduced to separate some of the dm-specifics from the direct_access implementation. This enabling is only for the top-level dm representation to upper layers. Converting target direct_access implementations is deferred to a separate patch. Cc: Toshi KaniCc: Mike Snitzer Signed-off-by: Dan Williams --- drivers/md/Kconfig|1 drivers/md/dm-core.h |1 drivers/md/dm.c | 84 ++--- include/linux/device-mapper.h |1 4 files changed, 73 insertions(+), 14 deletions(-) diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index b7767da50c26..1de8372d9459 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN config BLK_DEV_DM tristate "Device mapper support" select BLK_DEV_DM_BUILTIN + select DAX ---help--- Device-mapper is a low level volume manager. It works by allowing people to specify mappings for ranges of logical sectors. Various diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 136fda3ff9e5..538630190f66 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -58,6 +58,7 @@ struct mapped_device { struct target_type *immutable_target_type; struct gendisk *disk; + struct dax_device *dax_dev; char name[16]; void *interface_ptr; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index dfb75979e455..bd56dfe43a99 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -908,31 +909,68 @@ int dm_set_target_max_io_len(struct dm_target *ti, sector_t len) } EXPORT_SYMBOL_GPL(dm_set_target_max_io_len); -static long dm_blk_direct_access(struct block_device *bdev, sector_t sector, -void **kaddr, pfn_t *pfn, long size) +static struct dm_target *dm_dax_get_live_target(struct mapped_device *md, + sector_t sector, int *srcu_idx) { - struct mapped_device *md = bdev->bd_disk->private_data; struct dm_table *map; struct dm_target *ti; - int srcu_idx; - long len, ret = -EIO; - map = dm_get_live_table(md, _idx); + map = dm_get_live_table(md, srcu_idx); if (!map) - goto out; + return NULL; ti = dm_table_find_target(map, sector); if (!dm_target_is_valid(ti)) - goto out; + return NULL; - len = max_io_len(sector, ti) << SECTOR_SHIFT; - size = min(len, size); + return ti; +} - if (ti->type->direct_access) - ret = ti->type->direct_access(ti, sector, kaddr, pfn, size); -out: +static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) +{ + struct mapped_device *md = dax_get_private(dax_dev); + sector_t sector = pgoff * PAGE_SECTORS; + struct dm_target *ti; + long len, ret = -EIO; + int srcu_idx; + + ti = dm_dax_get_live_target(md, sector, _idx); + + if (!ti) + goto out; + if (!ti->type->direct_access) + goto out; + len = max_io_len(sector, ti) / PAGE_SECTORS; + if (len < 1) + goto out; + nr_pages = min(len, nr_pages); + if (ti->type->direct_access) { + ret = ti->type->direct_access(ti, sector, kaddr, pfn, + nr_pages * PAGE_SIZE); + /* +* FIXME: convert ti->type->direct_access to return +* nr_pages directly. +*/ + if (ret >= 0) + ret /= PAGE_SIZE; + } + out: dm_put_live_table(md, srcu_idx); - return min(ret, size); + + return ret; +} + +static long dm_blk_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, pfn_t *pfn, long size) +{ + struct mapped_device *md = bdev->bd_disk->private_data; + struct dax_device *dax_dev = md->dax_dev; + long nr_pages = size / PAGE_SIZE; + + nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS, + nr_pages, kaddr, pfn); + return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE; } /* @@ -1437,6 +1475,7 @@ static int next_free_minor(int *minor) } static const struct block_device_operations dm_blk_dops; +static const struct dax_operations
[PATCH v2 11/33] dm: add dax_device and dax_operations support
Allocate a dax_device to represent the capacity of a device-mapper instance. Provide a ->direct_access() method via the new dax_operations indirection that mirrors the functionality of the current direct_access support via block_device_operations. Once fs/dax.c has been converted to use dax_operations the old dm_blk_direct_access() will be removed. A new helper dm_dax_get_live_target() is introduced to separate some of the dm-specifics from the direct_access implementation. This enabling is only for the top-level dm representation to upper layers. Converting target direct_access implementations is deferred to a separate patch. Cc: Toshi Kani Cc: Mike Snitzer Signed-off-by: Dan Williams --- drivers/md/Kconfig|1 drivers/md/dm-core.h |1 drivers/md/dm.c | 84 ++--- include/linux/device-mapper.h |1 4 files changed, 73 insertions(+), 14 deletions(-) diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index b7767da50c26..1de8372d9459 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -200,6 +200,7 @@ config BLK_DEV_DM_BUILTIN config BLK_DEV_DM tristate "Device mapper support" select BLK_DEV_DM_BUILTIN + select DAX ---help--- Device-mapper is a low level volume manager. It works by allowing people to specify mappings for ranges of logical sectors. Various diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 136fda3ff9e5..538630190f66 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -58,6 +58,7 @@ struct mapped_device { struct target_type *immutable_target_type; struct gendisk *disk; + struct dax_device *dax_dev; char name[16]; void *interface_ptr; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index dfb75979e455..bd56dfe43a99 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -908,31 +909,68 @@ int dm_set_target_max_io_len(struct dm_target *ti, sector_t len) } EXPORT_SYMBOL_GPL(dm_set_target_max_io_len); -static long dm_blk_direct_access(struct block_device *bdev, sector_t sector, -void **kaddr, pfn_t *pfn, long size) +static struct dm_target *dm_dax_get_live_target(struct mapped_device *md, + sector_t sector, int *srcu_idx) { - struct mapped_device *md = bdev->bd_disk->private_data; struct dm_table *map; struct dm_target *ti; - int srcu_idx; - long len, ret = -EIO; - map = dm_get_live_table(md, _idx); + map = dm_get_live_table(md, srcu_idx); if (!map) - goto out; + return NULL; ti = dm_table_find_target(map, sector); if (!dm_target_is_valid(ti)) - goto out; + return NULL; - len = max_io_len(sector, ti) << SECTOR_SHIFT; - size = min(len, size); + return ti; +} - if (ti->type->direct_access) - ret = ti->type->direct_access(ti, sector, kaddr, pfn, size); -out: +static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) +{ + struct mapped_device *md = dax_get_private(dax_dev); + sector_t sector = pgoff * PAGE_SECTORS; + struct dm_target *ti; + long len, ret = -EIO; + int srcu_idx; + + ti = dm_dax_get_live_target(md, sector, _idx); + + if (!ti) + goto out; + if (!ti->type->direct_access) + goto out; + len = max_io_len(sector, ti) / PAGE_SECTORS; + if (len < 1) + goto out; + nr_pages = min(len, nr_pages); + if (ti->type->direct_access) { + ret = ti->type->direct_access(ti, sector, kaddr, pfn, + nr_pages * PAGE_SIZE); + /* +* FIXME: convert ti->type->direct_access to return +* nr_pages directly. +*/ + if (ret >= 0) + ret /= PAGE_SIZE; + } + out: dm_put_live_table(md, srcu_idx); - return min(ret, size); + + return ret; +} + +static long dm_blk_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, pfn_t *pfn, long size) +{ + struct mapped_device *md = bdev->bd_disk->private_data; + struct dax_device *dax_dev = md->dax_dev; + long nr_pages = size / PAGE_SIZE; + + nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS, + nr_pages, kaddr, pfn); + return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE; } /* @@ -1437,6 +1475,7 @@ static int next_free_minor(int *minor) } static const struct block_device_operations dm_blk_dops; +static const struct dax_operations dm_dax_ops; static void dm_wq_work(struct work_struct *work); @@
[PATCH v2 07/33] brd: add dax_operations support
Setup a dax_inode to have the same lifetime as the brd block device and add a ->direct_access() method that is equivalent to brd_direct_access(). Once fs/dax.c has been converted to use dax_operations the old brd_direct_access() will be removed. Signed-off-by: Dan Williams--- drivers/block/Kconfig |1 + drivers/block/brd.c | 65 + 2 files changed, 55 insertions(+), 11 deletions(-) diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index f744de7a0f9b..e66956fc2c88 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -339,6 +339,7 @@ config BLK_DEV_SX8 config BLK_DEV_RAM tristate "RAM block device support" + select DAX if BLK_DEV_RAM_DAX ---help--- Saying Y here will allow you to use a portion of your RAM memory as a block device, so that you can make file systems on it, read and diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 3adc32a3153b..60f3193c9ce2 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -21,6 +21,7 @@ #include #ifdef CONFIG_BLK_DEV_RAM_DAX #include +#include #endif #include @@ -41,6 +42,9 @@ struct brd_device { struct request_queue*brd_queue; struct gendisk *brd_disk; +#ifdef CONFIG_BLK_DEV_RAM_DAX + struct dax_device *dax_dev; +#endif struct list_headbrd_list; /* @@ -375,30 +379,53 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector, } #ifdef CONFIG_BLK_DEV_RAM_DAX -static long brd_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, pfn_t *pfn, long size) +static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) { - struct brd_device *brd = bdev->bd_disk->private_data; struct page *page; if (!brd) return -ENODEV; - page = brd_insert_page(brd, sector); + page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512); if (!page) return -ENOSPC; *kaddr = page_address(page); *pfn = page_to_pfn_t(page); - return PAGE_SIZE; + return 1; +} + +static long brd_blk_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, pfn_t *pfn, long size) +{ + struct brd_device *brd = bdev->bd_disk->private_data; + long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512), + PHYS_PFN(size), kaddr, pfn); + + if (nr_pages < 0) + return nr_pages; + return nr_pages * PAGE_SIZE; +} + +static long brd_dax_direct_access(struct dax_device *dax_dev, + pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) +{ + struct brd_device *brd = dax_get_private(dax_dev); + + return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn); } + +static const struct dax_operations brd_dax_ops = { + .direct_access = brd_dax_direct_access, +}; #else -#define brd_direct_access NULL +#define brd_blk_direct_access NULL #endif static const struct block_device_operations brd_fops = { .owner =THIS_MODULE, .rw_page = brd_rw_page, - .direct_access =brd_direct_access, + .direct_access =brd_blk_direct_access, }; /* @@ -441,7 +468,9 @@ static struct brd_device *brd_alloc(int i) { struct brd_device *brd; struct gendisk *disk; - +#ifdef CONFIG_BLK_DEV_RAM_DAX + struct dax_device *dax_dev; +#endif brd = kzalloc(sizeof(*brd), GFP_KERNEL); if (!brd) goto out; @@ -469,9 +498,6 @@ static struct brd_device *brd_alloc(int i) blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX); brd->brd_queue->limits.discard_zeroes_data = 1; queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue); -#ifdef CONFIG_BLK_DEV_RAM_DAX - queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); -#endif disk = brd->brd_disk = alloc_disk(max_part); if (!disk) goto out_free_queue; @@ -484,8 +510,21 @@ static struct brd_device *brd_alloc(int i) sprintf(disk->disk_name, "ram%d", i); set_capacity(disk, rd_size * 2); +#ifdef CONFIG_BLK_DEV_RAM_DAX + queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); + dax_dev = alloc_dax(brd, disk->disk_name, _dax_ops); + if (!dax_dev) + goto out_free_inode; +#endif + + return brd; +#ifdef CONFIG_BLK_DEV_RAM_DAX +out_free_inode: + kill_dax(dax_dev); + put_dax(dax_dev); +#endif out_free_queue: blk_cleanup_queue(brd->brd_queue); out_free_dev: @@ -525,6 +564,10 @@ static struct brd_device *brd_init_one(int i, bool *new) static void brd_del_one(struct brd_device *brd) { list_del(>brd_list); +#ifdef CONFIG_BLK_DEV_RAM_DAX +
[PATCH v2 07/33] brd: add dax_operations support
Setup a dax_inode to have the same lifetime as the brd block device and add a ->direct_access() method that is equivalent to brd_direct_access(). Once fs/dax.c has been converted to use dax_operations the old brd_direct_access() will be removed. Signed-off-by: Dan Williams --- drivers/block/Kconfig |1 + drivers/block/brd.c | 65 + 2 files changed, 55 insertions(+), 11 deletions(-) diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index f744de7a0f9b..e66956fc2c88 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -339,6 +339,7 @@ config BLK_DEV_SX8 config BLK_DEV_RAM tristate "RAM block device support" + select DAX if BLK_DEV_RAM_DAX ---help--- Saying Y here will allow you to use a portion of your RAM memory as a block device, so that you can make file systems on it, read and diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 3adc32a3153b..60f3193c9ce2 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -21,6 +21,7 @@ #include #ifdef CONFIG_BLK_DEV_RAM_DAX #include +#include #endif #include @@ -41,6 +42,9 @@ struct brd_device { struct request_queue*brd_queue; struct gendisk *brd_disk; +#ifdef CONFIG_BLK_DEV_RAM_DAX + struct dax_device *dax_dev; +#endif struct list_headbrd_list; /* @@ -375,30 +379,53 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector, } #ifdef CONFIG_BLK_DEV_RAM_DAX -static long brd_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, pfn_t *pfn, long size) +static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) { - struct brd_device *brd = bdev->bd_disk->private_data; struct page *page; if (!brd) return -ENODEV; - page = brd_insert_page(brd, sector); + page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512); if (!page) return -ENOSPC; *kaddr = page_address(page); *pfn = page_to_pfn_t(page); - return PAGE_SIZE; + return 1; +} + +static long brd_blk_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, pfn_t *pfn, long size) +{ + struct brd_device *brd = bdev->bd_disk->private_data; + long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512), + PHYS_PFN(size), kaddr, pfn); + + if (nr_pages < 0) + return nr_pages; + return nr_pages * PAGE_SIZE; +} + +static long brd_dax_direct_access(struct dax_device *dax_dev, + pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) +{ + struct brd_device *brd = dax_get_private(dax_dev); + + return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn); } + +static const struct dax_operations brd_dax_ops = { + .direct_access = brd_dax_direct_access, +}; #else -#define brd_direct_access NULL +#define brd_blk_direct_access NULL #endif static const struct block_device_operations brd_fops = { .owner =THIS_MODULE, .rw_page = brd_rw_page, - .direct_access =brd_direct_access, + .direct_access =brd_blk_direct_access, }; /* @@ -441,7 +468,9 @@ static struct brd_device *brd_alloc(int i) { struct brd_device *brd; struct gendisk *disk; - +#ifdef CONFIG_BLK_DEV_RAM_DAX + struct dax_device *dax_dev; +#endif brd = kzalloc(sizeof(*brd), GFP_KERNEL); if (!brd) goto out; @@ -469,9 +498,6 @@ static struct brd_device *brd_alloc(int i) blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX); brd->brd_queue->limits.discard_zeroes_data = 1; queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue); -#ifdef CONFIG_BLK_DEV_RAM_DAX - queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); -#endif disk = brd->brd_disk = alloc_disk(max_part); if (!disk) goto out_free_queue; @@ -484,8 +510,21 @@ static struct brd_device *brd_alloc(int i) sprintf(disk->disk_name, "ram%d", i); set_capacity(disk, rd_size * 2); +#ifdef CONFIG_BLK_DEV_RAM_DAX + queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); + dax_dev = alloc_dax(brd, disk->disk_name, _dax_ops); + if (!dax_dev) + goto out_free_inode; +#endif + + return brd; +#ifdef CONFIG_BLK_DEV_RAM_DAX +out_free_inode: + kill_dax(dax_dev); + put_dax(dax_dev); +#endif out_free_queue: blk_cleanup_queue(brd->brd_queue); out_free_dev: @@ -525,6 +564,10 @@ static struct brd_device *brd_init_one(int i, bool *new) static void brd_del_one(struct brd_device *brd) { list_del(>brd_list); +#ifdef CONFIG_BLK_DEV_RAM_DAX + kill_dax(brd->dax_dev); +
[PATCH v2 12/33] dm: teach dm-targets to use a dax_device + dax_operations
Arrange for dm to lookup the dax services available from member devices. Update the dax-capable targets, linear and stripe, to route dax operations to the underlying device. Changes the target-internal ->direct_access() method to more closely align with the dax_operations ->direct_access() calling convention. Cc: Toshi KaniCc: Mike Snitzer Signed-off-by: Dan Williams --- drivers/md/dm-linear.c| 27 +-- drivers/md/dm-snap.c |6 +++--- drivers/md/dm-stripe.c| 29 ++--- drivers/md/dm-target.c|6 +++--- drivers/md/dm.c | 16 ++-- include/linux/device-mapper.h |7 --- 6 files changed, 43 insertions(+), 48 deletions(-) diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 4788b0b989a9..c5a52f4dae81 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include @@ -141,22 +142,20 @@ static int linear_iterate_devices(struct dm_target *ti, return fn(ti, lc->dev, lc->start, ti->len, data); } -static long linear_direct_access(struct dm_target *ti, sector_t sector, -void **kaddr, pfn_t *pfn, long size) +static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) { + long ret; struct linear_c *lc = ti->private; struct block_device *bdev = lc->dev->bdev; - struct blk_dax_ctl dax = { - .sector = linear_map_sector(ti, sector), - .size = size, - }; - long ret; - - ret = bdev_direct_access(bdev, ); - *kaddr = dax.addr; - *pfn = dax.pfn; - - return ret; + struct dax_device *dax_dev = lc->dev->dax_dev; + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + + dev_sector = linear_map_sector(ti, sector); + ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, ); + if (ret) + return ret; + return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); } static struct target_type linear_target = { @@ -169,7 +168,7 @@ static struct target_type linear_target = { .status = linear_status, .prepare_ioctl = linear_prepare_ioctl, .iterate_devices = linear_iterate_devices, - .direct_access = linear_direct_access, + .direct_access = linear_dax_direct_access, }; int __init dm_linear_init(void) diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index c65feeada864..e152d9817c81 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -2302,8 +2302,8 @@ static int origin_map(struct dm_target *ti, struct bio *bio) return do_origin(o->dev, bio); } -static long origin_direct_access(struct dm_target *ti, sector_t sector, - void **kaddr, pfn_t *pfn, long size) +static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) { DMWARN("device does not support dax."); return -EIO; @@ -2368,7 +2368,7 @@ static struct target_type origin_target = { .postsuspend = origin_postsuspend, .status = origin_status, .iterate_devices = origin_iterate_devices, - .direct_access = origin_direct_access, + .direct_access = origin_dax_direct_access, }; static struct target_type snapshot_target = { diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 28193a57bf47..cb4b1e9e16ab 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -308,27 +309,25 @@ static int stripe_map(struct dm_target *ti, struct bio *bio) return DM_MAPIO_REMAPPED; } -static long stripe_direct_access(struct dm_target *ti, sector_t sector, -void **kaddr, pfn_t *pfn, long size) +static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) { + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; struct stripe_c *sc = ti->private; - uint32_t stripe; + struct dax_device *dax_dev; struct block_device *bdev; - struct blk_dax_ctl dax = { - .size = size, - }; + uint32_t stripe; long ret; - stripe_map_sector(sc, sector, , ); - - dax.sector += sc->stripe[stripe].physical_start; + stripe_map_sector(sc, sector, , _sector); + dev_sector += sc->stripe[stripe].physical_start; + dax_dev = sc->stripe[stripe].dev->dax_dev; bdev = sc->stripe[stripe].dev->bdev; - ret = bdev_direct_access(bdev, ); - *kaddr = dax.addr; - *pfn = dax.pfn; - - return ret; + ret = bdev_dax_pgoff(bdev, dev_sector,
[PATCH v2 12/33] dm: teach dm-targets to use a dax_device + dax_operations
Arrange for dm to lookup the dax services available from member devices. Update the dax-capable targets, linear and stripe, to route dax operations to the underlying device. Changes the target-internal ->direct_access() method to more closely align with the dax_operations ->direct_access() calling convention. Cc: Toshi Kani Cc: Mike Snitzer Signed-off-by: Dan Williams --- drivers/md/dm-linear.c| 27 +-- drivers/md/dm-snap.c |6 +++--- drivers/md/dm-stripe.c| 29 ++--- drivers/md/dm-target.c|6 +++--- drivers/md/dm.c | 16 ++-- include/linux/device-mapper.h |7 --- 6 files changed, 43 insertions(+), 48 deletions(-) diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 4788b0b989a9..c5a52f4dae81 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include @@ -141,22 +142,20 @@ static int linear_iterate_devices(struct dm_target *ti, return fn(ti, lc->dev, lc->start, ti->len, data); } -static long linear_direct_access(struct dm_target *ti, sector_t sector, -void **kaddr, pfn_t *pfn, long size) +static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) { + long ret; struct linear_c *lc = ti->private; struct block_device *bdev = lc->dev->bdev; - struct blk_dax_ctl dax = { - .sector = linear_map_sector(ti, sector), - .size = size, - }; - long ret; - - ret = bdev_direct_access(bdev, ); - *kaddr = dax.addr; - *pfn = dax.pfn; - - return ret; + struct dax_device *dax_dev = lc->dev->dax_dev; + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + + dev_sector = linear_map_sector(ti, sector); + ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, ); + if (ret) + return ret; + return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); } static struct target_type linear_target = { @@ -169,7 +168,7 @@ static struct target_type linear_target = { .status = linear_status, .prepare_ioctl = linear_prepare_ioctl, .iterate_devices = linear_iterate_devices, - .direct_access = linear_direct_access, + .direct_access = linear_dax_direct_access, }; int __init dm_linear_init(void) diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index c65feeada864..e152d9817c81 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -2302,8 +2302,8 @@ static int origin_map(struct dm_target *ti, struct bio *bio) return do_origin(o->dev, bio); } -static long origin_direct_access(struct dm_target *ti, sector_t sector, - void **kaddr, pfn_t *pfn, long size) +static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) { DMWARN("device does not support dax."); return -EIO; @@ -2368,7 +2368,7 @@ static struct target_type origin_target = { .postsuspend = origin_postsuspend, .status = origin_status, .iterate_devices = origin_iterate_devices, - .direct_access = origin_direct_access, + .direct_access = origin_dax_direct_access, }; static struct target_type snapshot_target = { diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 28193a57bf47..cb4b1e9e16ab 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -308,27 +309,25 @@ static int stripe_map(struct dm_target *ti, struct bio *bio) return DM_MAPIO_REMAPPED; } -static long stripe_direct_access(struct dm_target *ti, sector_t sector, -void **kaddr, pfn_t *pfn, long size) +static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) { + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; struct stripe_c *sc = ti->private; - uint32_t stripe; + struct dax_device *dax_dev; struct block_device *bdev; - struct blk_dax_ctl dax = { - .size = size, - }; + uint32_t stripe; long ret; - stripe_map_sector(sc, sector, , ); - - dax.sector += sc->stripe[stripe].physical_start; + stripe_map_sector(sc, sector, , _sector); + dev_sector += sc->stripe[stripe].physical_start; + dax_dev = sc->stripe[stripe].dev->dax_dev; bdev = sc->stripe[stripe].dev->bdev; - ret = bdev_direct_access(bdev, ); - *kaddr = dax.addr; - *pfn = dax.pfn; - - return ret; + ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, ); + if (ret) + return ret;
[PATCH v2 21/33] filesystem-dax: convert to dax_copy_from_iter()
Now that all possible providers of the dax_operations copy_from_iter method are implemented, switch filesytem-dax to call the driver rather than copy_to_iter_pmem. Signed-off-by: Dan Williams--- arch/x86/include/asm/pmem.h | 50 --- fs/dax.c|3 ++- include/linux/pmem.h| 24 - 3 files changed, 2 insertions(+), 75 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d5a22bac9988..60e8edbe0205 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -66,56 +66,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size) } /** - * arch_copy_from_iter_pmem - copy data from an iterator to PMEM - * @addr: PMEM destination address - * @bytes: number of bytes to copy - * @i: iterator with source data - * - * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'. - */ -static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, - struct iov_iter *i) -{ - size_t len; - - /* TODO: skip the write-back by always using non-temporal stores */ - len = copy_from_iter_nocache(addr, bytes, i); - - /* -* In the iovec case on x86_64 copy_from_iter_nocache() uses -* non-temporal stores for the bulk of the transfer, but we need -* to manually flush if the transfer is unaligned. A cached -* memory copy is used when destination or size is not naturally -* aligned. That is: -* - Require 8-byte alignment when size is 8 bytes or larger. -* - Require 4-byte alignment when size is 4 bytes. -* -* In the non-iovec case the entire destination needs to be -* flushed. -*/ - if (iter_is_iovec(i)) { - unsigned long flushed, dest = (unsigned long) addr; - - if (bytes < 8) { - if (!IS_ALIGNED(dest, 4) || (bytes != 4)) - arch_wb_cache_pmem(addr, 1); - } else { - if (!IS_ALIGNED(dest, 8)) { - dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); - arch_wb_cache_pmem(addr, 1); - } - - flushed = dest - (unsigned long) addr; - if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) - arch_wb_cache_pmem(addr + bytes - 1, 1); - } - } else - arch_wb_cache_pmem(addr, bytes); - - return len; -} - -/** * arch_clear_pmem - zero a PMEM memory range * @addr: virtual start address * @size: number of bytes to zero diff --git a/fs/dax.c b/fs/dax.c index ce9dc9c3e829..11b9909c91df 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1061,7 +1061,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, map_len = end - pos; if (iov_iter_rw(iter) == WRITE) - map_len = copy_from_iter_pmem(kaddr, map_len, iter); + map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr, + map_len, iter); else map_len = copy_to_iter(kaddr, map_len, iter); if (map_len <= 0) { diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 71ecf3d46aac..9d542a5600e4 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -31,13 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, - struct iov_iter *i) -{ - BUG(); - return 0; -} - static inline void arch_clear_pmem(void *addr, size_t size) { BUG(); @@ -80,23 +73,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n) } /** - * copy_from_iter_pmem - copy data from an iterator to PMEM - * @addr: PMEM destination address - * @bytes: number of bytes to copy - * @i: iterator with source data - * - * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'. - * See blkdev_issue_flush() note for memcpy_to_pmem(). - */ -static inline size_t copy_from_iter_pmem(void *addr, size_t bytes, - struct iov_iter *i) -{ - if (arch_has_pmem_api()) - return arch_copy_from_iter_pmem(addr, bytes, i); - return copy_from_iter_nocache(addr, bytes, i); -} - -/** * clear_pmem - zero a PMEM memory range * @addr: virtual start address * @size: number of bytes to zero
[PATCH v2 21/33] filesystem-dax: convert to dax_copy_from_iter()
Now that all possible providers of the dax_operations copy_from_iter method are implemented, switch filesytem-dax to call the driver rather than copy_to_iter_pmem. Signed-off-by: Dan Williams --- arch/x86/include/asm/pmem.h | 50 --- fs/dax.c|3 ++- include/linux/pmem.h| 24 - 3 files changed, 2 insertions(+), 75 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d5a22bac9988..60e8edbe0205 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -66,56 +66,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size) } /** - * arch_copy_from_iter_pmem - copy data from an iterator to PMEM - * @addr: PMEM destination address - * @bytes: number of bytes to copy - * @i: iterator with source data - * - * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'. - */ -static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, - struct iov_iter *i) -{ - size_t len; - - /* TODO: skip the write-back by always using non-temporal stores */ - len = copy_from_iter_nocache(addr, bytes, i); - - /* -* In the iovec case on x86_64 copy_from_iter_nocache() uses -* non-temporal stores for the bulk of the transfer, but we need -* to manually flush if the transfer is unaligned. A cached -* memory copy is used when destination or size is not naturally -* aligned. That is: -* - Require 8-byte alignment when size is 8 bytes or larger. -* - Require 4-byte alignment when size is 4 bytes. -* -* In the non-iovec case the entire destination needs to be -* flushed. -*/ - if (iter_is_iovec(i)) { - unsigned long flushed, dest = (unsigned long) addr; - - if (bytes < 8) { - if (!IS_ALIGNED(dest, 4) || (bytes != 4)) - arch_wb_cache_pmem(addr, 1); - } else { - if (!IS_ALIGNED(dest, 8)) { - dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); - arch_wb_cache_pmem(addr, 1); - } - - flushed = dest - (unsigned long) addr; - if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) - arch_wb_cache_pmem(addr + bytes - 1, 1); - } - } else - arch_wb_cache_pmem(addr, bytes); - - return len; -} - -/** * arch_clear_pmem - zero a PMEM memory range * @addr: virtual start address * @size: number of bytes to zero diff --git a/fs/dax.c b/fs/dax.c index ce9dc9c3e829..11b9909c91df 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1061,7 +1061,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, map_len = end - pos; if (iov_iter_rw(iter) == WRITE) - map_len = copy_from_iter_pmem(kaddr, map_len, iter); + map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr, + map_len, iter); else map_len = copy_to_iter(kaddr, map_len, iter); if (map_len <= 0) { diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 71ecf3d46aac..9d542a5600e4 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -31,13 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, - struct iov_iter *i) -{ - BUG(); - return 0; -} - static inline void arch_clear_pmem(void *addr, size_t size) { BUG(); @@ -80,23 +73,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n) } /** - * copy_from_iter_pmem - copy data from an iterator to PMEM - * @addr: PMEM destination address - * @bytes: number of bytes to copy - * @i: iterator with source data - * - * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'. - * See blkdev_issue_flush() note for memcpy_to_pmem(). - */ -static inline size_t copy_from_iter_pmem(void *addr, size_t bytes, - struct iov_iter *i) -{ - if (arch_has_pmem_api()) - return arch_copy_from_iter_pmem(addr, bytes, i); - return copy_from_iter_nocache(addr, bytes, i); -} - -/** * clear_pmem - zero a PMEM memory range * @addr: virtual start address * @size: number of bytes to zero
[PATCH v2 17/33] block: remove block_device_operations ->direct_access()
Now that all the producers and consumers of dax interfaces have been converted to using dax_operations on a dax_device, remove the block device direct_access enabling. Signed-off-by: Dan Williams--- arch/powerpc/sysdev/axonram.c | 23 - drivers/block/brd.c | 15 -- drivers/md/dm.c | 13 drivers/nvdimm/pmem.c | 10 - drivers/s390/block/dcssblk.c | 16 --- fs/block_dev.c| 45 - include/linux/blkdev.h| 17 --- 7 files changed, 4 insertions(+), 135 deletions(-) diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index ad857d5e81b1..83eb56ff1d2c 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -139,6 +139,10 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio) return BLK_QC_T_NONE; } +static const struct block_device_operations axon_ram_devops = { + .owner = THIS_MODULE, +}; + static long __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) @@ -150,25 +154,6 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_page return (bank->size - offset) / PAGE_SIZE; } -/** - * axon_ram_direct_access - direct_access() method for block device - * @device, @sector, @data: see block_device_operations method - */ -static long -axon_ram_blk_direct_access(struct block_device *device, sector_t sector, - void **kaddr, pfn_t *pfn, long size) -{ - struct axon_ram_bank *bank = device->bd_disk->private_data; - - return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE, - size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE; -} - -static const struct block_device_operations axon_ram_devops = { - .owner = THIS_MODULE, - .direct_access = axon_ram_blk_direct_access -}; - static long axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 60f3193c9ce2..bfa4ed2c75ef 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -395,18 +395,6 @@ static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff, return 1; } -static long brd_blk_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, pfn_t *pfn, long size) -{ - struct brd_device *brd = bdev->bd_disk->private_data; - long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512), - PHYS_PFN(size), kaddr, pfn); - - if (nr_pages < 0) - return nr_pages; - return nr_pages * PAGE_SIZE; -} - static long brd_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) { @@ -418,14 +406,11 @@ static long brd_dax_direct_access(struct dax_device *dax_dev, static const struct dax_operations brd_dax_ops = { .direct_access = brd_dax_direct_access, }; -#else -#define brd_blk_direct_access NULL #endif static const struct block_device_operations brd_fops = { .owner =THIS_MODULE, .rw_page = brd_rw_page, - .direct_access =brd_blk_direct_access, }; /* diff --git a/drivers/md/dm.c b/drivers/md/dm.c index ef4c6f8cad47..79d5f5fd823e 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -957,18 +957,6 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, return ret; } -static long dm_blk_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, pfn_t *pfn, long size) -{ - struct mapped_device *md = bdev->bd_disk->private_data; - struct dax_device *dax_dev = md->dax_dev; - long nr_pages = size / PAGE_SIZE; - - nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS, - nr_pages, kaddr, pfn); - return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE; -} - /* * A target may call dm_accept_partial_bio only from the map routine. It is * allowed for all bio types except REQ_PREFLUSH. @@ -2823,7 +2811,6 @@ static const struct block_device_operations dm_blk_dops = { .open = dm_blk_open, .release = dm_blk_close, .ioctl = dm_blk_ioctl, - .direct_access = dm_blk_direct_access, .getgeo = dm_blk_getgeo, .pr_ops = _pr_ops, .owner = THIS_MODULE diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index fbbcf8154eec..85b85633d674 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -220,19 +220,9 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, return PHYS_PFN(pmem->size - pmem->pfn_pad
[PATCH v2 22/33] dax, pmem: introduce an optional 'flush' dax_operation
Filesystem-DAX flushes caches whenever it writes to the address returned through dax_direct_access() and when writing back dirty radix entries. That flushing is only required in the pmem case, so add a dax operation to allow pmem to take this extra action, but skip it for other dax capable devices that do not provide a flush routine. An example for this differentiation might be a volatile ram disk where there is no expectation of persistence. In fact the pmem driver itself might front such an address range specified by the NFIT. So, this "no flush" property might be something passed down by the bus / libnvdimm. Cc: Christoph HellwigCc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/nvdimm/pmem.c | 11 +++ include/linux/dax.h |2 ++ 2 files changed, 13 insertions(+) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index e501df4ab4b4..822b85fb3365 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -276,9 +276,20 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); } +static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, + void *addr, size_t size) +{ + /* +* TODO: move arch specific cache management into the driver +* directly. +*/ + wb_cache_pmem(addr, size); +} + static const struct dax_operations pmem_dax_ops = { .direct_access = pmem_dax_direct_access, .copy_from_iter = pmem_copy_from_iter, + .flush = pmem_dax_flush, }; static void pmem_release_queue(void *q) diff --git a/include/linux/dax.h b/include/linux/dax.h index cd8561bb21f3..c88bbcba26d9 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -19,6 +19,8 @@ struct dax_operations { /* copy_from_iter: dax-driver override for default copy_from_iter */ size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t, struct iov_iter *); + /* flush: optional driver-specific cache management after writes */ + void (*flush)(struct dax_device *, pgoff_t, void *, size_t); }; int dax_read_lock(void);
[PATCH v2 17/33] block: remove block_device_operations ->direct_access()
Now that all the producers and consumers of dax interfaces have been converted to using dax_operations on a dax_device, remove the block device direct_access enabling. Signed-off-by: Dan Williams --- arch/powerpc/sysdev/axonram.c | 23 - drivers/block/brd.c | 15 -- drivers/md/dm.c | 13 drivers/nvdimm/pmem.c | 10 - drivers/s390/block/dcssblk.c | 16 --- fs/block_dev.c| 45 - include/linux/blkdev.h| 17 --- 7 files changed, 4 insertions(+), 135 deletions(-) diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index ad857d5e81b1..83eb56ff1d2c 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -139,6 +139,10 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio) return BLK_QC_T_NONE; } +static const struct block_device_operations axon_ram_devops = { + .owner = THIS_MODULE, +}; + static long __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) @@ -150,25 +154,6 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_page return (bank->size - offset) / PAGE_SIZE; } -/** - * axon_ram_direct_access - direct_access() method for block device - * @device, @sector, @data: see block_device_operations method - */ -static long -axon_ram_blk_direct_access(struct block_device *device, sector_t sector, - void **kaddr, pfn_t *pfn, long size) -{ - struct axon_ram_bank *bank = device->bd_disk->private_data; - - return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE, - size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE; -} - -static const struct block_device_operations axon_ram_devops = { - .owner = THIS_MODULE, - .direct_access = axon_ram_blk_direct_access -}; - static long axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 60f3193c9ce2..bfa4ed2c75ef 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -395,18 +395,6 @@ static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff, return 1; } -static long brd_blk_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, pfn_t *pfn, long size) -{ - struct brd_device *brd = bdev->bd_disk->private_data; - long nr_pages = __brd_direct_access(brd, PHYS_PFN(sector * 512), - PHYS_PFN(size), kaddr, pfn); - - if (nr_pages < 0) - return nr_pages; - return nr_pages * PAGE_SIZE; -} - static long brd_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) { @@ -418,14 +406,11 @@ static long brd_dax_direct_access(struct dax_device *dax_dev, static const struct dax_operations brd_dax_ops = { .direct_access = brd_dax_direct_access, }; -#else -#define brd_blk_direct_access NULL #endif static const struct block_device_operations brd_fops = { .owner =THIS_MODULE, .rw_page = brd_rw_page, - .direct_access =brd_blk_direct_access, }; /* diff --git a/drivers/md/dm.c b/drivers/md/dm.c index ef4c6f8cad47..79d5f5fd823e 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -957,18 +957,6 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, return ret; } -static long dm_blk_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, pfn_t *pfn, long size) -{ - struct mapped_device *md = bdev->bd_disk->private_data; - struct dax_device *dax_dev = md->dax_dev; - long nr_pages = size / PAGE_SIZE; - - nr_pages = dm_dax_direct_access(dax_dev, sector / PAGE_SECTORS, - nr_pages, kaddr, pfn); - return nr_pages < 0 ? nr_pages : nr_pages * PAGE_SIZE; -} - /* * A target may call dm_accept_partial_bio only from the map routine. It is * allowed for all bio types except REQ_PREFLUSH. @@ -2823,7 +2811,6 @@ static const struct block_device_operations dm_blk_dops = { .open = dm_blk_open, .release = dm_blk_close, .ioctl = dm_blk_ioctl, - .direct_access = dm_blk_direct_access, .getgeo = dm_blk_getgeo, .pr_ops = _pr_ops, .owner = THIS_MODULE diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index fbbcf8154eec..85b85633d674 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -220,19 +220,9 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); } -static
[PATCH v2 22/33] dax, pmem: introduce an optional 'flush' dax_operation
Filesystem-DAX flushes caches whenever it writes to the address returned through dax_direct_access() and when writing back dirty radix entries. That flushing is only required in the pmem case, so add a dax operation to allow pmem to take this extra action, but skip it for other dax capable devices that do not provide a flush routine. An example for this differentiation might be a volatile ram disk where there is no expectation of persistence. In fact the pmem driver itself might front such an address range specified by the NFIT. So, this "no flush" property might be something passed down by the bus / libnvdimm. Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/nvdimm/pmem.c | 11 +++ include/linux/dax.h |2 ++ 2 files changed, 13 insertions(+) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index e501df4ab4b4..822b85fb3365 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -276,9 +276,20 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); } +static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, + void *addr, size_t size) +{ + /* +* TODO: move arch specific cache management into the driver +* directly. +*/ + wb_cache_pmem(addr, size); +} + static const struct dax_operations pmem_dax_ops = { .direct_access = pmem_dax_direct_access, .copy_from_iter = pmem_copy_from_iter, + .flush = pmem_dax_flush, }; static void pmem_release_queue(void *q) diff --git a/include/linux/dax.h b/include/linux/dax.h index cd8561bb21f3..c88bbcba26d9 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -19,6 +19,8 @@ struct dax_operations { /* copy_from_iter: dax-driver override for default copy_from_iter */ size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t, struct iov_iter *); + /* flush: optional driver-specific cache management after writes */ + void (*flush)(struct dax_device *, pgoff_t, void *, size_t); }; int dax_read_lock(void);
[PATCH v2 18/33] x86, dax, pmem: remove indirection around memcpy_from_pmem()
memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper serves no real benefit aside from affording a more generic function name than the x86-specific 'mcsafe'. However this would not be the first time that x86 terminology leaked into the global namespace. For lack of better name, just use memcpy_mcsafe() directly. This conversion also catches a place where we should have been using plain memcpy, acpi_nfit_blk_single_io(). Cc:Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/x86/include/asm/pmem.h |5 - arch/x86/include/asm/string_64.h |1 + drivers/acpi/nfit/core.c |3 +-- drivers/nvdimm/claim.c |2 +- drivers/nvdimm/pmem.c|2 +- include/linux/pmem.h | 23 --- include/linux/string.h |8 7 files changed, 12 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index 529bb4a6487a..d5a22bac9988 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -44,11 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n) -{ - return memcpy_mcsafe(dst, src, n); -} - /** * arch_wb_cache_pmem - write back a cache range with CLWB * @vaddr: virtual start address diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index a164862d77e3..733bae07fb29 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -79,6 +79,7 @@ int strcmp(const char *cs, const char *ct); #define memset(s, c, n) __memset(s, c, n) #endif +#define __HAVE_ARCH_MEMCPY_MCSAFE 1 __must_check int memcpy_mcsafe_unrolled(void *dst, const void *src, size_t cnt); DECLARE_STATIC_KEY_FALSE(mcsafe_key); diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index c8ea9d698cd0..d0c07b2344e4 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -1783,8 +1783,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, mmio_flush_range((void __force *) mmio->addr.aperture + offset, c); - memcpy_from_pmem(iobuf + copied, - mmio->addr.aperture + offset, c); + memcpy(iobuf + copied, mmio->addr.aperture + offset, c); } copied += c; diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c index ca6d572c48fc..3a35e8028b9c 100644 --- a/drivers/nvdimm/claim.c +++ b/drivers/nvdimm/claim.c @@ -239,7 +239,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, if (rw == READ) { if (unlikely(is_bad_pmem(>bb, sector, sz_align))) return -EIO; - return memcpy_from_pmem(buf, nsio->addr + offset, size); + return memcpy_mcsafe(buf, nsio->addr + offset, size); } if (unlikely(is_bad_pmem(>bb, sector, sz_align))) { diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 85b85633d674..3b3dab73d741 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -89,7 +89,7 @@ static int read_pmem(struct page *page, unsigned int off, int rc; void *mem = kmap_atomic(page); - rc = memcpy_from_pmem(mem + off, pmem_addr, len); + rc = memcpy_mcsafe(mem + off, pmem_addr, len); kunmap_atomic(mem); if (rc) return -EIO; diff --git a/include/linux/pmem.h b/include/linux/pmem.h index e856c2cb0fe8..71ecf3d46aac 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -31,12 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n) -{ - BUG(); - return -EFAULT; -} - static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, struct iov_iter *i) { @@ -65,23 +59,6 @@ static inline bool arch_has_pmem_api(void) return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API); } -/* - * memcpy_from_pmem - read from persistent memory with error handling - * @dst: destination buffer - * @src: source buffer - * @size: transfer length - * - * Returns 0 on success negative error code on failure. - */ -static inline int memcpy_from_pmem(void *dst, void const *src, size_t size) -{ - if (arch_has_pmem_api()) - return arch_memcpy_from_pmem(dst, src, size); -
[PATCH v2 18/33] x86, dax, pmem: remove indirection around memcpy_from_pmem()
memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper serves no real benefit aside from affording a more generic function name than the x86-specific 'mcsafe'. However this would not be the first time that x86 terminology leaked into the global namespace. For lack of better name, just use memcpy_mcsafe() directly. This conversion also catches a place where we should have been using plain memcpy, acpi_nfit_blk_single_io(). Cc: Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/x86/include/asm/pmem.h |5 - arch/x86/include/asm/string_64.h |1 + drivers/acpi/nfit/core.c |3 +-- drivers/nvdimm/claim.c |2 +- drivers/nvdimm/pmem.c|2 +- include/linux/pmem.h | 23 --- include/linux/string.h |8 7 files changed, 12 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index 529bb4a6487a..d5a22bac9988 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -44,11 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n) -{ - return memcpy_mcsafe(dst, src, n); -} - /** * arch_wb_cache_pmem - write back a cache range with CLWB * @vaddr: virtual start address diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index a164862d77e3..733bae07fb29 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -79,6 +79,7 @@ int strcmp(const char *cs, const char *ct); #define memset(s, c, n) __memset(s, c, n) #endif +#define __HAVE_ARCH_MEMCPY_MCSAFE 1 __must_check int memcpy_mcsafe_unrolled(void *dst, const void *src, size_t cnt); DECLARE_STATIC_KEY_FALSE(mcsafe_key); diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index c8ea9d698cd0..d0c07b2344e4 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -1783,8 +1783,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, mmio_flush_range((void __force *) mmio->addr.aperture + offset, c); - memcpy_from_pmem(iobuf + copied, - mmio->addr.aperture + offset, c); + memcpy(iobuf + copied, mmio->addr.aperture + offset, c); } copied += c; diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c index ca6d572c48fc..3a35e8028b9c 100644 --- a/drivers/nvdimm/claim.c +++ b/drivers/nvdimm/claim.c @@ -239,7 +239,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, if (rw == READ) { if (unlikely(is_bad_pmem(>bb, sector, sz_align))) return -EIO; - return memcpy_from_pmem(buf, nsio->addr + offset, size); + return memcpy_mcsafe(buf, nsio->addr + offset, size); } if (unlikely(is_bad_pmem(>bb, sector, sz_align))) { diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 85b85633d674..3b3dab73d741 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -89,7 +89,7 @@ static int read_pmem(struct page *page, unsigned int off, int rc; void *mem = kmap_atomic(page); - rc = memcpy_from_pmem(mem + off, pmem_addr, len); + rc = memcpy_mcsafe(mem + off, pmem_addr, len); kunmap_atomic(mem); if (rc) return -EIO; diff --git a/include/linux/pmem.h b/include/linux/pmem.h index e856c2cb0fe8..71ecf3d46aac 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -31,12 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n) -{ - BUG(); - return -EFAULT; -} - static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, struct iov_iter *i) { @@ -65,23 +59,6 @@ static inline bool arch_has_pmem_api(void) return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API); } -/* - * memcpy_from_pmem - read from persistent memory with error handling - * @dst: destination buffer - * @src: source buffer - * @size: transfer length - * - * Returns 0 on success negative error code on failure. - */ -static inline int memcpy_from_pmem(void *dst, void const *src, size_t size) -{ - if (arch_has_pmem_api()) - return arch_memcpy_from_pmem(dst, src, size); - else - memcpy(dst, src, size); - return 0; -} - /** * memcpy_to_pmem - copy data to persistent memory * @dst: destination buffer for the copy diff --git a/include/linux/string.h
[PATCH v2 28/33] x86, libnvdimm, dax: stop abusing __copy_user_nocache
The pmem and nd_blk drivers both have need to copy data through the cpu cache to persistent memory. To date they have been abusing __copy_user_nocache through the memcpy_to_pmem abstraction, but this has several problems: * __copy_user_nocache does not guarantee that it will always avoid the cache. While we have fixed the cases where the pmem usage might trigger that behavior it's a fragile assumption and burdens the uaccess.h implementation with worrying about the distinction between 'nocache' and the stricter write-through semantic needed by pmem. Quoting Linus: "Quite frankly, the whole "memcpy_nocache()" idea or (ab-)using copy_user_nocache() just needs to die. ... If some driver ends up using "movnt" by hand, that is up to that *driver*." * It implements SMAP (supervisor mode access protection) which is only meant for user copies. * It expects faults. For in-kernel copies, faults are fatal and we should not be coding for exception handling in that case. __arch_memcpy_to_pmem() is effectively a copy of __copy_user_nocache() minus SMAP, unaligned support, and exception handling. The configuration symbol ARCH_HAS_PMEM_API is also moved local to libnvdimm to be next to the implementation. Cc:Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: Toshi Kani Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Al Viro Cc: Thomas Gleixner Cc: Oliver O'Halloran Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Linus Torvalds Signed-off-by: Dan Williams --- MAINTAINERS |2 - arch/x86/Kconfig|1 - arch/x86/include/asm/pmem.h | 48 - drivers/acpi/nfit/core.c|3 +- drivers/nvdimm/Kconfig |4 ++ drivers/nvdimm/claim.c |4 +- drivers/nvdimm/namespace_devs.c |1 - drivers/nvdimm/pmem.c |4 +- drivers/nvdimm/region_devs.c|1 - drivers/nvdimm/x86.c| 65 +++ fs/dax.c|1 - include/linux/libnvdimm.h |9 + include/linux/pmem.h| 59 --- lib/Kconfig |3 -- 14 files changed, 83 insertions(+), 122 deletions(-) delete mode 100644 arch/x86/include/asm/pmem.h delete mode 100644 include/linux/pmem.h diff --git a/MAINTAINERS b/MAINTAINERS index 819d5e8b668a..1c4da1bebd7c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7458,8 +7458,6 @@ L:linux-nvd...@lists.01.org Q: https://patchwork.kernel.org/project/linux-nvdimm/list/ S: Supported F: drivers/nvdimm/pmem.c -F: include/linux/pmem.h -F: arch/*/include/asm/pmem.h LIGHTNVM PLATFORM SUPPORT M: Matias Bjorling diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cc98d5a294ee..d377da696903 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -53,7 +53,6 @@ config X86 select ARCH_HAS_GCOV_PROFILE_ALL select ARCH_HAS_KCOVif X86_64 select ARCH_HAS_MMIO_FLUSH - select ARCH_HAS_PMEM_APIif X86_64 select ARCH_HAS_SET_MEMORY select ARCH_HAS_SG_CHAIN select ARCH_HAS_STRICT_KERNEL_RWX diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h deleted file mode 100644 index ded2541a7ba9.. --- a/arch/x86/include/asm/pmem.h +++ /dev/null @@ -1,48 +0,0 @@ -/* - * Copyright(c) 2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - */ -#ifndef __ASM_X86_PMEM_H__ -#define __ASM_X86_PMEM_H__ - -#include -#include -#include -#include - -#ifdef CONFIG_ARCH_HAS_PMEM_API -/** - * arch_memcpy_to_pmem - copy data to persistent memory - * @dst: destination buffer for the copy - * @src: source buffer for the copy - * @n: length of the copy in bytes - * - * Copy data to persistent memory media via non-temporal stores so that - * a subsequent pmem driver flush operation will drain posted write queues. - */ -static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) -{ - int rem; - - /* -* We are copying between two kernel buffers, if -* __copy_from_user_inatomic_nocache() returns an error (page -* fault) we
[PATCH v2 26/33] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm
With all calls to this routine re-directed through the pmem driver, we can kill the pmem api indirection. arch_wb_cache_pmem() is now optionally supplied by an arch specific extension to libnvdimm. Same as before, pmem flushing is only defined for x86_64, but it is straightforward to add other archs in the future. Cc:Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Oliver O'Halloran Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/x86/include/asm/pmem.h | 21 - drivers/nvdimm/Makefile |1 + drivers/nvdimm/pmem.c | 14 +- drivers/nvdimm/pmem.h |8 drivers/nvdimm/x86.c| 36 include/linux/pmem.h| 19 --- tools/testing/nvdimm/Kbuild |1 + 7 files changed, 51 insertions(+), 49 deletions(-) create mode 100644 drivers/nvdimm/x86.c diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index f4c119d253f3..4759a179aa52 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -44,27 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -/** - * arch_wb_cache_pmem - write back a cache range with CLWB - * @vaddr: virtual start address - * @size: number of bytes to write back - * - * Write back a cache range using the CLWB (cache line write back) - * instruction. Note that @size is internally rounded up to be cache - * line size aligned. - */ -static inline void arch_wb_cache_pmem(void *addr, size_t size) -{ - u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; - unsigned long clflush_mask = x86_clflush_size - 1; - void *vend = addr + size; - void *p; - - for (p = (void *)((unsigned long)addr & ~clflush_mask); -p < vend; p += x86_clflush_size) - clwb(p); -} - static inline void arch_invalidate_pmem(void *addr, size_t size) { clflush_cache_range(addr, size); diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile index 909554c3f955..9eafb1dd2876 100644 --- a/drivers/nvdimm/Makefile +++ b/drivers/nvdimm/Makefile @@ -24,3 +24,4 @@ libnvdimm-$(CONFIG_ND_CLAIM) += claim.o libnvdimm-$(CONFIG_BTT) += btt_devs.o libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o libnvdimm-$(CONFIG_NVDIMM_DAX) += dax_devs.o +libnvdimm-$(CONFIG_X86_64) += x86.o diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 822b85fb3365..c77a3a757729 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -245,19 +245,19 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, if (bytes < 8) { if (!IS_ALIGNED(dest, 4) || (bytes != 4)) - wb_cache_pmem(addr, 1); + arch_wb_cache_pmem(addr, 1); } else { if (!IS_ALIGNED(dest, 8)) { dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); - wb_cache_pmem(addr, 1); + arch_wb_cache_pmem(addr, 1); } flushed = dest - (unsigned long) addr; if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) - wb_cache_pmem(addr + bytes - 1, 1); + arch_wb_cache_pmem(addr + bytes - 1, 1); } } else - wb_cache_pmem(addr, bytes); + arch_wb_cache_pmem(addr, bytes); return len; } @@ -279,11 +279,7 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t size) { - /* -* TODO: move arch specific cache management into the driver -* directly. -*/ - wb_cache_pmem(addr, size); + arch_wb_cache_pmem(addr, size); } static const struct dax_operations pmem_dax_ops = { diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 7f4dbd72a90a..c4b3371c7f88 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -5,6 +5,14 @@ #include #include +#ifdef CONFIG_ARCH_HAS_PMEM_API +void arch_wb_cache_pmem(void *addr, size_t size); +#else +static inline void arch_wb_cache_pmem(void *addr, size_t size) +{ +} +#endif + /* this definition is in it's own header for tools/testing/nvdimm to consume */ struct pmem_device { /* One contiguous memory region per device */ diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c new file mode 100644 index ..79d7267da4d2
[PATCH v2 25/33] x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
The clear_pmem() helper simply combines a memset() plus a cache flush. Now that the flush routine is optionally provided by the dax device driver we can avoid unnecessary cache management on dax devices fronting volatile memory. With clear_pmem() gone we can follow on with a patch to make pmem cache management completely defined within the pmem driver. Cc:Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/x86/include/asm/pmem.h | 13 - fs/dax.c|3 ++- include/linux/pmem.h| 21 - 3 files changed, 2 insertions(+), 35 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index 60e8edbe0205..f4c119d253f3 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -65,19 +65,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size) clwb(p); } -/** - * arch_clear_pmem - zero a PMEM memory range - * @addr: virtual start address - * @size: number of bytes to zero - * - * Write zeros into the memory range starting at 'addr' for 'size' bytes. - */ -static inline void arch_clear_pmem(void *addr, size_t size) -{ - memset(addr, 0, size); - arch_wb_cache_pmem(addr, size); -} - static inline void arch_invalidate_pmem(void *addr, size_t size) { clflush_cache_range(addr, size); diff --git a/fs/dax.c b/fs/dax.c index edbf988de86c..edee7e8298bc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -982,7 +982,8 @@ int __dax_zero_page_range(struct block_device *bdev, dax_read_unlock(id); return rc; } - clear_pmem(kaddr + offset, size); + memset(kaddr + offset, 0, size); + dax_flush(dax_dev, pgoff, kaddr + offset, size); dax_read_unlock(id); } return 0; diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 9d542a5600e4..772bd02a5b52 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline void arch_clear_pmem(void *addr, size_t size) -{ - BUG(); -} - static inline void arch_wb_cache_pmem(void *addr, size_t size) { BUG(); @@ -73,22 +68,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n) } /** - * clear_pmem - zero a PMEM memory range - * @addr: virtual start address - * @size: number of bytes to zero - * - * Write zeros into the memory range starting at 'addr' for 'size' bytes. - * See blkdev_issue_flush() note for memcpy_to_pmem(). - */ -static inline void clear_pmem(void *addr, size_t size) -{ - if (arch_has_pmem_api()) - arch_clear_pmem(addr, size); - else - memset(addr, 0, size); -} - -/** * invalidate_pmem - flush a pmem range from the cache hierarchy * @addr: virtual start address * @size: bytes to invalidate (internally aligned to cache line size)
[PATCH v2 26/33] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm
With all calls to this routine re-directed through the pmem driver, we can kill the pmem api indirection. arch_wb_cache_pmem() is now optionally supplied by an arch specific extension to libnvdimm. Same as before, pmem flushing is only defined for x86_64, but it is straightforward to add other archs in the future. Cc: Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Oliver O'Halloran Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/x86/include/asm/pmem.h | 21 - drivers/nvdimm/Makefile |1 + drivers/nvdimm/pmem.c | 14 +- drivers/nvdimm/pmem.h |8 drivers/nvdimm/x86.c| 36 include/linux/pmem.h| 19 --- tools/testing/nvdimm/Kbuild |1 + 7 files changed, 51 insertions(+), 49 deletions(-) create mode 100644 drivers/nvdimm/x86.c diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index f4c119d253f3..4759a179aa52 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -44,27 +44,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -/** - * arch_wb_cache_pmem - write back a cache range with CLWB - * @vaddr: virtual start address - * @size: number of bytes to write back - * - * Write back a cache range using the CLWB (cache line write back) - * instruction. Note that @size is internally rounded up to be cache - * line size aligned. - */ -static inline void arch_wb_cache_pmem(void *addr, size_t size) -{ - u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; - unsigned long clflush_mask = x86_clflush_size - 1; - void *vend = addr + size; - void *p; - - for (p = (void *)((unsigned long)addr & ~clflush_mask); -p < vend; p += x86_clflush_size) - clwb(p); -} - static inline void arch_invalidate_pmem(void *addr, size_t size) { clflush_cache_range(addr, size); diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile index 909554c3f955..9eafb1dd2876 100644 --- a/drivers/nvdimm/Makefile +++ b/drivers/nvdimm/Makefile @@ -24,3 +24,4 @@ libnvdimm-$(CONFIG_ND_CLAIM) += claim.o libnvdimm-$(CONFIG_BTT) += btt_devs.o libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o libnvdimm-$(CONFIG_NVDIMM_DAX) += dax_devs.o +libnvdimm-$(CONFIG_X86_64) += x86.o diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 822b85fb3365..c77a3a757729 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -245,19 +245,19 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, if (bytes < 8) { if (!IS_ALIGNED(dest, 4) || (bytes != 4)) - wb_cache_pmem(addr, 1); + arch_wb_cache_pmem(addr, 1); } else { if (!IS_ALIGNED(dest, 8)) { dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); - wb_cache_pmem(addr, 1); + arch_wb_cache_pmem(addr, 1); } flushed = dest - (unsigned long) addr; if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) - wb_cache_pmem(addr + bytes - 1, 1); + arch_wb_cache_pmem(addr + bytes - 1, 1); } } else - wb_cache_pmem(addr, bytes); + arch_wb_cache_pmem(addr, bytes); return len; } @@ -279,11 +279,7 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t size) { - /* -* TODO: move arch specific cache management into the driver -* directly. -*/ - wb_cache_pmem(addr, size); + arch_wb_cache_pmem(addr, size); } static const struct dax_operations pmem_dax_ops = { diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 7f4dbd72a90a..c4b3371c7f88 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -5,6 +5,14 @@ #include #include +#ifdef CONFIG_ARCH_HAS_PMEM_API +void arch_wb_cache_pmem(void *addr, size_t size); +#else +static inline void arch_wb_cache_pmem(void *addr, size_t size) +{ +} +#endif + /* this definition is in it's own header for tools/testing/nvdimm to consume */ struct pmem_device { /* One contiguous memory region per device */ diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c new file mode 100644 index ..79d7267da4d2 --- /dev/null +++ b/drivers/nvdimm/x86.c @@ -0,0 +1,36 @@ +/* + * Copyright(c) 2015 - 2017 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it
[PATCH v2 25/33] x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
The clear_pmem() helper simply combines a memset() plus a cache flush. Now that the flush routine is optionally provided by the dax device driver we can avoid unnecessary cache management on dax devices fronting volatile memory. With clear_pmem() gone we can follow on with a patch to make pmem cache management completely defined within the pmem driver. Cc: Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/x86/include/asm/pmem.h | 13 - fs/dax.c|3 ++- include/linux/pmem.h| 21 - 3 files changed, 2 insertions(+), 35 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index 60e8edbe0205..f4c119d253f3 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -65,19 +65,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size) clwb(p); } -/** - * arch_clear_pmem - zero a PMEM memory range - * @addr: virtual start address - * @size: number of bytes to zero - * - * Write zeros into the memory range starting at 'addr' for 'size' bytes. - */ -static inline void arch_clear_pmem(void *addr, size_t size) -{ - memset(addr, 0, size); - arch_wb_cache_pmem(addr, size); -} - static inline void arch_invalidate_pmem(void *addr, size_t size) { clflush_cache_range(addr, size); diff --git a/fs/dax.c b/fs/dax.c index edbf988de86c..edee7e8298bc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -982,7 +982,8 @@ int __dax_zero_page_range(struct block_device *bdev, dax_read_unlock(id); return rc; } - clear_pmem(kaddr + offset, size); + memset(kaddr + offset, 0, size); + dax_flush(dax_dev, pgoff, kaddr + offset, size); dax_read_unlock(id); } return 0; diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 9d542a5600e4..772bd02a5b52 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline void arch_clear_pmem(void *addr, size_t size) -{ - BUG(); -} - static inline void arch_wb_cache_pmem(void *addr, size_t size) { BUG(); @@ -73,22 +68,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n) } /** - * clear_pmem - zero a PMEM memory range - * @addr: virtual start address - * @size: number of bytes to zero - * - * Write zeros into the memory range starting at 'addr' for 'size' bytes. - * See blkdev_issue_flush() note for memcpy_to_pmem(). - */ -static inline void clear_pmem(void *addr, size_t size) -{ - if (arch_has_pmem_api()) - arch_clear_pmem(addr, size); - else - memset(addr, 0, size); -} - -/** * invalidate_pmem - flush a pmem range from the cache hierarchy * @addr: virtual start address * @size: bytes to invalidate (internally aligned to cache line size)
[PATCH v2 28/33] x86, libnvdimm, dax: stop abusing __copy_user_nocache
The pmem and nd_blk drivers both have need to copy data through the cpu cache to persistent memory. To date they have been abusing __copy_user_nocache through the memcpy_to_pmem abstraction, but this has several problems: * __copy_user_nocache does not guarantee that it will always avoid the cache. While we have fixed the cases where the pmem usage might trigger that behavior it's a fragile assumption and burdens the uaccess.h implementation with worrying about the distinction between 'nocache' and the stricter write-through semantic needed by pmem. Quoting Linus: "Quite frankly, the whole "memcpy_nocache()" idea or (ab-)using copy_user_nocache() just needs to die. ... If some driver ends up using "movnt" by hand, that is up to that *driver*." * It implements SMAP (supervisor mode access protection) which is only meant for user copies. * It expects faults. For in-kernel copies, faults are fatal and we should not be coding for exception handling in that case. __arch_memcpy_to_pmem() is effectively a copy of __copy_user_nocache() minus SMAP, unaligned support, and exception handling. The configuration symbol ARCH_HAS_PMEM_API is also moved local to libnvdimm to be next to the implementation. Cc: Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: Toshi Kani Cc: Tony Luck Cc: "H. Peter Anvin" Cc: Al Viro Cc: Thomas Gleixner Cc: Oliver O'Halloran Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Linus Torvalds Signed-off-by: Dan Williams --- MAINTAINERS |2 - arch/x86/Kconfig|1 - arch/x86/include/asm/pmem.h | 48 - drivers/acpi/nfit/core.c|3 +- drivers/nvdimm/Kconfig |4 ++ drivers/nvdimm/claim.c |4 +- drivers/nvdimm/namespace_devs.c |1 - drivers/nvdimm/pmem.c |4 +- drivers/nvdimm/region_devs.c|1 - drivers/nvdimm/x86.c| 65 +++ fs/dax.c|1 - include/linux/libnvdimm.h |9 + include/linux/pmem.h| 59 --- lib/Kconfig |3 -- 14 files changed, 83 insertions(+), 122 deletions(-) delete mode 100644 arch/x86/include/asm/pmem.h delete mode 100644 include/linux/pmem.h diff --git a/MAINTAINERS b/MAINTAINERS index 819d5e8b668a..1c4da1bebd7c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7458,8 +7458,6 @@ L:linux-nvd...@lists.01.org Q: https://patchwork.kernel.org/project/linux-nvdimm/list/ S: Supported F: drivers/nvdimm/pmem.c -F: include/linux/pmem.h -F: arch/*/include/asm/pmem.h LIGHTNVM PLATFORM SUPPORT M: Matias Bjorling diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cc98d5a294ee..d377da696903 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -53,7 +53,6 @@ config X86 select ARCH_HAS_GCOV_PROFILE_ALL select ARCH_HAS_KCOVif X86_64 select ARCH_HAS_MMIO_FLUSH - select ARCH_HAS_PMEM_APIif X86_64 select ARCH_HAS_SET_MEMORY select ARCH_HAS_SG_CHAIN select ARCH_HAS_STRICT_KERNEL_RWX diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h deleted file mode 100644 index ded2541a7ba9.. --- a/arch/x86/include/asm/pmem.h +++ /dev/null @@ -1,48 +0,0 @@ -/* - * Copyright(c) 2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - */ -#ifndef __ASM_X86_PMEM_H__ -#define __ASM_X86_PMEM_H__ - -#include -#include -#include -#include - -#ifdef CONFIG_ARCH_HAS_PMEM_API -/** - * arch_memcpy_to_pmem - copy data to persistent memory - * @dst: destination buffer for the copy - * @src: source buffer for the copy - * @n: length of the copy in bytes - * - * Copy data to persistent memory media via non-temporal stores so that - * a subsequent pmem driver flush operation will drain posted write queues. - */ -static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) -{ - int rem; - - /* -* We are copying between two kernel buffers, if -* __copy_from_user_inatomic_nocache() returns an error (page -* fault) we would have already reported a general protection fault -* before the WARN+BUG. -*/ - rem = __copy_from_user_inatomic_nocache(dst, (void __user *) src, n); - if (WARN(rem, "%s: fault copying %p <- %p unwritten: %d\n", - __func__, dst, src, rem)) - BUG(); -} -
[PATCH v2 27/33] x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm
Kill this globally defined wrapper and move to libnvdimm so that we can ultimately remove the public pmem api. Cc:Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/x86/include/asm/pmem.h |4 drivers/nvdimm/claim.c |3 ++- drivers/nvdimm/pmem.c |2 +- drivers/nvdimm/pmem.h |4 drivers/nvdimm/x86.c|6 ++ include/linux/pmem.h| 19 --- 6 files changed, 13 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index 4759a179aa52..ded2541a7ba9 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -44,9 +44,5 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline void arch_invalidate_pmem(void *addr, size_t size) -{ - clflush_cache_range(addr, size); -} #endif /* CONFIG_ARCH_HAS_PMEM_API */ #endif /* __ASM_X86_PMEM_H__ */ diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c index 3a35e8028b9c..1e13a196ce4b 100644 --- a/drivers/nvdimm/claim.c +++ b/drivers/nvdimm/claim.c @@ -14,6 +14,7 @@ #include #include #include "nd-core.h" +#include "pmem.h" #include "pfn.h" #include "btt.h" #include "nd.h" @@ -261,7 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, cleared /= 512; badblocks_clear(>bb, sector, cleared); } - invalidate_pmem(nsio->addr + offset, size); + arch_invalidate_pmem(nsio->addr + offset, size); } else rc = -EIO; } diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index c77a3a757729..769a510c20e8 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -69,7 +69,7 @@ static int pmem_clear_poison(struct pmem_device *pmem, phys_addr_t offset, badblocks_clear(>bb, sector, cleared); } - invalidate_pmem(pmem->virt_addr + offset, len); + arch_invalidate_pmem(pmem->virt_addr + offset, len); return rc; } diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index c4b3371c7f88..5900c1b7 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -7,10 +7,14 @@ #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_wb_cache_pmem(void *addr, size_t size); +void arch_invalidate_pmem(void *addr, size_t size); #else static inline void arch_wb_cache_pmem(void *addr, size_t size) { } +static inline void arch_invalidate_pmem(void *addr, size_t size) +{ +} #endif /* this definition is in it's own header for tools/testing/nvdimm to consume */ diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c index 79d7267da4d2..07478ed7ce97 100644 --- a/drivers/nvdimm/x86.c +++ b/drivers/nvdimm/x86.c @@ -34,3 +34,9 @@ void arch_wb_cache_pmem(void *addr, size_t size) clwb(p); } EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); + +void arch_invalidate_pmem(void *addr, size_t size) +{ + clflush_cache_range(addr, size); +} +EXPORT_SYMBOL_GPL(arch_invalidate_pmem); diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 33ae761f010a..559c00848583 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -30,11 +30,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) { BUG(); } - -static inline void arch_invalidate_pmem(void *addr, size_t size) -{ - BUG(); -} #endif static inline bool arch_has_pmem_api(void) @@ -61,18 +56,4 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n) else memcpy(dst, src, n); } - -/** - * invalidate_pmem - flush a pmem range from the cache hierarchy - * @addr: virtual start address - * @size: bytes to invalidate (internally aligned to cache line size) - * - * For platforms that support clearing poison this flushes any poisoned - * ranges out of the cache - */ -static inline void invalidate_pmem(void *addr, size_t size) -{ - if (arch_has_pmem_api()) - arch_invalidate_pmem(addr, size); -} #endif /* __PMEM_H__ */
[PATCH v2 27/33] x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm
Kill this globally defined wrapper and move to libnvdimm so that we can ultimately remove the public pmem api. Cc: Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/x86/include/asm/pmem.h |4 drivers/nvdimm/claim.c |3 ++- drivers/nvdimm/pmem.c |2 +- drivers/nvdimm/pmem.h |4 drivers/nvdimm/x86.c|6 ++ include/linux/pmem.h| 19 --- 6 files changed, 13 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index 4759a179aa52..ded2541a7ba9 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -44,9 +44,5 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) BUG(); } -static inline void arch_invalidate_pmem(void *addr, size_t size) -{ - clflush_cache_range(addr, size); -} #endif /* CONFIG_ARCH_HAS_PMEM_API */ #endif /* __ASM_X86_PMEM_H__ */ diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c index 3a35e8028b9c..1e13a196ce4b 100644 --- a/drivers/nvdimm/claim.c +++ b/drivers/nvdimm/claim.c @@ -14,6 +14,7 @@ #include #include #include "nd-core.h" +#include "pmem.h" #include "pfn.h" #include "btt.h" #include "nd.h" @@ -261,7 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, cleared /= 512; badblocks_clear(>bb, sector, cleared); } - invalidate_pmem(nsio->addr + offset, size); + arch_invalidate_pmem(nsio->addr + offset, size); } else rc = -EIO; } diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index c77a3a757729..769a510c20e8 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -69,7 +69,7 @@ static int pmem_clear_poison(struct pmem_device *pmem, phys_addr_t offset, badblocks_clear(>bb, sector, cleared); } - invalidate_pmem(pmem->virt_addr + offset, len); + arch_invalidate_pmem(pmem->virt_addr + offset, len); return rc; } diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index c4b3371c7f88..5900c1b7 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -7,10 +7,14 @@ #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_wb_cache_pmem(void *addr, size_t size); +void arch_invalidate_pmem(void *addr, size_t size); #else static inline void arch_wb_cache_pmem(void *addr, size_t size) { } +static inline void arch_invalidate_pmem(void *addr, size_t size) +{ +} #endif /* this definition is in it's own header for tools/testing/nvdimm to consume */ diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c index 79d7267da4d2..07478ed7ce97 100644 --- a/drivers/nvdimm/x86.c +++ b/drivers/nvdimm/x86.c @@ -34,3 +34,9 @@ void arch_wb_cache_pmem(void *addr, size_t size) clwb(p); } EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); + +void arch_invalidate_pmem(void *addr, size_t size) +{ + clflush_cache_range(addr, size); +} +EXPORT_SYMBOL_GPL(arch_invalidate_pmem); diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 33ae761f010a..559c00848583 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -30,11 +30,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) { BUG(); } - -static inline void arch_invalidate_pmem(void *addr, size_t size) -{ - BUG(); -} #endif static inline bool arch_has_pmem_api(void) @@ -61,18 +56,4 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n) else memcpy(dst, src, n); } - -/** - * invalidate_pmem - flush a pmem range from the cache hierarchy - * @addr: virtual start address - * @size: bytes to invalidate (internally aligned to cache line size) - * - * For platforms that support clearing poison this flushes any poisoned - * ranges out of the cache - */ -static inline void invalidate_pmem(void *addr, size_t size) -{ - if (arch_has_pmem_api()) - arch_invalidate_pmem(addr, size); -} #endif /* __PMEM_H__ */
[PATCH v2 30/33] libnvdimm, pmem: fix persistence warning
The pmem driver assumes if platform firmware describes the memory devices associated with a persistent memory range and CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to flush data to a power-fail safe zone. We warn if the firmware does not describe memory devices, but we also need to warn if the architecture does not claim pmem support. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/nvdimm/region_devs.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index 307a48060aa3..5976f6c0407f 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -970,8 +970,9 @@ int nvdimm_has_flush(struct nd_region *nd_region) struct nd_region_data *ndrd = dev_get_drvdata(_region->dev); int i; - /* no nvdimm == flushing capability unknown */ - if (nd_region->ndr_mappings == 0) + /* no nvdimm or pmem api == flushing capability unknown */ + if (nd_region->ndr_mappings == 0 + || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API)) return -ENXIO; for (i = 0; i < nd_region->ndr_mappings; i++)
[PATCH v2 31/33] libnvdimm, nfit: enable support for volatile ranges
Allow volatile nfit ranges to participate in all the same infrastructure provided for persistent memory regions. A resulting resulting namespace device will still be called "pmem", but the parent region type will be "nd_volatile". This is in preparation for disabling the dax ->flush() operation in the pmem driver when it is hosted on a volatile range. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/acpi/nfit/core.c|9 - drivers/nvdimm/bus.c| 10 +- drivers/nvdimm/core.c |2 +- drivers/nvdimm/dax_devs.c |2 +- drivers/nvdimm/dimm_devs.c |2 +- drivers/nvdimm/namespace_devs.c |8 drivers/nvdimm/nd-core.h|9 + drivers/nvdimm/pfn_devs.c |4 ++-- drivers/nvdimm/region_devs.c| 27 ++- 9 files changed, 45 insertions(+), 28 deletions(-) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 8b4c6212737c..6ac31846c4df 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2162,6 +2162,13 @@ static bool nfit_spa_is_virtual(struct acpi_nfit_system_address *spa) nfit_spa_type(spa) == NFIT_SPA_PCD); } +static bool nfit_spa_is_volatile(struct acpi_nfit_system_address *spa) +{ + return (nfit_spa_type(spa) == NFIT_SPA_VDISK || + nfit_spa_type(spa) == NFIT_SPA_VCD || + nfit_spa_type(spa) == NFIT_SPA_VOLATILE); +} + static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc, struct nfit_spa *nfit_spa) { @@ -2236,7 +2243,7 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc, ndr_desc); if (!nfit_spa->nd_region) rc = -ENOMEM; - } else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) { + } else if (nfit_spa_is_volatile(spa)) { nfit_spa->nd_region = nvdimm_volatile_region_create(nvdimm_bus, ndr_desc); if (!nfit_spa->nd_region) diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 351bac8f6503..d4173fbdba28 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -37,13 +37,13 @@ static int to_nd_device_type(struct device *dev) { if (is_nvdimm(dev)) return ND_DEVICE_DIMM; - else if (is_nd_pmem(dev)) + else if (is_memory(dev)) return ND_DEVICE_REGION_PMEM; else if (is_nd_blk(dev)) return ND_DEVICE_REGION_BLK; else if (is_nd_dax(dev)) return ND_DEVICE_DAX_PMEM; - else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent)) + else if (is_nd_region(dev->parent)) return nd_region_to_nstype(to_nd_region(dev->parent)); return 0; @@ -55,7 +55,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct kobj_uevent_env *env) * Ensure that region devices always have their numa node set as * early as possible. */ - if (is_nd_pmem(dev) || is_nd_blk(dev)) + if (is_nd_region(dev)) set_dev_node(dev, to_nd_region(dev)->numa_node); return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT, to_nd_device_type(dev)); @@ -64,7 +64,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct kobj_uevent_env *env) static struct module *to_bus_provider(struct device *dev) { /* pin bus providers while regions are enabled */ - if (is_nd_pmem(dev) || is_nd_blk(dev)) { + if (is_nd_region(dev)) { struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev); return nvdimm_bus->nd_desc->module; @@ -771,7 +771,7 @@ void wait_nvdimm_bus_probe_idle(struct device *dev) static int pmem_active(struct device *dev, void *data) { - if (is_nd_pmem(dev) && dev->driver) + if (is_memory(dev) && dev->driver) return -EBUSY; return 0; } diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c index 9303cfeb8bee..875ef4cecb35 100644 --- a/drivers/nvdimm/core.c +++ b/drivers/nvdimm/core.c @@ -504,7 +504,7 @@ void nvdimm_badblocks_populate(struct nd_region *nd_region, struct nvdimm_bus *nvdimm_bus; struct list_head *poison_list; - if (!is_nd_pmem(_region->dev)) { + if (!is_memory(_region->dev)) { dev_WARN_ONCE(_region->dev, 1, "%s only valid for pmem regions\n", __func__); return; diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c index 45fa82cae87c..6a92b84c8072 100644 --- a/drivers/nvdimm/dax_devs.c +++ b/drivers/nvdimm/dax_devs.c @@ -89,7 +89,7 @@ struct device *nd_dax_create(struct nd_region
[PATCH v2 32/33] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC
Some platforms arrange for cpu caches to be flushed on power-fail. On those platforms there is no requirement that the kernel track and flush potentially dirty cache lines. Given that we still insert entries into the radix for locking purposes this patch only disables the cache flush loop, not the dirty tracking. Userspace can override the default cache setting via the block device queue "write_cache" attribute in sysfs. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- fs/dax.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index f37ed21e4093..5b7ee1bc74d0 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -797,7 +797,8 @@ static int dax_writeback_one(struct block_device *bdev, } dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn)); - dax_flush(dax_dev, pgoff, kaddr, size); + if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags)) + dax_flush(dax_dev, pgoff, kaddr, size); /* * After we have flushed the cache, we can clear the dirty tag. There * cannot be new dirty data in the pfn after the flush has completed as @@ -982,7 +983,8 @@ int __dax_zero_page_range(struct block_device *bdev, return rc; } memset(kaddr + offset, 0, size); - dax_flush(dax_dev, pgoff, kaddr + offset, size); + if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags)) + dax_flush(dax_dev, pgoff, kaddr + offset, size); dax_read_unlock(id); } return 0;
[PATCH v2 33/33] libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
The pmem driver attaches to both persistent and volatile memory ranges advertised by the ACPI NFIT. When the region is volatile it is redundant to spend cycles flushing caches at fsync(). Check if the hosting region is volatile and do not set QUEUE_FLAG_WC if it is. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/nvdimm/pmem.c|9 +++-- drivers/nvdimm/region_devs.c |6 ++ include/linux/libnvdimm.h|1 + 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index b000c6db5731..42876a75dab8 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -275,6 +275,7 @@ static int pmem_attach_disk(struct device *dev, struct vmem_altmap __altmap, *altmap = NULL; struct resource *res = >res; struct nd_pfn *nd_pfn = NULL; + int has_flush, fua = 0, wbc; struct dax_device *dax_dev; int nid = dev_to_node(dev); struct nd_pfn_sb *pfn_sb; @@ -302,8 +303,12 @@ static int pmem_attach_disk(struct device *dev, dev_set_drvdata(dev, pmem); pmem->phys_addr = res->start; pmem->size = resource_size(res); - if (nvdimm_has_flush(nd_region) < 0) + has_flush = nvdimm_has_flush(nd_region); + if (has_flush < 0) dev_warn(dev, "unable to guarantee persistence of writes\n"); + else + fua = has_flush; + wbc = nvdimm_has_cache(nd_region); if (!devm_request_mem_region(dev, res->start, resource_size(res), dev_name(>dev))) { @@ -344,7 +349,7 @@ static int pmem_attach_disk(struct device *dev, return PTR_ERR(addr); pmem->virt_addr = addr; - blk_queue_write_cache(q, true, true); + blk_queue_write_cache(q, wbc, fua); blk_queue_make_request(q, pmem_make_request); blk_queue_physical_block_size(q, PAGE_SIZE); blk_queue_max_hw_sectors(q, UINT_MAX); diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index 2df259010720..a085f7094b76 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -989,6 +989,12 @@ int nvdimm_has_flush(struct nd_region *nd_region) } EXPORT_SYMBOL_GPL(nvdimm_has_flush); +int nvdimm_has_cache(struct nd_region *nd_region) +{ + return is_nd_pmem(_region->dev); +} +EXPORT_SYMBOL_GPL(nvdimm_has_cache); + void __exit nd_region_devs_exit(void) { ida_destroy(_ida); diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index a98004745768..b733030107bb 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -162,6 +162,7 @@ void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane); u64 nd_fletcher64(void *addr, size_t len, bool le); void nvdimm_flush(struct nd_region *nd_region); int nvdimm_has_flush(struct nd_region *nd_region); +int nvdimm_has_cache(struct nd_region *nd_region); #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_memcpy_to_pmem(void *dst, void *src, unsigned size); #define ARCH_MEMREMAP_PMEM MEMREMAP_WB
[PATCH v2 30/33] libnvdimm, pmem: fix persistence warning
The pmem driver assumes if platform firmware describes the memory devices associated with a persistent memory range and CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to flush data to a power-fail safe zone. We warn if the firmware does not describe memory devices, but we also need to warn if the architecture does not claim pmem support. Cc: Jan Kara Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/nvdimm/region_devs.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index 307a48060aa3..5976f6c0407f 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -970,8 +970,9 @@ int nvdimm_has_flush(struct nd_region *nd_region) struct nd_region_data *ndrd = dev_get_drvdata(_region->dev); int i; - /* no nvdimm == flushing capability unknown */ - if (nd_region->ndr_mappings == 0) + /* no nvdimm or pmem api == flushing capability unknown */ + if (nd_region->ndr_mappings == 0 + || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API)) return -ENXIO; for (i = 0; i < nd_region->ndr_mappings; i++)
[PATCH v2 31/33] libnvdimm, nfit: enable support for volatile ranges
Allow volatile nfit ranges to participate in all the same infrastructure provided for persistent memory regions. A resulting resulting namespace device will still be called "pmem", but the parent region type will be "nd_volatile". This is in preparation for disabling the dax ->flush() operation in the pmem driver when it is hosted on a volatile range. Cc: Jan Kara Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/acpi/nfit/core.c|9 - drivers/nvdimm/bus.c| 10 +- drivers/nvdimm/core.c |2 +- drivers/nvdimm/dax_devs.c |2 +- drivers/nvdimm/dimm_devs.c |2 +- drivers/nvdimm/namespace_devs.c |8 drivers/nvdimm/nd-core.h|9 + drivers/nvdimm/pfn_devs.c |4 ++-- drivers/nvdimm/region_devs.c| 27 ++- 9 files changed, 45 insertions(+), 28 deletions(-) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 8b4c6212737c..6ac31846c4df 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2162,6 +2162,13 @@ static bool nfit_spa_is_virtual(struct acpi_nfit_system_address *spa) nfit_spa_type(spa) == NFIT_SPA_PCD); } +static bool nfit_spa_is_volatile(struct acpi_nfit_system_address *spa) +{ + return (nfit_spa_type(spa) == NFIT_SPA_VDISK || + nfit_spa_type(spa) == NFIT_SPA_VCD || + nfit_spa_type(spa) == NFIT_SPA_VOLATILE); +} + static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc, struct nfit_spa *nfit_spa) { @@ -2236,7 +2243,7 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc, ndr_desc); if (!nfit_spa->nd_region) rc = -ENOMEM; - } else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) { + } else if (nfit_spa_is_volatile(spa)) { nfit_spa->nd_region = nvdimm_volatile_region_create(nvdimm_bus, ndr_desc); if (!nfit_spa->nd_region) diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 351bac8f6503..d4173fbdba28 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -37,13 +37,13 @@ static int to_nd_device_type(struct device *dev) { if (is_nvdimm(dev)) return ND_DEVICE_DIMM; - else if (is_nd_pmem(dev)) + else if (is_memory(dev)) return ND_DEVICE_REGION_PMEM; else if (is_nd_blk(dev)) return ND_DEVICE_REGION_BLK; else if (is_nd_dax(dev)) return ND_DEVICE_DAX_PMEM; - else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent)) + else if (is_nd_region(dev->parent)) return nd_region_to_nstype(to_nd_region(dev->parent)); return 0; @@ -55,7 +55,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct kobj_uevent_env *env) * Ensure that region devices always have their numa node set as * early as possible. */ - if (is_nd_pmem(dev) || is_nd_blk(dev)) + if (is_nd_region(dev)) set_dev_node(dev, to_nd_region(dev)->numa_node); return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT, to_nd_device_type(dev)); @@ -64,7 +64,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct kobj_uevent_env *env) static struct module *to_bus_provider(struct device *dev) { /* pin bus providers while regions are enabled */ - if (is_nd_pmem(dev) || is_nd_blk(dev)) { + if (is_nd_region(dev)) { struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev); return nvdimm_bus->nd_desc->module; @@ -771,7 +771,7 @@ void wait_nvdimm_bus_probe_idle(struct device *dev) static int pmem_active(struct device *dev, void *data) { - if (is_nd_pmem(dev) && dev->driver) + if (is_memory(dev) && dev->driver) return -EBUSY; return 0; } diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c index 9303cfeb8bee..875ef4cecb35 100644 --- a/drivers/nvdimm/core.c +++ b/drivers/nvdimm/core.c @@ -504,7 +504,7 @@ void nvdimm_badblocks_populate(struct nd_region *nd_region, struct nvdimm_bus *nvdimm_bus; struct list_head *poison_list; - if (!is_nd_pmem(_region->dev)) { + if (!is_memory(_region->dev)) { dev_WARN_ONCE(_region->dev, 1, "%s only valid for pmem regions\n", __func__); return; diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c index 45fa82cae87c..6a92b84c8072 100644 --- a/drivers/nvdimm/dax_devs.c +++ b/drivers/nvdimm/dax_devs.c @@ -89,7 +89,7 @@ struct device *nd_dax_create(struct nd_region *nd_region) struct device *dev = NULL; struct nd_dax *nd_dax; - if (!is_nd_pmem(_region->dev)) + if
[PATCH v2 32/33] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC
Some platforms arrange for cpu caches to be flushed on power-fail. On those platforms there is no requirement that the kernel track and flush potentially dirty cache lines. Given that we still insert entries into the radix for locking purposes this patch only disables the cache flush loop, not the dirty tracking. Userspace can override the default cache setting via the block device queue "write_cache" attribute in sysfs. Cc: Jan Kara Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- fs/dax.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index f37ed21e4093..5b7ee1bc74d0 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -797,7 +797,8 @@ static int dax_writeback_one(struct block_device *bdev, } dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn)); - dax_flush(dax_dev, pgoff, kaddr, size); + if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags)) + dax_flush(dax_dev, pgoff, kaddr, size); /* * After we have flushed the cache, we can clear the dirty tag. There * cannot be new dirty data in the pfn after the flush has completed as @@ -982,7 +983,8 @@ int __dax_zero_page_range(struct block_device *bdev, return rc; } memset(kaddr + offset, 0, size); - dax_flush(dax_dev, pgoff, kaddr + offset, size); + if (test_bit(QUEUE_FLAG_WC, >bd_queue->queue_flags)) + dax_flush(dax_dev, pgoff, kaddr + offset, size); dax_read_unlock(id); } return 0;
[PATCH v2 33/33] libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
The pmem driver attaches to both persistent and volatile memory ranges advertised by the ACPI NFIT. When the region is volatile it is redundant to spend cycles flushing caches at fsync(). Check if the hosting region is volatile and do not set QUEUE_FLAG_WC if it is. Cc: Jan Kara Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/nvdimm/pmem.c|9 +++-- drivers/nvdimm/region_devs.c |6 ++ include/linux/libnvdimm.h|1 + 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index b000c6db5731..42876a75dab8 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -275,6 +275,7 @@ static int pmem_attach_disk(struct device *dev, struct vmem_altmap __altmap, *altmap = NULL; struct resource *res = >res; struct nd_pfn *nd_pfn = NULL; + int has_flush, fua = 0, wbc; struct dax_device *dax_dev; int nid = dev_to_node(dev); struct nd_pfn_sb *pfn_sb; @@ -302,8 +303,12 @@ static int pmem_attach_disk(struct device *dev, dev_set_drvdata(dev, pmem); pmem->phys_addr = res->start; pmem->size = resource_size(res); - if (nvdimm_has_flush(nd_region) < 0) + has_flush = nvdimm_has_flush(nd_region); + if (has_flush < 0) dev_warn(dev, "unable to guarantee persistence of writes\n"); + else + fua = has_flush; + wbc = nvdimm_has_cache(nd_region); if (!devm_request_mem_region(dev, res->start, resource_size(res), dev_name(>dev))) { @@ -344,7 +349,7 @@ static int pmem_attach_disk(struct device *dev, return PTR_ERR(addr); pmem->virt_addr = addr; - blk_queue_write_cache(q, true, true); + blk_queue_write_cache(q, wbc, fua); blk_queue_make_request(q, pmem_make_request); blk_queue_physical_block_size(q, PAGE_SIZE); blk_queue_max_hw_sectors(q, UINT_MAX); diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index 2df259010720..a085f7094b76 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -989,6 +989,12 @@ int nvdimm_has_flush(struct nd_region *nd_region) } EXPORT_SYMBOL_GPL(nvdimm_has_flush); +int nvdimm_has_cache(struct nd_region *nd_region) +{ + return is_nd_pmem(_region->dev); +} +EXPORT_SYMBOL_GPL(nvdimm_has_cache); + void __exit nd_region_devs_exit(void) { ida_destroy(_ida); diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index a98004745768..b733030107bb 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -162,6 +162,7 @@ void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane); u64 nd_fletcher64(void *addr, size_t len, bool le); void nvdimm_flush(struct nd_region *nd_region); int nvdimm_has_flush(struct nd_region *nd_region); +int nvdimm_has_cache(struct nd_region *nd_region); #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_memcpy_to_pmem(void *dst, void *src, unsigned size); #define ARCH_MEMREMAP_PMEM MEMREMAP_WB
[PATCH v2 29/33] uio, libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations
Introduce copy_from_iter_ops() to enable passing custom sub-routines to iterate_and_advance(). Define pmem operations that guarantee cache bypass to supplement the existing usage of __copy_from_iter_nocache() backed by arch_wb_cache_pmem(). Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Toshi Kani Cc: Al Viro Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Linus Torvalds Signed-off-by: Dan Williams --- drivers/nvdimm/Kconfig |1 + drivers/nvdimm/pmem.c | 38 +- drivers/nvdimm/pmem.h |7 +++ drivers/nvdimm/x86.c | 48 include/linux/uio.h|4 lib/Kconfig|3 +++ lib/iov_iter.c | 25 + 7 files changed, 89 insertions(+), 37 deletions(-) diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index 4d45196d6f94..28002298cdc8 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -38,6 +38,7 @@ config BLK_DEV_PMEM config ARCH_HAS_PMEM_API depends on X86_64 + select COPY_FROM_ITER_OPS def_bool y config ND_BLK diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 329895ca88e1..b000c6db5731 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -223,43 +223,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - size_t len; - - /* TODO: skip the write-back by always using non-temporal stores */ - len = copy_from_iter_nocache(addr, bytes, i); - - /* -* In the iovec case on x86_64 copy_from_iter_nocache() uses -* non-temporal stores for the bulk of the transfer, but we need -* to manually flush if the transfer is unaligned. A cached -* memory copy is used when destination or size is not naturally -* aligned. That is: -* - Require 8-byte alignment when size is 8 bytes or larger. -* - Require 4-byte alignment when size is 4 bytes. -* -* In the non-iovec case the entire destination needs to be -* flushed. -*/ - if (iter_is_iovec(i)) { - unsigned long flushed, dest = (unsigned long) addr; - - if (bytes < 8) { - if (!IS_ALIGNED(dest, 4) || (bytes != 4)) - arch_wb_cache_pmem(addr, 1); - } else { - if (!IS_ALIGNED(dest, 8)) { - dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); - arch_wb_cache_pmem(addr, 1); - } - - flushed = dest - (unsigned long) addr; - if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) - arch_wb_cache_pmem(addr + bytes - 1, 1); - } - } else - arch_wb_cache_pmem(addr, bytes); - - return len; + return arch_copy_from_iter_pmem(addr, bytes, i); } static const struct block_device_operations pmem_fops = { diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 5900c1b7..574b63fb5376 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -3,11 +3,13 @@ #include #include #include +#include #include #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_wb_cache_pmem(void *addr, size_t size); void arch_invalidate_pmem(void *addr, size_t size); +size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, struct iov_iter *i); #else static inline void arch_wb_cache_pmem(void *addr, size_t size) { @@ -15,6 +17,11 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size) static inline void arch_invalidate_pmem(void *addr, size_t size) { } +static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, + struct iov_iter *i) +{ + return copy_from_iter_nocache(addr, bytes, i); +} #endif /* this definition is in it's own header for tools/testing/nvdimm to consume */ diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c index d99b452332a9..bc145d760d43 100644 --- a/drivers/nvdimm/x86.c +++ b/drivers/nvdimm/x86.c @@ -10,6 +10,9 @@ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. */ +#include +#include +#include #include #include #include @@ -105,3 +108,48 @@ void arch_memcpy_to_pmem(void *_dst, void *_src, unsigned size) } } EXPORT_SYMBOL_GPL(arch_memcpy_to_pmem); + +static int pmem_from_user(void *dst, const void __user *src, unsigned size) +{ + unsigned long flushed, dest = (unsigned long) dest; + int
[PATCH v2 29/33] uio, libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations
Introduce copy_from_iter_ops() to enable passing custom sub-routines to iterate_and_advance(). Define pmem operations that guarantee cache bypass to supplement the existing usage of __copy_from_iter_nocache() backed by arch_wb_cache_pmem(). Cc: Jan Kara Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Toshi Kani Cc: Al Viro Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Linus Torvalds Signed-off-by: Dan Williams --- drivers/nvdimm/Kconfig |1 + drivers/nvdimm/pmem.c | 38 +- drivers/nvdimm/pmem.h |7 +++ drivers/nvdimm/x86.c | 48 include/linux/uio.h|4 lib/Kconfig|3 +++ lib/iov_iter.c | 25 + 7 files changed, 89 insertions(+), 37 deletions(-) diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index 4d45196d6f94..28002298cdc8 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -38,6 +38,7 @@ config BLK_DEV_PMEM config ARCH_HAS_PMEM_API depends on X86_64 + select COPY_FROM_ITER_OPS def_bool y config ND_BLK diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 329895ca88e1..b000c6db5731 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -223,43 +223,7 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - size_t len; - - /* TODO: skip the write-back by always using non-temporal stores */ - len = copy_from_iter_nocache(addr, bytes, i); - - /* -* In the iovec case on x86_64 copy_from_iter_nocache() uses -* non-temporal stores for the bulk of the transfer, but we need -* to manually flush if the transfer is unaligned. A cached -* memory copy is used when destination or size is not naturally -* aligned. That is: -* - Require 8-byte alignment when size is 8 bytes or larger. -* - Require 4-byte alignment when size is 4 bytes. -* -* In the non-iovec case the entire destination needs to be -* flushed. -*/ - if (iter_is_iovec(i)) { - unsigned long flushed, dest = (unsigned long) addr; - - if (bytes < 8) { - if (!IS_ALIGNED(dest, 4) || (bytes != 4)) - arch_wb_cache_pmem(addr, 1); - } else { - if (!IS_ALIGNED(dest, 8)) { - dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); - arch_wb_cache_pmem(addr, 1); - } - - flushed = dest - (unsigned long) addr; - if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) - arch_wb_cache_pmem(addr + bytes - 1, 1); - } - } else - arch_wb_cache_pmem(addr, bytes); - - return len; + return arch_copy_from_iter_pmem(addr, bytes, i); } static const struct block_device_operations pmem_fops = { diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 5900c1b7..574b63fb5376 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -3,11 +3,13 @@ #include #include #include +#include #include #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_wb_cache_pmem(void *addr, size_t size); void arch_invalidate_pmem(void *addr, size_t size); +size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, struct iov_iter *i); #else static inline void arch_wb_cache_pmem(void *addr, size_t size) { @@ -15,6 +17,11 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size) static inline void arch_invalidate_pmem(void *addr, size_t size) { } +static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, + struct iov_iter *i) +{ + return copy_from_iter_nocache(addr, bytes, i); +} #endif /* this definition is in it's own header for tools/testing/nvdimm to consume */ diff --git a/drivers/nvdimm/x86.c b/drivers/nvdimm/x86.c index d99b452332a9..bc145d760d43 100644 --- a/drivers/nvdimm/x86.c +++ b/drivers/nvdimm/x86.c @@ -10,6 +10,9 @@ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. */ +#include +#include +#include #include #include #include @@ -105,3 +108,48 @@ void arch_memcpy_to_pmem(void *_dst, void *_src, unsigned size) } } EXPORT_SYMBOL_GPL(arch_memcpy_to_pmem); + +static int pmem_from_user(void *dst, const void __user *src, unsigned size) +{ + unsigned long flushed, dest = (unsigned long) dest; + int rc = __copy_from_user_nocache(dst, src, size); + + /* +* On x86_64 __copy_from_user_nocache() uses non-temporal stores +* for the bulk of the transfer, but we need to manually flush
[PATCH v2 24/33] filesystem-dax: convert to dax_flush()
Filesystem-DAX flushes caches whenever it writes to the address returned through dax_direct_access() and when writing back dirty radix entries. That flushing is only required in the pmem case, so the dax_flush() helper skips cache management work when the underlying driver does not specify a flush method. We still do all the dirty tracking since the radix entry will already be there for locking purposes. However, the work to clean the entry will be a nop for some dax drivers. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- fs/dax.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index 11b9909c91df..edbf988de86c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -798,7 +798,7 @@ static int dax_writeback_one(struct block_device *bdev, } dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn)); - wb_cache_pmem(kaddr, size); + dax_flush(dax_dev, pgoff, kaddr, size); /* * After we have flushed the cache, we can clear the dirty tag. There * cannot be new dirty data in the pfn after the flush has completed as
[PATCH v2 24/33] filesystem-dax: convert to dax_flush()
Filesystem-DAX flushes caches whenever it writes to the address returned through dax_direct_access() and when writing back dirty radix entries. That flushing is only required in the pmem case, so the dax_flush() helper skips cache management work when the underlying driver does not specify a flush method. We still do all the dirty tracking since the radix entry will already be there for locking purposes. However, the work to clean the entry will be a nop for some dax drivers. Cc: Jan Kara Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- fs/dax.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index 11b9909c91df..edbf988de86c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -798,7 +798,7 @@ static int dax_writeback_one(struct block_device *bdev, } dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn)); - wb_cache_pmem(kaddr, size); + dax_flush(dax_dev, pgoff, kaddr, size); /* * After we have flushed the cache, we can clear the dirty tag. There * cannot be new dirty data in the pfn after the flush has completed as
[PATCH v2 20/33] dm: add ->copy_from_iter() dax operation support
Allow device-mapper to route copy_from_iter operations to the per-target implementation. In order for the device stacking to work we need a dax_dev and a pgoff relative to that device. This gives each layer of the stack the information it needs to look up the operation pointer for the next level. This conceptually allows for an array of mixed device drivers with varying copy_from_iter implementations. Cc: Toshi KaniCc: Mike Snitzer Signed-off-by: Dan Williams --- drivers/dax/super.c | 13 + drivers/md/dm-linear.c| 15 +++ drivers/md/dm-stripe.c| 20 drivers/md/dm.c | 26 ++ include/linux/dax.h |2 ++ include/linux/device-mapper.h |3 +++ 6 files changed, 79 insertions(+) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 23ce3ab49f10..73f0da8e5d27 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -104,6 +105,18 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, } EXPORT_SYMBOL_GPL(dax_direct_access); +size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, + size_t bytes, struct iov_iter *i) +{ + if (!dax_alive(dax_dev)) + return 0; + + if (!dax_dev->ops->copy_from_iter) + return copy_from_iter(addr, bytes, i); + return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i); +} +EXPORT_SYMBOL_GPL(dax_copy_from_iter); + bool dax_alive(struct dax_device *dax_dev) { lockdep_assert_held(_srcu); diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index c5a52f4dae81..5fe44a0ddfab 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -158,6 +158,20 @@ static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); } +static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + struct linear_c *lc = ti->private; + struct block_device *bdev = lc->dev->bdev; + struct dax_device *dax_dev = lc->dev->dax_dev; + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + + dev_sector = linear_map_sector(ti, sector); + if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), )) + return 0; + return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i); +} + static struct target_type linear_target = { .name = "linear", .version = {1, 3, 0}, @@ -169,6 +183,7 @@ static struct target_type linear_target = { .prepare_ioctl = linear_prepare_ioctl, .iterate_devices = linear_iterate_devices, .direct_access = linear_dax_direct_access, + .dax_copy_from_iter = linear_dax_copy_from_iter, }; int __init dm_linear_init(void) diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index cb4b1e9e16ab..4f45d23249b2 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -330,6 +330,25 @@ static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); } +static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + struct stripe_c *sc = ti->private; + struct dax_device *dax_dev; + struct block_device *bdev; + uint32_t stripe; + + stripe_map_sector(sc, sector, , _sector); + dev_sector += sc->stripe[stripe].physical_start; + dax_dev = sc->stripe[stripe].dev->dax_dev; + bdev = sc->stripe[stripe].dev->bdev; + + if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), )) + return 0; + return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i); +} + /* * Stripe status: * @@ -448,6 +467,7 @@ static struct target_type stripe_target = { .iterate_devices = stripe_iterate_devices, .io_hints = stripe_io_hints, .direct_access = stripe_dax_direct_access, + .dax_copy_from_iter = stripe_dax_copy_from_iter, }; int __init dm_stripe_init(void) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 79d5f5fd823e..8c8579efcba2 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -957,6 +958,30 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, return ret; } +static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + struct mapped_device *md =
[PATCH v2 20/33] dm: add ->copy_from_iter() dax operation support
Allow device-mapper to route copy_from_iter operations to the per-target implementation. In order for the device stacking to work we need a dax_dev and a pgoff relative to that device. This gives each layer of the stack the information it needs to look up the operation pointer for the next level. This conceptually allows for an array of mixed device drivers with varying copy_from_iter implementations. Cc: Toshi Kani Cc: Mike Snitzer Signed-off-by: Dan Williams --- drivers/dax/super.c | 13 + drivers/md/dm-linear.c| 15 +++ drivers/md/dm-stripe.c| 20 drivers/md/dm.c | 26 ++ include/linux/dax.h |2 ++ include/linux/device-mapper.h |3 +++ 6 files changed, 79 insertions(+) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 23ce3ab49f10..73f0da8e5d27 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -104,6 +105,18 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, } EXPORT_SYMBOL_GPL(dax_direct_access); +size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, + size_t bytes, struct iov_iter *i) +{ + if (!dax_alive(dax_dev)) + return 0; + + if (!dax_dev->ops->copy_from_iter) + return copy_from_iter(addr, bytes, i); + return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i); +} +EXPORT_SYMBOL_GPL(dax_copy_from_iter); + bool dax_alive(struct dax_device *dax_dev) { lockdep_assert_held(_srcu); diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index c5a52f4dae81..5fe44a0ddfab 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -158,6 +158,20 @@ static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); } +static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + struct linear_c *lc = ti->private; + struct block_device *bdev = lc->dev->bdev; + struct dax_device *dax_dev = lc->dev->dax_dev; + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + + dev_sector = linear_map_sector(ti, sector); + if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), )) + return 0; + return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i); +} + static struct target_type linear_target = { .name = "linear", .version = {1, 3, 0}, @@ -169,6 +183,7 @@ static struct target_type linear_target = { .prepare_ioctl = linear_prepare_ioctl, .iterate_devices = linear_iterate_devices, .direct_access = linear_dax_direct_access, + .dax_copy_from_iter = linear_dax_copy_from_iter, }; int __init dm_linear_init(void) diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index cb4b1e9e16ab..4f45d23249b2 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -330,6 +330,25 @@ static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); } +static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + struct stripe_c *sc = ti->private; + struct dax_device *dax_dev; + struct block_device *bdev; + uint32_t stripe; + + stripe_map_sector(sc, sector, , _sector); + dev_sector += sc->stripe[stripe].physical_start; + dax_dev = sc->stripe[stripe].dev->dax_dev; + bdev = sc->stripe[stripe].dev->bdev; + + if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), )) + return 0; + return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i); +} + /* * Stripe status: * @@ -448,6 +467,7 @@ static struct target_type stripe_target = { .iterate_devices = stripe_iterate_devices, .io_hints = stripe_io_hints, .direct_access = stripe_dax_direct_access, + .dax_copy_from_iter = stripe_dax_copy_from_iter, }; int __init dm_stripe_init(void) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 79d5f5fd823e..8c8579efcba2 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -957,6 +958,30 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, return ret; } +static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + struct mapped_device *md = dax_get_private(dax_dev); + sector_t sector = pgoff *
[PATCH v2 23/33] dm: add ->flush() dax operation support
Allow device-mapper to route flush operations to the per-target implementation. In order for the device stacking to work we need a dax_dev and a pgoff relative to that device. This gives each layer of the stack the information it needs to look up the operation pointer for the next level. This conceptually allows for an array of mixed device drivers with varying flush implementations. Cc: Toshi KaniCc: Mike Snitzer Signed-off-by: Dan Williams --- drivers/dax/super.c | 11 +++ drivers/md/dm-linear.c| 15 +++ drivers/md/dm-stripe.c| 20 drivers/md/dm.c | 19 +++ include/linux/dax.h |2 ++ include/linux/device-mapper.h |3 +++ 6 files changed, 70 insertions(+) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 73f0da8e5d27..1253c05a2e53 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -117,6 +117,17 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, } EXPORT_SYMBOL_GPL(dax_copy_from_iter); +void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, + size_t size) +{ + if (!dax_alive(dax_dev)) + return; + + if (dax_dev->ops->flush) + dax_dev->ops->flush(dax_dev, pgoff, addr, size); +} +EXPORT_SYMBOL_GPL(dax_flush); + bool dax_alive(struct dax_device *dax_dev) { lockdep_assert_held(_srcu); diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 5fe44a0ddfab..70d8439a1b63 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -172,6 +172,20 @@ static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff, return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i); } +static void linear_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr, + size_t size) +{ + struct linear_c *lc = ti->private; + struct block_device *bdev = lc->dev->bdev; + struct dax_device *dax_dev = lc->dev->dax_dev; + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + + dev_sector = linear_map_sector(ti, sector); + if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), )) + return; + dax_flush(dax_dev, pgoff, addr, size); +} + static struct target_type linear_target = { .name = "linear", .version = {1, 3, 0}, @@ -184,6 +198,7 @@ static struct target_type linear_target = { .iterate_devices = linear_iterate_devices, .direct_access = linear_dax_direct_access, .dax_copy_from_iter = linear_dax_copy_from_iter, + .dax_flush = linear_dax_flush, }; int __init dm_linear_init(void) diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 4f45d23249b2..829fd438318d 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -349,6 +349,25 @@ static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff, return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i); } +static void stripe_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr, + size_t size) +{ + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + struct stripe_c *sc = ti->private; + struct dax_device *dax_dev; + struct block_device *bdev; + uint32_t stripe; + + stripe_map_sector(sc, sector, , _sector); + dev_sector += sc->stripe[stripe].physical_start; + dax_dev = sc->stripe[stripe].dev->dax_dev; + bdev = sc->stripe[stripe].dev->bdev; + + if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), )) + return; + dax_flush(dax_dev, pgoff, addr, size); +} + /* * Stripe status: * @@ -468,6 +487,7 @@ static struct target_type stripe_target = { .io_hints = stripe_io_hints, .direct_access = stripe_dax_direct_access, .dax_copy_from_iter = stripe_dax_copy_from_iter, + .dax_flush = stripe_dax_flush, }; int __init dm_stripe_init(void) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 8c8579efcba2..6a97711cdbdf 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -982,6 +982,24 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, return ret; } +static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, + size_t size) +{ + struct mapped_device *md = dax_get_private(dax_dev); + sector_t sector = pgoff * PAGE_SECTORS; + struct dm_target *ti; + int srcu_idx; + + ti = dm_dax_get_live_target(md, sector, _idx); + + if (!ti) + goto out; + if (ti->type->dax_flush) + ti->type->dax_flush(ti, pgoff, addr, size); + out: + dm_put_live_table(md, srcu_idx); +} + /* * A target may call dm_accept_partial_bio only from the map routine. It is * allowed for all bio types except
[PATCH v2 19/33] dax, pmem: introduce 'copy_from_iter' dax operation
The direct-I/O write path for a pmem device must ensure that data is flushed to a power-fail safe zone when the operation is complete. However, other dax capable block devices, like brd, do not have this requirement. Introduce a 'copy_from_iter' dax operation so that pmem can inject cache management without imposing this overhead on other dax capable block_device drivers. This is also a first step of moving all architecture-specific pmem-operations to the pmem driver. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Al Viro Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/nvdimm/pmem.c | 43 +++ include/linux/dax.h |3 +++ 2 files changed, 46 insertions(+) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 3b3dab73d741..e501df4ab4b4 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -220,6 +220,48 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); } +static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + size_t len; + + /* TODO: skip the write-back by always using non-temporal stores */ + len = copy_from_iter_nocache(addr, bytes, i); + + /* +* In the iovec case on x86_64 copy_from_iter_nocache() uses +* non-temporal stores for the bulk of the transfer, but we need +* to manually flush if the transfer is unaligned. A cached +* memory copy is used when destination or size is not naturally +* aligned. That is: +* - Require 8-byte alignment when size is 8 bytes or larger. +* - Require 4-byte alignment when size is 4 bytes. +* +* In the non-iovec case the entire destination needs to be +* flushed. +*/ + if (iter_is_iovec(i)) { + unsigned long flushed, dest = (unsigned long) addr; + + if (bytes < 8) { + if (!IS_ALIGNED(dest, 4) || (bytes != 4)) + wb_cache_pmem(addr, 1); + } else { + if (!IS_ALIGNED(dest, 8)) { + dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); + wb_cache_pmem(addr, 1); + } + + flushed = dest - (unsigned long) addr; + if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) + wb_cache_pmem(addr + bytes - 1, 1); + } + } else + wb_cache_pmem(addr, bytes); + + return len; +} + static const struct block_device_operations pmem_fops = { .owner =THIS_MODULE, .rw_page = pmem_rw_page, @@ -236,6 +278,7 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, static const struct dax_operations pmem_dax_ops = { .direct_access = pmem_dax_direct_access, + .copy_from_iter = pmem_copy_from_iter, }; static void pmem_release_queue(void *q) diff --git a/include/linux/dax.h b/include/linux/dax.h index d3158e74a59e..156f067d4db5 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -16,6 +16,9 @@ struct dax_operations { */ long (*direct_access)(struct dax_device *, pgoff_t, long, void **, pfn_t *); + /* copy_from_iter: dax-driver override for default copy_from_iter */ + size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t, + struct iov_iter *); }; int dax_read_lock(void);
[PATCH v2 19/33] dax, pmem: introduce 'copy_from_iter' dax operation
The direct-I/O write path for a pmem device must ensure that data is flushed to a power-fail safe zone when the operation is complete. However, other dax capable block devices, like brd, do not have this requirement. Introduce a 'copy_from_iter' dax operation so that pmem can inject cache management without imposing this overhead on other dax capable block_device drivers. This is also a first step of moving all architecture-specific pmem-operations to the pmem driver. Cc: Jan Kara Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Al Viro Cc: Matthew Wilcox Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/nvdimm/pmem.c | 43 +++ include/linux/dax.h |3 +++ 2 files changed, 46 insertions(+) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 3b3dab73d741..e501df4ab4b4 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -220,6 +220,48 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); } +static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + size_t len; + + /* TODO: skip the write-back by always using non-temporal stores */ + len = copy_from_iter_nocache(addr, bytes, i); + + /* +* In the iovec case on x86_64 copy_from_iter_nocache() uses +* non-temporal stores for the bulk of the transfer, but we need +* to manually flush if the transfer is unaligned. A cached +* memory copy is used when destination or size is not naturally +* aligned. That is: +* - Require 8-byte alignment when size is 8 bytes or larger. +* - Require 4-byte alignment when size is 4 bytes. +* +* In the non-iovec case the entire destination needs to be +* flushed. +*/ + if (iter_is_iovec(i)) { + unsigned long flushed, dest = (unsigned long) addr; + + if (bytes < 8) { + if (!IS_ALIGNED(dest, 4) || (bytes != 4)) + wb_cache_pmem(addr, 1); + } else { + if (!IS_ALIGNED(dest, 8)) { + dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); + wb_cache_pmem(addr, 1); + } + + flushed = dest - (unsigned long) addr; + if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8)) + wb_cache_pmem(addr + bytes - 1, 1); + } + } else + wb_cache_pmem(addr, bytes); + + return len; +} + static const struct block_device_operations pmem_fops = { .owner =THIS_MODULE, .rw_page = pmem_rw_page, @@ -236,6 +278,7 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, static const struct dax_operations pmem_dax_ops = { .direct_access = pmem_dax_direct_access, + .copy_from_iter = pmem_copy_from_iter, }; static void pmem_release_queue(void *q) diff --git a/include/linux/dax.h b/include/linux/dax.h index d3158e74a59e..156f067d4db5 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -16,6 +16,9 @@ struct dax_operations { */ long (*direct_access)(struct dax_device *, pgoff_t, long, void **, pfn_t *); + /* copy_from_iter: dax-driver override for default copy_from_iter */ + size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t, + struct iov_iter *); }; int dax_read_lock(void);
[PATCH v2 23/33] dm: add ->flush() dax operation support
Allow device-mapper to route flush operations to the per-target implementation. In order for the device stacking to work we need a dax_dev and a pgoff relative to that device. This gives each layer of the stack the information it needs to look up the operation pointer for the next level. This conceptually allows for an array of mixed device drivers with varying flush implementations. Cc: Toshi Kani Cc: Mike Snitzer Signed-off-by: Dan Williams --- drivers/dax/super.c | 11 +++ drivers/md/dm-linear.c| 15 +++ drivers/md/dm-stripe.c| 20 drivers/md/dm.c | 19 +++ include/linux/dax.h |2 ++ include/linux/device-mapper.h |3 +++ 6 files changed, 70 insertions(+) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 73f0da8e5d27..1253c05a2e53 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -117,6 +117,17 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, } EXPORT_SYMBOL_GPL(dax_copy_from_iter); +void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, + size_t size) +{ + if (!dax_alive(dax_dev)) + return; + + if (dax_dev->ops->flush) + dax_dev->ops->flush(dax_dev, pgoff, addr, size); +} +EXPORT_SYMBOL_GPL(dax_flush); + bool dax_alive(struct dax_device *dax_dev) { lockdep_assert_held(_srcu); diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 5fe44a0ddfab..70d8439a1b63 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -172,6 +172,20 @@ static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff, return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i); } +static void linear_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr, + size_t size) +{ + struct linear_c *lc = ti->private; + struct block_device *bdev = lc->dev->bdev; + struct dax_device *dax_dev = lc->dev->dax_dev; + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + + dev_sector = linear_map_sector(ti, sector); + if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), )) + return; + dax_flush(dax_dev, pgoff, addr, size); +} + static struct target_type linear_target = { .name = "linear", .version = {1, 3, 0}, @@ -184,6 +198,7 @@ static struct target_type linear_target = { .iterate_devices = linear_iterate_devices, .direct_access = linear_dax_direct_access, .dax_copy_from_iter = linear_dax_copy_from_iter, + .dax_flush = linear_dax_flush, }; int __init dm_linear_init(void) diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 4f45d23249b2..829fd438318d 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -349,6 +349,25 @@ static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff, return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i); } +static void stripe_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr, + size_t size) +{ + sector_t dev_sector, sector = pgoff * PAGE_SECTORS; + struct stripe_c *sc = ti->private; + struct dax_device *dax_dev; + struct block_device *bdev; + uint32_t stripe; + + stripe_map_sector(sc, sector, , _sector); + dev_sector += sc->stripe[stripe].physical_start; + dax_dev = sc->stripe[stripe].dev->dax_dev; + bdev = sc->stripe[stripe].dev->bdev; + + if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), )) + return; + dax_flush(dax_dev, pgoff, addr, size); +} + /* * Stripe status: * @@ -468,6 +487,7 @@ static struct target_type stripe_target = { .io_hints = stripe_io_hints, .direct_access = stripe_dax_direct_access, .dax_copy_from_iter = stripe_dax_copy_from_iter, + .dax_flush = stripe_dax_flush, }; int __init dm_stripe_init(void) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 8c8579efcba2..6a97711cdbdf 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -982,6 +982,24 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, return ret; } +static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, + size_t size) +{ + struct mapped_device *md = dax_get_private(dax_dev); + sector_t sector = pgoff * PAGE_SECTORS; + struct dm_target *ti; + int srcu_idx; + + ti = dm_dax_get_live_target(md, sector, _idx); + + if (!ti) + goto out; + if (ti->type->dax_flush) + ti->type->dax_flush(ti, pgoff, addr, size); + out: + dm_put_live_table(md, srcu_idx); +} + /* * A target may call dm_accept_partial_bio only from the map routine. It is * allowed for all bio types except REQ_PREFLUSH. @@ -2844,6 +2862,7 @@ static const struct
[PATCH v2 16/33] block, dax: convert bdev_dax_supported() to dax_direct_access()
Kill of the final user of bdev_direct_access() and struct blk_dax_ctl. Signed-off-by: Dan Williams--- fs/block_dev.c | 48 1 file changed, 28 insertions(+), 20 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 2f7885712575..ecbdc8f9f718 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -788,35 +788,43 @@ EXPORT_SYMBOL(bdev_dax_pgoff); */ int bdev_dax_supported(struct super_block *sb, int blocksize) { - struct blk_dax_ctl dax = { - .sector = 0, - .size = PAGE_SIZE, - }; - int err; + struct block_device *bdev = sb->s_bdev; + struct dax_device *dax_dev; + pgoff_t pgoff; + int err, id; + void *kaddr; + pfn_t pfn; + long len; if (blocksize != PAGE_SIZE) { vfs_msg(sb, KERN_ERR, "error: unsupported blocksize for dax"); return -EINVAL; } - err = bdev_direct_access(sb->s_bdev, ); - if (err < 0) { - switch (err) { - case -EOPNOTSUPP: - vfs_msg(sb, KERN_ERR, - "error: device does not support dax"); - break; - case -EINVAL: - vfs_msg(sb, KERN_ERR, - "error: unaligned partition for dax"); - break; - default: - vfs_msg(sb, KERN_ERR, - "error: dax access failed (%d)", err); - } + err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, ); + if (err) { + vfs_msg(sb, KERN_ERR, "error: unaligned partition for dax"); return err; } + dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); + if (!dax_dev) { + vfs_msg(sb, KERN_ERR, "error: device does not support dax"); + return -EOPNOTSUPP; + } + + id = dax_read_lock(); + len = dax_direct_access(dax_dev, pgoff, 1, , ); + dax_read_unlock(id); + + put_dax(dax_dev); + + if (len < 1) { + vfs_msg(sb, KERN_ERR, + "error: dax access failed (%d)", len); + return len < 0 ? len : -EIO; + } + return 0; } EXPORT_SYMBOL_GPL(bdev_dax_supported);
[PATCH v2 16/33] block, dax: convert bdev_dax_supported() to dax_direct_access()
Kill of the final user of bdev_direct_access() and struct blk_dax_ctl. Signed-off-by: Dan Williams --- fs/block_dev.c | 48 1 file changed, 28 insertions(+), 20 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 2f7885712575..ecbdc8f9f718 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -788,35 +788,43 @@ EXPORT_SYMBOL(bdev_dax_pgoff); */ int bdev_dax_supported(struct super_block *sb, int blocksize) { - struct blk_dax_ctl dax = { - .sector = 0, - .size = PAGE_SIZE, - }; - int err; + struct block_device *bdev = sb->s_bdev; + struct dax_device *dax_dev; + pgoff_t pgoff; + int err, id; + void *kaddr; + pfn_t pfn; + long len; if (blocksize != PAGE_SIZE) { vfs_msg(sb, KERN_ERR, "error: unsupported blocksize for dax"); return -EINVAL; } - err = bdev_direct_access(sb->s_bdev, ); - if (err < 0) { - switch (err) { - case -EOPNOTSUPP: - vfs_msg(sb, KERN_ERR, - "error: device does not support dax"); - break; - case -EINVAL: - vfs_msg(sb, KERN_ERR, - "error: unaligned partition for dax"); - break; - default: - vfs_msg(sb, KERN_ERR, - "error: dax access failed (%d)", err); - } + err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, ); + if (err) { + vfs_msg(sb, KERN_ERR, "error: unaligned partition for dax"); return err; } + dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); + if (!dax_dev) { + vfs_msg(sb, KERN_ERR, "error: device does not support dax"); + return -EOPNOTSUPP; + } + + id = dax_read_lock(); + len = dax_direct_access(dax_dev, pgoff, 1, , ); + dax_read_unlock(id); + + put_dax(dax_dev); + + if (len < 1) { + vfs_msg(sb, KERN_ERR, + "error: dax access failed (%d)", len); + return len < 0 ? len : -EIO; + } + return 0; } EXPORT_SYMBOL_GPL(bdev_dax_supported);
[PATCH v2 15/33] filesystem-dax: convert to dax_direct_access()
Now that a dax_device is plumbed through all dax-capable drivers we can switch from block_device_operations to dax_operations for invoking ->direct_access. This also lets us kill off some usages of struct blk_dax_ctl on the way to its eventual removal. Suggested-by: Christoph HellwigSigned-off-by: Dan Williams --- fs/dax.c| 277 +-- fs/iomap.c |3 - include/linux/dax.h |6 + 3 files changed, 162 insertions(+), 124 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index b78a6947c4f5..ce9dc9c3e829 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -55,32 +55,6 @@ static int __init init_dax_wait_table(void) } fs_initcall(init_dax_wait_table); -static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax) -{ - struct request_queue *q = bdev->bd_queue; - long rc = -EIO; - - dax->addr = ERR_PTR(-EIO); - if (blk_queue_enter(q, true) != 0) - return rc; - - rc = bdev_direct_access(bdev, dax); - if (rc < 0) { - dax->addr = ERR_PTR(rc); - blk_queue_exit(q); - return rc; - } - return rc; -} - -static void dax_unmap_atomic(struct block_device *bdev, - const struct blk_dax_ctl *dax) -{ - if (IS_ERR(dax->addr)) - return; - blk_queue_exit(bdev->bd_queue); -} - static int dax_is_pmd_entry(void *entry) { return (unsigned long)entry & RADIX_DAX_PMD; @@ -553,21 +527,30 @@ static int dax_load_hole(struct address_space *mapping, void **entry, return ret; } -static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size, - struct page *to, unsigned long vaddr) +static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, + sector_t sector, size_t size, struct page *to, + unsigned long vaddr) { - struct blk_dax_ctl dax = { - .sector = sector, - .size = size, - }; - void *vto; - - if (dax_map_atomic(bdev, ) < 0) - return PTR_ERR(dax.addr); + void *vto, *kaddr; + pgoff_t pgoff; + pfn_t pfn; + long rc; + int id; + + rc = bdev_dax_pgoff(bdev, sector, size, ); + if (rc) + return rc; + + id = dax_read_lock(); + rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), , ); + if (rc < 0) { + dax_read_unlock(id); + return rc; + } vto = kmap_atomic(to); - copy_user_page(vto, (void __force *)dax.addr, vaddr, to); + copy_user_page(vto, (void __force *)kaddr, vaddr, to); kunmap_atomic(vto); - dax_unmap_atomic(bdev, ); + dax_read_unlock(id); return 0; } @@ -735,12 +718,16 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, } static int dax_writeback_one(struct block_device *bdev, - struct address_space *mapping, pgoff_t index, void *entry) + struct dax_device *dax_dev, struct address_space *mapping, + pgoff_t index, void *entry) { struct radix_tree_root *page_tree = >page_tree; - struct blk_dax_ctl dax; - void *entry2, **slot; - int ret = 0; + void *entry2, **slot, *kaddr; + long ret = 0, id; + sector_t sector; + pgoff_t pgoff; + size_t size; + pfn_t pfn; /* * A page got tagged dirty in DAX mapping? Something is seriously @@ -789,26 +776,29 @@ static int dax_writeback_one(struct block_device *bdev, * 'entry'. This allows us to flush for PMD_SIZE and not have to * worry about partial PMD writebacks. */ - dax.sector = dax_radix_sector(entry); - dax.size = PAGE_SIZE << dax_radix_order(entry); + sector = dax_radix_sector(entry); + size = PAGE_SIZE << dax_radix_order(entry); + + id = dax_read_lock(); + ret = bdev_dax_pgoff(bdev, sector, size, ); + if (ret) + goto dax_unlock; /* -* We cannot hold tree_lock while calling dax_map_atomic() because it -* eventually calls cond_resched(). +* dax_direct_access() may sleep, so cannot hold tree_lock over +* its invocation. */ - ret = dax_map_atomic(bdev, ); - if (ret < 0) { - put_locked_mapping_entry(mapping, index, entry); - return ret; - } + ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, , ); + if (ret < 0) + goto dax_unlock; - if (WARN_ON_ONCE(ret < dax.size)) { + if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) { ret = -EIO; - goto unmap; + goto dax_unlock; } - dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn)); - wb_cache_pmem(dax.addr, dax.size); +
[PATCH v2 14/33] Revert "block: use DAX for partition table reads"
commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was part of a stalled effort to allow dax mappings of block devices. Since then the device-dax mechanism has filled the role of dax-mapping static device ranges. Now that we are moving ->direct_access() from a block_device operation to a dax_inode operation we would need block devices to map and carry their own dax_inode reference. Unless / until we decide to revive dax mapping of raw block devices through the dax_inode scheme, there is no need to carry read_dax_sector(). Its removal in turn allows for the removal of bdev_direct_access() and should have been included in commit 223757016837 ("block_dev: remove DAX leftovers"). Cc: Jeff MoyerSigned-off-by: Dan Williams --- block/partition-generic.c | 17 ++--- fs/dax.c | 20 include/linux/dax.h |6 -- 3 files changed, 2 insertions(+), 41 deletions(-) diff --git a/block/partition-generic.c b/block/partition-generic.c index 7afb9907821f..5dfac337b0f2 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -16,7 +16,6 @@ #include #include #include -#include #include #include "partitions/check.h" @@ -631,24 +630,12 @@ int invalidate_partitions(struct gendisk *disk, struct block_device *bdev) return 0; } -static struct page *read_pagecache_sector(struct block_device *bdev, sector_t n) -{ - struct address_space *mapping = bdev->bd_inode->i_mapping; - - return read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), -NULL); -} - unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p) { + struct address_space *mapping = bdev->bd_inode->i_mapping; struct page *page; - /* don't populate page cache for dax capable devices */ - if (IS_DAX(bdev->bd_inode)) - page = read_dax_sector(bdev, n); - else - page = read_pagecache_sector(bdev, n); - + page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), NULL); if (!IS_ERR(page)) { if (PageError(page)) goto fail; diff --git a/fs/dax.c b/fs/dax.c index de622d4282a6..b78a6947c4f5 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -101,26 +101,6 @@ static int dax_is_empty_entry(void *entry) return (unsigned long)entry & RADIX_DAX_EMPTY; } -struct page *read_dax_sector(struct block_device *bdev, sector_t n) -{ - struct page *page = alloc_pages(GFP_KERNEL, 0); - struct blk_dax_ctl dax = { - .size = PAGE_SIZE, - .sector = n & ~int) PAGE_SIZE) / 512) - 1), - }; - long rc; - - if (!page) - return ERR_PTR(-ENOMEM); - - rc = dax_map_atomic(bdev, ); - if (rc < 0) - return ERR_PTR(rc); - memcpy_from_pmem(page_address(page), dax.addr, PAGE_SIZE); - dax_unmap_atomic(bdev, ); - return page; -} - /* * DAX radix tree locking */ diff --git a/include/linux/dax.h b/include/linux/dax.h index 7e62e280c11f..0d0d890f9186 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -70,15 +70,9 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping, pgoff_t index, void *entry, bool wake_all); #ifdef CONFIG_FS_DAX -struct page *read_dax_sector(struct block_device *bdev, sector_t n); int __dax_zero_page_range(struct block_device *bdev, sector_t sector, unsigned int offset, unsigned int length); #else -static inline struct page *read_dax_sector(struct block_device *bdev, - sector_t n) -{ - return ERR_PTR(-ENXIO); -} static inline int __dax_zero_page_range(struct block_device *bdev, sector_t sector, unsigned int offset, unsigned int length) {
[PATCH v2 13/33] ext2, ext4, xfs: retrieve dax_device for iomap operations
In preparation for converting fs/dax.c to use dax_direct_access() instead of bdev_direct_access(), add the plumbing to retrieve the dax_device associated with a given block_device. Signed-off-by: Dan Williams--- fs/ext2/inode.c |9 - fs/ext4/inode.c |9 - fs/xfs/xfs_iomap.c| 10 ++ include/linux/iomap.h |1 + 4 files changed, 27 insertions(+), 2 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 128cce540645..4c9d2d44e879 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -799,6 +799,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock, static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length, unsigned flags, struct iomap *iomap) { + struct block_device *bdev; unsigned int blkbits = inode->i_blkbits; unsigned long first_block = offset >> blkbits; unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits; @@ -812,8 +813,13 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length, return ret; iomap->flags = 0; - iomap->bdev = inode->i_sb->s_bdev; + bdev = inode->i_sb->s_bdev; + iomap->bdev = bdev; iomap->offset = (u64)first_block << blkbits; + if (blk_queue_dax(bdev->bd_queue)) + iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); + else + iomap->dax_dev = NULL; if (ret == 0) { iomap->type = IOMAP_HOLE; @@ -835,6 +841,7 @@ static int ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { + put_dax(iomap->dax_dev); if (iomap->type == IOMAP_MAPPED && written < length && (flags & IOMAP_WRITE)) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 4247d8d25687..2cb2634daa99 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3305,6 +3305,7 @@ static int ext4_releasepage(struct page *page, gfp_t wait) static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, unsigned flags, struct iomap *iomap) { + struct block_device *bdev; unsigned int blkbits = inode->i_blkbits; unsigned long first_block = offset >> blkbits; unsigned long last_block = (offset + length - 1) >> blkbits; @@ -3373,7 +3374,12 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, } iomap->flags = 0; - iomap->bdev = inode->i_sb->s_bdev; + bdev = inode->i_sb->s_bdev; + iomap->bdev = bdev; + if (blk_queue_dax(bdev->bd_queue)) + iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); + else + iomap->dax_dev = NULL; iomap->offset = first_block << blkbits; if (ret == 0) { @@ -3406,6 +3412,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, int blkbits = inode->i_blkbits; bool truncate = false; + put_dax(iomap->dax_dev); if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT)) return 0; diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 288ee5b840d7..4b47403f8089 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -976,6 +976,7 @@ xfs_file_iomap_begin( int nimaps = 1, error = 0; boolshared = false, trimmed = false; unsignedlockmode; + struct block_device *bdev; if (XFS_FORCED_SHUTDOWN(mp)) return -EIO; @@ -1063,6 +1064,14 @@ xfs_file_iomap_begin( } xfs_bmbt_to_iomap(ip, iomap, ); + + /* optionally associate a dax device with the iomap bdev */ + bdev = iomap->bdev; + if (blk_queue_dax(bdev->bd_queue)) + iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); + else + iomap->dax_dev = NULL; + if (shared) iomap->flags |= IOMAP_F_SHARED; return 0; @@ -1140,6 +1149,7 @@ xfs_file_iomap_end( unsignedflags, struct iomap*iomap) { + put_dax(iomap->dax_dev); if ((flags & IOMAP_WRITE) && iomap->type == IOMAP_DELALLOC) return xfs_file_iomap_end_delalloc(XFS_I(inode), offset, length, written, iomap); diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 7291810067eb..f753e788da31 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -41,6 +41,7 @@ struct iomap { u16 type; /* type of mapping */ u16 flags; /* flags for mapping */ struct block_device *bdev; /* block device for I/O */ + struct dax_device *dax_dev; /* dax_dev for dax operations */ }; /*
[PATCH v2 15/33] filesystem-dax: convert to dax_direct_access()
Now that a dax_device is plumbed through all dax-capable drivers we can switch from block_device_operations to dax_operations for invoking ->direct_access. This also lets us kill off some usages of struct blk_dax_ctl on the way to its eventual removal. Suggested-by: Christoph Hellwig Signed-off-by: Dan Williams --- fs/dax.c| 277 +-- fs/iomap.c |3 - include/linux/dax.h |6 + 3 files changed, 162 insertions(+), 124 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index b78a6947c4f5..ce9dc9c3e829 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -55,32 +55,6 @@ static int __init init_dax_wait_table(void) } fs_initcall(init_dax_wait_table); -static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax) -{ - struct request_queue *q = bdev->bd_queue; - long rc = -EIO; - - dax->addr = ERR_PTR(-EIO); - if (blk_queue_enter(q, true) != 0) - return rc; - - rc = bdev_direct_access(bdev, dax); - if (rc < 0) { - dax->addr = ERR_PTR(rc); - blk_queue_exit(q); - return rc; - } - return rc; -} - -static void dax_unmap_atomic(struct block_device *bdev, - const struct blk_dax_ctl *dax) -{ - if (IS_ERR(dax->addr)) - return; - blk_queue_exit(bdev->bd_queue); -} - static int dax_is_pmd_entry(void *entry) { return (unsigned long)entry & RADIX_DAX_PMD; @@ -553,21 +527,30 @@ static int dax_load_hole(struct address_space *mapping, void **entry, return ret; } -static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size, - struct page *to, unsigned long vaddr) +static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, + sector_t sector, size_t size, struct page *to, + unsigned long vaddr) { - struct blk_dax_ctl dax = { - .sector = sector, - .size = size, - }; - void *vto; - - if (dax_map_atomic(bdev, ) < 0) - return PTR_ERR(dax.addr); + void *vto, *kaddr; + pgoff_t pgoff; + pfn_t pfn; + long rc; + int id; + + rc = bdev_dax_pgoff(bdev, sector, size, ); + if (rc) + return rc; + + id = dax_read_lock(); + rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), , ); + if (rc < 0) { + dax_read_unlock(id); + return rc; + } vto = kmap_atomic(to); - copy_user_page(vto, (void __force *)dax.addr, vaddr, to); + copy_user_page(vto, (void __force *)kaddr, vaddr, to); kunmap_atomic(vto); - dax_unmap_atomic(bdev, ); + dax_read_unlock(id); return 0; } @@ -735,12 +718,16 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, } static int dax_writeback_one(struct block_device *bdev, - struct address_space *mapping, pgoff_t index, void *entry) + struct dax_device *dax_dev, struct address_space *mapping, + pgoff_t index, void *entry) { struct radix_tree_root *page_tree = >page_tree; - struct blk_dax_ctl dax; - void *entry2, **slot; - int ret = 0; + void *entry2, **slot, *kaddr; + long ret = 0, id; + sector_t sector; + pgoff_t pgoff; + size_t size; + pfn_t pfn; /* * A page got tagged dirty in DAX mapping? Something is seriously @@ -789,26 +776,29 @@ static int dax_writeback_one(struct block_device *bdev, * 'entry'. This allows us to flush for PMD_SIZE and not have to * worry about partial PMD writebacks. */ - dax.sector = dax_radix_sector(entry); - dax.size = PAGE_SIZE << dax_radix_order(entry); + sector = dax_radix_sector(entry); + size = PAGE_SIZE << dax_radix_order(entry); + + id = dax_read_lock(); + ret = bdev_dax_pgoff(bdev, sector, size, ); + if (ret) + goto dax_unlock; /* -* We cannot hold tree_lock while calling dax_map_atomic() because it -* eventually calls cond_resched(). +* dax_direct_access() may sleep, so cannot hold tree_lock over +* its invocation. */ - ret = dax_map_atomic(bdev, ); - if (ret < 0) { - put_locked_mapping_entry(mapping, index, entry); - return ret; - } + ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, , ); + if (ret < 0) + goto dax_unlock; - if (WARN_ON_ONCE(ret < dax.size)) { + if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) { ret = -EIO; - goto unmap; + goto dax_unlock; } - dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn)); - wb_cache_pmem(dax.addr, dax.size); + dax_mapping_entry_mkclean(mapping, index,
[PATCH v2 14/33] Revert "block: use DAX for partition table reads"
commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was part of a stalled effort to allow dax mappings of block devices. Since then the device-dax mechanism has filled the role of dax-mapping static device ranges. Now that we are moving ->direct_access() from a block_device operation to a dax_inode operation we would need block devices to map and carry their own dax_inode reference. Unless / until we decide to revive dax mapping of raw block devices through the dax_inode scheme, there is no need to carry read_dax_sector(). Its removal in turn allows for the removal of bdev_direct_access() and should have been included in commit 223757016837 ("block_dev: remove DAX leftovers"). Cc: Jeff Moyer Signed-off-by: Dan Williams --- block/partition-generic.c | 17 ++--- fs/dax.c | 20 include/linux/dax.h |6 -- 3 files changed, 2 insertions(+), 41 deletions(-) diff --git a/block/partition-generic.c b/block/partition-generic.c index 7afb9907821f..5dfac337b0f2 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -16,7 +16,6 @@ #include #include #include -#include #include #include "partitions/check.h" @@ -631,24 +630,12 @@ int invalidate_partitions(struct gendisk *disk, struct block_device *bdev) return 0; } -static struct page *read_pagecache_sector(struct block_device *bdev, sector_t n) -{ - struct address_space *mapping = bdev->bd_inode->i_mapping; - - return read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), -NULL); -} - unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p) { + struct address_space *mapping = bdev->bd_inode->i_mapping; struct page *page; - /* don't populate page cache for dax capable devices */ - if (IS_DAX(bdev->bd_inode)) - page = read_dax_sector(bdev, n); - else - page = read_pagecache_sector(bdev, n); - + page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT-9)), NULL); if (!IS_ERR(page)) { if (PageError(page)) goto fail; diff --git a/fs/dax.c b/fs/dax.c index de622d4282a6..b78a6947c4f5 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -101,26 +101,6 @@ static int dax_is_empty_entry(void *entry) return (unsigned long)entry & RADIX_DAX_EMPTY; } -struct page *read_dax_sector(struct block_device *bdev, sector_t n) -{ - struct page *page = alloc_pages(GFP_KERNEL, 0); - struct blk_dax_ctl dax = { - .size = PAGE_SIZE, - .sector = n & ~int) PAGE_SIZE) / 512) - 1), - }; - long rc; - - if (!page) - return ERR_PTR(-ENOMEM); - - rc = dax_map_atomic(bdev, ); - if (rc < 0) - return ERR_PTR(rc); - memcpy_from_pmem(page_address(page), dax.addr, PAGE_SIZE); - dax_unmap_atomic(bdev, ); - return page; -} - /* * DAX radix tree locking */ diff --git a/include/linux/dax.h b/include/linux/dax.h index 7e62e280c11f..0d0d890f9186 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -70,15 +70,9 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping, pgoff_t index, void *entry, bool wake_all); #ifdef CONFIG_FS_DAX -struct page *read_dax_sector(struct block_device *bdev, sector_t n); int __dax_zero_page_range(struct block_device *bdev, sector_t sector, unsigned int offset, unsigned int length); #else -static inline struct page *read_dax_sector(struct block_device *bdev, - sector_t n) -{ - return ERR_PTR(-ENXIO); -} static inline int __dax_zero_page_range(struct block_device *bdev, sector_t sector, unsigned int offset, unsigned int length) {
[PATCH v2 13/33] ext2, ext4, xfs: retrieve dax_device for iomap operations
In preparation for converting fs/dax.c to use dax_direct_access() instead of bdev_direct_access(), add the plumbing to retrieve the dax_device associated with a given block_device. Signed-off-by: Dan Williams --- fs/ext2/inode.c |9 - fs/ext4/inode.c |9 - fs/xfs/xfs_iomap.c| 10 ++ include/linux/iomap.h |1 + 4 files changed, 27 insertions(+), 2 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 128cce540645..4c9d2d44e879 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -799,6 +799,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock, static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length, unsigned flags, struct iomap *iomap) { + struct block_device *bdev; unsigned int blkbits = inode->i_blkbits; unsigned long first_block = offset >> blkbits; unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits; @@ -812,8 +813,13 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length, return ret; iomap->flags = 0; - iomap->bdev = inode->i_sb->s_bdev; + bdev = inode->i_sb->s_bdev; + iomap->bdev = bdev; iomap->offset = (u64)first_block << blkbits; + if (blk_queue_dax(bdev->bd_queue)) + iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); + else + iomap->dax_dev = NULL; if (ret == 0) { iomap->type = IOMAP_HOLE; @@ -835,6 +841,7 @@ static int ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { + put_dax(iomap->dax_dev); if (iomap->type == IOMAP_MAPPED && written < length && (flags & IOMAP_WRITE)) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 4247d8d25687..2cb2634daa99 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3305,6 +3305,7 @@ static int ext4_releasepage(struct page *page, gfp_t wait) static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, unsigned flags, struct iomap *iomap) { + struct block_device *bdev; unsigned int blkbits = inode->i_blkbits; unsigned long first_block = offset >> blkbits; unsigned long last_block = (offset + length - 1) >> blkbits; @@ -3373,7 +3374,12 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, } iomap->flags = 0; - iomap->bdev = inode->i_sb->s_bdev; + bdev = inode->i_sb->s_bdev; + iomap->bdev = bdev; + if (blk_queue_dax(bdev->bd_queue)) + iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); + else + iomap->dax_dev = NULL; iomap->offset = first_block << blkbits; if (ret == 0) { @@ -3406,6 +3412,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, int blkbits = inode->i_blkbits; bool truncate = false; + put_dax(iomap->dax_dev); if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT)) return 0; diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 288ee5b840d7..4b47403f8089 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -976,6 +976,7 @@ xfs_file_iomap_begin( int nimaps = 1, error = 0; boolshared = false, trimmed = false; unsignedlockmode; + struct block_device *bdev; if (XFS_FORCED_SHUTDOWN(mp)) return -EIO; @@ -1063,6 +1064,14 @@ xfs_file_iomap_begin( } xfs_bmbt_to_iomap(ip, iomap, ); + + /* optionally associate a dax device with the iomap bdev */ + bdev = iomap->bdev; + if (blk_queue_dax(bdev->bd_queue)) + iomap->dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); + else + iomap->dax_dev = NULL; + if (shared) iomap->flags |= IOMAP_F_SHARED; return 0; @@ -1140,6 +1149,7 @@ xfs_file_iomap_end( unsignedflags, struct iomap*iomap) { + put_dax(iomap->dax_dev); if ((flags & IOMAP_WRITE) && iomap->type == IOMAP_DELALLOC) return xfs_file_iomap_end_delalloc(XFS_I(inode), offset, length, written, iomap); diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 7291810067eb..f753e788da31 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -41,6 +41,7 @@ struct iomap { u16 type; /* type of mapping */ u16 flags; /* flags for mapping */ struct block_device *bdev; /* block device for I/O */ + struct dax_device *dax_dev; /* dax_dev for dax operations */ }; /*
[PATCH v2 09/33] block: kill bdev_dax_capable()
This is leftover dead code that has since been replaced by bdev_dax_supported(). Signed-off-by: Dan Williams--- fs/block_dev.c | 24 include/linux/blkdev.h |1 - 2 files changed, 25 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 2eca00ec4370..7f40ea2f0875 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -807,30 +807,6 @@ int bdev_dax_supported(struct super_block *sb, int blocksize) } EXPORT_SYMBOL_GPL(bdev_dax_supported); -/** - * bdev_dax_capable() - Return if the raw device is capable for dax - * @bdev: The device for raw block device access - */ -bool bdev_dax_capable(struct block_device *bdev) -{ - struct blk_dax_ctl dax = { - .size = PAGE_SIZE, - }; - - if (!IS_ENABLED(CONFIG_FS_DAX)) - return false; - - dax.sector = 0; - if (bdev_direct_access(bdev, ) < 0) - return false; - - dax.sector = bdev->bd_part->nr_sects - (PAGE_SIZE / 512); - if (bdev_direct_access(bdev, ) < 0) - return false; - - return true; -} - /* * pseudo-fs */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 5a7da607ca04..f72708399b83 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1958,7 +1958,6 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *); extern int bdev_dax_supported(struct super_block *, int); -extern bool bdev_dax_capable(struct block_device *); #else /* CONFIG_BLOCK */ struct block_device;
[PATCH v2 10/33] dax: introduce dax_direct_access()
Replace bdev_direct_access() with dax_direct_access() that uses dax_device and dax_operations instead of a block_device and block_device_operations for dax. Once all consumers of the old api have been converted bdev_direct_access() will be deleted. Given that block device partitioning decisions can cause dax page alignment constraints to be violated this also introduces the bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative to the dax_device and also checks for page alignment. Signed-off-by: Dan Williams--- block/Kconfig |1 + drivers/dax/super.c| 39 +++ fs/block_dev.c | 14 ++ include/linux/blkdev.h |1 + include/linux/dax.h|2 ++ 5 files changed, 57 insertions(+) diff --git a/block/Kconfig b/block/Kconfig index e9f780f815f5..93da7fc3f254 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -6,6 +6,7 @@ menuconfig BLOCK default y select SBITMAP select SRCU + select DAX help Provide block layer support for the kernel. diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 45ccfc043da8..23ce3ab49f10 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -65,6 +65,45 @@ struct dax_device { const struct dax_operations *ops; }; +/** + * dax_direct_access() - translate a device pgoff to an absolute pfn + * @dax_dev: a dax_device instance representing the logical memory range + * @pgoff: offset in pages from the start of the device to translate + * @nr_pages: number of consecutive pages caller can handle relative to @pfn + * @kaddr: output parameter that returns a virtual address mapping of pfn + * @pfn: output parameter that returns an absolute pfn translation of @pgoff + * + * Return: negative errno if an error occurs, otherwise the number of + * pages accessible at the device relative @pgoff. + */ +long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, + void **kaddr, pfn_t *pfn) +{ + long avail; + + /* +* The device driver is allowed to sleep, in order to make the +* memory directly accessible. +*/ + might_sleep(); + + if (!dax_dev) + return -EOPNOTSUPP; + + if (!dax_alive(dax_dev)) + return -ENXIO; + + if (nr_pages < 0) + return nr_pages; + + avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, + kaddr, pfn); + if (!avail) + return -ERANGE; + return min(avail, nr_pages); +} +EXPORT_SYMBOL_GPL(dax_direct_access); + bool dax_alive(struct dax_device *dax_dev) { lockdep_assert_held(_srcu); diff --git a/fs/block_dev.c b/fs/block_dev.c index 7f40ea2f0875..2f7885712575 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -762,6 +763,19 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax) } EXPORT_SYMBOL_GPL(bdev_direct_access); +int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size, + pgoff_t *pgoff) +{ + phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512; + + if (pgoff) + *pgoff = PHYS_PFN(phys_off); + if (phys_off % PAGE_SIZE || size % PAGE_SIZE) + return -EINVAL; + return 0; +} +EXPORT_SYMBOL(bdev_dax_pgoff); + /** * bdev_dax_supported() - Check if the device supports dax for filesystem * @sb: The superblock of the device diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f72708399b83..612c497d1461 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1958,6 +1958,7 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *); extern int bdev_dax_supported(struct super_block *, int); +int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff); #else /* CONFIG_BLOCK */ struct block_device; diff --git a/include/linux/dax.h b/include/linux/dax.h index 39a0312c45c3..7e62e280c11f 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -27,6 +27,8 @@ void put_dax(struct dax_device *dax_dev); bool dax_alive(struct dax_device *dax_dev); void kill_dax(struct dax_device *dax_dev); void *dax_get_private(struct dax_device *dax_dev); +long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, + void **kaddr, pfn_t *pfn); /* * We use lowest available bit in exceptional entry for locking, one bit for
[PATCH v2 08/33] dcssblk: add dax_operations support
Setup a dax_dev to have the same lifetime as the dcssblk block device and add a ->direct_access() method that is equivalent to dcssblk_direct_access(). Once fs/dax.c has been converted to use dax_operations the old dcssblk_direct_access() will be removed. Cc: Gerald SchaeferSigned-off-by: Dan Williams --- drivers/s390/block/Kconfig |1 + drivers/s390/block/dcssblk.c | 54 +++--- 2 files changed, 46 insertions(+), 9 deletions(-) diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig index 4a3b62326183..0acb8c2f9475 100644 --- a/drivers/s390/block/Kconfig +++ b/drivers/s390/block/Kconfig @@ -14,6 +14,7 @@ config BLK_DEV_XPRAM config DCSSBLK def_tristate m + select DAX prompt "DCSSBLK support" depends on S390 && BLOCK help diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 415d10a67b7a..682a9eb4934d 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode); static void dcssblk_release(struct gendisk *disk, fmode_t mode); static blk_qc_t dcssblk_make_request(struct request_queue *q, struct bio *bio); -static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum, +static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum, void **kaddr, pfn_t *pfn, long size); +static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn); static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; @@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = { .owner = THIS_MODULE, .open = dcssblk_open, .release= dcssblk_release, - .direct_access = dcssblk_direct_access, + .direct_access = dcssblk_blk_direct_access, +}; + +static const struct dax_operations dcssblk_dax_ops = { + .direct_access = dcssblk_dax_direct_access, }; struct dcssblk_dev_info { @@ -57,6 +64,7 @@ struct dcssblk_dev_info { struct request_queue *dcssblk_queue; int num_of_segments; struct list_head seg_list; + struct dax_device *dax_dev; }; struct segment_info { @@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct device_attribute *attr, const ch } list_del(_info->lh); + kill_dax(dev_info->dax_dev); + put_dax(dev_info->dax_dev); del_gendisk(dev_info->gd); blk_cleanup_queue(dev_info->dcssblk_queue); dev_info->gd->queue = NULL; @@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char int rc, i, j, num_of_segments; struct dcssblk_dev_info *dev_info; struct segment_info *seg_info, *temp; + struct dax_device *dax_dev; char *local_buf; unsigned long seg_byte_size; @@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char if (rc) goto put_dev; + dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name, + _dax_ops); + if (!dax_dev) + goto put_dev; + get_device(_info->dev); device_add_disk(_info->dev, dev_info->gd); @@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch } list_del(_info->lh); + kill_dax(dev_info->dax_dev); + put_dax(dev_info->dax_dev); del_gendisk(dev_info->gd); blk_cleanup_queue(dev_info->dcssblk_queue); dev_info->gd->queue = NULL; @@ -883,21 +901,39 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio) } static long -dcssblk_direct_access (struct block_device *bdev, sector_t secnum, +__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) +{ + resource_size_t offset = pgoff * PAGE_SIZE; + unsigned long dev_sz; + + dev_sz = dev_info->end - dev_info->start + 1; + *kaddr = (void *) dev_info->start + offset; + *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV); + + return (dev_sz - offset) / PAGE_SIZE; +} + +static long +dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum, void **kaddr, pfn_t *pfn, long size) { struct dcssblk_dev_info *dev_info; - unsigned long offset, dev_sz; dev_info = bdev->bd_disk->private_data; if (!dev_info) return -ENODEV; - dev_sz = dev_info->end - dev_info->start + 1; - offset = secnum * 512; - *kaddr =
[PATCH v2 09/33] block: kill bdev_dax_capable()
This is leftover dead code that has since been replaced by bdev_dax_supported(). Signed-off-by: Dan Williams --- fs/block_dev.c | 24 include/linux/blkdev.h |1 - 2 files changed, 25 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 2eca00ec4370..7f40ea2f0875 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -807,30 +807,6 @@ int bdev_dax_supported(struct super_block *sb, int blocksize) } EXPORT_SYMBOL_GPL(bdev_dax_supported); -/** - * bdev_dax_capable() - Return if the raw device is capable for dax - * @bdev: The device for raw block device access - */ -bool bdev_dax_capable(struct block_device *bdev) -{ - struct blk_dax_ctl dax = { - .size = PAGE_SIZE, - }; - - if (!IS_ENABLED(CONFIG_FS_DAX)) - return false; - - dax.sector = 0; - if (bdev_direct_access(bdev, ) < 0) - return false; - - dax.sector = bdev->bd_part->nr_sects - (PAGE_SIZE / 512); - if (bdev_direct_access(bdev, ) < 0) - return false; - - return true; -} - /* * pseudo-fs */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 5a7da607ca04..f72708399b83 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1958,7 +1958,6 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *); extern int bdev_dax_supported(struct super_block *, int); -extern bool bdev_dax_capable(struct block_device *); #else /* CONFIG_BLOCK */ struct block_device;
[PATCH v2 10/33] dax: introduce dax_direct_access()
Replace bdev_direct_access() with dax_direct_access() that uses dax_device and dax_operations instead of a block_device and block_device_operations for dax. Once all consumers of the old api have been converted bdev_direct_access() will be deleted. Given that block device partitioning decisions can cause dax page alignment constraints to be violated this also introduces the bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative to the dax_device and also checks for page alignment. Signed-off-by: Dan Williams --- block/Kconfig |1 + drivers/dax/super.c| 39 +++ fs/block_dev.c | 14 ++ include/linux/blkdev.h |1 + include/linux/dax.h|2 ++ 5 files changed, 57 insertions(+) diff --git a/block/Kconfig b/block/Kconfig index e9f780f815f5..93da7fc3f254 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -6,6 +6,7 @@ menuconfig BLOCK default y select SBITMAP select SRCU + select DAX help Provide block layer support for the kernel. diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 45ccfc043da8..23ce3ab49f10 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -65,6 +65,45 @@ struct dax_device { const struct dax_operations *ops; }; +/** + * dax_direct_access() - translate a device pgoff to an absolute pfn + * @dax_dev: a dax_device instance representing the logical memory range + * @pgoff: offset in pages from the start of the device to translate + * @nr_pages: number of consecutive pages caller can handle relative to @pfn + * @kaddr: output parameter that returns a virtual address mapping of pfn + * @pfn: output parameter that returns an absolute pfn translation of @pgoff + * + * Return: negative errno if an error occurs, otherwise the number of + * pages accessible at the device relative @pgoff. + */ +long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, + void **kaddr, pfn_t *pfn) +{ + long avail; + + /* +* The device driver is allowed to sleep, in order to make the +* memory directly accessible. +*/ + might_sleep(); + + if (!dax_dev) + return -EOPNOTSUPP; + + if (!dax_alive(dax_dev)) + return -ENXIO; + + if (nr_pages < 0) + return nr_pages; + + avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, + kaddr, pfn); + if (!avail) + return -ERANGE; + return min(avail, nr_pages); +} +EXPORT_SYMBOL_GPL(dax_direct_access); + bool dax_alive(struct dax_device *dax_dev) { lockdep_assert_held(_srcu); diff --git a/fs/block_dev.c b/fs/block_dev.c index 7f40ea2f0875..2f7885712575 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -762,6 +763,19 @@ long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax) } EXPORT_SYMBOL_GPL(bdev_direct_access); +int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size, + pgoff_t *pgoff) +{ + phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512; + + if (pgoff) + *pgoff = PHYS_PFN(phys_off); + if (phys_off % PAGE_SIZE || size % PAGE_SIZE) + return -EINVAL; + return 0; +} +EXPORT_SYMBOL(bdev_dax_pgoff); + /** * bdev_dax_supported() - Check if the device supports dax for filesystem * @sb: The superblock of the device diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f72708399b83..612c497d1461 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1958,6 +1958,7 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *); extern int bdev_dax_supported(struct super_block *, int); +int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff); #else /* CONFIG_BLOCK */ struct block_device; diff --git a/include/linux/dax.h b/include/linux/dax.h index 39a0312c45c3..7e62e280c11f 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -27,6 +27,8 @@ void put_dax(struct dax_device *dax_dev); bool dax_alive(struct dax_device *dax_dev); void kill_dax(struct dax_device *dax_dev); void *dax_get_private(struct dax_device *dax_dev); +long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, + void **kaddr, pfn_t *pfn); /* * We use lowest available bit in exceptional entry for locking, one bit for
[PATCH v2 08/33] dcssblk: add dax_operations support
Setup a dax_dev to have the same lifetime as the dcssblk block device and add a ->direct_access() method that is equivalent to dcssblk_direct_access(). Once fs/dax.c has been converted to use dax_operations the old dcssblk_direct_access() will be removed. Cc: Gerald Schaefer Signed-off-by: Dan Williams --- drivers/s390/block/Kconfig |1 + drivers/s390/block/dcssblk.c | 54 +++--- 2 files changed, 46 insertions(+), 9 deletions(-) diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig index 4a3b62326183..0acb8c2f9475 100644 --- a/drivers/s390/block/Kconfig +++ b/drivers/s390/block/Kconfig @@ -14,6 +14,7 @@ config BLK_DEV_XPRAM config DCSSBLK def_tristate m + select DAX prompt "DCSSBLK support" depends on S390 && BLOCK help diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 415d10a67b7a..682a9eb4934d 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -30,8 +31,10 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode); static void dcssblk_release(struct gendisk *disk, fmode_t mode); static blk_qc_t dcssblk_make_request(struct request_queue *q, struct bio *bio); -static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum, +static long dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum, void **kaddr, pfn_t *pfn, long size); +static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn); static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; @@ -40,7 +43,11 @@ static const struct block_device_operations dcssblk_devops = { .owner = THIS_MODULE, .open = dcssblk_open, .release= dcssblk_release, - .direct_access = dcssblk_direct_access, + .direct_access = dcssblk_blk_direct_access, +}; + +static const struct dax_operations dcssblk_dax_ops = { + .direct_access = dcssblk_dax_direct_access, }; struct dcssblk_dev_info { @@ -57,6 +64,7 @@ struct dcssblk_dev_info { struct request_queue *dcssblk_queue; int num_of_segments; struct list_head seg_list; + struct dax_device *dax_dev; }; struct segment_info { @@ -389,6 +397,8 @@ dcssblk_shared_store(struct device *dev, struct device_attribute *attr, const ch } list_del(_info->lh); + kill_dax(dev_info->dax_dev); + put_dax(dev_info->dax_dev); del_gendisk(dev_info->gd); blk_cleanup_queue(dev_info->dcssblk_queue); dev_info->gd->queue = NULL; @@ -525,6 +535,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char int rc, i, j, num_of_segments; struct dcssblk_dev_info *dev_info; struct segment_info *seg_info, *temp; + struct dax_device *dax_dev; char *local_buf; unsigned long seg_byte_size; @@ -654,6 +665,11 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char if (rc) goto put_dev; + dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name, + _dax_ops); + if (!dax_dev) + goto put_dev; + get_device(_info->dev); device_add_disk(_info->dev, dev_info->gd); @@ -752,6 +768,8 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch } list_del(_info->lh); + kill_dax(dev_info->dax_dev); + put_dax(dev_info->dax_dev); del_gendisk(dev_info->gd); blk_cleanup_queue(dev_info->dcssblk_queue); dev_info->gd->queue = NULL; @@ -883,21 +901,39 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio) } static long -dcssblk_direct_access (struct block_device *bdev, sector_t secnum, +__dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) +{ + resource_size_t offset = pgoff * PAGE_SIZE; + unsigned long dev_sz; + + dev_sz = dev_info->end - dev_info->start + 1; + *kaddr = (void *) dev_info->start + offset; + *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV); + + return (dev_sz - offset) / PAGE_SIZE; +} + +static long +dcssblk_blk_direct_access(struct block_device *bdev, sector_t secnum, void **kaddr, pfn_t *pfn, long size) { struct dcssblk_dev_info *dev_info; - unsigned long offset, dev_sz; dev_info = bdev->bd_disk->private_data; if (!dev_info) return -ENODEV; - dev_sz = dev_info->end - dev_info->start + 1; - offset = secnum * 512; - *kaddr = (void *) dev_info->start + offset; - *pfn =
[PATCH v2 06/33] axon_ram: add dax_operations support
Setup a dax_device to have the same lifetime as the axon_ram block device and add a ->direct_access() method that is equivalent to axon_ram_direct_access(). Once fs/dax.c has been converted to use dax_operations the old axon_ram_direct_access() will be removed. Signed-off-by: Dan Williams--- arch/powerpc/platforms/Kconfig |1 + arch/powerpc/sysdev/axonram.c | 48 +++- 2 files changed, 43 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig index 7e3a2ebba29b..33244e3d9375 100644 --- a/arch/powerpc/platforms/Kconfig +++ b/arch/powerpc/platforms/Kconfig @@ -284,6 +284,7 @@ config CPM2 config AXON_RAM tristate "Axon DDR2 memory device driver" depends on PPC_IBM_CELL_BLADE && BLOCK + select DAX default m help It registers one block device per Axon's DDR2 memory bank found diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index f523ac883150..ad857d5e81b1 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -25,6 +25,7 @@ #include #include +#include #include #include #include @@ -62,6 +63,7 @@ static int azfs_major, azfs_minor; struct axon_ram_bank { struct platform_device *device; struct gendisk *disk; + struct dax_device *dax_dev; unsigned intirq_id; unsigned long ph_addr; unsigned long io_addr; @@ -137,25 +139,47 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio) return BLK_QC_T_NONE; } +static long +__axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_pages, + void **kaddr, pfn_t *pfn) +{ + resource_size_t offset = pgoff * PAGE_SIZE; + + *kaddr = (void *) bank->io_addr + offset; + *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV); + return (bank->size - offset) / PAGE_SIZE; +} + /** * axon_ram_direct_access - direct_access() method for block device * @device, @sector, @data: see block_device_operations method */ static long -axon_ram_direct_access(struct block_device *device, sector_t sector, +axon_ram_blk_direct_access(struct block_device *device, sector_t sector, void **kaddr, pfn_t *pfn, long size) { struct axon_ram_bank *bank = device->bd_disk->private_data; - loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT; - *kaddr = (void *) bank->io_addr + offset; - *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV); - return bank->size - offset; + return __axon_ram_direct_access(bank, (sector * 512) / PAGE_SIZE, + size / PAGE_SIZE, kaddr, pfn) * PAGE_SIZE; } static const struct block_device_operations axon_ram_devops = { .owner = THIS_MODULE, - .direct_access = axon_ram_direct_access + .direct_access = axon_ram_blk_direct_access +}; + +static long +axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, + void **kaddr, pfn_t *pfn) +{ + struct axon_ram_bank *bank = dax_get_private(dax_dev); + + return __axon_ram_direct_access(bank, pgoff, nr_pages, kaddr, pfn); +} + +static const struct dax_operations axon_ram_dax_ops = { + .direct_access = axon_ram_dax_direct_access, }; /** @@ -219,6 +243,7 @@ static int axon_ram_probe(struct platform_device *device) goto failed; } + bank->disk->major = azfs_major; bank->disk->first_minor = azfs_minor; bank->disk->fops = _ram_devops; @@ -227,6 +252,11 @@ static int axon_ram_probe(struct platform_device *device) sprintf(bank->disk->disk_name, "%s%d", AXON_RAM_DEVICE_NAME, axon_ram_bank_id); + bank->dax_dev = alloc_dax(bank, bank->disk->disk_name, + _ram_dax_ops); + if (!bank->dax_dev) + goto failed; + bank->disk->queue = blk_alloc_queue(GFP_KERNEL); if (bank->disk->queue == NULL) { dev_err(>dev, "Cannot register disk queue\n"); @@ -278,6 +308,10 @@ static int axon_ram_probe(struct platform_device *device) del_gendisk(bank->disk); put_disk(bank->disk); } + if (bank->dax_dev) { + kill_dax(bank->dax_dev); + put_dax(bank->dax_dev); + } device->dev.platform_data = NULL; if (bank->io_addr != 0) iounmap((void __iomem *) bank->io_addr); @@ -300,6 +334,8 @@ axon_ram_remove(struct platform_device *device) device_remove_file(>dev, _attr_ecc); free_irq(bank->irq_id, device); + kill_dax(bank->dax_dev); + put_dax(bank->dax_dev); del_gendisk(bank->disk);
[PATCH v2 05/33] pmem: add dax_operations support
Setup a dax_device to have the same lifetime as the pmem block device and add a ->direct_access() method that is equivalent to pmem_direct_access(). Once fs/dax.c has been converted to use dax_operations the old pmem_direct_access() will be removed. Signed-off-by: Dan Williams--- drivers/dax/dax.h |7 drivers/nvdimm/Kconfig |1 + drivers/nvdimm/pmem.c | 61 +++ drivers/nvdimm/pmem.h |7 +++- include/linux/dax.h |6 tools/testing/nvdimm/pmem-dax.c | 21 ++--- 6 files changed, 70 insertions(+), 33 deletions(-) diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h index 617bbc24be2b..f9e5feea742c 100644 --- a/drivers/dax/dax.h +++ b/drivers/dax/dax.h @@ -13,13 +13,6 @@ #ifndef __DAX_H__ #define __DAX_H__ struct dax_device; -struct dax_operations; -struct dax_device *alloc_dax(void *private, const char *host, - const struct dax_operations *ops); -void put_dax(struct dax_device *dax_dev); -bool dax_alive(struct dax_device *dax_dev); -void kill_dax(struct dax_device *dax_dev); struct dax_device *inode_dax(struct inode *inode); struct inode *dax_inode(struct dax_device *dax_dev); -void *dax_get_private(struct dax_device *dax_dev); #endif /* __DAX_H__ */ diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index 59e750183b7f..5bdd499b5f4f 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -20,6 +20,7 @@ if LIBNVDIMM config BLK_DEV_PMEM tristate "PMEM: Persistent memory block device support" default LIBNVDIMM + select DAX select ND_BTT if BTT select ND_PFN if NVDIMM_PFN help diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 5b536be5a12e..fbbcf8154eec 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include "pmem.h" #include "pfn.h" @@ -199,13 +200,13 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector, } /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */ -__weak long pmem_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, pfn_t *pfn, long size) +__weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) { - struct pmem_device *pmem = bdev->bd_queue->queuedata; - resource_size_t offset = sector * 512 + pmem->data_offset; + resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; - if (unlikely(is_bad_pmem(>bb, sector, size))) + if (unlikely(is_bad_pmem(>bb, PFN_PHYS(pgoff) / 512, + PFN_PHYS(nr_pages return -EIO; *kaddr = pmem->virt_addr + offset; *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags); @@ -215,26 +216,51 @@ __weak long pmem_direct_access(struct block_device *bdev, sector_t sector, * requested range. */ if (unlikely(pmem->bb.count)) - return size; - return pmem->size - pmem->pfn_pad - offset; + return nr_pages; + return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); +} + +static long pmem_blk_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, pfn_t *pfn, long size) +{ + struct pmem_device *pmem = bdev->bd_queue->queuedata; + + return __pmem_direct_access(pmem, PHYS_PFN(sector * 512), + PHYS_PFN(size), kaddr, pfn); } static const struct block_device_operations pmem_fops = { .owner =THIS_MODULE, .rw_page = pmem_rw_page, - .direct_access =pmem_direct_access, + .direct_access =pmem_blk_direct_access, .revalidate_disk = nvdimm_revalidate_disk, }; +static long pmem_dax_direct_access(struct dax_device *dax_dev, + pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) +{ + struct pmem_device *pmem = dax_get_private(dax_dev); + + return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); +} + +static const struct dax_operations pmem_dax_ops = { + .direct_access = pmem_dax_direct_access, +}; + static void pmem_release_queue(void *q) { blk_cleanup_queue(q); } -static void pmem_release_disk(void *disk) +static void pmem_release_disk(void *__pmem) { - del_gendisk(disk); - put_disk(disk); + struct pmem_device *pmem = __pmem; + + kill_dax(pmem->dax_dev); + put_dax(pmem->dax_dev); + del_gendisk(pmem->disk); + put_disk(pmem->disk); } static int pmem_attach_disk(struct device *dev, @@ -245,6 +271,7 @@ static int pmem_attach_disk(struct device *dev, struct vmem_altmap __altmap, *altmap = NULL; struct resource *res = >res; struct nd_pfn *nd_pfn = NULL; +