Re: [PATCH 1/1] device-dax: avoid an unnecessary check in alloc_dev_dax_range()

2020-12-17 Thread Leizhen (ThunderTown)



On 2020/11/20 17:22, Zhen Lei wrote:
> Swap the calling sequence of krealloc() and __request_region(), call the
> latter first. In this way, the value of dev_dax->nr_range does not need to
> be considered when __request_region() failed.
> 
> Signed-off-by: Zhen Lei 
> ---
>  drivers/dax/bus.c | 29 -
>  1 file changed, 12 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 27513d311242..1efae11d947a 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -763,23 +763,15 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, 
> u64 start,
>   return 0;
>   }
>  
> - ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
> - * (dev_dax->nr_range + 1), GFP_KERNEL);
> - if (!ranges)
> - return -ENOMEM;
> -
>   alloc = __request_region(res, start, size, dev_name(dev), 0);
> - if (!alloc) {
> - /*
> -  * If this was an empty set of ranges nothing else
> -  * will release @ranges, so do it now.
> -  */
> - if (!dev_dax->nr_range) {
> - kfree(ranges);
> - ranges = NULL;
> - }
> - dev_dax->ranges = ranges;
> + if (!alloc)
>   return -ENOMEM;
> +
> + ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
> + * (dev_dax->nr_range + 1), GFP_KERNEL);
> + if (!ranges) {
> + rc = -ENOMEM;
> + goto err;

Hi, Dan Williams:
In fact, after adding the new helper dev_dax_trim_range(), we can
directly call __release_region() and return error code at here. Replace goto.

>   }
>  
>   for (i = 0; i < dev_dax->nr_range; i++)
> @@ -808,11 +800,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, 
> u64 start,
>   dev_dbg(dev, "delete range[%d]: %pa:%pa\n", dev_dax->nr_range - 
> 1,
>   >start, >end);
>   dev_dax->nr_range--;
> - __release_region(res, alloc->start, resource_size(alloc));
> - return rc;
> + goto err;
>   }
>  
>   return 0;
> +
> +err:
> + __release_region(res, alloc->start, resource_size(alloc));
> + return rc;
>  }
>  
>  static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource 
> *res, resource_size_t size)
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 1/1] device-dax: avoid an unnecessary check in alloc_dev_dax_range()

2020-12-17 Thread Leizhen (ThunderTown)



On 2020/12/18 11:10, Dan Williams wrote:
> On Fri, Nov 20, 2020 at 1:23 AM Zhen Lei  wrote:
>>
>> Swap the calling sequence of krealloc() and __request_region(), call the
>> latter first. In this way, the value of dev_dax->nr_range does not need to
>> be considered when __request_region() failed.
> 
> This looks ok, but I think I want to see another cleanup go in first
> before this to add a helper for trimming the last range off the set of
> ranges:
> 
> static void dev_dax_trim_range(struct dev_dax *dev_dax)
> {
> int i = dev_dax->nr_range - 1;
> struct range *range = _dax->ranges[i].range;
> struct dax_region *dax_region = dev_dax->region;
> 
> dev_dbg(dev, "delete range[%d]: %#llx:%#llx\n", i,
> (unsigned long long)range->start,
> (unsigned long long)range->end);
> 
> __release_region(_region->res, range->start, range_len(range));
> if (--dev_dax->nr_range == 0) {
> kfree(dev_dax->ranges);
> dev_dax->ranges = NULL;
> }
> }
> 
> Care to do a lead in patch with that cleanup, then do this one?

I don't mind! You can add above helper first. After that, I'll update
and send this patch again.

> 
> I think that might also cleanup a memory leak report from Jane in
> addition to not needing the "goto" as well.
> 
> http://lore.kernel.org/r/c8a8a260-34c6-dbfc-1f19-25c23d01c...@oracle.com
> 
> .
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [NEEDS-REVIEW] [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-17 Thread Ira Weiny
On Thu, Dec 17, 2020 at 12:41:50PM -0800, Dave Hansen wrote:
> On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
> >  void disable_TSC(void)
> > @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, 
> > struct task_struct *next_p)
> >  
> > if ((tifp ^ tifn) & _TIF_SLD)
> > switch_to_sld(tifn);
> > +
> > +   pks_sched_in();
> >  }
> 
> Does the selftest for this ever actually schedule()?

At this point I'm not sure.  This code has been in since the beginning.  So its
seen a lot of soak time.

> 
> I see it talking about context switching, but I don't immediately see
> how it would.

We were trying to force parent and child to run on the same CPU.  I suspect
something is wrong in the timing of that test.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-17 Thread Ira Weiny
On Thu, Dec 17, 2020 at 03:50:55PM +0100, Thomas Gleixner wrote:
> On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -43,6 +43,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "process.h"
> >  
> > @@ -187,6 +188,27 @@ int copy_thread(unsigned long clone_flags, unsigned 
> > long sp, unsigned long arg,
> > return ret;
> >  }
> >  
> > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +DECLARE_PER_CPU(u32, pkrs_cache);
> > +static inline void pks_init_task(struct task_struct *tsk)
> 
> First of all. I asked several times now not to glue stuff onto a
> function without a newline inbetween. It's unreadable.

Fixed.

> 
> But what's worse is that the declaration of pkrs_cache which is global
> is in a C file and not in a header. And pkrs_cache is not even used in
> this file. So what?

OK, this was just a complete rebase/refactor mess up on my part.  The
global'ness is not required until we need a global update of the pkrs which was
not part of this series.

I've removed it from this patch.  And cleaned it up in patch 6/10 as well.  And
cleaned it up in the global pkrs patch which you found in my git tree.

> 
> > +{
> > +   /* New tasks get the most restrictive PKRS value */
> > +   tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
> > +}
> > +static inline void pks_sched_in(void)
> 
> Newline between functions. It's fine for stubs, but not for a real 
> implementation.

Again my apologies.

Fixed.

> 
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index d1dfe743e79f..76a62419c446 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -231,3 +231,34 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int 
> > flags)
> >  
> > return pk_reg;
> >  }
> > +
> > +DEFINE_PER_CPU(u32, pkrs_cache);
> 
> Again, why is this global?

In this patch it does not need to be.  I've changed it to static.

> 
> > +void write_pkrs(u32 new_pkrs)
> > +{
> > +   u32 *pkrs;
> > +
> > +   if (!static_cpu_has(X86_FEATURE_PKS))
> > +   return;
> > +
> > +   pkrs = get_cpu_ptr(_cache);
> 
> So this is called from various places including schedule and also from
> the low level entry/exit code. Why do we need to have an extra
> preempt_disable/enable() there via get/put_cpu_ptr()?
> 
> Just because performance in those code paths does not matter?

Honestly I don't recall the full history at this point.  The
preempt_disable/enable() is required when this is called from
pks_update_protection()  AKA when a user is trying to update the protections of
their key.  What I do remember is that this was originally not preempt safe and 
we
had a comment to that effect in the early patches.[1]

Somewhere along the line the preempt discussion lead us to make write_pkrs()
'self contained' with the preemption protection here.  I just did not think
about any performance issues.  It is safe to call preempt_disable() from a
preempt disabled region, correct?  I seem to recall asking that and the answer
was 'yes'.

I will audit the calls again and adjust the preemption disable as needed.

[1] https://lore.kernel.org/lkml/20200717072056.73134-5-ira.we...@intel.com/#t

> 
> > +   if (*pkrs != new_pkrs) {
> > +   *pkrs = new_pkrs;
> > +   wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > +   }
> > +   put_cpu_ptr(pkrs);
> 
> Now back to the context switch:
> 
> > @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, 
> > struct task_struct *next_p)
> >
> >  if ((tifp ^ tifn) & _TIF_SLD)
> >  switch_to_sld(tifn);
> > +
> > +   pks_sched_in();
> >  }
> 
> How is this supposed to work? 
> 
> switch_to() {
>
>switch_to_extra() {
>   
>   if (unlikely(next_tif & _TIF_WORK_CTXSW_NEXT ||
>  prev_tif & _TIF_WORK_CTXSW_PREV))
>  __switch_to_xtra(prev, next);
> 
> I.e. __switch_to_xtra() is only invoked when the above condition is
> true, which is not guaranteed at all.

I did not know that.  I completely missunderstood what __switch_to_xtra()
meant.  I thought it was arch specific 'extra' stuff so it seemed reasonable to
me.

Also, our test seemed to work.  I'm still investigating what may be wrong.

> 
> While I have to admit that I dropped the ball on the update for the
> entry patch, I'm not too sorry about it anymore when looking at this.
> 
> Are you still sure that this is ready for merging?

Nope...

Thanks for the review,
Ira

> 
> Thanks,
> 
> tglx
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 10/10] x86/pks: Add PKS test code

2020-12-17 Thread Ira Weiny
On Thu, Dec 17, 2020 at 12:55:39PM -0800, Dave Hansen wrote:
> On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
> > +   /* Arm for context switch test */
> > +   write(fd, "1", 1);
> > +
> > +   /* Context switch out... */
> > +   sleep(4);
> > +
> > +   /* Check msr restored */
> > +   write(fd, "2", 1);
> 
> These are always tricky.  What you ideally want here is:
> 
> 1. Switch away from this task to a non-PKS task, or
> 2. Switch from this task to a PKS-using task, but one which has a
>different PKS value

Or both...

> 
> then, switch back to this task and make sure PKS maintained its value.
> 
> *But*, there's no absolute guarantee that another task will run.  It
> would not be totally unreasonable to have the kernel just sit in a loop
> without context switching here if no other tasks can run.
> 
> The only way you *know* there is a context switch is by having two tasks
> bound to the same logical CPU and make sure they run one after another.

Ah...  We do that.

...
+   CPU_ZERO();
+   CPU_SET(0, );
+   /* Two processes run on CPU 0 so that they go through context switch.  
*/
+   sched_setaffinity(getpid(), sizeof(cpu_set_t), );
...

I think this should be ensuring that both the parent and the child are
running on CPU 0.  At least according to the man page they should be.


A child created via fork(2) inherits its parent's CPU affinity mask.


Perhaps a better method would be to synchronize the 2 threads more to ensure
that we are really running at the 'same time' and forcing the context switch.

>  This just gets itself into a state where it *CAN* context switch and
> prays that one will happen.

Not sure what you mean by 'This'?  Do you mean that running on the same CPU
will sometimes not force a context switch?  Or do you mean that the sleeps
could be badly timed and the 2 threads could run 1 after the other on the same
CPU?  The latter is AFAICT the most likely case.

> 
> You can also run a bunch of these in parallel bound to a single CPU.
> That would also give you higher levels of assurance that *some* context
> switch happens at sleep().

I think more cycles is a good idea for sure.  But I'm more comfortable with
forcing the test to be more synchronized so that it is actually running in the
order we think/want it to be.

> 
> One critical thing with these tests is to sabotage the kernel and then
> run them and make *sure* they fail.  Basically, if you screw up, do they
> actually work to catch it?

I'll try and come up with a more stressful test.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 0/9] fsdax: introduce fs query to support reflink

2020-12-17 Thread Darrick J. Wong
On Fri, Dec 18, 2020 at 10:44:26AM +0800, Ruan Shiyang wrote:
> 
> 
> On 2020/12/17 上午4:55, Jane Chu wrote:
> > Hi, Shiyang,
> > 
> > On 12/15/2020 4:14 AM, Shiyang Ruan wrote:
> > > The call trace is like this:
> > > memory_failure()
> > >   pgmap->ops->memory_failure()  => pmem_pgmap_memory_failure()
> > >    gendisk->fops->corrupted_range() => - pmem_corrupted_range()
> > >    - md_blk_corrupted_range()
> > >     sb->s_ops->currupted_range()    => xfs_fs_corrupted_range()
> > >  xfs_rmap_query_range()
> > >   xfs_currupt_helper()
> > >    * corrupted on metadata
> > >    try to recover data, call xfs_force_shutdown()
> > >    * corrupted on file data
> > >    try to recover data, call mf_dax_mapping_kill_procs()
> > > 
> > > The fsdax & reflink support for XFS is not contained in this patchset.
> > > 
> > > (Rebased on v5.10)
> > 
> > So I tried the patchset with pmem error injection, the SIGBUS payload
> > does not look right -
> > 
> > ** SIGBUS(7): **
> > ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **
> > 
> > I expect the payload looks like
> > 
> > ** si_addr(0x7f3672e0), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) **
> 
> Thanks for testing.  I test the SIGBUS by writing a program which calls
> madvise(... ,MADV_HWPOISON) to inject memory-failure.  It just shows that
> the program is killed by SIGBUS.  I cannot get any detail from it.  So,
> could you please show me the right way(test tools) to test it?

I'm assuming that Jane is using a program that calls sigaction to
install a SIGBUS handler, and dumps the entire siginfo_t structure
whenever it receives one...

--D

> 
> --
> Thanks,
> Ruan Shiyang.
> 
> > 
> > thanks,
> > -jane
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH] badblocks: Improvement badblocks_set() for handling multiple ranges

2020-12-17 Thread Dan Williams
[ add Neil, original gooodguy who wrote badblocks ]


On Thu, Dec 3, 2020 at 9:16 AM Coly Li  wrote:
>
> Recently I received a bug report that current badblocks code does not
> properly handle multiple ranges. For example,
> badblocks_set(bb, 32, 1, true);
> badblocks_set(bb, 34, 1, true);
> badblocks_set(bb, 36, 1, true);
> badblocks_set(bb, 32, 12, true);
> Then indeed badblocks_show() reports,
> 32 3
> 36 1
> But the expected bad blocks table should be,
> 32 12
> Obviously only the first 2 ranges are merged and badblocks_set() returns
> and ignores the rest setting range.
>
> This behavior is improper, if the caller of badblocks_set() wants to set
> a range of blocks into bad blocks table, all of the blocks in the range
> should be handled even the previous part encountering failure.
>
> The desired way to set bad blocks range by badblocks_set() is,
> - Set as many as blocks in the setting range into bad blocks table.
> - Merge the bad blocks ranges and occupy as less as slots in the bad
>   blocks table.
> - Fast.
>
> Indeed the above proposal is complicated, especially with the following
> restrictions,
> - The setting bad blocks range can be ackknowledged or not acknowledged.

s/ackknowledged/acknowledged/

I'd run checkpatch --codespell for future versions...

> - The bad blocks table size is limited.
> - Memory allocation should be avoided.
>
> This patch is an initial effort to improve badblocks_set() for setting
> bad blocks range when it covers multiple already set bad ranges in the
> bad blocks table, and to do it as fast as possible.
>
> The basic idea of the patch is to categorize all possible bad blocks
> range setting combinationsinto to much less simplified and more less
> special conditions. Inside badblocks_set() there is an implicit loop
> composed by jumping between labels 're_insert' and 'update_sectors'. No
> matter how large the setting bad blocks range is, in every loop just a
> minimized range from the head is handled by a pre-defined behavior from
> one of the categorized conditions. The logic is simple and code flow is
> manageable.
>
> This patch is unfinished yet, it only improves badblocks_set() and not
> touch badblocks_clear() and badblocks_show() yet. I post it earlier
> because this patch will be large (more then 1000 lines of change), I
> want more people to give me comments earlier before I go too far away.
>

I wonder if this isn't indication that the base data structure should
be replaced... but I have not had a chance to devote deeper thought to
this.


> The code logic is tested as user space programmer, this patch passes
> compiling but not tested in kernel mode yet. Right now it is only for
> RFC purpose. I will post tested patch in further versions.
>
> Thank you in advance for any review or comments on this patch.
>
> Signed-off-by: Coly Li 
> ---
>  block/badblocks.c | 1041 ++---
>  include/linux/badblocks.h |   33 ++
>  2 files changed, 881 insertions(+), 193 deletions(-)
>
> diff --git a/block/badblocks.c b/block/badblocks.c
> index d39056630d9c..04ccae95777d 100644
> --- a/block/badblocks.c
> +++ b/block/badblocks.c
> @@ -5,6 +5,8 @@
>   * - Heavily based on MD badblocks code from Neil Brown
>   *
>   * Copyright (c) 2015, Intel Corporation.
> + *
> + * Improvement for handling multiple ranges by Coly Li 
>   */
>
>  #include 
> @@ -16,114 +18,612 @@
>  #include 
>  #include 
>
> -/**
> - * badblocks_check() - check a given range for bad sectors
> - * @bb:the badblocks structure that holds all badblock 
> information
> - * @s: sector (start) at which to check for badblocks
> - * @sectors:   number of sectors to check for badblocks
> - * @first_bad: pointer to store location of the first badblock
> - * @bad_sectors: pointer to store number of badblocks after @first_bad
> +/*
> + * The purpose of badblocks set/clear is to manage bad blocks ranges which 
> are
> + * identified by LBA addresses.
>   *
> - * We can record which blocks on each device are 'bad' and so just
> - * fail those blocks, or that stripe, rather than the whole device.
> - * Entries in the bad-block table are 64bits wide.  This comprises:
> - * Length of bad-range, in sectors: 0-511 for lengths 1-512
> - * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
> - *  A 'shift' can be set so that larger blocks are tracked and
> - *  consequently larger devices can be covered.
> - * 'Acknowledged' flag - 1 bit. - the most significant bit.
> + * When the caller of badblocks_set() wants to set a range of bad blocks, the
> + * setting range can be acked or unacked. And the setting range may merge,
> + * overwrite, skip the overlaypped already set range, depends on who they are
> + * overlapped or adjacent, and the acknowledgment type of the ranges. It can 
> be
> + * more complicated when the setting range covers multiple already set bad 
> block
> + * ranges, with 

Re: [PATCH 1/1] device-dax: avoid an unnecessary check in alloc_dev_dax_range()

2020-12-17 Thread Dan Williams
On Fri, Nov 20, 2020 at 1:23 AM Zhen Lei  wrote:
>
> Swap the calling sequence of krealloc() and __request_region(), call the
> latter first. In this way, the value of dev_dax->nr_range does not need to
> be considered when __request_region() failed.

This looks ok, but I think I want to see another cleanup go in first
before this to add a helper for trimming the last range off the set of
ranges:

static void dev_dax_trim_range(struct dev_dax *dev_dax)
{
int i = dev_dax->nr_range - 1;
struct range *range = _dax->ranges[i].range;
struct dax_region *dax_region = dev_dax->region;

dev_dbg(dev, "delete range[%d]: %#llx:%#llx\n", i,
(unsigned long long)range->start,
(unsigned long long)range->end);

__release_region(_region->res, range->start, range_len(range));
if (--dev_dax->nr_range == 0) {
kfree(dev_dax->ranges);
dev_dax->ranges = NULL;
}
}

Care to do a lead in patch with that cleanup, then do this one?

I think that might also cleanup a memory leak report from Jane in
addition to not needing the "goto" as well.

http://lore.kernel.org/r/c8a8a260-34c6-dbfc-1f19-25c23d01c...@oracle.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 0/9] fsdax: introduce fs query to support reflink

2020-12-17 Thread Ruan Shiyang



On 2020/12/17 上午4:55, Jane Chu wrote:

Hi, Shiyang,

On 12/15/2020 4:14 AM, Shiyang Ruan wrote:

The call trace is like this:
memory_failure()
  pgmap->ops->memory_failure()  => pmem_pgmap_memory_failure()
   gendisk->fops->corrupted_range() => - pmem_corrupted_range()
   - md_blk_corrupted_range()
    sb->s_ops->currupted_range()    => xfs_fs_corrupted_range()
 xfs_rmap_query_range()
  xfs_currupt_helper()
   * corrupted on metadata
   try to recover data, call xfs_force_shutdown()
   * corrupted on file data
   try to recover data, call mf_dax_mapping_kill_procs()

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.10)


So I tried the patchset with pmem error injection, the SIGBUS payload
does not look right -

** SIGBUS(7): **
** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **

I expect the payload looks like

** si_addr(0x7f3672e0), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) **


Thanks for testing.  I test the SIGBUS by writing a program which calls 
madvise(... ,MADV_HWPOISON) to inject memory-failure.  It just shows 
that the program is killed by SIGBUS.  I cannot get any detail from it. 
 So, could you please show me the right way(test tools) to test it?



--
Thanks,
Ruan Shiyang.



thanks,
-jane








___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 9/9] xfs: Implement ->corrupted_range() for XFS

2020-12-17 Thread Ruan Shiyang



On 2020/12/16 上午4:40, Darrick J. Wong wrote:

On Tue, Dec 15, 2020 at 08:14:14PM +0800, Shiyang Ruan wrote:

This function is used to handle errors which may cause data lost in
filesystem.  Such as memory failure in fsdax mode.

In XFS, it requires "rmapbt" feature in order to query for files or
metadata which associated to the corrupted data.  Then we could call fs
recover functions to try to repair the corrupted data.(did not
implemented in this patchset)

After that, the memory failure also needs to notify the processes who
are using those files.

Only support data device.  Realtime device is not supported for now.

Signed-off-by: Shiyang Ruan 
---
  fs/xfs/xfs_fsops.c | 10 +
  fs/xfs/xfs_mount.h |  2 +
  fs/xfs/xfs_super.c | 93 ++
  3 files changed, 105 insertions(+)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index ef1d5bb88b93..0ec1b44bfe88 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -501,6 +501,16 @@ xfs_do_force_shutdown(
  "Corruption of in-memory data detected.  Shutting down filesystem");
if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
xfs_stack_trace();
+   } else if (flags & SHUTDOWN_CORRUPT_META) {
+   xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT,
+"Corruption of on-disk metadata detected.  Shutting down filesystem");
+   if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
+   xfs_stack_trace();
+   } else if (flags & SHUTDOWN_CORRUPT_DATA) {
+   xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT,
+"Corruption of on-disk file data detected.  Shutting down filesystem");
+   if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
+   xfs_stack_trace();
} else if (logerror) {
xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_LOGERROR,
"Log I/O Error Detected. Shutting down filesystem");
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index dfa429b77ee2..e36c07553486 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -274,6 +274,8 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, 
char *fname,
  #define SHUTDOWN_LOG_IO_ERROR 0x0002  /* write attempt to the log failed */
  #define SHUTDOWN_FORCE_UMOUNT 0x0004  /* shutdown from a forced unmount */
  #define SHUTDOWN_CORRUPT_INCORE   0x0008  /* corrupt in-memory data 
structures */
+#define SHUTDOWN_CORRUPT_META  0x0010  /* corrupt metadata on device */
+#define SHUTDOWN_CORRUPT_DATA  0x0020  /* corrupt file data on device */


This symbol isn't used anywhere.  I don't know why we'd shut down the fs
for data loss, as we don't do that anywhere else in xfs.


I prepared this flag for the later use if possible.  But since it seems 
unnecessary, I will remove it in the next version.




  
  /*

   * Flags for xfs_mountfs
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e3e229e52512..30202de7e89d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,11 @@
  #include "xfs_refcount_item.h"
  #include "xfs_bmap_item.h"
  #include "xfs_reflink.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bit.h"
  
  #include 

  #include 
@@ -1103,6 +1108,93 @@ xfs_fs_free_cached_objects(
return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
  }
  
+static int

+xfs_corrupt_helper(
+   struct xfs_btree_cur*cur,
+   struct xfs_rmap_irec*rec,
+   void*data)
+{
+   struct xfs_inode*ip;
+   int rc = 0;


Note: we usually use the name "error", not "rc".


OK.




+   int *flags = data;
+
+   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner)) {


There are a few more things to check here to detect if metadata has been
lost.  The first is that any loss in the extended attribute information
is considered filesystem metadata; and the second is that loss of an
extent btree block is also metadata.

IOWs, this check should be:

if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
// TODO check and try to fix metadata
return -EFSCORRUPTED;
}


Thanks for pointing out.




+   // TODO check and try to fix metadata
+   rc = -EFSCORRUPTED;
+   } else {
+   /*
+* Get files that incore, filter out others that are not in use.
+*/
+   rc = xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner,
+ XFS_IGET_INCORE, 0, );
+   if (rc || !ip)
+   return rc;
+   if (!VFS_I(ip)->i_mapping)
+   goto out;
+
+   if (IS_DAX(VFS_I(ip)))
+   rc = mf_dax_mapping_kill_procs(VFS_I(ip)->i_mapping,
+   

[PATCH daxctl v2 3/5] libdaxctl: add daxctl_dev_set_mapping()

2020-12-17 Thread Joao Martins
This API adds the ability to manually pick a range within the region
device.  Such routine allows for a admin to restore the mappings of a
device after kexec. This is specially useful for hmem dynamic devdax
which do not persistent ranges allocation through say a e.g. namespace
label storage area. It also allows an userspace application to pick
it's own ranges, should it want to avoid relying on kernel's policy.

Signed-off-by: Joao Martins 
---
 daxctl/lib/libdaxctl.c   | 27 +++
 daxctl/lib/libdaxctl.sym |  1 +
 daxctl/libdaxctl.h   |  2 ++
 3 files changed, 30 insertions(+)

diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
index 3781e9ed27d5..59fe84fee409 100644
--- a/daxctl/lib/libdaxctl.c
+++ b/daxctl/lib/libdaxctl.c
@@ -1123,6 +1123,33 @@ DAXCTL_EXPORT int daxctl_dev_set_align(struct daxctl_dev 
*dev, unsigned long ali
return 0;
 }
 
+DAXCTL_EXPORT int daxctl_dev_set_mapping(struct daxctl_dev *dev,
+   unsigned long long start,
+   unsigned long long end)
+{
+   struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
+   unsigned long long size = end - start + 1;
+   char buf[SYSFS_ATTR_SIZE];
+   char *path = dev->dev_buf;
+   int len = dev->buf_len;
+
+   if (snprintf(path, len, "%s/mapping", dev->dev_path) >= len) {
+   err(ctx, "%s: buffer too small!\n",
+   daxctl_dev_get_devname(dev));
+   return -ENXIO;
+   }
+
+   sprintf(buf, "%#llx-%#llx\n", start, end);
+   if (sysfs_write_attr(ctx, path, buf) < 0) {
+   err(ctx, "%s: failed to set mapping\n",
+   daxctl_dev_get_devname(dev));
+   return -ENXIO;
+   }
+   dev->size += size;
+
+   return 0;
+}
+
 DAXCTL_EXPORT int daxctl_dev_get_target_node(struct daxctl_dev *dev)
 {
return dev->target_node;
diff --git a/daxctl/lib/libdaxctl.sym b/daxctl/lib/libdaxctl.sym
index 08362b683678..a4e16848494b 100644
--- a/daxctl/lib/libdaxctl.sym
+++ b/daxctl/lib/libdaxctl.sym
@@ -89,4 +89,5 @@ global:
daxctl_mapping_get_end;
daxctl_mapping_get_offset;
daxctl_mapping_get_size;
+   daxctl_dev_set_mapping;
 } LIBDAXCTL_7;
diff --git a/daxctl/libdaxctl.h b/daxctl/libdaxctl.h
index f94a72fed85b..09439c16d6df 100644
--- a/daxctl/libdaxctl.h
+++ b/daxctl/libdaxctl.h
@@ -73,6 +73,8 @@ unsigned long long daxctl_dev_get_size(struct daxctl_dev 
*dev);
 int daxctl_dev_set_size(struct daxctl_dev *dev, unsigned long long size);
 unsigned long daxctl_dev_get_align(struct daxctl_dev *dev);
 int daxctl_dev_set_align(struct daxctl_dev *dev, unsigned long align);
+int daxctl_dev_set_mapping(struct daxctl_dev *dev, unsigned long long start,
+   unsigned long long end);
 struct daxctl_ctx *daxctl_dev_get_ctx(struct daxctl_dev *dev);
 int daxctl_dev_is_enabled(struct daxctl_dev *dev);
 int daxctl_dev_disable(struct daxctl_dev *dev);
-- 
1.8.3.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH daxctl v2 5/5] daxctl/test: add a test for daxctl-create with input file

2020-12-17 Thread Joao Martins
The test creates a multi-range device (4 mappings) using the
same setup as one of the tests. Afterwards we validate that
the size/nr-mappings are the same as the original test.

Signed-off-by: Joao Martins 
---
 test/daxctl-create.sh | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/test/daxctl-create.sh b/test/daxctl-create.sh
index 41d5ba5888f7..e1d733916851 100755
--- a/test/daxctl-create.sh
+++ b/test/daxctl-create.sh
@@ -199,6 +199,7 @@ daxctl_test_multi()
 daxctl_test_multi_reconfig()
 {
local ncfgs=$1
+   local dump=$2
local daxdev
 
size=$((available / ncfgs))
@@ -226,6 +227,10 @@ daxctl_test_multi_reconfig()
test "$(daxctl_get_nr_mappings "$testdev")" -eq $((ncfgs / 2))
test "$(daxctl_get_nr_mappings "$daxdev_1")" -eq $((ncfgs / 2))
 
+   if [[ $dump ]]; then
+   "$DAXCTL" list -M -d "$daxdev_1" | jq -er '.[]' > $dump
+   fi
+
fail_if_available
 
"$DAXCTL" disable-device "$daxdev_1" && "$DAXCTL" destroy-device 
"$daxdev_1"
@@ -328,7 +333,7 @@ daxctl_test3()
 # pick at the end of the region
 daxctl_test4()
 {
-   daxctl_test_multi_reconfig 8
+   daxctl_test_multi_reconfig 8 ""
clear_dev
test_pass
 }
@@ -371,6 +376,29 @@ daxctl_test6()
test_pass
 }
 
+# Test 7: input device
+# Successfully creates a device with an input file from the multi-range
+# device test, and checking that we have the same number of mappings/size.
+daxctl_test7()
+{
+   daxctl_test_multi_reconfig 8 "input.json"
+
+   # The parameter should parse the region_id from the chardev entry
+   # therefore using the same region_id as test4
+   daxdev_1=$("$DAXCTL" create-device --input input.json | jq -er 
'.[].chardev')
+
+   # Validate if it's the same mappings as done by test4
+   # It also validates the size computed from the mappings
+   # A zero value means it failed, and four mappings is what's
+   # created by daxctl_test4
+   test "$(daxctl_get_nr_mappings "$daxdev_1")" -eq 4
+
+   "$DAXCTL" disable-device "$daxdev_1" && "$DAXCTL" destroy-device 
"$daxdev_1"
+
+   clear_dev
+   test_pass
+}
+
 find_testdev
 rc=1
 setup_dev
@@ -381,5 +409,6 @@ daxctl_test3
 daxctl_test4
 daxctl_test5
 daxctl_test6
+daxctl_test7
 reset_dev
 exit 0
-- 
1.8.3.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH daxctl v2 4/5] daxctl: allow creating devices from input json

2020-12-17 Thread Joao Martins
Add an option namely --input which passes a parameter
which is a JSON file path. The JSON file contains the
data usually returned by:

$ daxctl list -d dax0.1 | jq -er '.[]' > device.json
{
  "chardev":"dax0.1",
  "size":34359738368,
  "target_node":0,
  "align":1073741824,
  "mode":"devdax",
  "mappings":[
{
  "page_offset":4194304,
  "start":25769803776,
  "end":42949672959,
  "size":17179869184
},
{
  "page_offset":0,
  "start":8589934592,
  "end":25769803775,
  "size":17179869184
}
  ]
}

The input values in the mapping json are decimal
for now. A device can then be created by specifying this same data
to re-create it e.g.

$ daxctl create-device -u --input device.json
{
  "chardev":"dax0.1",
  "size":"32.00 GiB (34.36 GB)",
  "target_node":0,
  "align":"1024.00 MiB (1073.74 MB)",
  "mode":"devdax",
}

$ daxctl list -d dax0.1
{
  "chardev":"dax0.1",
  "size":34359738368,
  "target_node":0,
  "align":1073741824,
  "mode":"devdax",
  "mappings":[
{
  "page_offset":4194304,
  "start":25769803776,
  "end":42949672959,
  "size":17179869184
},
{
  "page_offset":0,
  "start":8589934592,
  "end":25769803775,
  "size":17179869184
}
  ]
}

created 1 device

This means we can restore/recreate previously established mappings.

Signed-off-by: Joao Martins 
---
 Documentation/daxctl/daxctl-create-device.txt |  13 +++
 daxctl/device.c   | 128 +-
 2 files changed, 136 insertions(+), 5 deletions(-)

diff --git a/Documentation/daxctl/daxctl-create-device.txt 
b/Documentation/daxctl/daxctl-create-device.txt
index 7f64719d16f2..05f4dbd9d61c 100644
--- a/Documentation/daxctl/daxctl-create-device.txt
+++ b/Documentation/daxctl/daxctl-create-device.txt
@@ -90,6 +90,19 @@ include::region-option.txt[]
to 2M. Note that "devdax" mode enforces all mappings to be
aligned to this value, i.e. it fails unaligned mapping attempts.
 
+--input::
+   Applications that want to select ranges assigned to a device-dax
+   instance, or wanting to establish previously created devices, can
+   pass an input JSON file. The file option lets a user pass a JSON
+   object similar to the one listed with "daxctl list".
+
+   The device name is not re-created, but if a "chardev" is passed in
+   the JSON file, it will use that to get the region id.
+
+   Note that the JSON content in the file cannot be an array of
+   JSON objects but rather a single JSON object i.e. without the
+   array enclosing brackets.
+
 include::human-option.txt[]
 
 include::verbose-option.txt[]
diff --git a/daxctl/device.c b/daxctl/device.c
index 3c2d4e3d8b48..fe4291199312 100644
--- a/daxctl/device.c
+++ b/daxctl/device.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -23,6 +24,7 @@ static struct {
const char *region;
const char *size;
const char *align;
+   const char *input;
bool no_online;
bool no_movable;
bool force;
@@ -36,10 +38,16 @@ enum dev_mode {
DAXCTL_DEV_MODE_RAM,
 };
 
+struct mapping {
+   unsigned long long start, end, pgoff;
+};
+
 static enum dev_mode reconfig_mode = DAXCTL_DEV_MODE_UNKNOWN;
 static long long align = -1;
 static long long size = -1;
 static unsigned long flags;
+static struct mapping *maps = NULL;
+static long long nmaps = -1;
 
 enum memory_zone {
MEM_ZONE_MOVABLE,
@@ -71,7 +79,8 @@ OPT_BOOLEAN('f', "force", , \
 
 #define CREATE_OPTIONS() \
 OPT_STRING('s', "size", , "size", "size to switch the device to"), \
-OPT_STRING('a', "align", , "align", "alignment to switch the 
device to")
+OPT_STRING('a', "align", , "align", "alignment to switch the 
device to"), \
+OPT_STRING('\0', "input", , "input", "input device JSON file")
 
 #define DESTROY_OPTIONS() \
 OPT_BOOLEAN('f', "force", , \
@@ -124,6 +133,94 @@ static const struct option destroy_options[] = {
OPT_END(),
 };
 
+static int sort_mappings(const void *a, const void *b)
+{
+   json_object **jsoa, **jsob;
+   struct json_object *va, *vb;
+   unsigned long long pga, pgb;
+
+   jsoa = (json_object **)a;
+   jsob = (json_object **)b;
+   if (!*jsoa && !*jsob)
+   return 0;
+
+   if (!json_object_object_get_ex(*jsoa, "page_offset", ) ||
+   !json_object_object_get_ex(*jsob, "page_offset", ))
+   return 0;
+
+   pga = json_object_get_int64(va);
+   pgb = json_object_get_int64(vb);
+

[PATCH daxctl v2 2/5] daxctl: include mappings when listing

2020-12-17 Thread Joao Martins
Iterate over the device mappings and print @page_offset,
 @start, @end and a computed size, but only if user
 passes -M|--mappings to daxctl list as the output can
get verbose with a lot of mapping entries.

Signed-off-by: Joao Martins 
---
 Documentation/daxctl/daxctl-list.txt |  4 +++
 daxctl/list.c|  4 +++
 util/json.c  | 57 +++-
 util/json.h  |  4 +++
 4 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/Documentation/daxctl/daxctl-list.txt 
b/Documentation/daxctl/daxctl-list.txt
index cb82c3cc6fb2..5a6a98e73849 100644
--- a/Documentation/daxctl/daxctl-list.txt
+++ b/Documentation/daxctl/daxctl-list.txt
@@ -67,6 +67,10 @@ OPTIONS
 --devices::
Include device-dax instance info in the listing (default)
 
+-M::
+--mappings::
+   Include device-dax instance mappings info in the listing
+
 -R::
 --regions::
Include region info in the listing
diff --git a/daxctl/list.c b/daxctl/list.c
index 6c6251b4de37..6860a460e4c0 100644
--- a/daxctl/list.c
+++ b/daxctl/list.c
@@ -25,6 +25,7 @@
 static struct {
bool devs;
bool regions;
+   bool mappings;
bool idle;
bool human;
 } list;
@@ -35,6 +36,8 @@ static unsigned long listopts_to_flags(void)
 
if (list.devs)
flags |= UTIL_JSON_DAX_DEVS;
+   if (list.mappings)
+   flags |= UTIL_JSON_DAX_MAPPINGS;
if (list.idle)
flags |= UTIL_JSON_IDLE;
if (list.human)
@@ -70,6 +73,7 @@ int cmd_list(int argc, const char **argv, struct daxctl_ctx 
*ctx)
"filter by dax device instance name"),
OPT_BOOLEAN('D', "devices", , "include dax device 
info"),
OPT_BOOLEAN('R', "regions", , "include dax region 
info"),
+   OPT_BOOLEAN('M', "mappings", , "include dax 
mappings info"),
OPT_BOOLEAN('i', "idle", , "include idle devices"),
OPT_BOOLEAN('u', "human", ,
"use human friendly number formats "),
diff --git a/util/json.c b/util/json.c
index 357dff20d6be..dcc927294e4f 100644
--- a/util/json.c
+++ b/util/json.c
@@ -454,7 +454,8 @@ struct json_object *util_daxctl_dev_to_json(struct 
daxctl_dev *dev,
 {
struct daxctl_memory *mem = daxctl_dev_get_memory(dev);
const char *devname = daxctl_dev_get_devname(dev);
-   struct json_object *jdev, *jobj;
+   struct json_object *jdev, *jobj, *jmappings = NULL;
+   struct daxctl_mapping *mapping = NULL;
int node, movable, align;
 
jdev = json_object_new_object();
@@ -508,6 +509,25 @@ struct json_object *util_daxctl_dev_to_json(struct 
daxctl_dev *dev,
json_object_object_add(jdev, "state", jobj);
}
 
+   if (!(flags & UTIL_JSON_DAX_MAPPINGS))
+   return jdev;
+
+   daxctl_mapping_foreach(dev, mapping) {
+   struct json_object *jmapping;
+
+   if (!jmappings) {
+   jmappings = json_object_new_array();
+   if (!jmappings)
+   continue;
+
+   json_object_object_add(jdev, "mappings", jmappings);
+   }
+
+   jmapping = util_daxctl_mapping_to_json(mapping, flags);
+   if (!jmapping)
+   continue;
+   json_object_array_add(jmappings, jmapping);
+   }
return jdev;
 }
 
@@ -1357,6 +1377,41 @@ struct json_object *util_mapping_to_json(struct 
ndctl_mapping *mapping,
return NULL;
 }
 
+struct json_object *util_daxctl_mapping_to_json(struct daxctl_mapping *mapping,
+   unsigned long flags)
+{
+   struct json_object *jmapping = json_object_new_object();
+   struct json_object *jobj;
+
+   if (!jmapping)
+   return NULL;
+
+   jobj = util_json_object_hex(daxctl_mapping_get_offset(mapping), flags);
+   if (!jobj)
+   goto err;
+   json_object_object_add(jmapping, "page_offset", jobj);
+
+   jobj = util_json_object_hex(daxctl_mapping_get_start(mapping), flags);
+   if (!jobj)
+   goto err;
+   json_object_object_add(jmapping, "start", jobj);
+
+   jobj = util_json_object_hex(daxctl_mapping_get_end(mapping), flags);
+   if (!jobj)
+   goto err;
+   json_object_object_add(jmapping, "end", jobj);
+
+   jobj = util_json_object_size(daxctl_mapping_get_size(mapping), flags);
+   if (!jobj)
+   goto err;
+   json_object_object_add(jmapping, "size", jobj);
+
+   return jmapping;
+ err:
+   json_object_put(jmapping);
+   return NULL;
+}
+
 struct json_object *util_badblock_rec_to_json(u64 block, u64 count,
unsigned long flags)
 {
diff --git a/util/json.h b/util/json.h
index 39a33789bac9..e26875a5ecd8 100644
--- a/util/json.h
+++ b/util/json.h
@@ -15,6 +15,7 @@
 

[PATCH daxctl v2 1/5] libdaxctl: add mapping iterator APIs

2020-12-17 Thread Joao Martins
Add the following APIs to be able to iterate over a
dynamic device-dax mapping list, as well as fetching
each of the mapping attributes.

Signed-off-by: Joao Martins 
---
 daxctl/lib/libdaxctl-private.h |  8 
 daxctl/lib/libdaxctl.c | 91 +-
 daxctl/lib/libdaxctl.sym   |  6 +++
 daxctl/libdaxctl.h | 12 ++
 4 files changed, 116 insertions(+), 1 deletion(-)

diff --git a/daxctl/lib/libdaxctl-private.h b/daxctl/lib/libdaxctl-private.h
index b307a8bc9438..6d05aefbeda0 100644
--- a/daxctl/lib/libdaxctl-private.h
+++ b/daxctl/lib/libdaxctl-private.h
@@ -91,6 +91,12 @@ struct daxctl_region {
struct list_head devices;
 };
 
+struct daxctl_mapping {
+   struct daxctl_dev *dev;
+   unsigned long long pgoff, start, end;
+   struct list_node list;
+};
+
 struct daxctl_dev {
int id, major, minor;
void *dev_buf;
@@ -104,6 +110,8 @@ struct daxctl_dev {
struct daxctl_region *region;
struct daxctl_memory *mem;
int target_node;
+   int num_mappings;
+   struct list_head mappings;
 };
 
 struct daxctl_memory {
diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
index 39871427c799..3781e9ed27d5 100644
--- a/daxctl/lib/libdaxctl.c
+++ b/daxctl/lib/libdaxctl.c
@@ -526,7 +526,8 @@ static void *add_dax_dev(void *parent, int id, const char 
*daxdev_base)
free(path);
return dev_dup;
}
-
+   dev->num_mappings = -1;
+   list_head_init(>mappings);
list_add(>devices, >list);
free(path);
return dev;
@@ -1150,6 +1151,94 @@ DAXCTL_EXPORT unsigned long 
daxctl_memory_get_block_size(struct daxctl_memory *m
return mem->block_size;
 }
 
+static void mappings_init(struct daxctl_dev *dev)
+{
+   struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
+   char buf[SYSFS_ATTR_SIZE];
+   char *path = dev->dev_buf;
+   int i;
+
+   if (dev->num_mappings != -1)
+   return;
+
+   dev->num_mappings = 0;
+   for (;;) {
+   struct daxctl_mapping *mapping;
+   unsigned long long pgoff, start, end;
+
+   i = dev->num_mappings;
+   mapping = calloc(1, sizeof(*mapping));
+   if (!mapping) {
+   err(ctx, "%s: mapping%u allocation failure\n",
+   daxctl_dev_get_devname(dev), i);
+   continue;
+   }
+
+   sprintf(path, "%s/mapping%d/start", dev->dev_path, i);
+   if (sysfs_read_attr(ctx, path, buf) < 0) {
+   free(mapping);
+   break;
+   }
+   start = strtoull(buf, NULL, 0);
+
+   sprintf(path, "%s/mapping%d/end", dev->dev_path, i);
+   if (sysfs_read_attr(ctx, path, buf) < 0) {
+   free(mapping);
+   break;
+   }
+   end = strtoull(buf, NULL, 0);
+
+   sprintf(path, "%s/mapping%d/page_offset", dev->dev_path, i);
+   if (sysfs_read_attr(ctx, path, buf) < 0) {
+   free(mapping);
+   break;
+   }
+   pgoff = strtoull(buf, NULL, 0);
+
+   mapping->dev = dev;
+   mapping->start = start;
+   mapping->end = end;
+   mapping->pgoff = pgoff;
+
+   dev->num_mappings++;
+   list_add(>mappings, >list);
+   }
+}
+
+DAXCTL_EXPORT struct daxctl_mapping *daxctl_mapping_get_first(struct 
daxctl_dev *dev)
+{
+   mappings_init(dev);
+
+   return list_top(>mappings, struct daxctl_mapping, list);
+}
+
+DAXCTL_EXPORT struct daxctl_mapping *daxctl_mapping_get_next(struct 
daxctl_mapping *mapping)
+{
+   struct daxctl_dev *dev = mapping->dev;
+
+   return list_next(>mappings, mapping, list);
+}
+
+DAXCTL_EXPORT unsigned long long daxctl_mapping_get_start(struct 
daxctl_mapping *mapping)
+{
+   return mapping->start;
+}
+
+DAXCTL_EXPORT unsigned long long daxctl_mapping_get_end(struct daxctl_mapping 
*mapping)
+{
+   return mapping->end;
+}
+
+DAXCTL_EXPORT unsigned long long  daxctl_mapping_get_offset(struct 
daxctl_mapping *mapping)
+{
+   return mapping->pgoff;
+}
+
+DAXCTL_EXPORT unsigned long long daxctl_mapping_get_size(struct daxctl_mapping 
*mapping)
+{
+   return mapping->end - mapping->start + 1;
+}
+
 static int memblock_is_online(struct daxctl_memory *mem, char *memblock)
 {
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
diff --git a/daxctl/lib/libdaxctl.sym b/daxctl/lib/libdaxctl.sym
index c3d08179c9fd..08362b683678 100644
--- a/daxctl/lib/libdaxctl.sym
+++ b/daxctl/lib/libdaxctl.sym
@@ -83,4 +83,10 @@ global:
daxctl_region_destroy_dev;
daxctl_dev_get_align;
daxctl_dev_set_align;
+   daxctl_mapping_get_first;
+   daxctl_mapping_get_next;
+   

[PATCH daxctl v2 0/5] daxctl: range mapping allocation

2020-12-17 Thread Joao Martins
Hey,

This series adds support for:

 1) Listing mappings when passing -M to ´daxctl list´. These are ommited
 by default.

 2) Iteration APIs for the mappings.

 3) Allow passing an input JSON file with the manually selected ranges
 to be used when creating the device-dax instance.

This applies on top of 'jm/devdax_subdiv' branch in github.com:pmem/ndctl.git

Testing requires a 5.10+ kernel.

v1 -> v2:
  * List mappings only with -M|--mappings option
  * Adds a unit test for --input file (while testing with -M listing too)
  * Rename --restore to --input
  * Add Documentation for -M and for --input

Joao Martins (5):
  libdaxctl: add mapping iterator APIs
  daxctl: include mappings when listing
  libdaxctl: add daxctl_dev_set_mapping()
  daxctl: allow creating devices from input json
  daxctl/test: add a test for daxctl-create with input file

 Documentation/daxctl/daxctl-create-device.txt |  13 +++
 Documentation/daxctl/daxctl-list.txt  |   4 +
 daxctl/device.c   | 128 +-
 daxctl/lib/libdaxctl-private.h|   8 ++
 daxctl/lib/libdaxctl.c| 118 +++-
 daxctl/lib/libdaxctl.sym  |   7 ++
 daxctl/libdaxctl.h|  14 +++
 daxctl/list.c |   4 +
 test/daxctl-create.sh |  31 ++-
 util/json.c   |  57 +++-
 util/json.h   |   4 +
 11 files changed, 380 insertions(+), 8 deletions(-)

-- 
1.8.3.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2020-12-17 Thread Ruan Shiyang



On 2020/12/16 上午4:51, Darrick J. Wong wrote:

On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:

With the support of ->rmap(), it is possible to obtain the superblock on
a mapped device.

If a pmem device is used as one target of mapped device, we cannot
obtain its superblock directly.  With the help of SYSFS, the mapped
device can be found on the target devices.  So, we iterate the
bdev->bd_holder_disks to obtain its mapped device.

Signed-off-by: Shiyang Ruan 
---
  drivers/md/dm.c   | 66 +++
  drivers/nvdimm/pmem.c |  9 --
  fs/block_dev.c| 21 ++
  include/linux/genhd.h |  7 +
  4 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4e0cbfe3f14d..9da1f9322735 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, 
sector_t sector,
  #define dm_blk_report_zones   NULL
  #endif /* CONFIG_BLK_DEV_ZONED */
  
+struct dm_blk_corrupt {

+   struct block_device *bdev;
+   sector_t offset;
+};
+
+static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
+   sector_t start, sector_t len, void *data)
+{
+   struct dm_blk_corrupt *bc = data;
+
+   return bc->bdev == (void *)dev->bdev &&
+   (start <= bc->offset && bc->offset < start + len);
+}
+
+static int dm_blk_corrupted_range(struct gendisk *disk,
+ struct block_device *target_bdev,
+ loff_t target_offset, size_t len, void *data)
+{
+   struct mapped_device *md = disk->private_data;
+   struct block_device *md_bdev = md->bdev;
+   struct dm_table *map;
+   struct dm_target *ti;
+   struct super_block *sb;
+   int srcu_idx, i, rc = 0;
+   bool found = false;
+   sector_t disk_sec, target_sec = to_sector(target_offset);
+
+   map = dm_get_live_table(md, _idx);
+   if (!map)
+   return -ENODEV;
+
+   for (i = 0; i < dm_table_get_num_targets(map); i++) {
+   ti = dm_table_get_target(map, i);
+   if (ti->type->iterate_devices && ti->type->rmap) {
+   struct dm_blk_corrupt bc = {target_bdev, target_sec};
+
+   found = ti->type->iterate_devices(ti, dm_blk_corrupt_fn, 
);
+   if (!found)
+   continue;
+   disk_sec = ti->type->rmap(ti, target_sec);


What happens if the dm device has multiple reverse mappings because the
physical storage is being shared at multiple LBAs?  (e.g. a
deduplication target)


I thought that the dm device knows the mapping relationship, and it can 
be done by implementation of ->rmap() in each target.  Did I understand 
it wrong?





+   break;
+   }
+   }
+
+   if (!found) {
+   rc = -ENODEV;
+   goto out;
+   }
+
+   sb = get_super(md_bdev);
+   if (!sb) {
+   rc = bd_disk_holder_corrupted_range(md_bdev, 
to_bytes(disk_sec), len, data);
+   goto out;
+   } else if (sb->s_op->corrupted_range) {
+   loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
+
+   rc = sb->s_op->corrupted_range(sb, md_bdev, off, len, data);


This "call bd_disk_holder_corrupted_range or sb->s_op->corrupted_range"
logic appears twice; should it be refactored into a common helper?

Or, should the superblock dispatch part move to
bd_disk_holder_corrupted_range?


bd_disk_holder_corrupted_range() requires SYSFS configuration.  I 
introduce it to handle those block devices that can not obtain 
superblock by `get_super()`.


Usually, if we create filesystem directly on a pmem device, or make some 
partitions at first, we can use `get_super()` to get the superblock.  In 
other case, such as creating a LVM on pmem device, `get_super()` does 
not work.


So, I think refactoring it into a common helper looks better.


--
Thanks,
Ruan Shiyang.




+   }
+   drop_super(sb);
+
+out:
+   dm_put_live_table(md, srcu_idx);
+   return rc;
+}
+
  static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
struct block_device **bdev)
  {
@@ -3084,6 +3149,7 @@ static const struct block_device_operations dm_blk_dops = 
{
.getgeo = dm_blk_getgeo,
.report_zones = dm_blk_report_zones,
.pr_ops = _pr_ops,
+   .corrupted_range = dm_blk_corrupted_range,
.owner = THIS_MODULE
  };
  
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c

index 4688bff19c20..e8cfaf860149 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, 
struct block_device *bdev,
  
  	bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;

sb = get_super(bdev);

Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2020-12-17 Thread Ruan Shiyang



On 2020/12/16 下午1:43, Jane Chu wrote:

On 12/15/2020 4:14 AM, Shiyang Ruan wrote:

  #ifdef CONFIG_SYSFS
+int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t 
off,

+   size_t len, void *data);
  int bd_link_disk_holder(struct block_device *bdev, struct gendisk 
*disk);
  void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk 
*disk);

  #else
+int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t 
off,


Did you mean
   static inline int bd_disk_holder_corrupted_range(..
?


Yes, it's my fault.  Thanks a lot.


--
Thanks,
Ruan Shiyang.



thanks,
-jane


+   size_t len, void *data)
+{
+    return 0;
+}
  static inline int bd_link_disk_holder(struct block_device *bdev,
    struct gendisk *disk)





___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH v3 4/9] mm, fsdax: Refactor memory-failure handler for dax mapping

2020-12-17 Thread Ruan Shiyang



On 2020/12/17 上午5:26, Dave Chinner wrote:

On Tue, Dec 15, 2020 at 08:14:09PM +0800, Shiyang Ruan wrote:

The current memory_failure_dev_pagemap() can only handle single-mapped
dax page for fsdax mode.  The dax page could be mapped by multiple files
and offsets if we let reflink feature & fsdax mode work together.  So,
we refactor current implementation to support handle memory failure on
each file and offset.

Signed-off-by: Shiyang Ruan 
---

.

  static const char *action_name[] = {
@@ -1147,6 +1148,60 @@ static int try_to_split_thp_page(struct page *page, 
const char *msg)
return 0;
  }
  
+int mf_dax_mapping_kill_procs(struct address_space *mapping, pgoff_t index, int flags)

+{
+   const bool unmap_success = true;
+   unsigned long pfn, size = 0;
+   struct to_kill *tk;
+   LIST_HEAD(to_kill);
+   int rc = -EBUSY;
+   loff_t start;
+   dax_entry_t cookie;
+
+   /*
+* Prevent the inode from being freed while we are interrogating
+* the address_space, typically this would be handled by
+* lock_page(), but dax pages do not use the page lock. This
+* also prevents changes to the mapping of this pfn until
+* poison signaling is complete.
+*/
+   cookie = dax_lock(mapping, index, );
+   if (!cookie)
+   goto unlock;


Why do we need to prevent the inode from going away here? This
function gets called by XFS after doing an xfs_iget() call to grab
the inode that owns the block. Hence the the inode (and the mapping)
are guaranteed to be referenced and can't go away. Hence for the
filesystem based callers, this whole "dax_lock()" thing can go away >
So, AFAICT, the dax_lock() stuff is only necessary when the
filesystem can't be used to resolve the owner of physical page that
went bad


Yes, you are right.  I made a mistake in the calling sequence here. 
Thanks for pointing out.



--
Thanks,
Ruan Shiyang.



Cheers,

Dave.



___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-17 Thread Thomas Gleixner
On Thu, Dec 17 2020 at 15:50, Thomas Gleixner wrote:
> On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
>
>> +void write_pkrs(u32 new_pkrs)
>> +{
>> +u32 *pkrs;
>> +
>> +if (!static_cpu_has(X86_FEATURE_PKS))
>> +return;
>> +
>> +pkrs = get_cpu_ptr(_cache);
>
> So this is called from various places including schedule and also from
> the low level entry/exit code. Why do we need to have an extra
> preempt_disable/enable() there via get/put_cpu_ptr()?
>
> Just because performance in those code paths does not matter?
>
>> +if (*pkrs != new_pkrs) {
>> +*pkrs = new_pkrs;
>> +wrmsrl(MSR_IA32_PKRS, new_pkrs);
>> +}
>> +put_cpu_ptr(pkrs);

Which made me look at the other branch of your git repo just because I
wanted to know about the 'other' storage requirements and I found this
gem:

> update_global_pkrs()
> ...
>   /*
>* If we are preventing access from the old value.  Force the
>* update on all running threads.
>*/
>   if (((old_val == 0) && protection) ||
>   ((old_val & PKR_WD_BIT) && (protection & PKEY_DISABLE_ACCESS))) {
>   int cpu;
>
>   for_each_online_cpu(cpu) {
>   u32 *ptr = per_cpu_ptr(_cache, cpu);
>
>   *ptr = update_pkey_val(*ptr, pkey, protection);
>   wrmsrl_on_cpu(cpu, MSR_IA32_PKRS, *ptr);
>   put_cpu_ptr(ptr);

1) per_cpu_ptr() -> put_cpu_ptr() is broken as per_cpu_ptr() is not
   disabling preemption while put_cpu_ptr() enables it which wreckages
   the preemption count. 

   How was that ever tested at all with any debug option enabled?

   Answer: Not at all

2) How is that sequence:

ptr = per_cpu_ptr(_cache, cpu);
*ptr = update_pkey_val(*ptr, pkey, protection);
wrmsrl_on_cpu(cpu, MSR_IA32_PKRS, *ptr);

   supposed to be correct vs. a concurrent modification of the
   pkrs_cache of the remote CPU?

   Answer: Not at all

Also doing a wrmsrl_on_cpu() on _each_ online CPU is insane at best.

  A smp function call on a remote CPU takes ~3-5us when the remote CPU
  is not idle and can immediately respond. If the remote CPU is deep in
  idle it can take up to 100us depending on C-State it is in.

  Even if the remote CPU is not not idle and just has interrupts
  disabled for a few dozen of microseconds this adds up.

  So on a 256 CPU system depending on the state of the remote CPUs this
  stalls the CPU doing the update for anything between 1 and 25ms worst
  case.

  Of course that also violates _all_ CPU isolation mechanisms.

  What for?

  Just for the theoretical chance that _all_ remote CPUs have
  seen that global permission and have it still active?

  You're not serious about that, right?

The only use case for this in your tree is: kmap() and the possible
usage of that mapping outside of the thread context which sets it up.

The only hint for doing this at all is:

Some users, such as kmap(), sometimes requires PKS to be global.

'sometime requires' is really _not_ a technical explanation.

Where is the explanation why kmap() usage 'sometimes' requires this
global trainwreck in the first place and where is the analysis why this
can't be solved differently?

Detailed use case analysis please.

Thanks,

tglx


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages

2020-12-17 Thread Joao Martins
On 12/17/20 8:05 PM, Jason Gunthorpe wrote:
> On Thu, Dec 17, 2020 at 07:05:37PM +, Joao Martins wrote:
>>> No reason not to fix set_page_dirty_lock() too while you are here.
>>
>> The wack of atomics you mentioned earlier you referred to, I suppose it
>> ends being account_page_dirtied(). See partial diff at the end.
> 
> Well, even just eliminating the lock_page, page_mapping, PageDirty,
> etc is already a big win.
> 
> If mapping->a_ops->set_page_dirty() needs to be called multiple times
> on the head page I'd probably just suggest:
> 
>   while (ntails--)
> rc |= (*spd)(head);
> 
> At least as a start.
> 
/me nods

> If you have workloads that have page_mapping != NULL then look at
> another series to optimze that. Looks a bit large though given the
> number of places implementing set_page_dirty
> 
Yes. I don't have a particular workload, was just wondering what you had in
mind, as at a glance, changing all the places without messing filesystems looks 
like
the subject of a separate series.

> I think the current reality is calling set_page_dirty on an actual
> file system is busted anyhow, so I think mapping is generally going to
> be NULL here?

Perhaps -- I'll have to check.

Joao
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 1/1] device-dax: delete a redundancy check in dev_dax_validate_align()

2020-12-17 Thread Dan Williams
On Fri, Nov 20, 2020 at 1:27 AM Leizhen (ThunderTown)
 wrote:
>
>
>
> On 2020/11/20 17:20, Zhen Lei wrote:
> > After we have done the alignment check for the length of each range, the
> > alignment check for dev_dax_size(dev_dax) is no longer needed, because it
> > get the sum of the length of each range.
>
> For example:
> x/M + y/M = (x + y)/M
> If x/M is a integer and y/M is also a integer, then (x + y)/M must be a 
> integer.
>

True... I was going to say that the different error messages might be
useful, but those are debug statements anyways, so I'll apply this.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 10/10] x86/pks: Add PKS test code

2020-12-17 Thread Dave Hansen
On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
> + /* Arm for context switch test */
> + write(fd, "1", 1);
> +
> + /* Context switch out... */
> + sleep(4);
> +
> + /* Check msr restored */
> + write(fd, "2", 1);

These are always tricky.  What you ideally want here is:

1. Switch away from this task to a non-PKS task, or
2. Switch from this task to a PKS-using task, but one which has a
   different PKS value

then, switch back to this task and make sure PKS maintained its value.

*But*, there's no absolute guarantee that another task will run.  It
would not be totally unreasonable to have the kernel just sit in a loop
without context switching here if no other tasks can run.

The only way you *know* there is a context switch is by having two tasks
bound to the same logical CPU and make sure they run one after another.
 This just gets itself into a state where it *CAN* context switch and
prays that one will happen.

You can also run a bunch of these in parallel bound to a single CPU.
That would also give you higher levels of assurance that *some* context
switch happens at sleep().

One critical thing with these tests is to sabotage the kernel and then
run them and make *sure* they fail.  Basically, if you screw up, do they
actually work to catch it?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [NEEDS-REVIEW] [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-17 Thread Dave Hansen
On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
>  void disable_TSC(void)
> @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct 
> task_struct *next_p)
>  
>   if ((tifp ^ tifn) & _TIF_SLD)
>   switch_to_sld(tifn);
> +
> + pks_sched_in();
>  }

Does the selftest for this ever actually schedule()?

I see it talking about context switching, but I don't immediately see
how it would.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH ndctl v1 0/8] daxctl: Add device align and range mapping allocation

2020-12-17 Thread Verma, Vishal L
On Thu, 2020-12-17 at 11:23 +, Joao Martins wrote:
> 
> The provisioning flow additions has some questions open about the daxctl
> user interface. To summarize:
> 
> 1) Should we always list mappings, or only list them with a -v option? Or
> maybe instead of -v to use instead a new -M option which enables the
> listing of mappings?
> 
> The reason being that it can get quite verbose with a device picks a lot
> of mappings, hence I would imagine this info isn't necessary for the casual
> daxctl listing.

I think hiding them behind a new option is probably best. And then we
can have different verbosity levels turn on or off. The verbosity levels
stuff is implemented in ndctl, but I don't think it is in daxctl yet, so
we can just do the specific option to display mappings for now, and then
revisit verbosity levels in the future if we feel like the number of
options is getting out of hand.

> 
> 2) Does the '--restore ' should instead be called it
> instead '--device'? I feel the name '--restore' is too tied to one specific
> way of using it when the feature can be used by a tool which wants to manage

Hm, I looked at other commands that take an input file - write labels
just calls it --input, so there might be value in staying consistent
with that. But write-infoblock just uses stdin - so that could be
another option. I'd be fine with either of those.

> its own range mappings. Additionally, I was thinking that if one wants to
> manually add/fixup ranges more easily that we would add
> a --mapping :- sort of syntax. But I suppose this could
> be added later if its really desired.

Agreed with adding this later if needed.

> 
> And with these clarified, I could respin it over. Oh and I'm missing a
> mappings test as well.

Sounds good I'll wait to get these in.

> 
> It's worth mentioning that kexec will need fixing, as dax_hmem regions
> created with HMAT or manually assigned with efi_fake_mem= get lost on
> kexec because we do not pass the EFI_MEMMAP conventional memory ranges
> to the second kernel (only runtime code/boot services). I have a
> RFC patch for x86 efi handling, but should get that conversation started
> after holidays.
> 
>   Joao

___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages

2020-12-17 Thread Joao Martins
On 12/8/20 7:34 PM, Jason Gunthorpe wrote:
>> @@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
>> unsigned long npages,
>>   bool make_dirty)
>>  {
>>  unsigned long index;
>> +int refs = 1;
>>  
>>  /*
>>   * TODO: this can be optimized for huge pages: if a series of pages is
>> @@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
>> unsigned long npages,
>>  return;
>>  }
>>  
>> -for (index = 0; index < npages; index++) {
>> +for (index = 0; index < npages; index += refs) {
>>  struct page *page = compound_head(pages[index]);
>> +
> 
> I think this is really hard to read, it should end up as some:
> 
> for_each_compond_head(page_list, page_list_len, , ) {
>   if (!PageDirty(head))
>   set_page_dirty_lock(head, ntails);
>   unpin_user_page(head, ntails);
> }
> 
> And maybe you open code that iteration, but that basic idea to find a
> compound_head and ntails should be computational work performed.
> 
> No reason not to fix set_page_dirty_lock() too while you are here.
> 

The wack of atomics you mentioned earlier you referred to, I suppose it
ends being account_page_dirtied(). See partial diff at the end.

I was looking at the latter part and renaming all the fs that supply
set_page_dirty()... But now my concern is whether it's really safe to
assume that filesystems that supply it ... have indeed the ability to dirty
@ntails pages. Functionally, fixing set_page_dirty_lock() means we don't call
set_page_dirty(head) @ntails times as it happens today, we would only call once
with ntails as argument.

Perhaps the safest thing to do is still to iterate over
@ntails and call .set_page_dirty(page) and instead introduce
a set_page_range_dirty() which individual filesystems can separately
supply and give precedence of ->set_page_range_dirty() as opposed
to ->set_page_dirty() ?

Joao

->8--

diff --git a/mm/gup.c b/mm/gup.c
index 41ab3d48e1bb..5f8a0f16ab62 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -295,7 +295,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long
npages,
 * next writeback cycle. This is harmless.
 */
if (!PageDirty(head))
-   set_page_dirty_lock(head);
+   set_page_range_dirty_lock(head, ntails);
put_compound_head(head, ntails, FOLL_PIN);
}
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 088729ea80b2..4642d037f657 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2417,7 +2417,8 @@ int __set_page_dirty_no_writeback(struct page *page, 
unsigned int
ntails)
  *
  * NOTE: This relies on being atomic wrt interrupts.
  */
-void account_page_dirtied(struct page *page, struct address_space *mapping)
+void account_page_dirtied(struct page *page, struct address_space *mapping,
+ unsigned int ntails)
 {
struct inode *inode = mapping->host;

@@ -2425,17 +2426,18 @@ void account_page_dirtied(struct page *page, struct 
address_space
*mapping)

if (mapping_can_writeback(mapping)) {
struct bdi_writeback *wb;
+   int nr = ntails + 1;

inode_attach_wb(inode, page);
wb = inode_to_wb(inode);

-   __inc_lruvec_page_state(page, NR_FILE_DIRTY);
-   __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-   __inc_node_page_state(page, NR_DIRTIED);
-   inc_wb_stat(wb, WB_RECLAIMABLE);
-   inc_wb_stat(wb, WB_DIRTIED);
-   task_io_account_write(PAGE_SIZE);
-   current->nr_dirtied++;
+   mod_lruvec_page_state(page, NR_FILE_DIRTY, nr);
+   mod_zone_page_state(page_zone(page), NR_ZONE_WRITE_PENDING, nr);
+   mod_node_page_state(page_pgdat(page), NR_DIRTIED, nr);
+   __add_wb_stat(wb, WB_RECLAIMABLE, nr);
+   __add_wb_stat(wb, WB_DIRTIED, nr);
+   task_io_account_write(nr * PAGE_SIZE);
+   current->nr_dirtied += nr;
this_cpu_inc(bdp_ratelimits);

mem_cgroup_track_foreign_dirty(page, wb);
@@ -2485,7 +2487,7 @@ int __set_page_dirty_nobuffers(struct page *page, 
unsigned int ntails)
xa_lock_irqsave(>i_pages, flags);
BUG_ON(page_mapping(page) != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-   account_page_dirtied(page, mapping);
+   account_page_dirtied(page, mapping, ntails);
__xa_set_mark(>i_pages, page_index(page),
   PAGECACHE_TAG_DIRTY);
xa_unlock_irqrestore(>i_pages, flags);
@@ -2624,6 +2626,27 @@ int set_page_dirty_lock(struct page *page)
 }
 EXPORT_SYMBOL(set_page_dirty_lock);

+/*
+ * 

Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

2020-12-17 Thread Thomas Gleixner
On Mon, Nov 23 2020 at 22:09, ira weiny wrote:
> From: Ira Weiny 
>
> Currently struct irqentry_state_t only contains a single bool value
> which makes passing it by value is reasonable.  However, future patches
> add information to this struct.  This includes the PKRS thread state,
> included in this series, as well as information to store kmap reference
> tracking and PKS global state outside this series. In total, we
> anticipate 2 new 32 bit fields and an integer field to be added to the
> struct beyond the existing bool value.

Well yes, but why can't you provide at least in the comment section
below the '---' a pointer to the latest version of this reference muck
and PKS global state if you can't explain at least the concept of the
two things here?

It's one thing that you anticipate something but a different thing
whether it's the right thing to do.

Thanks,

tglx
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

2020-12-17 Thread Andy Lutomirski

> On Dec 17, 2020, at 5:19 AM, Peter Zijlstra  wrote:
> 
> On Thu, Dec 17, 2020 at 02:07:01PM +0100, Thomas Gleixner wrote:
>>> On Fri, Dec 11 2020 at 14:14, Andy Lutomirski wrote:
 On Mon, Nov 23, 2020 at 10:10 PM  wrote:
>>> After contemplating this for a bit, I think this isn't really the
>>> right approach.  It *works*, but we've mostly just created a bit of an
>>> unfortunate situation.  Our stack, on a (possibly nested) entry looks
>>> like:
>>> 
>>> previous frame (or empty if we came from usermode)
>>> ---
>>> SS
>>> RSP
>>> FLAGS
>>> CS
>>> RIP
>>> rest of pt_regs
>>> 
>>> C frame
>>> 
>>> irqentry_state_t (maybe -- the compiler is within its rights to play
>>> almost arbitrary games here)
>>> 
>>> more C stuff
>>> 
>>> So what we've accomplished is having two distinct arch register
>>> regions, one called pt_regs and the other stuck in irqentry_state_t.
>>> This is annoying because it means that, if we want to access this
>>> thing without passing a pointer around or access it at all from outer
>>> frames, we need to do something terrible with the unwinder, and we
>>> don't want to go there.
>>> 
>>> So I propose a somewhat different solution: lay out the stack like this.
>>> 
>>> SS
>>> RSP
>>> FLAGS
>>> CS
>>> RIP
>>> rest of pt_regs
>>> PKS
>>>  extended_pt_regs points here
>>> 
>>> C frame
>>> more C stuff
>>> ...
>>> 
>>> IOW we have:
>>> 
>>> struct extended_pt_regs {
>>>  bool rcu_whatever;
>>>  other generic fields here;
>>>  struct arch_extended_pt_regs arch_regs;
>>>  struct pt_regs regs;
>>> };
>>> 
>>> and arch_extended_pt_regs has unsigned long pks;
>>> 
>>> and instead of passing a pointer to irqentry_state_t to the generic
>>> entry/exit code, we just pass a pt_regs pointer.
>> 
>> While I agree vs. PKS which is architecture specific state and needed in
>> other places e.g. #PF, I'm not convinced that sticking the existing
>> state into the same area buys us anything more than an indirect access.
>> 
>> Peter?
> 
> Agreed; that immediately solves the confusion Ira had as well. While
> extending pt_regs sounds scary, I think we've isolated our pt_regs
> implementation from actual ABI pretty well, but of course, that would
> need an audit. We don't want to leak this into signals for example.
> 

I’m okay with this.

My suggestion for having an extended pt_regs that contains pt_regs is to keep 
extensions like this invisible to unsuspecting parts of the kernel. In 
particular, BPF seems to pass around struct pt_regs *, and I don’t know what 
the implications of effectively offsetting all the registers relative to the 
pointer would be.

Anything that actually broke the signal regs ABI should be noticed by the x86 
selftests — the tests read and write registers through ucontext.

> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 06/10] x86/entry: Preserve PKRS MSR across exceptions

2020-12-17 Thread Thomas Gleixner
On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
> +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> +/*
> + * PKRS is a per-logical-processor MSR which overlays additional protection 
> for
> + * pages which have been mapped with a protection key.
> + *
> + * The register is not maintained with XSAVE so we have to maintain the MSR
> + * value in software during context switch and exception handling.
> + *
> + * Context switches save the MSR in the task struct thus taking that value to
> + * other processors if necessary.
> + *
> + * To protect against exceptions having access to this memory we save the
> + * current running value and set the PKRS value for the duration of the
> + * exception.  Thus preventing exception handlers from having the elevated
> + * access of the interrupted task.
> + */
> +noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_PKS))
> + return;
> +
> + irq_state->thread_pkrs = current->thread.saved_pkrs;
> + write_pkrs(INIT_PKRS_VALUE);

Why is this noinstr? Just because it's called from a noinstr function?

Of course the function itself violates the noinstr constraints:

  vmlinux.o: warning: objtool: write_pkrs()+0x36: call to do_trace_write_msr() 
leaves .noinstr.text section

There is absolutely no reason to have this marked noinstr.

Thanks,

tglx
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-17 Thread Thomas Gleixner
On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "process.h"
>  
> @@ -187,6 +188,27 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> sp, unsigned long arg,
>   return ret;
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> +DECLARE_PER_CPU(u32, pkrs_cache);
> +static inline void pks_init_task(struct task_struct *tsk)

First of all. I asked several times now not to glue stuff onto a
function without a newline inbetween. It's unreadable.

But what's worse is that the declaration of pkrs_cache which is global
is in a C file and not in a header. And pkrs_cache is not even used in
this file. So what?

> +{
> + /* New tasks get the most restrictive PKRS value */
> + tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
> +}
> +static inline void pks_sched_in(void)

Newline between functions. It's fine for stubs, but not for a real 
implementation.

> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index d1dfe743e79f..76a62419c446 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -231,3 +231,34 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int 
> flags)
>  
>   return pk_reg;
>  }
> +
> +DEFINE_PER_CPU(u32, pkrs_cache);

Again, why is this global?

> +void write_pkrs(u32 new_pkrs)
> +{
> + u32 *pkrs;
> +
> + if (!static_cpu_has(X86_FEATURE_PKS))
> + return;
> +
> + pkrs = get_cpu_ptr(_cache);

So this is called from various places including schedule and also from
the low level entry/exit code. Why do we need to have an extra
preempt_disable/enable() there via get/put_cpu_ptr()?

Just because performance in those code paths does not matter?

> + if (*pkrs != new_pkrs) {
> + *pkrs = new_pkrs;
> + wrmsrl(MSR_IA32_PKRS, new_pkrs);
> + }
> + put_cpu_ptr(pkrs);

Now back to the context switch:

> @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct 
> task_struct *next_p)
>
>if ((tifp ^ tifn) & _TIF_SLD)
>switch_to_sld(tifn);
> +
> + pks_sched_in();
>  }

How is this supposed to work? 

switch_to() {
   
   switch_to_extra() {
  
  if (unlikely(next_tif & _TIF_WORK_CTXSW_NEXT ||
   prev_tif & _TIF_WORK_CTXSW_PREV))
   __switch_to_xtra(prev, next);

I.e. __switch_to_xtra() is only invoked when the above condition is
true, which is not guaranteed at all.

While I have to admit that I dropped the ball on the update for the
entry patch, I'm not too sorry about it anymore when looking at this.

Are you still sure that this is ready for merging?

Thanks,

tglx
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

2020-12-17 Thread Peter Zijlstra
On Thu, Dec 17, 2020 at 02:07:01PM +0100, Thomas Gleixner wrote:
> On Fri, Dec 11 2020 at 14:14, Andy Lutomirski wrote:
> > On Mon, Nov 23, 2020 at 10:10 PM  wrote:
> > After contemplating this for a bit, I think this isn't really the
> > right approach.  It *works*, but we've mostly just created a bit of an
> > unfortunate situation.  Our stack, on a (possibly nested) entry looks
> > like:
> >
> > previous frame (or empty if we came from usermode)
> > ---
> > SS
> > RSP
> > FLAGS
> > CS
> > RIP
> > rest of pt_regs
> >
> > C frame
> >
> > irqentry_state_t (maybe -- the compiler is within its rights to play
> > almost arbitrary games here)
> >
> > more C stuff
> >
> > So what we've accomplished is having two distinct arch register
> > regions, one called pt_regs and the other stuck in irqentry_state_t.
> > This is annoying because it means that, if we want to access this
> > thing without passing a pointer around or access it at all from outer
> > frames, we need to do something terrible with the unwinder, and we
> > don't want to go there.
> >
> > So I propose a somewhat different solution: lay out the stack like this.
> >
> > SS
> > RSP
> > FLAGS
> > CS
> > RIP
> > rest of pt_regs
> > PKS
> >  extended_pt_regs points here
> >
> > C frame
> > more C stuff
> > ...
> >
> > IOW we have:
> >
> > struct extended_pt_regs {
> >   bool rcu_whatever;
> >   other generic fields here;
> >   struct arch_extended_pt_regs arch_regs;
> >   struct pt_regs regs;
> > };
> >
> > and arch_extended_pt_regs has unsigned long pks;
> >
> > and instead of passing a pointer to irqentry_state_t to the generic
> > entry/exit code, we just pass a pt_regs pointer.
> 
> While I agree vs. PKS which is architecture specific state and needed in
> other places e.g. #PF, I'm not convinced that sticking the existing
> state into the same area buys us anything more than an indirect access.
> 
> Peter?

Agreed; that immediately solves the confusion Ira had as well. While
extending pt_regs sounds scary, I think we've isolated our pt_regs
implementation from actual ABI pretty well, but of course, that would
need an audit. We don't want to leak this into signals for example.

___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

2020-12-17 Thread Thomas Gleixner
On Fri, Dec 11 2020 at 14:14, Andy Lutomirski wrote:
> On Mon, Nov 23, 2020 at 10:10 PM  wrote:
> After contemplating this for a bit, I think this isn't really the
> right approach.  It *works*, but we've mostly just created a bit of an
> unfortunate situation.  Our stack, on a (possibly nested) entry looks
> like:
>
> previous frame (or empty if we came from usermode)
> ---
> SS
> RSP
> FLAGS
> CS
> RIP
> rest of pt_regs
>
> C frame
>
> irqentry_state_t (maybe -- the compiler is within its rights to play
> almost arbitrary games here)
>
> more C stuff
>
> So what we've accomplished is having two distinct arch register
> regions, one called pt_regs and the other stuck in irqentry_state_t.
> This is annoying because it means that, if we want to access this
> thing without passing a pointer around or access it at all from outer
> frames, we need to do something terrible with the unwinder, and we
> don't want to go there.
>
> So I propose a somewhat different solution: lay out the stack like this.
>
> SS
> RSP
> FLAGS
> CS
> RIP
> rest of pt_regs
> PKS
>  extended_pt_regs points here
>
> C frame
> more C stuff
> ...
>
> IOW we have:
>
> struct extended_pt_regs {
>   bool rcu_whatever;
>   other generic fields here;
>   struct arch_extended_pt_regs arch_regs;
>   struct pt_regs regs;
> };
>
> and arch_extended_pt_regs has unsigned long pks;
>
> and instead of passing a pointer to irqentry_state_t to the generic
> entry/exit code, we just pass a pt_regs pointer.

While I agree vs. PKS which is architecture specific state and needed in
other places e.g. #PF, I'm not convinced that sticking the existing
state into the same area buys us anything more than an indirect access.

Peter?

Thanks,

tglx
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH ndctl v1 0/8] daxctl: Add device align and range mapping allocation

2020-12-17 Thread Joao Martins
On 12/16/20 11:42 PM, Dan Williams wrote:
> On Wed, Dec 16, 2020 at 2:54 PM Joao Martins  
> wrote:
>> On 12/16/20 10:31 PM, Dan Williams wrote:
>>> On Wed, Dec 16, 2020 at 1:51 PM Joao Martins  
>>> wrote:
 On 12/16/20 6:42 PM, Verma, Vishal L wrote:
> On Wed, 2020-12-16 at 11:39 +, Joao Martins wrote:
>> On 7/16/20 7:46 PM, Joao Martins wrote:
>>> Hey,
>>>
>>> This series builds on top of this one[0] and does the following 
>>> improvements
>>> to the Soft-Reserved subdivision:
>>>
>>>  1) Support for {create,reconfigure}-device for selecting @align 
>>> (hugepage size).
>>>  Here we add a '-a|--align 4K|2M|1G' option to the existing commands;
>>>
>>>  2) Listing improvements for device alignment and mappings;
>>>  Note: Perhaps it is better to hide the mappings by default, and only
>>>print with -v|--verbose. This would align with ndctl, as the 
>>> mappings
>>>info can be quite large.
>>>
>>>  3) Allow creating devices from selecting ranges. This allows to keep 
>>> the
>>>same GPA->HPA mapping as before we kexec the hypervisor with running 
>>> guests:
>>>
>>>daxctl list -d dax0.1 > /var/log/dax0.1.json
>>>kexec -d -l bzImage
>>>systemctl kexec
>>>daxctl create -u --restore /var/log/dax0.1.json
>>>
>>>The JSON was what I though it would be easier for an user, given 
>>> that it is
>>>the data format daxctl outputs. Alternatives could be adding 
>>> multiple:
>>> --mapping :-
>>>
>>>But that could end up in a gigantic line and a little more
>>>unmanageable I think.
>>>
>>> This series requires this series[0] on top of Dan's patches[1]:
>>>
>>>  [0] 
>>> https://lore.kernel.org/linux-nvdimm/20200716172913.19658-1-joao.m.mart...@oracle.com/
>>>  [1] 
>>> https://lore.kernel.org/linux-nvdimm/159457116473.754248.7879464730875147365.st...@dwillia2-desk3.amr.corp.intel.com/
>>>
>>> The only TODO here is docs and improving tests to validate mappings, 
>>> and test
>>> the restore path.
>>>
>>> Suggestions/comments are welcome.
>>>
>> There's a couple of issues in this series regarding daxctl-reconfigure 
>> options and
>> breakage of ndctl with kernels (<5.10) that do not supply a device 
>> @align upon testing
>> with NVDIMMs. Plus it is missing daxctl-create.sh unit test for @align.
>>
>> I will fix those and respin, and probably take out the last patch as 
>> it's more RFC-ish and
>> in need of feedback.
>
> Sounds good. Any objections to releasing v70 with the initial support,
> and then adding this series on for the next one? I'm thinking I'll do a
> much quicker v72 release say in early January with this and anything
> else that missed v71.

 If we're able to wait until tomorrow, I could respin these first four 
 patches with the
 fixes and include the align support in the initial set. Otherwise, I am 
 also good if you
 prefer defering it to v72.
>>>
>>> Does this change the JSON output? Might be nice to keep that all in
>>> one update rather than invalidate some testing with the old format
>>> betweem v71 and v72.
>>>
>> Ugh, sent the v2 too early before seeing this.
>>
>> The only change to the output is on daxctl when listing devices for 5.10+.
>> It starts displaying an "align" key/value.
> 
> No worries. Vishal and I chatted and it looks to me like your
> improvements to the provisioning flow are worthwhile to sneak into the
> v71 release if you want to include those changes as well.

The provisioning flow additions has some questions open about the daxctl
user interface. To summarize:

1) Should we always list mappings, or only list them with a -v option? Or
maybe instead of -v to use instead a new -M option which enables the
listing of mappings?

The reason being that it can get quite verbose with a device picks a lot
of mappings, hence I would imagine this info isn't necessary for the casual
daxctl listing.

2) Does the '--restore ' should instead be called it
instead '--device'? I feel the name '--restore' is too tied to one specific
way of using it when the feature can be used by a tool which wants to manage
its own range mappings. Additionally, I was thinking that if one wants to
manually add/fixup ranges more easily that we would add
a --mapping :- sort of syntax. But I suppose this could
be added later if its really desired.

And with these clarified, I could respin it over. Oh and I'm missing a
mappings test as well.

It's worth mentioning that kexec will need fixing, as dax_hmem regions
created with HMAT or manually assigned with efi_fake_mem= get lost on
kexec because we do not pass the EFI_MEMMAP conventional memory ranges
to the second kernel (only runtime code/boot services). I have a
RFC patch for x86 efi handling, but should get that 

Re: [ndctl PATCH V2 0/8] fix serverl issues reported by Coverity

2020-12-17 Thread Dan Williams
On Tue, Dec 8, 2020 at 4:20 PM Jane Chu  wrote:
>
> Hi,
>
> I actually just ran into the NULL deref issue that is fixed here.
>
> Bu I have a question for the experts:
> what might cause libndctl to run into the NULL deref like below ?
>
> Program terminated with signal 11, Segmentation fault.
> #0  ndctl_pfn_get_bus (pfn=pfn@entry=0x0) at libndctl.c:5540
> 5540return pfn->region->bus;
>
> (gdb) print pfn
> $1 = (struct ndctl_pfn *) 0x0
> (gdb) frame 4
> #4  0x0040ca70 in setup_namespace (region=region@entry=0x109d910,
>  ndns=ndns@entry=0x10a7d40, p=p@entry=0x7ffd8ff73b90) at namespace.c:570
> 570 try(ndctl_dax, set_uuid, dax, uuid);
> (gdb) info locals
> __rc = 
> dax = 0x0
>
> What I did was to let 2 threads run "create-namespace all" in a tight
> loop, and 2 other threads run "destroy-namespace all" in a tight loop,

This will definitely cause libndctl to get confused the expectation is
that only one ndctl instance is operating on one region at a time.
What likely happened is TOCTOU in ndctl_region_get_pfn_seed() where
the seed device name is destroyed by the time it tries to convert it
to an ndctl object. The reason libndctl does not perform its own
locking is to keep the library stateless and allow locking to imposed
from a higher level. Two ndctl instances must not be allowed to
operate in the same region otherwise the 2 libndctl instances will get
out of sync.

This is no different than 2 fdisk processes running at the same time,
they are going to invalidate each other's view of the cached partition
state. The fix is not for fdisk to implement locking internally
instead it requires the admin to arrange for only one fdisk to run
against one disk at a time.

> while chasing an year old issue that randomly resurfaces -
> "nd_region region1: allocation underrun: 0x0 of 0x4000 bytes"

It would be interesting to get more data on the sequence of
allocations that lead up to this event, and a dump of the resource
tree when this happens:

for (i = 0; i < nd_region->ndr_mappings; i++)
ndd = to_ndd(_region->mapping[i]);
for_each_dpa_resource(...)
nd_dbg_dpa(...)


>
> In addition, there are kmemleaks,
> # cat /sys/kernel/debug/kmemleak
> [..]
> unreferenced object 0x976bd46f6240 (size 64):
>comm "ndctl", pid 23556, jiffies 4299514316 (age 5406.733s)
>hex dump (first 32 bytes):
>  00 00 00 00 00 00 00 00 00 00 20 c3 37 00 00 00  .. .7...
>  ff ff ff 7f 38 00 00 00 00 00 00 00 00 00 00 00  8...
>backtrace:
>  [<064003cf>] __kmalloc_track_caller+0x136/0x379
>  [] krealloc+0x67/0x92
>  [] __alloc_dev_dax_range+0x73/0x25c
>  [<27d58626>] devm_create_dev_dax+0x27d/0x416
>  [<434abd43>] __dax_pmem_probe+0x1c9/0x1000 [dax_pmem_core]
>  [<83726c1c>] dax_pmem_probe+0x10/0x1f [dax_pmem]
>  [] nvdimm_bus_probe+0x9d/0x340 [libnvdimm]
>  [] really_probe+0x230/0x48d
>  [<6cabd38e>] driver_probe_device+0x122/0x13b
>  [<29c7b95a>] device_driver_attach+0x5b/0x60
>  [<53e5659b>] bind_store+0xb7/0xc3
>  [] drv_attr_store+0x27/0x31
>  [<949069c5>] sysfs_kf_write+0x4a/0x57
>  [<4a8b5adf>] kernfs_fop_write+0x150/0x1e5
>  [] __vfs_write+0x1b/0x34
>  [] vfs_write+0xd8/0x1d1

Hmm... maybe this?

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 27513d311242..506549235e03 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1393,6 +1393,7 @@ struct dev_dax *devm_create_dev_dax(struct
dev_dax_data *data)
 err_pgmap:
free_dev_dax_ranges(dev_dax);
 err_range:
+   kfree(dev_dax->ranges);
free_dev_dax_id(dev_dax);
 err_id:
kfree(dev_dax);
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org