Re: [PATCH 17/17] mm: make range-to-target_node lookup facility a part of numa_memblks

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:46 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> The x86 implementation of range-to-target_node lookup (i.e.
> phys_to_target_node() and memory_add_physaddr_to_nid()) relies on
> numa_memblks.
> 
> Since numa_memblks are now part of the generic code, move these
> functions from x86 to mm/numa_memblks.c and select
> CONFIG_NUMA_KEEP_MEMINFO when CONFIG_NUMA_MEMBLKS=y for dax and cxl.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 

Reviewed-by: Jonathan Cameron 

Thanks. I'll poke around more next week.  Have a good weekend.

Jonathan



Re: [PATCH 12/17] mm: introduce numa_memblks

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:41 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig
> options to let x86 select it in its Kconfig.
> 
> This code will be later reused by arch_numa.
> 
> No functional changes.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 
Hi Mike,

My only real concern in here is there are a few places where
the lifted code makes changes to memblocks that are x86 only today.
I need to do some more digging to work out if those are safe
in all cases.

Jonathan



> +/**
> + * numa_cleanup_meminfo - Cleanup a numa_meminfo
> + * @mi: numa_meminfo to clean up
> + *
> + * Sanitize @mi by merging and removing unnecessary memblks.  Also check for
> + * conflicts and clear unused memblks.
> + *
> + * RETURNS:
> + * 0 on success, -errno on failure.
> + */
> +int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
> +{
> + const u64 low = 0;

Given always zero, why not just use that value inline?

> + const u64 high = PFN_PHYS(max_pfn);
> + int i, j, k;
> +
> + /* first, trim all entries */
> + for (i = 0; i < mi->nr_blks; i++) {
> + struct numa_memblk *bi = >blk[i];
> +
> + /* move / save reserved memory ranges */
> + if (!memblock_overlaps_region(,
> + bi->start, bi->end - bi->start)) {
> + numa_move_tail_memblk(_reserved_meminfo, i--, mi);
> + continue;
> + }
> +
> + /* make sure all non-reserved blocks are inside the limits */
> + bi->start = max(bi->start, low);
> +
> + /* preserve info for non-RAM areas above 'max_pfn': */
> + if (bi->end > high) {
> + numa_add_memblk_to(bi->nid, high, bi->end,
> +_reserved_meminfo);
> + bi->end = high;
> + }
> +
> + /* and there's no empty block */
> + if (bi->start >= bi->end)
> + numa_remove_memblk_from(i--, mi);
> + }
> +
> + /* merge neighboring / overlapping entries */
> + for (i = 0; i < mi->nr_blks; i++) {
> + struct numa_memblk *bi = >blk[i];
> +
> + for (j = i + 1; j < mi->nr_blks; j++) {
> + struct numa_memblk *bj = >blk[j];
> + u64 start, end;
> +
> + /*
> +  * See whether there are overlapping blocks.  Whine
> +  * about but allow overlaps of the same nid.  They
> +  * will be merged below.
> +  */
> + if (bi->end > bj->start && bi->start < bj->end) {
> + if (bi->nid != bj->nid) {
> + pr_err("node %d [mem %#010Lx-%#010Lx] 
> overlaps with node %d [mem %#010Lx-%#010Lx]\n",
> +bi->nid, bi->start, bi->end - 1,
> +bj->nid, bj->start, bj->end - 1);
> + return -EINVAL;
> + }
> + pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] 
> overlaps with itself [mem %#010Lx-%#010Lx]\n",
> + bi->nid, bi->start, bi->end - 1,
> + bj->start, bj->end - 1);
> + }
> +
> + /*
> +  * Join together blocks on the same node, holes
> +  * between which don't overlap with memory on other
> +  * nodes.
> +  */
> + if (bi->nid != bj->nid)
> + continue;
> + start = min(bi->start, bj->start);
> + end = max(bi->end, bj->end);
> + for (k = 0; k < mi->nr_blks; k++) {
> + struct numa_memblk *bk = >blk[k];
> +
> + if (bi->nid == bk->nid)
> + continue;
> + if (start < bk->end && end > bk->start)
> + break;
> + }
> + if (k < mi->nr_blks)
> + continue;
> + pr_info("NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem 
> %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n",
> +bi->nid, bi->start, bi->end - 1, bj->start,
> +bj->end - 1, start, end - 1);
> + bi->start = start;
> + bi->end = end;
> + numa_remove_memblk_from(j--, mi);
> + }
> + }
> +
> + /* clear unused ones */
> + for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) {
> + mi->blk[i].start = mi->blk[i].end = 0;
> +  

Re: [PATCH 16/17] arch_numa: switch over to numa_memblks

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:45 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> Until now arch_numa was directly translating firmware NUMA information
> to memblock.
> 
> Using numa_memblks as an intermediate step has a few advantages:
> * alignment with more battle tested x86 implementation
> * availability of NUMA emulation
> * maintaining node information for not yet populated memory
> 
> Replace current functionality related to numa_add_memblk() and
> __node_distance() with the implementation based on numa_memblks and add
> functions required by numa_emulation.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 

One trivial comment inline,

Jonathan
>  /*
>   * Initialize NODE_DATA for a node on the local memory
>   */
> @@ -226,116 +204,9 @@ static void __init setup_node_data(int nid, u64 
> start_pfn, u64 end_pfn)
>   NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn;
>  }

>  
> @@ -454,3 +321,54 @@ void __init arch_numa_init(void)
>  
>   numa_init(dummy_numa_init);
>  }
> +
> +#ifdef CONFIG_NUMA_EMU
> +void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
> + unsigned int nr_emu_nids)
> +{
> + int i, j;
> +
> + /*
> +  * Transform __apicid_to_node table to use emulated nids by

Comment needs an update seeing as there is no __apicid_to_node table
here.

> +  * reverse-mapping phys_nid.  The maps should always exist but fall
> +  * back to zero just in case.
> +  */
> + for (i = 0; i < ARRAY_SIZE(cpu_to_node_map); i++) {
> + if (cpu_to_node_map[i] == NUMA_NO_NODE)
> + continue;
> + for (j = 0; j < nr_emu_nids; j++)
> + if (cpu_to_node_map[i] == emu_nid_to_phys[j])
> + break;
> + cpu_to_node_map[i] = j < nr_emu_nids ? j : 0;
> + }
> +}
> +
> +u64 __init numa_emu_dma_end(void)
> +{
> + return PFN_PHYS(memblock_start_of_DRAM() + SZ_4G);
> +}



Re: [PATCH 15/17] mm: make numa_memblks more self-contained

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:44 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> Introduce numa_memblks_init() and move some code around to avoid several
> global variables in numa_memblks.

Hi Mike,

Adding the effectively always on memblock_force_top_down
deserves a comment on why. I assume because you are going to do
something with it later? 

There also seems to be more going on in here such as the change to
get_pfn_range_for_nid()  Perhaps break this up so each
change can have an explanation. 


> 
> Signed-off-by: Mike Rapoport (Microsoft) 
> ---
>  arch/x86/mm/numa.c   | 53 -
>  include/linux/numa_memblks.h |  9 +
>  mm/numa_memblks.c| 77 +++-
>  3 files changed, 68 insertions(+), 71 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 3848e68d771a..16bc703c9272 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -115,30 +115,19 @@ void __init setup_node_to_cpumask_map(void)
>   pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
>  }
>  
> -static int __init numa_register_memblks(struct numa_meminfo *mi)
> +static int __init numa_register_nodes(void)
>  {
> - int i, nid, err;
> -
> - err = numa_register_meminfo(mi);
> - if (err)
> - return err;
> + int nid;
>  
>   if (!memblock_validate_numa_coverage(SZ_1M))
>   return -EINVAL;
>  
>   /* Finally register nodes. */
>   for_each_node_mask(nid, node_possible_map) {
> - u64 start = PFN_PHYS(max_pfn);
> - u64 end = 0;
> -
> - for (i = 0; i < mi->nr_blks; i++) {
> - if (nid != mi->blk[i].nid)
> - continue;
> - start = min(mi->blk[i].start, start);
> - end = max(mi->blk[i].end, end);
> - }
> + unsigned long start_pfn, end_pfn;
>  
> - if (start >= end)
> + get_pfn_range_for_nid(nid, _pfn, _pfn);

It's not immediately obvious to me that this code is equivalent so I'd
prefer it in a separate patch with some description of why
it is a valid change.

> + if (start_pfn >= end_pfn)
>   continue;
>  
>   alloc_node_data(nid);
> @@ -178,39 +167,11 @@ static int __init numa_init(int (*init_func)(void))
>   for (i = 0; i < MAX_LOCAL_APIC; i++)
>   set_apicid_to_node(i, NUMA_NO_NODE);
>  
> - nodes_clear(numa_nodes_parsed);
> - nodes_clear(node_possible_map);
> - nodes_clear(node_online_map);
> - memset(_meminfo, 0, sizeof(numa_meminfo));
> - WARN_ON(memblock_set_node(0, ULLONG_MAX, ,
> -   NUMA_NO_NODE));
> - WARN_ON(memblock_set_node(0, ULLONG_MAX, ,
> -   NUMA_NO_NODE));
> - /* In case that parsing SRAT failed. */
> - WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
> - numa_reset_distance();
> -
> - ret = init_func();
> - if (ret < 0)
> - return ret;
> -
> - /*
> -  * We reset memblock back to the top-down direction
> -  * here because if we configured ACPI_NUMA, we have
> -  * parsed SRAT in init_func(). It is ok to have the
> -  * reset here even if we did't configure ACPI_NUMA
> -  * or acpi numa init fails and fallbacks to dummy
> -  * numa init.
> -  */
> - memblock_set_bottom_up(false);
> -
> - ret = numa_cleanup_meminfo(_meminfo);
> + ret = numa_memblks_init(init_func, /* memblock_force_top_down */ true);
The comment in parameter list seems unnecessary.
Maybe add a comment above the call instead if need to call that out?

>   if (ret < 0)
>   return ret;
>  
> - numa_emulation(_meminfo, numa_distance_cnt);
> -
> - ret = numa_register_memblks(_meminfo);
> + ret = numa_register_nodes();
>   if (ret < 0)
>   return ret;
>  

> diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> index e0039549aaac..640f3a3ce0ee 100644
> --- a/mm/numa_memblks.c
> +++ b/mm/numa_memblks.c
> @@ -7,13 +7,27 @@
>  #include 
>  #include 
>  

> +/*
> + * Set nodes, which have memory in @mi, in *@nodemask.
> + */
> +static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
> +   const struct numa_meminfo *mi)
> +{
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
> + if (mi->blk[i].start != mi->blk[i].end &&
> + mi->blk[i].nid != NUMA_NO_NODE)
> + node_set(mi->blk[i].nid, *nodemask);
> +}

The code move doesn't have an obvious purpose. Maybe call that
out in the patch description if it is needed for a future patch.
Or do it in two goes so first just adds the static, 2nd shuffles
the code.

>  
>  /**
>   * numa_reset_distance - Reset NUMA distance table
> @@ -287,20 +301,6 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
>   

Re: [PATCH 13/17] mm: move numa_distance and related code from x86 to numa_memblks

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:42 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> Move code dealing with numa_distance array from arch/x86 to
> mm/numa_memblks.c

It's not really numa memblock related. Is this the best place
to put it?

> 
> This code will be later reused by arch_numa.
> 
> No functional changes.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 



Re: [PATCH 10/17] x86/numa_emu: use a helper function to get MAX_DMA32_PFN

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:39 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> This is required to make numa emulation code architecture independent s
> that it can be moved to generic code in following commits.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 
Reviewed-by: Jonathan Cameron 


Re: [PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:38 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> This is required to make numa emulation code architecture independent so
> that it can be moved to generic code in following commits.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 

Not the most intuitive of function names but I can't immediately
think of a better one.

Reviewed-by: Jonathan Cameron 



Re: [PATCH 08/17] x86/numa_emu: simplify allocation of phys_dist

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:37 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> By the time numa_emulation() is called, all physical memory is already
> mapped in the direct map and there is no need to define limits for
> memblock allocation.
> 
> Replace memblock_phys_alloc_range() with memblock_alloc().
> 
> Signed-off-by: Mike Rapoport (Microsoft) 
Indeed seems to be after mapping physical memory, so this looks fine.
Reviewed-by: Jonathan Cameron 


Re: [PATCH 07/17] x86/numa: move FAKE_NODE_* defines to numa_emu

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:36 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are
> only used by numa emulation code, make them local to
> arch/x86/mm/numa_emulation.c
> 
> Signed-off-by: Mike Rapoport (Microsoft) 

Reviewed-by: Jonathan Cameron 


Re: [PATCH 06/17] x86/numa: simplify numa_distance allocation

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:35 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> Allocation of numa_distance uses memblock_phys_alloc_range() to limit
> allocation to be below the last mapped page.
> 
> But NUMA initializaition runs after the direct map is populated and

initialization (one too many 'i's)

> there is also code in setup_arch() that adjusts memblock limit to
> reflect how much memory is already mapped in the direct map.
> 
> Simplify the allocation of numa_distance and use plain memblock_alloc().
> This makes the code clearer and ensures that when numa_distance is not
> allocated it is always NULL.
Doesn't this break the comment in numa_set_distance() kernel-doc?
"
 * If such table cannot be allocated, a warning is printed and further
 * calls are ignored until the distance table is reset with
 * numa_reset_distance().
"

Superficially that looks to be to avoid repeatedly hitting the
singleton bit at the top of numa_set_distance() as SRAT or similar
parsing occurs.

> 
> Signed-off-by: Mike Rapoport (Microsoft) 
> ---
>  arch/x86/mm/numa.c | 12 +++-
>  1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 5e1dde26674b..ab2d4ecef786 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -319,8 +319,7 @@ void __init numa_reset_distance(void)
>  {
>   size_t size = numa_distance_cnt * numa_distance_cnt * 
> sizeof(numa_distance[0]);
>  
> - /* numa_distance could be 1LU marking allocation failure, test cnt */
> - if (numa_distance_cnt)
> + if (numa_distance)
>   memblock_free(numa_distance, size);
>   numa_distance_cnt = 0;
>   numa_distance = NULL;   /* enable table creation */
> @@ -331,7 +330,6 @@ static int __init numa_alloc_distance(void)
>   nodemask_t nodes_parsed;
>   size_t size;
>   int i, j, cnt = 0;
> - u64 phys;
>  
>   /* size the new table and allocate it */
>   nodes_parsed = numa_nodes_parsed;
> @@ -342,16 +340,12 @@ static int __init numa_alloc_distance(void)
>   cnt++;
>   size = cnt * cnt * sizeof(numa_distance[0]);
>  
> - phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0,
> -  PFN_PHYS(max_pfn_mapped));
> - if (!phys) {
> + numa_distance = memblock_alloc(size, PAGE_SIZE);
> + if (!numa_distance) {
>   pr_warn("Warning: can't allocate distance table!\n");
> - /* don't retry until explicitly reset */
> - numa_distance = (void *)1LU;
>   return -ENOMEM;
>   }
>  
> - numa_distance = __va(phys);
>   numa_distance_cnt = cnt;
>  
>   /* fill with the default distances */



Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:34 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> Architectures that support NUMA duplicate the code that allocates
> NODE_DATA on the node-local memory with slight variations in reporting
> of the addresses where the memory was allocated.
> 
> Use x86 version as the basis for the generic alloc_node_data() function
> and call this function in architecture specific numa initialization.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 


I've no idea what rules are for the sparc prom_printf() calls but given
that file already has mix and match of those and normal prints in
single functions I assume this change is fine and we'll just
see the prints a bit later.

Reviewed-by: Jonathan Cameron 



Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code

2024-07-19 Thread Jonathan Cameron
On Fri, 19 Jul 2024 17:07:35 +0200
David Hildenbrand  wrote:

> >>> -  * Allocate node data.  Try node-local memory and then any node.
> >>> -  * Never allocate in DMA zone.
> >>> -  */
> >>> - nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> >>> - if (!nd_pa) {
> >>> - pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
> >>> -nd_size, nid);
> >>> - return;
> >>> - }
> >>> - nd = __va(nd_pa);
> >>> -
> >>> - /* report and initialize */
> >>> - printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
> >>> -nd_pa, nd_pa + nd_size - 1);
> >>> - tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> >>> - if (tnid != nid)
> >>> - printk(KERN_INFO "NODE_DATA(%d) on node %d\n", nid, tnid);
> >>> -
> >>> - node_data[nid] = nd;
> >>> - memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> >>> -
> >>> - node_set_online(nid);
> >>> -}
> >>> -
> >>>/**
> >>> * numa_cleanup_meminfo - Cleanup a numa_meminfo
> >>> * @mi: numa_meminfo to clean up
> >>> @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct 
> >>> numa_meminfo *mi)
> >>>   continue;
> >>>   alloc_node_data(nid);
> >>> + node_set_online(nid);
> >>>   }  
> >>
> >> I can spot that we only remove a single node_set_online() call from x86.
> >>
> >> What about all the other architectures? Will there be any change in 
> >> behavior
> >> for them? Or do we simply set the nodes online later once more?  
> > 
> > On x86 node_set_online() was a part of alloc_node_data() and I moved it
> > outside so it's called right after alloc_node_data(). On other
> > architectures the allocation didn't include that call, so there should be
> > no difference there.  
> 
> But won't their arch code try setting the nodes online at a later stage?
> 
> And I think, some architectures only set nodes online conditionally
> (see most other node_set_online() calls).
> 
> Sorry if I'm confused here, but with now unconditional node_set_online(), 
> won't
> we change the behavior of other architectures?
This is moving x86 code to x86 code, not a generic location
so how would that affect anyone else? Their onlining should be same as
before.

The node onlining difference are a pain (I recall that fun from adding
generic initiators) as different ordering on x86 and arm64 at least.

Jonathan

> 



Re: [PATCH 04/17] arch, mm: move definition of node_data to generic code

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:33 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> Every architecture that supports NUMA defines node_data in the same way:
> 
>   struct pglist_data *node_data[MAX_NUMNODES];
> 
> No reason to keep multiple copies of this definition and its forward
> declarations, especially when such forward declaration is the only thing
> in include/asm/mmzone.h for many architectures.
> 
> Add definition and declaration of node_data to generic code and drop
> architecture-specific versions.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 
Reviewed-by: Jonathan Cameron 



Re: [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:32 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> Make definition of node_data match other architectures.
> This will allow pulling declaration of node_data to the generic mm code in
> the following commit.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 
FWIW rename looks fine
Reviewed-by: Jonathan Cameron 


Re: [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures

2024-07-19 Thread Jonathan Cameron
On Wed, 17 Jul 2024 16:32:59 +0200
David Hildenbrand  wrote:

> On 16.07.24 13:13, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" 
> > 
> > sgi-ip27 is the only system that defines NODE_DATA() differently than
> > the rest of NUMA machines.
> > 
> > Add node_data array of struct pglist pointers that will point to
> > __node_data[node]->pglist and redefine NODE_DATA() to use node_data
> > array.
> > 
> > This will allow pulling declaration of node_data to the generic mm code
> > in the next commit.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) 
> > ---
> >   arch/mips/include/asm/mach-ip27/mmzone.h | 5 -
> >   arch/mips/sgi-ip27/ip27-memory.c | 5 -
> >   2 files changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h 
> > b/arch/mips/include/asm/mach-ip27/mmzone.h
> > index 08c36e50a860..629c3f290203 100644
> > --- a/arch/mips/include/asm/mach-ip27/mmzone.h
> > +++ b/arch/mips/include/asm/mach-ip27/mmzone.h
> > @@ -22,7 +22,10 @@ struct node_data {
> >   
> >   extern struct node_data *__node_data[];
> >   
> > -#define NODE_DATA(n)   (&__node_data[(n)]->pglist)
> >   #define hub_data(n)   (&__node_data[(n)]->hub)
> >   
> > +extern struct pglist_data *node_data[];
> > +
> > +#define NODE_DATA(nid) (node_data[nid])
> > +
> >   #endif /* _ASM_MACH_MMZONE_H */
> > diff --git a/arch/mips/sgi-ip27/ip27-memory.c 
> > b/arch/mips/sgi-ip27/ip27-memory.c
> > index b8ca94cfb4fe..c30ef6958b97 100644
> > --- a/arch/mips/sgi-ip27/ip27-memory.c
> > +++ b/arch/mips/sgi-ip27/ip27-memory.c
> > @@ -34,8 +34,10 @@
> >   #define SLOT_PFNSHIFT (SLOT_SHIFT - PAGE_SHIFT)
> >   #define PFN_NASIDSHFT (NASID_SHFT - PAGE_SHIFT)
> >   
> > -struct node_data *__node_data[MAX_NUMNODES];
> > +struct pglist_data *node_data[MAX_NUMNODES];
> > +EXPORT_SYMBOL(node_data);
> >   
> > +struct node_data *__node_data[MAX_NUMNODES];
> >   EXPORT_SYMBOL(__node_data);
> >   
> >   static u64 gen_region_mask(void)
> > @@ -361,6 +363,7 @@ static void __init node_mem_init(nasid_t node)
> >  */
> > __node_data[node] = __va(slot_freepfn << PAGE_SHIFT);
> > memset(__node_data[node], 0, PAGE_SIZE);
> > +   node_data[node] = &__node_data[node]->pglist;
> >   
> > NODE_DATA(node)->node_start_pfn = start_pfn;
> > NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;  
> 
> I was assuming we could get rid of __node_data->pglist.
> 
> But now I am confused where that is actually set.

It looks nasty... Cast in arch_refresh_nodedata() takes
incoming pg_data_t * and casts it to the local version of
struct node_data * which I think is this one

struct node_data {
struct pglist_data pglist; (which is pg_data_t pglist)
struct hub_data hub;
};

https://elixir.bootlin.com/linux/v6.10/source/arch/mips/sgi-ip27/ip27-memory.c#L432

Now that pg_data_t is allocated by 
arch_alloc_nodedata() which might be fine (though types could be handled in a 
more
readable fashion via some container_of() magic.
https://elixir.bootlin.com/linux/v6.10/source/arch/mips/sgi-ip27/ip27-memory.c#L427

However that call is:
pg_data_t * __init arch_alloc_nodedata(int nid)
{
return memblock_alloc(sizeof(pg_data_t), SMP_CACHE_BYTES);
}

So doesn't seem to allocate enough space to me as should be sizeof(struct 
node_data)

Worth cleaning up whilst here?  Proper handling of types would definitely
help.

Jonathan


> 
> Anyhow
> 
> Reviewed-by: David Hildenbrand 
> 



Re: [PATCH 01/17] mm: move kernel/numa.c to mm/

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:30 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> The stub functions in kernel/numa.c belong to mm/ rather than to kernel/
> 
> Signed-off-by: Mike Rapoport (Microsoft) 

Makes sense + all arch specific implementations are in arch/*/mm not
arch/*/kernel so this makes it more consistent with that.

Reviewed-by: Jonathan Cameron 



Re: [PATCH 00/17] mm: introduce numa_memblks

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:29 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> Hi,
> 
> Following the discussion about handling of CXL fixed memory windows on
> arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
> the generic code so they will be available on arm64/riscv and maybe on
> loongarch sometime later.
> 
> While it could be possible to use memblock to describe CXL memory windows,
> it currently lacks notion of unpopulated memory ranges and numa_memblks
> does implement this.
> 
> Another reason to make numa_memblks generic is that both arch_numa (arm64
> and riscv) and loongarch use trimmed copy of x86 code although there is no
> fundamental reason why the same code cannot be used on all these platforms.
> Having numa_memblks in mm/ will make it's interaction with ACPI and FDT
> more consistent and I believe will reduce maintenance burden.
> 
> And with generic numa_memblks it is (almost) straightforward to enable NUMA
> emulation on arm64 and riscv.
> 
> The first 5 commits in this series are cleanups that are not strictly
> related to numa_memblks.
> 
> Commits 6-11 slightly reorder code in x86 to allow extracting numa_memblks
> and NUMA emulation to the generic code.
> 
> Commits 12-14 actually move the code from arch/x86/ to mm/ and commit 15
> does some aftermath cleanups.
> 
> Commit 16 switches arch_numa to numa_memblks.
> 
> Commit 17 enables usage of phys_to_target_node() and
> memory_add_physaddr_to_nid() with numa_memblks.

Hi Mike,

I've lightly tested with emulated CXL + Generic Ports and Generic
Initiators as well as more normal cpus and memory via qemu on arm64 and it's
looking good.

From my earlier series, patch 4 is probably still needed to avoid
presenting nodes with nothing in them at boot (but not if we hotplug
memory then remove it again in which case they disappear)
https://lore.kernel.org/all/20240529171236.32002-5-jonathan.came...@huawei.com/
However that was broken/inconsistent before your rework so I can send that
patch separately. 

Thanks for getting this sorted!  I should get time to do more extensive
testing and review in next week or so.

Jonathan

> 
> [1] 
> https://lore.kernel.org/all/20240529171236.32002-1-jonathan.came...@huawei.com/
> 
> Mike Rapoport (Microsoft) (17):
>   mm: move kernel/numa.c to mm/
>   MIPS: sgi-ip27: make NODE_DATA() the same as on all other
> architectures
>   MIPS: loongson64: rename __node_data to node_data
>   arch, mm: move definition of node_data to generic code
>   arch, mm: pull out allocation of NODE_DATA to generic code
>   x86/numa: simplify numa_distance allocation
>   x86/numa: move FAKE_NODE_* defines to numa_emu
>   x86/numa_emu: simplify allocation of phys_dist
>   x86/numa_emu: split __apicid_to_node update to a helper function
>   x86/numa_emu: use a helper function to get MAX_DMA32_PFN
>   x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
>   mm: introduce numa_memblks
>   mm: move numa_distance and related code from x86 to numa_memblks
>   mm: introduce numa_emulation
>   mm: make numa_memblks more self-contained
>   arch_numa: switch over to numa_memblks
>   mm: make range-to-target_node lookup facility a part of numa_memblks
> 
>  arch/arm64/include/asm/Kbuild |   1 +
>  arch/arm64/include/asm/mmzone.h   |  13 -
>  arch/arm64/include/asm/topology.h |   1 +
>  arch/loongarch/include/asm/Kbuild |   1 +
>  arch/loongarch/include/asm/mmzone.h   |  16 -
>  arch/loongarch/include/asm/topology.h |   1 +
>  arch/loongarch/kernel/numa.c  |  21 -
>  arch/mips/include/asm/mach-ip27/mmzone.h  |   1 -
>  .../mips/include/asm/mach-loongson64/mmzone.h |   4 -
>  arch/mips/loongson64/numa.c   |  20 +-
>  arch/mips/sgi-ip27/ip27-memory.c  |   2 +-
>  arch/powerpc/include/asm/mmzone.h |   6 -
>  arch/powerpc/mm/numa.c|  26 +-
>  arch/riscv/include/asm/Kbuild |   1 +
>  arch/riscv/include/asm/mmzone.h   |  13 -
>  arch/riscv/include/asm/topology.h |   4 +
>  arch/s390/include/asm/Kbuild  |   1 +
>  arch/s390/include/asm/mmzone.h|  17 -
>  arch/s390/kernel/numa.c   |   3 -
>  arch/sh/include/asm/mmzone.h  |   3 -
>  arch/sh/mm/init.c |   7 +-
>  arch/sh/mm/numa.c |   3 -
>  arch/sparc/include/asm/mmzone.h   |   4 -
>  arch/sparc/mm/init_64.c   |  11 +-
>  arch/x86/Kconfig  |   9 +-
>  arch/x86/include/asm/Kbuild   |   1 +
>  arch/x86/include/asm/mmzone.h |   6 -
>  arch/x86/include/asm/mmzone_32.h  |  17 -
>  arch/x86/include/asm/mmzone_64.h  |  18 -
>  arch/x86/include/asm/numa.h   |  24 +-
>  arch/x86/include/asm/sparsemem.h 

Re: [PATCH 12/20] iio: adc: ti_am335x_adc: convert to of_property_for_each_u32_new()

2024-07-03 Thread Jonathan Cameron
On Wed, 03 Jul 2024 12:36:56 +0200
Luca Ceresoli  wrote:

> Simplify code using of_property_for_each_u32_new() as the two additional
> parameters in of_property_for_each_u32() are not used here.
> 
> Signed-off-by: Luca Ceresoli 
Acked-by: Jonathan Cameron 


Re: [PATCH v4 2/3] PCI/AER: Print UNCOR_STATUS bits that might be ANFE

2024-06-06 Thread Jonathan Cameron
On Thu,  9 May 2024 16:48:32 +0800
Zhenzhong Duan  wrote:

> When an Advisory Non-Fatal error(ANFE) triggers, both correctable error(CE)
> status and ANFE related uncorrectable error(UE) status will be printed:
> 
>   AER: Correctable error message received from :b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=2000/
>  [13] NonFatalErr
> Uncorrectable errors that may cause Advisory Non-Fatal:
>  [18] TLP
> 
> Tested-by: Yudong Wang 
> Co-developed-by: "Wang, Qingshun" 
> Signed-off-by: "Wang, Qingshun" 
> Signed-off-by: Zhenzhong Duan 
Reviewed-by: Jonathan Cameron 


Re: [PATCH v4 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info

2024-06-06 Thread Jonathan Cameron
On Thu,  9 May 2024 16:48:31 +0800
Zhenzhong Duan  wrote:

> In some cases the detector of a Non-Fatal Error(NFE) is not the most
> appropriate agent to determine the type of the error. For example,
> when software performs a configuration read from a non-existent
> device or Function, completer will send an ERR_NONFATAL Message.
> On some platforms, ERR_NONFATAL results in a System Error, which
> breaks normal software probing.
> 
> Advisory Non-Fatal Error(ANFE) is a special case that can be used
> in above scenario. It is predominantly determined by the role of the
> detecting agent (Requester, Completer, or Receiver) and the specific
> error. In such cases, an agent with AER signals the NFE (if enabled)
> by sending an ERR_COR Message as an advisory to software, instead of
> sending ERR_NONFATAL.
> 
> When processing an ANFE, ideally both correctable error(CE) status and
> uncorrectable error(UE) status should be cleared. However, there is no
> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
> NFE will reproduce above mentioned issue, i.e., breaking softwore probing;
> treating NFE as ANFE will make us ignoring some UEs which need active
> recover operation. To avoid clearing UEs that are not ANFE by accident,
> the most conservative route is taken here: If any of the NFE Detected
> bits is set in Device Status, do not touch UE status, they should be
> cleared later by the UE handler. Otherwise, a specific set of UEs that
> may be raised as ANFE according to the PCIe specification will be cleared
> if their corresponding severity is Non-Fatal.
> 
> To achieve above purpose, store UNCOR_STATUS bits that might be ANFE
> in aer_err_info.anfe_status. So that those bits could be printed and
> processed later.
> 
> Tested-by: Yudong Wang 
> Co-developed-by: "Wang, Qingshun" 
> Signed-off-by: "Wang, Qingshun" 
> Signed-off-by: Zhenzhong Duan 

Not my most confident review ever as this is nasty and gives
me a headache but your description is good and I think the
implementation looks reasonable.

Reviewed-by: Jonathan Cameron 




Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE

2024-06-06 Thread Jonathan Cameron
On Thu,  9 May 2024 16:48:33 +0800
Zhenzhong Duan  wrote:

> When processing an ANFE, ideally both correctable error(CE) status and
> uncorrectable error(UE) status should be cleared. However, there is no
> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
> NFE will bring some issues, i.e., breaking softwore probing; treating
> NFE as ANFE will make us ignoring some UEs which need active recover
> operation. To avoid clearing UEs that are not ANFE by accident, the
> most conservative route is taken here: If any of the NFE Detected bits
> is set in Device Status, do not touch UE status, they should be cleared
> later by the UE handler. Otherwise, a specific set of UEs that may be
> raised as ANFE according to the PCIe specification will be cleared if
> their corresponding severity is Non-Fatal.
> 
> For instance, previously when kernel receives an ANFE with Poisoned TLP
> in OS native AER mode, only status of CE will be reported and cleared:
> 
>   AER: Correctable error message received from :b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=2000/
>  [13] NonFatalErr
> 
> If the kernel receives a Malformed TLP after that, two UEs will be
> reported, which is unexpected. Malformed TLP Header is lost since
> the previous ANFE gated the TLP header logs:
> 
>   PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, 
> (Receiver ID)
> device [8086:0db0] error status/mask=00041000/00180020
>  [12] TLP(First)
>  [18] MalfTLP
> 
> Now, for the same scenario, both CE status and related UE status will be
> reported and cleared after ANFE:
> 
>   AER: Correctable error message received from :b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=2000/
>  [13] NonFatalErr
> Uncorrectable errors that may cause Advisory Non-Fatal:
>  [18] TLP
> 
> Tested-by: Yudong Wang 
> Co-developed-by: "Wang, Qingshun" 
> Signed-off-by: "Wang, Qingshun" 
> Signed-off-by: Zhenzhong Duan 

Reviewed-by: Jonathan Cameron 

This is nasty enough though that it would benefit from more review
if possible.  

Thanks for all the detailed explanations in the patch descriptions,
that made it less painful than it might have been.

Jonathan




Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info

2024-05-01 Thread Jonathan Cameron
On Sun, 28 Apr 2024 03:31:11 +
"Duan, Zhenzhong"  wrote:

> Hi Jonathan,
> 
> >-Original Message-----
> >From: Jonathan Cameron 
> >Subject: Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might
> >be ANFE in aer_err_info
> >
> >On Tue, 23 Apr 2024 02:25:05 +
> >"Duan, Zhenzhong"  wrote:
> >  
> >> >-Original Message-
> >> >From: Jonathan Cameron 
> >> >Subject: Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that  
> >might  
> >> >be ANFE in aer_err_info
> >> >
> >> >On Wed, 17 Apr 2024 14:14:05 +0800
> >> >Zhenzhong Duan  wrote:
> >> >  
> >> >> In some cases the detector of a Non-Fatal Error(NFE) is not the most
> >> >> appropriate agent to determine the type of the error. For example,
> >> >> when software performs a configuration read from a non-existent
> >> >> device or Function, completer will send an ERR_NONFATAL Message.
> >> >> On some platforms, ERR_NONFATAL results in a System Error, which
> >> >> breaks normal software probing.
> >> >>
> >> >> Advisory Non-Fatal Error(ANFE) is a special case that can be used
> >> >> in above scenario. It is predominantly determined by the role of the
> >> >> detecting agent (Requester, Completer, or Receiver) and the specific
> >> >> error. In such cases, an agent with AER signals the NFE (if enabled)
> >> >> by sending an ERR_COR Message as an advisory to software, instead of
> >> >> sending ERR_NONFATAL.
> >> >>
> >> >> When processing an ANFE, ideally both correctable error(CE) status and
> >> >> uncorrectable error(UE) status should be cleared. However, there is no
> >> >> way to fully identify the UE associated with ANFE. Even worse, a Fatal
> >> >> Error(FE) or Non-Fatal Error(NFE) may set the same UE status bit as
> >> >> ANFE. Treating an ANFE as NFE will reproduce above mentioned issue,
> >> >> i.e., breaking softwore probing; treating NFE as ANFE will make us
> >> >> ignoring some UEs which need active recover operation. To avoid  
> >clearing  
> >> >> UEs that are not ANFE by accident, the most conservative route is taken
> >> >> here: If any of the FE/NFE Detected bits is set in Device Status, do not
> >> >> touch UE status, they should be cleared later by the UE handler.  
> >Otherwise,  
> >> >> a specific set of UEs that may be raised as ANFE according to the PCIe
> >> >> specification will be cleared if their corresponding severity is 
> >> >> Non-Fatal.
> >> >>
> >> >> To achieve above purpose, store UNCOR_STATUS bits that might be  
> >ANFE  
> >> >> in aer_err_info.anfe_status. So that those bits could be printed and
> >> >> processed later.
> >> >>
> >> >> Tested-by: Yudong Wang 
> >> >> Co-developed-by: "Wang, Qingshun" 
> >> >> Signed-off-by: "Wang, Qingshun" 
> >> >> Signed-off-by: Zhenzhong Duan 
> >> >> ---
> >> >>  drivers/pci/pci.h  |  1 +
> >> >>  drivers/pci/pcie/aer.c | 45  
> >> >++  
> >> >>  2 files changed, 46 insertions(+)
> >> >>
> >> >> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> >> >> index 17fed1846847..3f9eb807f9fd 100644
> >> >> --- a/drivers/pci/pci.h
> >> >> +++ b/drivers/pci/pci.h
> >> >> @@ -412,6 +412,7 @@ struct aer_err_info {
> >> >>
> >> >> unsigned int status;/* COR/UNCOR Error Status */
> >> >> unsigned int mask;  /* COR/UNCOR Error Mask */
> >> >> +   unsigned int anfe_status;   /* UNCOR Error Status for ANFE 
> >> >> */
> >> >> struct pcie_tlp_log tlp;/* TLP Header */
> >> >>  };
> >> >>
> >> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> >> index ac6293c24976..27364ab4b148 100644
> >> >> --- a/drivers/pci/pcie/aer.c
> >> >> +++ b/drivers/pci/pcie/aer.c
> >> >> @@ -107,6 +107,12 @@ struct aer_stats {
> >> >> PCI_ERR_ROOT_MULTI_COR_RCV |  
> >> >  \  
&

Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info

2024-04-26 Thread Jonathan Cameron
On Tue, 23 Apr 2024 02:25:05 +
"Duan, Zhenzhong"  wrote:

> >-Original Message-
> >From: Jonathan Cameron 
> >Subject: Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might
> >be ANFE in aer_err_info
> >
> >On Wed, 17 Apr 2024 14:14:05 +0800
> >Zhenzhong Duan  wrote:
> >  
> >> In some cases the detector of a Non-Fatal Error(NFE) is not the most
> >> appropriate agent to determine the type of the error. For example,
> >> when software performs a configuration read from a non-existent
> >> device or Function, completer will send an ERR_NONFATAL Message.
> >> On some platforms, ERR_NONFATAL results in a System Error, which
> >> breaks normal software probing.
> >>
> >> Advisory Non-Fatal Error(ANFE) is a special case that can be used
> >> in above scenario. It is predominantly determined by the role of the
> >> detecting agent (Requester, Completer, or Receiver) and the specific
> >> error. In such cases, an agent with AER signals the NFE (if enabled)
> >> by sending an ERR_COR Message as an advisory to software, instead of
> >> sending ERR_NONFATAL.
> >>
> >> When processing an ANFE, ideally both correctable error(CE) status and
> >> uncorrectable error(UE) status should be cleared. However, there is no
> >> way to fully identify the UE associated with ANFE. Even worse, a Fatal
> >> Error(FE) or Non-Fatal Error(NFE) may set the same UE status bit as
> >> ANFE. Treating an ANFE as NFE will reproduce above mentioned issue,
> >> i.e., breaking softwore probing; treating NFE as ANFE will make us
> >> ignoring some UEs which need active recover operation. To avoid clearing
> >> UEs that are not ANFE by accident, the most conservative route is taken
> >> here: If any of the FE/NFE Detected bits is set in Device Status, do not
> >> touch UE status, they should be cleared later by the UE handler. Otherwise,
> >> a specific set of UEs that may be raised as ANFE according to the PCIe
> >> specification will be cleared if their corresponding severity is Non-Fatal.
> >>
> >> To achieve above purpose, store UNCOR_STATUS bits that might be ANFE
> >> in aer_err_info.anfe_status. So that those bits could be printed and
> >> processed later.
> >>
> >> Tested-by: Yudong Wang 
> >> Co-developed-by: "Wang, Qingshun" 
> >> Signed-off-by: "Wang, Qingshun" 
> >> Signed-off-by: Zhenzhong Duan 
> >> ---
> >>  drivers/pci/pci.h  |  1 +
> >>  drivers/pci/pcie/aer.c | 45  
> >++  
> >>  2 files changed, 46 insertions(+)
> >>
> >> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> >> index 17fed1846847..3f9eb807f9fd 100644
> >> --- a/drivers/pci/pci.h
> >> +++ b/drivers/pci/pci.h
> >> @@ -412,6 +412,7 @@ struct aer_err_info {
> >>
> >>unsigned int status;/* COR/UNCOR Error Status */
> >>unsigned int mask;  /* COR/UNCOR Error Mask */
> >> +  unsigned int anfe_status;   /* UNCOR Error Status for ANFE */
> >>struct pcie_tlp_log tlp;/* TLP Header */
> >>  };
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index ac6293c24976..27364ab4b148 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -107,6 +107,12 @@ struct aer_stats {
> >>PCI_ERR_ROOT_MULTI_COR_RCV |  
> > \  
> >>PCI_ERR_ROOT_MULTI_UNCOR_RCV)
> >>
> >> +#define AER_ERR_ANFE_UNC_MASK  
> > (PCI_ERR_UNC_POISON_TLP |   \  
> >> +  PCI_ERR_UNC_COMP_TIME |  
> > \  
> >> +  PCI_ERR_UNC_COMP_ABORT |  
> > \  
> >> +  PCI_ERR_UNC_UNX_COMP |  
> > \  
> >> +  PCI_ERR_UNC_UNSUP)
> >> +
> >>  static int pcie_aer_disable;
> >>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
> >>
> >> @@ -1196,6 +1202,41 @@ void aer_recover_queue(int domain, unsigned  
> >int bus, unsigned int devfn,  
> >>  EXPORT_SYMBOL_GPL(aer_recover_queue);
> >>  #endif
> >>
> >> +static void anfe_get_uc_status(struct pci_dev *dev, struct aer_err_info  
> >*info)  
> >> +{
> >> +  u32 uncor_mask, uncor_status;
> 

Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info

2024-04-22 Thread Jonathan Cameron
On Wed, 17 Apr 2024 14:14:05 +0800
Zhenzhong Duan  wrote:

> In some cases the detector of a Non-Fatal Error(NFE) is not the most
> appropriate agent to determine the type of the error. For example,
> when software performs a configuration read from a non-existent
> device or Function, completer will send an ERR_NONFATAL Message.
> On some platforms, ERR_NONFATAL results in a System Error, which
> breaks normal software probing.
> 
> Advisory Non-Fatal Error(ANFE) is a special case that can be used
> in above scenario. It is predominantly determined by the role of the
> detecting agent (Requester, Completer, or Receiver) and the specific
> error. In such cases, an agent with AER signals the NFE (if enabled)
> by sending an ERR_COR Message as an advisory to software, instead of
> sending ERR_NONFATAL.
> 
> When processing an ANFE, ideally both correctable error(CE) status and
> uncorrectable error(UE) status should be cleared. However, there is no
> way to fully identify the UE associated with ANFE. Even worse, a Fatal
> Error(FE) or Non-Fatal Error(NFE) may set the same UE status bit as
> ANFE. Treating an ANFE as NFE will reproduce above mentioned issue,
> i.e., breaking softwore probing; treating NFE as ANFE will make us
> ignoring some UEs which need active recover operation. To avoid clearing
> UEs that are not ANFE by accident, the most conservative route is taken
> here: If any of the FE/NFE Detected bits is set in Device Status, do not
> touch UE status, they should be cleared later by the UE handler. Otherwise,
> a specific set of UEs that may be raised as ANFE according to the PCIe
> specification will be cleared if their corresponding severity is Non-Fatal.
> 
> To achieve above purpose, store UNCOR_STATUS bits that might be ANFE
> in aer_err_info.anfe_status. So that those bits could be printed and
> processed later.
> 
> Tested-by: Yudong Wang 
> Co-developed-by: "Wang, Qingshun" 
> Signed-off-by: "Wang, Qingshun" 
> Signed-off-by: Zhenzhong Duan 
> ---
>  drivers/pci/pci.h  |  1 +
>  drivers/pci/pcie/aer.c | 45 ++
>  2 files changed, 46 insertions(+)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 17fed1846847..3f9eb807f9fd 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -412,6 +412,7 @@ struct aer_err_info {
>  
>   unsigned int status;/* COR/UNCOR Error Status */
>   unsigned int mask;  /* COR/UNCOR Error Mask */
> + unsigned int anfe_status;   /* UNCOR Error Status for ANFE */
>   struct pcie_tlp_log tlp;/* TLP Header */
>  };
>  
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ac6293c24976..27364ab4b148 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -107,6 +107,12 @@ struct aer_stats {
>   PCI_ERR_ROOT_MULTI_COR_RCV |\
>   PCI_ERR_ROOT_MULTI_UNCOR_RCV)
>  
> +#define AER_ERR_ANFE_UNC_MASK(PCI_ERR_UNC_POISON_TLP |   
> \
> + PCI_ERR_UNC_COMP_TIME | \
> + PCI_ERR_UNC_COMP_ABORT |\
> + PCI_ERR_UNC_UNX_COMP |  \
> + PCI_ERR_UNC_UNSUP)
> +
>  static int pcie_aer_disable;
>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
>  
> @@ -1196,6 +1202,41 @@ void aer_recover_queue(int domain, unsigned int bus, 
> unsigned int devfn,
>  EXPORT_SYMBOL_GPL(aer_recover_queue);
>  #endif
>  
> +static void anfe_get_uc_status(struct pci_dev *dev, struct aer_err_info 
> *info)
> +{
> + u32 uncor_mask, uncor_status;
> + u16 device_status;
> + int aer = dev->aer_cap;
> +
> + if (pcie_capability_read_word(dev, PCI_EXP_DEVSTA, _status))
> + return;
> + /*
> +  * Take the most conservative route here. If there are
> +  * Non-Fatal/Fatal errors detected, do not assume any
> +  * bit in uncor_status is set by ANFE.
> +  */
> + if (device_status & (PCI_EXP_DEVSTA_NFED | PCI_EXP_DEVSTA_FED))
> + return;
> +

Is there not a race here?  If we happen to get either an NFED or FED 
between the read of device_status above and here we might pick up a status
that corresponds to that (and hence clear something we should not).

Or am I missing that race being close somewhere?

> + pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, _status);
> + pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, _mask);
> + /*
> +  * According to PCIe Base Specification Revision 6.1,
> +  * Section 6.2.3.2.4, if an UNCOR error is raised as
> +  * Advisory Non-Fatal error, it will match the following
> +  * conditions:
> +  *  a. The severity of the error is Non-Fatal.
> +  *  b. The error is one of the following:
> +  *  1. Poisoned TLP   (Section 6.2.3.2.4.3)
> + 

Re: [PATCH 1/3] PCI/AER: Use 'Correctable' and 'Uncorrectable' spec terms for errors

2023-12-08 Thread Jonathan Cameron
On Wed,  6 Dec 2023 16:42:29 -0600
Bjorn Helgaas  wrote:

> From: Bjorn Helgaas 
> 
> The PCIe spec classifies errors as either "Correctable" or "Uncorrectable".
> Previously we printed these as "Corrected" or "Uncorrected".  To avoid
> confusion, use the same terms as the spec.
> 
> One confusing situation is when one agent detects an error, but another
> agent is responsible for recovery, e.g., by re-attempting the operation.
> The first agent may log a "correctable" error but it has not yet been
> corrected.  The recovery agent must report an uncorrectable error if it is
> unable to recover.  If we print the first agent's error as "Corrected", it
> gives the false impression that it has already been resolved.
> 
> Sample message change:
> 
>   - pcieport :00:1c.5: AER: Corrected error received: :00:1c.5
>   + pcieport :00:1c.5: AER: Correctable error received: :00:1c.5
> 
> Signed-off-by: Bjorn Helgaas 


Re: [PATCH 3/3] PCI/AER: Use explicit register sizes for struct members

2023-12-08 Thread Jonathan Cameron
On Wed,  6 Dec 2023 16:42:31 -0600
Bjorn Helgaas  wrote:

> From: Bjorn Helgaas 
> 
> aer_irq() reads the AER Root Error Status and Error Source Identification
> (PCI_ERR_ROOT_STATUS and PCI_ERR_ROOT_ERR_SRC) registers directly into
> struct aer_err_source.  Both registers are 32 bits, so declare the members
> explicitly as "u32" instead of "unsigned int".
> 
> Similarly, aer_get_device_error_info() reads the AER Header Log
> (PCI_ERR_HEADER_LOG) registers, which are also 32 bits, into struct
> aer_header_log_regs.  Declare those members as "u32" as well.
> 
> No functional changes intended.
> 
> Signed-off-by: Bjorn Helgaas 

Another sensible cleanup. FWIW on such simple patches
Reviewed-by: Jonathan Cameron 


Re: [PATCH 2/3] PCI/AER: Decode Requester ID when no error info found

2023-12-08 Thread Jonathan Cameron
On Wed,  6 Dec 2023 16:42:30 -0600
Bjorn Helgaas  wrote:

> From: Bjorn Helgaas 
> 
> When a device with AER detects an error, it logs error information in its
> own AER Error Status registers.  It may send an Error Message to the Root
> Port (RCEC in the case of an RCiEP), which logs the fact that an Error
> Message was received (Root Error Status) and the Requester ID of the
> message source (Error Source Identification).
> 
> aer_print_port_info() prints the Requester ID from the Root Port Error
> Source in the usual Linux "bb:dd.f" format, but when find_source_device()
> finds no error details in the hierarchy below the Root Port, it printed the
> raw Requester ID without decoding it.
> 
> Decode the Requester ID in the usual Linux format so it matches other
> messages.
> 
> Sample message changes:
> 
>   - pcieport :00:1c.5: AER: Correctable error received: :00:1c.5
>   - pcieport :00:1c.5: AER: can't find device of ID00e5
>   + pcieport :00:1c.5: AER: Correctable error message received from 
> :00:1c.5
>   + pcieport :00:1c.5: AER: found no error details for 0000:00:1c.5
> 
> Signed-off-by: Bjorn Helgaas 
LGTM
Reviewed-by: Jonathan Cameron 



Re: [PATCH 0/8] devm_led_classdev_register() usage problem

2023-11-25 Thread Jonathan Cameron
On Sat, 25 Nov 2023 03:47:41 +0300
George Stark  wrote:

> Hello Andy
> 
> Thanks for the review.
> 
> On 11/24/23 18:28, Andy Shevchenko wrote:
> > On Wed, Oct 25, 2023 at 04:07:29PM +0300, George Stark wrote:  
> >> Lots of drivers use devm_led_classdev_register() to register their led 
> >> objects
> >> and let the kernel free those leds at the driver's remove stage.
> >> It can lead to a problem due to led_classdev_unregister()
> >> implementation calls led_set_brightness() to turn off the led.
> >> led_set_brightness() may call one of the module's brightness_set callbacks.
> >> If that callback uses module's resources allocated without using devm 
> >> funcs()
> >> then those resources will be already freed at module's remove() callback 
> >> and
> >> we may have use-after-free situation.
> >>
> >> Here is an example:
> >>
> >> module_probe()
> >> {
> >>  devm_led_classdev_register(module_brightness_set_cb);
> >>  mutex_init();
> >> }
> >>
> >> module_brightness_set_cb()
> >> {
> >>  mutex_lock();
> >>  do_set_brightness();
> >>  mutex_unlock();
> >> }
> >>
> >> module_remove()
> >> {
> >>  mutex_destroy();
> >> }
> >>
> >> at rmmod:
> >> module_remove()  
> >>  ->mutex_destroy();  
> >> devres_release_all()  
> >>  ->led_classdev_unregister();
> >>  ->led_set_brightness();
> >>  ->module_brightness_set_cb();
> >>   ->mutex_lock();  /* use-after-free */  
> >>
> >> I think it's an architectural issue and should be discussed thoroughly.
> >> Some thoughts about fixing it as a start:
> >> 1) drivers can use devm_led_classdev_unregister() to explicitly free leds 
> >> before
> >> dependend resources are freed. devm_led_classdev_register() remains being 
> >> useful
> >> to simplify probe implementation.
> >> As a proof of concept I examined all drivers from drivers/leds and prepared
> >> patches where it's needed. Sometimes it was not as clean as just calling
> >> devm_led_classdev_unregister() because several drivers do not track
> >> their leds object at all - they can call devm_led_classdev_register() and 
> >> drop the
> >> returned pointer. In that case I used devres group API.
> >>
> >> Drivers outside drivers/leds should be checked too after discussion.
> >>
> >> 2) remove led_set_brightness from led_classdev_unregister() and force the 
> >> drivers
> >> to turn leds off at shutdown. May be add check that led's brightness is 0
> >> at led_classdev_unregister() and put a warning to dmesg if it's not.
> >> Actually in many cases it doesn't really need to turn off the leds 
> >> manually one-by-one
> >> if driver shutdowns whole led controller. For the last case to disable the 
> >> warning
> >> new flag can be brought in e.g LED_AUTO_OFF_AT_SHUTDOWN (similar to 
> >> LED_RETAIN_AT_SHUTDOWN).  
> > 
> > NAK.
> > 
> > Just fix the drivers by wrapping mutex_destroy() into devm, There are many
> > doing so. You may be brave enough to introduce devm_mutex_init() somewhere
> > in include/linux/device*
> >   
> 
> Just one thing about mutex_destroy(). It seems like there's no single 
> opinion on should it be called in 100% cases e.g. in remove() paths.
> For example in iio subsystem Jonathan suggests it can be dropped in 
> simple cases: https://www.spinics.net/lists/linux-iio/msg73423.html
> 
> So the question is can we just drop mutex_destroy() in module's remove() 
> callback here if that mutex is needed for devm subsequent callbacks?

I've never considered it remotely critical. The way IIO works means that things
have gone pretty horribly wrong in the core if you managed to access a mutex 
after
the unwind of devm_iio_device_register() has completed but sure, add a
devm_mutex_init() and I'd happily see that adopted in IIO for consistency
and to avoid answering questions on whether it is necessary to call 
mutex_destroy()

My arguement has always eben that if line after(ish) a mutex_destroy() is going 
to
either free the memory it's in, or make it otherwise inaccessible (IIO is 
proxying
accesses via chardevs if there are open so should ensure they never hit the 
driver)
then it's pointless and messy to call mutex_destroy().  devm_mutex_init() gets 
rid
of that mess..

Jonathan


> 



Re: [PATCH v4 22/23] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-06-01 Thread Jonathan Cameron
On Tue, 23 May 2023 18:22:13 -0500
Terry Bowman  wrote:

> From: Robert Richter 
> 
> In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> RCiEP, but CXL downstream and upstream ports are not enumerated and
> not visible in the PCIe hierarchy. Protocol and link errors are sent
> to an RCEC.
> 
> Restricted CXL host (RCH) downstream port-detected errors are signaled
> as internal AER errors, either Uncorrectable Internal Error (UIE) or
> Corrected Internal Errors (CIE). The error source is the id of the
> RCEC. A CXL handler must then inspect the error status in various CXL
> registers residing in the dport's component register space (CXL RAS
> capability) or the dport's RCRB (PCIe AER extended capability). [1]
> 
> Errors showing up in the RCEC's error handler must be handled and
> connected to the CXL subsystem. Implement this by forwarding the error
> to all CXL devices below the RCEC. Since the entire CXL device is
> controlled only using PCIe Configuration Space of device 0, function
> 0, only pass it there [2]. The error handling is limited to currently
> supported devices with the Memory Device class code set
> (PCI_CLASS_MEMORY_CXL, 502h), where the handler can be implemented in
> the existing cxl_pci driver. Support of CXL devices (e.g. a CXL.cache
> device) can be enabled later.
> 
> In addition to errors directed to the CXL endpoint device, a handler
> must also inspect the CXL RAS and PCIe AER capabilities of the CXL
> downstream port that is connected to the device.
> 
> Since CXL downstream port errors are signaled using internal errors,
> the handler requires those errors to be unmasked. This is subject of a
> follow-on patch.
> 
> The reason for choosing this implementation is that a CXL RCEC device
> is bound to the AER port driver, but the driver does not allow it to
> register a custom specific handler to support CXL. Connecting the RCEC
> hard-wired with a CXL handler does not work, as the CXL subsystem
> might not be present all the time. The alternative to add an
> implementation to the portdrv to allow the registration of a custom
> RCEC error handler isn't worth doing it as CXL would be its only user.
> Instead, just check for an CXL RCEC and pass it down to the connected
> CXL device's error handler. With this approach the code can entirely
> be implemented in the PCIe AER driver and is independent of the CXL
> subsystem. The CXL driver only provides the handler.
> 
> [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
> [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
> 
> Co-developed-by: Terry Bowman 
> Signed-off-by: Terry Bowman 
> Signed-off-by: Robert Richter 
> Cc: "Oliver O'Halloran" 
> Cc: Bjorn Helgaas 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-...@vger.kernel.org
> ---
Reviewed-by: Jonathan Cameron 



Re: [PATCH v4 1/3] PCI/AER: Factor out interrupt toggling into helpers

2023-05-05 Thread Jonathan Cameron
On Mon, 24 Apr 2023 13:52:47 +0800
Kai-Heng Feng  wrote:

> There are many places that enable and disable AER interrput, so move

interrupt

> them into helpers.

Otherwise looks like a good clean up to me.
FWIW
Reviewed-by: Jonathan Cameron 

> 
> Reviewed-by: Mika Westerberg 
> Reviewed-by: Kuppuswamy Sathyanarayanan 
> 
> Signed-off-by: Kai-Heng Feng 
> ---
>  drivers/pci/pcie/aer.c | 45 +-
>  1 file changed, 27 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index f6c24ded134c..1420e1f27105 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1227,6 +1227,28 @@ static irqreturn_t aer_irq(int irq, void *context)
>   return IRQ_WAKE_THREAD;
>  }
>  
> +static void aer_enable_irq(struct pci_dev *pdev)
> +{
> + int aer = pdev->aer_cap;
> + u32 reg32;
> +
> + /* Enable Root Port's interrupt in response to error messages */
> + pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, );
> + reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> + pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> +}
> +
> +static void aer_disable_irq(struct pci_dev *pdev)
> +{
> + int aer = pdev->aer_cap;
> + u32 reg32;
> +
> + /* Disable Root's interrupt in response to error messages */
> + pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, );
> + reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> + pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> +}
> +
>  /**
>   * aer_enable_rootport - enable Root Port's interrupts when receiving 
> messages
>   * @rpc: pointer to a Root Port data structure
> @@ -1256,10 +1278,7 @@ static void aer_enable_rootport(struct aer_rpc *rpc)
>   pci_read_config_dword(pdev, aer + PCI_ERR_UNCOR_STATUS, );
>   pci_write_config_dword(pdev, aer + PCI_ERR_UNCOR_STATUS, reg32);
>  
> - /* Enable Root Port's interrupt in response to error messages */
> - pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, );
> - reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> - pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> + aer_enable_irq(pdev);
>  }
>  
>  /**
> @@ -1274,10 +1293,7 @@ static void aer_disable_rootport(struct aer_rpc *rpc)
>   int aer = pdev->aer_cap;
>   u32 reg32;
>  
> - /* Disable Root's interrupt in response to error messages */
> - pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, );
> - reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> - pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> + aer_disable_irq(pdev);
>  
>   /* Clear Root's error status reg */
>   pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_STATUS, );
> @@ -1372,12 +1388,8 @@ static pci_ers_result_t aer_root_reset(struct pci_dev 
> *dev)
>*/
>   aer = root ? root->aer_cap : 0;
>  
> - if ((host->native_aer || pcie_ports_native) && aer) {
> - /* Disable Root's interrupt in response to error messages */
> - pci_read_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, );
> - reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> - pci_write_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, reg32);
> - }
> + if ((host->native_aer || pcie_ports_native) && aer)
> + aer_disable_irq(root);
>  
>   if (type == PCI_EXP_TYPE_RC_EC || type == PCI_EXP_TYPE_RC_END) {
>   rc = pcie_reset_flr(dev, PCI_RESET_DO_RESET);
> @@ -1396,10 +1408,7 @@ static pci_ers_result_t aer_root_reset(struct pci_dev 
> *dev)
>   pci_read_config_dword(root, aer + PCI_ERR_ROOT_STATUS, );
>   pci_write_config_dword(root, aer + PCI_ERR_ROOT_STATUS, reg32);
>  
> - /* Enable Root Port's interrupt in response to error messages */
> - pci_read_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, );
> - reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> - pci_write_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, reg32);
> + aer_enable_irq(root);
>   }
>  
>   return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;



Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-04-17 Thread Jonathan Cameron
On Fri, 14 Apr 2023 16:35:05 +0200
Robert Richter  wrote:

> On 14.04.23 13:19:50, Jonathan Cameron wrote:
> > On Tue, 11 Apr 2023 13:03:01 -0500
> > Terry Bowman  wrote:
> >   
> > > From: Robert Richter 
> > > 
> > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> > > RCiEP, but CXL downstream and upstream ports are not enumerated and
> > > not visible in the PCIe hierarchy. Protocol and link errors are sent
> > > to an RCEC.
> > > 
> > > Restricted CXL host (RCH) downstream port-detected errors are signaled
> > > as internal AER errors, either Uncorrectable Internal Error (UIE) or
> > > Corrected Internal Errors (CIE). The error source is the id of the
> > > RCEC. A CXL handler must then inspect the error status in various CXL
> > > registers residing in the dport's component register space (CXL RAS
> > > cap) or the dport's RCRB (AER ext cap). [1]
> > > 
> > > Errors showing up in the RCEC's error handler must be handled and
> > > connected to the CXL subsystem. Implement this by forwarding the error
> > > to all CXL devices below the RCEC. Since the entire CXL device is
> > > controlled only using PCIe Configuration Space of device 0, Function
> > > 0, only pass it there [2]. These devices have the Memory Device class
> > > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver
> > > can implement the handler.  
> > 
> > This comment implies only class code compliant drivers.  Sure we don't
> > have drivers for anything else yet, but we should try to avoid saying
> > there won't be any (which I think above implies).
> > 
> > You have a comment in the code, but maybe relaxing the description above
> > to "currently support devices have..."  
> 
> It is used here to identify CXL memory devices and limit the
> enablement to those. The spec requires this to be set for CXL mem devs
> (see cxl 3.0, 8.1.12.2).
> 
> There could be other CXL devices (e.g. cache), but other drivers are
> not yet implemented. That is what I am referring to. The check makes
> sure there is actually a driver with a handler for it (cxl_pci).

Understood on intent. My worry is that the above can be read as a
statement on hardware restrictions, rathe than on what software currently
implements.  Meh. Minor point so I don't care that much!
Unlikely anyone will read the patch description after it merges anyway ;)

> 
> >   
> > > In addition to errors directed to the CXL
> > > endpoint device, the handler must also inspect the CXL downstream
> > > port's CXL RAS and PCIe AER external capabilities that is connected to
> > > the device.
> > > 
> > > Since CXL downstream port errors are signaled using internal errors,
> > > the handler requires those errors to be unmasked. This is subject of a
> > > follow-on patch.
> > > 
> > > The reason for choosing this implementation is that a CXL RCEC device
> > > is bound to the AER port driver, but the driver does not allow it to
> > > register a custom specific handler to support CXL. Connecting the RCEC
> > > hard-wired with a CXL handler does not work, as the CXL subsystem
> > > might not be present all the time. The alternative to add an
> > > implementation to the portdrv to allow the registration of a custom
> > > RCEC error handler isn't worth doing it as CXL would be its only user.
> > > Instead, just check for an CXL RCEC and pass it down to the connected
> > > CXL device's error handler. With this approach the code can entirely
> > > be implemented in the PCIe AER driver and is independent of the CXL
> > > subsystem. The CXL driver only provides the handler.
> > > 
> > > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
> > > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
> > > 
> > > Co-developed-by: Terry Bowman 
> > > Signed-off-by: Robert Richter 
> > > Signed-off-by: Terry Bowman 
> > > Cc: "Oliver O'Halloran" 
> > > Cc: Bjorn Helgaas 
> > > Cc: Mahesh J Salgaonkar 
> > > Cc: linuxppc-dev@lists.ozlabs.org
> > > Cc: linux-...@vger.kernel.org  
> > 
> > Generally looks good to me.  A few trivial comments inline.
> >   
> > > ---
> > >  drivers/pci/pcie/Kconfig |  8 ++
> > >  drivers/pci/pcie/aer.c   | 61 
> > >  2 files changed, 69 insertions(+)
> > > 
> > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
> > > 

Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-04-14 Thread Jonathan Cameron
On Tue, 11 Apr 2023 13:03:01 -0500
Terry Bowman  wrote:

> From: Robert Richter 
> 
> In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> RCiEP, but CXL downstream and upstream ports are not enumerated and
> not visible in the PCIe hierarchy. Protocol and link errors are sent
> to an RCEC.
> 
> Restricted CXL host (RCH) downstream port-detected errors are signaled
> as internal AER errors, either Uncorrectable Internal Error (UIE) or
> Corrected Internal Errors (CIE). The error source is the id of the
> RCEC. A CXL handler must then inspect the error status in various CXL
> registers residing in the dport's component register space (CXL RAS
> cap) or the dport's RCRB (AER ext cap). [1]
> 
> Errors showing up in the RCEC's error handler must be handled and
> connected to the CXL subsystem. Implement this by forwarding the error
> to all CXL devices below the RCEC. Since the entire CXL device is
> controlled only using PCIe Configuration Space of device 0, Function
> 0, only pass it there [2]. These devices have the Memory Device class
> code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver
> can implement the handler.

This comment implies only class code compliant drivers.  Sure we don't
have drivers for anything else yet, but we should try to avoid saying
there won't be any (which I think above implies).

You have a comment in the code, but maybe relaxing the description above
to "currently support devices have..."

> In addition to errors directed to the CXL
> endpoint device, the handler must also inspect the CXL downstream
> port's CXL RAS and PCIe AER external capabilities that is connected to
> the device.
> 
> Since CXL downstream port errors are signaled using internal errors,
> the handler requires those errors to be unmasked. This is subject of a
> follow-on patch.
> 
> The reason for choosing this implementation is that a CXL RCEC device
> is bound to the AER port driver, but the driver does not allow it to
> register a custom specific handler to support CXL. Connecting the RCEC
> hard-wired with a CXL handler does not work, as the CXL subsystem
> might not be present all the time. The alternative to add an
> implementation to the portdrv to allow the registration of a custom
> RCEC error handler isn't worth doing it as CXL would be its only user.
> Instead, just check for an CXL RCEC and pass it down to the connected
> CXL device's error handler. With this approach the code can entirely
> be implemented in the PCIe AER driver and is independent of the CXL
> subsystem. The CXL driver only provides the handler.
> 
> [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
> [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
> 
> Co-developed-by: Terry Bowman 
> Signed-off-by: Robert Richter 
> Signed-off-by: Terry Bowman 
> Cc: "Oliver O'Halloran" 
> Cc: Bjorn Helgaas 
> Cc: Mahesh J Salgaonkar 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-...@vger.kernel.org

Generally looks good to me.  A few trivial comments inline.

> ---
>  drivers/pci/pcie/Kconfig |  8 ++
>  drivers/pci/pcie/aer.c   | 61 
>  2 files changed, 69 insertions(+)
> 
> diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
> index 228652a59f27..b0dbd864d3a3 100644
> --- a/drivers/pci/pcie/Kconfig
> +++ b/drivers/pci/pcie/Kconfig
> @@ -49,6 +49,14 @@ config PCIEAER_INJECT
> gotten from:
>
> https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
>  
> +config PCIEAER_CXL
> + bool "PCI Express CXL RAS support"

Description makes this sound too general. I'd mentioned restricted
hosts even in the menu option title.


> + default y
> + depends on PCIEAER && CXL_PCI
> + help
> +   This enables CXL error handling for Restricted CXL Hosts
> +   (RCHs).

Spec term is probably fine in the title, but in the help I'd 
expand it as per the CXL 3.0 glossary to include
"CXL Host that is operating in RCD mode."
It might otherwise surprise people that this matters on their shiny
new CXL X.0 host (because they found an old CXL 1.1 card in a box
and decided to plug it in)

Do we actually need this protection at all?  It's a tiny amount of code
and I can't see anything immediately that requires the CXL_PCI dependency
other than it's a bit pointless if that isn't here.

> +
>  #
>  # PCI Express ECRC
>  #
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 7a25b62d9e01..171a08fd8ebd 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent,
>   return true;
>  }
>  
> +#ifdef CONFIG_PCIEAER_CXL
> +
> +static bool is_cxl_mem_dev(struct pci_dev *dev)
> +{
> + /*
> +  * A CXL device is controlled only using PCIe Configuration
> +  * Space of device 0, Function 0.

That's not true in general.   Definitely true that CXL protocol
error reporting is controlled only using this Devfn, 

Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling

2023-04-14 Thread Jonathan Cameron
On Fri, 14 Apr 2023 13:21:37 +0200
Robert Richter  wrote:

> On 13.04.23 15:52:36, Ira Weiny wrote:
> > Jonathan Cameron wrote:  
> > > On Wed, 12 Apr 2023 16:29:01 -0500
> > > Bjorn Helgaas  wrote:
> > >   
> > > > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote:  
> > > > > From: Robert Richter 
> > > > >   
> 
> > > > > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec)
> > > > > +{
> > > > > + int aer, rc;
> > > > > + u32 mask;
> > > > > +
> > > > > + /*
> > > > > +  * Internal errors are masked by default, unmask RCEC's here
> > > > > +  * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
> > > > > +  * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
> > > > > +  */
> > > > 
> > > > Unmasking internal errors doesn't have anything specific to do with
> > > > CXL, so I don't think it should have "cxl" in the function name.
> > > > Maybe something like "pci_aer_unmask_internal_errors()".  
> > > 
> > > This reminds me.  Not sure we resolved earlier discussion on changing
> > > the system wide policy to turn these on 
> > > https://lore.kernel.org/linux-cxl/20221229172731.GA611562@bhelgaas/
> > > which needs pretty much the same thing.
> > > 
> > > Ira, I think you were picking this one up?
> > > https://lore.kernel.org/linux-cxl/63e5fb533f304_13244829412@iweiny-mobl.notmuch/
> > >   
> > 
> > After this discussion I posted an RFC to enable those errors.
> > 
> > https://lore.kernel.org/all/20230209-cxl-pci-aer-v1-1-f9a817fa4...@intel.com/
> > 

Ah. I'd forgotten that thread. Thanks!

> > Unfortunately the prevailing opinion was that this was unsafe.  And no one
> > piped up with a reason to pursue the alternative of a pci core call to 
> > enable
> > them as needed.
> > 
> > So I abandoned the work.
> > 
> > I think the direction things where headed was to have a call like:
> > 
> > int pci_enable_pci_internal_errors(struct pci_dev *dev)
> > {
> > int pos_cap_err;
> > u32 reg;
> > 
> > if (!pcie_aer_is_native(dev))
> > return -EIO;
> > 
> > pos_cap_err = dev->aer_cap;
> > 
> > /* Unmask correctable and uncorrectable (non-fatal) internal errors */
> > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, );
> > reg &= ~PCI_ERR_COR_INTERNAL;
> > pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, reg);
> > 
> > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, );
> > reg &= ~PCI_ERR_UNC_INTN;
> > pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, reg);
> > 
> > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, );
> > reg &= ~PCI_ERR_UNC_INTN;
> > pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, reg);
> > 
> > return 0;
> > }
> > 
> > ... and call this from the cxl code where it is needed.  
> 
> The version I have ready after addressing Bjorn's comments is pretty
> much the same, apart from error checking of the read/writes.
> 
> From your patch proposed you will need it in aer.c too and we do not
> need to export it.

I think for the other components we'll want to call it from cxl_pci_ras_unmask()
so an export needed.

I also wonder if a more generic function would be better as seems likely
similar code will be needed for errors other than this pair.


> 
> This patch only enables it for (CXL) RCECs. You might want to extend
> this for CXL endpoints (and ports?) then.

Definitely.  We have the same limitation you are seeing.  No errors
without turning this on.

Jonathan



> 
> > 
> > Is this an acceptable direction?  Terry is welcome to steal the above from 
> > my
> > patch and throw it into the PCI core.
> > 
> > Looking at the current state of things I think cxl_pci_ras_unmask() may
> > actually be broken now without calling something like the above.  For that I
> > dropped the ball.  
> 
> Thanks,
> 
> -Robert
> 
> > 
> > Ira  



Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling

2023-04-13 Thread Jonathan Cameron
On Wed, 12 Apr 2023 16:29:01 -0500
Bjorn Helgaas  wrote:

> On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote:
> > From: Robert Richter 
> > 
> > RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are
> > disabled by default.  
> 
> "Disabled by default" just means "the power-up state of CIE/UIC is
> that they are masked", right?  It doesn't mean that Linux normally
> masks them.
> 
> > [1][2] Enable them to receive CXL downstream port
> > errors of a Restricted CXL Host (RCH).
> > 
> > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors
> > [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register,
> > 7.8.4.6 Correctable Error Mask Register
> > 
> > Co-developed-by: Terry Bowman 
> > Signed-off-by: Robert Richter 
> > Signed-off-by: Terry Bowman 
> > Cc: "Oliver O'Halloran" 
> > Cc: Bjorn Helgaas 
> > Cc: Mahesh J Salgaonkar 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-...@vger.kernel.org
> > ---
> >  drivers/pci/pcie/aer.c | 73 ++
> >  1 file changed, 73 insertions(+)
> > 
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 171a08fd8ebd..3973c731e11d 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -1000,7 +1000,79 @@ static void cxl_handle_error(struct pci_dev *dev, 
> > struct aer_err_info *info)
> > pcie_walk_rcec(dev, cxl_handle_error_iter, info);
> >  }
> >  
> > +static bool cxl_error_is_native(struct pci_dev *dev)
> > +{
> > +   struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> > +
> > +   if (pcie_ports_native)
> > +   return true;
> > +
> > +   return host->native_aer && host->native_cxl_error;
> > +}
> > +
> > +static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> > +{
> > +   int *handles_cxl = data;
> > +
> > +   *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
> > +
> > +   return *handles_cxl;
> > +}
> > +
> > +static bool handles_cxl_errors(struct pci_dev *rcec)
> > +{
> > +   int handles_cxl = 0;
> > +
> > +   if (!rcec->aer_cap)
> > +   return false;
> > +
> > +   if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC)
> > +   pcie_walk_rcec(rcec, handles_cxl_error_iter, _cxl);
> > +
> > +   return !!handles_cxl;
> > +}
> > +
> > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec)
> > +{
> > +   int aer, rc;
> > +   u32 mask;
> > +
> > +   /*
> > +* Internal errors are masked by default, unmask RCEC's here
> > +* PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
> > +* PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
> > +*/  
> 
> Unmasking internal errors doesn't have anything specific to do with
> CXL, so I don't think it should have "cxl" in the function name.
> Maybe something like "pci_aer_unmask_internal_errors()".

This reminds me.  Not sure we resolved earlier discussion on changing
the system wide policy to turn these on 
https://lore.kernel.org/linux-cxl/20221229172731.GA611562@bhelgaas/
which needs pretty much the same thing.

Ira, I think you were picking this one up?
https://lore.kernel.org/linux-cxl/63e5fb533f304_13244829412@iweiny-mobl.notmuch/

Thanks,

Jonathan


> 
> This also has nothing special to do with RCECs, so I think we should
> refer to the device as "dev" as is typical in this file.
> 
> I think this needs to check pcie_aer_is_native() as is done by
> pci_aer_clear_nonfatal_status() and other functions that write the AER
> Capability.
> 
> With the exception of this function, this patch looks like all CXL
> code that maybe could be with other CXL code.  Would require making
> pcie_walk_rcec() available outside drivers/pci, I guess.
> 
> > +   aer = rcec->aer_cap;
> > +   rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, );
> > +   if (rc)
> > +   return rc;
> > +   mask &= ~PCI_ERR_UNC_INTN;
> > +   rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask);
> > +   if (rc)
> > +   return rc;
> > +
> > +   rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, );
> > +   if (rc)
> > +   return rc;
> > +   mask &= ~PCI_ERR_COR_INTERNAL;
> > +   rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask);
> > +
> > +   return rc;
> > +}
> > +
> > +static void cxl_unmask_internal_errors(struct pci_dev *rcec)
> > +{
> > +   if (!handles_cxl_errors(rcec))
> > +   return;
> > +
> > +   if (__cxl_unmask_internal_errors(rcec))
> > +   dev_err(>dev, "cxl: Failed to unmask internal errors");
> > +   else
> > +   dev_dbg(>dev, "cxl: Internal errors unmasked");
> > +}
> > +
> >  #else
> > +static inline void cxl_unmask_internal_errors(struct pci_dev *dev) { }
> >  static inline void cxl_handle_error(struct pci_dev *dev,
> > struct aer_err_info *info) { }
> >  #endif
> > @@ -1397,6 +1469,7 @@ static int aer_probe(struct pcie_device *dev)
> > return status;
> > }
> >  
> > +   

Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling

2023-04-13 Thread Jonathan Cameron
On Thu, 13 Apr 2023 15:38:07 +0200
Robert Richter  wrote:

> On 12.04.23 16:29:01, Bjorn Helgaas wrote:
> > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote:  
> > > From: Robert Richter 
> > > 
> > > RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are
> > > disabled by default.  
> > 
> > "Disabled by default" just means "the power-up state of CIE/UIC is
> > that they are masked", right?  It doesn't mean that Linux normally
> > masks them.  
> 
> Yes, will change the wording here.
> 
> > > [1][2] Enable them to receive CXL downstream port
> > > errors of a Restricted CXL Host (RCH).
> > > 
> > > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors
> > > [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register,
> > > 7.8.4.6 Correctable Error Mask Register
> > > 
> > > Co-developed-by: Terry Bowman 
> > > Signed-off-by: Robert Richter 
> > > Signed-off-by: Terry Bowman 
> > > Cc: "Oliver O'Halloran" 
> > > Cc: Bjorn Helgaas 
> > > Cc: Mahesh J Salgaonkar 
> > > Cc: linuxppc-dev@lists.ozlabs.org
> > > Cc: linux-...@vger.kernel.org
> > > ---
> > >  drivers/pci/pcie/aer.c | 73 ++
> > >  1 file changed, 73 insertions(+)
> > > 
> > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > index 171a08fd8ebd..3973c731e11d 100644
> > > --- a/drivers/pci/pcie/aer.c
> > > +++ b/drivers/pci/pcie/aer.c
> > > @@ -1000,7 +1000,79 @@ static void cxl_handle_error(struct pci_dev *dev, 
> > > struct aer_err_info *info)
> > >   pcie_walk_rcec(dev, cxl_handle_error_iter, info);
> > >  }
> > >  
> > > +static bool cxl_error_is_native(struct pci_dev *dev)
> > > +{
> > > + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> > > +
> > > + if (pcie_ports_native)
> > > + return true;
> > > +
> > > + return host->native_aer && host->native_cxl_error;
> > > +}
> > > +
> > > +static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> > > +{
> > > + int *handles_cxl = data;
> > > +
> > > + *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
> > > +
> > > + return *handles_cxl;
> > > +}
> > > +
> > > +static bool handles_cxl_errors(struct pci_dev *rcec)
> > > +{
> > > + int handles_cxl = 0;
> > > +
> > > + if (!rcec->aer_cap)
> > > + return false;
> > > +
> > > + if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC)
> > > + pcie_walk_rcec(rcec, handles_cxl_error_iter, _cxl);
> > > +
> > > + return !!handles_cxl;
> > > +}
> > > +
> > > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec)
> > > +{
> > > + int aer, rc;
> > > + u32 mask;
> > > +
> > > + /*
> > > +  * Internal errors are masked by default, unmask RCEC's here
> > > +  * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
> > > +  * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
> > > +  */  
> > 
> > Unmasking internal errors doesn't have anything specific to do with
> > CXL, so I don't think it should have "cxl" in the function name.
> > Maybe something like "pci_aer_unmask_internal_errors()".  
> 
> Since it is static I renamed it to aer_unmask_internal_errors() and
> also moved it to the beginning of the #ifdef block for easier later
> reuse.
> 
> > 
> > This also has nothing special to do with RCECs, so I think we should
> > refer to the device as "dev" as is typical in this file.  
> 
> Changed.
> 
> > 
> > I think this needs to check pcie_aer_is_native() as is done by
> > pci_aer_clear_nonfatal_status() and other functions that write the AER
> > Capability.  
> 
> Also added the check to aer_unmask_internal_errors(). There was a
> check for native_* in handles_cxl_errors() already, but only for the
> pci devs of the RCEC. I added a check of the RCEC there too.
> 
> > 
> > With the exception of this function, this patch looks like all CXL
> > code that maybe could be with other CXL code.  Would require making
> > pcie_walk_rcec() available outside drivers/pci, I guess.  
> 
> Even this is CXL code, it implements AER support and fits better here
> around AER code. Export of pcie_walk_rcec() (and others?) is not the
> main issue here. CXL drivers can come as modules and would need to
> register a hook at the aer handler.  This would add even more
> complexity here. In contrast, current solution just adds two functions
> for enablement and handling which are empty stubs if code is disabled.
> 
> I could move that code to aer_cxl.c similar to aer_inject.c. Since the
> CXL part is small compared to the remaining aer code I left it in
> aer.c. Also, it is guarded by #ifdef which additionally encapsulates
> it.
> 

To throw another option in there (what Bjorn suggested IIRC for the more
general case..) 

Just enable internal errors always.  No need to know if they are CXL
or something else.

There will/might be fallout and it will be fun.

Jonathan

> >   
> > > + aer = rcec->aer_cap;
> > > + rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, );
> > > + if (rc)
> > > +

Re: [PATCH 000/606] i2c: Complete conversion to i2c_probe_new

2022-11-22 Thread Jonathan Cameron


Queued all of the below:
with one tweaked as per your suggestion and the highlighted one dropped on basis
I was already carrying the equivalent - as you pointed out.

I was already carrying the required dependency.

Includes the IIO ones in staging.

Thanks,

Jonathan

p.s. I perhaps foolishly did this in a highly manual way so as to
also pick up Andy's RB.  So might have dropped one...

Definitely would have been better as one patch per subsystem with
a cover letter suitable for replies like Andy's to be picked up
by b4.


>   iio: accel: adxl372_i2c: Convert to i2c's .probe_new()
>   iio: accel: bma180: Convert to i2c's .probe_new()
>   iio: accel: bma400: Convert to i2c's .probe_new()
>   iio: accel: bmc150: Convert to i2c's .probe_new()
>   iio: accel: da280: Convert to i2c's .probe_new()
>   iio: accel: kxcjk-1013: Convert to i2c's .probe_new()
>   iio: accel: mma7455_i2c: Convert to i2c's .probe_new()
>   iio: accel: mma8452: Convert to i2c's .probe_new()
>   iio: accel: mma9551: Convert to i2c's .probe_new()
>   iio: accel: mma9553: Convert to i2c's .probe_new()
>   iio: adc: ad7091r5: Convert to i2c's .probe_new()
>   iio: adc: ad7291: Convert to i2c's .probe_new()
>   iio: adc: ad799x: Convert to i2c's .probe_new()
>   iio: adc: ina2xx-adc: Convert to i2c's .probe_new()
>   iio: adc: ltc2471: Convert to i2c's .probe_new()
>   iio: adc: ltc2485: Convert to i2c's .probe_new()
>   iio: adc: ltc2497: Convert to i2c's .probe_new()
>   iio: adc: max1363: Convert to i2c's .probe_new()
>   iio: adc: max9611: Convert to i2c's .probe_new()
>   iio: adc: mcp3422: Convert to i2c's .probe_new()
>   iio: adc: ti-adc081c: Convert to i2c's .probe_new()
>   iio: adc: ti-ads1015: Convert to i2c's .probe_new()
>   iio: cdc: ad7150: Convert to i2c's .probe_new()
>   iio: cdc: ad7746: Convert to i2c's .probe_new()
>   iio: chemical: ams-iaq-core: Convert to i2c's .probe_new()
>   iio: chemical: atlas-ezo-sensor: Convert to i2c's .probe_new()
>   iio: chemical: atlas-sensor: Convert to i2c's .probe_new()
>   iio: chemical: bme680_i2c: Convert to i2c's .probe_new()
>   iio: chemical: ccs811: Convert to i2c's .probe_new()
>   iio: chemical: scd4x: Convert to i2c's .probe_new()
>   iio: chemical: sgp30: Convert to i2c's .probe_new()
>   iio: chemical: sgp40: Convert to i2c's .probe_new()
>   iio: chemical: vz89x: Convert to i2c's .probe_new()
>   iio: dac: ad5064: Convert to i2c's .probe_new()
>   iio: dac: ad5380: Convert to i2c's .probe_new()
>   iio: dac: ad5446: Convert to i2c's .probe_new()
>   iio: dac: ad5593r: Convert to i2c's .probe_new()
>   iio: dac: ad5696-i2c: Convert to i2c's .probe_new()
>   iio: dac: ds4424: Convert to i2c's .probe_new()
>   iio: dac: m62332: Convert to i2c's .probe_new()
>   iio: dac: max517: Convert to i2c's .probe_new()
>   iio: dac: max5821: Convert to i2c's .probe_new()
>   iio: dac: mcp4725: Convert to i2c's .probe_new()
>   iio: dac: ti-dac5571: Convert to i2c's .probe_new()
>   iio: gyro: bmg160_i2c: Convert to i2c's .probe_new()
>   iio: gyro: itg3200_core: Convert to i2c's .probe_new()
>   iio: gyro: mpu3050-i2c: Convert to i2c's .probe_new()
>   iio: gyro: st_gyro_i2c: Convert to i2c's .probe_new()
>   iio: health: afe4404: Convert to i2c's .probe_new()
>   iio: health: max30100: Convert to i2c's .probe_new()
>   iio: health: max30102: Convert to i2c's .probe_new()
>   iio: humidity: am2315: Convert to i2c's .probe_new()
>   iio: humidity: hdc100x: Convert to i2c's .probe_new()
>   iio: humidity: hdc2010: Convert to i2c's .probe_new()
>   iio: humidity: hts221_i2c: Convert to i2c's .probe_new()
>   iio: humidity: htu21: Convert to i2c's .probe_new()
>   iio: humidity: si7005: Convert to i2c's .probe_new()
>   iio: humidity: si7020: Convert to i2c's .probe_new()
>   iio: imu: bmi160/bmi160_i2c: Convert to i2c's .probe_new()
>   iio: imu: fxos8700_i2c: Convert to i2c's .probe_new()
>   iio: imu: inv_mpu6050: Convert to i2c's .probe_new()
>   iio: imu: kmx61: Convert to i2c's .probe_new()
>   iio: imu: st_lsm6dsx: Convert to i2c's .probe_new()
>   iio: light: adjd_s311: Convert to i2c's .probe_new()
>   iio: light: adux1020: Convert to i2c's .probe_new()
>   iio: light: al3010: Convert to i2c's .probe_new()
>   iio: light: al3320a: Convert to i2c's .probe_new()
>   iio: light: apds9300: Convert to i2c's .probe_new()
>   iio: light: apds9960: Convert to i2c's .probe_new()
>   iio: light: bh1750: Convert to i2c's .probe_new()
>   iio: light: bh1780: Convert to i2c's .probe_new()
>   iio: light: cm3232: Convert to i2c's .probe_new()
>   iio: light: cm3323: Convert to i2c's .probe_new()
>   iio: light: cm36651: Convert to i2c's .probe_new()
>   iio: light: gp2ap002: Convert to i2c's .probe_new()
>   iio: light: gp2ap020a00f: Convert to i2c's .probe_new()
>   iio: light: isl29018: Convert to i2c's .probe_new()
>   iio: light: isl29028: Convert to i2c's .probe_new()
>   iio: light: isl29125: Convert to i2c's .probe_new()
>   iio: light: jsa1212: Convert to i2c's .probe_new()
>   iio: 

Re: [PATCH v2 6/9] PCI: Add pci_find_dvsec_capability to find designated VSEC

2021-10-01 Thread Jonathan Cameron
On Thu, 23 Sep 2021 10:26:44 -0700
Ben Widawsky  wrote:

> Add pci_find_dvsec_capability to locate a Designated Vendor-Specific
> Extended Capability with the specified DVSEC ID.
> 
> The Designated Vendor-Specific Extended Capability (DVSEC) allows one or
> more vendor specific capabilities that aren't tied to the vendor ID of
> the PCI component.
> 
> DVSEC is critical for both the Compute Express Link (CXL) driver as well
> as the driver for OpenCAPI coherent accelerator (OCXL).
> 
> Cc: David E. Box 
> Cc: Jonathan Cameron 
> Cc: Bjorn Helgaas 
> Cc: Dan Williams 
> Cc: linux-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Andrew Donnellan 
> Cc: Lu Baolu 
> Reviewed-by: Frederic Barrat 
> Signed-off-by: Ben Widawsky 

Great to see this cleaned up.

Reviewed-by: Jonathan Cameron 

> ---
>  drivers/pci/pci.c   | 32 
>  include/linux/pci.h |  1 +
>  2 files changed, 33 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index ce2ab62b64cf..94ac86ff28b0 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -732,6 +732,38 @@ u16 pci_find_vsec_capability(struct pci_dev *dev, u16 
> vendor, int cap)
>  }
>  EXPORT_SYMBOL_GPL(pci_find_vsec_capability);
>  
> +/**
> + * pci_find_dvsec_capability - Find DVSEC for vendor
> + * @dev: PCI device to query
> + * @vendor: Vendor ID to match for the DVSEC
> + * @dvsec: Designated Vendor-specific capability ID
> + *
> + * If DVSEC has Vendor ID @vendor and DVSEC ID @dvsec return the capability
> + * offset in config space; otherwise return 0.
> + */
> +u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec)
> +{
> + int pos;
> +
> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
> + if (!pos)
> + return 0;
> +
> + while (pos) {
> + u16 v, id;
> +
> + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, );
> + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, );
> + if (vendor == v && dvsec == id)
> + return pos;
> +
> + pos = pci_find_next_ext_capability(dev, pos, 
> PCI_EXT_CAP_ID_DVSEC);
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_find_dvsec_capability);
> +
>  /**
>   * pci_find_parent_resource - return resource region of parent bus of given
>   * region
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index cd8aa6fce204..c93ccfa4571b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1130,6 +1130,7 @@ u16 pci_find_ext_capability(struct pci_dev *dev, int 
> cap);
>  u16 pci_find_next_ext_capability(struct pci_dev *dev, u16 pos, int cap);
>  struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
>  u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap);
> +u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec);
>  
>  u64 pci_get_dsn(struct pci_dev *dev);
>  



Re: Duplicated ABI entries - Was: Re: [PATCH v2 20/39] docs: ABI: testing: make the files compatible with ReST output

2020-11-15 Thread Jonathan Cameron
On Tue, 10 Nov 2020 08:26:58 +0100
Mauro Carvalho Chehab  wrote:

> Hi Jonathan,
> 
> Em Sun, 8 Nov 2020 16:56:21 +
> Jonathan Cameron  escreveu:
> 
> > > PS.: the IIO subsystem is the one that currently has more duplicated
> > > ABI entries:  
> > > $ ./scripts/get_abi.pl validate 2>&1|grep iio
> > > Warning: /sys/bus/iio/devices/iio:deviceX/in_accel_x_calibbias is defined 
> > > 2 times:  Documentation/ABI/testing/sysfs-bus-iio-icm42600:0  
> > > Documentation/ABI/testing/sysfs-bus-iio:394
> > > Warning: /sys/bus/iio/devices/iio:deviceX/in_accel_y_calibbias is defined 
> > > 2 times:  Documentation/ABI/testing/sysfs-bus-iio-icm42600:1  
> > > Documentation/ABI/testing/sysfs-bus-iio:395
> > > Warning: /sys/bus/iio/devices/iio:deviceX/in_accel_z_calibbias is defined 
> > > 2 times:  Documentation/ABI/testing/sysfs-bus-iio-icm42600:2  
> > > Documentation/ABI/testing/sysfs-bus-iio:396
> > > Warning: /sys/bus/iio/devices/iio:deviceX/in_anglvel_x_calibbias is 
> > > defined 2 times:  Documentation/ABI/testing/sysfs-bus-iio-icm42600:3  
> > > Documentation/ABI/testing/sysfs-bus-iio:397
> > > Warning: /sys/bus/iio/devices/iio:deviceX/in_anglvel_y_calibbias is 
> > > defined 2 times:  Documentation/ABI/testing/sysfs-bus-iio-icm42600:4  
> > > Documentation/ABI/testing/sysfs-bus-iio:398
> > > Warning: /sys/bus/iio/devices/iio:deviceX/in_anglvel_z_calibbias is 
> > > defined 2 times:  Documentation/ABI/testing/sysfs-bus-iio-icm42600:5  
> > > Documentation/ABI/testing/sysfs-bus-iio:399
> > > Warning: /sys/bus/iio/devices/iio:deviceX/in_count0_preset is defined 2 
> > > times:  Documentation/ABI/testing/sysfs-bus-iio-timer-stm32:100  
> > > Documentation/ABI/testing/sysfs-bus-iio-lptimer-stm32:0
> > > Warning: /sys/bus/iio/devices/iio:deviceX/in_count0_quadrature_mode is 
> > > defined 2 times:  Documentation/ABI/testing/sysfs-bus-iio-timer-stm32:117 
> > >  Documentation/ABI/testing/sysfs-bus-iio-lptimer-stm32:14
> > > Warning: 
> > > /sys/bus/iio/devices/iio:deviceX/in_count_quadrature_mode_available is 
> > > defined 3 times:  
> > > Documentation/ABI/testing/sysfs-bus-iio-counter-104-quad-8:2  
> > > Documentation/ABI/testing/sysfs-bus-iio-timer-stm32:111  
> > > Documentation/ABI/testing/sysfs-bus-iio-lptimer-stm32:8
> > > Warning: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_frequency is 
> > > defined 2 times:  
> > > Documentation/ABI/testing/sysfs-bus-iio-frequency-adf4371:0  
> > > Documentation/ABI/testing/sysfs-bus-iio:599
> > > Warning: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_powerdown is 
> > > defined 2 times:  
> > > Documentation/ABI/testing/sysfs-bus-iio-frequency-adf4371:36  
> > > Documentation/ABI/testing/sysfs-bus-iio:588
> > > Warning: /sys/bus/iio/devices/iio:deviceX/out_currentY_raw is defined 2 
> > > times:  Documentation/ABI/testing/sysfs-bus-iio-light-lm3533-als:43  
> > > Documentation/ABI/testing/sysfs-bus-iio-health-afe440x:38
> > > Warning: /sys/bus/iio/devices/iio:deviceX/out_current_heater_raw is 
> > > defined 2 times:  
> > > Documentation/ABI/testing/sysfs-bus-iio-humidity-hdc2010:0  
> > > Documentation/ABI/testing/sysfs-bus-iio-humidity-hdc100x:0
> > > Warning: 
> > > /sys/bus/iio/devices/iio:deviceX/out_current_heater_raw_available is 
> > > defined 2 times:  
> > > Documentation/ABI/testing/sysfs-bus-iio-humidity-hdc2010:1  
> > > Documentation/ABI/testing/sysfs-bus-iio-humidity-hdc100x:1
> > > Warning: /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity is defined 2 
> > > times:  Documentation/ABI/testing/sysfs-bus-iio-distance-srf08:0  
> > > Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935:8
> > > Warning: /sys/bus/iio/devices/triggerX/sampling_frequency is defined 2 
> > > times:  Documentation/ABI/testing/sysfs-bus-iio-timer-stm32:92  
> > > Documentation/ABI/testing/sysfs-bus-iio:45
> 
> > 
> > That was intentional.  Often these provide more information on the
> > ABI for a particular device than is present in the base ABI doc.  
> 
> FYI, right now, there are 20 duplicated entries, being 16 of them
> from IIO, on those files:
> 
>   $ ./scripts/get_abi.pl validate 2>&1|perl -ne 'if 
> (m,(Documentation/\S+)\:,g) { print "$1\n" }'|sort|uniq
>   Documentation/ABI/stable/sysfs-driver-w1_ds28e04
>   Documentation/ABI/testing/sysfs-bus-iio-counter-104-quad-8
>   Documentation/ABI/testing/sysfs-bus-iio-distance-srf08
>

Re: [PATCH v2 20/39] docs: ABI: testing: make the files compatible with ReST output

2020-11-08 Thread Jonathan Cameron
On Mon, 2 Nov 2020 15:42:50 +0100
Mauro Carvalho Chehab  wrote:

> Em Mon, 2 Nov 2020 13:46:41 +0100
> Greg Kroah-Hartman  escreveu:
> 
> > On Mon, Nov 02, 2020 at 12:04:36PM +0100, Fabrice Gasnier wrote:  
> > > On 10/30/20 11:09 AM, Mauro Carvalho Chehab wrote:
> > > > Em Fri, 30 Oct 2020 10:19:12 +0100
> > > > Fabrice Gasnier  escreveu:
> > > > 
> > > >> Hi Mauro,
> > > >>
> > > >> [...]
> > > >>
> > > >>>  
> > > >>> +What:
> > > >>> /sys/bus/iio/devices/iio:deviceX/in_count_quadrature_mode_available
> > > >>> +KernelVersion:   4.12
> > > >>> +Contact: benjamin.gaign...@st.com
> > > >>> +Description:
> > > >>> + Reading returns the list possible quadrature modes.
> > > >>> +
> > > >>> +What:
> > > >>> /sys/bus/iio/devices/iio:deviceX/in_count0_quadrature_mode
> > > >>> +KernelVersion:   4.12
> > > >>> +Contact: benjamin.gaign...@st.com
> > > >>> +Description:
> > > >>> + Configure the device counter quadrature modes:
> > > >>> +
> > > >>> + channel_A:
> > > >>> + Encoder A input servers as the count input and 
> > > >>> B as
> > > >>> + the UP/DOWN direction control input.
> > > >>> +
> > > >>> + channel_B:
> > > >>> + Encoder B input serves as the count input and A 
> > > >>> as
> > > >>> + the UP/DOWN direction control input.
> > > >>> +
> > > >>> + quadrature:
> > > >>> + Encoder A and B inputs are mixed to get 
> > > >>> direction
> > > >>> + and count with a scale of 0.25.
> > > >>> +  
> > > >>
> > > > 
> > > > Hi Fabrice,
> > > > 
> > > >> I just noticed that since Jonathan question in v1.
> > > >>
> > > >> Above ABI has been moved in the past as discussed in [1]. You can take 
> > > >> a
> > > >> look at:
> > > >> b299d00 IIO: stm32: Remove quadrature related functions from trigger 
> > > >> driver
> > > >>
> > > >> Could you please remove the above chunk ?
> > > >>
> > > >> With that, for the stm32 part:
> > > >> Acked-by: Fabrice Gasnier 
> > > > 
> > > > 
> > > > Hmm... probably those were re-introduced due to a rebase. This
> > > > series were originally written about 1,5 years ago.
> > > > 
> > > > I'll drop those hunks.
> > > 
> > > Hi Mauro, Greg,
> > > 
> > > I just figured out this patch has been applied with above hunk.
> > > 
> > > This should be dropped: is there a fix on its way already ?
> > > (I may have missed it)
> > 
> > Can you send a fix for just this hunk?  
> 
> Hmm...
> 
>   $ git grep 
> /sys/bus/iio/devices/iio:deviceX/in_count_quadrature_mode_available
>   Documentation/ABI/testing/sysfs-bus-iio-counter-104-quad-8:What:
> /sys/bus/iio/devices/iio:deviceX/in_count_quadrature_mode_available
>   Documentation/ABI/testing/sysfs-bus-iio-lptimer-stm32:What: 
> /sys/bus/iio/devices/iio:deviceX/in_count_quadrature_mode_available
>   Documentation/ABI/testing/sysfs-bus-iio-timer-stm32:What:   
> /sys/bus/iio/devices/iio:deviceX/in_count_quadrature_mode_available
> 
> Even re-doing the changes from 
> changeset b299d00420e2 ("IIO: stm32: Remove quadrature related functions from 
> trigger driver")
> at Documentation/ABI/testing/sysfs-bus-iio-timer-stm32, there's still
> a third duplicate of some of those, as reported by the script:
> 
>   $ ./scripts/get_abi.pl validate 2>&1|grep quadra
>   Warning: /sys/bus/iio/devices/iio:deviceX/in_count0_quadrature_mode is 
> defined 2 times:  Documentation/ABI/testing/sysfs-bus-iio-timer-stm32:117  
> Documentation/ABI/testing/sysfs-bus-iio-lptimer-stm32:14
>   Warning: 
> /sys/bus/iio/devices/iio:deviceX/in_count_quadrature_mode_available is 
> defined 3 times:  
> Documentation/ABI/testing/sysfs-bus-iio-counter-104-quad-8:2  
> Documentation/ABI/testing/sysfs-bus-iio-timer-stm32:111  
> Documentation/ABI/testing/sysfs-bus-iio-lptimer-stm32:8
> 
> As in_count_quadrature_mode_available is also defined at:
>   Documentation/ABI/testing/sysfs-bus-iio-counter-104-quad-8:2
> 
> The best here seems to have a patch that will also drop the other
> duplication of this, probably moving in_count_quadrature_mode_available
> to a generic node probably placing it inside 
> Documentation/ABI/testing/sysfs-bus-iio.

In this particular case it may be valid to do that, but it's not in
general without loosing information - see below.

> 
> Comments?
> 
> Thanks,
> Mauro
> 
> PS.: the IIO subsystem is the one that currently has more duplicated
> ABI entries:

That was intentional.  Often these provide more information on the
ABI for a particular device than is present in the base ABI doc.

A bit like when we have additional description for dt binding properties
for a particular device, even though they are standard properties.

Often a standard property allows for more values than the specific
one for a particular device.  There 

Re: [PATCH 20/33] docs: ABI: testing: make the files compatible with ReST output

2020-10-29 Thread Jonathan Cameron
On Wed, 28 Oct 2020 15:23:18 +0100
Mauro Carvalho Chehab  wrote:

> From: Mauro Carvalho Chehab 
> 
> Some files over there won't parse well by Sphinx.
> 
> Fix them.
> 
> Signed-off-by: Mauro Carvalho Chehab 
> Signed-off-by: Mauro Carvalho Chehab 

Query below...  I'm going to guess a rebase issue?

Other than that
Acked-by: Jonathan Cameron  # for IIO


> diff --git a/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32 
> b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
> index b7259234ad70..a10a4de3e5fe 100644
> --- a/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
> +++ b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
> @@ -3,67 +3,85 @@ KernelVersion:  4.11
>  Contact: benjamin.gaign...@st.com
>  Description:
>   Reading returns the list possible master modes which are:
> - - "reset" : The UG bit from the TIMx_EGR register is
> +
> +
> + - "reset"
> + The UG bit from the TIMx_EGR register is
>   used as trigger output (TRGO).
> - - "enable": The Counter Enable signal CNT_EN is used
> + - "enable"
> + The Counter Enable signal CNT_EN is used
>   as trigger output.
> - - "update": The update event is selected as trigger output.
> + - "update"
> + The update event is selected as trigger output.
>   For instance a master timer can then be used
>   as a prescaler for a slave timer.
> - - "compare_pulse" : The trigger output send a positive pulse
> - when the CC1IF flag is to be set.
> - - "OC1REF": OC1REF signal is used as trigger output.
> - - "OC2REF": OC2REF signal is used as trigger output.
> - - "OC3REF": OC3REF signal is used as trigger output.
> - - "OC4REF": OC4REF signal is used as trigger output.
> + - "compare_pulse"
> + The trigger output send a positive pulse
> + when the CC1IF flag is to be set.
> + - "OC1REF"
> + OC1REF signal is used as trigger output.
> + - "OC2REF"
> + OC2REF signal is used as trigger output.
> + - "OC3REF"
> + OC3REF signal is used as trigger output.
> + - "OC4REF"
> + OC4REF signal is used as trigger output.
> +
>   Additional modes (on TRGO2 only):
> - - "OC5REF": OC5REF signal is used as trigger output.
> - - "OC6REF": OC6REF signal is used as trigger output.
> +
> + - "OC5REF"
> + OC5REF signal is used as trigger output.
> + - "OC6REF"
> + OC6REF signal is used as trigger output.
>   - "compare_pulse_OC4REF":
> -   OC4REF rising or falling edges generate pulses.
> + OC4REF rising or falling edges generate pulses.
>   - "compare_pulse_OC6REF":
> -   OC6REF rising or falling edges generate pulses.
> + OC6REF rising or falling edges generate pulses.
>   - "compare_pulse_OC4REF_r_or_OC6REF_r":
> -   OC4REF or OC6REF rising edges generate pulses.
> + OC4REF or OC6REF rising edges generate pulses.
>   - "compare_pulse_OC4REF_r_or_OC6REF_f":
> -   OC4REF rising or OC6REF falling edges generate pulses.
> + OC4REF rising or OC6REF falling edges generate
> + pulses.
>   - "compare_pulse_OC5REF_r_or_OC6REF_r":
> -   OC5REF or OC6REF rising edges generate pulses.
> + OC5REF or OC6REF rising edges generate pulses.
>   - "compare_pulse_OC5REF_r_or_OC6REF_f":
> -   OC5REF rising or OC6REF falling edges generate pulses.
> + OC5REF rising or OC6REF falling edges generate
> + pulses.
>  
> -  

Re: [PATCH v3 8/8] mm/vmalloc: Hugepage vmalloc mappings

2020-08-12 Thread Jonathan Cameron
On Wed, 12 Aug 2020 13:25:24 +0100
Jonathan Cameron  wrote:

> On Mon, 10 Aug 2020 12:27:32 +1000
> Nicholas Piggin  wrote:
> 
> > On platforms that define HAVE_ARCH_HUGE_VMAP and support PMD vmaps,
> > vmalloc will attempt to allocate PMD-sized pages first, before falling
> > back to small pages.
> > 
> > Allocations which use something other than PAGE_KERNEL protections are
> > not permitted to use huge pages yet, not all callers expect this (e.g.,
> > module allocations vs strict module rwx).
> > 
> > This reduces TLB misses by nearly 30x on a `git diff` workload on a
> > 2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
> > 
> > This can result in more internal fragmentation and memory overhead for a
> > given allocation, an option nohugevmap is added to disable at boot.
> > 
> > Signed-off-by: Nicholas Piggin   
> Hi Nicholas,
> 
> Busy afternoon, but a possible point of interest in line in the meantime.
> 

I did manage to get back to this.

The issue I think is that ARM64 defines THREAD_ALIGN with CONFIG_VMAP_STACK
to be 2* THREAD SIZE.  There is comment in arch/arm64/include/asm/memory.h
that this is to allow cheap checking of overflow.

A quick grep suggests ARM64 is the only architecture to do this...

Jonathan



> 
> ...
> 
> > @@ -2701,22 +2760,45 @@ void *__vmalloc_node_range(unsigned long size, 
> > unsigned long align,
> > pgprot_t prot, unsigned long vm_flags, int node,
> > const void *caller)
> >  {
> > -   struct vm_struct *area;
> > +   struct vm_struct *area = NULL;
> > void *addr;
> > unsigned long real_size = size;
> > +   unsigned long real_align = align;
> > +   unsigned int shift = PAGE_SHIFT;
> >  
> > size = PAGE_ALIGN(size);
> > if (!size || (size >> PAGE_SHIFT) > totalram_pages())
> > goto fail;
> >  
> > -   area = __get_vm_area_node(real_size, align, VM_ALLOC | VM_UNINITIALIZED 
> > |
> > +   if (vmap_allow_huge && (pgprot_val(prot) == pgprot_val(PAGE_KERNEL))) {
> > +   unsigned long size_per_node;
> > +
> > +   /*
> > +* Try huge pages. Only try for PAGE_KERNEL allocations,
> > +* others like modules don't yet expect huge pages in
> > +* their allocations due to apply_to_page_range not
> > +* supporting them.
> > +*/
> > +
> > +   size_per_node = size;
> > +   if (node == NUMA_NO_NODE)
> > +   size_per_node /= num_online_nodes();
> > +   if (size_per_node >= PMD_SIZE)
> > +   shift = PMD_SHIFT;
> > +   }
> > +
> > +again:
> > +   align = max(real_align, 1UL << shift);
> > +   size = ALIGN(real_size, align);  
> 
> So my suspicion is that the issue on arm64 is related to this.
> In the relevant call path, align is 32K whilst the size is 16K
> 
> Previously I don't think we force size to be a multiple of align.
> 
> I think this results in nr_pages being double what it was before.
> 
> 
> > +
> > +   area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
> > vm_flags, start, end, node, gfp_mask, caller);
> > if (!area)
> > goto fail;
> >  
> > -   addr = __vmalloc_area_node(area, gfp_mask, prot, node);
> > +   addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> > if (!addr)
> > -   return NULL;
> > +   goto fail;
> >  
> > /*
> >  * In this function, newly allocated vm_struct has VM_UNINITIALIZED
> > @@ -2730,8 +2812,16 @@ void *__vmalloc_node_range(unsigned long size, 
> > unsigned long align,
> > return addr;
> >  
> >  fail:
> > -   warn_alloc(gfp_mask, NULL,
> > +   if (shift > PAGE_SHIFT) {
> > +   shift = PAGE_SHIFT;
> > +   goto again;
> > +   }
> > +
> > +   if (!area) {
> > +   /* Warn for area allocation, page allocations already warn */
> > +   warn_alloc(gfp_mask, NULL,
> >   "vmalloc: allocation failure: %lu bytes", real_size);
> > +   }
> > return NULL;
> >  }
> >
> 
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel




Re: [PATCH v3 8/8] mm/vmalloc: Hugepage vmalloc mappings

2020-08-12 Thread Jonathan Cameron
On Mon, 10 Aug 2020 12:27:32 +1000
Nicholas Piggin  wrote:

> On platforms that define HAVE_ARCH_HUGE_VMAP and support PMD vmaps,
> vmalloc will attempt to allocate PMD-sized pages first, before falling
> back to small pages.
> 
> Allocations which use something other than PAGE_KERNEL protections are
> not permitted to use huge pages yet, not all callers expect this (e.g.,
> module allocations vs strict module rwx).
> 
> This reduces TLB misses by nearly 30x on a `git diff` workload on a
> 2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
> 
> This can result in more internal fragmentation and memory overhead for a
> given allocation, an option nohugevmap is added to disable at boot.
> 
> Signed-off-by: Nicholas Piggin 
Hi Nicholas,

Busy afternoon, but a possible point of interest in line in the meantime.


...

> @@ -2701,22 +2760,45 @@ void *__vmalloc_node_range(unsigned long size, 
> unsigned long align,
>   pgprot_t prot, unsigned long vm_flags, int node,
>   const void *caller)
>  {
> - struct vm_struct *area;
> + struct vm_struct *area = NULL;
>   void *addr;
>   unsigned long real_size = size;
> + unsigned long real_align = align;
> + unsigned int shift = PAGE_SHIFT;
>  
>   size = PAGE_ALIGN(size);
>   if (!size || (size >> PAGE_SHIFT) > totalram_pages())
>   goto fail;
>  
> - area = __get_vm_area_node(real_size, align, VM_ALLOC | VM_UNINITIALIZED 
> |
> + if (vmap_allow_huge && (pgprot_val(prot) == pgprot_val(PAGE_KERNEL))) {
> + unsigned long size_per_node;
> +
> + /*
> +  * Try huge pages. Only try for PAGE_KERNEL allocations,
> +  * others like modules don't yet expect huge pages in
> +  * their allocations due to apply_to_page_range not
> +  * supporting them.
> +  */
> +
> + size_per_node = size;
> + if (node == NUMA_NO_NODE)
> + size_per_node /= num_online_nodes();
> + if (size_per_node >= PMD_SIZE)
> + shift = PMD_SHIFT;
> + }
> +
> +again:
> + align = max(real_align, 1UL << shift);
> + size = ALIGN(real_size, align);

So my suspicion is that the issue on arm64 is related to this.
In the relevant call path, align is 32K whilst the size is 16K

Previously I don't think we force size to be a multiple of align.

I think this results in nr_pages being double what it was before.


> +
> + area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
>   vm_flags, start, end, node, gfp_mask, caller);
>   if (!area)
>   goto fail;
>  
> - addr = __vmalloc_area_node(area, gfp_mask, prot, node);
> + addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
>   if (!addr)
> - return NULL;
> + goto fail;
>  
>   /*
>* In this function, newly allocated vm_struct has VM_UNINITIALIZED
> @@ -2730,8 +2812,16 @@ void *__vmalloc_node_range(unsigned long size, 
> unsigned long align,
>   return addr;
>  
>  fail:
> - warn_alloc(gfp_mask, NULL,
> + if (shift > PAGE_SHIFT) {
> + shift = PAGE_SHIFT;
> + goto again;
> + }
> +
> + if (!area) {
> + /* Warn for area allocation, page allocations already warn */
> + warn_alloc(gfp_mask, NULL,
> "vmalloc: allocation failure: %lu bytes", real_size);
> + }
>   return NULL;
>  }
>  




Re: [PATCH v3 0/8] huge vmalloc mappings

2020-08-11 Thread Jonathan Cameron
On Mon, 10 Aug 2020 12:27:24 +1000
Nicholas Piggin  wrote:

> Not tested on x86 or arm64, would appreciate a quick test there so I can
> ask Andrew to put it in -mm. Other option is I can disable huge vmallocs
> for them for the time being.

Hi Nicholas,

For arm64 testing with a Kunpeng920.

I ran a quick sanity test with this series on top of mainline (yes mid merge 
window
so who knows what state is...).  Could I be missing some dependency?

Without them it boots, with them it doesn't.  Any immediate guesses?

[0.069507] Dentry cache hash table entries: 33554432 (order: 16, 268435456 
bytes, vmalloc)   
[0.087134] Inode-cache hash table entries: 16777216 (order: 15, 134217728 
bytes, vmalloc)
[0.097044] Mount-cache hash table entries: 524288 (order: 10, 4194304 
bytes, vmalloc) 
   
[0.106534] Mountpoint-cache hash table entries: 524288 (order: 10, 4194304 
bytes, vmalloc)
[0.116349] [ cut here ]   
[0.121465] kernel BUG at kernel/fork.c:402!
[0.126194] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[0.132273] Modules linked in:
[0.135653] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
5.8.0-12307-g2b197e00c338 #637
[0.144240] pstate: 2009 (nzCv daif -PAN -UAO BTYPE=--)
[0.150420] pc : copy_process+0x10c0/0x1690
[0.155049] lr : copy_process+0x2e0/0x1690
[0.159584] sp : d96c55773d60
[0.163250] x29: d96c55773d70 x28: 20bf8706
[0.169134] x27: 00800300 x26: 
[0.175018] x25: 8000108a8000 x24: d96c55a32708
[0.180901] x23: 20bf87043800 x22: 
[0.186787] x21:  x20: d96c55773ef0
[0.192672] x19: d96c55783bc0 x18: 0010
[0.198557] x17: 855c858e x16: a8256fca
[0.204441] x15:  x14: 
[0.210327] x13: 800010901000 x12: 8000108b1000
[0.216212] x11: 0001 x10: d96c55a6d000
[0.222096] x9 : d96c53bf7594 x8 : 0041
[0.227980] x7 : 004fa6b0 x6 : 800010aa8000
[0.233864] x5 : fffd x4 : 
[0.239748] x3 : d96c55a63598 x2 : 0001
[0.245632] x1 : d96c55783bc0 x0 : 0008
[0.251519] Call trace:
[0.254221]  copy_process+0x10c0/0x1690
[0.258466]  _do_fork+0x98/0x488
[0.262036]  kernel_thread+0x6c/0x90
[0.265997]  rest_init+0x38/0xf0
[0.269568]  arch_call_rest_init+0x18/0x24
[0.274105]  start_kernel+0x60c/0x644
[0.278159] Code: f000a441 f943f421 cb01 17e1 (d421)
[0.284961] ---[ end trace 985361e2cb97a0d9 ]---
[0.290073] Kernel panic - not syncing: Attempted to kill the idle task!
[0.297532] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task! ]---

Thanks,

Jonathan


> 
> Since v2:
> - Rebased on vmalloc cleanups, split series into simpler pieces.
> - Fixed several compile errors and warnings
> - Keep the page array and accounting in small page units because
>   struct vm_struct is an interface (this should fix x86 vmap stack debug
>   assert). [Thanks Zefan]
> 
> Nicholas Piggin (8):
>   mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
>   mm: apply_to_pte_range warn and fail if a large pte is encountered
>   mm/vmalloc: rename vmap_*_range vmap_pages_*_range
>   lib/ioremap: rename ioremap_*_range to vmap_*_range
>   mm: HUGE_VMAP arch support cleanup
>   mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
>   mm/vmalloc: add vmap_range_noflush variant
>   mm/vmalloc: Hugepage vmalloc mappings
> 
>  .../admin-guide/kernel-parameters.txt |   2 +
>  arch/arm64/mm/mmu.c   |  10 +-
>  arch/powerpc/mm/book3s64/radix_pgtable.c  |   8 +-
>  arch/x86/mm/ioremap.c |  10 +-
>  include/linux/io.h|   9 -
>  include/linux/vmalloc.h   |  13 +
>  init/main.c   |   1 -
>  mm/ioremap.c  | 231 +
>  mm/memory.c   |  60 ++-
>  mm/vmalloc.c  | 442 +++---
>  10 files changed, 453 insertions(+), 333 deletions(-)
> 



Re: [PATCH 04/15] arm64: numa: simplify dummy_numa_init()

2020-07-29 Thread Jonathan Cameron
On Tue, 28 Jul 2020 08:11:42 +0300
Mike Rapoport  wrote:

> From: Mike Rapoport 
> 
> dummy_numa_init() loops over memblock.memory and passes nid=0 to
> numa_add_memblk() which essentially wraps memblock_set_node(). However,
> memblock_set_node() can cope with entire memory span itself, so the loop
> over memblock.memory regions is redundant.
> 
> Replace the loop with a single call to memblock_set_node() to the entire
> memory.

Hi Mike,

I had a similar patch I was going to post shortly so can add a bit more
on the advantages of this one.

Beyond cleaning up, it also fixes an issue with a buggy ACPI firmware in which 
the SRAT
table covers some but not all of the memory in the EFI memory map.  Stealing 
bits
from the draft cover letter I had for that...

> This issue can be easily triggered by having an SRAT table which fails
> to cover all elements of the EFI memory map.
> 
> This firmware error is detected and a warning printed. e.g.
> "NUMA: Warning: invalid memblk node 64 [mem 0x24000-0x27fff]"
> At that point we fall back to dummy_numa_init().
> 
> However, the failed ACPI init has left us with our memblocks all broken
> up as we split them when trying to assign them to NUMA nodes.
> 
> We then iterate over the memblocks and add them to node 0.
> 
> for_each_memblock(memory, mblk) {
>   ret = numa_add_memblk(0, mblk->base, mblk->base + mblk->size);
>   if (!ret)
>   continue;
>   pr_err("NUMA init failed\n");
>   return ret;
> }
> 
> numa_add_memblk() calls memblock_set_node() which merges regions that
> were previously split up during the earlier attempt to add them to different
> nodes during parsing of SRAT.
> 
> This means elements are moved in the memblock array and we can end up
> in a different memblock after the call to numa_add_memblk().
> Result is:
> 
> Unable to handle kernel paging request at virtual address 3a40
> Mem abort info:
>   ESR = 0x9604
>   EC = 0x25: DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
> Data abort info:
>   ISV = 0, ISS = 0x0004
>   CM = 0, WnR = 0
> [3a40] user address but active_mm is swapper
> Internal error: Oops: 9604 [#1] PREEMPT SMP
> 
> ...
> 
> Call trace:
>   sparse_init_nid+0x5c/0x2b0
>   sparse_init+0x138/0x170
>   bootmem_init+0x80/0xe0
>   setup_arch+0x2a0/0x5fc
>   start_kernel+0x8c/0x648
> 
> As an illustrative example:
> EFI table has one block of memory.
> memblks[0] = [0...0x2f]  so we start with a single memblock.
> 
> SRAT has
> [0x00...0x0f] in node 0
> [0x10...0x1f] in node 1
> but no entry covering 
> [0x20...0x2f].
> 
> Whilst parsing SRAT the single memblock is broken into 3.
> memblks[0] = [0x00...0x0f] in node 0
> memblks[1] = [0x10...0x1f] in node 1
> memblks[2] = [0x20...0x2f] in node MAX_NUM_NODES (invalid value)
> 
> A sanity check parse then detects the invalid section and acpi_numa_init
> fails.  We then fall back to the dummy path.
> 
> That iterates over the memblocks.  We'll use i an index in the array of 
> memblocks
> 
> i = 0;
> memblks[0] = [0x00...0x0f] set to node0.
>merge doesn't do anything because the neighbouring memblock is still in 
> node1.
> 
> i = 1
> memblks[1] = [0x10...0x1f] set to node 0.
>merge combines memblock 0 and 1 to give a new set of memblocks.
> 
> memblks[0] = [0x00..0x1f] in node 0
> memblks[1] = [0x20..0x2f] in node MAX_NUM_NODES.
> 
> i = 2 off the end of the now reduced array of memblocks, so exit the loop.
> (if we restart the loop here everything will be fine).
> 
> Later sparse_init_nid tries to use the node of the second memblock to index
> somethings and boom.


> 
> Signed-off-by: Mike Rapoport 

Acked-by: Jonathan Cameron 

> ---
>  arch/arm64/mm/numa.c | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index aafcee3e3f7e..0cbdbcc885fb 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -423,19 +423,16 @@ static int __init numa_init(int (*init_func)(void))
>   */
>  static int __init dummy_numa_init(void)
>  {
> + phys_addr_t start = memblock_start_of_DRAM();
> + phys_addr_t end = memblock_end_of_DRAM();
>   int ret;
> - struct memblock_region *mblk;
>  
>   if (numa_off)
>   pr_info("NUMA disabled\n"); /* Forced off on command line. */
> - pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n",
> - memblock_start_of_DRAM(), memblock_end_of_DRAM() - 1);
> -
> - for_each_memblock(memory, mblk) {
> - ret = numa_add_memblk(0, mblk->base, mblk->base + mblk->size);
> - if (!ret)
> - continue;
> + pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n", start, end - 1);
>  
> + ret = numa_add_memblk(0, start, end);
> + if (ret) {
>   pr_err("NUMA init failed\n");
>   return ret;
>   }




Re: [PATCH v2 22/27] nvdimm/ocxl: Implement the heartbeat command

2020-02-03 Thread Jonathan Cameron
On Tue, 3 Dec 2019 14:46:50 +1100
Alastair D'Silva  wrote:

> From: Alastair D'Silva 
> 
> The heartbeat admin command is a simple admin command that exercises
> the communication mechanisms within the controller.
> 
> This patch issues a heartbeat command to the card during init to ensure
> we can communicate with the card's crontroller.

controller

> 
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/scm.c | 43 +++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/drivers/nvdimm/ocxl/scm.c b/drivers/nvdimm/ocxl/scm.c
> index 8a30c887b5ed..e8b34262f397 100644
> --- a/drivers/nvdimm/ocxl/scm.c
> +++ b/drivers/nvdimm/ocxl/scm.c
> @@ -353,6 +353,44 @@ static bool scm_is_usable(const struct scm_data 
> *scm_data)
>   return true;
>  }
>  
> +/**
> + * scm_heartbeat() - Issue a heartbeat command to the controller
> + * @scm_data: a pointer to the SCM device data
> + * Return: 0 if the controller responded correctly, negative on error
> + */
> +static int scm_heartbeat(struct scm_data *scm_data)
> +{
> + int rc;
> +
> + mutex_lock(_data->admin_command.lock);
> +
> + rc = scm_admin_command_request(scm_data, ADMIN_COMMAND_HEARTBEAT);
> + if (rc)
> + goto out;
> +
> + rc = scm_admin_command_execute(scm_data);
> + if (rc)
> + goto out;
> +
> + rc = scm_admin_command_complete_timeout(scm_data, 
> ADMIN_COMMAND_HEARTBEAT);
> + if (rc < 0) {
> + dev_err(_data->dev, "Heartbeat timeout\n");
> + goto out;
> + }
> +
> + rc = scm_admin_response(scm_data);
> + if (rc < 0)
> + goto out;
> + if (rc != STATUS_SUCCESS)
> + scm_warn_status(scm_data, "Unexpected status from heartbeat", 
> rc);
> +
> + rc = scm_admin_response_handled(scm_data);
> +
> +out:
> + mutex_unlock(_data->admin_command.lock);
> + return rc;
> +}
> +
>  /**
>   * allocate_scm_minor() - Allocate a minor number to use for an SCM device
>   * @scm_data: The SCM device to associate the minor with
> @@ -1508,6 +1546,11 @@ static int scm_probe(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>   goto err;
>   }
>  
> + if (scm_heartbeat(scm_data)) {
> + dev_err(>dev, "SCM Heartbeat failed\n");
> + goto err;
> + }
> +
>   elapsed = 0;
>   timeout = scm_data->readiness_timeout + 
> scm_data->memory_available_timeout;
>   while (!scm_is_usable(scm_data)) {




Re: [PATCH v2 24/27] nvdimm/ocxl: Implement Overwrite

2020-02-03 Thread Jonathan Cameron
On Tue, 3 Dec 2019 14:46:52 +1100
Alastair D'Silva  wrote:

> From: Alastair D'Silva 
> 
> The near storage command 'Secure Erase' overwrites all data on the
> media.
> 
> This patch hooks it up to the security function 'overwrite'.
> 
> Signed-off-by: Alastair D'Silva 

A few things to tidy up in here.

Thanks,

Jonathan


> ---
>  drivers/nvdimm/ocxl/scm.c  | 164 -
>  drivers/nvdimm/ocxl/scm_internal.c |   1 +
>  drivers/nvdimm/ocxl/scm_internal.h |  17 +++
>  3 files changed, 180 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/ocxl/scm.c b/drivers/nvdimm/ocxl/scm.c
> index a81eb5916eb3..8deb7862793c 100644
> --- a/drivers/nvdimm/ocxl/scm.c
> +++ b/drivers/nvdimm/ocxl/scm.c
> @@ -169,6 +169,86 @@ static int scm_reserve_metadata(struct scm_data 
> *scm_data,
>   return 0;
>  }
>  
> +/**
> + * scm_overwrite() - Overwrite all data on the card
> + * @scm_data: The SCM device data

I would mention in here that this exists with the lock held and
where that is unlocked again.

> + * Return: 0 on success
> + */
> +int scm_overwrite(struct scm_data *scm_data)
> +{
> + int rc;
> +
> + mutex_lock(_data->ns_command.lock);
> +
> + rc = scm_ns_command_request(scm_data, NS_COMMAND_SECURE_ERASE);
> + if (rc)

Perhaps change that goto label to reflect it is the error path rather
than a shared exit route.

> + goto out;
> +
> + rc = scm_ns_command_execute(scm_data);
> + if (rc)
> + goto out;
> +
> + scm_data->overwrite_state = SCM_OVERWRITE_BUSY;
> +
> + return 0;
> +
> +out:
> + mutex_unlock(_data->ns_command.lock);
> + return rc;
> +}
> +
> +/**
> + * scm_secop_overwrite() - Overwrite all data on the card
> + * @nvdimm: The nvdimm representation of the SCM device to start the 
> overwrite on
> + * @key_data: Unused (no security key implementation)
> + * Return: 0 on success
> + */
> +static int scm_secop_overwrite(struct nvdimm *nvdimm,
> +const struct nvdimm_key_data *key_data)
> +{
> + struct scm_data *scm_data = nvdimm_provider_data(nvdimm);
> +
> + return scm_overwrite(scm_data);
> +}
> +
> +/**
> + * scm_secop_query_overwrite() - Get the current overwrite state
> + * @nvdimm: The nvdimm representation of the SCM device to start the 
> overwrite on
> + * Return: 0 if successful or idle, -EBUSY if busy, -EFAULT if failed
> + */
> +static int scm_secop_query_overwrite(struct nvdimm *nvdimm)
> +{
> + struct scm_data *scm_data = nvdimm_provider_data(nvdimm);
> +
> + if (scm_data->overwrite_state == SCM_OVERWRITE_BUSY)
> + return -EBUSY;
> +
> + if (scm_data->overwrite_state == SCM_OVERWRITE_FAILED)
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +/**
> + * scm_secop_get_flags() - return the security flags for the SCM device

All params need to documented in kernel-doc comments.

> + */
> +static unsigned long scm_secop_get_flags(struct nvdimm *nvdimm,
> + enum nvdimm_passphrase_type ptype)
> +{
> + struct scm_data *scm_data = nvdimm_provider_data(nvdimm);
> +
> + if (scm_data->overwrite_state == SCM_OVERWRITE_BUSY)
> + return BIT(NVDIMM_SECURITY_OVERWRITE);
> +
> + return BIT(NVDIMM_SECURITY_DISABLED);
> +}
> +
> +static const struct nvdimm_security_ops sec_ops  = {
> + .get_flags = scm_secop_get_flags,
> + .overwrite = scm_secop_overwrite,
> + .query_overwrite = scm_secop_query_overwrite,
> +};
> +
>  /**
>   * scm_register_lpc_mem() - Discover persistent memory on a device and 
> register it with the NVDIMM subsystem
>   * @scm_data: The SCM device data
> @@ -224,10 +304,10 @@ static int scm_register_lpc_mem(struct scm_data 
> *scm_data)
>   set_bit(NDD_ALIASING, _flags);
>  
>   snprintf(serial, sizeof(serial), "%llx", fn_config->serial);
> - nd_mapping_desc.nvdimm = nvdimm_create(scm_data->nvdimm_bus, scm_data,
> + nd_mapping_desc.nvdimm = __nvdimm_create(scm_data->nvdimm_bus, scm_data,
>scm_dimm_attribute_groups,
>nvdimm_flags, nvdimm_cmd_mask,
> -  0, NULL);
> +  0, NULL, serial, _ops);
>   if (!nd_mapping_desc.nvdimm)
>   return -ENOMEM;
>  
> @@ -1530,6 +1610,83 @@ static void scm_dump_error_log(struct scm_data 
> *scm_data)
>   kfree(buf);
>  }
>  
> +static void scm_handle_nscra_doorbell(struct scm_data *scm_data)
> +{
> + int rc;
> +
> + if (scm_data->ns_command.op_code == NS_COMMAND_SECURE_ERASE) {

Feels likely that we are going to end up with quite a few blocks like this as
the driver is extended. Perhaps just start out with a switch statement and
separate functions that it calls?

> + u64 success, attempted;
> +

One is enough here.

> +
> + rc = scm_ns_response(scm_data);
> + if (rc < 0) {
> + scm_data->overwrite_state = SCM_OVERWRITE_FAILED;

If 

Re: [PATCH v2 14/27] nvdimm/ocxl: Add support for near storage commands

2020-02-03 Thread Jonathan Cameron
On Tue, 3 Dec 2019 14:46:42 +1100
Alastair D'Silva  wrote:

> From: Alastair D'Silva 
> 
> Similar to the previous patch, this adds support for near storage commands.
> 
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/scm.c  |  6 +
>  drivers/nvdimm/ocxl/scm_internal.c | 41 ++
>  drivers/nvdimm/ocxl/scm_internal.h | 38 +++
>  3 files changed, 85 insertions(+)
> 
> diff --git a/drivers/nvdimm/ocxl/scm.c b/drivers/nvdimm/ocxl/scm.c
> index 1e175f3c3cf2..6c16ca7fabfa 100644
> --- a/drivers/nvdimm/ocxl/scm.c
> +++ b/drivers/nvdimm/ocxl/scm.c
> @@ -310,12 +310,18 @@ static int scm_setup_command_metadata(struct scm_data 
> *scm_data)
>   int rc;
>  
>   mutex_init(_data->admin_command.lock);
> + mutex_init(_data->ns_command.lock);
>  
>   rc = scm_extract_command_metadata(scm_data, GLOBAL_MMIO_ACMA_CREQO,
> _data->admin_command);
>   if (rc)
>   return rc;
>  
> + rc = scm_extract_command_metadata(scm_data, GLOBAL_MMIO_NSCMA_CREQO,
> +   _data->ns_command);
> + if (rc)
> + return rc;
> +

Ah. So much for my comment in previous patch.  Ignore that...

>   return 0;
>  }
>  
> diff --git a/drivers/nvdimm/ocxl/scm_internal.c 
> b/drivers/nvdimm/ocxl/scm_internal.c
> index 7b11b56863fb..c405f1d8afb8 100644
> --- a/drivers/nvdimm/ocxl/scm_internal.c
> +++ b/drivers/nvdimm/ocxl/scm_internal.c
> @@ -132,6 +132,47 @@ int scm_admin_response_handled(const struct scm_data 
> *scm_data)
> OCXL_LITTLE_ENDIAN, GLOBAL_MMIO_CHI_ACRA);
>  }
>  
> +int scm_ns_command_request(struct scm_data *scm_data, u8 op_code)
> +{
> + u64 val;
> + int rc = ocxl_global_mmio_read64(scm_data->ocxl_afu, GLOBAL_MMIO_CHI,
> +  OCXL_LITTLE_ENDIAN, );
> + if (rc)
> + return rc;
> +
> + if (!(val & GLOBAL_MMIO_CHI_NSCRA))
> + return -EBUSY;
> +
> + return scm_command_request(scm_data, _data->ns_command, op_code);
> +}
> +
> +int scm_ns_response(const struct scm_data *scm_data)
> +{
> + return scm_command_response(scm_data, _data->ns_command);
> +}
> +
> +int scm_ns_command_execute(const struct scm_data *scm_data)
> +{
> + return ocxl_global_mmio_set64(scm_data->ocxl_afu, GLOBAL_MMIO_HCI,
> +   OCXL_LITTLE_ENDIAN, 
> GLOBAL_MMIO_HCI_NSCRW);
> +}
> +
> +bool scm_ns_command_complete(const struct scm_data *scm_data)
> +{
> + u64 val = 0;
> + int rc = scm_chi(scm_data, );
> +
> + WARN_ON(rc);
> +
> + return (val & GLOBAL_MMIO_CHI_NSCRA) != 0;
> +}
> +
> +int scm_ns_response_handled(const struct scm_data *scm_data)
> +{
> + return ocxl_global_mmio_set64(scm_data->ocxl_afu, GLOBAL_MMIO_CHIC,
> +   OCXL_LITTLE_ENDIAN, 
> GLOBAL_MMIO_CHI_NSCRA);
> +}
> +
>  void scm_warn_status(const struct scm_data *scm_data, const char *message,
>u8 status)
>  {
> diff --git a/drivers/nvdimm/ocxl/scm_internal.h 
> b/drivers/nvdimm/ocxl/scm_internal.h
> index 9bff684cd069..9575996a89e7 100644
> --- a/drivers/nvdimm/ocxl/scm_internal.h
> +++ b/drivers/nvdimm/ocxl/scm_internal.h
> @@ -108,6 +108,7 @@ struct scm_data {
>   struct ocxl_context *ocxl_context;
>   void *metadata_addr;
>   struct command_metadata admin_command;
> + struct command_metadata ns_command;
>   struct resource scm_res;
>   struct nd_region *nd_region;
>   char fw_version[8+1];
> @@ -176,6 +177,42 @@ int scm_admin_command_complete_timeout(const struct 
> scm_data *scm_data,
>   */
>  int scm_admin_response_handled(const struct scm_data *scm_data);
>  
> +/**
> + * scm_ns_command_request() - Issue a near storage command request
> + * @scm_data: a pointer to the SCM device data
> + * @op_code: The op-code for the command
> + * Returns an identifier for the command, or negative on error
> + */
> +int scm_ns_command_request(struct scm_data *scm_data, u8 op_code);
> +
> +/**
> + * scm_ns_response() - Validate a near storage response
> + * @scm_data: a pointer to the SCM device data
> + * Returns the status code of the command, or negative on error
> + */
> +int scm_ns_response(const struct scm_data *scm_data);
> +
> +/**
> + * scm_ns_command_execute() - Notify the controller to start processing a 
> pending near storage command
> + * @scm_data: a pointer to the SCM device data
> + * Returns 0 on success, negative on error
> + */
> +int scm_ns_command_execute(const struct scm_data *scm_data);
> +
> +/**
> + * scm_ns_command_complete() - Is a near storage command executing
> + * scm_data: a pointer to the SCM device data
> + * Returns true if the previous admin command has completed
> + */
> +bool scm_ns_command_complete(const struct scm_data *scm_data);
> +
> +/**
> + * scm_ns_response_handled() - Notify the controller that the near storage 
> response 

Re: [PATCH v2 13/27] nvdimm/ocxl: Add support for Admin commands

2020-02-03 Thread Jonathan Cameron
On Tue, 3 Dec 2019 14:46:41 +1100
Alastair D'Silva  wrote:

> From: Alastair D'Silva 
> 
> This patch requests the metadata required to issue admin commands, as well
> as some helper functions to construct and check the completion of the
> commands.
> 
> Signed-off-by: Alastair D'Silva 

A few trivial bits inline.

Jonathan

> ---
>  drivers/nvdimm/ocxl/scm.c  |  67 +
>  drivers/nvdimm/ocxl/scm_internal.c | 152 +
>  drivers/nvdimm/ocxl/scm_internal.h |  62 
>  3 files changed, 281 insertions(+)
> 
> diff --git a/drivers/nvdimm/ocxl/scm.c b/drivers/nvdimm/ocxl/scm.c
> index 8088f65c289e..1e175f3c3cf2 100644
> --- a/drivers/nvdimm/ocxl/scm.c
> +++ b/drivers/nvdimm/ocxl/scm.c
> @@ -267,6 +267,58 @@ static int scm_register_lpc_mem(struct scm_data 
> *scm_data)
>   return 0;
>  }
>  
> +/**
> + * scm_extract_command_metadata() - Extract command data from MMIO & save it 
> for further use
> + * @scm_data: a pointer to the SCM device data
> + * @offset: The base address of the command data structures (address of 
> CREQO)
> + * @command_metadata: A pointer to the command metadata to populate
> + * Return: 0 on success, negative on failure
> + */
> +static int scm_extract_command_metadata(struct scm_data *scm_data, u32 
> offset,
> + struct command_metadata 
> *command_metadata)
> +{
> + int rc;
> + u64 tmp;
> +
> + rc = ocxl_global_mmio_read64(scm_data->ocxl_afu, offset, 
> OCXL_LITTLE_ENDIAN,
> +  );
> + if (rc)
> + return rc;
> +
> + command_metadata->request_offset = tmp >> 32;
> + command_metadata->response_offset = tmp & 0x;
> +
> + rc = ocxl_global_mmio_read64(scm_data->ocxl_afu, offset + 8, 
> OCXL_LITTLE_ENDIAN,
> +  );
> + if (rc)
> + return rc;
> +
> + command_metadata->data_offset = tmp >> 32;
> + command_metadata->data_size = tmp & 0x;
> +
> + command_metadata->id = 0;
> +
> + return 0;
> +}
> +
> +/**
> + * scm_setup_command_metadata() - Set up the command metadata
> + * @scm_data: a pointer to the SCM device data
> + */
> +static int scm_setup_command_metadata(struct scm_data *scm_data)
> +{
> + int rc;
> +
> + mutex_init(_data->admin_command.lock);
> +
> + rc = scm_extract_command_metadata(scm_data, GLOBAL_MMIO_ACMA_CREQO,
> +   _data->admin_command);
> + if (rc)
> + return rc;

Unless you are adding to this later in the series.

return scm_extract_command_metadata(scm_data,...)

> +
> + return 0;
> +}
> +
>  /**
>   * scm_is_usable() - Is a controller usable?
>   * @scm_data: a pointer to the SCM device data
> @@ -276,6 +328,8 @@ static bool scm_is_usable(const struct scm_data *scm_data)
>  {
>   u64 chi = 0;
>   int rc = scm_chi(scm_data, );
> + if (rc)
> + return false;
>  
>   if (!(chi & GLOBAL_MMIO_CHI_CRDY)) {
>   dev_err(_data->dev, "SCM controller is not ready.\n");
> @@ -502,6 +556,14 @@ static int scm_probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>   }
>   scm_data->pdev = pdev;
>  
> + scm_data->timeouts[ADMIN_COMMAND_ERRLOG] = 2000; // ms
> + scm_data->timeouts[ADMIN_COMMAND_HEARTBEAT] = 100; // ms
> + scm_data->timeouts[ADMIN_COMMAND_SMART] = 100; // ms
> + scm_data->timeouts[ADMIN_COMMAND_CONTROLLER_DUMP] = 1000; // ms
> + scm_data->timeouts[ADMIN_COMMAND_CONTROLLER_STATS] = 100; // ms
> + scm_data->timeouts[ADMIN_COMMAND_SHUTDOWN] = 1000; // ms
> + scm_data->timeouts[ADMIN_COMMAND_FW_UPDATE] = 16000; // ms
> +
>   pci_set_drvdata(pdev, scm_data);
>  
>   scm_data->ocxl_fn = ocxl_function_open(pdev);
> @@ -543,6 +605,11 @@ static int scm_probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>   goto err;
>   }
>  
> + if (scm_setup_command_metadata(scm_data)) {
> + dev_err(>dev, "Could not read OCXL command matada\n");
> + goto err;
> + }
> +
>   elapsed = 0;
>   timeout = scm_data->readiness_timeout + 
> scm_data->memory_available_timeout;
>   while (!scm_is_usable(scm_data)) {
> diff --git a/drivers/nvdimm/ocxl/scm_internal.c 
> b/drivers/nvdimm/ocxl/scm_internal.c
> index 72d3c0e7d846..7b11b56863fb 100644
> --- a/drivers/nvdimm/ocxl/scm_internal.c
> +++ b/drivers/nvdimm/ocxl/scm_internal.c
> @@ -17,3 +17,155 @@ int scm_chi(const struct scm_data *scm_data, u64 *chi)
>  
>   return 0;
>  }
> +
> +static int scm_command_request(const struct scm_data *scm_data,
> +struct command_metadata *cmd, u8 op_code)
> +{
> + u64 val = op_code;
> + int rc;
> + u8 i;
> +
> + cmd->op_code = op_code;
> + cmd->id++;
> +
> + val |= ((u64)cmd->id) << 16;
> +
> + rc = ocxl_global_mmio_write64(scm_data->ocxl_afu, cmd->request_offset,
> +  

Re: [PATCH v2 12/27] nvdimm/ocxl: Read the capability registers & wait for device ready

2020-02-03 Thread Jonathan Cameron
On Tue, 3 Dec 2019 14:46:40 +1100
Alastair D'Silva  wrote:

> From: Alastair D'Silva 
> 
> This patch reads timeouts & firmware version from the controller, and
> uses those timeouts to wait for the controller to report that it is ready
> before handing the memory over to libnvdimm.
> 
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/nvdimm/ocxl/Makefile   |  2 +-
>  drivers/nvdimm/ocxl/scm.c  | 84 ++
>  drivers/nvdimm/ocxl/scm_internal.c | 19 +++
>  drivers/nvdimm/ocxl/scm_internal.h | 24 +
>  4 files changed, 128 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/nvdimm/ocxl/scm_internal.c
> 
> diff --git a/drivers/nvdimm/ocxl/Makefile b/drivers/nvdimm/ocxl/Makefile
> index 74a1bd98848e..9b6e31f0eb3e 100644
> --- a/drivers/nvdimm/ocxl/Makefile
> +++ b/drivers/nvdimm/ocxl/Makefile
> @@ -4,4 +4,4 @@ ccflags-$(CONFIG_PPC_WERROR)  += -Werror
>  
>  obj-$(CONFIG_OCXL_SCM) += ocxlscm.o
>  
> -ocxlscm-y := scm.o
> +ocxlscm-y := scm.o scm_internal.o
> diff --git a/drivers/nvdimm/ocxl/scm.c b/drivers/nvdimm/ocxl/scm.c
> index 571058a9e7b8..8088f65c289e 100644
> --- a/drivers/nvdimm/ocxl/scm.c
> +++ b/drivers/nvdimm/ocxl/scm.c
> @@ -7,6 +7,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -266,6 +267,30 @@ static int scm_register_lpc_mem(struct scm_data 
> *scm_data)
>   return 0;
>  }
>  
> +/**
> + * scm_is_usable() - Is a controller usable?
> + * @scm_data: a pointer to the SCM device data
> + * Return: true if the controller is usable
> + */
> +static bool scm_is_usable(const struct scm_data *scm_data)
> +{
> + u64 chi = 0;
> + int rc = scm_chi(scm_data, );
> +
> + if (!(chi & GLOBAL_MMIO_CHI_CRDY)) {
> + dev_err(_data->dev, "SCM controller is not ready.\n");
> + return false;
> + }
> +
> + if (!(chi & GLOBAL_MMIO_CHI_MA)) {
> + dev_err(_data->dev,
> + "SCM controller does not have memory available.\n");
> + return false;
> + }
> +
> + return true;
> +}
> +
>  /**
>   * allocate_scm_minor() - Allocate a minor number to use for an SCM device
>   * @scm_data: The SCM device to associate the minor with
> @@ -380,6 +405,48 @@ static void scm_remove(struct pci_dev *pdev)
>   }
>  }
>  
> +/**
> + * read_device_metadata() - Retrieve config information from the AFU and 
> save it for future use
> + * @scm_data: the SCM metadata
> + * Return: 0 on success, negative on failure
> + */
> +static int read_device_metadata(struct scm_data *scm_data)
> +{
> + u64 val;
> + int rc;
> +
> + rc = ocxl_global_mmio_read64(scm_data->ocxl_afu, GLOBAL_MMIO_CCAP0,
> +  OCXL_LITTLE_ENDIAN, );
> + if (rc)
> + return rc;
> +
> + scm_data->scm_revision = val & 0x;
> + scm_data->read_latency = (val >> 32) & 0xFF;
> + scm_data->readiness_timeout = (val >> 48) & 0xff;
> + scm_data->memory_available_timeout = val >> 52;

This overlaps with the masked region for readiness_timeout.  I'll guess the maks
on that should be 0xF.

> +
> + rc = ocxl_global_mmio_read64(scm_data->ocxl_afu, GLOBAL_MMIO_CCAP1,
> +  OCXL_LITTLE_ENDIAN, );
> + if (rc)
> + return rc;
> +
> + scm_data->max_controller_dump_size = val & 0x;
> +
> + // Extract firmware version text
> + rc = ocxl_global_mmio_read64(scm_data->ocxl_afu, GLOBAL_MMIO_FWVER,
> +  OCXL_HOST_ENDIAN, (u64 
> *)scm_data->fw_version);
> + if (rc)
> + return rc;
> +
> + scm_data->fw_version[8] = '\0';
> +
> + dev_info(_data->dev,
> +  "Firmware version '%s' SCM revision %d:%d\n", 
> scm_data->fw_version,
> +  scm_data->scm_revision >> 4, scm_data->scm_revision & 0x0F);
> +
> + return 0;
> +}
> +
>  /**
>   * scm_probe_function_0 - Set up function 0 for an OpenCAPI Storage Class 
> Memory device
>   * This is important as it enables templates higher than 0 across all other 
> functions,
> @@ -420,6 +487,8 @@ static int scm_probe_function_0(struct pci_dev *pdev)
>  static int scm_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  {
>   struct scm_data *scm_data = NULL;
> + int elapsed;
> + u16 timeout;
>  
>   if (PCI_FUNC(pdev->devfn) == 0)
>   return scm_probe_function_0(pdev);
> @@ -469,6 +538,21 @@ static int scm_probe(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>   goto err;
>   }
>  
> + if (read_device_metadata(scm_data)) {
> + dev_err(>dev, "Could not read SCM device metadata\n");
> + goto err;
> + }
> +
> + elapsed = 0;
> + timeout = scm_data->readiness_timeout + 
> scm_data->memory_available_timeout;
> + while (!scm_is_usable(scm_data)) {
> + if (elapsed++ > timeout) {
> + dev_warn(_data->dev, "SCM ready 

Re: [PATCH v2 10/27] nvdimm: Add driver for OpenCAPI Storage Class Memory

2020-02-03 Thread Jonathan Cameron
On Tue, 3 Dec 2019 14:46:38 +1100
Alastair D'Silva  wrote:

> From: Alastair D'Silva 
> 
> This driver exposes LPC memory on OpenCAPI SCM cards
> as an NVDIMM, allowing the existing nvram infrastructure
> to be used.
> 
> Namespace metadata is stored on the media itself, so
> scm_reserve_metadata() maps 1 section's worth of PMEM storage
> at the start to hold this. The rest of the PMEM range is registered
> with libnvdimm as an nvdimm. scm_ndctl_config_read/write/size() provide
> callbacks to libnvdimm to access the metadata.
> 
> Signed-off-by: Alastair D'Silva 
Hi Alastair,

A few bits and bobs inline.

Thanks,

Jonathan

> ---
>  drivers/nvdimm/Kconfig |   2 +
>  drivers/nvdimm/Makefile|   2 +-
>  drivers/nvdimm/ocxl/Kconfig|  15 +
>  drivers/nvdimm/ocxl/Makefile   |   7 +
>  drivers/nvdimm/ocxl/scm.c  | 519 +
>  drivers/nvdimm/ocxl/scm_internal.h |  28 ++
>  6 files changed, 572 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/nvdimm/ocxl/Kconfig
>  create mode 100644 drivers/nvdimm/ocxl/Makefile
>  create mode 100644 drivers/nvdimm/ocxl/scm.c
>  create mode 100644 drivers/nvdimm/ocxl/scm_internal.h
> 
> diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
> index 36af7af6b7cf..d1bab36da61c 100644
> --- a/drivers/nvdimm/Kconfig
> +++ b/drivers/nvdimm/Kconfig
> @@ -130,4 +130,6 @@ config NVDIMM_TEST_BUILD
> core devm_memremap_pages() implementation and other
> infrastructure.
>  
> +source "drivers/nvdimm/ocxl/Kconfig"
> +
>  endif
> diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> index 29203f3d3069..e33492128042 100644
> --- a/drivers/nvdimm/Makefile
> +++ b/drivers/nvdimm/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0
> -obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
> +obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o ocxl/
>  obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
>  obj-$(CONFIG_ND_BTT) += nd_btt.o
>  obj-$(CONFIG_ND_BLK) += nd_blk.o
> diff --git a/drivers/nvdimm/ocxl/Kconfig b/drivers/nvdimm/ocxl/Kconfig
> new file mode 100644
> index ..24099b300f5e
> --- /dev/null
> +++ b/drivers/nvdimm/ocxl/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +if LIBNVDIMM
> +
> +config OCXL_SCM
> + tristate "OpenCAPI Storage Class Memory"
> + depends on LIBNVDIMM && PPC_POWERNV && PCI && EEH
> + select ZONE_DEVICE
> + select OCXL
> + help
> +   Exposes devices that implement the OpenCAPI Storage Class Memory
> +   specification as persistent memory regions.
> +
> +   Select N if unsure.
> +
> +endif
> diff --git a/drivers/nvdimm/ocxl/Makefile b/drivers/nvdimm/ocxl/Makefile
> new file mode 100644
> index ..74a1bd98848e
> --- /dev/null
> +++ b/drivers/nvdimm/ocxl/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +ccflags-$(CONFIG_PPC_WERROR) += -Werror
> +
> +obj-$(CONFIG_OCXL_SCM) += ocxlscm.o
> +
> +ocxlscm-y := scm.o
> diff --git a/drivers/nvdimm/ocxl/scm.c b/drivers/nvdimm/ocxl/scm.c
> new file mode 100644
> index ..571058a9e7b8
> --- /dev/null
> +++ b/drivers/nvdimm/ocxl/scm.c
> @@ -0,0 +1,519 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +// Copyright 2019 IBM Corp.
> +
> +/*
> + * A driver for Storage Class Memory, connected via OpenCAPI
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "scm_internal.h"
> +
> +
> +static const struct pci_device_id scm_pci_tbl[] = {
> + { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), },
> + { }
> +};
> +
> +MODULE_DEVICE_TABLE(pci, scm_pci_tbl);
> +
> +#define SCM_NUM_MINORS 256 // Total to reserve
> +
> +static dev_t scm_dev;
> +static struct class *scm_class;
> +static struct mutex minors_idr_lock;
> +static struct idr minors_idr;
> +
> +static const struct attribute_group *scm_pmem_attribute_groups[] = {
> + _bus_attribute_group,
> + NULL,
> +};
> +
> +static const struct attribute_group *scm_pmem_region_attribute_groups[] = {
> + _region_attribute_group,
> + _device_attribute_group,
> + _mapping_attribute_group,
> + _numa_attribute_group,
> + NULL,
> +};
> +
> +/**
> + * scm_ndctl_config_write() - Handle a ND_CMD_SET_CONFIG_DATA command from 
> ndctl
> + * @scm_data: the SCM metadata
> + * @command: the incoming data to write
> + * Return: 0 on success, negative on failure
> + */
> +static int scm_ndctl_config_write(struct scm_data *scm_data,
> +   struct nd_cmd_set_config_hdr *command)
> +{
> + if (command->in_offset + command->in_length > SCM_LABEL_AREA_SIZE)
> + return -EINVAL;
> +
> + memcpy_flushcache(scm_data->metadata_addr + command->in_offset, 
> command->in_buf,
> +   command->in_length);
> +
> + return 0;
> +}
> +
> +/**
> + * scm_ndctl_config_read() - Handle a ND_CMD_GET_CONFIG_DATA command from 
> ndctl
> + * @scm_data: the SCM metadata
> + * @command: the read request
> + * Return: 0 on 

Re: [PATCH v2 08/27] ocxl: Save the device serial number in ocxl_fn

2020-02-03 Thread Jonathan Cameron
On Tue, 3 Dec 2019 14:46:36 +1100
Alastair D'Silva  wrote:

> From: Alastair D'Silva 
> 
> This patch retrieves the serial number of the card and makes it available
> to consumers of the ocxl driver via the ocxl_fn struct.
> 
> Signed-off-by: Alastair D'Silva 
> Acked-by: Frederic Barrat 
> Acked-by: Andrew Donnellan 
> ---
>  drivers/misc/ocxl/config.c | 46 ++
>  include/misc/ocxl.h|  1 +
>  2 files changed, 47 insertions(+)
> 
> diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
> index fb0c3b6f8312..a9203c309365 100644
> --- a/drivers/misc/ocxl/config.c
> +++ b/drivers/misc/ocxl/config.c
> @@ -71,6 +71,51 @@ static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 
> afu_idx)
>   return 0;
>  }
>  
> +/**

Make sure anything you mark as kernel doc with /** is valid
kernel-doc.

> + * Find a related PCI device (function 0)
> + * @device: PCI device to match
> + *
> + * Returns a pointer to the related device, or null if not found
> + */
> +static struct pci_dev *get_function_0(struct pci_dev *dev)
> +{
> + unsigned int devfn = PCI_DEVFN(PCI_SLOT(dev->devfn), 0); // Look for 
> function 0

Not sure the trailing comment adds much.

I'd personally not bother with this wrapper at all and just call
the pci functions directly where needed.

> +
> + return pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
> + dev->bus->number, devfn);
> +}
> +
> +static void read_serial(struct pci_dev *dev, struct ocxl_fn_config *fn)
> +{
> + u32 low, high;
> + int pos;
> +
> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DSN);
> + if (pos) {
> + pci_read_config_dword(dev, pos + 0x04, );
> + pci_read_config_dword(dev, pos + 0x08, );
> +
> + fn->serial = low | ((u64)high) << 32;
> +
> + return;
> + }
> +
> + if (PCI_FUNC(dev->devfn) != 0) {
> + struct pci_dev *related = get_function_0(dev);
> +
> + if (!related) {
> + fn->serial = 0;
> + return;
> + }
> +
> + read_serial(related, fn);
> + pci_dev_put(related);
> + return;
> + }
> +
> + fn->serial = 0;
> +}
> +
>  static void read_pasid(struct pci_dev *dev, struct ocxl_fn_config *fn)
>  {
>   u16 val;
> @@ -208,6 +253,7 @@ int ocxl_config_read_function(struct pci_dev *dev, struct 
> ocxl_fn_config *fn)
>   int rc;
>  
>   read_pasid(dev, fn);
> + read_serial(dev, fn);
>  
>   rc = read_dvsec_tl(dev, fn);
>   if (rc) {
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 6f7c02f0d5e3..9843051c3c5b 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -46,6 +46,7 @@ struct ocxl_fn_config {
>   int dvsec_afu_info_pos; /* offset of the AFU information DVSEC */
>   s8 max_pasid_log;
>   s8 max_afu_index;
> + u64 serial;
>  };
>  
>  enum ocxl_endian {




Re: [PATCH v2 07/27] ocxl: Add functions to map/unmap LPC memory

2020-02-03 Thread Jonathan Cameron
On Tue, 3 Dec 2019 14:46:35 +1100
Alastair D'Silva  wrote:

> From: Alastair D'Silva 
> 
> Add functions to map/unmap LPC memory
> 
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/misc/ocxl/config.c|  4 +++
>  drivers/misc/ocxl/core.c  | 50 +++
>  drivers/misc/ocxl/ocxl_internal.h |  3 ++
>  include/misc/ocxl.h   | 18 +++
>  4 files changed, 75 insertions(+)
> 
> diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
> index c8e19bfb5ef9..fb0c3b6f8312 100644
> --- a/drivers/misc/ocxl/config.c
> +++ b/drivers/misc/ocxl/config.c
> @@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct pci_dev *dev,
>   afu->special_purpose_mem_size =
>   total_mem_size - lpc_mem_size;
>   }
> +
> + dev_info(>dev, "Probed LPC memory of %#llx bytes and special 
> purpose memory of %#llx bytes\n",
> + afu->lpc_mem_size, afu->special_purpose_mem_size);
> +

If we are being fussy, this block has nothing todo with the rest of the patch
so we should be seeing it here.

>   return 0;
>  }
>  
> diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
> index 2531c6cf19a0..98611faea219 100644
> --- a/drivers/misc/ocxl/core.c
> +++ b/drivers/misc/ocxl/core.c
> @@ -210,6 +210,55 @@ static void unmap_mmio_areas(struct ocxl_afu *afu)
>   release_fn_bar(afu->fn, afu->config.global_mmio_bar);
>  }
>  
> +int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu)
> +{
> + struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> +
> + if ((afu->config.lpc_mem_size + afu->config.special_purpose_mem_size) 
> == 0)
> + return 0;
> +
> + afu->lpc_base_addr = ocxl_link_lpc_map(afu->fn->link, dev);
> + if (afu->lpc_base_addr == 0)
> + return -EINVAL;
> +
> + if (afu->config.lpc_mem_size) {

I was happy with the explicit check on 0 above, but we should be consistent.  
Either
we make use of 0 == false, or we don't and explicitly check vs 0.

Hence

if (afu->config.pc_mem_size != 0) { 

here or

if (!(afu->config.pc_mem_size + afu->config.special_purpose_mem_size))
return 0;

above.

> + afu->lpc_res.start = afu->lpc_base_addr + 
> afu->config.lpc_mem_offset;
> + afu->lpc_res.end = afu->lpc_res.start + 
> afu->config.lpc_mem_size - 1;
> + }
> +
> + if (afu->config.special_purpose_mem_size) {
> + afu->special_purpose_res.start = afu->lpc_base_addr +
> +  
> afu->config.special_purpose_mem_offset;
> + afu->special_purpose_res.end = afu->special_purpose_res.start +
> +
> afu->config.special_purpose_mem_size - 1;
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ocxl_afu_map_lpc_mem);
> +
> +struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu)
> +{
> + return >lpc_res;
> +}
> +EXPORT_SYMBOL_GPL(ocxl_afu_lpc_mem);
> +
> +static void unmap_lpc_mem(struct ocxl_afu *afu)
> +{
> + struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> +
> + if (afu->lpc_res.start || afu->special_purpose_res.start) {
> + void *link = afu->fn->link;
> +
> + ocxl_link_lpc_release(link, dev);
> +
> + afu->lpc_res.start = 0;
> + afu->lpc_res.end = 0;
> + afu->special_purpose_res.start = 0;
> + afu->special_purpose_res.end = 0;
> + }
> +}
> +
>  static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct pci_dev 
> *dev)
>  {
>   int rc;
> @@ -251,6 +300,7 @@ static int configure_afu(struct ocxl_afu *afu, u8 
> afu_idx, struct pci_dev *dev)
>  
>  static void deconfigure_afu(struct ocxl_afu *afu)
>  {
> + unmap_lpc_mem(afu);

Hmm. This breaks the existing balance between configure_afu and deconfigure_afu.

Given comments below on why we don't do map_lpc_mem in the afu bring up
(as it's a shared operation) it seems to me that we should be doing this
outside of the afu deconfigure.  Perhaps ocxl_function_close is appropriate?
I don't know this infrastructure well enough to be sure.

If it does need to be here, then a comment to give more info on
why would be great!

>   unmap_mmio_areas(afu);
>   reclaim_afu_pasid(afu);
>   reclaim_afu_actag(afu);
> diff --git a/drivers/misc/ocxl/ocxl_internal.h 
> b/drivers/misc/ocxl/ocxl_internal.h
> index 20b417e00949..9f4b47900e62 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -52,6 +52,9 @@ struct ocxl_afu {
>   void __iomem *global_mmio_ptr;
>   u64 pp_mmio_start;
>   void *private;
> + u64 lpc_base_addr; /* Covers both LPC & special purpose memory */
> + struct resource lpc_res;
> + struct resource special_purpose_res;
>  };
>  
>  enum ocxl_context_status {
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 06dd5839e438..6f7c02f0d5e3 100644
> --- a/include/misc/ocxl.h
> +++ 

Re: [PATCH v2 06/27] ocxl: Tally up the LPC memory on a link & allow it to be mapped

2020-02-03 Thread Jonathan Cameron
On Tue, 3 Dec 2019 14:46:34 +1100
Alastair D'Silva  wrote:

> From: Alastair D'Silva 
> 
> Tally up the LPC memory on an OpenCAPI link & allow it to be mapped
> 
> Signed-off-by: Alastair D'Silva 
Hi Alastair,

A few trivial comments inline.

Jonathan

> ---
>  drivers/misc/ocxl/core.c  | 10 ++
>  drivers/misc/ocxl/link.c  | 60 +++
>  drivers/misc/ocxl/ocxl_internal.h | 33 +
>  3 files changed, 103 insertions(+)
> 
> diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
> index b7a09b21ab36..2531c6cf19a0 100644
> --- a/drivers/misc/ocxl/core.c
> +++ b/drivers/misc/ocxl/core.c
> @@ -230,8 +230,18 @@ static int configure_afu(struct ocxl_afu *afu, u8 
> afu_idx, struct pci_dev *dev)
>   if (rc)
>   goto err_free_pasid;
>  
> + if (afu->config.lpc_mem_size || afu->config.special_purpose_mem_size) {
> + rc = ocxl_link_add_lpc_mem(afu->fn->link, 
> afu->config.lpc_mem_offset,
> +afu->config.lpc_mem_size +
> +
> afu->config.special_purpose_mem_size);
> + if (rc)
> + goto err_free_mmio;
> + }
> +
>   return 0;
>  
> +err_free_mmio:
> + unmap_mmio_areas(afu);
>  err_free_pasid:
>   reclaim_afu_pasid(afu);
>  err_free_actag:
> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> index 58d111afd9f6..d8503f0dc6ec 100644
> --- a/drivers/misc/ocxl/link.c
> +++ b/drivers/misc/ocxl/link.c
> @@ -84,6 +84,11 @@ struct ocxl_link {
>   int dev;
>   atomic_t irq_available;
>   struct spa *spa;
> + struct mutex lpc_mem_lock;

Always a good idea to explicitly document what a lock is intended to protect.

> + u64 lpc_mem_sz; /* Total amount of LPC memory presented on the link */
> + u64 lpc_mem;
> + int lpc_consumers;
> +
>   void *platform_data;
>  };
>  static struct list_head links_list = LIST_HEAD_INIT(links_list);
> @@ -396,6 +401,8 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, 
> struct ocxl_link **out_l
>   if (rc)
>   goto err_spa;
>  
> + mutex_init(>lpc_mem_lock);
> +
>   /* platform specific hook */
>   rc = pnv_ocxl_spa_setup(dev, link->spa->spa_mem, PE_mask,
>   >platform_data);
> @@ -711,3 +718,56 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq)
>   atomic_inc(>irq_available);
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
> +
> +int ocxl_link_add_lpc_mem(void *link_handle, u64 offset, u64 size)
> +{
> + struct ocxl_link *link = (struct ocxl_link *) link_handle;
> +
> + // Check for overflow

Stray c++ style comment.

> + if (offset > (offset + size))
> + return -EINVAL;
> +
> + mutex_lock(>lpc_mem_lock);
> + link->lpc_mem_sz = max(link->lpc_mem_sz, offset + size);
> +
> + mutex_unlock(>lpc_mem_lock);
> +
> + return 0;
> +}
> +
> +u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)
> +{
> + struct ocxl_link *link = (struct ocxl_link *) link_handle;
> + u64 lpc_mem;
> +
> + mutex_lock(>lpc_mem_lock);
> + if (link->lpc_mem) {

If you don't modify this later in the series (I haven't read it all yet :),
it rather feels like it would be more compact and just as readable as
something like...

if (!link->lpc_mem)
link->lpc_mem = pnv_ocxl...

if (link->lpc_mem)
link->lpc_consumers++;
mutex_unlock(>lpc_mem_lock);

return link->lpc_mem;

> + lpc_mem = link->lpc_mem;
> +
> + link->lpc_consumers++;
> + mutex_unlock(>lpc_mem_lock);
> + return lpc_mem;
> + }
> +
> + link->lpc_mem = pnv_ocxl_platform_lpc_setup(pdev, link->lpc_mem_sz);
> + if (link->lpc_mem)
> + link->lpc_consumers++;
> + lpc_mem = link->lpc_mem;
> + mutex_unlock(>lpc_mem_lock);
> +
> + return lpc_mem;
> +}
> +
> +void ocxl_link_lpc_release(void *link_handle, struct pci_dev *pdev)
> +{
> + struct ocxl_link *link = (struct ocxl_link *) link_handle;
> +
> + mutex_lock(>lpc_mem_lock);
> + WARN_ON(--link->lpc_consumers < 0);
> + if (link->lpc_consumers == 0) {
> + pnv_ocxl_platform_lpc_release(pdev);
> + link->lpc_mem = 0;
> + }
> +
> + mutex_unlock(>lpc_mem_lock);
> +}
> diff --git a/drivers/misc/ocxl/ocxl_internal.h 
> b/drivers/misc/ocxl/ocxl_internal.h
> index 97415afd79f3..20b417e00949 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -141,4 +141,37 @@ int ocxl_irq_offset_to_id(struct ocxl_context *ctx, u64 
> offset);
>  u64 ocxl_irq_id_to_offset(struct ocxl_context *ctx, int irq_id);
>  void ocxl_afu_irq_free_all(struct ocxl_context *ctx);
>  
> +/**
> + * ocxl_link_add_lpc_mem() - Increment the amount of memory required by an 
> OpenCAPI link
> + *
> + * @link_handle: The OpenCAPI link 

Re: [PATCH v4 13/15] docs: ABI: testing: make the files compatible with ReST output

2019-07-17 Thread Jonathan Cameron
On Wed, 17 Jul 2019 09:28:17 -0300
Mauro Carvalho Chehab  wrote:

> Some files over there won't parse well by Sphinx.
> 
> Fix them.
> 
> Signed-off-by: Mauro Carvalho Chehab 
Hi Mauro,

Does feel like this one should perhaps have been broken up a touch!

For the IIO ones I've eyeballed it rather than testing the results

Acked-by: Jonathan Cameron 




Re: [PATCH v10 00/18] Introduce the Counter subsystem

2019-04-27 Thread Jonathan Cameron
On Thu, 25 Apr 2019 21:36:24 +0200
Greg KH  wrote:

> On Sun, Apr 07, 2019 at 03:25:50PM +0100, Jonathan Cameron wrote:
> > On Tue,  2 Apr 2019 15:30:35 +0900
> > William Breathitt Gray  wrote:
> >   
> > > Changes in v10:
> > >   - Fix minor typographical errors in documentation
> > >   - Merge the FlexTimer Module Quadrature decoder counter driver patches
> > > 
> > > This revision is functionally identical to the last; changes in this
> > > version were made to fix minor typos in the documentation files and also
> > > to pull in the new FTM quadrature decoder counter driver.
> > > 
> > > The Generic Counter API has been and is still in a feature freeze until
> > > it is merged into the mainline. The following features will be
> > > investigated after the merge: interrupt support for counter devices, and
> > > a character device interface for low-latency applications.  
> > 
> > Hi William / al,
> > 
> > So the question is how to move this forwards?  I'm happy with how it turned
> > out and the existing drivers we had in IIO are a lot cleaner under
> > the counter subsystem (other than the backwards compatibility for those that
> > ever existed in IIO).  For those  not following closely the situation is:  
> 
> I've now sucked this into my staging-testing branch and if 0-day is fine
> with it, I'll merge it to staging-next in a day or so.  This way you can
> build on it for any iio drivers that might be coming.

Great thanks. 

> 
> I do have reservations about that one sysfs file that is multi-line, and
> I think it will come to bite you in the end over time, so I reserve the
> right to say "I told you so" when that happens...
> 
> But, I don't have a better answer for it now, so don't really worry
> about it :)
> 
> thanks,
> 
> greg k-h

Looks like a few late breaking comments came in, but nothing that can't
be fixed up before this reaches a release.

Thanks,

Jonathan




Re: [PATCH v10 00/18] Introduce the Counter subsystem

2019-04-07 Thread Jonathan Cameron
On Tue,  2 Apr 2019 15:30:35 +0900
William Breathitt Gray  wrote:

> Changes in v10:
>   - Fix minor typographical errors in documentation
>   - Merge the FlexTimer Module Quadrature decoder counter driver patches
> 
> This revision is functionally identical to the last; changes in this
> version were made to fix minor typos in the documentation files and also
> to pull in the new FTM quadrature decoder counter driver.
> 
> The Generic Counter API has been and is still in a feature freeze until
> it is merged into the mainline. The following features will be
> investigated after the merge: interrupt support for counter devices, and
> a character device interface for low-latency applications.

Hi William / al,

So the question is how to move this forwards?  I'm happy with how it turned
out and the existing drivers we had in IIO are a lot cleaner under
the counter subsystem (other than the backwards compatibility for those that
ever existed in IIO).  For those  not following closely the situation is:

1. Counter drivers never really fitted that well in IIO, because IIO is
focused on an abstraction of individual channels that just doesn't match
to these devices.  It's just the wrong model. 

2. William tried hard in earlier proposals to extend IIO to support these
devices well, but it became so convoluted and involved I advised him that
we were better off with a separate subsystem.  The amount of code overlap
between the core IIO support for counters and the reset of IIO was
become very small and it would have been a maintenance problem for both.
https://lwn.net/Articles/729363/ gives some of the history

3. The new subsystem introduced by this series is fairly simple, clean
and well aligned with the way these devices work. There are (I think)
4 initial drivers in this series from 4 different authors so it's got
some practical review that way!
There are a couple more drivers under development.  Right now, not
everyone is aware of this work and so we have had a few developers potentially
waste their time writing IIO drivers (which are then ported to this) rather
that starting with the counter subsystem.

So what we are after is more review, or agreement that we can move this
series forwards.  For now the intent is that the counter subsystem will
share the linux-iio mailing list etc but I don't think either William
or I have any particularly strong views on how we actually handle the
patches.  I'm more than happy to take them through the IIO tree, if that
works for everyone, particularly Greg as IIO goes through him after me.
Once it is in a release, the cross dependency is broken and we can think
about longer term approaches.

So Greg and others, how do we make progress here?  If there are any obvious
reviewers not on the CC list, please do draw their attention to this.

Thanks,

Jonathan

+CC linux-api as obviously one of the biggest areas for review is the new
userspace ABI.

> 
> Benjamin Gaignard (2):
>   counter: Add STM32 Timer quadrature encoder
>   dt-bindings: counter: Document stm32 quadrature encoder
> 
> Fabrice Gasnier (2):
>   counter: stm32-lptimer: add counter device
>   dt-bindings: counter: Adjust dt-bindings for STM32 lptimer move
> 
> Patrick Havelange (7):
>   include/fsl: add common FlexTimer #defines in a separate header.
>   drivers/pwm: pwm-fsl-ftm: use common header for FlexTimer #defines
>   drivers/clocksource: timer-fsl-ftm: use common header for FlexTimer
> #defines
>   dt-bindings: counter: ftm-quaddec
>   counter: add FlexTimer Module Quadrature decoder counter driver
>   counter: ftm-quaddec: Documentation: Add specific counter sysfs
> documentation
>   LS1021A: dtsi: add ftm quad decoder entries
> 
> William Breathitt Gray (7):
>   counter: Introduce the Generic Counter interface
>   counter: Documentation: Add Generic Counter sysfs documentation
>   docs: Add Generic Counter interface documentation
>   iio: 104-quad-8: Update license boilerplate
>   counter: 104-quad-8: Add Generic Counter interface support
>   counter: 104-quad-8: Documentation: Add Generic Counter sysfs
> documentation
>   iio: counter: Add deprecation markings for IIO Counter attributes
> 
>  Documentation/ABI/testing/sysfs-bus-counter   |  230 +++
>  .../ABI/testing/sysfs-bus-counter-104-quad-8  |   36 +
>  .../ABI/testing/sysfs-bus-counter-ftm-quaddec |   16 +
>  Documentation/ABI/testing/sysfs-bus-iio   |8 +
>  .../testing/sysfs-bus-iio-counter-104-quad-8  |   16 +
>  .../bindings/counter/ftm-quaddec.txt  |   18 +
>  .../{iio => }/counter/stm32-lptimer-cnt.txt   |0
>  .../bindings/counter/stm32-timer-cnt.txt  |   31 +
>  .../devicetree/bindings/mfd/stm32-lptimer.txt |2 +-
>  .../devicetree/bindings/mfd/stm32-timers.txt  |7 +
>  Documentation/driver-api/generic-counter.rst  |  342 
>  Documentation/driver-api/index.rst|1 +
>  MAINTAINERS   |   15 +-
>  arch/arm/boot/dts/ls1021a.dtsi|   28 +
>  

Re: [PATCH v2 4/7] dt-bindings: counter: ftm-quaddec

2019-03-16 Thread Jonathan Cameron
On Tue, 12 Mar 2019 14:09:52 -0500
Rob Herring  wrote:

> On Wed, Mar 06, 2019 at 12:12:05PM +0100, Patrick Havelange wrote:
> > FlexTimer quadrature decoder driver.
> > 
> > Signed-off-by: Patrick Havelange 
> > Reviewed-by: Esben Haabendal 
> > ---
> > Changes v2
> >  - None
> > ---
> >  .../bindings/counter/ftm-quaddec.txt   | 18 ++
> >  1 file changed, 18 insertions(+)
> >  create mode 100644 
> > Documentation/devicetree/bindings/counter/ftm-quaddec.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/counter/ftm-quaddec.txt 
> > b/Documentation/devicetree/bindings/counter/ftm-quaddec.txt
> > new file mode 100644
> > index ..4d18cd722074
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/counter/ftm-quaddec.txt
> > @@ -0,0 +1,18 @@
> > +FlexTimer Quadrature decoder counter
> > +
> > +This driver exposes a simple counter for the quadrature decoder mode.  
> 
> Seems like this is more a mode of a h/w block than describing a h/w 
> block. Bindings should do the latter.
The snag is that we need to dig ourselves out of the hole set by:
fsl,vf610-ftm-pwm etc.

Documentation/devicetree/bindings/pwm/pwm-fsl-ftm.txt
Documentation/devicetree/bindings/timer/fsl,ftm-timer.txt
(I'm assuming these are the same IP block).

Can probably be sorted out though.  One core driver binds against the
ftm and deals with instantiating the others depending on the configuration
(note that this mode for instance does make sense in DT as it's
really reflecting the fact there is a quadrature encoder
connected to the ftm).

Fiddly though :)

J
> 
> > +
> > +Required properties:
> > +- compatible:  Must be "fsl,ftm-quaddec".
> > +- reg: Must be set to the memory region of the 
> > flextimer.
> > +
> > +Optional property:
> > +- big-endian:  Access the device registers in big-endian mode.
> > +
> > +Example:
> > +   counter0: counter@29d {
> > +   compatible = "fsl,ftm-quaddec";
> > +   reg = <0x0 0x29d 0x0 0x1>;
> > +   big-endian;
> > +   status = "disabled";
> > +   };
> > -- 
> > 2.19.1
> >   



Re: [PATCH v2 5/7] counter: add FlexTimer Module Quadrature decoder counter driver

2019-03-11 Thread Jonathan Cameron
On Wed, 6 Mar 2019 12:12:06 +0100
Patrick Havelange  wrote:

> This driver exposes the counter for the quadrature decoder of the
> FlexTimer Module, present in the LS1021A soc.
> 
> Signed-off-by: Patrick Havelange 
A few really trivial bits inline to add to William's feedback.

Otherwise I'm happy enough,

Reviewed-by: Jonathan Cameron 

> ---
> Changes v2
>  - Rebased on new counter subsystem
>  - Cleaned up included headers
>  - Use devm_ioremap()
>  - Correct order of devm_ and unmanaged resources
> ---
>  drivers/counter/Kconfig   |   9 +
>  drivers/counter/Makefile  |   1 +
>  drivers/counter/ftm-quaddec.c | 356 ++
>  3 files changed, 366 insertions(+)
>  create mode 100644 drivers/counter/ftm-quaddec.c
> 
> diff --git a/drivers/counter/Kconfig b/drivers/counter/Kconfig
> index 87c491a19c63..233ac305d878 100644
> --- a/drivers/counter/Kconfig
> +++ b/drivers/counter/Kconfig
> @@ -48,4 +48,13 @@ config STM32_LPTIMER_CNT
> To compile this driver as a module, choose M here: the
> module will be called stm32-lptimer-cnt.
>  
> +config FTM_QUADDEC
> + tristate "Flex Timer Module Quadrature decoder driver"
> + help
> +   Select this option to enable the Flex Timer Quadrature decoder
> +   driver.
> +
> +   To compile this driver as a module, choose M here: the
> +   module will be called ftm-quaddec.
> +
>  endif # COUNTER
> diff --git a/drivers/counter/Makefile b/drivers/counter/Makefile
> index 5589976d37f8..0c9e622a6bea 100644
> --- a/drivers/counter/Makefile
> +++ b/drivers/counter/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_COUNTER) += counter.o
>  obj-$(CONFIG_104_QUAD_8) += 104-quad-8.o
>  obj-$(CONFIG_STM32_TIMER_CNT)+= stm32-timer-cnt.o
>  obj-$(CONFIG_STM32_LPTIMER_CNT)  += stm32-lptimer-cnt.o
> +obj-$(CONFIG_FTM_QUADDEC)+= ftm-quaddec.o
> diff --git a/drivers/counter/ftm-quaddec.c b/drivers/counter/ftm-quaddec.c
> new file mode 100644
> index ..1bc9e075a386
> --- /dev/null
> +++ b/drivers/counter/ftm-quaddec.c
> @@ -0,0 +1,356 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Flex Timer Module Quadrature decoder
> + *
> + * This module implements a driver for decoding the FTM quadrature
> + * of ex. a LS1021A
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct ftm_quaddec {
> + struct counter_device counter;
> + struct platform_device *pdev;
> + void __iomem *ftm_base;
> + bool big_endian;
> + struct mutex ftm_quaddec_mutex;
> +};
> +
> +static void ftm_read(struct ftm_quaddec *ftm, uint32_t offset, uint32_t 
> *data)
> +{
> + if (ftm->big_endian)
> + *data = ioread32be(ftm->ftm_base + offset);
> + else
> + *data = ioread32(ftm->ftm_base + offset);
> +}
> +
> +static void ftm_write(struct ftm_quaddec *ftm, uint32_t offset, uint32_t 
> data)
> +{
> + if (ftm->big_endian)
> + iowrite32be(data, ftm->ftm_base + offset);
> + else
> + iowrite32(data, ftm->ftm_base + offset);
> +}
> +
> +/*
> + * take mutex
> + * call ftm_clear_write_protection
> + * update settings
> + * call ftm_set_write_protection
> + * release mutex
> + */
> +static void ftm_clear_write_protection(struct ftm_quaddec *ftm)
> +{
> + uint32_t flag;
> +
> + /* First see if it is enabled */
> + ftm_read(ftm, FTM_FMS, );
> +
> + if (flag & FTM_FMS_WPEN) {
> + ftm_read(ftm, FTM_MODE, );
> + ftm_write(ftm, FTM_MODE, flag | FTM_MODE_WPDIS);
> + }
> +}
> +
> +static void ftm_set_write_protection(struct ftm_quaddec *ftm)
> +{
> + ftm_write(ftm, FTM_FMS, FTM_FMS_WPEN);
> +}
> +
> +static void ftm_reset_counter(struct ftm_quaddec *ftm)
> +{
> + /* Reset hardware counter to CNTIN */
> + ftm_write(ftm, FTM_CNT, 0x0);
> +}
> +
> +static void ftm_quaddec_init(struct ftm_quaddec *ftm)
> +{
> + ftm_clear_write_protection(ftm);
> +
> + /*
> +  * Do not write in the region from the CNTIN register through the
> +  * PWMLOAD register when FTMEN = 0.
> +  */
> + ftm_write(ftm, FTM_MODE, FTM_MODE_FTMEN);
> + ftm_write(ftm, FTM_CNTIN, 0x);
> + ftm_write(ftm, FTM_MOD, 0x);
> + ftm_write(ftm, FTM_CNT, 0x0);
> + ftm_write(ftm, FTM_SC, FTM_SC_PS_1);
> +
> + /* Select quad mode */
> + ftm_write(ftm, FTM_QDCTRL, FTM_QDCTRL_QUADEN);
> +
> + /* Unused features and reset to default section */
> + ftm_write(ftm, FTM_POL, 

Re: [PATCH v2 6/7] counter: ftm-quaddec: Documentation: Add specific counter sysfs documentation

2019-03-11 Thread Jonathan Cameron
On Thu, 7 Mar 2019 20:42:16 +0900
William Breathitt Gray  wrote:

> On Wed, Mar 06, 2019 at 12:12:07PM +0100, Patrick Havelange wrote:
> > This adds documentation for the specific prescaler entry.
> > 
> > Signed-off-by: Patrick Havelange 
> > ---
> > Changes v2
> >  - Add doc for prescaler entry
> > ---
> >  .../ABI/testing/sysfs-bus-counter-ftm-quaddec| 16 
> >  1 file changed, 16 insertions(+)
> >  create mode 100644 Documentation/ABI/testing/sysfs-bus-counter-ftm-quaddec
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-bus-counter-ftm-quaddec 
> > b/Documentation/ABI/testing/sysfs-bus-counter-ftm-quaddec
> > new file mode 100644
> > index ..2da629d6d485
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-bus-counter-ftm-quaddec
> > @@ -0,0 +1,16 @@
> > +What:  
> > /sys/bus/counter/devices/counterX/countY/prescaler_available
> > +KernelVersion: 5.1
> > +Contact:   linux-...@vger.kernel.org
> > +Description:
> > +   Discrete set of available values for the respective Count Y
> > +   configuration are listed in this file. Values are delimited by
> > +   newline characters.
> > +
> > +What:  /sys/bus/counter/devices/counterX/countY/prescaler
> > +KernelVersion: 5.1
> > +Contact:   linux-...@vger.kernel.org
> > +Description:
> > +   Configure the prescaler value associated with Count Y.
> > +   On the FlexTimer, the counter clock source passes through a
> > +   prescaler that is a 7-bit counter. This acts like a clock
> > +   divider.
> > -- 
> > 2.19.1  
> 
> Hmm, prescalers seem common enough among counter devices to permit these
> attributes to be listed in the sysfs-bus-counter documentation file.
> However, I'd like to wait until we get another counter driver for a
> device with a prescaler before we make that move. From there, we'll have
> a better vantage point to determine a fitting standard prescaler
> attribute behavior.
> 
> So for now, we'll keep these attributes documented here in the
> sysfs-bus-counter-ftm-quaddec file, until the time comes to broach the
> discussion again.
Agreed. As long as the definition is sufficiently non-specific so it can be
moved later.  I'm not sure for example that the docs need to say that it is
a 7 bit counter. That should be apparent from prescaler_available - or at
least possible values should be which is all we need to know.

Jonathan
> 
> William Breathitt Gray




Re: [PATCH 8/8] iio/counter/ftm-quaddec: add handling of under/overflow of the counter.

2019-02-20 Thread Jonathan Cameron
On Mon, 18 Feb 2019 15:03:21 +0100
Patrick Havelange  wrote:

> This is implemented by polling the counter value. A new parameter
> "poll-interval" can be set in the device tree, or can be changed
> at runtime. The reason for the polling is to avoid interrupts flooding.
> If the quadrature input is going up and down around the overflow value
> (or around 0), the interrupt will be triggering all the time. Thus,
> polling is an easy way to handle overflow in a consistent way.
> Polling can still be disabled by setting poll-interval to 0.
> 
> Signed-off-by: Patrick Havelange 
> Reviewed-by: Esben Haabendal 
Comments inline.

Jonathan

> ---
>  drivers/iio/counter/ftm-quaddec.c | 199 +-
>  1 file changed, 193 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iio/counter/ftm-quaddec.c 
> b/drivers/iio/counter/ftm-quaddec.c
> index ca7e55a9ab3f..3a0395c3ef33 100644
> --- a/drivers/iio/counter/ftm-quaddec.c
> +++ b/drivers/iio/counter/ftm-quaddec.c
> @@ -25,11 +25,33 @@
>  
>  struct ftm_quaddec {
>   struct platform_device *pdev;
> + struct delayed_work delayedcounterwork;
>   void __iomem *ftm_base;
>   bool big_endian;
> +
> + /* Offset added to the counter to adjust for overflows of the
> +  * 16 bit HW counter. Only the 16 MSB are set.
Comment syntax.
> +  */
> + uint32_t counteroffset;
> +
> + /* Store the counter on each read, this is used to detect
> +  * if the counter readout if we over or underflow
> +  */
> + uint8_t lastregion;
> +
> + /* Poll-interval, in ms before delayed work must poll counter */
> + uint16_t poll_interval;
> +
>   struct mutex ftm_quaddec_mutex;
>  };
>  
> +struct counter_result {
> + /* 16 MSB are from the counteroffset
> +  * 16 LSB are from the hardware counter
> +  */
> + uint32_t value;
Why the structure?

> +};
> +
>  #define HASFLAGS(flag, bits) ((flag & bits) ? 1 : 0)
>  
>  #define DEFAULT_POLL_INTERVAL100 /* in msec */
> @@ -74,8 +96,75 @@ static void ftm_set_write_protection(struct ftm_quaddec 
> *ftm)
>   ftm_write(ftm, FTM_FMS, FTM_FMS_WPEN);
>  }
>  
> +/* must be called with mutex locked */
> +static void ftm_work_reschedule(struct ftm_quaddec *ftm)
> +{
> + cancel_delayed_work(>delayedcounterwork);
> + if (ftm->poll_interval > 0)
> + schedule_delayed_work(>delayedcounterwork,
> +msecs_to_jiffies(ftm->poll_interval));
> +}
> +
> +/* Reports the hardware counter added the offset counter.
> + *
> + * The quadrature decodes does not use interrupts, because it cannot be
> + * guaranteed that the counter won't flip between 0x and 0x at a high
> + * rate, causing Real Time performance degration. Instead the counter must be
> + * read frequently enough - the assumption is 150 KHz input can be handled 
> with
> + * 100 ms read cycles.
> + */
> +static void ftm_work_counter(struct ftm_quaddec *ftm,
> +  struct counter_result *returndata)
> +{
> + /* only 16bits filled in*/
> + uint32_t hwcounter;
> + uint8_t currentregion;
> +
> + mutex_lock(>ftm_quaddec_mutex);
> +
> + ftm_read(ftm, FTM_CNT, );
> +
> + /* Divide the counter in four regions:
> +  *   0x-0x4000-0x8000-0xC000-0x
> +  * When the hwcounter changes between region 0 and 3 there is an
> +  * over/underflow
> +  */
> + currentregion = hwcounter / 0x4000;
> +
> + if (ftm->lastregion == 3 && currentregion == 0)
> + ftm->counteroffset += 0x1;
> +
> + if (ftm->lastregion == 0 && currentregion == 3)
> + ftm->counteroffset -= 0x1;
> +
> + ftm->lastregion = currentregion;
> +
> + if (returndata)
> + returndata->value = ftm->counteroffset + hwcounter;
> +
> + ftm_work_reschedule(ftm);
> +
> + mutex_unlock(>ftm_quaddec_mutex);
> +}
> +
> +/* wrapper around the real function */
> +static void ftm_work_counter_delay(struct work_struct *workptr)
> +{
> + struct delayed_work *work;
> + struct ftm_quaddec *ftm;
> +
> + work = container_of(workptr, struct delayed_work, work);
> + ftm = container_of(work, struct ftm_quaddec, delayedcounterwork);
> +
> + ftm_work_counter(ftm, NULL);
> +}
> +
> +/* must be called with mutex locked */
>  static void ftm_reset_counter(struct ftm_quaddec *ftm)
>  {
> + ftm->counteroffset = 0;
> + ftm->lastregion = 0;
> +
>   /* Reset hardware counter to CNTIN */
>   ftm_write(ftm, FTM_CNT, 0x0);
>  }
> @@ -110,18 +199,91 @@ static int ftm_quaddec_read_raw(struct iio_dev 
> *indio_dev,
>   int *val, int *val2, long mask)
>  {
>   struct ftm_quaddec *ftm = iio_priv(indio_dev);
> - uint32_t counter;
> + struct counter_result counter;
>  
>   switch (mask) {
>   case IIO_CHAN_INFO_RAW:
> - ftm_read(ftm, FTM_CNT, );
> - *val = counter;
> + case IIO_CHAN_INFO_PROCESSED:

> +  

Re: [PATCH 5/8] iio/counter: add FlexTimer Module Quadrature decoder counter driver

2019-02-20 Thread Jonathan Cameron
On Mon, 18 Feb 2019 15:03:18 +0100
Patrick Havelange  wrote:

> This driver exposes the counter for the quadrature decoder of the
> FlexTimer Module, present in the LS1021A soc.
> 
> Signed-off-by: Patrick Havelange 
> Reviewed-by: Esben Haabendal 
Given you cc'd William, I'm guessing you know about the counter
subsystem effort.  I would really rather not take any drivers
into IIO if we have any hope of getting that upstreamed soon
(which I personally think we do and should!).  The reason is
we end up having to maintain old ABI just because someone might be using
it and it makes the drivers very messy.

I'll review as is though as may be there are some elements that will
cross over.

Comments inline.  William: Looks like a straight forward conversion if
it makes sense to get this lined up as part of your initial submission?
You have quite a few drivers so I wouldn't have said it needs to be there
at the start, but good to have it soon after.

Jonathan

> ---
>  drivers/iio/counter/Kconfig   |  10 +
>  drivers/iio/counter/Makefile  |   1 +
>  drivers/iio/counter/ftm-quaddec.c | 294 ++
>  3 files changed, 305 insertions(+)
>  create mode 100644 drivers/iio/counter/ftm-quaddec.c
> 
> diff --git a/drivers/iio/counter/Kconfig b/drivers/iio/counter/Kconfig
> index bf1e559ad7cd..4641cb2e752a 100644
> --- a/drivers/iio/counter/Kconfig
> +++ b/drivers/iio/counter/Kconfig
> @@ -31,4 +31,14 @@ config STM32_LPTIMER_CNT
>  
> To compile this driver as a module, choose M here: the
> module will be called stm32-lptimer-cnt.
> +
> +config FTM_QUADDEC
> + tristate "Flex Timer Module Quadrature decoder driver"
> + help
> +   Select this option to enable the Flex Timer Quadrature decoder
> +   driver.
> +
> +   To compile this driver as a module, choose M here: the
> +   module will be called ftm-quaddec.
> +
>  endmenu
> diff --git a/drivers/iio/counter/Makefile b/drivers/iio/counter/Makefile
> index 1b9a896eb488..757c1f4196af 100644
> --- a/drivers/iio/counter/Makefile
> +++ b/drivers/iio/counter/Makefile
> @@ -6,3 +6,4 @@
>  
>  obj-$(CONFIG_104_QUAD_8) += 104-quad-8.o
>  obj-$(CONFIG_STM32_LPTIMER_CNT)  += stm32-lptimer-cnt.o
> +obj-$(CONFIG_FTM_QUADDEC)+= ftm-quaddec.o
> diff --git a/drivers/iio/counter/ftm-quaddec.c 
> b/drivers/iio/counter/ftm-quaddec.c
> new file mode 100644
> index ..ca7e55a9ab3f
> --- /dev/null
> +++ b/drivers/iio/counter/ftm-quaddec.c
> @@ -0,0 +1,294 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Flex Timer Module Quadrature decoder
> + *
> + * This module implements a driver for decoding the FTM quadrature
> + * of ex. a LS1021A
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
Tidy these up. Not all are used.
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct ftm_quaddec {
> + struct platform_device *pdev;
> + void __iomem *ftm_base;
> + bool big_endian;

I'm curious. What is the benefit of running in big endian mode?

> + struct mutex ftm_quaddec_mutex;
> +};
> +
> +#define HASFLAGS(flag, bits) ((flag & bits) ? 1 : 0)
Not used.


> +
> +#define DEFAULT_POLL_INTERVAL100 /* in msec */
> +
> +static void ftm_read(struct ftm_quaddec *ftm, uint32_t offset, uint32_t 
> *data)
> +{
> + if (ftm->big_endian)
> + *data = ioread32be(ftm->ftm_base + offset);
> + else
> + *data = ioread32(ftm->ftm_base + offset);
> +}
> +
> +static void ftm_write(struct ftm_quaddec *ftm, uint32_t offset, uint32_t 
> data)
> +{
> + if (ftm->big_endian)
> + iowrite32be(data, ftm->ftm_base + offset);
> + else
> + iowrite32(data, ftm->ftm_base + offset);
> +}
> +
> +/* take mutex

Tidy this comment up.  I  would have said the flow as fairly
obvious and the only thing needed here is to document that
the mutex must be held?

> + * call ftm_clear_write_protection
> + * update settings
> + * call ftm_set_write_protection
> + * release mutex
> + */
> +static void ftm_clear_write_protection(struct ftm_quaddec *ftm)
> +{
> + uint32_t flag;
> +
> + /* First see if it is enabled */
> + ftm_read(ftm, FTM_FMS, );
> +
> + if (flag & FTM_FMS_WPEN) {
> + ftm_read(ftm, FTM_MODE, );
> + ftm_write(ftm, FTM_MODE, flag | FTM_MODE_WPDIS);
> + }
> +}
> +
> +static void ftm_set_write_protection(struct ftm_quaddec *ftm)
> +{
> + ftm_write(ftm, FTM_FMS, FTM_FMS_WPEN);
> +}
> +
> +static void ftm_reset_counter(struct ftm_quaddec *ftm)
> +{
> + /* Reset hardware counter to CNTIN */
> + ftm_write(ftm, FTM_CNT, 0x0);
> +}
> +
> +static void ftm_quaddec_init(struct ftm_quaddec *ftm)
> +{
> + ftm_clear_write_protection(ftm);
> +
> + /* Do not write in the region from the CNTIN register through the
IIO multiline syntax is
/*
 * Do not write
 * PWM..
 */
> +  * PWMLOAD 

Re: [RFC PATCH 03/29] mm: remove CONFIG_HAVE_MEMBLOCK

2018-09-19 Thread Jonathan Cameron
On Wed, 19 Sep 2018 13:34:57 +0300
Mike Rapoport  wrote:

> Hi Jonathan,
> 
> On Wed, Sep 19, 2018 at 10:04:49AM +0100, Jonathan Cameron wrote:
> > On Wed, 5 Sep 2018 18:59:18 +0300
> > Mike Rapoport  wrote:
> >   
> > > All architecures use memblock for early memory management. There is no 
> > > need
> > > for the CONFIG_HAVE_MEMBLOCK configuration option.
> > > 
> > > Signed-off-by: Mike Rapoport   
> > 
> > Hi Mike,
> > 
> > A minor editing issue in here that is stopping boot on arm64 platforms with 
> > latest
> > version of the mm tree.  
> 
> Can you please try the following patch:
> 
> 
> From 079bd5d24a01df3df9500d0a33d89cb9f7da4588 Mon Sep 17 00:00:00 2001
> From: Mike Rapoport 
> Date: Wed, 19 Sep 2018 13:29:27 +0300
> Subject: [PATCH] of/fdt: fixup #ifdefs after removal of HAVE_MEMBLOCK config
>  option
> 
> The removal of HAVE_MEMBLOCK configuration option, mistakenly dropped the
> wrong #endif. This patch restores that #endif and removes the part that
> should have been actually removed, starting from #else and up to the
> correct #endif
> 
> Reported-by: Jonathan Cameron 
> Signed-off-by: Mike Rapoport 

Hi Mike,

That's identical to the local patch I'm carrying to fix this so looks good to 
me.

For what it's worth given you'll probably fold this into the larger patch.

Tested-by: Jonathan Cameron 

Thanks for the quick reply.

Jonathan

> ---
>  drivers/of/fdt.c | 21 +
>  1 file changed, 1 insertion(+), 20 deletions(-)
> 
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index 48314e9..bb532aa 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -1119,6 +1119,7 @@ int __init early_init_dt_scan_chosen(unsigned long 
> node, const char *uname,
>  #endif
>  #ifndef MAX_MEMBLOCK_ADDR
>  #define MAX_MEMBLOCK_ADDR((phys_addr_t)~0)
> +#endif
>  
>  void __init __weak early_init_dt_add_memory_arch(u64 base, u64 size)
>  {
> @@ -1175,26 +1176,6 @@ int __init __weak 
> early_init_dt_reserve_memory_arch(phys_addr_t base,
>   return memblock_reserve(base, size);
>  }
>  
> -#else
> -void __init __weak early_init_dt_add_memory_arch(u64 base, u64 size)
> -{
> - WARN_ON(1);
> -}
> -
> -int __init __weak early_init_dt_mark_hotplug_memory_arch(u64 base, u64 size)
> -{
> - return -ENOSYS;
> -}
> -
> -int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
> - phys_addr_t size, bool nomap)
> -{
> - pr_err("Reserved memory not supported, ignoring range %pa - %pa%s\n",
> -   , , nomap ? " (nomap)" : "");
> - return -ENOSYS;
> -}
> -#endif
> -
>  static void * __init early_init_dt_alloc_memory_arch(u64 size, u64 align)
>  {
>   return memblock_alloc(size, align);




Re: [RFC PATCH 03/29] mm: remove CONFIG_HAVE_MEMBLOCK

2018-09-19 Thread Jonathan Cameron
On Wed, 5 Sep 2018 18:59:18 +0300
Mike Rapoport  wrote:

> All architecures use memblock for early memory management. There is no need
> for the CONFIG_HAVE_MEMBLOCK configuration option.
> 
> Signed-off-by: Mike Rapoport 

Hi Mike,

A minor editing issue in here that is stopping boot on arm64 platforms with 
latest
version of the mm tree.

> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index 76c83c1..bd841bb 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -1115,13 +1115,11 @@ int __init early_init_dt_scan_chosen(unsigned long 
> node, const char *uname,
>   return 1;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK
>  #ifndef MIN_MEMBLOCK_ADDR
>  #define MIN_MEMBLOCK_ADDR__pa(PAGE_OFFSET)
>  #endif
>  #ifndef MAX_MEMBLOCK_ADDR
>  #define MAX_MEMBLOCK_ADDR((phys_addr_t)~0)
> -#endif

This isn't the right #endif. It is matching with the #ifndef MAX_MEMBLOCK_ADDR
not the intented #ifdef CONFIG_HAVE_MEMBLOCK.

Now I haven't chased through the exact reason this is causing my acpi
arm64 system not to boot on the basis it is obviously miss-matched anyway
and I'm inherently lazy.  It's resulting in stubs replacing the following weak
functions.

early_init_dt_add_memory_arch
(this is defined elsewhere for some architectures but not arm)

early_init_dt_mark_hotplug_memory_arch
(there is only one definition of this in the kernel so it doesn't
 need to be weak or in the header etc).

early_init_dt_reserve_memory_arch
(defined on mips but nothing else)

Taking out the right endif also lets you drop an #else removing some stub
functions further down in here.

Nice cleanup in general btw.

Thanks,

Jonathan
>  
>  void __init __weak early_init_dt_add_memory_arch(u64 base, u64 size)
>  {



Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg

2018-09-17 Thread Jonathan Cameron
On Wed, 12 Sep 2018 17:08:52 +0200
Arnd Bergmann  wrote:

> The .ioctl and .compat_ioctl file operations have the same prototype so
> they can both point to the same function, which works great almost all
> the time when all the commands are compatible.
> 
> One exception is the s390 architecture, where a compat pointer is only
> 31 bit wide, and converting it into a 64-bit pointer requires calling
> compat_ptr(). Most drivers here will ever run in s390, but since we now
> have a generic helper for it, it's easy enough to use it consistently.
> 
> I double-checked all these drivers to ensure that all ioctl arguments
> are used as pointers or are ignored, but are not interpreted as integer
> values.
> 
> Signed-off-by: Arnd Bergmann 
> ---

For IIO part.

Acked-by: Jonathan Cameron 

Thanks,
> diff --git a/drivers/iio/industrialio-core.c b/drivers/iio/industrialio-core.c
> index a062cfddc5af..22844b94b0e9 100644
> --- a/drivers/iio/industrialio-core.c
> +++ b/drivers/iio/industrialio-core.c
> @@ -1630,7 +1630,7 @@ static const struct file_operations iio_buffer_fileops 
> = {
>   .owner = THIS_MODULE,
>   .llseek = noop_llseek,
>   .unlocked_ioctl = iio_ioctl,
> - .compat_ioctl = iio_ioctl,
> + .compat_ioctl = generic_compat_ioctl_ptrarg,
>  };
>  



Re: [PATCH 1/4] treewide: convert ISO_8859-1 text comments to utf-8

2018-07-24 Thread Jonathan Cameron
On Tue, 24 Jul 2018 13:13:25 +0200
Arnd Bergmann  wrote:

> Almost all files in the kernel are either plain text or UTF-8
> encoded. A couple however are ISO_8859-1, usually just a few
> characters in a C comments, for historic reasons.
> 
> This converts them all to UTF-8 for consistency.
> 
> Signed-off-by: Arnd Bergmann 
For IIO, Acked-by: Jonathan Cameron 

Thanks for tidying this up.

Jonathan

> ---
>  .../devicetree/bindings/net/nfc/pn544.txt |   2 +-
>  arch/arm/boot/dts/sun4i-a10-inet97fv2.dts |   2 +-
>  arch/arm/crypto/sha256_glue.c |   2 +-
>  arch/arm/crypto/sha256_neon_glue.c|   4 +-
>  drivers/crypto/vmx/ghashp8-ppc.pl |  12 +-
>  drivers/iio/dac/ltc2632.c |   2 +-
>  drivers/power/reset/ltc2952-poweroff.c|   4 +-
>  kernel/events/callchain.c |   2 +-
>  net/netfilter/ipvs/Kconfig|   8 +-
>  net/netfilter/ipvs/ip_vs_mh.c |   4 +-
>  tools/power/cpupower/po/de.po |  44 +++
>  tools/power/cpupower/po/fr.po | 120 +-
>  12 files changed, 103 insertions(+), 103 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/nfc/pn544.txt 
> b/Documentation/devicetree/bindings/net/nfc/pn544.txt
> index 538a86f7b2b0..72593f056b75 100644
> --- a/Documentation/devicetree/bindings/net/nfc/pn544.txt
> +++ b/Documentation/devicetree/bindings/net/nfc/pn544.txt
> @@ -2,7 +2,7 @@
>  
>  Required properties:
>  - compatible: Should be "nxp,pn544-i2c".
> -- clock-frequency: I_C work frequency.
> +- clock-frequency: I²C work frequency.
>  - reg: address on the bus
>  - interrupt-parent: phandle for the interrupt gpio controller
>  - interrupts: GPIO interrupt to which the chip is connected
> diff --git a/arch/arm/boot/dts/sun4i-a10-inet97fv2.dts 
> b/arch/arm/boot/dts/sun4i-a10-inet97fv2.dts
> index 5d096528e75a..71c27ea0b53e 100644
> --- a/arch/arm/boot/dts/sun4i-a10-inet97fv2.dts
> +++ b/arch/arm/boot/dts/sun4i-a10-inet97fv2.dts
> @@ -1,7 +1,7 @@
>  /*
>   * Copyright 2014 Open Source Support GmbH
>   *
> - * David Lanzend_rfer 
> + * David Lanzendörfer 
>   *
>   * This file is dual-licensed: you can use it either under the terms
>   * of the GPL or the X11 license, at your option. Note that this dual
> diff --git a/arch/arm/crypto/sha256_glue.c b/arch/arm/crypto/sha256_glue.c
> index bf8ccff2c9d0..0ae900e778f3 100644
> --- a/arch/arm/crypto/sha256_glue.c
> +++ b/arch/arm/crypto/sha256_glue.c
> @@ -2,7 +2,7 @@
>   * Glue code for the SHA256 Secure Hash Algorithm assembly implementation
>   * using optimized ARM assembler and NEON instructions.
>   *
> - * Copyright _ 2015 Google Inc.
> + * Copyright © 2015 Google Inc.
>   *
>   * This file is based on sha256_ssse3_glue.c:
>   *   Copyright (C) 2013 Intel Corporation
> diff --git a/arch/arm/crypto/sha256_neon_glue.c 
> b/arch/arm/crypto/sha256_neon_glue.c
> index 9bbee56fbdc8..1d82c6cd31a4 100644
> --- a/arch/arm/crypto/sha256_neon_glue.c
> +++ b/arch/arm/crypto/sha256_neon_glue.c
> @@ -2,10 +2,10 @@
>   * Glue code for the SHA256 Secure Hash Algorithm assembly implementation
>   * using NEON instructions.
>   *
> - * Copyright _ 2015 Google Inc.
> + * Copyright © 2015 Google Inc.
>   *
>   * This file is based on sha512_neon_glue.c:
> - *   Copyright _ 2014 Jussi Kivilinna 
> + *   Copyright © 2014 Jussi Kivilinna 
>   *
>   * This program is free software; you can redistribute it and/or modify it
>   * under the terms of the GNU General Public License as published by the Free
> diff --git a/drivers/crypto/vmx/ghashp8-ppc.pl 
> b/drivers/crypto/vmx/ghashp8-ppc.pl
> index f746af271460..38b06503ede0 100644
> --- a/drivers/crypto/vmx/ghashp8-ppc.pl
> +++ b/drivers/crypto/vmx/ghashp8-ppc.pl
> @@ -129,9 +129,9 @@ $code=<<___;
>le?vperm   $IN,$IN,$IN,$lemask
>   vxor$zero,$zero,$zero
>  
> - vpmsumd $Xl,$IN,$Hl # H.lo_Xi.lo
> - vpmsumd $Xm,$IN,$H  # H.hi_Xi.lo+H.lo_Xi.hi
> - vpmsumd $Xh,$IN,$Hh # H.hi_Xi.hi
> + vpmsumd $Xl,$IN,$Hl # H.lo·Xi.lo
> + vpmsumd $Xm,$IN,$H  # H.hi·Xi.lo+H.lo·Xi.hi
> + vpmsumd $Xh,$IN,$Hh # H.hi·Xi.hi
>  
>   vpmsumd $t2,$Xl,$xC2# 1st phase
>  
> @@ -187,11 +187,11 @@ $code=<<___;
>  .align   5
>  Loop:
>subic  $len,$len,16
> - vpmsumd $Xl,$IN,$Hl # H.lo_Xi.lo
> + vpmsumd $Xl,$IN,$Hl # H.lo·Xi.lo
>subfe. r0,r0,r0# borrow?-1

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-22 Thread Jonathan Cameron
On Mon, 21 Aug 2017 13:55:04 -0700
David Miller <da...@davemloft.net> wrote:

> From: Nicholas Piggin <npig...@gmail.com>
> Date: Tue, 22 Aug 2017 00:19:28 +1000
> 
> > Thanks here's an updated version with a couple more bugs fixed. If
> > you could try testing, that would be much appreciated.  
> 
> I'm not getting RCU stalls on sparc64 any longer with this patch.
> 
> I'm really happy you guys were able to figure out what was going
> wrong. :-)
> 
> Feel free to add my Tested-by:
> 

Like wise - 16 hours of clean run with the latest

Tested-by: Jonathan Cameron <jonathan.came...@huawei.com>

Thanks for all the hard work everyone put into this one, great to
cross it off the list!

Jonathan


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-21 Thread Jonathan Cameron
On Tue, 22 Aug 2017 00:19:28 +1000
Nicholas Piggin <npig...@gmail.com> wrote:

> On Mon, 21 Aug 2017 11:18:33 +0100
> Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> 
> > On Mon, 21 Aug 2017 16:06:05 +1000
> > Nicholas Piggin <npig...@gmail.com> wrote:
> >   
> > > On Mon, 21 Aug 2017 10:52:58 +1000
> > > Nicholas Piggin <npig...@gmail.com> wrote:
> > > 
> > > > On Sun, 20 Aug 2017 14:14:29 -0700
> > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > >   
> > > > > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote: 
> > > > >
> > > > > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:
> > > > > >   
> > > > > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > > > > Nicholas Piggin <npig...@gmail.com> wrote:
> > > > > > >   
> > > > > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:  
> > > > > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney 
> > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > Thomas, John, am I misinterpreting the timer trace event 
> > > > > > > > > messages?
> > > > > > > > 
> > > > > > > > So I did some digging, and what you find is that rcu_sched 
> > > > > > > > seems to do a
> > > > > > > > simple scheudle_timeout(1) and just goes out to lunch for many 
> > > > > > > > seconds.
> > > > > > > > The process_timeout timer never fires (when it finally does 
> > > > > > > > wake after
> > > > > > > > one of these events, it usually removes the timer with 
> > > > > > > > del_timer_sync).
> > > > > > > > 
> > > > > > > > So this patch seems to fix it. Testing, comments welcome.   
> > > > > > > >
> > > > > > > 
> > > > > > > Okay this had a problem of trying to forward the timer from a 
> > > > > > > timer
> > > > > > > callback function.
> > > > > > > 
> > > > > > > This was my other approach which also fixes the RCU warnings, but 
> > > > > > > it's
> > > > > > > a little more complex. I reworked it a bit so the mod_timer fast 
> > > > > > > path
> > > > > > > hopefully doesn't have much more overhead (actually by reading 
> > > > > > > jiffies
> > > > > > > only when needed, it probably saves a load).  
> > > > > > 
> > > > > > Giving this one a whirl!  
> > > > > 
> > > > > No joy here, but then again there are other reasons to believe that I
> > > > > am seeing a different bug than Dave and Jonathan are.
> > > > > 
> > > > > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > > > > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > > > > not statistically different from what I see without either patch.
> > > > > 
> > > > > But no statistical difference compared to without patch, and I still
> > > > > see the "rcu_sched kthread starved" messages.  For whatever it is 
> > > > > worth,
> > > > > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > > > > Hmmm...  I am also seeing that without any of your patches.  Might
> > > > > be hypervisor preemption, I guess.
> > > > 
> > > > Okay it makes the warnings go away for me, but I'm just booting then
> > > > leaving the system idle. You're doing some CPU hotplug activity?  
> > > 
> > > Okay found a bug in the patch (it was not forwarding properly before
> > > adding the first timer after an idle) and a few other concerns.
> > > 
> > > There's still a problem of a timer function doing a mod timer from
> > > within expire_timers. It can't forward the base, which might currently
> > > be quite a way behind. I *

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-21 Thread Jonathan Cameron
On Mon, 21 Aug 2017 16:06:05 +1000
Nicholas Piggin  wrote:

> On Mon, 21 Aug 2017 10:52:58 +1000
> Nicholas Piggin  wrote:
> 
> > On Sun, 20 Aug 2017 14:14:29 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:  
> > > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > > Nicholas Piggin  wrote:
> > > > >   
> > > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > > "Paul E. McKenney"  wrote:  
> > > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > > > 
> > > > > > > Thomas, John, am I misinterpreting the timer trace event 
> > > > > > > messages?
> > > > > > 
> > > > > > So I did some digging, and what you find is that rcu_sched seems to 
> > > > > > do a
> > > > > > simple scheudle_timeout(1) and just goes out to lunch for many 
> > > > > > seconds.
> > > > > > The process_timeout timer never fires (when it finally does wake 
> > > > > > after
> > > > > > one of these events, it usually removes the timer with 
> > > > > > del_timer_sync).
> > > > > > 
> > > > > > So this patch seems to fix it. Testing, comments welcome.  
> > > > > 
> > > > > Okay this had a problem of trying to forward the timer from a timer
> > > > > callback function.
> > > > > 
> > > > > This was my other approach which also fixes the RCU warnings, but it's
> > > > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > > > only when needed, it probably saves a load).  
> > > > 
> > > > Giving this one a whirl!  
> > > 
> > > No joy here, but then again there are other reasons to believe that I
> > > am seeing a different bug than Dave and Jonathan are.
> > > 
> > > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > > not statistically different from what I see without either patch.
> > > 
> > > But no statistical difference compared to without patch, and I still
> > > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > > Hmmm...  I am also seeing that without any of your patches.  Might
> > > be hypervisor preemption, I guess.
> > 
> > Okay it makes the warnings go away for me, but I'm just booting then
> > leaving the system idle. You're doing some CPU hotplug activity?  
> 
> Okay found a bug in the patch (it was not forwarding properly before
> adding the first timer after an idle) and a few other concerns.
> 
> There's still a problem of a timer function doing a mod timer from
> within expire_timers. It can't forward the base, which might currently
> be quite a way behind. I *think* after we close these gaps and get
> timely wakeups for timers on there, it should not get too far behind
> for standard timers.
> 
> Deferrable is a different story. Firstly it has no idle tracking so we
> never forward it. Even if we wanted to, we can't do it reliably because
> it could contain timers way behind the base. They are "deferrable", so
> you get what you pay for, but this still means there's a window where
> you can add a deferrable timer and get a far later expiry than you
> asked for despite the CPU never going idle after you added it.
> 
> All these problems would seem to go away if mod_timer just queued up
> the timer to a single list on the base then pushed them into the
> wheel during your wheel processing softirq... Although maybe you end
> up with excessive passes over big queue of timers. Anyway that
> wouldn't be suitable for 4.13 even if it could work.
> 
> I'll send out an updated minimal fix after some more testing...

Hi All,

I'm back in the office with hardware access on our D05 64 core ARM64
boards.

I think we still have by far the quickest test cases for this so
feel free to ping me anything you want tested quickly (we were
looking at an average of less than 10 minutes to trigger
with machine idling).

Nick, I'm currently running your previous version and we are over an
hour so even without any instances of the issue so it looks like a
considerable improvement.  I'll see if I can line a couple of boards
up for an overnight run if you have your updated version out by then.

Be great to finally put this one to bed.

Thanks,

Jonathan

> 
> Thanks,
> Nick



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-15 Thread Jonathan Cameron
On Tue, 15 Aug 2017 08:47:43 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Wed, Aug 02, 2017 at 05:25:55PM +0100, Jonathan Cameron wrote:
> > On Tue, 1 Aug 2017 11:46:46 -0700
> > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> >   
> > > On Mon, Jul 31, 2017 at 04:27:57PM +0100, Jonathan Cameron wrote:  
> > > > On Mon, 31 Jul 2017 08:04:11 -0700
> > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > On Mon, Jul 31, 2017 at 12:08:47PM +0100, Jonathan Cameron wrote:
> > > > > > On Fri, 28 Jul 2017 12:03:50 -0700
> > > > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > > >   
> > > > > > > On Fri, Jul 28, 2017 at 06:27:05PM +0100, Jonathan Cameron wrote: 
> > > > > > >  
> > > > > > > > On Fri, 28 Jul 2017 09:55:29 -0700
> > > > > > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > > > > > 
> > > > > > > > > On Fri, Jul 28, 2017 at 02:24:03PM +0100, Jonathan Cameron 
> > > > > > > > > wrote:
> > > > > > > > > > On Fri, 28 Jul 2017 08:44:11 +0100
> > > > > > > > > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:   
> > > > > > > > > >
> > > > > > > > > 
> > > > > > > > > [ . . . ]
> > > > > > > > > 
> > > > > > > > > > Ok.  Some info.  I disabled a few driver (usb and SAS) in 
> > > > > > > > > > the interest of having
> > > > > > > > > > fewer timer events.  Issue became much easier to trigger 
> > > > > > > > > > (on some runs before
> > > > > > > > > > I could get tracing up and running)
> > > > > > > > > >e
> > > > > > > > > > So logs are large enough that pastebin doesn't like them - 
> > > > > > > > > > please shoet if  
> > > > > > > > > >>e another timer period is of interest.  
> > > > > > > > > > 
> > > > > > > > > > https://pastebin.com/iUZDfQGM for the timer trace.
> > > > > > > > > > https://pastebin.com/3w1F7amH for dmesg.  
> > > > > > > > > > 
> > > > > > > > > > The relevant timeout on the RCU stall detector was 8 
> > > > > > > > > > seconds.  Event is
> > > > > > > > > > detected around 835.
> > > > > > > > > > 
> > > > > > > > > > It's a lot of logs, so I haven't identified a smoking gun 
> > > > > > > > > > yet but there
> > > > > > > > > > may well be one in there.  
> > > > > > > > > 
> > > > > > > > > The dmesg says:
> > > > > > > > > 
> > > > > > > > > rcu_preempt kthread starved for 2508 jiffies! g112 c111 f0x0 
> > > > > > > > > RCU_GP_WAIT_FQS(3) ->state=0x1
> > > > > > > > > 
> > > > > > > > > So I look for "rcu_preempt" timer events and find these:
> > > > > > > > > 
> > > > > > > > > rcu_preempt-9 [019]    827.579114: timer_init: 
> > > > > > > > > timer=8017d5fc7da0
> > > > > > > > > rcu_preempt-9 [019] d..1   827.579115: timer_start: 
> > > > > > > > > timer=8017d5fc7da0 function=process_timeout 
> > > > > > > > > 
> > > > > > > > > Next look for "8017d5fc7da0" and I don't find anything 
> > > > > > > > > else.
> > > > > > > > It does show up off the bottom of what would fit in pastebin...
> > > > > > > > 
> > > > > > > >  rcu_preempt-9 [001] d..1   837.681077: timer_cancel: 
> > > > > > > > timer=8017d5fc7da0
> > > > > > > >  rcu_preempt-9 [001]    837.681086: timer_init: 

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-02 Thread Jonathan Cameron
On Tue, 1 Aug 2017 11:46:46 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Mon, Jul 31, 2017 at 04:27:57PM +0100, Jonathan Cameron wrote:
> > On Mon, 31 Jul 2017 08:04:11 -0700
> > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> >   
> > > On Mon, Jul 31, 2017 at 12:08:47PM +0100, Jonathan Cameron wrote:  
> > > > On Fri, 28 Jul 2017 12:03:50 -0700
> > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > On Fri, Jul 28, 2017 at 06:27:05PM +0100, Jonathan Cameron wrote:
> > > > > > On Fri, 28 Jul 2017 09:55:29 -0700
> > > > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > > >   
> > > > > > > On Fri, Jul 28, 2017 at 02:24:03PM +0100, Jonathan Cameron wrote: 
> > > > > > >  
> > > > > > > > On Fri, 28 Jul 2017 08:44:11 +0100
> > > > > > > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> > > > > > > 
> > > > > > > [ . . . ]
> > > > > > >   
> > > > > > > > Ok.  Some info.  I disabled a few driver (usb and SAS) in the 
> > > > > > > > interest of having
> > > > > > > > fewer timer events.  Issue became much easier to trigger (on 
> > > > > > > > some runs before
> > > > > > > > I could get tracing up and running)
> > > > > > > >e
> > > > > > > > So logs are large enough that pastebin doesn't like them - 
> > > > > > > > please shoet if
> > > > > > > >>e another timer period is of interest.
> > > > > > > > 
> > > > > > > > https://pastebin.com/iUZDfQGM for the timer trace.
> > > > > > > > https://pastebin.com/3w1F7amH for dmesg.  
> > > > > > > > 
> > > > > > > > The relevant timeout on the RCU stall detector was 8 seconds.  
> > > > > > > > Event is
> > > > > > > > detected around 835.
> > > > > > > > 
> > > > > > > > It's a lot of logs, so I haven't identified a smoking gun yet 
> > > > > > > > but there
> > > > > > > > may well be one in there.
> > > > > > > 
> > > > > > > The dmesg says:
> > > > > > > 
> > > > > > > rcu_preempt kthread starved for 2508 jiffies! g112 c111 f0x0 
> > > > > > > RCU_GP_WAIT_FQS(3) ->state=0x1
> > > > > > > 
> > > > > > > So I look for "rcu_preempt" timer events and find these:
> > > > > > > 
> > > > > > > rcu_preempt-9 [019]    827.579114: timer_init: 
> > > > > > > timer=8017d5fc7da0
> > > > > > > rcu_preempt-9 [019] d..1   827.579115: timer_start: 
> > > > > > > timer=8017d5fc7da0 function=process_timeout 
> > > > > > > 
> > > > > > > Next look for "8017d5fc7da0" and I don't find anything else.  
> > > > > > > 
> > > > > > It does show up off the bottom of what would fit in pastebin...
> > > > > > 
> > > > > >  rcu_preempt-9 [001] d..1   837.681077: timer_cancel: 
> > > > > > timer=8017d5fc7da0
> > > > > >  rcu_preempt-9 [001]    837.681086: timer_init: 
> > > > > > timer=8017d5fc7da0
> > > > > >  rcu_preempt-9 [001] d..1   837.681087: timer_start: 
> > > > > > timer=8017d5fc7da0 function=process_timeout expires=4295101298 
> > > > > > [timeout=1] cpu=1 idx=0 flags=  
> > > > > 
> > > > > Odd.  I would expect an expiration...  And ten seconds is way longer
> > > > > than the requested one jiffy!
> > > > > 
> > > > > > > The timeout was one jiffy, and more than a second later, no 
> > > > > > > expiration.
> > > > > > > Is it possible that this event was lost?  I am not seeing any 
> > > > > > > sign of
> > > > > > > this is the trace.
> > > > > > > 
> > > > >

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-01 Thread Jonathan Cameron

Sorry - accidental send.  No content!

Jonathan

On Mon, 31 Jul 2017 12:55:48 +0100
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Mon, 31 Jul 2017 12:09:08 +0100
> Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> 
> > On Wed, 26 Jul 2017 16:15:05 -0700
> > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> >   
> > > On Wed, Jul 26, 2017 at 03:45:40PM -0700, David Miller wrote:
> > > > From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> > > > Date: Wed, 26 Jul 2017 15:36:58 -0700
> > > >   
> > > > > And without CONFIG_SOFTLOCKUP_DETECTOR, I see five runs of 24 with RCU
> > > > > CPU stall warnings.  So it seems likely that 
> > > > > CONFIG_SOFTLOCKUP_DETECTOR
> > > > > really is having an effect.  
> > > > 
> > > > Thanks for all of the info Paul, I'll digest this and scan over the
> > > > code myself.
> > > > 
> > > > Just out of curiousity, what x86 idle method is your machine using?
> > > > The mwait one or the one which simply uses 'halt'?  The mwait variant
> > > > might mask this bug, and halt would be a lot closer to how sparc64 and
> > > > Jonathan's system operates.  
> > > 
> > > My kernel builds with CONFIG_INTEL_IDLE=n, which I believe means that
> > > I am not using the mwait one.  Here is a grep for IDLE in my .config:
> > > 
> > >   CONFIG_NO_HZ_IDLE=y
> > >   CONFIG_GENERIC_SMP_IDLE_THREAD=y
> > >   # CONFIG_IDLE_PAGE_TRACKING is not set
> > >   CONFIG_ACPI_PROCESSOR_IDLE=y
> > >   CONFIG_CPU_IDLE=y
> > >   # CONFIG_CPU_IDLE_GOV_LADDER is not set
> > >   CONFIG_CPU_IDLE_GOV_MENU=y
> > >   # CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
> > >   # CONFIG_INTEL_IDLE is not set
> > > 
> > > > On sparc64 the cpu yield we do in the idle loop sleeps the cpu.  It's
> > > > local TICK register keeps advancing, and the local timer therefore
> > > > will still trigger.  Also, any externally generated interrupts
> > > > (including cross calls) will wake up the cpu as well.
> > > > 
> > > > The tick-sched code is really tricky wrt. NO_HZ even in the NO_HZ_IDLE
> > > > case.  One of my running theories is that we miss scheduling a tick
> > > > due to a race.  That would be consistent with the behavior we see
> > > > in the RCU dumps, I think.  
> > > 
> > > But wouldn't you have to miss a -lot- of ticks to get an RCU CPU stall
> > > warning?  By default, your grace period needs to extend for more than
> > > 21 seconds (more than one-third of a -minute-) to get one.  Or do
> > > you mean that the ticks get shut off now and forever, as opposed to
> > > just losing one of them?
> > > 
> > > > Anyways, just a theory, and that's why I keep mentioning that commit
> > > > about the revert of the revert (specifically
> > > > 411fe24e6b7c283c3a1911450cdba6dd3aaea56e).
> > > > 
> > > > :-)  
> > > 
> > > I am running an overnight test in preparation for attempting to push
> > > some fixes for regressions into 4.12, but will try reverting this
> > > and enabling CONFIG_HZ_PERIODIC tomorrow.
> > > 
> > > Jonathan, might the commit that Dave points out above be what reduces
> > > the probability of occurrence as you test older releases?
> > I just got around to trying this out of curiosity.  Superficially it did
> > appear to possibly make the issue harder to hit took over 30 minutes
> > but the issue otherwise looks much the same with or without that patch.
> > 
> > Just out of curiosity, next thing on my list is to disable hrtimers entirely
> > and see what happens.
> > 
> > Jonathan  
> > > 
> > >   Thanx, Paul
> > > 
> > 
> > ___
> > linuxarm mailing list
> > linux...@huawei.com
> > http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm  
> 
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-01 Thread Jonathan Cameron
On Mon, 31 Jul 2017 12:09:08 +0100
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Wed, 26 Jul 2017 16:15:05 -0700
> "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> 
> > On Wed, Jul 26, 2017 at 03:45:40PM -0700, David Miller wrote:  
> > > From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> > > Date: Wed, 26 Jul 2017 15:36:58 -0700
> > > 
> > > > And without CONFIG_SOFTLOCKUP_DETECTOR, I see five runs of 24 with RCU
> > > > CPU stall warnings.  So it seems likely that CONFIG_SOFTLOCKUP_DETECTOR
> > > > really is having an effect.
> > > 
> > > Thanks for all of the info Paul, I'll digest this and scan over the
> > > code myself.
> > > 
> > > Just out of curiousity, what x86 idle method is your machine using?
> > > The mwait one or the one which simply uses 'halt'?  The mwait variant
> > > might mask this bug, and halt would be a lot closer to how sparc64 and
> > > Jonathan's system operates.
> > 
> > My kernel builds with CONFIG_INTEL_IDLE=n, which I believe means that
> > I am not using the mwait one.  Here is a grep for IDLE in my .config:
> > 
> > CONFIG_NO_HZ_IDLE=y
> > CONFIG_GENERIC_SMP_IDLE_THREAD=y
> > # CONFIG_IDLE_PAGE_TRACKING is not set
> > CONFIG_ACPI_PROCESSOR_IDLE=y
> > CONFIG_CPU_IDLE=y
> > # CONFIG_CPU_IDLE_GOV_LADDER is not set
> > CONFIG_CPU_IDLE_GOV_MENU=y
> > # CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
> > # CONFIG_INTEL_IDLE is not set
> >   
> > > On sparc64 the cpu yield we do in the idle loop sleeps the cpu.  It's
> > > local TICK register keeps advancing, and the local timer therefore
> > > will still trigger.  Also, any externally generated interrupts
> > > (including cross calls) will wake up the cpu as well.
> > > 
> > > The tick-sched code is really tricky wrt. NO_HZ even in the NO_HZ_IDLE
> > > case.  One of my running theories is that we miss scheduling a tick
> > > due to a race.  That would be consistent with the behavior we see
> > > in the RCU dumps, I think.
> > 
> > But wouldn't you have to miss a -lot- of ticks to get an RCU CPU stall
> > warning?  By default, your grace period needs to extend for more than
> > 21 seconds (more than one-third of a -minute-) to get one.  Or do
> > you mean that the ticks get shut off now and forever, as opposed to
> > just losing one of them?
> >   
> > > Anyways, just a theory, and that's why I keep mentioning that commit
> > > about the revert of the revert (specifically
> > > 411fe24e6b7c283c3a1911450cdba6dd3aaea56e).
> > > 
> > > :-)
> > 
> > I am running an overnight test in preparation for attempting to push
> > some fixes for regressions into 4.12, but will try reverting this
> > and enabling CONFIG_HZ_PERIODIC tomorrow.
> > 
> > Jonathan, might the commit that Dave points out above be what reduces
> > the probability of occurrence as you test older releases?  
> I just got around to trying this out of curiosity.  Superficially it did
> appear to possibly make the issue harder to hit took over 30 minutes
> but the issue otherwise looks much the same with or without that patch.
> 
> Just out of curiosity, next thing on my list is to disable hrtimers entirely
> and see what happens.
> 
> Jonathan
> > 
> > Thanx, Paul
> >   
> 
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-31 Thread Jonathan Cameron
On Mon, 31 Jul 2017 08:04:11 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Mon, Jul 31, 2017 at 12:08:47PM +0100, Jonathan Cameron wrote:
> > On Fri, 28 Jul 2017 12:03:50 -0700
> > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> >   
> > > On Fri, Jul 28, 2017 at 06:27:05PM +0100, Jonathan Cameron wrote:  
> > > > On Fri, 28 Jul 2017 09:55:29 -0700
> > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > On Fri, Jul 28, 2017 at 02:24:03PM +0100, Jonathan Cameron wrote:
> > > > > > On Fri, 28 Jul 2017 08:44:11 +0100
> > > > > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:  
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > Ok.  Some info.  I disabled a few driver (usb and SAS) in the 
> > > > > > interest of having
> > > > > > fewer timer events.  Issue became much easier to trigger (on some 
> > > > > > runs before
> > > > > > I could get tracing up and running)
> > > > > >e
> > > > > > So logs are large enough that pastebin doesn't like them - please 
> > > > > > shoet if  
> > > > > >>e another timer period is of interest.  
> > > > > > 
> > > > > > https://pastebin.com/iUZDfQGM for the timer trace.
> > > > > > https://pastebin.com/3w1F7amH for dmesg.  
> > > > > > 
> > > > > > The relevant timeout on the RCU stall detector was 8 seconds.  
> > > > > > Event is
> > > > > > detected around 835.
> > > > > > 
> > > > > > It's a lot of logs, so I haven't identified a smoking gun yet but 
> > > > > > there
> > > > > > may well be one in there.  
> > > > > 
> > > > > The dmesg says:
> > > > > 
> > > > > rcu_preempt kthread starved for 2508 jiffies! g112 c111 f0x0 
> > > > > RCU_GP_WAIT_FQS(3) ->state=0x1
> > > > > 
> > > > > So I look for "rcu_preempt" timer events and find these:
> > > > > 
> > > > > rcu_preempt-9 [019]    827.579114: timer_init: 
> > > > > timer=8017d5fc7da0
> > > > > rcu_preempt-9 [019] d..1   827.579115: timer_start: 
> > > > > timer=8017d5fc7da0 function=process_timeout 
> > > > > 
> > > > > Next look for "8017d5fc7da0" and I don't find anything else.
> > > > It does show up off the bottom of what would fit in pastebin...
> > > > 
> > > >  rcu_preempt-9 [001] d..1   837.681077: timer_cancel: 
> > > > timer=8017d5fc7da0
> > > >  rcu_preempt-9 [001]    837.681086: timer_init: 
> > > > timer=8017d5fc7da0
> > > >  rcu_preempt-9 [001] d..1   837.681087: timer_start: 
> > > > timer=8017d5fc7da0 function=process_timeout expires=4295101298 
> > > > [timeout=1] cpu=1 idx=0 flags=
> > > 
> > > Odd.  I would expect an expiration...  And ten seconds is way longer
> > > than the requested one jiffy!
> > >   
> > > > > The timeout was one jiffy, and more than a second later, no 
> > > > > expiration.
> > > > > Is it possible that this event was lost?  I am not seeing any sign of
> > > > > this is the trace.
> > > > > 
> > > > > I don't see any sign of CPU hotplug (and I test with lots of that in
> > > > > any case).
> > > > > 
> > > > > The last time we saw something like this it was a timer HW/driver 
> > > > > problem,
> > > > > but it is a bit hard to imagine such a problem affecting both ARM64
> > > > > and SPARC.  ;-)
> > > > Could be different issues, both of which were hidden by that lockup 
> > > > detector.
> > > > 
> > > > There is an errata work around for the timers on this particular board.
> > > > I'm only vaguely aware of it, so may be unconnected.
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/clocksource/arm_arch_timer.c?h=v4.13-rc2=bb42ca47401010fc02901b5e8f79e40a26f208cb
> > > > 
> > > > Seems unlikely though! + we've not yet seen it on the other chips that
> > > > errata effects (not that that means much).
> > > 
> > > If you can reproduce quickly, might be worth trying anyway...
> > > 
> > >   Thanx, Paul  
> > Errata fix is running already and was for all those tests.  
> 
> I was afraid of that...  ;-)
It's a pretty rare errata it seems.  Not actually managed to catch
one yet. 
> 
> > I'll have a dig into the timers today and see where I get to.  
> 
> Look forward to seeing what you find!
Nothing obvious turning up other than we don't seem to have issue
when we aren't running hrtimers.

On a plus side I just got a report that it is effecting our d03
boards which is good on the basis I couldn't tell what the difference
could be wrt to this issue!

It indeed looks like we are consistently missing a timer before
the rcu splat occurs.

J
> 
>   Thanx, Paul
> 
> > Jonathan  
> > >   
> > > > Jonathan
> > > > 
> > > > > 
> > > > > Thomas, any debugging suggestions?
> > > > > 
> > > > >   Thanx, Paul
> > > > > 
> > > > 
> > >   
> >   
> 



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-31 Thread Jonathan Cameron
On Wed, 26 Jul 2017 16:15:05 -0700
"Paul E. McKenney"  wrote:

> On Wed, Jul 26, 2017 at 03:45:40PM -0700, David Miller wrote:
> > From: "Paul E. McKenney" 
> > Date: Wed, 26 Jul 2017 15:36:58 -0700
> >   
> > > And without CONFIG_SOFTLOCKUP_DETECTOR, I see five runs of 24 with RCU
> > > CPU stall warnings.  So it seems likely that CONFIG_SOFTLOCKUP_DETECTOR
> > > really is having an effect.  
> > 
> > Thanks for all of the info Paul, I'll digest this and scan over the
> > code myself.
> > 
> > Just out of curiousity, what x86 idle method is your machine using?
> > The mwait one or the one which simply uses 'halt'?  The mwait variant
> > might mask this bug, and halt would be a lot closer to how sparc64 and
> > Jonathan's system operates.  
> 
> My kernel builds with CONFIG_INTEL_IDLE=n, which I believe means that
> I am not using the mwait one.  Here is a grep for IDLE in my .config:
> 
>   CONFIG_NO_HZ_IDLE=y
>   CONFIG_GENERIC_SMP_IDLE_THREAD=y
>   # CONFIG_IDLE_PAGE_TRACKING is not set
>   CONFIG_ACPI_PROCESSOR_IDLE=y
>   CONFIG_CPU_IDLE=y
>   # CONFIG_CPU_IDLE_GOV_LADDER is not set
>   CONFIG_CPU_IDLE_GOV_MENU=y
>   # CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
>   # CONFIG_INTEL_IDLE is not set
> 
> > On sparc64 the cpu yield we do in the idle loop sleeps the cpu.  It's
> > local TICK register keeps advancing, and the local timer therefore
> > will still trigger.  Also, any externally generated interrupts
> > (including cross calls) will wake up the cpu as well.
> > 
> > The tick-sched code is really tricky wrt. NO_HZ even in the NO_HZ_IDLE
> > case.  One of my running theories is that we miss scheduling a tick
> > due to a race.  That would be consistent with the behavior we see
> > in the RCU dumps, I think.  
> 
> But wouldn't you have to miss a -lot- of ticks to get an RCU CPU stall
> warning?  By default, your grace period needs to extend for more than
> 21 seconds (more than one-third of a -minute-) to get one.  Or do
> you mean that the ticks get shut off now and forever, as opposed to
> just losing one of them?
> 
> > Anyways, just a theory, and that's why I keep mentioning that commit
> > about the revert of the revert (specifically
> > 411fe24e6b7c283c3a1911450cdba6dd3aaea56e).
> > 
> > :-)  
> 
> I am running an overnight test in preparation for attempting to push
> some fixes for regressions into 4.12, but will try reverting this
> and enabling CONFIG_HZ_PERIODIC tomorrow.
> 
> Jonathan, might the commit that Dave points out above be what reduces
> the probability of occurrence as you test older releases?
I just got around to trying this out of curiosity.  Superficially it did
appear to possibly make the issue harder to hit took over 30 minutes
but the issue otherwise looks much the same with or without that patch.

Just out of curiosity, next thing on my list is to disable hrtimers entirely
and see what happens.

Jonathan
> 
>   Thanx, Paul
> 



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-31 Thread Jonathan Cameron
On Fri, 28 Jul 2017 12:03:50 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Fri, Jul 28, 2017 at 06:27:05PM +0100, Jonathan Cameron wrote:
> > On Fri, 28 Jul 2017 09:55:29 -0700
> > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> >   
> > > On Fri, Jul 28, 2017 at 02:24:03PM +0100, Jonathan Cameron wrote:  
> > > > On Fri, 28 Jul 2017 08:44:11 +0100
> > > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> > > 
> > > [ . . . ]
> > >   
> > > > Ok.  Some info.  I disabled a few driver (usb and SAS) in the interest 
> > > > of having
> > > > fewer timer events.  Issue became much easier to trigger (on some runs 
> > > > before
> > > > I could get tracing up and running)
> > > >e
> > > > So logs are large enough that pastebin doesn't like them - please shoet 
> > > > if
> > > >>e another timer period is of interest.
> > > > 
> > > > https://pastebin.com/iUZDfQGM for the timer trace.
> > > > https://pastebin.com/3w1F7amH for dmesg.  
> > > > 
> > > > The relevant timeout on the RCU stall detector was 8 seconds.  Event is
> > > > detected around 835.
> > > > 
> > > > It's a lot of logs, so I haven't identified a smoking gun yet but there
> > > > may well be one in there.
> > > 
> > > The dmesg says:
> > > 
> > > rcu_preempt kthread starved for 2508 jiffies! g112 c111 f0x0 
> > > RCU_GP_WAIT_FQS(3) ->state=0x1
> > > 
> > > So I look for "rcu_preempt" timer events and find these:
> > > 
> > > rcu_preempt-9 [019]    827.579114: timer_init: 
> > > timer=8017d5fc7da0
> > > rcu_preempt-9 [019] d..1   827.579115: timer_start: 
> > > timer=8017d5fc7da0 function=process_timeout 
> > > 
> > > Next look for "8017d5fc7da0" and I don't find anything else.  
> > It does show up off the bottom of what would fit in pastebin...
> > 
> >  rcu_preempt-9 [001] d..1   837.681077: timer_cancel: 
> > timer=8017d5fc7da0
> >  rcu_preempt-9 [001]    837.681086: timer_init: 
> > timer=8017d5fc7da0
> >  rcu_preempt-9 [001] d..1   837.681087: timer_start: 
> > timer=8017d5fc7da0 function=process_timeout expires=4295101298 
> > [timeout=1] cpu=1 idx=0 flags=  
> 
> Odd.  I would expect an expiration...  And ten seconds is way longer
> than the requested one jiffy!
> 
> > > The timeout was one jiffy, and more than a second later, no expiration.
> > > Is it possible that this event was lost?  I am not seeing any sign of
> > > this is the trace.
> > > 
> > > I don't see any sign of CPU hotplug (and I test with lots of that in
> > > any case).
> > > 
> > > The last time we saw something like this it was a timer HW/driver problem,
> > > but it is a bit hard to imagine such a problem affecting both ARM64
> > > and SPARC.  ;-)  
> > Could be different issues, both of which were hidden by that lockup 
> > detector.
> > 
> > There is an errata work around for the timers on this particular board.
> > I'm only vaguely aware of it, so may be unconnected.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/clocksource/arm_arch_timer.c?h=v4.13-rc2=bb42ca47401010fc02901b5e8f79e40a26f208cb
> > 
> > Seems unlikely though! + we've not yet seen it on the other chips that
> > errata effects (not that that means much).  
> 
> If you can reproduce quickly, might be worth trying anyway...
> 
>   Thanx, Paul
Errata fix is running already and was for all those tests.

I'll have a dig into the timers today and see where I get to.

Jonathan
> 
> > Jonathan
> >   
> > > 
> > > Thomas, any debugging suggestions?
> > > 
> > >   Thanx, Paul
> > >   
> >   
> 



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-28 Thread Jonathan Cameron
On Fri, 28 Jul 2017 09:55:29 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Fri, Jul 28, 2017 at 02:24:03PM +0100, Jonathan Cameron wrote:
> > On Fri, 28 Jul 2017 08:44:11 +0100
> > Jonathan Cameron <jonathan.came...@huawei.com> wrote:  
> 
> [ . . . ]
> 
> > Ok.  Some info.  I disabled a few driver (usb and SAS) in the interest of 
> > having
> > fewer timer events.  Issue became much easier to trigger (on some runs 
> > before
> > I could get tracing up and running)
> >e
> > So logs are large enough that pastebin doesn't like them - please shoet if  
> >>e another timer period is of interest.  
> > 
> > https://pastebin.com/iUZDfQGM for the timer trace.
> > https://pastebin.com/3w1F7amH for dmesg.  
> > 
> > The relevant timeout on the RCU stall detector was 8 seconds.  Event is
> > detected around 835.
> > 
> > It's a lot of logs, so I haven't identified a smoking gun yet but there
> > may well be one in there.  
> 
> The dmesg says:
> 
> rcu_preempt kthread starved for 2508 jiffies! g112 c111 f0x0 
> RCU_GP_WAIT_FQS(3) ->state=0x1
> 
> So I look for "rcu_preempt" timer events and find these:
> 
> rcu_preempt-9 [019]    827.579114: timer_init: timer=8017d5fc7da0
> rcu_preempt-9 [019] d..1   827.579115: timer_start: 
> timer=8017d5fc7da0 function=process_timeout 
> 
> Next look for "8017d5fc7da0" and I don't find anything else.
It does show up off the bottom of what would fit in pastebin...

 rcu_preempt-9 [001] d..1   837.681077: timer_cancel: 
timer=8017d5fc7da0
 rcu_preempt-9 [001]    837.681086: timer_init: 
timer=8017d5fc7da0
 rcu_preempt-9 [001] d..1   837.681087: timer_start: 
timer=8017d5fc7da0 function=process_timeout expires=4295101298 [timeout=1] 
cpu=1 idx=0 flags=

> The timeout was one jiffy, and more than a second later, no expiration.
> Is it possible that this event was lost?  I am not seeing any sign of
> this is the trace.
> 
> I don't see any sign of CPU hotplug (and I test with lots of that in
> any case).
> 
> The last time we saw something like this it was a timer HW/driver problem,
> but it is a bit hard to imagine such a problem affecting both ARM64
> and SPARC.  ;-)
Could be different issues, both of which were hidden by that lockup detector.

There is an errata work around for the timers on this particular board.
I'm only vaguely aware of it, so may be unconnected.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/clocksource/arm_arch_timer.c?h=v4.13-rc2=bb42ca47401010fc02901b5e8f79e40a26f208cb

Seems unlikely though! + we've not yet seen it on the other chips that
errata effects (not that that means much).

Jonathan

> 
> Thomas, any debugging suggestions?
> 
>   Thanx, Paul
> 



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-28 Thread Jonathan Cameron
On Fri, 28 Jul 2017 08:44:11 +0100
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Thu, 27 Jul 2017 09:52:45 -0700
> "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Jul 27, 2017 at 05:39:23PM +0100, Jonathan Cameron wrote:  
> > > On Thu, 27 Jul 2017 14:49:03 +0100
> > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> > > 
> > > > On Thu, 27 Jul 2017 05:49:13 -0700
> > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > On Thu, Jul 27, 2017 at 02:34:00PM +1000, Nicholas Piggin wrote:  
> > > > > > On Wed, 26 Jul 2017 18:42:14 -0700
> > > > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > > > 
> > > > > > > On Wed, Jul 26, 2017 at 04:22:00PM -0700, David Miller wrote: 
> > > > > > >
> > > > > > 
> > > > > > > > Indeed, that really wouldn't explain how we end up with a RCU 
> > > > > > > > stall
> > > > > > > > dump listing almost all of the cpus as having missed a grace 
> > > > > > > > period.  
> > > > > > > 
> > > > > > > I have seen stranger things, but admittedly not often.
> > > > > > 
> > > > > > So the backtraces show the RCU gp thread in schedule_timeout.
> > > > > > 
> > > > > > Are you sure that it's timeout has expired and it's not being 
> > > > > > scheduled,
> > > > > > or could it be a bad (large) timeout (looks unlikely) or that it's 
> > > > > > being
> > > > > > scheduled but not correctly noting gps on other CPUs?
> > > > > > 
> > > > > > It's not in R state, so if it's not being scheduled at all, then 
> > > > > > it's
> > > > > > because the timer has not fired:
> > > > > 
> > > > > Good point, Nick!
> > > > > 
> > > > > Jonathan, could you please reproduce collecting timer event tracing?  
> > > > > 
> > > > I'm a little new to tracing (only started playing with it last week)
> > > > so fingers crossed I've set it up right.  No splats yet.  Was getting
> > > > splats on reading out the trace when running with the RCU stall timer
> > > > set to 4 so have increased that back to the default and am rerunning.
> > > > 
> > > > This may take a while.  Correct me if I've gotten this wrong to save 
> > > > time
> > > > 
> > > > echo "timer:*" > /sys/kernel/debug/tracing/set_event
> > > > 
> > > > when it dumps, just send you the relevant part of what is in
> > > > /sys/kernel/debug/tracing/trace?
> > > 
> > > Interestingly the only thing that can make trip for me with tracing on
> > > is peaking in the tracing buffers.  Not sure this is a valid case or
> > > not.
> > > 
> > > Anyhow all timer activity seems to stop around the area of interest.
> > > 
> > > 

Firstly sorry to those who got the rather silly length email a minute ago.
It bounced on the list (fair enough - I was just being lazy on getting
data past our firewalls).

Ok.  Some info.  I disabled a few driver (usb and SAS) in the interest of having
fewer timer events.  Issue became much easier to trigger (on some runs before
I could get tracing up and running)

So logs are large enough that pastebin doesn't like them - please shout if
another timer period is of interest.

https://pastebin.com/iUZDfQGM for the timer trace.
https://pastebin.com/3w1F7amH for dmesg.  

The relevant timeout on the RCU stall detector was 8 seconds.  Event is
detected around 835.

It's a lot of logs, so I haven't identified a smoking gun yet but there
may well be one in there.

Jonathan


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-28 Thread Jonathan Cameron
On Fri, 28 Jul 2017 20:54:16 +0800
Boqun Feng  wrote:

> Hi Jonathan,
> 
> FWIW, there is wakeup-missing issue in swake_up() and swake_up_all():
> 
>   https://marc.info/?l=linux-kernel=149750022019663
> 
> and RCU begins to use swait/wake last year, so I thought this could be
> relevant.
> 
> Could you try the following patch and see if it works? Thanks.
Sadly seems to be a no...  Just splatted before I could even get
the tracing set up. Back to staring at logs and hoping something
will stand out!

Jonathan
> 
> Regards,
> Boqun
> 
> -->8  
> Subject: [PATCH] swait: Remove the lockless swait_active() check in
>  swake_up*()
> 
> Steven Rostedt reported a potential race in RCU core because of
> swake_up():
> 
> CPU0CPU1
> 
> __call_rcu_core() {
> 
>  spin_lock(rnp_root)
>  need_wake = __rcu_start_gp() {
>   rcu_start_gp_advanced() {
>gp_flags = FLAG_INIT
>   }
>  }
> 
>  rcu_gp_kthread() {
>swait_event_interruptible(wq,
> gp_flags & FLAG_INIT) {
>spin_lock(q->lock)
> 
> *fetch wq->task_list here! *
> 
>list_add(wq->task_list, q->task_list)
>spin_unlock(q->lock);
> 
>*fetch old value of gp_flags here *
> 
>  spin_unlock(rnp_root)
> 
>  rcu_gp_kthread_wake() {
>   swake_up(wq) {
>swait_active(wq) {
> list_empty(wq->task_list)
> 
>} * return false *
> 
>   if (condition) * false *
> schedule();
> 
> In this case, a wakeup is missed, which could cause the rcu_gp_kthread
> waits for a long time.
> 
> The reason of this is that we do a lockless swait_active() check in
> swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
> before swait_active() to provide the proper order or 2) simply remove
> the swait_active() in swake_up().
> 
> The solution 2 not only fixes this problem but also keeps the swait and
> wait API as close as possible, as wake_up() doesn't provide a full
> barrier and doesn't do a lockless check of the wait queue either.
> Moreover, there are users already using swait_active() to do their quick
> checks for the wait queues, so it make less sense that swake_up() and
> swake_up_all() do this on their own.
> 
> This patch then removes the lockless swait_active() check in swake_up()
> and swake_up_all().
> 
> Reported-by: Steven Rostedt 
> Signed-off-by: Boqun Feng 
> ---
>  kernel/sched/swait.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> index 3d5610dcce11..2227e183e202 100644
> --- a/kernel/sched/swait.c
> +++ b/kernel/sched/swait.c
> @@ -33,9 +33,6 @@ void swake_up(struct swait_queue_head *q)
>  {
>   unsigned long flags;
>  
> - if (!swait_active(q))
> - return;
> -
>   raw_spin_lock_irqsave(>lock, flags);
>   swake_up_locked(q);
>   raw_spin_unlock_irqrestore(>lock, flags);
> @@ -51,9 +48,6 @@ void swake_up_all(struct swait_queue_head *q)
>   struct swait_queue *curr;
>   LIST_HEAD(tmp);
>  
> - if (!swait_active(q))
> - return;
> -
>   raw_spin_lock_irq(>lock);
>   list_splice_init(>task_list, );
>   while (!list_empty()) {



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-28 Thread Jonathan Cameron
On Thu, 27 Jul 2017 09:52:45 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Thu, Jul 27, 2017 at 05:39:23PM +0100, Jonathan Cameron wrote:
> > On Thu, 27 Jul 2017 14:49:03 +0100
> > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> >   
> > > On Thu, 27 Jul 2017 05:49:13 -0700
> > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > >   
> > > > On Thu, Jul 27, 2017 at 02:34:00PM +1000, Nicholas Piggin wrote:
> > > > > On Wed, 26 Jul 2017 18:42:14 -0700
> > > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > >   
> > > > > > On Wed, Jul 26, 2017 at 04:22:00PM -0700, David Miller wrote:  
> > > > >   
> > > > > > > Indeed, that really wouldn't explain how we end up with a RCU 
> > > > > > > stall
> > > > > > > dump listing almost all of the cpus as having missed a grace 
> > > > > > > period.
> > > > > > 
> > > > > > I have seen stranger things, but admittedly not often.  
> > > > > 
> > > > > So the backtraces show the RCU gp thread in schedule_timeout.
> > > > > 
> > > > > Are you sure that it's timeout has expired and it's not being 
> > > > > scheduled,
> > > > > or could it be a bad (large) timeout (looks unlikely) or that it's 
> > > > > being
> > > > > scheduled but not correctly noting gps on other CPUs?
> > > > > 
> > > > > It's not in R state, so if it's not being scheduled at all, then it's
> > > > > because the timer has not fired:  
> > > > 
> > > > Good point, Nick!
> > > > 
> > > > Jonathan, could you please reproduce collecting timer event tracing?
> > > I'm a little new to tracing (only started playing with it last week)
> > > so fingers crossed I've set it up right.  No splats yet.  Was getting
> > > splats on reading out the trace when running with the RCU stall timer
> > > set to 4 so have increased that back to the default and am rerunning.
> > > 
> > > This may take a while.  Correct me if I've gotten this wrong to save time
> > > 
> > > echo "timer:*" > /sys/kernel/debug/tracing/set_event
> > > 
> > > when it dumps, just send you the relevant part of what is in
> > > /sys/kernel/debug/tracing/trace?  
> > 
> > Interestingly the only thing that can make trip for me with tracing on
> > is peaking in the tracing buffers.  Not sure this is a valid case or
> > not.
> > 
> > Anyhow all timer activity seems to stop around the area of interest.
> > 
> > 
> > [ 9442.413624] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [ 9442.419107]  1-...: (1 GPs behind) idle=844/0/0 softirq=27747/27755 
> > fqs=0 last_accelerate: dd6a/de80, nonlazy_posted: 0, L.
> > [ 9442.430224]  3-...: (2 GPs behind) idle=8f8/0/0 softirq=32197/32198 
> > fqs=0 last_accelerate: 29b1/de80, nonlazy_posted: 0, L.
> > [ 9442.441340]  4-...: (7 GPs behind) idle=740/0/0 softirq=22351/22352 
> > fqs=0 last_accelerate: ca88/de80, nonlazy_posted: 0, L.
> > [ 9442.452456]  5-...: (2 GPs behind) idle=9b0/0/0 softirq=21315/21319 
> > fqs=0 last_accelerate: b280/de88, nonlazy_posted: 0, L.
> > [ 9442.463572]  6-...: (2 GPs behind) idle=794/0/0 softirq=19699/19707 
> > fqs=0 last_accelerate: ba62/de88, nonlazy_posted: 0, L.
> > [ 9442.474688]  7-...: (2 GPs behind) idle=ac4/0/0 softirq=22547/22554 
> > fqs=0 last_accelerate: b280/de88, nonlazy_posted: 0, L.
> > [ 9442.485803]  8-...: (9 GPs behind) idle=118/0/0 softirq=281/291 
> > fqs=0 last_accelerate: c3fe/de88, nonlazy_posted: 0, L.
> > [ 9442.496571]  9-...: (9 GPs behind) idle=8fc/0/0 softirq=284/292 
> > fqs=0 last_accelerate: 6030/de88, nonlazy_posted: 0, L.
> > [ 9442.507339]  10-...: (14 GPs behind) idle=f78/0/0 softirq=254/254 
> > fqs=0 last_accelerate: 5487/de88, nonlazy_posted: 0, L.
> > [ 9442.518281]  11-...: (9 GPs behind) idle=c9c/0/0 softirq=301/308 
> > fqs=0 last_accelerate: 3d3e/de99, nonlazy_posted: 0, L.
> > [ 9442.529136]  12-...: (9 GPs behind) idle=4a4/0/0 softirq=735/737 
> > fqs=0 last_accelerate: 6010/de99, nonlazy_posted: 0, L.
> > [ 9442.539992]  13-...: (9 GPs behind) idle=34c/0/0 softirq=1121/1131 
> > fqs=0 last_accelerate: b280/de99, nonlazy_posted: 0, L.
> > [ 9442.551020]  14

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-27 Thread Jonathan Cameron
On Thu, 27 Jul 2017 14:49:03 +0100
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Thu, 27 Jul 2017 05:49:13 -0700
> "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Jul 27, 2017 at 02:34:00PM +1000, Nicholas Piggin wrote:  
> > > On Wed, 26 Jul 2017 18:42:14 -0700
> > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > 
> > > > On Wed, Jul 26, 2017 at 04:22:00PM -0700, David Miller wrote:
> > > 
> > > > > Indeed, that really wouldn't explain how we end up with a RCU stall
> > > > > dump listing almost all of the cpus as having missed a grace period.  
> > > > > 
> > > > 
> > > > I have seen stranger things, but admittedly not often.
> > > 
> > > So the backtraces show the RCU gp thread in schedule_timeout.
> > > 
> > > Are you sure that it's timeout has expired and it's not being scheduled,
> > > or could it be a bad (large) timeout (looks unlikely) or that it's being
> > > scheduled but not correctly noting gps on other CPUs?
> > > 
> > > It's not in R state, so if it's not being scheduled at all, then it's
> > > because the timer has not fired:
> > 
> > Good point, Nick!
> > 
> > Jonathan, could you please reproduce collecting timer event tracing?  
> I'm a little new to tracing (only started playing with it last week)
> so fingers crossed I've set it up right.  No splats yet.  Was getting
> splats on reading out the trace when running with the RCU stall timer
> set to 4 so have increased that back to the default and am rerunning.
> 
> This may take a while.  Correct me if I've gotten this wrong to save time
> 
> echo "timer:*" > /sys/kernel/debug/tracing/set_event
> 
> when it dumps, just send you the relevant part of what is in
> /sys/kernel/debug/tracing/trace?

Interestingly the only thing that can make trip for me with tracing on
is peaking in the tracing buffers.  Not sure this is a valid case or
not.

Anyhow all timer activity seems to stop around the area of interest.


[ 9442.413624] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 9442.419107]  1-...: (1 GPs behind) idle=844/0/0 softirq=27747/27755 fqs=0 
last_accelerate: dd6a/de80, nonlazy_posted: 0, L.
[ 9442.430224]  3-...: (2 GPs behind) idle=8f8/0/0 softirq=32197/32198 fqs=0 
last_accelerate: 29b1/de80, nonlazy_posted: 0, L.
[ 9442.441340]  4-...: (7 GPs behind) idle=740/0/0 softirq=22351/22352 fqs=0 
last_accelerate: ca88/de80, nonlazy_posted: 0, L.
[ 9442.452456]  5-...: (2 GPs behind) idle=9b0/0/0 softirq=21315/21319 fqs=0 
last_accelerate: b280/de88, nonlazy_posted: 0, L.
[ 9442.463572]  6-...: (2 GPs behind) idle=794/0/0 softirq=19699/19707 fqs=0 
last_accelerate: ba62/de88, nonlazy_posted: 0, L.
[ 9442.474688]  7-...: (2 GPs behind) idle=ac4/0/0 softirq=22547/22554 fqs=0 
last_accelerate: b280/de88, nonlazy_posted: 0, L.
[ 9442.485803]  8-...: (9 GPs behind) idle=118/0/0 softirq=281/291 fqs=0 
last_accelerate: c3fe/de88, nonlazy_posted: 0, L.
[ 9442.496571]  9-...: (9 GPs behind) idle=8fc/0/0 softirq=284/292 fqs=0 
last_accelerate: 6030/de88, nonlazy_posted: 0, L.
[ 9442.507339]  10-...: (14 GPs behind) idle=f78/0/0 softirq=254/254 fqs=0 
last_accelerate: 5487/de88, nonlazy_posted: 0, L.
[ 9442.518281]  11-...: (9 GPs behind) idle=c9c/0/0 softirq=301/308 fqs=0 
last_accelerate: 3d3e/de99, nonlazy_posted: 0, L.
[ 9442.529136]  12-...: (9 GPs behind) idle=4a4/0/0 softirq=735/737 fqs=0 
last_accelerate: 6010/de99, nonlazy_posted: 0, L.
[ 9442.539992]  13-...: (9 GPs behind) idle=34c/0/0 softirq=1121/1131 fqs=0 
last_accelerate: b280/de99, nonlazy_posted: 0, L.
[ 9442.551020]  14-...: (9 GPs behind) idle=2f4/0/0 softirq=707/713 fqs=0 
last_accelerate: 6030/de99, nonlazy_posted: 0, L.
[ 9442.561875]  15-...: (2 GPs behind) idle=b30/0/0 softirq=821/976 fqs=0 
last_accelerate: c208/de99, nonlazy_posted: 0, L.
[ 9442.572730]  17-...: (2 GPs behind) idle=5a8/0/0 softirq=1456/1565 fqs=0 
last_accelerate: ca88/de99, nonlazy_posted: 0, L.
[ 9442.583759]  18-...: (2 GPs behind) idle=2e4/0/0 softirq=1923/1936 fqs=0 
last_accelerate: ca88/dea7, nonlazy_posted: 0, L.
[ 9442.594787]  19-...: (2 GPs behind) idle=138/0/0 softirq=1421/1432 fqs=0 
last_accelerate: b280/dea7, nonlazy_posted: 0, L.
[ 9442.605816]  20-...: (50 GPs behind) idle=634/0/0 softirq=217/219 fqs=0 
last_accelerate: c96f/dea7, nonlazy_posted: 0, L.
[ 9442.616758]  21-...: (2 GPs behind) idle=eb8/0/0 softirq=1368/1369 fqs=0 
last_accelerate: b599/deb2, nonlazy_posted: 0, L.
[ 9442.627786]  22-...: (1 GPs behind) idle=aa8/0/0 softirq=229/232 fqs=0 
last_accelerate: c604/deb2, nonlazy_posted: 0, L.
[ 9442.638641]  23-...: (1 GPs behind) idle=488/0/0 softirq=247/248 fqs=0 
last_accelerate: c600/deb2, no

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-27 Thread Jonathan Cameron
On Thu, 27 Jul 2017 05:49:13 -0700
"Paul E. McKenney"  wrote:

> On Thu, Jul 27, 2017 at 02:34:00PM +1000, Nicholas Piggin wrote:
> > On Wed, 26 Jul 2017 18:42:14 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > On Wed, Jul 26, 2017 at 04:22:00PM -0700, David Miller wrote:  
> >   
> > > > Indeed, that really wouldn't explain how we end up with a RCU stall
> > > > dump listing almost all of the cpus as having missed a grace period.
> > > 
> > > I have seen stranger things, but admittedly not often.  
> > 
> > So the backtraces show the RCU gp thread in schedule_timeout.
> > 
> > Are you sure that it's timeout has expired and it's not being scheduled,
> > or could it be a bad (large) timeout (looks unlikely) or that it's being
> > scheduled but not correctly noting gps on other CPUs?
> > 
> > It's not in R state, so if it's not being scheduled at all, then it's
> > because the timer has not fired:  
> 
> Good point, Nick!
> 
> Jonathan, could you please reproduce collecting timer event tracing?
I'm a little new to tracing (only started playing with it last week)
so fingers crossed I've set it up right.  No splats yet.  Was getting
splats on reading out the trace when running with the RCU stall timer
set to 4 so have increased that back to the default and am rerunning.

This may take a while.  Correct me if I've gotten this wrong to save time

echo "timer:*" > /sys/kernel/debug/tracing/set_event

when it dumps, just send you the relevant part of what is in
/sys/kernel/debug/tracing/trace?

Thanks,

Jonathan
> 
>   Thanx, Paul
> 
> > [ 1984.628602] rcu_preempt kthread starved for 5663 jiffies! g1566 c1565 
> > f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> > [ 1984.638153] rcu_preempt S0 9  2 0x
> > [ 1984.643626] Call trace:
> > [ 1984.646059] [] __switch_to+0x90/0xa8
> > [ 1984.651189] [] __schedule+0x19c/0x5d8
> > [ 1984.656400] [] schedule+0x38/0xa0
> > [ 1984.661266] [] schedule_timeout+0x124/0x218
> > [ 1984.667002] [] rcu_gp_kthread+0x4fc/0x748
> > [ 1984.672564] [] kthread+0xfc/0x128
> > [ 1984.677429] [] ret_from_fork+0x10/0x50
> >   
> 



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-27 Thread Jonathan Cameron
On Wed, 26 Jul 2017 18:13:12 +0100
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Wed, 26 Jul 2017 09:54:32 -0700
> David Miller <da...@davemloft.net> wrote:
> 
> > From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> > Date: Wed, 26 Jul 2017 08:49:00 -0700
> >   
> > > On Wed, Jul 26, 2017 at 04:33:40PM +0100, Jonathan Cameron wrote:
> > >> Didn't leave it long enough. Still bad on 4.10-rc7 just took over
> > >> an hour to occur.
> > > 
> > > And it is quite possible that SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y
> > > are just greatly reducing the probability of the problem rather than
> > > completely preventing it.
> > > 
> > > Still, hopefully useful information, thank you for the testing!
> 
> Not sure it actually gives us much information, but no issues yet
> with a simple program running every cpu that wakes up every 3 seconds.
> 
> Will leave it running overnight and report back in the morning.
Perhaps unsurprisingly the above test didn't show any splats.

So it appears a userspace wakeup is enough to stop the issue happening
(or at least make it a lot less likely).

Jonathan
> 
> > 
> > I guess that invalidates my idea to test reverting recent changes to
> > the tick-sched.c code... :-/
> > 
> > In NO_HZ_IDLE mode, what is really supposed to happen on a completely
> > idle system?
> > 
> > All the cpus enter the idle loop, have no timers programmed, and they
> > all just go to sleep until an external event happens.
> > 
> > What ensures that grace periods get processed in this regime?  
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 09:54:32 -0700
David Miller <da...@davemloft.net> wrote:

> From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> Date: Wed, 26 Jul 2017 08:49:00 -0700
> 
> > On Wed, Jul 26, 2017 at 04:33:40PM +0100, Jonathan Cameron wrote:  
> >> Didn't leave it long enough. Still bad on 4.10-rc7 just took over
> >> an hour to occur.  
> > 
> > And it is quite possible that SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y
> > are just greatly reducing the probability of the problem rather than
> > completely preventing it.
> > 
> > Still, hopefully useful information, thank you for the testing!  

Not sure it actually gives us much information, but no issues yet
with a simple program running every cpu that wakes up every 3 seconds.

Will leave it running overnight and report back in the morning.

> 
> I guess that invalidates my idea to test reverting recent changes to
> the tick-sched.c code... :-/
> 
> In NO_HZ_IDLE mode, what is really supposed to happen on a completely
> idle system?
> 
> All the cpus enter the idle loop, have no timers programmed, and they
> all just go to sleep until an external event happens.
> 
> What ensures that grace periods get processed in this regime?


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 15:23:15 +0100
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Wed, 26 Jul 2017 07:14:17 -0700
> "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> 
> > On Wed, Jul 26, 2017 at 01:28:01PM +0100, Jonathan Cameron wrote:  
> > > On Wed, 26 Jul 2017 10:32:32 +0100
> > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> > >     
> > > > On Wed, 26 Jul 2017 09:16:23 +0100
> > > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> > > > 
> > > > > On Tue, 25 Jul 2017 21:12:17 -0700
> > > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > >   
> > > > > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:   
> > > > > >  
> > > > > > > From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> > > > > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > > > > >   
> > > > > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:   
> > > > > > > >
> > > > > > > >> Just to report, turning softlockup back on fixes things for me 
> > > > > > > >> on
> > > > > > > >> sparc64 too.  
> > > > > > > > 
> > > > > > > > Very good!
> > > > > > > >   
> > > > > > > >> The thing about softlockup is it runs an hrtimer, which seems 
> > > > > > > >> to run
> > > > > > > >> about every 4 seconds.  
> > > > > > > > 
> > > > > > > > I could see where that could shake things loose, but I am 
> > > > > > > > surprised that
> > > > > > > > it would be needed.  I ran a short run with 
> > > > > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > > > > with no trouble, but I will be running a longer test later on.
> > > > > > > >   
> > > > > > > >> So I wonder if this is a NO_HZ problem.  
> > > > > > > > 
> > > > > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  
> > > > > > > > What are
> > > > > > > > you running?  (Again, my symptoms are slightly different, so I 
> > > > > > > > might
> > > > > > > > be seeing a different bug.)  
> > > > > > > 
> > > > > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > > > > 
> > > > > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR 
> > > > > > > disabled.  
> > > > > > 
> > > > > > Same here -- but my failure case happens fairly rarely, so it will 
> > > > > > take
> > > > > > some time to gain reasonable confidence that enabling 
> > > > > > SOFTLOCKUP_DETECTOR
> > > > > > had effect.
> > > > > > 
> > > > > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > > > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > > > > 
> > > > > > Thanx, Paul
> > > > > > 
> > > > > I'll be the headless chicken running around and trying as many tests
> > > > > as I can fit in.  Typical time to see the failure for us is sub 10
> > > > > minutes so we'll see how far we get.
> > > > > 
> > > > > Make me a list to run if you like ;)
> > > > > 
> > > > > NO_HZ_PERIODIC=y running now.  
> > > > By which I mean CONFIG_HZ_PERIODIC=y
> > 
> > I did get that messed up, didn't I?  Sorry for my confusion!
> >   
> > > > Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> > > > on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> > > > I won't have much confidence until we are a few hours in on this.
> > > > 
> > > > Anyhow, certainly looking like a promising direction for investigation!
> > > > 
> > > Well it's done over 3 hours without a splat so I think it is fine with
> > > CONFIG_HZ_PERIODIC=y
> > 
> > Thank you!
> > 
> > If you run with SOFTLOCKUP_DETECTOR=n and NO_HZ_IDLE=y, but have a normal
> > user task waking up every few seconds on each CPU, does the problem occur?
> > (The question is whether any disturbance gets things going, or whether there
> > is something special about SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y.
> > 
> > Dave, any other ideas on what might be causing this or what might be
> > tested?
> > 
> > Thanx, Paul
> >   
> 
> Although it's still early days (40 mins in) it looks like the issue first
> occurred between 4.10-rc7 and 4.11-rc1 (don't ask why those particular RCs)
> 
> Bad as with current kernel on 4.11-rc1 and good on 4.10-rc7.
Didn't leave it long enough. Still bad on 4.10-rc7 just took over
an hour to occur.
> 
> Could be something different was hiding it in 4.10 though.  We have a fair
> delta from mainline back then unfortunately so bisecting will be
> 'interesting'.
> 
> I'll see if I can get the test you suggest running.
> 
> Jonathan
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 07:14:17 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Wed, Jul 26, 2017 at 01:28:01PM +0100, Jonathan Cameron wrote:
> > On Wed, 26 Jul 2017 10:32:32 +0100
> > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> >   
> > > On Wed, 26 Jul 2017 09:16:23 +0100
> > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> > >   
> > > > On Tue, 25 Jul 2017 21:12:17 -0700
> > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:  
> > > > > > From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> > > > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > > > > 
> > > > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote: 
> > > > > > >
> > > > > > >> Just to report, turning softlockup back on fixes things for me on
> > > > > > >> sparc64 too.
> > > > > > > 
> > > > > > > Very good!
> > > > > > > 
> > > > > > >> The thing about softlockup is it runs an hrtimer, which seems to 
> > > > > > >> run
> > > > > > >> about every 4 seconds.
> > > > > > > 
> > > > > > > I could see where that could shake things loose, but I am 
> > > > > > > surprised that
> > > > > > > it would be needed.  I ran a short run with 
> > > > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > > > with no trouble, but I will be running a longer test later on.
> > > > > > > 
> > > > > > >> So I wonder if this is a NO_HZ problem.
> > > > > > > 
> > > > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What 
> > > > > > > are
> > > > > > > you running?  (Again, my symptoms are slightly different, so I 
> > > > > > > might
> > > > > > > be seeing a different bug.)
> > > > > > 
> > > > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > > > 
> > > > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled. 
> > > > > >
> > > > > 
> > > > > Same here -- but my failure case happens fairly rarely, so it will 
> > > > > take
> > > > > some time to gain reasonable confidence that enabling 
> > > > > SOFTLOCKUP_DETECTOR
> > > > > had effect.
> > > > > 
> > > > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > > > 
> > > > >   Thanx, Paul
> > > > >   
> > > > I'll be the headless chicken running around and trying as many tests
> > > > as I can fit in.  Typical time to see the failure for us is sub 10
> > > > minutes so we'll see how far we get.
> > > > 
> > > > Make me a list to run if you like ;)
> > > > 
> > > > NO_HZ_PERIODIC=y running now.
> > > By which I mean CONFIG_HZ_PERIODIC=y  
> 
> I did get that messed up, didn't I?  Sorry for my confusion!
> 
> > > Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> > > on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> > > I won't have much confidence until we are a few hours in on this.
> > > 
> > > Anyhow, certainly looking like a promising direction for investigation!
> > >   
> > Well it's done over 3 hours without a splat so I think it is fine with
> > CONFIG_HZ_PERIODIC=y  
> 
> Thank you!
> 
> If you run with SOFTLOCKUP_DETECTOR=n and NO_HZ_IDLE=y, but have a normal
> user task waking up every few seconds on each CPU, does the problem occur?
> (The question is whether any disturbance gets things going, or whether there
> is something special about SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y.
> 
> Dave, any other ideas on what might be causing this or what might be
> tested?
> 
>   Thanx, Paul
> 

Although it's still early days (40 mins in) it looks like the issue first
occurred between 4.10-rc7 and 4.11-rc1 (don't ask why those particular RCs)

Bad as with current kernel on 4.11-rc1 and good on 4.10-rc7.

Could be something different was hiding it in 4.10 though.  We have a fair
delta from mainline back then unfortunately so bisecting will be
'interesting'.

I'll see if I can get the test you suggest running.

Jonathan


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 13:28:01 +0100
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Wed, 26 Jul 2017 10:32:32 +0100
> Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> 
> > On Wed, 26 Jul 2017 09:16:23 +0100
> > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> >   
> > > On Tue, 25 Jul 2017 21:12:17 -0700
> > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > 
> > > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:  
> > > > > From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> > > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > > > 
> > > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:   
> > > > > >  
> > > > > >> Just to report, turning softlockup back on fixes things for me on
> > > > > >> sparc64 too.
> > > > > > 
> > > > > > Very good!
> > > > > > 
> > > > > >> The thing about softlockup is it runs an hrtimer, which seems to 
> > > > > >> run
> > > > > >> about every 4 seconds.
> > > > > > 
> > > > > > I could see where that could shake things loose, but I am surprised 
> > > > > > that
> > > > > > it would be needed.  I ran a short run with 
> > > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > > with no trouble, but I will be running a longer test later on.
> > > > > > 
> > > > > >> So I wonder if this is a NO_HZ problem.
> > > > > > 
> > > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What 
> > > > > > are
> > > > > > you running?  (Again, my symptoms are slightly different, so I might
> > > > > > be seeing a different bug.)
> > > > > 
> > > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > > 
> > > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled.   
> > > > >  
> > > > 
> > > > Same here -- but my failure case happens fairly rarely, so it will take
> > > > some time to gain reasonable confidence that enabling 
> > > > SOFTLOCKUP_DETECTOR
> > > > had effect.
> > > > 
> > > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > > 
> > > > Thanx, Paul
> > > >   
> > > I'll be the headless chicken running around and trying as many tests
> > > as I can fit in.  Typical time to see the failure for us is sub 10
> > > minutes so we'll see how far we get.
> > > 
> > > Make me a list to run if you like ;)
> > > 
> > > NO_HZ_PERIODIC=y running now.
> > By which I mean CONFIG_HZ_PERIODIC=y
> > 
> > Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> > on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> > I won't have much confidence until we are a few hours in on this.
> > 
> > Anyhow, certainly looking like a promising direction for investigation!
> >   
> Well it's done over 3 hours without a splat so I think it is fine with
> CONFIG_HZ_PERIODIC=y
> 
As I think we expected, the problem occurs with NO_HZ_FULL.
Happened pretty quickly but given the somewhat random nature,
might just be coincidence.

Jonathan
> 
> > Jonathan
> >   
> > > 
> > > Jonathan
> > > 
> > > ___
> > > linuxarm mailing list
> > > linux...@huawei.com
> > > http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm
> > 
> > 
> > ___
> > linuxarm mailing list
> > linux...@huawei.com
> > http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm  
> 
> 
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm




Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 10:32:32 +0100
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Wed, 26 Jul 2017 09:16:23 +0100
> Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> 
> > On Tue, 25 Jul 2017 21:12:17 -0700
> > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> >   
> > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:
> > > > From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > >   
> > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:  
> > > > >> Just to report, turning softlockup back on fixes things for me on
> > > > >> sparc64 too.  
> > > > > 
> > > > > Very good!
> > > > >   
> > > > >> The thing about softlockup is it runs an hrtimer, which seems to run
> > > > >> about every 4 seconds.  
> > > > > 
> > > > > I could see where that could shake things loose, but I am surprised 
> > > > > that
> > > > > it would be needed.  I ran a short run with 
> > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > with no trouble, but I will be running a longer test later on.
> > > > >   
> > > > >> So I wonder if this is a NO_HZ problem.  
> > > > > 
> > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What are
> > > > > you running?  (Again, my symptoms are slightly different, so I might
> > > > > be seeing a different bug.)  
> > > > 
> > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > 
> > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled. 
> > > >  
> > > 
> > > Same here -- but my failure case happens fairly rarely, so it will take
> > > some time to gain reasonable confidence that enabling SOFTLOCKUP_DETECTOR
> > > had effect.
> > > 
> > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > 
> > >   Thanx, Paul
> > > 
> > I'll be the headless chicken running around and trying as many tests
> > as I can fit in.  Typical time to see the failure for us is sub 10
> > minutes so we'll see how far we get.
> > 
> > Make me a list to run if you like ;)
> > 
> > NO_HZ_PERIODIC=y running now.  
> By which I mean CONFIG_HZ_PERIODIC=y
> 
> Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> I won't have much confidence until we are a few hours in on this.
> 
> Anyhow, certainly looking like a promising direction for investigation!
> 
Well it's done over 3 hours without a splat so I think it is fine with
CONFIG_HZ_PERIODIC=y


> Jonathan
> 
> > 
> > Jonathan
> > 
> > ___
> > linuxarm mailing list
> > linux...@huawei.com
> > http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm  
> 
> 
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm




Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 09:16:23 +0100
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Tue, 25 Jul 2017 21:12:17 -0700
> "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> 
> > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:  
> > > From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > 
> > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:
> > > >> Just to report, turning softlockup back on fixes things for me on
> > > >> sparc64 too.
> > > > 
> > > > Very good!
> > > > 
> > > >> The thing about softlockup is it runs an hrtimer, which seems to run
> > > >> about every 4 seconds.
> > > > 
> > > > I could see where that could shake things loose, but I am surprised that
> > > > it would be needed.  I ran a short run with CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > with no trouble, but I will be running a longer test later on.
> > > > 
> > > >> So I wonder if this is a NO_HZ problem.
> > > > 
> > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What are
> > > > you running?  (Again, my symptoms are slightly different, so I might
> > > > be seeing a different bug.)
> > > 
> > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > 
> > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled.
> > 
> > Same here -- but my failure case happens fairly rarely, so it will take
> > some time to gain reasonable confidence that enabling SOFTLOCKUP_DETECTOR
> > had effect.
> > 
> > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > 
> > Thanx, Paul
> >   
> I'll be the headless chicken running around and trying as many tests
> as I can fit in.  Typical time to see the failure for us is sub 10
> minutes so we'll see how far we get.
> 
> Make me a list to run if you like ;)
> 
> NO_HZ_PERIODIC=y running now.
By which I mean CONFIG_HZ_PERIODIC=y

Anyhow, run for 40 minutes with out seeing a splat but my sanity check
on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
I won't have much confidence until we are a few hours in on this.

Anyhow, certainly looking like a promising direction for investigation!

Jonathan

> 
> Jonathan
> 
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm




Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Tue, 25 Jul 2017 21:12:17 -0700
"Paul E. McKenney"  wrote:

> On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:
> > From: "Paul E. McKenney" 
> > Date: Tue, 25 Jul 2017 20:55:45 -0700
> >   
> > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:  
> > >> Just to report, turning softlockup back on fixes things for me on
> > >> sparc64 too.  
> > > 
> > > Very good!
> > >   
> > >> The thing about softlockup is it runs an hrtimer, which seems to run
> > >> about every 4 seconds.  
> > > 
> > > I could see where that could shake things loose, but I am surprised that
> > > it would be needed.  I ran a short run with CONFIG_SOFTLOCKUP_DETECTOR=y
> > > with no trouble, but I will be running a longer test later on.
> > >   
> > >> So I wonder if this is a NO_HZ problem.  
> > > 
> > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What are
> > > you running?  (Again, my symptoms are slightly different, so I might
> > > be seeing a different bug.)  
> > 
> > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > 
> > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled.  
> 
> Same here -- but my failure case happens fairly rarely, so it will take
> some time to gain reasonable confidence that enabling SOFTLOCKUP_DETECTOR
> had effect.
> 
> But you are right, might be interesting to try NO_HZ_PERIODIC=y
> or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> 
>   Thanx, Paul
> 
I'll be the headless chicken running around and trying as many tests
as I can fit in.  Typical time to see the failure for us is sub 10
minutes so we'll see how far we get.

Make me a list to run if you like ;)

NO_HZ_PERIODIC=y running now.

Jonathan



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Tue, 25 Jul 2017 20:53:06 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Wed, Jul 26, 2017 at 12:52:07AM +0800, Jonathan Cameron wrote:
> > On Tue, 25 Jul 2017 08:12:45 -0700
> > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> >   
> > > On Tue, Jul 25, 2017 at 10:42:45PM +0800, Jonathan Cameron wrote:  
> > > > On Tue, 25 Jul 2017 06:46:26 -0700
> > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > On Tue, Jul 25, 2017 at 10:26:54PM +1000, Nicholas Piggin wrote:
> > > > > > On Tue, 25 Jul 2017 19:32:10 +0800
> > > > > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> > > > > >   
> > > > > > > Hi All,
> > > > > > > 
> > > > > > > We observed a regression on our d05 boards (but curiously not
> > > > > > > the fairly similar but single socket / smaller core count
> > > > > > > d03), initially seen with linux-next prior to the merge window
> > > > > > > and still present in v4.13-rc2.
> > > > > > > 
> > > > > > > The symptom is:  
> > > > > 
> > > > > Adding Dave Miller and the sparcli...@vger.kernel.org email on CC, as
> > > > > they have been seeing something similar, and you might well have saved
> > > > > them the trouble of bisecting.
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > [ 1984.628602] rcu_preempt kthread starved for 5663 jiffies! 
> > > > > > > g1566 c1565 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1  
> > > > > 
> > > > > This is the cause from an RCU perspective.  You had a lot of idle 
> > > > > CPUs,
> > > > > and RCU is not permitted to disturb them -- the battery-powered 
> > > > > embedded
> > > > > guys get very annoyed by that sort of thing.  What happens instead is
> > > > > that each CPU updates a per-CPU state variable when entering or 
> > > > > exiting
> > > > > idle, and the grace-period kthread ("rcu_preempt kthread" in the above
> > > > > message) checks these state variables, and if when sees an idle CPU,
> > > > > it reports a quiescent state on that CPU's behalf.
> > > > > 
> > > > > But the grace-period kthread can only do this work if it gets a chance
> > > > > to run.  And the message above says that this kthread hasn't had a 
> > > > > chance
> > > > > to run for a full 5,663 jiffies.  For completeness, the "g1566 c1565"
> > > > > says that grace period #1566 is in progress, the "f0x0" says that no 
> > > > > one
> > > > > is needing another grace period #1567.  The "RCU_GP_WAIT_FQS(3)" says
> > > > > that the grace-period kthread has fully initialized the current grace
> > > > > period and is sleeping for a few jiffies waiting to scan for idle 
> > > > > tasks.
> > > > > Finally, the "->state=0x1" says that the grace-period kthread is in
> > > > > TASK_INTERRUPTIBLE state, in other words, still sleeping.
> > > > 
> > > > Thanks for the explanation!
> > > > > 
> > > > > So my first question is "What did commit 05a4a9527 (kernel/watchdog:
> > > > > split up config options) do to prevent the grace-period kthread from
> > > > > getting a chance to run?" 
> > > > 
> > > > As far as we can tell it was a side effect of that patch.
> > > > 
> > > > The real cause is that patch changed the result of defconfigs to stop 
> > > > running
> > > > the softlockup detector - now CONFIG_SOFTLOCKUP_DETECTOR
> > > > 
> > > > Enabling that on 4.13-rc2 (and presumably everything in between)
> > > > means we don't see the problem any more.
> > > > 
> > > > > I must confess that I don't see anything
> > > > > obvious in that commit, so my second question is "Are we sure that
> > > > > reverting this commit makes the problem go away?"
> > > > 
> > > > Simply enabling CONFIG_SOFTLOCKUP_DETECTOR seems to make it go away.
> > > > That detector fires up a thread on every cpu, which m

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-25 Thread Jonathan Cameron
On Tue, 25 Jul 2017 08:12:45 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Tue, Jul 25, 2017 at 10:42:45PM +0800, Jonathan Cameron wrote:
> > On Tue, 25 Jul 2017 06:46:26 -0700
> > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> >   
> > > On Tue, Jul 25, 2017 at 10:26:54PM +1000, Nicholas Piggin wrote:  
> > > > On Tue, 25 Jul 2017 19:32:10 +0800
> > > > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> > > > 
> > > > > Hi All,
> > > > > 
> > > > > We observed a regression on our d05 boards (but curiously not
> > > > > the fairly similar but single socket / smaller core count
> > > > > d03), initially seen with linux-next prior to the merge window
> > > > > and still present in v4.13-rc2.
> > > > > 
> > > > > The symptom is:
> > > 
> > > Adding Dave Miller and the sparcli...@vger.kernel.org email on CC, as
> > > they have been seeing something similar, and you might well have saved
> > > them the trouble of bisecting.
> > > 
> > > [ . . . ]
> > >   
> > > > > [ 1984.628602] rcu_preempt kthread starved for 5663 jiffies! g1566 
> > > > > c1565 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> > > 
> > > This is the cause from an RCU perspective.  You had a lot of idle CPUs,
> > > and RCU is not permitted to disturb them -- the battery-powered embedded
> > > guys get very annoyed by that sort of thing.  What happens instead is
> > > that each CPU updates a per-CPU state variable when entering or exiting
> > > idle, and the grace-period kthread ("rcu_preempt kthread" in the above
> > > message) checks these state variables, and if when sees an idle CPU,
> > > it reports a quiescent state on that CPU's behalf.
> > > 
> > > But the grace-period kthread can only do this work if it gets a chance
> > > to run.  And the message above says that this kthread hasn't had a chance
> > > to run for a full 5,663 jiffies.  For completeness, the "g1566 c1565"
> > > says that grace period #1566 is in progress, the "f0x0" says that no one
> > > is needing another grace period #1567.  The "RCU_GP_WAIT_FQS(3)" says
> > > that the grace-period kthread has fully initialized the current grace
> > > period and is sleeping for a few jiffies waiting to scan for idle tasks.
> > > Finally, the "->state=0x1" says that the grace-period kthread is in
> > > TASK_INTERRUPTIBLE state, in other words, still sleeping.  
> > 
> > Thanks for the explanation!  
> > > 
> > > So my first question is "What did commit 05a4a9527 (kernel/watchdog:
> > > split up config options) do to prevent the grace-period kthread from
> > > getting a chance to run?"   
> > 
> > As far as we can tell it was a side effect of that patch.
> > 
> > The real cause is that patch changed the result of defconfigs to stop 
> > running
> > the softlockup detector - now CONFIG_SOFTLOCKUP_DETECTOR
> > 
> > Enabling that on 4.13-rc2 (and presumably everything in between)
> > means we don't see the problem any more.
> >   
> > > I must confess that I don't see anything
> > > obvious in that commit, so my second question is "Are we sure that
> > > reverting this commit makes the problem go away?"  
> > 
> > Simply enabling CONFIG_SOFTLOCKUP_DETECTOR seems to make it go away.
> > That detector fires up a thread on every cpu, which may be relevant.  
> 
> Interesting...  Why should it be necessary to fire up a thread on every
> CPU in order to make sure that RCU's grace-period kthreads get some
> CPU time?  Especially give how many idle CPUs you had on your system.
> 
> So I have to ask if there is some other bug that the softlockup detector
> is masking.
I am thinking the same.  We can try going back further than 4.12 tomorrow
(we think we can realistically go back to 4.8 and possibly 4.6
with this board)
> 
> > > and my third is "Is
> > > this an intermittent problem that led to a false bisection?"  
> > 
> > Whilst it is a bit slow to occur, we verified with long runs on either
> > side of that patch and since with the option enabled on latest mainline.
> > 
> > Also can cause the issue before that patch by disabling the previous
> > relevant option on 4.12.  
> 
> OK, thank you -- hard to argue with that!  ;-)
We thought it was a pretty unlikely a bisec

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-25 Thread Jonathan Cameron
On Tue, 25 Jul 2017 06:46:26 -0700
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Tue, Jul 25, 2017 at 10:26:54PM +1000, Nicholas Piggin wrote:
> > On Tue, 25 Jul 2017 19:32:10 +0800
> > Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> >   
> > > Hi All,
> > > 
> > > We observed a regression on our d05 boards (but curiously not
> > > the fairly similar but single socket / smaller core count
> > > d03), initially seen with linux-next prior to the merge window
> > > and still present in v4.13-rc2.
> > > 
> > > The symptom is:  
> 
> Adding Dave Miller and the sparcli...@vger.kernel.org email on CC, as
> they have been seeing something similar, and you might well have saved
> them the trouble of bisecting.
> 
> [ . . . ]
> 
> > > [ 1984.628602] rcu_preempt kthread starved for 5663 jiffies! g1566 c1565 
> > > f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1  
> 
> This is the cause from an RCU perspective.  You had a lot of idle CPUs,
> and RCU is not permitted to disturb them -- the battery-powered embedded
> guys get very annoyed by that sort of thing.  What happens instead is
> that each CPU updates a per-CPU state variable when entering or exiting
> idle, and the grace-period kthread ("rcu_preempt kthread" in the above
> message) checks these state variables, and if when sees an idle CPU,
> it reports a quiescent state on that CPU's behalf.
> 
> But the grace-period kthread can only do this work if it gets a chance
> to run.  And the message above says that this kthread hasn't had a chance
> to run for a full 5,663 jiffies.  For completeness, the "g1566 c1565"
> says that grace period #1566 is in progress, the "f0x0" says that no one
> is needing another grace period #1567.  The "RCU_GP_WAIT_FQS(3)" says
> that the grace-period kthread has fully initialized the current grace
> period and is sleeping for a few jiffies waiting to scan for idle tasks.
> Finally, the "->state=0x1" says that the grace-period kthread is in
> TASK_INTERRUPTIBLE state, in other words, still sleeping.
Thanks for the explanation!
> 
> So my first question is "What did commit 05a4a9527 (kernel/watchdog:
> split up config options) do to prevent the grace-period kthread from
> getting a chance to run?" 

As far as we can tell it was a side effect of that patch.

The real cause is that patch changed the result of defconfigs to stop running
the softlockup detector - now CONFIG_SOFTLOCKUP_DETECTOR

Enabling that on 4.13-rc2 (and presumably everything in between)
means we don't see the problem any more.

> I must confess that I don't see anything
> obvious in that commit, so my second question is "Are we sure that
> reverting this commit makes the problem go away?"
Simply enabling CONFIG_SOFTLOCKUP_DETECTOR seems to make it go away.
That detector fires up a thread on every cpu, which may be relevant.

> and my third is "Is
> this an intermittent problem that led to a false bisection?"
Whilst it is a bit slow to occur, we verified with long runs on either
side of that patch and since with the option enabled on latest mainline.

Also can cause the issue before that patch by disabling the previous
relevant option on 4.12.

> 
> [ . . . ]
> 
> > > Reducing the RCU CPU stall timeout makes it happen more often,
> > > but we are seeing even with the default value of 24 seconds.
> > > 
> > > Tends to occur after a period or relatively low usage, but has
> > > also been seen mid way through performance tests.
> > > 
> > > This was not seen with v4.12 so a bisection run later lead to
> > > commit 05a4a9527 (kernel/watchdog: split up config options).
> > > 
> > > Which was odd until we discovered that a side effect of this patch
> > > was to change whether the softlockup detector was enabled or not in
> > > the arm64 defconfig.
> > > 
> > > On 4.13-rc2 enabling the softlockup detector indeed stopped us
> > > seeing the rcu issue. Disabling the equivalent on 4.12 made the
> > > issue occur there as well.
> > > 
> > > Clearly the softlockup detector results in a thread on every cpu,
> > > which might be related but beyond that we are still looking into
> > > the issue.
> > > 
> > > So the obvious question is whether anyone else is seeing this as
> > > it might help us to focus in on where to look!  
> > 
> > Huh. Something similar has been seen very intermittently on powerpc
> > as well. We couldn't reproduce it reliably to bisect it already, so
> > this is a good help.
> > 
> > 

Re: [PATCH 1/3] ABI: fix some syntax issues at the ABI database

2016-10-30 Thread Jonathan Cameron
On 28/10/16 13:19, Mauro Carvalho Chehab wrote:
> On those three files, the ABI representation described at
> README are violated.
> 
> - at sysfs-bus-iio-proximity-as3935:
>   a ':' character is missing after "What"
> 
> - at sysfs-class-devfreq:
>   there's a typo at Description
> 
> - at sysfs-class-cxl, it is using the ":" character at a
>   file preamble, causing it to be misinterpreted as a
>   tag.
> 
> - On the other files, instead of "What", they use "Where".
> 
> Signed-off-by: Mauro Carvalho Chehab <mche...@s-opensource.com>
Acked-by: Jonathan Cameron <ji...@kernel.org> for the iio one.

As an aside, I think that hm6352 is probably the docs fo the hmc6352 driver in 
misc.
Hence wrong filename perhaps?

Thanks,
Jonathan

> ---
>  Documentation/ABI/testing/pstore   |  2 +-
>  .../testing/sysfs-bus-event_source-devices-format  |  2 +-
>  .../ABI/testing/sysfs-bus-i2c-devices-hm6352   |  6 +++---
>  .../ABI/testing/sysfs-bus-iio-proximity-as3935 |  4 ++--
>  .../ABI/testing/sysfs-bus-pci-devices-cciss| 22 
> +++---
>  .../ABI/testing/sysfs-bus-usb-devices-usbsevseg| 12 ++--
>  Documentation/ABI/testing/sysfs-class-cxl  |  6 +++---
>  Documentation/ABI/testing/sysfs-class-devfreq  |  2 +-
>  8 files changed, 28 insertions(+), 28 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/pstore 
> b/Documentation/ABI/testing/pstore
> index 5fca9f5e10a3..8d6e48f4e8ef 100644
> --- a/Documentation/ABI/testing/pstore
> +++ b/Documentation/ABI/testing/pstore
> @@ -1,4 +1,4 @@
> -Where:   /sys/fs/pstore/... (or /dev/pstore/...)
> +What:/sys/fs/pstore/... (or /dev/pstore/...)
>  Date:March 2011
>  Kernel Version: 2.6.39
>  Contact: tony.l...@intel.com
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format 
> b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
> index 77f47ff5ee02..b6f8748e0200 100644
> --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format
> @@ -1,4 +1,4 @@
> -Where:   /sys/bus/event_source/devices//format
> +What:/sys/bus/event_source/devices//format
>  Date:January 2012
>  Kernel Version: 3.3
>  Contact: Jiri Olsa <jo...@redhat.com>
> diff --git a/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352 
> b/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352
> index feb2e4a87075..29bd447e50a0 100644
> --- a/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352
> +++ b/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352
> @@ -1,18 +1,18 @@
> -Where:   /sys/bus/i2c/devices/.../heading0_input
> +What:/sys/bus/i2c/devices/.../heading0_input
>  Date:April 2010
>  Kernel Version: 2.6.36?
>  Contact: alan@intel.com
>  Description: Reports the current heading from the compass as a floating
>   point value in degrees.
>  
> -Where:   /sys/bus/i2c/devices/.../power_state
> +What:/sys/bus/i2c/devices/.../power_state
>  Date:April 2010
>  Kernel Version: 2.6.36?
>  Contact: alan@intel.com
>  Description: Sets the power state of the device. 0 sets the device into
>   sleep mode, 1 wakes it up.
>  
> -Where:   /sys/bus/i2c/devices/.../calibration
> +What:/sys/bus/i2c/devices/.../calibration
>  Date:April 2010
>  Kernel Version: 2.6.36?
>  Contact: alan@intel.com
> diff --git a/Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935 
> b/Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935
> index 33e96f740639..61a3c9fed07d 100644
> --- a/Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935
> +++ b/Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935
> @@ -1,4 +1,4 @@
> -What /sys/bus/iio/devices/iio:deviceX/in_proximity_input
> +What:/sys/bus/iio/devices/iio:deviceX/in_proximity_input
>  Date:March 2014
>  KernelVersion:   3.15
>  Contact: Matt Ranostay <mranos...@gmail.com>
> @@ -6,7 +6,7 @@ Description:
>   Get the current distance in meters of storm (1km steps)
>   1000-4 = distance in meters
>  
> -What /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity
> +What:/sys/bus/iio/devices/iio:deviceX/sensor_sensitivity
>  Date:March 2014
>  KernelVersion:   3.15
>  Contact: Matt Ranostay <mranos...@gmail.com>
> diff --git